Impressive Image Descriptions from Mistral AI

cogdog · December 3, 2024, 6:02pm

In scanning Bryan Alexander’s recent Academia and AI newsletter issue on Generative AI at the end of 2024 I was intrigued by his examples of image descriptions by Mistral AI, so I did the obvious and tried a few examples (see below).

I’m curious to hear how others are faring with using GenerativeAI for image descriptions.

But. maybe more for those deeper into this (cough cough @moodler) is Mistral open enough? Or open at all? Their site lists under Openness “We lead the market of open source generative technologies to bring trust and transparency in the field and foster decentralised technology development”

Their model weights are available, though I have no idea what I do with them. Their Pixtral Large model"demonstrates frontier-level image understanding. Particularly, the model is able to understand documents, charts and natural images, while maintaining the leading text-only understanding of Mistral Large 2."

Alan’s Test (n=3)

Yes, the natural response is to try it on images. I chose three, for which in use elsewhere I have already written my own alt text descriptions from my own model (my brain).

Photo by Alan Levine shared into the public domain using CC0

I thought I had posted this somewhere with a description but am not finding it. Mistral AI described it (used in the alt text above)

This image depicts a tractor operating in a snowy field. The tractor is equipped with a front loader attachment, which is currently raised and appears to be holding snow. The tractor has large tires, suitable for navigating through snowy terrain. The background shows a flat, open field covered in snow, with some trees and a clear sky visible in the distance. The scene suggests that the tractor is being used for snow removal or other winter-related tasks.

Here is a second test.

The Virtuous Properties flickr photo by cogdogblog shared into the public domain using Creative Commons Public Domain Dedication (CC0)

I used this for a blog post where my not so great alt text was written (now cringing at how weak it is)

Taped to the side of an abandoned refrigerator is a hand written sign reading “It Works, Free”

Mistral AI provided

This image shows a refrigerator placed outdoors. Attached to the front of the refrigerator is a handwritten note on a piece of lined paper. The note reads, “It works. Free.” The note is secured to the refrigerator with green tape. The background shows a natural setting with trees and grass, indicating that the refrigerator is likely being offered for free to anyone who wants it.

And finally my third test, was a photo I took in a local coffee shop

Pixelfed photo by @cogdog licensed under Public Domain (CC0)

My alt text was I feel pretty good:

A small round shaped cactus sits in a bowl on a counter. It has small, delicate bright colored flower on its crown. Through the window behind is a snow covered street suggesting very cold temperature out there.

And here is Mistral AI’s description

This image shows a close-up of a cactus plant with several bright pink flowers blooming at the top. The cactus has a round shape and is covered with numerous white, hair-like spines. The background is slightly blurred but appears to show a snowy outdoor scene with buildings and vehicles, suggesting that the cactus is indoors, possibly near a window. The contrast between the vibrant flowers and the snowy backdrop creates an interesting visual juxtaposition.

I have been making a focused effort for more than a year to practice regular image descriptions, and one of my go to guides is Alex Chen’s How to Write and Image Description for its approach of ‘Object — action — context’

Is Mistral achieving this? From my bit of reading, an alt text description does not need a reference to itself as an image, I would remove the “This image shows” or “This image depicts” as it should just be the description. That’s minor. Mistral is doing a good job of describing and then ending ina bit of context or suggestion of meaning.

From my limited tests MIstral does well. I’d bet it might struggle (as Bryan’s examples suggests) with words on the screen, although when you look at the examples in MIstral Large, it is showing not only transcription of text in images but also intrepretation (I should try some charts and graphs).

But I have to say in my small test, wow.

heatherb · December 3, 2024, 7:26pm

Alan,
I think your first example of alt text is not weak: Taped to the side of an abandoned refrigerator is a hand written sign reading “It Works, Free.”

Alt text should be no more than 125 characters as most screen readers cut off at this point. I’m a big believer in alt text being used alongside surrounding text, a figure caption, and a long image description (if needed).

While the examples from Mistral AI provide more depth, I’d use them only for long image descriptions and only if the images are truly relevant to the content.

I’ve been testing out ChatGPT for long image descriptions for charts and graphs and have been pretty happy with the results, though I still read through and make adjustments. My go-to for accessibility is the BCcampus Accessibility Toolkit, 2nd edition and I’ve been exploring this source- The Accessibility Handbook for Teaching and Learning that has a chapter on AI-generated images.

yasin.dahi · December 3, 2024, 7:43pm

I’m curious to hear how others are faring with using GenerativeAI for image descriptions.

I’ve been testing GenAI for generating alt text, long descriptions, and even video transcripts, translations, etc. GenAI is really compelling for these sorts of tasks.

is Mistral open enough? Or open at all?

Mistral has Free and Premium models. The Free models are distributed under the Apache 2.0 license. They are quite good for the sort of tasks we’re talking about here.

How to test open AI models on your own machine

Step 1

You’ll need an runtime environment, here are some open source ones:

LM Studio: https://lmstudio.ai/
Jan AI: https://jan.ai/
Msty: https://msty.app/ (not open source)

I’ve been using LM Studio with a good amount of success. These tools are designed to demystify AI implementation, so regular folks can run AI models on their own laptops.

Note: you will want a decent computer with lots of disk space (AI Models can be big).

Step 2

Download AI models you want to test with. For example, in LM Studio, click the “Discover” button and search for “Mistral”, “Pixtral”, or whatever open source model you want to test.

I’ve been using “MiniCPM V 2 6” for image-to-text and it works great for a 5GB model.

Create a new chat and select the model you want to use in the chat.

Example outputs

Let’s generate an image description and alternative text for one of Alan’s images.

Image:

The Model: MiniCPM V 2 6
The Weights: default set by LM Studio

Prompt 1: “describe this image”

Output 1: "The image shows a refrigerator placed outdoors, likely in someone’s yard. The fridge is beige and appears to be quite large, taking up most of the frame from left to right.

On top of the refrigerator door, there are two pieces of paper taped with green tape. One piece has handwritten text that reads “It WORKS” at the top and “FREE” below it in a bold font. The papers seem to indicate that this fridge is available for free use or pickup by someone who might need one.

The background reveals some natural elements, including trees and possibly other outdoor structures like sheds or houses. It’s daytime with clear skies as indicated by the bright lighting on the refrigerator and shadows cast around it."

Prompt 2: “provide alternative text for the image”

Output 2: “An advertisement featuring a free fridge in an outdoor setting, highlighting its functionality through handwritten notes taped to the appliance door.”

Is Mistral achieving this? From my bit of reading, an alt text description does not need a reference to itself as an image, I would remove the “This image shows” or “This image depicts” as it should just be the description.

Any AI output should always be reviewed and approved by a human. They call this “human in the loop”. AI get’s you the base, it’s should be up to the person to review, fix, and finalize. AI is meant to help us do it faster, or at least help us perform the task more consistently. It’s not perfect, but certainly better than having no alt text.

cogdog · December 3, 2024, 8:07pm

Thanks for this Yasin. I am game to try running local with LM Studio.

And I am not looking for the LLM to do this automatically for me, but can see as a first draft for alt text, this can be a big help. I do find it very interesting to try on my own photos, because I know the exact location and setting where that refrigerator was sitting

And thanks for the tip to follow up the request for a description with one to return alternative text-- for my little cactus in the window. Mistral gave me:

“A close-up of a round cactus with white hair-like spines and bright pink flowers blooming at the top, set against a blurred snowy outdoor scene with buildings and vehicles in the background.”

This will be a help in my alt text writing, it’s a good start.

dajbelshaw · December 4, 2024, 9:48am

Thanks @cogdog I’m not sure which model it’s using under the hood, I’ve been using this tool from ASU with great success over the last few months: https://asuo-ai-labs.streamlit.app/Image_Accessibility

What I like about it is that you can add context before it generates the Alt text. It also provides long and short-form descriptors

cogdog · December 4, 2024, 8:45pm

Their information indicates “This tool currently utilizes the GPT-4o model provided by OpenAI. We are using their API platform, which ensures the data sent is not retained or stored to improve their training or models.”

That too is a suitable start or 90% of the way for a decent al-text description.

The situation here is that being “correct” is not precise, so the variable responses of GenAI don’t matter too much. I think you could run the image through a number of different tools and got something not drastically different, I doubt there will be something that hallucinates it as a toy ball with prickles or a petunia in a pot.

Even the same tool does not return the same response. I redid my refrigerator one in mistral, and its responses are similar and maybe off by 2 or 3 words:

I also have some wonderings about how we are evaluating how “good” a tool is with really what are limited tests. This is how we judge a large complex opaque tool through a really miniscule lens of inquiry.

I do like knowing that you can add some context to the prompt. I just tossed in “Write an alt text description in the style of blues music singer” and got

“Out in the wild, under the big ol’ sky, stands a lonesome fridge, shiny and cold. Got a note taped on, green like the grass, sayin’ ‘It works, free for your soul.’ Trees whisperin’ in the breeze, grass dancin’ slow, this here fridge waitin’ for a new home to go.”

Its fun, but wonder what other voice styles might be productive (probably not for alt text) ??