AI Navigate

Mistral Small 4 is kind of awful with images

Reddit r/LocalLLaMA / 3/18/2026

💬 OpinionTools & Practical UsageModels & Research

Key Points

  • The author tests Mistral Small 4's image recognition and finds the results to be notably poor, both with the Q4_K_M setup and the official API.
  • The provided caption for a festival image misidentifies the scene, describing a stadium, track, staff, and other elements that do not exist in the image.
  • A comparison with Qwen3.5 35B A3B is cited, suggesting smaller models may produce more accurate captions for the same image.
  • The post highlights reliability concerns about Mistral Small 4's visual understanding and implies cautious use for image-related tasks.

I first started testing with the Q4_K_M and the image recognition was so bad that I assumed there was something wrong with my setup. So, I tested Mistral's official API and the image capabilities are just as terrible, so I believe this may just be the model's actual ability.

Given the prompt "Describe this image in detail in around 200 words" and this picture of a music festival, here's the nonsense the official API for Mistral Small 4 came up with:

The image captures a vibrant scene at a large stadium during what appears to be an outdoor event, possibly a sports game or concert. The stadium is filled with a dense crowd of spectators, suggesting that the event is popular and well-attended. The photograph is taken from an elevated vantage point, providing a comprehensive view of the stadium and its surroundings. The sky above is a clear blue, dotted with a few clouds, indicating pleasant weather. The stadium is bathed in natural daylight, with areas of shadow cast by the architectural structures. Illuminating lights are suspended above, ready to illuminate the event as dusk falls. In the lower central part of the image, there are a few individuals, possibly staff or security personnel, walking within the confines of the stadium's track area. The seating areas are vibrant with various colors from the crowd's clothing and banners. Vehicles are parked around the perimeter of the stadium, and a few structures, including light poles and possibly concession stands, are visible. The overall atmosphere is bustling and energetic, characteristic of a large-scale public event. 

To save you from having to look at the picture yourself, here's everything that's just completely wrong about this description:

  • It's very obviously not a sports game. The musicians are clearly visible in the foreground.
  • The lower central part of the image is the stage with the musicians. Not security personnel/staff. It's obviously the musicians.
  • There is no stadium
  • There is no "track area"
  • There is no seating
  • The crowd does not have any banners
  • There are no vehicles in the image at all
  • There are no light poles

For comparison, here is Qwen3.5 35B A3B (keep in mind, ~1/4 the total parameters and ~1/2 the active parameters):

This image captures a breathtaking, high-angle view of a massive outdoor music concert, likely a festival, taken from the perspective of the stage looking out. In the foreground, the dark stage deck is busy with activity; musicians are performing, including a figure in a cowboy hat and another near a microphone stand, while a camera operator films the action. The stage is framed by towering stacks of speakers and a complex overhead truss system laden with stage lighting rigs. Stretching out into the distance is an immense, dense crowd of thousands of attendees, creating a sea of humanity that fills the entire mid-ground. The setting is clearly coastal, with the ocean visible on the horizon under a vast sky transitioning from bright blue to the soft oranges and pinks of sunset. Along the far edge of the crowd, near the water, rows of colorful umbrellas and white tents line the beach, indicating a festival atmosphere with vendors or food stalls. The scene is bathed in natural sunlight, highlighting the sheer scale of the event and the energetic atmosphere of a summer concert by the sea. 

If you look at the image, you'll see that the Qwen description is incredibly detailed and accurate, and Mistral sounds like something from over a year ago.

I also tested the last-generation Mistral Small 3.2 24B, as well as Ministral 3B, 8B, and 14B. None of the other Mistral models I tested had any issues with interpreting the image.

This issue also isn't specific to just this image, it thought Lenna was an ornate bird sculpture.

Could this just be an issue with the model being so recent? Like, the image recognition is completely unusable.

submitted by /u/EffectiveCeilingFan
[link] [comments]