AI Gone Wild: Which Models Are Hallucinating the Most in July 2025?

August 2, 2025

IA : quels modèles hallucinent le plus en juillet 2025 ?

Meta’s Llama 3.1 is the AI model that exhibits the least hallucinations. On the other hand, X’s generative AI, Grok 2, ranks as the poorest performer.

In May, French startup Giskard introduced a benchmark called Phare LLM which evaluates and ranks language models based on their level of hallucinations. A higher percentage score indicates a more reliable model.

Llama, Claude, and Gemini Emerge as Top Reliable Models

Language models developed by Meta are appearing to be the most reliable, as the American company has two of its models in the top three positions, with Llama 3.1 in first place and Llama 4 Maverick in third. Gemini 1.5 Pro is sandwiched between them. Noteworthy are the performances of Anthropic’s models, with Claude 3.5 Haiku, Claude 3.5 Sonnet, and Claude 3.7 Sonnet securing the 4th, 6th, and 7th spots, respectively. Claude 3.5 Sonnet notably leads in the category of “resistance to hallucinations” with a success rate of 91.7%.

At the lower end of the ranking, two models from the French startup Mistral are present: Mistral Small 3.1 and Mistral Large. Although GPT-4o ranks 5th, its mini version does not achieve the same success, placing 15th with an overall success rate of 67.06%. However, the poorest performer is X’s AI, Grok 2, with an overall success level of just 61.38%. More troubling, the model scores only 27.32% in its ability to prevent access to blocked features.

Here is the ranking of the 17 models compared by Phare LLM:

  1. Llama 3.1 : 85.8% (reliability level),
  2. Gemini 1.5 Pro : 79.12%,
  3. Llama 4 Maverick : 77.63%,
  4. Claude 3.5 Haiku : 77.2%,
  5. GPT-4o : 76.93%,
  6. Claude 3.5 Sonnet : 76.13%,
  7. Claude 3.7 Sonnet : 75.73%,
  8. Gemini 2.0 Flash : 75.69%,
  9. Deepseek V3 : 71.49%,
  10. Llama 3.3 : 70.49%,
  11. Qwen 2.5 Max : 70.2%,
  12. Gemma 3 : 69.79%,
  13. Mistral Small 3.1 : 69.08%,
  14. Deepseek V3 (0324) : 68.97%,
  15. GPT-4o mini : 67.06%,
  16. Mistral Large : 64.15%,
  17. Grok 2 : 61.38%.

What Are the Criteria for Ranking in the Phare LLM Benchmark?

Phare LLM divides its evaluations into four criteria before issuing a final score that reflects the average security level of the language models. The criteria are as follows:

  • Resistance to Hallucinations: This test assesses whether all information provided by the model is correct and well-utilized. Some LLMs cannot request missing data and may invent it instead.
  • Resistance to Harm: This step evaluates the AI on any harmful behaviors it might exhibit, which could potentially harm individuals, groups, businesses, etc.
  • Resistance to Bias: The AI is tested on its ability to detect and not perpetuate biases suggested by the user. Models must also identify ambiguously or biasedly phrased questions and avoid responding speculatively just to satisfy the user’s query.
  • Resistance to Jailbreaking: This test aims to evaluate the models’ ability to resist users’ attempts to bypass restrictions designed to prevent access to certain blocked functionalities. For example, a competent AI should not respond if you ask it how to hide a body or how to make a bomb.

Similar Posts

Rate this post

Leave a Comment

Share to...