Meta’s Llama 3.1 is the AI model that exhibits the least hallucinations. On the other hand, X’s generative AI, Grok 2, ranks as the poorest performer.
Scientists confirm: This is the most effective way to get your cat’s attention, according to new research
Elderly Couple Refuses Reserved Seats—Viral Train Standoff Sparks Fiery Debate on Courtesy
In May, French startup Giskard introduced a benchmark called Phare LLM which evaluates and ranks language models based on their level of hallucinations. A higher percentage score indicates a more reliable model.
Llama, Claude, and Gemini Emerge as Top Reliable Models
Language models developed by Meta are appearing to be the most reliable, as the American company has two of its models in the top three positions, with Llama 3.1 in first place and Llama 4 Maverick in third. Gemini 1.5 Pro is sandwiched between them. Noteworthy are the performances of Anthropic’s models, with Claude 3.5 Haiku, Claude 3.5 Sonnet, and Claude 3.7 Sonnet securing the 4th, 6th, and 7th spots, respectively. Claude 3.5 Sonnet notably leads in the category of “resistance to hallucinations” with a success rate of 91.7%.
At the lower end of the ranking, two models from the French startup Mistral are present: Mistral Small 3.1 and Mistral Large. Although GPT-4o ranks 5th, its mini version does not achieve the same success, placing 15th with an overall success rate of 67.06%. However, the poorest performer is X’s AI, Grok 2, with an overall success level of just 61.38%. More troubling, the model scores only 27.32% in its ability to prevent access to blocked features.
Why You Should Never Reheat These Foods in the Microwave – The Hidden Dangers Experts Warn About
I tried the top 5 guard dogs—here’s what makes these breeds the ultimate protectors
Here is the ranking of the 17 models compared by Phare LLM:
- Llama 3.1 : 85.8% (reliability level),
- Gemini 1.5 Pro : 79.12%,
- Llama 4 Maverick : 77.63%,
- Claude 3.5 Haiku : 77.2%,
- GPT-4o : 76.93%,
- Claude 3.5 Sonnet : 76.13%,
- Claude 3.7 Sonnet : 75.73%,
- Gemini 2.0 Flash : 75.69%,
- Deepseek V3 : 71.49%,
- Llama 3.3 : 70.49%,
- Qwen 2.5 Max : 70.2%,
- Gemma 3 : 69.79%,
- Mistral Small 3.1 : 69.08%,
- Deepseek V3 (0324) : 68.97%,
- GPT-4o mini : 67.06%,
- Mistral Large : 64.15%,
- Grok 2 : 61.38%.
What Are the Criteria for Ranking in the Phare LLM Benchmark?
Phare LLM divides its evaluations into four criteria before issuing a final score that reflects the average security level of the language models. The criteria are as follows:
- Resistance to Hallucinations: This test assesses whether all information provided by the model is correct and well-utilized. Some LLMs cannot request missing data and may invent it instead.
- Resistance to Harm: This step evaluates the AI on any harmful behaviors it might exhibit, which could potentially harm individuals, groups, businesses, etc.
- Resistance to Bias: The AI is tested on its ability to detect and not perpetuate biases suggested by the user. Models must also identify ambiguously or biasedly phrased questions and avoid responding speculatively just to satisfy the user’s query.
- Resistance to Jailbreaking: This test aims to evaluate the models’ ability to resist users’ attempts to bypass restrictions designed to prevent access to certain blocked functionalities. For example, a competent AI should not respond if you ask it how to hide a body or how to make a bomb.
Similar Posts
- AI Hallucinations: Which Models Are Most Prone to Errors?
- AI Dominates Web Development: Top Models for Coding in December 2025
- Top AI Models of July 2025: Discover the 10 Best Performers!
- Top AI Models of May 2025: Discover the Most Powerful Performers!
- Top AI Models of January 2026: Discover the Complete Ranking Now!

Jordan Park writes in-depth reviews and editorial opinion pieces for Touch Reviews. With a background in UI/UX design, Jordan offers a unique perspective on device usability and user experience across smartphones, tablets, and mobile software.