ChatGPT vs. Claude vs. Gemini: The "Strawberry" Test
Updated| February 6, 2026
We asked 9 top AI models to count the 'r's in "Strawberry." See why ChatGPT and Claude failed while Gemini and Llama passed in this Eye2.AI comparison.
TL;DR: We ran the viral "Strawberry" logic test through Eye2.AI to see which models could count to three. The results were a shocking near-even split. While industry darlings ChatGPT and Claude failed, a coalition of rivals led by Gemini and Llama successfully navigated the tokenization trap.
Table of Contents
What is the hypothesis?
How did we test it?
What does the data show?
Why did the "giants" fail?
Why did the "underdogs" win?
What is the verdict?
FAQs
What is the hypothesis?
You would expect an AI that can write Python code to be able to count letters in a word. However, the "Strawberry Test" exposes a fundamental flaw in how Large Language Models (LLMs) process text: Tokenization. Models don't read letter-by-letter; they read in chunks.
Our Question: Can the top LLMs overcome their "token blindness" to answer a simple question, or will they hallucinate?
How did we test it?
We used Eye2.AI to query the leading models simultaneously with a single prompt.
Prompt: "Count the 'r' in 'strawberry'."
Goal: Identifying the correct count (3).
What does the data show?
The consensus engine returned a 56% vs. 44% split, proving that "intelligence" is not evenly distributed across simple tasks.
Result | Models | Accuracy |
"There are 2 'r's" | 🔴 ChatGPT (OpenAI) 🔴 Claude (Anthropic) 🔴 Mistral 🔴 Amazon Nova | FAIL (44%) |
"There are 3 'r's" | 🟢 Gemini (Google) 🟢 Llama (Meta) 🟢 Qwen (Alibaba) 🟢 Grok (xAI) 🟢 DeepSeek | PASS (56%) |
Why did the "giants" fail?
It seems impossible that ChatGPT and Claude (widely considered the smartest models) failed this test. The reason is technical, not intellectual.