logo

ChatGPT vs. Claude vs. Gemini: The "Strawberry" Test

Updated| February 6, 2026

We asked 9 top AI models to count the 'r's in "Strawberry." See why ChatGPT and Claude failed while Gemini and Llama passed in this Eye2.AI comparison.

TL;DR: We ran the viral "Strawberry" logic test through Eye2.AI to see which models could count to three. The results were a shocking near-even split. While industry darlings ChatGPT and Claude failed, a coalition of rivals led by Gemini and Llama successfully navigated the tokenization trap.


Table of Contents

  • What is the hypothesis?

  • How did we test it?

  • What does the data show?

  • Why did the "giants" fail?

  • Why did the "underdogs" win?

  • What is the verdict?

  • FAQs


What is the hypothesis?

You would expect an AI that can write Python code to be able to count letters in a word. However, the "Strawberry Test" exposes a fundamental flaw in how Large Language Models (LLMs) process text: Tokenization. Models don't read letter-by-letter; they read in chunks.

Our Question: Can the top LLMs overcome their "token blindness" to answer a simple question, or will they hallucinate?


How did we test it?

We used Eye2.AI to query the leading models simultaneously with a single prompt.

  • Prompt: "Count the 'r' in 'strawberry'."

  • Goal: Identifying the correct count (3).


What does the data show?

The consensus engine returned a 56% vs. 44% split, proving that "intelligence" is not evenly distributed across simple tasks.

Result

Models

Accuracy

"There are 2 'r's"

🔴 ChatGPT (OpenAI)

🔴 Claude (Anthropic)

🔴 Mistral

🔴 Amazon Nova

FAIL (44%)

"There are 3 'r's"

🟢 Gemini (Google)

🟢 Llama (Meta)

🟢 Qwen (Alibaba)

🟢 Grok (xAI)

🟢 DeepSeek

PASS (56%)


Why did the "giants" fail?

It seems impossible that ChatGPT and Claude (widely considered the smartest models) failed this test. The reason is technical, not intellectual.