Word Search LLM Research

LLMs are remarkably good at finding patterns. Give them a wall of text and they'll pull out names, dates, themes, contradictions. But here's a question I've been chewing on: can they tell you where they found something? Not just what they found, but the precise coordinates in a grid, the exact position in a structure?

Word search puzzles turn out to be a clean way to test this. The task is simple enough that any model can understand it: find the hidden words in a grid of letters, then report their start and end coordinates. The interesting part isn't whether models can find the words - spoiler, they're great at that - it's whether they can accurately report where those words are.

What we found surprised us. There's a fundamental gap between finding and locating, and it reveals something important about how these models process positional information.

Model	Evaluations	Word Accuracy	Position Accuracy	Avg Latency	Total Cost
GPT-4o	94	99.0%	17.9%	2.6s	$0.30
GPT-4 Turbo	94	98.0%	15.4%	4.9s	$0.99
Claude 3.5 Haiku	94	95.8%	9.3%	3.8s	$0.18
Claude Opus 4	94	95.7%	58.8%	8.2s	$3.58
Claude Sonnet 4	94	95.0%	60.6%	10.9s	$1.10
GPT-4o Mini	94	95.0%	11.5%	4.6s	$0.02
Gemini 2.0 Flash	59	94.4%	33.2%	2.0s	$0.01

Model

Evaluations

Word Accuracy

Position Accuracy

Avg Latency

Total Cost

GPT-4o

99.0%

17.9%

2.6s

$0.30

GPT-4 Turbo

98.0%

15.4%

4.9s

$0.99

Claude 3.5 Haiku

95.8%

9.3%

3.8s

$0.18

Claude Opus 4

95.7%

58.8%

8.2s

$3.58

Claude Sonnet 4

95.0%

60.6%

10.9s

$1.10

GPT-4o Mini

95.0%

11.5%

4.6s

$0.02

Gemini 2.0 Flash

94.4%

33.2%

2.0s

$0.01

Model	12x12	15x15	20x20	Word Acc
Claude Sonnet 4	58.1%	46.7%	40.0%	100%
Claude Opus 4	51.9%	19.2%	50.0%	100%
Gemini 2.0 Flash	24.2%	21.4%	10.0%	95%
Claude 3.5 Haiku	8.1%	1.7%	0.0%	100%
GPT-4o	13.1%	8.3%	0.0%	97.9%
GPT-4 Turbo	11.2%	6.7%	0.0%	100%
GPT-4o Mini	5.0%	0.0%	0.0%	99.2%

Model

12x12

15x15

20x20

Word Acc

Claude Sonnet 4

58.1%

46.7%

40.0%

100%

Claude Opus 4

51.9%

19.2%

50.0%

100%

Gemini 2.0 Flash

24.2%

21.4%

10.0%

95%

Claude 3.5 Haiku

8.1%

1.7%

0.0%

100%

GPT-4o

13.1%

8.3%

0.0%

97.9%

GPT-4 Turbo

11.2%

6.7%

0.0%

100%

GPT-4o Mini

5.0%

0.0%

99.2%

The pattern here is striking: models can achieve near-perfect word finding while simultaneously failing at position reporting. GPT-4o finds 99% of words but only reports correct coordinates 21% of the time. This isn't a small gap - it's a fundamental disconnect between two capabilities we might naively assume go hand in hand.

Why does this happen? LLMs are trained on likelihood, not spatial truth. When a model scans a grid and recognizes "PYTHON" running diagonally, it's doing pattern matching, something these models excel at. But when asked to report that the word starts at row 3, column 5 and ends at row 8, column 10, it's doing something fundamentally different. It's not predicting the next likely token - it's supposed to be doing precise coordinate math, and that's where things fall apart.

The scaling results make this even clearer. As grids get larger, the coordinate space expands and position accuracy craters even while word-finding stays strong. Claude models degrade more gracefully than GPT models, but everyone struggles. At 20x20, most models are essentially guessing coordinates while still finding every word.

The implication? Be careful when asking LLMs to work with positional data. They might confidently tell you they found what you're looking for, but their sense of where is unreliable. For applications that require coordinate accuracy - parsing structured documents, navigating spatial data, referencing specific locations - verification isn't optional. The model might be right about what exists, but wrong about where it is.

Word Search
LLM Research

Word Accuracy

Position Accuracy

Puzzle Info

Model Submissions

Experimental Setup

Phase 1: Baseline

Phase 2: Scaling

Word SearchLLM Research

Word Accuracy

Position Accuracy

Puzzle Info

Model Submissions

Experimental Setup

Phase 1: Baseline

Phase 2: Scaling

Get insights delivered to your inbox

Word Search
LLM Research