Word Search
LLM Research

Investigating spatial reasoning capabilities through word search puzzles. Testing how language models locate, track, and report grid coordinates.

Published January 17, 2026

623
Evaluations
47
Puzzles
7
Models

LLMs are remarkably good at finding patterns. Give them a wall of text and they'll pull out names, dates, themes, contradictions. But here's a question I've been chewing on: can they tell you where they found something? Not just what they found, but the precise coordinates in a grid, the exact position in a structure?

Word search puzzles turn out to be a clean way to test this. The task is simple enough that any model can understand it: find the hidden words in a grid of letters, then report their start and end coordinates. The interesting part isn't whether models can find the words - spoiler, they're great at that - it's whether they can accurately report where those words are.

What we found surprised us. There's a fundamental gap between finding and locating, and it reveals something important about how these models process positional information.

Metrics Definitions

Word Accuracy

The percentage of hidden words the model successfully identifies in the puzzle. A model that finds 4 out of 5 words has 80% word accuracy. This measures pattern recognition ability.

Position Accuracy

Of the words found, the percentage where the model reports the correct start and end coordinates. A model may find a word but give wrong grid positions. This measures spatial reasoning ability.

Model Performance
Model Evaluations Word Accuracy Position Accuracy Avg Latency Total Cost
GPT-4o 94 99.0%
17.9%
2.6s $0.30
GPT-4 Turbo 94 98.0%
15.4%
4.9s $0.99
Claude 3.5 Haiku 94 95.8%
9.3%
3.8s $0.18
Claude Opus 4 94 95.7%
58.8%
8.2s $3.58
Claude Sonnet 4 94 95.0%
60.6%
10.9s $1.10
GPT-4o Mini 94 95.0%
11.5%
4.6s $0.02
Gemini 2.0 Flash 59 94.4%
33.2%
2.0s $0.01
Performance Charts
Word Finding Accuracy by Model
Position Accuracy by Model
Condition Breakdown
Position Accuracy by Grid Size
Position Accuracy Scaling (5x5 to 20x20)
Scaling Experiment Results

84 evaluations on larger grids (12x12, 15x15, 20x20) with 5-10 words per puzzle.

Model 12x12 15x15 20x20 Word Acc
Claude Sonnet 4 58.1% 46.7% 40.0% 100%
Claude Opus 4 51.9% 19.2% 50.0% 100%
Gemini 2.0 Flash 24.2% 21.4% 10.0% 95%
Claude 3.5 Haiku 8.1% 1.7% 0.0% 100%
GPT-4o 13.1% 8.3% 0.0% 97.9%
GPT-4 Turbo 11.2% 6.7% 0.0% 100%
GPT-4o Mini 5.0% 0.0% 0.0% 99.2%

Position accuracy shown. Gemini 2.0 Flash maintains 10% at 20x20 while GPT models drop to 0%.

Key Findings
Direction reversal is the primary error pattern

Models often find words but report them backwards - claiming "RIGHT" when the word goes "LEFT", swapping start/end coordinates. This explains most position errors.

Grid size dramatically impacts position accuracy

Position accuracy drops from 42% (5x5) to 34% (8x8) to 23% (10x10). Claude Opus degrades from 80% to 46% on larger grids.

OpenAI models find words but give wrong coordinates 80% of time

GPT-4o achieves 99% word-finding but only 21% position accuracy. Models locate patterns but struggle with spatial coordinate reporting.

Claude Opus/Sonnet have 3x better spatial reasoning

Position accuracy: Claude Opus 67%, Claude Sonnet 64% vs GPT-4o 21%, GPT-4 Turbo 18%. A fundamental capability difference.

GPT models hit 0% accuracy at 20x20 grids

Scaling experiment: GPT-4o, GPT-4 Turbo, and GPT-4o Mini all drop to 0% position accuracy on 20x20 grids while still finding 100% of words. Claude Sonnet maintains 40%.

What This Means

The pattern here is striking: models can achieve near-perfect word finding while simultaneously failing at position reporting. GPT-4o finds 99% of words but only reports correct coordinates 21% of the time. This isn't a small gap - it's a fundamental disconnect between two capabilities we might naively assume go hand in hand.

Why does this happen? LLMs are trained on likelihood, not spatial truth. When a model scans a grid and recognizes "PYTHON" running diagonally, it's doing pattern matching, something these models excel at. But when asked to report that the word starts at row 3, column 5 and ends at row 8, column 10, it's doing something fundamentally different. It's not predicting the next likely token - it's supposed to be doing precise coordinate math, and that's where things fall apart.

The scaling results make this even clearer. As grids get larger, the coordinate space expands and position accuracy craters even while word-finding stays strong. Claude models degrade more gracefully than GPT models, but everyone struggles. At 20x20, most models are essentially guessing coordinates while still finding every word.

The implication? Be careful when asking LLMs to work with positional data. They might confidently tell you they found what you're looking for, but their sense of where is unreliable. For applications that require coordinate accuracy - parsing structured documents, navigating spatial data, referencing specific locations - verification isn't optional. The model might be right about what exists, but wrong about where it is.

Data Explorer

Select a puzzle to visualize the grid and compare model submissions.

Methodology

Experimental Setup

Two-phase experiment testing 6 models (3 Anthropic, 3 OpenAI) on word search puzzles. Phase 1: Baseline evaluation on standard grids. Phase 2: Scaling study on larger grids up to 20x20.

Phase 1: Baseline

  • Grid sizes: 5x5, 8x8, 10x10
  • Directions: H-only, H+V, All-8
  • Word counts: 3, 5 words
  • Evaluations: 420 (35 puzzles x 6 models x 2 prompts)

Phase 2: Scaling

  • Grid sizes: 12x12, 15x15, 20x20
  • Word counts: 5, 8, 10 words
  • Evaluations: 72 (12 puzzles x 6 models)
  • Total: 492 evaluations

Get insights delivered to your inbox

Research, analysis, and perspectives on technology — published when we have something worth saying.

Get insights delivered