labs · benchmark

Which cheap model reads Reddit as well as Opus 4.8?

An open benchmark that pits low-cost models against Claude Opus 4.8 on one job — reading a Reddit thread and pulling out the buyer pains, competitor mentions, and leads a product can act on.

dist0 reads Reddit threads and pulls out three things a marketer can act on: a buyer pain, a competitor mention, or a lead. That reading job runs on every post we track, so the model behind it sets both the quality of what you get and what the product costs to run.

Opus 4.8 does this well. The question this benchmark answers is whether a cheaper model does it just as well — without dropping the high-intent pains and leads that matter most. So we froze the exact job, ran a field of low-cost models against it, and scored every one against an Opus 4.8 answer key that a human checked by hand.

Nothing here is a moat. The prompt, the dataset, the scoring, and the results are all on this page.

The job we test

We benchmark the production reading step exactly as it ships — same prompt, same tool, same inputs. No simplified version. The model reads one post (the original text plus its comments) together with what the project knows about itself, and records each finding through a single emit_signal action with a kind, a verbatim quote, and the author it belongs to.

The rules it follows are the interesting part: a pain has to be a concrete problem the person has themselves, not advice they're giving or a feeling they're venting; a lead has to be someone who fits the customer profile and has a current pain the product can address. Here is the full prompt.

analyze-post promptchecking access…

<role>
You are analyzing one Reddit post to find buyer pains, competitor
mentions, and leads relevant to the project. Emit signals with the

The models

Every model runs through OpenRouter at medium reasoning effort. Spotting a buyer pain is an extraction job, not a hard reasoning problem, so the test is whether cheaper models keep up under that same cap.

Baseline: Claude Opus 4.8 — the answer key.
Candidates: Gemini 3.5 Flash, Qwen3.7 Max, Grok 4.3, GLM 5.1, MiMo v2.5 Pro, and DeepSeek V4 Pro.

A model that can't call the emit_signal tool over OpenRouter at all is marked tool-incompatible and left out of scoring — failing to call the tool is not the same as reading a thread and finding nothing.

The dataset

The input is a frozen snapshot of real Reddit posts from dist0.com's own tracked subreddits — 137 posts from a single digest run, with their comments. The exact post list and the project memory are pinned and committed alongside the results, so every model reads the identical set and the run can be reproduced. The set is small enough to judge every signal by hand.

How we score

Opus 4.8 runs first. A human then reviews every baseline signal — checking the quote is really in the source, the attribution is right, the kind is right, and the signal is genuinely high-intent — and that curated set becomes the answer key. Every candidate is scored against it.

A second Opus 4.8 pass acts as the judge. For each post it sees the baseline signals and one candidate's signals, and sorts the candidate's output into:

Valid — matches a baseline signal: right kind, the quote really appears in the source, correct attribution.
Invalid — breaks a rule: a quote that isn't in the post, advice or venting emitted as a pain, or a pain pinned on the wrong person.
Misaligned — plausible but off-target, or a restatement of a pain already counted.

Any baseline signal a candidate never found is a miss — the error that matters most, because a missed lead is a customer you never saw.

Models are ranked by total valid signals. Recall (with leads broken out), precision, and cost are reported alongside but don't set the ranking. Whether a valid signal is trivial is too subjective to score, so a human spot-checks the misses and the top models' signals before we call a winner.

The leaderboard

Benchmark runs are pending. The leaderboard fills in once the Claude Opus 4.8 baseline and the candidate models have been scored over the 137 pinned posts.

The model we ship is the cheapest one whose valid-signal count and lead recall hold up under the manual review — not simply the top of the table.

Caveats

OpenRouter's Anthropic-compatible layer doesn't guarantee tool-calling for every model; any that needed the tool-compatibility guard are noted above.
The judge is Opus 4.8 scoring against its own baseline. The human review pass is the backstop — disagreements are settled by a person, not the judge.