Lookalike Benchmark
An independent benchmark of company lookalike / similar-companies APIs — Exa, Ocean.io, OpenFunnel, Parallel, PredictLeads — ranked on how relevant the companies each one returns actually are, across 24 B2B seed companies.
Each vendor returns up to 100 lookalikes per seed; an LLM judge scores every returned company for relevance (judge: gpt-5.4-mini). The cell value is Precision@100, with Precision@10 and Precision@50 for top-of-list quality.
There is no single permanent winner: top-of-list quality, long-list quality, cost, and agent-readiness favor different vendors, so read the P@10/P@50/P@100 columns against the workflow you care about.
Lookalike Precision@10/50/100
Scan the company type, then compare vendors. Each cell highlights Precision@100 and shows Precision@10/50 for how clean the top of the ranked list is.
Can an AI agent actually use this vendor?
Same agent-readiness lens as the technographics benchmark. Vendors that let an autonomous agent obtain a working key on its own (OTP-via-email or device-code) work end-to-end without human handoff.
| Vendor | Agent sign-up | API docs | llms.txt | MCP | Try it |
|---|---|---|---|---|---|
| OpenFunnel | ✓ readyotp-email | docs ↗ | llms.txt ↗ | mcp ↗ | sign up → |
| Ocean.io | manual signup | docs ↗ | llms.txt ↗ | mcp ↗ | — |
| Exa | ✓ readyotp-email | docs ↗ | llms.txt ↗ | mcp ↗ | sign up → |
| Parallel | ✓ readyotp-email | docs ↗ | llms.txt ↗ | mcp ↗ | sign up → |
| PredictLeads | manual signup | docs ↗ | — | mcp ↗ | — |
[02] methodology, metric definitions, and known limitations+
How the matrix is built
- Fix a canonical list of 24 seed companies across 13 verticals. Each seed has a name, domain, and short description - the exact inputs every vendor sees.
- For every (seed, vendor) cell, call the vendor's lookalike API with
K = 100. Capture the ordered top-K result list and credit cost. - Feed the seed + each returned candidate to the LLM judge (
gpt-5.4-mini). Judge returns a binary relevance label per candidate plus a one-line rationale. Identical prompt and rubric across all vendors. - Persist
Precision@10,Precision@50, andPrecision@100. Aggregate per vendor asavg_precision_at_10,avg_precision_at_50, andavg_precision_at_100. - A vendor that returns fewer than K candidates for a seed has the cell rendered as
-rather than scored on a truncated denominator. Keeps cells comparable.
What each metric means
Precision@10/50/100· fixed-cutoff precision. Of the top N lookalikes a vendor returned for the seed, the fraction the LLM judge labeled relevant.avg Precision@100· headline comparison metric. Mean Precision@100 across all judged seeds. Higher is better.total relevant· sum of relevant lookalikes across all seeds. Reach metric - useful when comparing two vendors with similar precision.cost per relevant· vendor credit spend ÷ total relevant lookalikes. The economics metric.
Why an LLM judge instead of a hand-labeled set
A fully hand-labeled lookalike set would require labeling K × seeds × vendors candidates (100 × 24 × 5 = 12000judgements) every time we re-run a snapshot. That doesn't scale, and it isn't how the buyer actually evaluates a vendor in the wild - the buyer reads the list and decides "close enough to my ICP, yes or no".
The judge approximates that decision with a consistent rubric: given the seed's name, domain, and description, is this returned candidate plausibly the same kind of company a B2B seller would target as a lookalike? The judge's rationale is persisted alongside the binary label so any cell can be audited by a human in seconds. When the model swaps, the cohort re-runs with the same prompt; deltas are visible.
How each vendor was queried
- OpenFunnel · embeddings over the OpenFunnel company index with the seed input as the query. Top-K by embedding similarity.
- Ocean.io ·
/companies/lookalikeswith seed domain. Default similarity model, K = 100. - Exa ·
/searchwithcategory: companyand query text likecompanies like HubSpot, K = 100. Uses Exa's company vertical and structured company metadata where present. - Parallel· agentic research task: "find 100 companies similar to {seed}". The agent decides its own retrieval strategy. We record the final ranked list.
- PredictLeads ·
/api/v3/companies/{domain}/similar_companies; ranks via shared tech, news, and jobs co-signals.
What this benchmark does not tell you
- Judge bias.A single LLM judge has its own priors about what "similar" means. We publish the judge model and the full rationale so the bias is auditable, but expect ±5% drift across model versions.
- K-tail vs precision tradeoff. Vendors with thin catalogs can win Precision@10 by refusing to return tail results. We mitigate by requiring ≥K results for a cell to be scored - thin cells render as
-, not a high Precision number with a small denominator. - No recall metric.Precision@10/50/100 doesn't measure how many real lookalikes the vendor missed. That requires a held-out ground truth set we don't yet have.
- Domain-only seeding. All vendors receive the same compact input (name + domain + 1-line description). Vendors that benefit from richer inputs (e.g. headcount filters, ARR band, geography) may underperform their in-product behavior. The flip side is that this matches how an agent would query them.
- Cohort coverage. 24 seeds across software, fintech, commerce, industrials, logistics, hospitality, energy, healthcare, real estate, and services.
Verify any number end-to-end
The full benchmark — runner code, judge prompt, benchmark snapshot, and per-cell raw audit trail (the literal HTTP request/response we sent each vendor + the literal LLM judge prompt/response per candidate) — is mirrored in a public repo: openbenchmarks-labs/lookalikes. Auth headers are scrubbed via an allow-list; everything else is verbatim.
To audit a single cell, open data/lookalike-runs/<dataset>/<seed>/<vendor>.raw.json in that repo and replay any of the vendor_calls[] with your own credentials, or re-score with your own LLM by replaying judge_calls[].messages against any OpenAI-v1 compatible model — useful for measuring judge bias or drift across model versions.
Inclusion queue and how to request a provider
Live: OpenFunnel, Ocean.io, Exa, Parallel, PredictLeads.
Requested but not directly comparable: ZoomInfo (company lookalikes are sales-gated, no self-serve API), Clay (lookalike runs inside Clay tables), Apollo (no public lookalike endpoint), Lusha (`/v3/companies/lookalike` requires 5-100 seeds per request, incompatible with the per-seed cell unit of this benchmark).
Under review next: Common Room, Koala, LeadGenius, 6sense, Demandbase.
To request a provider, email founders@openbenchmarks.com with a link to the public API docs and pricing page.
Running an open benchmark means spending real API credits on every vendor call. We're grateful to PredictLeads for providing credits to support fair, reproducible, open benchmarking. Credits cover the cost of calling an API — they do not influence scores or ranking, which are decided by the LLM judge on identical inputs. Any vendor can provide credits on the same terms; more credits let us test deeper and at greater scale — founders@openbenchmarks.com.








