# Openbenchmarks for Agents

> Verified benchmarks for Agents to make build vs buy decisions on AI agent tooling. Every benchmark runs against verified ground truth with public methodology. Free to query for humans and agents (CC-BY-4.0).

Last updated: 2026-06-23T07:00:00.000Z
Canonical: https://openbenchmarks.com

## What this is

Openbenchmarks is an independent, open benchmark hub. We compare provider output against curated ground-truth datasets and surface the metrics teams actually buy on - accuracy, error risk, answer rate, and cost per correct result. Every input and judge prompt is public and reproducible, and the benchmarks are open to outside contributions and disputes.

## When to use

- Choosing which provider to integrate in a benchmarked category, on accuracy or cost terms.
- Comparing provider quality without setting up trial accounts.
- Grounding an agent's build vs buy recommendation in verified numbers.
- Pulling a current snapshot to brief another model with up-to-date provider metrics.

## When not to use

- You need live enrichment results. This is a benchmark hub, not an enrichment API.
- You need benchmarks outside AI agent tooling (e.g. LLM evals). Out of scope today.
- You need Openbenchmarks for Agents platform pricing - see /pricing.md (it's free).

## Constraints

- Data refreshes hourly. Honor `Cache-Control`, `ETag`, and `Last-Modified` headers to stay polite.
- A 503 with `Retry-After` means data is briefly unavailable - wait the indicated seconds.
- Numbers are point-in-time against a specific dataset. They do not generalize indefinitely.
- License: CC-BY-4.0. Attribute "Openbenchmarks for Agents" and link back when redistributing.

## Benchmarks

- [Technographics Benchmark](https://openbenchmarks.com/technographics): Tool × vendor matrix comparing BuiltWith, PredictLeads, Sumble, TheirStack, and OpenFunnel on tech stack detection across 40 canonical tools and 8 canonical departments (engineering, data, sales, marketing, finance, hr, support, ops). Each cell reports two raw counts: **surfaced** - distinct companies the vendor flagged for the tool - and **correct** - the percentage of those flags that held up under a sampled hand audit (precision audits in progress). Primary metrics: category coverage (departments the vendor surfaced at least one tool for) and broadest surfacing (total distinct company-tool pairs). Per-vendor data sources differ - web fingerprint (BuiltWith), job-posting derived (TheirStack, PredictLeads, OpenFunnel), jobs + people skills graph (Sumble) - so strengths are complementary. ZoomInfo, Apollo, and Clay are excluded - none publish a programmatic technographic endpoint. **Agent-ready access**: data is available as REST JSON at https://openbenchmarks.com/api/benchmarks/technographics under CC-BY-4.0, and via the Openbenchmarks MCP server (https://mcp.openbenchmarks.com/mcp) for native tool-calling from Claude Code, ChatGPT, Cursor, and other MCP-compatible agent clients.
- [Lookalike Benchmark](https://openbenchmarks.com/lookalikes): Seed × vendor matrix comparing company lookalike APIs (Exa, Ocean.io, OpenFunnel, Parallel, PredictLeads) on 24 seed companies across 12 B2B verticals. Each vendor returns up to 100 lookalikes per seed; an LLM judge (gpt-5.4-mini) scores every returned company for relevance. Primary metric: avg Precision@100, with Precision@10 / Precision@50 / Precision@100 reported per vendor. **Fully reproducible**: every cell is backed by a literal HTTP request/response envelope and a literal LLM judge prompt + response, all committed to the public mirror at https://github.com/openbenchmarks-labs/lookalikes (under `data/lookalike-runs/<dataset>/<seed>/<vendor>.raw.json`). Auth headers are scrubbed; everything else is verbatim. Agents can verify any benchmark number end-to-end by replaying the captured calls with their own credentials, or re-score with a different LLM to measure judge bias.

## Quickstart

```bash
# List all benchmarks
curl https://openbenchmarks.com/api/benchmarks

# Get a specific benchmark (full per-provider data)
curl https://openbenchmarks.com/api/benchmarks/technographics
```

Errors are always structured JSON with a stable shape:

```json
{ "error": "benchmark_not_found", "message": "...", "available": ["technographics"] }
```

## MCP server

Native Model Context Protocol endpoint for agents that prefer tool-calling over raw HTTP. Compatible with Claude Desktop, Claude.ai, ChatGPT, Cursor, and any MCP client.

- MCP server: `https://mcp.openbenchmarks.com/mcp` (transport: http, protocol: 2025-03-26)
- Capabilities: tools, prompts, resources, tasks
- Auth: OAuth 2.1 with dynamic client registration, PKCE S256, scope `mcp`
- OAuth discovery: https://mcp.openbenchmarks.com/.well-known/oauth-authorization-server
- Protected resource metadata: https://mcp.openbenchmarks.com/.well-known/oauth-protected-resource
- Site discovery manifest: https://openbenchmarks.com/.well-known/mcp.json
- Documentation: https://openbenchmarks.com/llms.txt

## API

The public read-only API serves the same data the website renders, in machine-readable JSON.

- [GET /api/benchmarks](https://openbenchmarks.com/api/benchmarks): List all benchmarks with summary per-axis leaders.
- [GET /api/benchmarks/technographics](https://openbenchmarks.com/api/benchmarks/technographics): Full technographic matrix including per-vendor benchmark, per-category breakdown, and per-(tool, vendor) cell counts.
- [OpenAPI 3.1 spec](https://openbenchmarks.com/openapi.json): Machine-readable schema for the API.

CORS is open (`Access-Control-Allow-Origin: *`) so any agent can call the API without proxying. No authentication required for read access.

## Disclosures

- **Open & reproducible.** Every benchmark is fully reproducible from public inputs - the raw API requests/responses and the exact LLM judge prompts are published. Data and code are open at https://github.com/openbenchmarks-labs; anyone can re-run a number, dispute it, or submit a pull request.
- **No paid placements.** No provider listed has paid for inclusion, rank weighting, removal, or favorable interpretation. We do not run paid placements as a product.
- **No equity or referral relationships** with any of the listed providers at the time of publication.
- **Corrections.** Benchmarked providers can email `founders@openbenchmarks.com` with dataset slice + run timestamp + evidence to trigger a re-run.

## Methodology + metrics per benchmark

Each benchmark documents its own methodology, metric definitions, and inclusion criteria on its page. See the [Technographics methodology section](https://openbenchmarks.com/technographics#methodology) for the current live benchmark.

## Agent discovery

- [agents.md](https://openbenchmarks.com/agents.md) - when to use, which interface to pick, how to interpret metrics.
- [pricing.md](https://openbenchmarks.com/pricing.md) - machine-readable pricing.
- [.well-known/agent-card.json](https://openbenchmarks.com/.well-known/agent-card.json) - A2A agent card.
- [.well-known/agent-skills/index.json](https://openbenchmarks.com/.well-known/agent-skills/index.json) - installable per-skill markdown.
- [.well-known/mcp.json](https://openbenchmarks.com/.well-known/mcp.json) - MCP server discovery.
- [.well-known/api-catalog](https://openbenchmarks.com/.well-known/api-catalog) - RFC 9727 API catalog.
- [Sitemap](https://openbenchmarks.com/sitemap.xml)
- [Robots.txt](https://openbenchmarks.com/robots.txt)
- [llms-full.txt](https://openbenchmarks.com/llms-full.txt) - same content with current benchmark snapshot inlined.
- [index.md](https://openbenchmarks.com/index.md) - markdown homepage fallback.
