Fermi Estimate of Episteme's Costs

notedraft

Part 12 of 12 in Episteme

← Maintaining neutrality while avoiding nihilism

What we're estimating

A Fermi estimate of the all-in cost of compiling and serving Episteme's claim graph. The dominant term is LLM inference; everything else (storage, the API, graph serving) is almost certainly negligible by comparison, with one caveat for vector search at scale, addressed below. The build is estimated bottom-up: (number of claims) x (agents run per claim) x (tokens per run) x (price per token, at a model quality appropriate to the work).

Two terms are uncertain before we even start. The count of claims is sensitive to how liberal our qualification for claim is, and to how aggressively we split versus lump. Splitting vs. lumping sets that policy; here we just note its leverage. Lumping more aggressively shrinks every number below, and a liberal claim qualifier inflates the long tail of trivia faster than anything else. Treat the totals as good to about one order of magnitude.

The power law of detail

A claim page's cost is not uniform. It follows a power law in the claim's importance, salience, complexity, and contestedness. The lab-leak hypothesis will have a very long page: many subclaims and related claims, but more importantly a core claim that is approached many different ways, with many arguments for and against and a heavy stream of contributions to handle. At the other extreme, an obscure and uncontroversial point made only a couple of times decomposes into nothing, attracts no argument and no contributions, and deserves no more than a handful of tokens of basic due diligence from a sub-frontier model.

So the right unit of analysis is not the average claim but the tier. Cost concentrates at the top by per-claim spend and at the bottom by sheer count; the interesting question is where the product of the two peaks.

Claim tiers (standards)

Five tiers, by how much page a claim earns:

How many of each?

Three triangulating methods, which should agree to an order of magnitude. (1) Top-down from the text corpus: the open web is low billions of pages, but distinct propositions are far fewer because of redundancy. (2) Wikipedia as a calibration anchor: ~7M English articles at tens of claims each bounds the encyclopedic core. (3) Bottom-up by area of work, summing tier counts per field. Where the three disagree, that is a signal we have mis-set the splitting/lumping dial rather than mis-counted.

A few hundred million canonical claims is far below the number of claim-instances on the internet: the same claim appears in thousands of articles, and once it is decomposed the work applies everywhere it occurs. Ingestion volume (instances) is a separate, larger number handled under tokens below.

Split by area of work

The tiers redistribute sharply by field. Politics, economics, finance, law, science, and medicine carry most of the L3-L4 mass, because that is where contested, high-salience claims live. History, technology, philosophy, and religion carry deep L2 mass. Mathematics is the outlier: an enormous L0/L1 base (every theorem and lemma is a checkable claim) but almost no L3+, because there are few live empirical disputes. The personal, fiction, and humor categories contribute almost nothing to the graph. A first cut of the expensive (L3+) mass: economics/finance ~25%, politics/law ~25%, science/medicine ~25%, history/philosophy/religion ~15%, everything else ~10%.

Agents per tier

The agents divide into two scaling regimes. Ingestion agents (Extractor, Matcher, and the embedding step) run once per ingested instance, so they scale with the text firehose, not with canonical claims. The rest (Decomposer, Assessor, Claim Steward, Contribution Reviewer, Dispute Arbitrator, Audit Agent) run per canonical claim, and the table gives their average lifetime run counts, amortized over the claim's life.

Tokens per run

The constitution and policies are large, stable system prompts (order 5-15k tokens) shared across every run, so they bill at ~0.1x input under prompt caching. Per-run cost is therefore dominated by fresh context (the claim, its neighbours, retrieved evidence) and by output, which is large here because we keep the full reasoning trace for auditability. Representative profiles (cached / fresh-input / output):

Frontier pricing

Prices per million tokens. The Anthropic figures are current. The OpenAI and Google figures are approximate early-2026 list prices for their frontier and efficient tiers and move frequently; treat them as order-of-magnitude. Cache reads run ~0.1x input; cache writes ~1.25x (5-min) to ~2x (1-hr). I agree with the instinct to model only the major closed labs here: open-weight inference is cheaper per token but, at this date, is not on the Pareto frontier for the judgment-heavy agents that actually matter.

Model quality per agent and tier

Two columns: a lean build for scarce funding, and Doing It Right. Under Doing It Right the default is a frontier model (Opus), with Fable reserved for flagship synthesis and arbitration, and Haiku admitted only for L0's mechanical due diligence. The agent whose quality I would never compromise is the Assessor: it does the actual epistemic work, and I would not personally use a graph whose assessments were written by sub-frontier models when frontier intelligence is sitting right in front of me all day and can likely investigate better than an arbitrarily well-scaffolded smaller one, simply by being more intelligent.

Cost roll-up

Worked example: an L3 assessment on Opus with the profile above is 12k cached read (~$0.006) + 10k fresh input (~$0.05) + 5k output (~$0.125), about $0.18 per run; three over the claim's life is ~$0.55 for assessment alone. A flagship L4 claim, built on Fable/Opus across ~10 decompositions, ~10 assessments, ~30 steward passes, ~50 contribution reviews, multi-model arbitration and audit, runs ~$50-300 all-in. Rolling that up:

Two things stand out. First, the flagship claims, the ones that look most expensive, are a rounding error in aggregate, because there are only thousands of them; the cost lives in the broad middle (L1-L2) and in the ingestion firehose. Second, the whole thing is fundable: a frontier-quality compilation of the world's claims is high single-digit millions of dollars of inference, not a moonshot. A lean build (cheaper models down the tiers, capped depth) lands closer to $1-2M.

Serving the graph

Serving is separate from building, and dominated by the browser extension. For each page a user engages, we run claim identification on the visible text (an Extractor-like pass), match the identified spans to the graph, and traverse to assemble context. Identification and matching are the LLM cost; the precomputed graph read is cheap. Per engaged page is ~$0.002-0.02 depending on model and page length. A heavy user at ~100 pages/day is ~$0.20-2/day, or ~$6-60/month. That is exactly the cost a subscription should offset.

MCP traversal by agents is the cheaper surface: an agent that already has its claims needs no identification, only matching and traversal, which it can drive itself, at ~$0.001/query order. That is a natural metered-API line.

Non-AI costs

I agree they are negligible, with one caveat. Postgres + pgvector, the API layer, and graph serving are, even at billions of rows and millions of users, plausibly thousands to low tens of thousands of dollars a month, two to three orders below the AI line. The caveat is vector search at scale: an HNSW index over ~1e8-1e9 embeddings at ~1024 dimensions is terabytes of RAM-resident index, a real recurring infra cost. It is still secondary to inference, but it is the one non-AI line that could reach six figures a year and deserves its own estimate rather than being waved away.

What to bear, and when

Under scarce funding, spend frontier tokens only where they change the answer: the contested and flagship claims (L3-L4), and the Assessor everywhere it matters. Let L0-L1 run on Haiku/Sonnet, cap decomposition depth, and defer stewardship until a claim shows discourse. That lean build is ~$1-2M and still produces a usable graph.

The case for Doing It Right anyway: any text worth a human's time to read in full is worth well more than the most expensive output tokens on the market. At $50/M output, a frontier model writing a claim page costs cents, far below minimum wage in the reading time it saves; and if Episteme becomes epistemic infrastructure that AI systems ground against, each word is read and reused many times. The per-claim economics are not the binding constraint. Ambition and ingestion throughput are.

Revenue and profit centers line up neatly with the cost structure. The browser extension's serving cost is naturally offset by subscription; MCP/API traversal is a metered line saleable to AI labs grounding their models; and the expensive flagship pages double as the most valuable and most cite-able public surface. The things that cost the most to serve are the things users will pay for.

Headline: a frontier-quality build is ~$6M (range $5-10M), dominated by the broad middle tiers and ingestion, not the flagship claims. Serving is a per-user, subscription-offsettable line. Non-AI costs are negligible except for vector search at scale. Lean build: ~$1-2M.

Part 12 of 12 in Episteme

← Maintaining neutrality while avoiding nihilism