What a token actually costs
Three ways to buy tokens, and why the self-hosted option is the one worth a calculator.
A token is a chunk of text, usually a short word or part of one. "Inference" is one token. "Inferecon" is three. An average English word runs about 1.3 tokens. A paragraph of this article is around 80. A full novel is 150,000. Language models read and write in these chunks, and every commercial provider bills by the chunk.
Until 2023 or so, that billing line sat in a research budget. In 2026 it sits in cost of goods sold. Any product with an LLM feature attached (a support assistant, a coding tool, a drafting helper, an agent loop) is now buying tokens the way a factory buys kilowatt-hours. The unit cost matters because volume is large (a serious agentic workload burns tens of millions of tokens a day), prices vary by more than 10× across routes, and the cheapest answer is not the same for every workload. This piece walks through the three ways you can buy tokens, what each costs in practice, and why the self-hosted route needs a calculator while the other two mostly don't.
Three ways to buy tokens
There's a subscription route, a hosted-API route, and a self-hosted route. They're priced on different units (access, tokens, and throughput-hours respectively), so comparing them requires picking a workload and backing out an equivalent $ per million tokens for each.
Subscription. Flat monthly fee, capped usage. ChatGPT Pro at $20/month. Claude Max at $100–200/month. Gemini Advanced at $20/month. Priced for one person reading answers on a screen. A heavy individual user who burns, say, 3 million tokens a month through ChatGPT Pro is paying an implied $7 per million tokens. A light user on the same plan pays $50/M or more. At any volume beyond single-seat human use, you start hitting rate limits quickly.
Hosted API. Pay per token, metered. The closed-model providers (OpenAI, Anthropic, Google) publish rate cards in the $0.10 to $15 per million tokens range depending on model tier. As of April 2026: GPT-5 at roughly $2.50/M input, $10/M output; Claude Sonnet 4.6 at $3/M and $15/M; Gemini 2.5 Pro at $1.25/M and $5/M. On the open-weight side, Together AI and Fireworks, two AI inference platforms, serve Llama 3.3 70B at around $0.89 and $0.90/M output. No commitment, no capacity to manage, metered billing. This is the default for production workloads from the first API call up to several hundred million tokens a month.
Self-hosted. You provision GPUs (owned, co-located, or leased on a long-term contract) and serve an open-weight model yourself. No per-token bill. Instead you pay for hardware, power, cooling, and the team that keeps the cluster up. Capital-intensive and operationally demanding. The unit cost depends entirely on your choices, which is why it's the only route that needs a calculator.
How the three routes compare
| Route | Typical unit cost | Upfront cost | Ops burden | Typical user |
|---|---|---|---|---|
| Subscription | $2–50/M (implied) | None | None | An individual, or a team of a few humans |
| Hosted API | $0.10–$15/M by model | None | Minimal | Most products with LLM features |
| Self-hosted | $0.80–$3.00/M on open-weight 70B-class models | $35k per GPU, up | Real engineering capacity | High-volume products; regulated workloads |
As you move down the table, upfront cost and operational burden rise; unit cost can fall, but doesn't have to. Which route wins depends on two things: how many tokens you actually consume, and whether the workload has constraints that push it off the default.
For a human reading answers, subscription wins. The flat fee caps exposure, the rate limits don't bind, and the alternatives are worse priced for single-seat use. No need to over-engineer it.
For most products with LLM features, the hosted API wins. Frontier models without capex, scale up and down with demand, and at any volume below a few hundred million tokens a month the math rarely favours running your own GPUs. Also, the operational simplicity is worth a lot.
For a narrow but growing band of workloads, self-hosting wins, or wins by default because nothing else is allowed. Three cases make up most of them. The first: sustained volume high enough to push a dedicated GPU above 50% utilisation, which is where the unit cost starts competing with the API. The second: data residency constraints (EU public sector, healthcare under strict GDPR readings, regulated finance, defence) where shipping tokens to a US inference provider is simply off the table. The third: model choice the hosted side won't support, whether that's a specific fine-tune or an uncommon open-weight model the majors don't bother to serve.
The economics of self-hosting
Self-hosting's unit cost isn't a price you look up. It's an output of your choices: which GPU, how hard you run it, how much you paid for power, how long you amortise the hardware over. You can derive it from a formula, the levelized cost of inference (LCOI), by direct analogy to LCOE in the power sector. Three cost buckets go in:
- Capital, amortised. GPU purchase price divided by useful life. Three years is the industry default. Some operators push to five on older silicon.
- Electricity. Thermal design power (kW), multiplied by hours running, multiplied by utilisation, multiplied by the tariff ($/kWh).
- Cooling and facility overhead. Every watt the GPU draws pulls extra watts with it (HVAC, UPS, lighting, conversion losses). The ratio is captured in PUE, power usage effectiveness. A PUE of 1.4 means 40% on top of IT load for the non-GPU infrastructure.
Divide the annual total by the tokens produced in that year and you have $/M tokens:
annual_tokens = tps × 3600 × 8760 × utilisation capex_annual = gpu_price / amortisation_years electricity_annual = tdp_kw × 8760 × utilisation × $/kWh cooling_annual = tdp_kw × (pue – 1) × 8760 × utilisation × $/kWh lcoi = (capex + electricity + cooling) / annual_tokens × 1,000,000
tps here is server-side blended output throughput in tokens per second per GPU, at production batch sizes — what the hardware actually delivers when pushed with many concurrent requests. Not the single-user streaming speed you see in a chat window, which runs 5 to 10 times slower.
The formula leaves out networking, storage, cluster-ops labour, software licensing, maintenance, residual hardware value, and the time value of money. Each is real, and each is on the assumptions page. For now, treat LCOI as a lower bound on a fully-loaded self-hosted cost. v0.2 will expand on that, but I expect the correct number to land 15–30% above the v0.1 figures.
A worked example
Take an H100 SXM at $35,000, three-year amortisation, serving Llama 3 70B.
- Capex: $35,000 / 3 = $11,667/year.
- Electricity: 0.7 kW × 8,760 h × 40% × $0.08/kWh = $196/year.
- Cooling: 0.7 kW × (1.4 – 1) × 8,760 h × 40% × $0.08 = $78/year.
- Annual tokens: 1,000 tps × 3,600 × 8,760 × 40% = 12.6 billion.
- LCOI: $(11,667 + 196 + 78) / 12,600 = $0.95 per million tokens.
Capex is 97% of the total. Electricity is 2%. Cooling is 1%. That breakdown is the single most important output of the model, and almost everything noteworthy about self-hosting follows from it.
What the breakdown implies
Utilisation dominates. Doubling utilisation roughly halves unit cost, because capex is spread over twice as many tokens. Going from 40% to 70% on the same H100 in the same building drops LCOI from $0.95 to $0.55/M.
The hosted API's advantage is almost entirely utilisation arbitrage. Together AI sells Llama 3.3 70B at $0.89/M, cheaper than the naive self-hosted $0.95/M. Their margin on this is thin. What they have is pooled demand across thousands of customers, smoothed bursts, and GPUs running at 60–80% instead of your 40%. On top of that, hyperscale buyers negotiate GPU prices 20–30% below retail.
Hardware price matters more than power price. A 10% discount on the GPU moves unit cost by roughly 10%. A 10% discount on power moves it by 0.2%.
Geography matters less than it looks. Swap US industrial electricity ($0.08/kWh) for Irish industrial ($0.28/kWh) and LCOI rises from $0.95 to $1.05/M. An 11% jump for a 350% tariff increase. It matters, just less than I had guessed. However, at high utilisation rates this becomes relevant: run the same H100 at 80% in Ireland and electricity hits 15% of total cost. That's the regime hyperscalers operate in, which is why they sign 20-year PPAs in the Nordics and Texas.
Older silicon is often cheaper per token than new. Per-token economics follow price-to-throughput, not absolute throughput. A used A100 80GB at $12,000 produces Llama 3 70B tokens at $0.82/M; a new H100 at $35,000 comes in at $0.95/M. The H100 has 2.5× the throughput of an A100, but it costs 3× as much, so the A100 wins on unit cost.
Consumer GPUs lose on large models. (of course) An RTX 4090 costs $2,000, one-seventeenth the price of an H100. At first pass that looks like a dramatically cheaper self-hosting route but the math doesn't hold. Running Llama 3 70B on a 4090 requires aggressive quantisation to fit the weights in 24GB of VRAM, and throughput collapses from 1,000 tps (H100, FP8) to 45 tps per card. The calculator puts the 4090 at $1.48/M, more expensive than either data-centre card. Consumer cards are fine for development, smaller models, on-device work. On 70B-class models at production scale, they fail.
Try the calculator
The LCOI calculator runs this formula against whatever assumptions you want to supply. Swap the hardware. Change utilisation. Switch the electricity region. Change the amortisation window. The assumptions page documents every preset with sources and capture dates.
If a preset in the calculator looks wrong, every default is documented and sourced. Push back with a link.