inferecon.com
THE ECONOMICS OF INFERENCE

v0.2 — captured 2026-04-18

Assumptions

Every default value in the LCOI calculator has a source and a date. This page lists them alongside the formula and the simplifications baked into v0.2. Figures are captured quarterly — the market moves faster than any calculator default can.

Methodology

The calculator computes levelized cost of inference per million tokens, separately for input (prefill) and output (decode), plus a blended figure. Combined throughput is a token-weighted harmonic mean of hardware-specific prefill and decode rates. CapEx (net of hardware-specific residual value) is annualised via a Capital Recovery Factor. OpEx splits into a fixed-per-cluster bucket and a per-GPU bucket. Utilisation is a 24-hour schedule whose time-average drives cost. All preset-linked defaults — cluster size, power cap, residual value, facility overhead, OpEx, PUE — update automatically when hardware or region is changed.

prefill_time/req       = avg_prompt_tokens  / prefill_tps
decode_time/req        = avg_output_tokens / decode_tps
prefill_time_fraction  = prefill_time/req / (prefill_time/req + decode_time/req)
input_token_fraction   = avg_prompt_tokens  / (avg_prompt_tokens + avg_output_tokens)
output_token_fraction  = 1 − input_token_fraction
combined_tps           = (avg_prompt_tokens + avg_output_tokens)
                       / (prefill_time/req + decode_time/req)

cluster_efficiency     = max(0, 1 − clusterOverheadBase × log2(cluster_size))
effective_tps          = combined_tps × cluster_size × cluster_efficiency
avg_utilisation        = mean(utilisation_schedule)
peak_utilisation       = max(utilisation_schedule)
headroom_fraction      = 1 − avg_utilisation / peak_utilisation
annual_tokens          = effective_tps × 3600 × 8760 × avg_utilisation
annual_input_tokens    = annual_tokens × input_token_fraction
annual_output_tokens   = annual_tokens × output_token_fraction

CRF(r, n)              = r(1+r)^n / ((1+r)^n − 1)          (= 1/n at r = 0)
PV(future, r, n)       = future / (1+r)^n
gpu_capex_total        = gpu_price × cluster_size
salvage_total          = gpu_capex_total × residual_value_pct     ← hardware preset
net_gpu_capex          = gpu_capex_total − PV(salvage_total, r, n)
capex_annual           = net_gpu_capex × CRF(r, n)
overhead_annual        = gpu_capex_total × facility_overhead × CRF(r, n)  ← hardware preset

effective_power_kw     = power_cap_watts / 1000                   ← hardware preset
electricity_annual     = effective_power_kw × cluster_size × 8760 × avg_utilisation × price_per_kwh
cooling_annual         = effective_power_kw × cluster_size × (pue − 1) × 8760 × avg_utilisation × price_per_kwh
                                                                    ← pue from region preset

opex_annual            = opex_fixed_per_cluster + opex_per_gpu × cluster_size
                                                                    ← both from hardware preset

total_annual           = capex_annual + overhead_annual
                       + electricity_annual + cooling_annual + opex_annual

# Joint costs allocated to input vs output by GPU time (prefill vs decode):
input_cost_annual      = total_annual × prefill_time_fraction
output_cost_annual     = total_annual × (1 − prefill_time_fraction)

lcoi_input_per_M       = input_cost_annual  / annual_input_tokens  × 1,000,000
lcoi_output_per_M      = output_cost_annual / annual_output_tokens × 1,000,000
blended_lcoi_per_M     = total_annual / annual_tokens × 1,000,000

What v0.2 leaves out

  • Cost allocation is by GPU time, not by token type. Joint costs are split between input and output streams by the share of GPU wall-clock in each phase. Other defensible bases (FLOPs, memory bandwidth) would weight phases differently; time is the most defensible default and rarely gives materially different results.
  • Topology-blind cluster model. The log₂ derate does not distinguish NVLink within a node from InfiniBand or Ethernet across nodes; see Cluster overhead below.
  • Idle power is zero. Utilisation scales both throughput and power linearly. Real GPUs draw 5–10% of TDP at idle; small effect at moderate utilisation, larger below 20%.
  • Residual is recovered as one lump at end-of-life. Real resale happens over months and depends on secondary-market clearing. The PV-of-salvage approximation is adequate for 3–5 year windows.
  • Training costs excluded. Inference is on the tin.

v0.2 modelling levers

Several levers now carry hardware- or region-specific defaults: switching the hardware preset in the calculator automatically refills cluster size, power cap, residual value, facility overhead, and OpEx. Region switching refills electricity price and PUE. The values shown in the Hardware and Region preset sections below are the authoritative defaults.

Workload shape (prompt vs. output tokens)

Llama 3 70B (FP8)
1,000 in / 300 out
Llama 3 8B (FP8)
500 in / 200 out
GPT-4-class MoE (proxy)
1,500 in / 500 out
Captured
2026-04

Defaults are model-specific and update when the model preset changes. Combined throughput is a token-weighted harmonic mean of the prefill and decode rates (see TPS section below): an average request takes prompt/prefillTps + output/decodeTps seconds, and effective throughput is total tokens divided by total time. When prefillTps ≫ decodeTps the harmonic mean collapses toward the decode rate — decode is the bottleneck. Values reflect typical 2026 chat workloads; RAG and coding tasks skew toward longer prompts. Source: Towards Data Science — Prefill Is Compute-Bound. Decode Is Memory-Bound (2024) — also Agrawal et al., Sarathi-Serve, USENIX OSDI 2024.

Joint-cost allocation: input vs output prices

Allocation basis
GPU wall-clock time
Captured
2026-04

All costs are joint — CapEx, electricity, cooling, and OpEx accrue while the GPU is busy regardless of phase. The calculator splits the total annual cost by prefillTimeFraction (share of GPU wall-clock in prefill), then divides each bucket by the corresponding annual token volume. With separate prefill and decode rates set per hardware preset, input and output prices differ meaningfully by default: on H100 + Llama 3 70B, prefill runs at 3,000 tok/s vs. decode at 310 tok/s (~10:1 ratio), so output tokens are ~10× more expensive per token than input tokens. The throughput asymmetry is documented in Agrawal et al., Sarathi-Serve, USENIX OSDI 2024 — the paper shows one decode step is computationally equivalent to ~128 prefill tokens, though at production batch sizes with continuous batching the effective ratio at the system level is 2–10×. Time remains the most defensible joint-cost allocation basis; FLOPs or bandwidth allocation would give broadly similar results.

Discount rate / NPV

Default
8.0%
Captured
2026-04

CapEx is annualised via the Capital Recovery Factor — the constant annual payment whose present value equals the net-of-residual up-front cost. 8% is derived from: US 10-year Treasury (FRED) at ~4.3–4.5% in early 2026, plus the Kroll recommended equity risk premium of 5.0% (reaffirmed Feb 2025), giving ~8.5–9.5% for a base-case US business. 8% is on the low end, appropriate for an operator using leverage or with secured financing. At r = 0 the CRF collapses to 1/n (straight-line). Hyperscalers borrow at 4–6%.

Residual / salvage value

H100 SXM
30%
A100 80GB
35%
RTX 4090
40%
Captured
2026-04

Defaults are hardware-specific and update when the hardware preset changes. Values represent conservative 3-year forward projections anchored to current secondary-market data: Hashrate Index — Used GPU Market (2024–25) and BestValueGPU — RTX 4090 price history. H100 SXM units sell at 50–70% of current retail today; 30% is conservative for 3 years out as B200/GB300 displace them. A100 80GB currently trade at 50–75% of original; 35% accounts for further depreciation as H100 supply increases. RTX 4090 sells at 44–62% of MSRP (trade-in as low as $699 per VideoCardz); 40% is a conservative 3-year projection. Salvage is discounted to present value at the same discount rate as CapEx and subtracted before annualising.

Power cap (vs. nameplate TDP)

H100 SXM
500 W cap / 700 W TDP
A100 80GB
350 W cap / 400 W TDP
RTX 4090
400 W cap / 450 W TDP
Captured
2026-04

Defaults are hardware-specific and update when the hardware preset changes. Power is capped via nvidia-smi -pl with minimal throughput impact at inference-typical batch sizes. The NVIDIA DGX H100 User Guide documents the three mechanisms for power budget control; the H100 PCIe configurable range is 200–350 W (NVIDIA Developer Forums, 2023). The 500 W SXM cap used here is an informed operator estimate — no public NVIDIA doc prescribes an inference-specific SXM cap. A100 (350 W cap vs. 400 W TDP) and RTX 4090 (400 W cap vs. 450 W TDP) are similarly conservative. The power cap drives electricity and cooling; TDP is retained as the nameplate spec.

Facility overhead

H100 SXM
15%
A100 80GB
10%
RTX 4090
2%
Captured
2026-04

Defaults are hardware-specific and update when the hardware preset changes. H100 production colo includes networking switches, NVMe storage, racks, and PDUs (15%). A100 deployments share more infrastructure (10%). RTX 4090 workstations sit on a shelf with near-zero overhead (2%). An Introl GPU Infrastructure TCO model (2025) puts space + facilities at ~$240k/year for a 100-GPU H100 cluster (against ~$3.5M GPU CapEx), consistent with 6–8% annually — the 15% one-time CapEx multiplier used here is comparable over a 3-year amortisation window. Facility overhead is CapEx, annualised via the same CRF as the GPU. Excludes cooling (PUE) and the building shell.

Hybrid OpEx (fixed + per-GPU)

H100 SXM
$30k fixed + $1000/GPU
A100 80GB
$15k fixed + $800/GPU
RTX 4090
$0k fixed + $200/GPU
Captured
2026-04

Defaults are hardware-specific and update when the hardware preset changes. Labor doesn’t scale linearly with GPU count — one SRE supports many GPUs — so it lives in the fixed bucket. The Introl TCO model (2025) puts 5 FTE at $900k/year for a 100-GPU H100 cluster, implying ~$9k/GPU/year all-in staff. Our $30k fixed + $1k/GPU is lower (conservative for a deployment sharing staff across other workloads). Staffing ratios of 20–50 GPUs per engineer are consistent with BroadStaff Global — Data Center Staffing Levels (2024). RTX 4090 ($0 fixed) models hobbyist self-service. Per-GPU costs (software licences, maintenance, HBM/PSU replacement) scale with the fleet.

Cluster overhead

Coefficient
3% / doubling
Captured
2026-04

Effective throughput per GPU is derated by 3% × log₂(clusterSize): ~9% at 8 GPUs, ~15% at 32 GPUs, ~18% at 64 GPUs, clamped at 0. Coarse empirical proxy — does not distinguish NVLink within a node from InfiniBand or Ethernet across nodes, nor the discontinuity at the node boundary (~8 GPUs). All-NVLink clusters see 2–5% real overhead (model overstates); large multi-node clusters may see more (model understates). Source: [source needed] — empirical proxy; no published benchmark was found that directly calibrates this log₂ coefficient; topology-aware model on the roadmap.

Hourly utilisation schedule

Default
flat 40%
Length
24 hours
Captured
2026-04

Cost math uses the time-average utilisation; peak and headroom (1 − avg/peak) are reported separately so operators can see over-provisioning cost. A flat schedule produces zero headroom; a consumer app with peak business hours and quiet nights often shows 50%+ headroom. The slider drives a flat schedule; the Advanced textarea accepts comma-separated hourly values for diurnal patterns. The 40% default is a neutral starting point — no single published figure exists for small-operator inference utilisation; hyperscalers report 65%+ via multi-tenant continuous batching. Source: [source needed] — workload-specific; adjust to your deployment.

Hardware presets

Each hardware preset carries all deployment-linked defaults: GPU price, nameplate TDP, power cap for energy math, typical cluster size, residual value, facility overhead, and OpEx (fixed + per-GPU). All refill automatically when hardware is changed in the calculator.

NVIDIA H100 SXM

GPU price
$35,000
TDP (nameplate)
700 W
Power cap
500 W
Default cluster
32 GPUs
Residual value
30%
Facility overhead
15%
OpEx fixed
$30,000 / cluster
OpEx marginal
$1,000 / GPU

Source: GMI Cloud / Northflank pricing analysis (2026)Market is $35–40k for SXM units; PCIe variants trade $25–30k. Using $35k as a defensible midpoint for the SXM form factor.

NVIDIA A100 80GB

GPU price
$12,000
TDP (nameplate)
400 W
Power cap
350 W
Default cluster
8 GPUs
Residual value
35%
Facility overhead
10%
OpEx fixed
$15,000 / cluster
OpEx marginal
$800 / GPU

Source: JarvisLabs / Northflank (2026)New units $9.5–14k; refurbished $4–9k. $12k is a midpoint for new/refurb. A100 being cheaper per token than H100 at current used pricing is a real finding — throughput is lower but not in proportion to price.

NVIDIA RTX 4090

GPU price
$2,000
TDP (nameplate)
450 W
Power cap
400 W
Default cluster
1 GPU
Residual value
40%
Facility overhead
2%
OpEx fixed
$0 / cluster
OpEx marginal
$200 / GPU

Source: Consumer retail (NVIDIA MSRP, 2026)Not a production-inference part. Listed for reference: Llama 3 70B FP8 does not fit in 24 GB VRAM without Q4/Q5 quantisation, which changes output quality. Figures assume quantised deployment.

Model presets & throughput

Each model preset carries its typical workload shape (prompt / output tokens) and is associated with two throughput matrices — blended (for benchmark reference) and separate prefill / decode values (for the input/output cost split). Prefill and decode rates refill automatically when hardware or model is changed.

Default workload shape

ModelAvg prompt (tokens)Avg output (tokens)
Llama 3 70B (FP8)1,000300
Llama 3 8B (FP8)500200
GPT-4-class MoE (proxy)1,500500

Blended throughput — benchmark reference (tok/s per GPU)

Combined rate across a production workload mix. Used as the anchor for calibrating the prefill/decode split below.

Model ↓ / Hardware →H100 SXMA100 80GBRTX 4090
Llama 3 70B (FP8)1,00040045
Llama 3 8B (FP8)3,5001,500900
GPT-4-class MoE (proxy)50020015

Prefill throughput (tok/s per GPU)

Calibrated so the harmonic mean at each model’s default workload shape equals the blended figure above. Prefill:decode ratio ~10:1 reflects production continuous-batching behaviour — prefill is compute-parallel (fast); decode is KV-cache bandwidth-bound (slow). The fundamental asymmetry is documented in Agrawal et al., Sarathi-Serve, USENIX OSDI 2024 (one decode step ≈ 128 prefill tokens computationally); at production batch sizes the system-level ratio compresses to 2–10×.

Model ↓ / Hardware →H100 SXMA100 80GBRTX 4090
Llama 3 70B (FP8)3,0001,200140
Llama 3 8B (FP8)10,0004,5003,200
GPT-4-class MoE (proxy)1,60065050

Decode throughput (tok/s per GPU)

Model ↓ / Hardware →H100 SXMA100 80GBRTX 4090
Llama 3 70B (FP8)31012014
Llama 3 8B (FP8)1,333563320
GPT-4-class MoE (proxy)160655

Model citations

  • Llama 3 70B (FP8)source needed — model identity is public; throughput figures are hardware-specific, see TPS matrix sources.
  • Llama 3 8B (FP8)source needed — model identity is public; throughput figures are hardware-specific, see TPS matrix sources.
  • GPT-4-class MoE (proxy)source needed — rough proxy for a large MoE of similar class. No verified public benchmark; flagged as estimate until a reference lands.

TPS benchmark sources

  • dlewis.io — Llama 3.3 70B on 4×H100 SXM5 (BF16) (2026-04) — 2,600 tps at 250 concurrent users across 4 GPUs → ~650 tps/GPU BF16. FP8 on H100 yields ~1.5× → ~1,000 tps/GPU.
  • VALDI — Llama 3.1 8B inference on H100 (2026-04) — 3,621 tps at batch 64; rounded down to 3,500 for a production-conservative default.
  • hardware-corner.net — RTX 4090 LLM benchmarks (2026-04) — ~52 tps peak with Q4_K_M quantisation; 45 tps as production-realistic. Requires quantisation — 70B does not fit in 24 GB VRAM at full precision.
  • No public benchmark source needed — rough estimate for a large MoE proxy. [source needed]
  • Derived from H100×70B (bandwidth-ratio) Derived from H100 by memory-bandwidth ratio (HBM2e 3.35 TB/s vs HBM2 2.0 TB/s ≈ 1.68×) plus weaker FP8 support, giving a combined ~2.3–2.5× gap. Estimate, not measurement. [source needed]
  • Derived from H100×8B (bandwidth-ratio) Derived from H100×8B by bandwidth ratio (~2.3×). Estimate. [source needed]
  • Bandwidth-derived estimate source needed — 8B fits in 24 GB VRAM; throughput estimated from H100×8B via bandwidth ratio. Estimate. [source needed]

Electricity & PUE presets

Industrial tariffs in USD/kWh and typical facility PUE. Both refill automatically when region is changed in the calculator. PUE reflects climate and typical data-centre build standard in each region.

US avg (industrial)

$0.08 / kWh · PUE 1.40

Source: EIA Electricity Monthly Update (industrial forecast, 2025) — captured 2026-04. PUE 1.40 reflects large-colo average; Uptime Institute 13th Annual Survey (2023) puts global capacity-weighted average at 1.47.

EU avg (industrial)

$0.15 / kWh · PUE 1.36

Source: Eurostat — non-household electricity, H1 2025 — captured 2026-04. PUE 1.36 from European Commission EED aggregate data (2025); member-state range 1.15–1.66.

Germany (industrial)

$0.18 / kWh · PUE 1.30

Source: SMARD / Bundesnetzagentur — industrial tariff, Jan 2025 — captured 2026-04. PUE 1.30 from German Datacenter Association Impact Report 2024 (modern colo range 1.2–1.3).

Ireland (industrial)

$0.28 / kWh · PUE 1.20

Source: Eurostat — non-household, Ireland H1 2025 — captured 2026-04. PUE 1.20 is conservative for Ireland; Google Dublin achieved 1.08 TTM (Q4 2024); Microsoft EMEA (incl. Dublin) 1.16 FY2025.

Market reference cross-checks

At H100 (32 GPUs, US defaults) the calculator’s blended LCOI lands in the low single digits per million tokens, converging toward the public serverless API range at realistic utilisation. The output price is materially higher than the blended — reflecting the 10:1 prefill:decode throughput ratio allocating most cost to the decode phase. Public serverless pricing at comparable quality for comparison:

Hyperscale providers run higher utilisation (65%+ via continuous batching across many tenants), buy GPUs in bulk below list price, and amortise fixed engineering costs across far more GPUs. The calculator is priced for an honest small-to-mid operator — match cluster size and utilisation to your deployment before comparing.

Prompt Economics — model pricing & token assumptions

The Prompt Economics Calculator uses the following per-model pricing and token size defaults, sourced from Anthropic public documentation and community benchmarks. All figures April 2026.

Anthropic model pricing (USD per million tokens)

ModelInputOutputCache writeCache read
Claude sonnet 4.6$3.00$15.00$3.00$0.30
Claude opus 4.7$5.00$25.00$5.00$0.30
Claude haiku 4.5$1.00$5.00$1.00$0.10

Source: Anthropic pricing page, 2026-04. Cache writes are billed at the same rate as input; cache reads are 10% of input. No markup on cache writes.

Default token size assumptions

AssumptionDefault (tokens)Notes
Anthropic tool use docs – average schema size for a typical function500Based on examples and community patterns; smaller schemas (e.g., get_weather) are ~200 tokens, larger ones (e.g., API connectors) up to 1,000.
MCP protocol – bundle overhead (server description + multiple tools)3000Approximation; an MCP exposing 3-5 tools typically adds 2,000–4,000 tokens of description and schemas.
Typical tool result size (file read, DB query, web fetch)1000Estimate; adjust for your workload (e.g., reading a whole file might be 10k+).
Average tool_use message size150tool_use block includes JSON with name and arguments; usually <200 tokens.
Extended thinking – tokens per turn (if enabled)0Set to 0 by default; Sonnet 4 typically uses 1,000–4,000 thinking tokens per turn in complex tasks.

Cache & cost model notes

  • The prefix (system prompt + attached context + tool schemas + MCP bundles) is written to cache once at the start of a session, then read on every subsequent model invocation — including mid-turn tool call roundtrips.
  • History(previous turns’ user messages, tool results, and outputs) and the current user message are never cached and are billed at the full uncached input rate.
  • Tool calls create multiple API invocations per turn: each tool call adds a tool_use output message and a tool_result input (the returned data). The formula models this quadratically growing input size accurately.
  • Extended thinking tokens are billed as output and are not added to conversation history (they are stripped by the API before the next turn). Disabled by default.
  • The crossover point on the cumulative chart is where maintaining one session (shared cache) becomes cheaper than starting a fresh session for every turn. For short conversations fresh sessions win; for long ones caching wins.

← Back to home·Open the calculator →