Units and quantities
Token
A chunk of text, usually a short word or part of one. "Inference" is one token; "Inferecon" is three. For English prose, a word averages roughly 1.3 tokens, though this varies by domain and language. Models read and write in tokens, and every commercial provider bills in them. Tokens come in two flavours: input (the prompt) and output (what the model generates).
The token is the unit of account for the entire inference market. When prices vary by 10× across providers, that is $/M tokens varying. Input and output are priced separately, and output typically runs 3–5× more on hosted APIs. The reason is structural. Inference has two phases. In "prefill", the model processes the entire prompt in a single forward pass: weights are loaded from memory once and that load is amortised across every input token. In "decode", the model generates output one token at a time, and each new token requires a fresh pass through the full set of parameters. Since LLM inference is memory-bandwidth-bound, decode cost scales roughly with the number of weights read per token, while prefill spreads the same memory load across the whole prompt. Hosted-API pricing tracks this compute structure, with market dynamics layered on top.
The economic consequence: input-heavy products (summarisation, classification, retrieval) are cheap to run; output-heavy ones (long-form generation, agentic workflows) are not.
Parameter
A learnable weight inside a neural network. Modern LLMs are described by their parameter count: a 70B model has 70 billion such weights, each a number stored in memory and read every time the model produces a token. Parameter count is the rough proxy for model size and the binding number for how much GPU memory a model occupies. A 70B model at FP16 needs roughly 140 GB of memory just for weights, before KV cache or activations.
Parameter count drives most of inference cost structure but is only one input to capability. Compute and data also matter to capability, not just parameter count; Mixture of Experts architectures decouple total parameters from active parameters per token. Doubling parameters roughly doubles per-token decode cost in a memory-bandwidth-bound regime, which is why quantisation (compressing weights into fewer bits) is the most direct lever on serving cost.
Model weights
The trained parameters serialised to disk — the artifact distributed when a model is "released." A 70B model at FP16 occupies roughly 140 GB across one or more files; at INT4, roughly 35 GB. Standard formats include `safetensors` for HuggingFace distribution and `.gguf` for llama.cpp.
Weights, not architectures, are what gets shared. "Open-weights" releases (Llama, Mistral, DeepSeek, Qwen) ship the file and a permissive licence; training data, code, and recipes typically do not travel with them, which is why "open-weights" and "open-source" are not synonyms. On the serving side, weight file size sets the floor on which GPUs can host a model at all, before KV cache or activations. On the supply side, open-weights releases turn a fixed training cost into a freely-redistributable artifact and put downward pressure on hosted-API prices for any task the open model is competitive.
Context window
The maximum number of tokens a model can attend to in a single request, covering both the prompt and any growing conversation history. Frontier models in 2026 offer 200k to 2M tokens; older models cap at 4k–32k.
Hosted-API price scales linearly with input length. Most providers charge the same per-token rate regardless of context size, so a 1M-token prompt costs roughly 100× the input cost of a 10k-token one at the same model. The economic catch lives on the serving side, not the headline price. KV cache grows linearly with context length, eating GPU memory and capping the batch size a single GPU can serve concurrently. Effective utilisation, and therefore unit cost, suffers even when the per-token API price is flat. Quality also degrades at long context, so high-context use cases often pay the same per token and get less reliable output. Long context is sold as a headline feature; for self-hosters, it is also the regime where the LCOI calculator's utilisation assumption can quietly fall apart.
FLOPS
Floating-point operations per second, the headline measure of raw compute throughput. A floating-point operation is a single arithmetic step on numbers stored in floating-point format, typically a multiply or add inside a matrix multiplication. The precision (FP16, FP8, FP4) refers to how many bits each number uses. Quoted by precision: an H100 SXM does roughly 1 PFLOPS at FP16 and 2 PFLOPS at FP8, both without sparsity.
A marketing trap for inference. LLM inference is almost always memory-bandwidth-bound, not compute-bound: the GPU spends most of its time waiting for weights to arrive from high-bandwidth memory (HBM, see memory bandwidth) rather than multiplying matrices. The B200's headline FP4 spec sounds dramatic; the actual inference throughput improvement over an H100 on 70B-class models is closer to 2–3×, set by memory bandwidth gains rather than raw FLOPS [MLPerf Inference v4.1, NVIDIA]. FLOPS matter for training. For inference they are a ceiling, not a forecast.
Performance
Throughput and latency
Throughput is how many tokens per second a system produces. Latency is wall-clock time per request, usually broken into TTFT (time to first token, the cold-start delay) and TPOT (time per output token, the streaming speed). The two trade against each other on the batch size dial: bigger batches raise throughput but increase wait time per individual user.
Different workloads want different points on this throughput–latency curve. Real-time chat needs low latency: TTFT under roughly 500ms is a common rule-of-thumb threshold for "feels instant", though the right target depends on the application. Agentic workflows running without a human in the loop often care only about throughput and tolerate seconds or minutes of wait. The unit-cost calculation lives on the throughput side. The user-experience constraint lives on the latency side. Most serving decisions are trades between the two.
Batch size
The number of requests a GPU processes simultaneously in a single forward pass. Larger batches amortise the cost of loading model weights from memory across more concurrent users.
The single biggest economic lever for utilisation. Going from batch size 1 to batch size 64 can lift per-GPU throughput by several multiples; the GPU was always doing the same amount of memory-loading work, just on behalf of more users. This is the core arbitrage hosted-API providers run: pooled demand across thousands of customers pushes batch sizes — and effective utilisation — far higher than any single customer could achieve self-hosting. It's also why context window length puts pressure on unit economics: longer contexts shrink the number of users a GPU can serve concurrently, compressing batch size even if headline throughput holds.
Memory bandwidth
The rate at which a GPU can move data between its high-bandwidth memory (HBM) and its compute units. H100 SXM: 3.35 TB/s. A100 SXM: 2.0 TB/s. RTX 4090: 1.0 TB/s [NVIDIA H100 datasheet, NVIDIA A100 datasheet, NVIDIA RTX 4090 specs].
The actual binding constraint for LLM inference. To generate each token, the entire model has to be read from memory at least once; throughput scales roughly with bandwidth, not FLOPS. The H100/A100 ratios make this concrete. Raw FP16 FLOPS gap: roughly 3× (989 vs. 312 TFLOPS). Bandwidth gap: 1.68× (3.35 vs. 2.0 TB/s). On Llama 3 70B at matched precision (FP16), an H100 delivers roughly 1.5–2× the single-user throughput of an A100, tracking the bandwidth ratio (1.68×) more closely than the FLOPS ratio (~3×). At FP8 — which the A100 does not support — the H100 advantage on transformer inference widens to 3–4× via the Transformer Engine [Runpod H100 guide, Mar 2026]. Quantisation can shift the calculus by reducing the volume of weights that need to transit memory each step.
Hardware
Datacenter GPU (H100, A100)
NVIDIA's H100 (Hopper, 2022) and A100 (Ampere, 2020) are the two GPUs serving most of the world's LLM inference in 2026. Both come in SXM (server module, fast interconnect, ~700W H100 / ~400W A100) and PCIe (lower bandwidth, lower power) variants. The successor B200 (Blackwell) is shipping in volume through 2026 and roughly doubles inference throughput on 70B-class workloads via higher memory bandwidth and new FP4 tensor cores.
At approximate prices (H100 SXM ~$35–40k new; A100 80GB ~$8–15k secondary as of mid-2026; these move) [IntuitionLabs pricing guide, Jarvis Labs A100 guide], an H100 delivers roughly 1.5–2× the single-user throughput of an A100 on models like Llama 3 70B. Whether that ratio justifies the price gap depends on your utilisation and workload mix. At many real-world operating points the A100 wins on $/M tokens. This is the counter-intuitive result that drives the "older silicon" arbitrage in the inference market, and one reason hyperscalers have not rushed to retire their A100 fleets. See the LCOI calculator for the worked numbers.
TPU
Tensor Processing Unit, Google's custom AI silicon. Designed around pod-scale interconnects rather than single-chip performance, with high-bandwidth optical links between chips and tight integration with the JAX/XLA software stack. The fundamental design choice is to optimise for matrix multiplication at pod scale rather than per-chip headline specs, betting that real workloads scale across many chips and that interconnect bandwidth is the binding constraint at that scale.
Current production generations as of April 2026: Trillium (v6e, GA late 2024) and Ironwood (v7, GA November 2025) [Google Cloud Blog]. At Cloud Next on 22 April 2026, Google announced its eighth generation, splitting training (8t/Sunfish, Broadcom-designed) and inference (8i/Zebrafish, MediaTek-designed) silicon for the first time, both targeting TSMC 2nm with late-2027 availability [The Next Web]. The training/inference chip split signals that Google is treating inference as a separate design problem, which the memory-bandwidth constraints governing decode-phase serving make a defensible call.
Available only via Google Cloud, which makes TPU inference economics opaque. Google does not publish per-chip pricing, the relevant unit is pod-hours, and direct comparisons against NVIDIA require workload-specific benchmarks that few outside parties are equipped to run. TPUs are clearly competitive on training and on inference at very large scale. The case at the volumes most operators actually run is harder to verify from outside the Google fence.
PUE
Power Usage Effectiveness — the ratio of total facility power to IT power. PUE 1.4 means 40% on top of the GPU's own draw goes to cooling, lighting, UPS losses, and power conversion. Hyperscale data centres run 1.1–1.2; older enterprise sites 1.5–2.0.
A multiplier on every kWh of inference power you pay for. In the LCOI breakdown it's a second-order term when CapEx dominates — at 40% utilisation with US industrial electricity, cooling is roughly 1% of unit cost. It matters more at high utilisation and in expensive electricity markets (where opex share of unit cost rises), and it's what drove the shift to direct liquid cooling for B200/H200-generation hardware. The engineering investment required to reach sub-1.2 PUE is substantial.
Serving and architecture
Quantisation
Compressing model weights from higher to lower numerical precision. A 70B-parameter model at FP16 is ~140 GB; at FP8, ~70 GB; at INT4, around 35 GB. Quantisation can run at load time (weights stored at lower precision) and sometimes during compute (operations performed at lower precision via FP8/FP4 tensor cores).
The primary lever for fitting larger models onto smaller hardware, and the most direct way to lift throughput in a memory-bandwidth-bound regime: fewer bits per weight means fewer bytes that need to traverse HBM each decode step. The bandwidth ceiling itself does not move; the demand placed on it falls. Quality loss is real but smaller than naive bit-counting suggests. Modern 4-bit quantisation schemes typically degrade aggregate benchmark scores by roughly 1–2% on instruction-tuned models, with task-specific variation: multi-step reasoning (e.g. GSM8K) is more sensitive than commonsense multiple-choice, and aggressive 3-bit settings can push degradation past 5% [Kurt 2026, Llama-3.1-8B GGUF evaluation, Huang et al. 2024, Llama 3 quantization study]. The economic implication is most visible at the consumer end: running Llama 3 70B on a 24 GB RTX 4090 at all requires INT4 quantisation. The throughput penalty is severe at single-user load — a Q4 RTX 4090 produces roughly 40–50 tok/s versus 100–200 tok/s on an H100 in FP8 — and widens to roughly 10× under batched serving conditions, because KV cache and batch size constraints compound the memory limitations of consumer hardware.
KV cache
A working memory that lets transformer models generate long outputs without redoing past work. When the model produces each new token, it has to attend to every prior token in the sequence. The KV cache (key-value cache) stores the intermediate attention tensors for each already-processed token, so generating token N only requires fresh computation for token N rather than redoing the work for tokens 1 through N−1. The cache lives in GPU VRAM and grows linearly with context window length and the number of active users.
The real bottleneck for serving concurrent users. For a 70B model at 32k context, KV cache runs roughly 5–15 GB per active user, depending on architecture (grouped-query attention reduces it significantly). VRAM not occupied by parameters gets split among active users, which caps how many users a single GPU can serve simultaneously. That cap sets effective batch size, which sets utilisation, which sets LCOI. KV-cache management is the most consequential inference-engineering work of the past two years and is why open-source serving systems like vLLM and TensorRT-LLM dominate self-hosting decisions.
Mixture of Experts (MoE)
An architecture where the model contains many specialised "expert" subnetworks but routes each token through only a small subset, typically two to eight experts out of dozens or hundreds. A 220B-parameter MoE with 20B parameters active per token has the memory footprint of 220B but reads only ~20B worth of weights from memory per forward pass.
Decouples model capacity from per-token serving cost. The economic catch: all expert weights have to reside in GPU memory to be selectable, even though only a fraction are read per token. MoE therefore shifts the binding constraint toward HBM capacity. Where dense models hit a memory-bandwidth wall, MoE hits a memory-capacity wall first. Per token, MoE is cheap because only the active experts' weights traverse memory bandwidth at decode time; in aggregate, MoE is expensive to host because the inactive experts still occupy VRAM. This is the architectural pattern that lets frontier models keep growing in capability without proportional growth in inference cost, and it is why MoE is now dominant above roughly 100B parameters.
Decoder-only
A transformer architecture that generates tokens one at a time, each conditioned on all preceding tokens, with no separate "encoder" stage. GPT, Llama, Claude, Gemini are all decoder-only. Contrasts with encoder-decoder designs (the original transformer, T5) and encoder-only models (BERT).
Won the LLM race because next-token prediction on raw text is a simple, scalable training objective and most NLP tasks can be reframed as generation problems. The economic consequence: most inference cost models, including the LCOI calculator, implicitly assume the decoder-only serving pattern (autoregressive decode, KV cache, prefill/decode split). Encoder-decoder and encoder-only architectures still serve narrow applications (translation, embedding, classification) but do not shape the inference-cost conversation.
Industry and infrastructure
Hyperscaler
Cloud providers operating data centres at multi-gigawatt scale: AWS, Microsoft Azure, Google Cloud, Oracle. The "neoclouds" (CoreWeave, Lambda, Crusoe) are smaller and AI-specialised, growing fast on inference demand but operating at a fraction of hyperscaler footprint. Operationally distinguished by buying GPUs in five- and six-figure lots, signing long-term power purchase agreements, and building or leasing purpose-built data centre capacity measured in hundreds of megawatts per campus.
They dominate the inference supply curve, and that dominance is what self-hosters benchmark against, not retail GPU prices and consumer electricity tariffs. Bulk GPU procurement at meaningful discounts to list price, PUE near 1.1 from modern liquid-cooled facilities, and pooled customer demand pushing effective utilisation well above what self-hosters can achieve give them structural unit-cost advantages no smaller operator can match. This is why hosted-API prices can undercut naive self-hosted economics: what looks like a thin margin at the surface is built on deep infrastructure leverage. See LCOI.
Co-location
Renting space, power, and cooling in someone else's data centre while owning the hardware yourself. Distinct from cloud (provider owns the hardware) and on-premise (you own everything, including the building).
The middle ground for serious self-hosters. Avoids the capital cost of building a facility and the per-GPU markup on cloud rental. Most non-hyperscaler AI operators running their own GPUs do it via "colo" with providers like Equinix, Digital Realty, or Crusoe. Power availability is now the primary binding constraint on new colo capacity: pre-built shells with committed grid connections command significant premiums in major markets.
Economics and demand
Inference (vs training)
Inference is running a trained model on new input, what happens every time someone sends a prompt. Training is the (much larger) one-time process of producing the model weights. A frontier model might cost $100M+ to train once and fractions of a cent per query to serve thereafter.
The spending mix has shifted. Through 2022, AI compute spending was dominated by training: a one-time capital outlay, planned in advance, executed by a small number of labs. By 2026, inference is the larger and faster-growing share of GPU demand. Training compute is bounded by the number of frontier runs being executed; inference grows with end-user volume, which keeps climbing as agentic workloads expand token consumption per task by orders of magnitude over simple chat.
LCOE
Levelized Cost of Energy, the power sector's standard tool for comparing generation assets. Total lifetime cost of building and operating a power plant, divided by total lifetime energy output, expressed in $/MWh. Used to put nuclear, solar, gas, and wind on a common basis.
The direct ancestor of LCOI. The logic is identical (capital amortisation plus operating costs, divided by cumulative output) applied to compute instead of electrons. The analogy is tight except for timescales: power generation assets amortise over 25–40 years; GPUs over three. That compressed window is why CapEx so thoroughly dominates inference unit cost in the LCOI breakdown. You do not have decades for the depreciated value to fade into the background.
LCOI
Levelized Cost of Inference, the all-in cost of producing one million output tokens from a self-hosted model, amortised over the life of the hardware. Direct analogy to LCOE in the power sector. Three buckets: capital (GPU purchase price divided by useful life), electricity (TDP × utilisation × power tariff), and cooling (electricity × (PUE − 1)). Divide annual cost by annual token output, scale to per million tokens.
Useful as a benchmark against hosted-API prices, and as a diagnostic for what drives unit cost. Capital cost dominates the breakdown at typical operating utilisations; electricity and cooling are second-order. The mechanism is amortisation horizon — three years for a GPU against decades for the power assets underlying LCOE. That structure explains why utilisation matters more than power tariffs, why used A100s often beat new H100s on $/M tokens, and why hyperscalers can undercut naive self-hosted economics by running fleets at higher utilisation than self-hosters can sustain. See the calculator and the assumptions page.
Agentic workflow
An application pattern where the LLM runs in a loop, calling tools, reading results, planning next steps, writing code, observing output, rather than producing a single response. Cursor, Claude Code, browser agents, and autonomous coding tools are the current generation.
The single largest demand-side driver of inference growth in 2025–2026. A human chatting consumes a few thousand tokens per turn. Bai et al. (2026) find that agentic coding tasks consume roughly 1000× more tokens than chat or one-shot reasoning, with input tokens (not output) dominating cost, and run-to-run variance up to 30× on the same task [arXiv:2604.22750]. At Anthropic's Series G in February 2026, Claude Code (an agentic coding product) had a run-rate revenue of $2.5B against the company's total $14B run-rate — roughly 18%, more than doubled in the prior six weeks [Anthropic, Feb 2026]. Agentic workloads also reshape model selection. Agents are price-sensitive in ways human users are not, because total cost is dominated by volume rather than per-query quality. The result is Jevons' paradox in direct form: per-token costs have fallen sharply through 2024–2025, but per-task token consumption has expanded fast enough that total inference spend keeps rising.