Cerebras
Tokenomics analysis of the Cerebras WSE-3 for inference with leading open source models. Source: SemiAnalysis Tokenomics team.
Default (96.3k) is the P50 input sequence length from our internal testing across Claude Code, Codex, Cursor, OpenCode, and Pi. Output tokens are derived from the workload mix in the Cost to serve panel below.
Hardware cost per Mtok = (systems × $41.96/hr) ÷ cluster throughput. Each concurrent request splits its wall-time between output decode (at the Interactivity rate) and non-output tokens (cache reads, cache writes, input prefill, all at the Cache read throughput).
Note on interactivity: Interactivity (output decode rate) and Cache read throughput both feed the hardware cost. In the market, interactivity also drives selling-price tier (faster = premium), but this calculator leaves selling price under your control.
Note on cache read throughput: We assume cache read throughput is 5x to 20x faster than the output decode rate. That ratio is reflected in pricing: cache-read tokens are typically priced at roughly 1/10 to 1/20 of output tokens across the listed providers.
Inputs from the first card: the Avg ISL and Concurrent requests values you set above feed directly into this cost calculation. As ISL and concurrent requests grow, the output throughput each request can sustain drops (more KV bandwidth contention per decode step), so the interactivity here is the effective rate after that contention.
Assumes prefill runs on Cerebras at the WSE-3 $/hr above. In practice prefill can be offloaded to a separate GPU or Trainium fleet at a lower $/hr, which would reduce the hardware cost per Mtok beyond what this calculator shows.
| Token type | $/Mtok | Mix (%) | Token spend ($) | Token spend (%) |
|---|---|---|---|---|
| Cache read | $ | $0.166 | 55.8% | |
| Cache write | $ | $0.082 | 27.7% | |
| Input | $ | $0.007 | 2.4% | |
| Output | $ | $0.042 | 14.1% | |
| Total | 100% | $0.297 | 100.0% |
Where the workload mix comes from
Token-type breakdown observed across our internal workload and spend on the leading coding assistants (Claude Code, Codex, Cursor, OpenCode, Pi). Cache reads dominate by volume but cache writes and outputs disproportionately drive spend.
Where our ISL assumption comes from
Input sequence length distribution across agentic coding harnesses (Claude Code, Codex, Cursor, OpenCode, Pi). P50 lands at ~96.3k tokens.
Where our OSL assumption comes from
Output sequence length across the same harnesses. P50 lands at ~213 tokens: most turns are short replies.
Where our interactivity assumption comes from
Interactivity (output tok/s) on Cerebras: smaller models go faster, larger models go slower.
Source: Artificial Analysis: Cerebras provider page.
Fit to interactivity ≈ 3007 / active_params_B0.234 using gpt-oss 120B (2059) and GLM 4.7 (1201) as anchors. DeepSeek V3/V4 bumped slightly above the curve to reflect MLA's smaller per-step KV bandwidth.
| Model | Active params | Interactivity (tok / sec) |
|---|---|---|
| DeepSeek V4 | 80 B | 1,150 |
| Kimi K2.6 | 32 B | 1,400 |
| gpt-oss 120B | 5.1 B | 2,059 |
| GLM 4.7 | 32 B | 1,201 |
| DeepSeek V3 | 37 B | 1,350 |
Cerebras WSE-3 vs other chips
Comparing chip and system specs.