?
1. Latency dominance (especially LLM inference)
-
SRAM access: ~0.3–1 ns
-
HBM access (effective): ~50–100 ns
-
DDR access: 100+ ns
2. Energy efficiency
Approximate energy per access:-
SRAM: ~0.1–1 pJ/bit
-
HBM: ~3–5 pJ/bit
-
DDR: 10+ pJ/bit
3. Deterministic performance
-
No DRAM scheduling, refresh, or bank conflicts
-
Enables cycle-accurate pipelines (important for real-time systems)
| Chip class | On-chip SRAM |
|---|---|
| Mobile NPU | 4–32 MB |
| Edge inference ASIC | 32–128 MB |
| Datacenter inference ASIC | 100–300 MB |
| Wafer-scale (Cerebras) | 10s of GB |
Famous examples (and what they optimized for)
Groq
-
All on-chip SRAM
-
Static schedule, no caches
-
Unmatched token latency
-
Limited flexibility and capacity
Google TPU v1–v3
-
Large SRAM buffers
-
Matrix-centric workloads
-
Training + inference hybrid
Cerebras
-
Wafer-scale SRAM + compute
-
Avoids off-chip memory entirely
-
Extreme cost, extreme performance for certain models
When on-chip SRAM AI ASICs are the right answer
Ultra-low latency LLM inferenceReal-time systems (finance, robotics, telecom)
Edge or power-constrained environments
Predictable workloads with known model shapes
?
?
?
?
?
?
?
更多我的博客文章
?
?