Back to Blog
AI Breakthroughs13 min read

AI Chip Design and Custom Silicon: Comprehensive Comparison Report (2026)

I’ll compile a comprehensive research report based on my knowledge of AI chip developments through early 2025, noting where information is confirmed versus projected.

Dhawal ChhedaAI Leader at Accel4

AI Chip Design and Custom Silicon: Comprehensive Comparison Report (2026)

I’ll compile a comprehensive research report based on my knowledge of AI chip developments through early 2025, noting where information is confirmed versus projected.


1. NVIDIA Blackwell Architecture (B200 / B300)

B200 (Shipping since late 2024)

MetricSpecification
ProcessTSMC 4NP, dual-die design
Transistors~208 billion
FP8 Performance9 PFLOPS (per GB200 system, 2 GPUs + 1 Grace CPU)
FP4 Performance18 PFLOPS (per GB200 system)
HBM3e Memory192 GB per GPU
Memory Bandwidth8 TB/s per GPU
TDP~1000W per GPU
NVLink 51.8 TB/s bidirectional
Key Feature2nd-gen Transformer Engine with FP4 support
AvailabilityGenerally available in NVL72 rack-scale systems

NVIDIA claimed a 4x training speedup and 30x inference speedup over H100 for large language models. The GB200 NVL72 rack integrates 72 GPUs and 36 Grace CPUs into a single liquid-cooled system with up to 13.5 TB of unified HBM3e memory.

B300 (Announced GTC 2025)

MetricSpecification
HBM3e Memory288 GB per GPU (up from 192 GB)
Memory Bandwidth~12 TB/s (estimated improvement)
FP4 PerformanceSignificant uplift over B200
Form FactorGB300 NVL72 rack-scale
AvailabilityExpected H2 2025 / early 2026 ramp

The B300 is an enhanced Blackwell variant, not a new architecture. The key improvement is the move to higher-capacity HBM3e stacks (12-Hi), enabling 288 GB per GPU. Jensen Huang also previewed the Vera Rubin architecture (successor to Blackwell) at GTC 2025, expected in 2026-2027.

Software Ecosystem

CUDA remains the dominant ecosystem. NVIDIA’s moat is not just hardware but the CUDA toolkit, cuDNN, TensorRT, TensorRT-LLM, Triton Inference Server, NeMo, and the broader ecosystem. Virtually all ML frameworks (PyTorch, JAX, etc.) have first-class CUDA support.

Pricing: B200 GPUs estimated at $30,000-$40,000 per unit; GB200 NVL72 racks reportedly $2-3 million+.


2. AMD CDNA 4 / MI400 Series

MI300X (Shipping, CDNA 3)

MetricSpecification
ProcessTSMC 5nm/6nm chiplet design
HBM3 Memory192 GB
Memory Bandwidth5.3 TB/s
FP8 Performance2.6 PFLOPS
FP16 Performance1.3 PFLOPS
TDP750W
InterconnectInfinity Fabric

The MI300X has gained significant traction with cloud providers (Microsoft Azure, Oracle Cloud) and has been adopted by several hyperscalers for inference workloads due to its large memory capacity.

MI350 (Announced, CDNA 4 generation)

MetricSpecification
ProcessTSMC 3nm
Expected Performance~35x inference performance over MI300X (AMD’s claim for FP4/FP6)
New data typesFP4, FP6 support
AvailabilityExpected mid-2025

MI400 (CDNA “Next”)

AMD has disclosed that an MI400 series is on the roadmap, expected to follow MI350. Details are sparse, but it is expected to leverage more advanced packaging and potentially HBM4.

Estimated availability: Late 2026 or 2027.

Software Ecosystem

ROCm has improved substantially but still trails CUDA in maturity. PyTorch has official ROCm support. JAX support exists but is less mature. Key pain points remain in library compatibility, debugging tools, and third-party framework support. AMD has invested heavily in closing the gap, including hiring and open-source contributions.

Pricing: MI300X is priced competitively below the H100/B200, typically 20-30% less for comparable SKUs.


3. Google TPU v6 (Trillium)

MetricSpecification
GenerationTPU v6e (“Trillium”), announced mid-2024
Performance~4.7x improvement in peak compute per chip over TPU v5e
HBMSignificant increase over v5e
InterconnectICI (Inter-Chip Interconnect) for pod-scale training
AvailabilityGoogle Cloud only
Pod ScaleUp to 256 chips per pod slice

Google’s TPU strategy focuses on tight integration with its JAX/XLA software stack and Google Cloud infrastructure. TPUs are not sold as discrete chips but consumed as cloud services.

TPU v6 (Full, non-“e” variant)

A full TPU v6 (beyond the efficiency-focused v6e) has been anticipated but details are limited in public disclosures. Google tends to deploy internally first (for Gemini model training) before making available on Cloud.

Software Ecosystem

  • JAX is the primary framework, with XLA compilation
  • PyTorch/XLA provides PyTorch compatibility but adds complexity
  • Vertex AI integration for managed training/serving
  • The ecosystem is strong but Google Cloud-locked; no on-premise option

Pricing: Competitive on a $/FLOP basis for large-scale training when using committed use discounts. TPU v6e pricing is roughly $1.50-$3.00/chip/hour depending on region and commitment.


4. Groq LPU (Language Processing Unit)

MetricSpecification
ArchitectureTSP (Tensor Streaming Processor), deterministic dataflow
Process14nm (first generation)
SRAM230 MB on-chip (no HBM)
Key AdvantageUltra-low latency inference, deterministic execution
Performance~750 tokens/second per user for Llama-2 70B (demonstrated)
TargetInference only (not training)

Groq’s approach is architecturally unique: a fully deterministic, software-scheduled dataflow processor with no caches, no branch prediction, and no out-of-order execution. This yields extremely predictable, low-latency inference.

GroqCloud

Groq operates a cloud inference service (GroqCloud) that has demonstrated best-in-class latency for LLM inference. They have served models like Llama 3, Mixtral, and Gemma.

Limitations

  • Inference only – cannot train models
  • Model size constrained by SRAM capacity; very large models require multi-chip configurations
  • 14nm process limits power efficiency compared to leading-edge nodes
  • Limited availability – primarily through GroqCloud API

Groq has announced next-generation chips on more advanced process nodes, which would address power efficiency concerns.

Pricing: GroqCloud API pricing has been very aggressive, sometimes 5-10x cheaper per token than competitors for supported models.


5. Cerebras WSE-3 (Wafer-Scale Engine 3)

MetricSpecification
Die SizeFull wafer (~46,225 mm²)
Transistors4 trillion
Cores900,000 AI-optimized cores
On-chip SRAM44 GB
Memory Bandwidth21 PB/s on-chip
ProcessTSMC 5nm
FP16 Performance~125 PFLOPS (claimed)
SystemCS-3 (with MemoryX and SwarmX)

Cerebras takes the most radical approach: an entire wafer as a single chip. The WSE-3 eliminates off-chip memory bottlenecks by having 44 GB of on-chip SRAM with enormous aggregate bandwidth.

Key Developments

  • Cerebras Inference: Launched cloud inference service demonstrating 1,800+ tokens/second for Llama 3.1 70B
  • Training at scale: Demonstrated training of models up to 100B+ parameters using MemoryX (external memory) and SwarmX (multi-system clustering)
  • CS-3 System: A single CS-3 replaces clusters of GPUs for many workloads
  • IPO Plans: Cerebras filed for IPO, indicating business traction

Limitations

  • Extremely expensive per-system cost ($2-5 million estimated per CS-3)
  • Custom software stack (Cerebras SDK), though PyTorch compatibility has improved
  • Limited to Cerebras cloud or on-premise CS-3 systems
  • Yield and manufacturing complexity of wafer-scale chips

6. Etched Sohu

MetricSpecification
ArchitectureASIC purpose-built exclusively for Transformer inference
ProcessTSMC 4nm
Key Claim500,000+ tokens/second for Llama 3 70B (8-chip server)
On-chip Memory144 GB HBM3e (per chip)
ApproachHardcoded Transformer operations in silicon

Etched is a startup that has taken the most aggressive specialization approach: burning the Transformer architecture directly into silicon, eliminating the overhead of general-purpose compute. The Sohu chip does one thing – Transformer inference – but does it at extreme throughput.

Key Claims (as of announcements)

  • 10x+ better performance/$ than NVIDIA B200 for Transformer inference
  • Supports any model that is a Transformer (various sizes)
  • 8-chip Sohu server claimed to outperform 160 H100 GPUs

Risks

  • Architecture lock-in: If the field moves beyond Transformers (e.g., state-space models, RWKV, next-gen architectures), Sohu becomes obsolete
  • Startup risk: Pre-revenue as of last public information
  • Software ecosystem: Nascent; must build tooling from scratch
  • Availability: Announced but limited shipping information

7. MatX

MetricDetail
Founded byFormer Google TPU engineers
FocusCustom AI inference accelerator
ApproachHighly efficient ASIC for LLM inference
StatusStealth/early-stage as of last reports

MatX has been relatively secretive. Founded by engineers from the Google TPU team, the company is building custom silicon optimized for large language model inference. Limited public specifications are available.

Key differentiator is reportedly extreme memory bandwidth efficiency and cost-optimized inference, targeting the growing demand for cheap, high-throughput LLM serving.


8. Intel Gaudi 3

MetricSpecification
ProcessTSMC 5nm
FP8 Performance1,835 TFLOPS
HBM2e Memory128 GB
Memory Bandwidth3.7 TB/s
TDP~600W
Ethernet Networking24x 200GbE RoCE built-in
Key FeatureNative Ethernet networking (no proprietary interconnect)

Context

Intel acquired Habana Labs (Gaudi’s designer) in 2019. Gaudi 3 represents a significant leap from Gaudi 2 (roughly 2x FP8 performance and 1.5x memory bandwidth).

Software Ecosystem

  • Intel Gaudi Software Suite with PyTorch integration
  • Hugging Face Optimum Habana for easy model deployment
  • Growing but significantly behind CUDA and even ROCm in breadth
  • Intel has struggled with software ecosystem adoption

Market Position

  • Gaudi 3 targets price/performance: significantly cheaper than H100/B200
  • Adopted by some cloud providers (AWS “dl2” instances for Gaudi 2; Gaudi 3 expected to follow)
  • Intel’s AI accelerator division has faced strategic uncertainty, with reports of potential restructuring

Pricing: Gaudi 3 is priced aggressively, estimated at $15,000-$20,000 per chip, significantly below NVIDIA equivalents.


9. Notable Startup Chips

CompanyChip/ProductApproachStatus
TenstorrentWormhole, Grayskull, Black HoleRISC-V based AI accelerator, open-source philosophyShipping dev hardware; Jim Keller-led
SambaNovaSN40L (Cardinal)Reconfigurable Dataflow ArchitectureAvailable as cloud/on-prem service
d-MatrixCorsairDigital in-memory compute for inferenceEarly production
MythicM1076Analog compute for edge AIShipping for edge/embedded
GraphcoreIPU (Bow)Massively parallel processor for AIAcquired by SoftBank in 2024
Luminous/LightmatterEnvisePhotonic AI acceleratorResearch/early commercial stage
Rain AINPUNeuromorphic analog computePre-production
Positron AI(unnamed)Inference-optimized ASICEarly stage, founded by former Meta engineers

Tenstorrent (Noteworthy)

Led by legendary chip architect Jim Keller, Tenstorrent uses a RISC-V based approach with open-source tooling. Their roadmap includes advanced chips on 3nm+ nodes. They have secured significant funding and partnerships (Hyundai, Samsung). Their open-source ethos is a deliberate contrast to NVIDIA’s proprietary ecosystem.


10. Comprehensive Comparison Matrix

ChipPeak TOPS (INT8/FP8)MemoryMem BWTDPTrainingInferenceSW EcosystemEst. Price
NVIDIA B200~9,000 (FP8)192 GB HBM3e8 TB/s~1000WYesYesCUDA (gold standard)$30-40K
NVIDIA B300>9,000 (FP8)288 GB HBM3e~12 TB/s~1000W+YesYesCUDA$40-50K (est)
AMD MI300X~2,600 (FP8)192 GB HBM35.3 TB/s750WYesYesROCm (improving)$20-25K
AMD MI350TBD (~35x MI300X inf)TBDTBDTBDYesYesROCmTBD
Google TPU v6e~4.7x v5eHBM (undisclosed)HighN/A (cloud)YesYesJAX/XLA~$2-3/hr (cloud)
Groq LPUN/A (latency-opt)230 MB SRAMN/A~300W (est)NoYesGroq SDKCloud API only
Cerebras WSE-3~125,000 (FP16)44 GB SRAM21 PB/s on-chip~23 kW (system)YesYesCerebras SDK$2-5M (system)
Etched SohuClaimed 10x B200 (Tx inf)144 GB HBM3eTBDTBDNoTransformer onlyNascentTBD
Intel Gaudi 3~1,835 (FP8)128 GB HBM2e3.7 TB/s~600WYesYesGaudi SW Suite$15-20K
TenstorrentVaries by configVariesVariesLowYesYesOpen-source RISC-VCompetitive

11. Key Trends and Analysis

1. The Memory Wall is the Defining Challenge

Across all vendors, the limiting factor for LLM workloads is memory capacity and bandwidth, not raw compute. This explains why B300’s main improvement is memory (288 GB vs 192 GB), and why Cerebras and Groq emphasize on-chip SRAM. HBM4 (expected 2026) will be the next major enabler.

2. Inference is Diverging from Training

Training remains dominated by NVIDIA (and to a lesser extent, Google TPUs and AMD). But inference – which represents the majority of production AI compute cost – is attracting specialized silicon (Groq, Etched, MatX, d-Matrix). The economics favor specialization because inference workloads are more predictable and can be optimized aggressively.

3. NVIDIA’s Moat Remains Formidable but Not Unassailable

CUDA’s ecosystem advantage is worth years of head start. However:
- AMD’s ROCm is improving with major customer wins
- Google’s JAX/TPU stack is mature for those in Google Cloud
- PyTorch’s compiler stack (torch.compile, Triton language) is becoming more hardware-agnostic
- MLIR and OpenXLA are creating hardware abstraction layers

4. The Rise of “Good Enough” Inference

For many production LLM serving workloads, the question is not “what is fastest” but “what is cheapest per token at acceptable latency.” This favors:
- Groq for latency-critical applications
- AMD MI300X for cost-sensitive GPU inference
- Custom ASICs (Etched, MatX) for high-volume serving
- Cerebras for batch throughput

5. Chiplet and Advanced Packaging are Standard

AMD pioneered chiplets; now NVIDIA (B200 dual-die), Intel, and others use multi-die designs. TSMC’s CoWoS (Chip-on-Wafer-on-Substrate) packaging is the bottleneck, not fab capacity for leading-edge nodes.

6. Sovereign AI and Geographic Diversification

Export controls on advanced AI chips (particularly to China) have created a market for “compliant” chips and spurred domestic chip development in China (Huawei Ascend 910B/C), Europe, and other regions.


12. Recommendations by Use Case

Use CaseRecommendedRationale
Large-scale training (>100B params)NVIDIA B200/B300 or Google TPU v6Ecosystem maturity, scaling proven
Cost-sensitive trainingAMD MI300X/MI350Lower $/FLOP, improving software
Low-latency inferenceGroq LPUBest-in-class latency
High-throughput inferenceCerebras CS-3 or Etched SohuExtreme throughput per system
Cost-optimized inferenceAMD MI300X or Intel Gaudi 3Best $/token for GPU-based inference
Edge / embedded AISpecialized edge chipsPower-constrained environments
Maximum flexibilityNVIDIA (any)CUDA ecosystem, widest model support

Caveats

This report synthesizes information available through my knowledge cutoff (early-to-mid 2025). Several chips mentioned (MI400, B300, TPU v6 full, Etched Sohu shipping status, MatX details) may have had significant updates between then and March 2026. Performance claims from startups (particularly Etched and Groq next-gen) should be verified against independent benchmarks. Pricing information is approximate and subject to market conditions, volume discounts, and regional variation.

Get workflow automation insights that cut through the noise

One email per week. Practical frameworks, not product pitches.

Ready to Run Autonomous Enterprise Operations?

See how QorSync AI deploys governed agents across your enterprise systems.

Request Demo

Not ready for a demo? Start here instead:

Related Articles