AI Breakthroughs13 min read

AI Chip Design and Custom Silicon: Comprehensive Comparison Report (2026)

I’ll compile a comprehensive research report based on my knowledge of AI chip developments through early 2025, noting where information is confirmed versus projected.

Dhawal Chheda•AI Leader at Accel4•March 16, 2026•

AI Chip Design and Custom Silicon: Comprehensive Comparison Report (2026)

I’ll compile a comprehensive research report based on my knowledge of AI chip developments through early 2025, noting where information is confirmed versus projected.

1. NVIDIA Blackwell Architecture (B200 / B300)

B200 (Shipping since late 2024)

Metric	Specification
Process	TSMC 4NP, dual-die design
Transistors	~208 billion
FP8 Performance	9 PFLOPS (per GB200 system, 2 GPUs + 1 Grace CPU)
FP4 Performance	18 PFLOPS (per GB200 system)
HBM3e Memory	192 GB per GPU
Memory Bandwidth	8 TB/s per GPU
TDP	~1000W per GPU
NVLink 5	1.8 TB/s bidirectional
Key Feature	2nd-gen Transformer Engine with FP4 support
Availability	Generally available in NVL72 rack-scale systems

NVIDIA claimed a 4x training speedup and 30x inference speedup over H100 for large language models. The GB200 NVL72 rack integrates 72 GPUs and 36 Grace CPUs into a single liquid-cooled system with up to 13.5 TB of unified HBM3e memory.

B300 (Announced GTC 2025)

Metric	Specification
HBM3e Memory	288 GB per GPU (up from 192 GB)
Memory Bandwidth	~12 TB/s (estimated improvement)
FP4 Performance	Significant uplift over B200
Form Factor	GB300 NVL72 rack-scale
Availability	Expected H2 2025 / early 2026 ramp

The B300 is an enhanced Blackwell variant, not a new architecture. The key improvement is the move to higher-capacity HBM3e stacks (12-Hi), enabling 288 GB per GPU. Jensen Huang also previewed the Vera Rubin architecture (successor to Blackwell) at GTC 2025, expected in 2026-2027.

Software Ecosystem

CUDA remains the dominant ecosystem. NVIDIA’s moat is not just hardware but the CUDA toolkit, cuDNN, TensorRT, TensorRT-LLM, Triton Inference Server, NeMo, and the broader ecosystem. Virtually all ML frameworks (PyTorch, JAX, etc.) have first-class CUDA support.

Pricing: B200 GPUs estimated at $30,000-$40,000 per unit; GB200 NVL72 racks reportedly $2-3 million+.

2. AMD CDNA 4 / MI400 Series

MI300X (Shipping, CDNA 3)

Metric	Specification
Process	TSMC 5nm/6nm chiplet design
HBM3 Memory	192 GB
Memory Bandwidth	5.3 TB/s
FP8 Performance	2.6 PFLOPS
FP16 Performance	1.3 PFLOPS
TDP	750W
Interconnect	Infinity Fabric

The MI300X has gained significant traction with cloud providers (Microsoft Azure, Oracle Cloud) and has been adopted by several hyperscalers for inference workloads due to its large memory capacity.

MI350 (Announced, CDNA 4 generation)

Metric	Specification
Process	TSMC 3nm
Expected Performance	~35x inference performance over MI300X (AMD’s claim for FP4/FP6)
New data types	FP4, FP6 support
Availability	Expected mid-2025

MI400 (CDNA “Next”)

AMD has disclosed that an MI400 series is on the roadmap, expected to follow MI350. Details are sparse, but it is expected to leverage more advanced packaging and potentially HBM4.

Estimated availability: Late 2026 or 2027.

Software Ecosystem

ROCm has improved substantially but still trails CUDA in maturity. PyTorch has official ROCm support. JAX support exists but is less mature. Key pain points remain in library compatibility, debugging tools, and third-party framework support. AMD has invested heavily in closing the gap, including hiring and open-source contributions.

Pricing: MI300X is priced competitively below the H100/B200, typically 20-30% less for comparable SKUs.

3. Google TPU v6 (Trillium)

Metric	Specification
Generation	TPU v6e (“Trillium”), announced mid-2024
Performance	~4.7x improvement in peak compute per chip over TPU v5e
HBM	Significant increase over v5e
Interconnect	ICI (Inter-Chip Interconnect) for pod-scale training
Availability	Google Cloud only
Pod Scale	Up to 256 chips per pod slice

Google’s TPU strategy focuses on tight integration with its JAX/XLA software stack and Google Cloud infrastructure. TPUs are not sold as discrete chips but consumed as cloud services.

TPU v6 (Full, non-“e” variant)

A full TPU v6 (beyond the efficiency-focused v6e) has been anticipated but details are limited in public disclosures. Google tends to deploy internally first (for Gemini model training) before making available on Cloud.

Software Ecosystem

JAX is the primary framework, with XLA compilation
PyTorch/XLA provides PyTorch compatibility but adds complexity
Vertex AI integration for managed training/serving
The ecosystem is strong but Google Cloud-locked; no on-premise option

Pricing: Competitive on a $/FLOP basis for large-scale training when using committed use discounts. TPU v6e pricing is roughly $1.50-$3.00/chip/hour depending on region and commitment.

4. Groq LPU (Language Processing Unit)

Metric	Specification
Architecture	TSP (Tensor Streaming Processor), deterministic dataflow
Process	14nm (first generation)
SRAM	230 MB on-chip (no HBM)
Key Advantage	Ultra-low latency inference, deterministic execution
Performance	~750 tokens/second per user for Llama-2 70B (demonstrated)
Target	Inference only (not training)

Groq’s approach is architecturally unique: a fully deterministic, software-scheduled dataflow processor with no caches, no branch prediction, and no out-of-order execution. This yields extremely predictable, low-latency inference.

GroqCloud

Groq operates a cloud inference service (GroqCloud) that has demonstrated best-in-class latency for LLM inference. They have served models like Llama 3, Mixtral, and Gemma.

Limitations

Inference only – cannot train models
Model size constrained by SRAM capacity; very large models require multi-chip configurations
14nm process limits power efficiency compared to leading-edge nodes
Limited availability – primarily through GroqCloud API

Groq has announced next-generation chips on more advanced process nodes, which would address power efficiency concerns.

Pricing: GroqCloud API pricing has been very aggressive, sometimes 5-10x cheaper per token than competitors for supported models.

5. Cerebras WSE-3 (Wafer-Scale Engine 3)

Metric	Specification
Die Size	Full wafer (~46,225 mm²)
Transistors	4 trillion
Cores	900,000 AI-optimized cores
On-chip SRAM	44 GB
Memory Bandwidth	21 PB/s on-chip
Process	TSMC 5nm
FP16 Performance	~125 PFLOPS (claimed)
System	CS-3 (with MemoryX and SwarmX)

Cerebras takes the most radical approach: an entire wafer as a single chip. The WSE-3 eliminates off-chip memory bottlenecks by having 44 GB of on-chip SRAM with enormous aggregate bandwidth.

Key Developments

Cerebras Inference: Launched cloud inference service demonstrating 1,800+ tokens/second for Llama 3.1 70B
Training at scale: Demonstrated training of models up to 100B+ parameters using MemoryX (external memory) and SwarmX (multi-system clustering)
CS-3 System: A single CS-3 replaces clusters of GPUs for many workloads
IPO Plans: Cerebras filed for IPO, indicating business traction

Limitations

Extremely expensive per-system cost ($2-5 million estimated per CS-3)
Custom software stack (Cerebras SDK), though PyTorch compatibility has improved
Limited to Cerebras cloud or on-premise CS-3 systems
Yield and manufacturing complexity of wafer-scale chips

6. Etched Sohu

Metric	Specification
Architecture	ASIC purpose-built exclusively for Transformer inference
Process	TSMC 4nm
Key Claim	500,000+ tokens/second for Llama 3 70B (8-chip server)
On-chip Memory	144 GB HBM3e (per chip)
Approach	Hardcoded Transformer operations in silicon

Etched is a startup that has taken the most aggressive specialization approach: burning the Transformer architecture directly into silicon, eliminating the overhead of general-purpose compute. The Sohu chip does one thing – Transformer inference – but does it at extreme throughput.

Key Claims (as of announcements)

10x+ better performance/$ than NVIDIA B200 for Transformer inference
Supports any model that is a Transformer (various sizes)
8-chip Sohu server claimed to outperform 160 H100 GPUs

Risks

Architecture lock-in: If the field moves beyond Transformers (e.g., state-space models, RWKV, next-gen architectures), Sohu becomes obsolete
Startup risk: Pre-revenue as of last public information
Software ecosystem: Nascent; must build tooling from scratch
Availability: Announced but limited shipping information

7. MatX

Metric	Detail
Founded by	Former Google TPU engineers
Focus	Custom AI inference accelerator
Approach	Highly efficient ASIC for LLM inference
Status	Stealth/early-stage as of last reports

MatX has been relatively secretive. Founded by engineers from the Google TPU team, the company is building custom silicon optimized for large language model inference. Limited public specifications are available.

Key differentiator is reportedly extreme memory bandwidth efficiency and cost-optimized inference, targeting the growing demand for cheap, high-throughput LLM serving.

8. Intel Gaudi 3

Metric	Specification
Process	TSMC 5nm
FP8 Performance	1,835 TFLOPS
HBM2e Memory	128 GB
Memory Bandwidth	3.7 TB/s
TDP	~600W
Ethernet Networking	24x 200GbE RoCE built-in
Key Feature	Native Ethernet networking (no proprietary interconnect)

Context

Intel acquired Habana Labs (Gaudi’s designer) in 2019. Gaudi 3 represents a significant leap from Gaudi 2 (roughly 2x FP8 performance and 1.5x memory bandwidth).

Software Ecosystem

Intel Gaudi Software Suite with PyTorch integration
Hugging Face Optimum Habana for easy model deployment
Growing but significantly behind CUDA and even ROCm in breadth
Intel has struggled with software ecosystem adoption

Market Position

Gaudi 3 targets price/performance: significantly cheaper than H100/B200
Adopted by some cloud providers (AWS “dl2” instances for Gaudi 2; Gaudi 3 expected to follow)
Intel’s AI accelerator division has faced strategic uncertainty, with reports of potential restructuring

Pricing: Gaudi 3 is priced aggressively, estimated at $15,000-$20,000 per chip, significantly below NVIDIA equivalents.

9. Notable Startup Chips

Company	Chip/Product	Approach	Status
Tenstorrent	Wormhole, Grayskull, Black Hole	RISC-V based AI accelerator, open-source philosophy	Shipping dev hardware; Jim Keller-led
SambaNova	SN40L (Cardinal)	Reconfigurable Dataflow Architecture	Available as cloud/on-prem service
d-Matrix	Corsair	Digital in-memory compute for inference	Early production
Mythic	M1076	Analog compute for edge AI	Shipping for edge/embedded
Graphcore	IPU (Bow)	Massively parallel processor for AI	Acquired by SoftBank in 2024
Luminous/Lightmatter	Envise	Photonic AI accelerator	Research/early commercial stage
Rain AI	NPU	Neuromorphic analog compute	Pre-production
Positron AI	(unnamed)	Inference-optimized ASIC	Early stage, founded by former Meta engineers

Tenstorrent (Noteworthy)

Led by legendary chip architect Jim Keller, Tenstorrent uses a RISC-V based approach with open-source tooling. Their roadmap includes advanced chips on 3nm+ nodes. They have secured significant funding and partnerships (Hyundai, Samsung). Their open-source ethos is a deliberate contrast to NVIDIA’s proprietary ecosystem.

10. Comprehensive Comparison Matrix

Chip	Peak TOPS (INT8/FP8)	Memory	Mem BW	TDP	Training	Inference	SW Ecosystem	Est. Price
NVIDIA B200	~9,000 (FP8)	192 GB HBM3e	8 TB/s	~1000W	Yes	Yes	CUDA (gold standard)	$30-40K
NVIDIA B300	>9,000 (FP8)	288 GB HBM3e	~12 TB/s	~1000W+	Yes	Yes	CUDA	$40-50K (est)
AMD MI300X	~2,600 (FP8)	192 GB HBM3	5.3 TB/s	750W	Yes	Yes	ROCm (improving)	$20-25K
AMD MI350	TBD (~35x MI300X inf)	TBD	TBD	TBD	Yes	Yes	ROCm	TBD
Google TPU v6e	~4.7x v5e	HBM (undisclosed)	High	N/A (cloud)	Yes	Yes	JAX/XLA	~$2-3/hr (cloud)
Groq LPU	N/A (latency-opt)	230 MB SRAM	N/A	~300W (est)	No	Yes	Groq SDK	Cloud API only
Cerebras WSE-3	~125,000 (FP16)	44 GB SRAM	21 PB/s on-chip	~23 kW (system)	Yes	Yes	Cerebras SDK	$2-5M (system)
Etched Sohu	Claimed 10x B200 (Tx inf)	144 GB HBM3e	TBD	TBD	No	Transformer only	Nascent	TBD
Intel Gaudi 3	~1,835 (FP8)	128 GB HBM2e	3.7 TB/s	~600W	Yes	Yes	Gaudi SW Suite	$15-20K
Tenstorrent	Varies by config	Varies	Varies	Low	Yes	Yes	Open-source RISC-V	Competitive

11. Key Trends and Analysis

1. The Memory Wall is the Defining Challenge

Across all vendors, the limiting factor for LLM workloads is memory capacity and bandwidth, not raw compute. This explains why B300’s main improvement is memory (288 GB vs 192 GB), and why Cerebras and Groq emphasize on-chip SRAM. HBM4 (expected 2026) will be the next major enabler.

2. Inference is Diverging from Training

Training remains dominated by NVIDIA (and to a lesser extent, Google TPUs and AMD). But inference – which represents the majority of production AI compute cost – is attracting specialized silicon (Groq, Etched, MatX, d-Matrix). The economics favor specialization because inference workloads are more predictable and can be optimized aggressively.

3. NVIDIA’s Moat Remains Formidable but Not Unassailable

CUDA’s ecosystem advantage is worth years of head start. However:
- AMD’s ROCm is improving with major customer wins
- Google’s JAX/TPU stack is mature for those in Google Cloud
- PyTorch’s compiler stack (torch.compile, Triton language) is becoming more hardware-agnostic
- MLIR and OpenXLA are creating hardware abstraction layers

4. The Rise of “Good Enough” Inference

For many production LLM serving workloads, the question is not “what is fastest” but “what is cheapest per token at acceptable latency.” This favors:
- Groq for latency-critical applications
- AMD MI300X for cost-sensitive GPU inference
- Custom ASICs (Etched, MatX) for high-volume serving
- Cerebras for batch throughput

5. Chiplet and Advanced Packaging are Standard

AMD pioneered chiplets; now NVIDIA (B200 dual-die), Intel, and others use multi-die designs. TSMC’s CoWoS (Chip-on-Wafer-on-Substrate) packaging is the bottleneck, not fab capacity for leading-edge nodes.

6. Sovereign AI and Geographic Diversification

Export controls on advanced AI chips (particularly to China) have created a market for “compliant” chips and spurred domestic chip development in China (Huawei Ascend 910B/C), Europe, and other regions.

12. Recommendations by Use Case

Use Case	Recommended	Rationale
Large-scale training (>100B params)	NVIDIA B200/B300 or Google TPU v6	Ecosystem maturity, scaling proven
Cost-sensitive training	AMD MI300X/MI350	Lower $/FLOP, improving software
Low-latency inference	Groq LPU	Best-in-class latency
High-throughput inference	Cerebras CS-3 or Etched Sohu	Extreme throughput per system
Cost-optimized inference	AMD MI300X or Intel Gaudi 3	Best $/token for GPU-based inference
Edge / embedded AI	Specialized edge chips	Power-constrained environments
Maximum flexibility	NVIDIA (any)	CUDA ecosystem, widest model support

Caveats

This report synthesizes information available through my knowledge cutoff (early-to-mid 2025). Several chips mentioned (MI400, B300, TPU v6 full, Etched Sohu shipping status, MatX details) may have had significant updates between then and March 2026. Performance claims from startups (particularly Etched and Groq next-gen) should be verified against independent benchmarks. Pricing information is approximate and subject to market conditions, volume discounts, and regional variation.

Get workflow automation insights that cut through the noise

One email per week. Practical frameworks, not product pitches.

Ready to Run Autonomous Enterprise Operations?

See how QorSync AI deploys governed agents across your enterprise systems.

Request Demo

Not ready for a demo? Start here instead:

Download the governance checklist Try the ROI calculator

Open-Source vs Closed-Source AI: The 2026 Landscape

12 min read

AI Deepfake Detection & Content Authentication Technology: State of Play (2026)

11 min read

AI Alignment and Safety Research: Comprehensive Report (2025–2026)

12 min read

AI Chip Design and Custom Silicon: Comprehensive Comparison Report (2026)

1. NVIDIA Blackwell Architecture (B200 / B300)

B200 (Shipping since late 2024)

B300 (Announced GTC 2025)

Software Ecosystem

2. AMD CDNA 4 / MI400 Series

MI300X (Shipping, CDNA 3)

MI350 (Announced, CDNA 4 generation)

MI400 (CDNA “Next”)

Software Ecosystem

3. Google TPU v6 (Trillium)

TPU v6 (Full, non-“e” variant)

Software Ecosystem

4. Groq LPU (Language Processing Unit)

GroqCloud

Limitations

5. Cerebras WSE-3 (Wafer-Scale Engine 3)

Key Developments

Limitations

6. Etched Sohu

Key Claims (as of announcements)

Risks

7. MatX

8. Intel Gaudi 3

Context

Software Ecosystem

Market Position

9. Notable Startup Chips

Tenstorrent (Noteworthy)

10. Comprehensive Comparison Matrix

11. Key Trends and Analysis

1. The Memory Wall is the Defining Challenge

2. Inference is Diverging from Training

3. NVIDIA’s Moat Remains Formidable but Not Unassailable

4. The Rise of “Good Enough” Inference

5. Chiplet and Advanced Packaging are Standard

6. Sovereign AI and Geographic Diversification

12. Recommendations by Use Case

Caveats

Get workflow automation insights that cut through the noise

Ready to Run Autonomous Enterprise Operations?

Related Articles