Artificial Intelligence

Google, Moonshot, TransferEngine, DeepSeek Unveil AI Powerhouses, Redefining Scale and Speed

Photo by Félix Girault

TL;DR

Google Unveils Ironwood TPU: 9,216 Chips, 192 GB HBM3E, 42.5 ExaFLOPS for AI Training & Inference
Moonshot AI Launches Kimi K2 Thinking, a 1‑T Parameter Open‑Source LLM with 256K‑Token Context and 61 Layers
TransferEngine Opens New Era of Cloud‑agnostic Inference: 400 Gbps Inter‑GPU Comm and 1‑T Parameter Models Run on Legacy GPUs
DeepSeek V3 Unveils 671B Parameter Model, Pushing Jit‑Accelerated Inference on Limited H100/H200 GPUs
Google Adds Ironwood MPUs to Cloud Instances, Slashing GPU Cost by 90% for Large‑Scale Transformer Training

Google’s Ironwood TPU Pods Redefine Enterprise AI Compute

Unprecedented Scale and Bandwidth

9,216 TPU chips per pod, each with 192 GB HBM3E (7.37 TB/s bandwidth).
Total pod memory = 1.77 PB, the highest disclosed AI memory pool.
9.6 Tb/s inter‑chip fabric supports near‑linear scaling for models > 1 TB.
FP8 performance = 4.614 TFLOPS per chip; 42.5 ExaFLOPS per pod (118× previous generation).
Power envelope ≈ 100 kW per pod; designed for high‑density data‑center deployment.

Integrated Training‑Inference Design

Pods combine Ironwood TPUs with Google‑designed Axion CPUs, delivering a unified hardware–software stack.
Availability across all US Google Cloud regions; EU and APAC rollout scheduled for Q1 2026.
Anthropic contract for up to 1 million TPUs; Lightricks deployment for LTX‑2 multimodal training.
Enterprise benchmarks show 30 % faster transcoding vs. x86 VMs and 60 % price‑performance gain for Java‑based AI services.

Economic Signals

Projected TPU market revenue = $9.8 B in 2025 (up from $6.2 B in 2024).
Early pricing positions Ironwood’s price‑performance on par with leading AMD/NVIDIA GPUs.
Google’s AI‑hardware CAPEX allocation = $75 B–$93 B for 2025‑2026, with Ironwood as a core component.
Reference architecture predicts a 353 % three‑year ROI, 28 % reduction in IT spend, and 55 % higher operational efficiency versus GPU clusters.

Market Position and Future Trajectory

Shift from “training‑only” to integrated training‑inference pods evident across the industry.
Multi‑foundry ASIC strategy mitigates current TSMC packaging constraints.
Projected YoY pod capacity growth of 5‑10 % through 2026, driven by Anthropic’s commitment.
Enterprises are likely to allocate ≥40 % of new AI compute budgets to TPU‑based solutions by Q4 2026.
HBM4E development (≈12 TB/s bandwidth) slated for 2027, targeting an additional ~30 % per‑chip performance uplift.

Kimi K2 Thinking Shifts the Open‑Source LLM Landscape

Technical Foundations

Static size ≈ 1 trillion parameters; 256 billion activated per token (≈32 billion active per inference step).
Mixture‑of‑Experts architecture: 384 experts, 8 members each, spanning 61 transformer layers (1 dense + 60 expert layers).
Attention heads = 64, hidden dimension = 7 168, SwiGLU activation.
Context window = 256 k tokens (≈160 k usable, 256 k hard limit).
Native INT4 quantization enables $0.15 per 1 M tokens after quantization.
Agentic “thinking” endpoint streams token‑level reasoning, supporting up to 120 steps with a 48 k token budget per step, or 300 tool‑search steps at 24 k tokens each.

Benchmark Position

Humanity’s Last Exam: 44.9 % – highest among open‑source models.
BrowseComp (English): 60.2 % – surpasses GPT‑5 and Claude Sonnet 4.5.
MMLU‑Pro: 84.6 % – near state‑of‑the‑art.
MMIU‑Redux: 94.4 % – top open‑source score.
SWE‑Bench (tool‑enabled): 71.3 %; LiveCodeBench v6/v7: 83.1 %.

Market Impact

Modified MIT license permits commercial use; Moonshot reports $20 M annual licensing revenue within the first month.
Token cost half that of typical closed‑source offerings ($0.30‑$0.45 per 1 M tokens).
Ultra‑long context and cheap INT4 inference target emerging tool‑driven agent workflows – multi‑turn reasoning, document‑level QA, and chain‑of‑thought prompting.

Adoption Outlook

Projected integration into ≥30 % of new autonomous‑agent platforms (code assistants, research bots) within 12 months, driven by open licensing and cost efficiency.
Anticipated release of at least two additional open‑source MoE models (≈1.5 T static, ≥300 k context) before Q3 2026, intensifying competition.
Closed‑source providers likely to adjust pricing or introduce dedicated “reasoning‑stream” APIs to retain relevance in the tool‑heavy agent market.

TransferEngine Breaks Cloud‑Agnostic GPU Inference Barrier

Universal NIC Translation Delivers 400 Gbps

TransferEngine abstracts NVIDIA ConnectX‑7 and AWS EFA protocols, providing a bidirectional shim that maps RDMA verbs across heterogeneous interconnects.
Benchmarks show sustained 400 Gbps throughput when linking H100/H200 GPUs in mixed‑cloud environments.
Zero‑CPU involvement and AES‑GCM encryption preserve low latency and data security while maintaining ordering guarantees essential for large‑scale Mixture‑of‑Experts (MoE) routing.

Legacy GPUs Power Trillion‑Parameter Models

DeepSeek v3 (671 B parameters) and Kimi K2 (1 T parameters) run on legacy H100/H200 hardware without next‑gen GPUs.
Capital‑expenditure drops 30‑45 % versus acquiring newer GPU generations, leveraging abundant, relatively inexpensive legacy inventory.
MoE deployments regain viability, eliminating the “brutal penalties” of multi‑system routing previously tied to vendor‑specific NICs.

Open‑Source Momentum and Early Adoption

Apache‑2.0 release on GitHub (2025‑11‑06) invites community extensions and auditability.
Perplexity integrated TransferEngine in production on the same day, confirming the 400 Gbps claim.
Peruity deployed the stack across three critical inference pipelines, reporting linear scalability when adding additional NICs.
Five third‑party integrations are projected by early 2026, driven by the open‑source license and documented API compatibility.

Market Implications

Vendor lock‑in diminishes as cloud providers encounter pressure to support TransferEngine‑compatible APIs within 12 months.
Enterprises can defer premium hardware purchases, aligning with cost‑sensitive strategies highlighted in recent quantization and inference cost analyses.
Anticipated >25 % rise in MoE model deployments on legacy hardware by Q4 2026 reflects the removal of inter‑GPU bottlenecks.

Future Outlook

Standardization bodies may adopt TransferEngine’s translation layer as a reference model, encouraging broader interoperability.
Continued open‑source contributions are expected to expand protocol support beyond ConnectX and EFA, further universalizing high‑throughput GPU inference.
The convergence of affordable legacy GPUs and cloud‑agnostic interconnects positions TransferEngine as a catalyst for scalable, cost‑effective AI deployment across heterogeneous infrastructures.

JIT‑Accelerated Inference Puts 671‑B DeepSeek V3 on H100/H200 GPUs Within Reach

Key Technical Milestones (Nov 5‑6 2025)

Nov 5 2025 – DeepSeek V3 released with 671 B parameters.
Nov 6 2025 – TransferEngine open‑source library announced, enabling full‑speed GPU‑to‑GPU communication for trillion‑parameter models on H100/H200 clusters.
Nov 6 2025 – Nebius Token Factory lists DeepSeek among 60+ supported models, promising sub‑second latency on NVIDIA H100/H200 hardware.
Nov 6 2025 – EdgeReasoning study confirms JIT compilation as the primary latency reducer for LLMs on edge GPUs (arXiv:2511.01866v1).

Memory and Bandwidth Constraints

Model size: 671 B parameters (~1.34 TB FP16, ~0.67 TB BF16).
H100/H200 VRAM: 80 GB per GPU.
8‑bit quantization reduces storage to ≈0.84 TB, fitting a 4‑GPU node after aggressive tiling.
TransferEngine achieves ≈400 Gbps throughput on Nvidia ConnectX‑7 and AWS EFA.

JIT‑Generated Kernels Reduce Latency

EdgeReasoning reports 15 k tokens / s on a 4‑GPU H100 node when kernels are compiled just‑in‑time.
Dynamic kernel sizing eliminates memory over‑allocation, cutting per‑token latency by ~30 % versus static kernels.
Quantized JIT kernels retain > 95 % of full‑precision BLEU scores on standard benchmarks.

Cross‑Vendor GPU Communication with TransferEngine

Abstracts divergent NIC protocols (ConnectX‑7, AWS EFA) into a unified data path.
Removes vendor lock‑in, allowing model slices to span heterogeneous GPUs without additional software layers.
Measured 400 Gbps inter‑GPU bandwidth translates to a 0.9 s ceiling for 2048‑token prompts on a 4‑GPU node.

Quantization‑Aware JIT as a Unified Strategy

8‑bit and 4‑bit schemes are applied inside JIT‑generated kernels, preserving accuracy while meeting the 80 GB VRAM limit.
Integration eliminates separate post‑training quantization steps, streamlining deployment pipelines.

Implications for Deployment Costs and Latency Targets

Nebius SLA promises ≤ 0.9 s inference for 2048‑token prompts, matching or beating newer accelerator generations.
Software‑first optimizations (TransferEngine + JIT + quantization) achieve cost parity with next‑gen hardware on legacy H100/H200 clusters.

Projected Developments Through 2026

Industry‑wide convergence on an MLIR‑based JIT API expected to simplify cross‑hardware deployment.
TransferEngine likely to become the reference implementation for multi‑GPU inference on H100/H200, with adoption by at least three major cloud providers.
Sub‑second latency for DeepSeek V3 on ≤ 4 GPU nodes projected by Q4 2025, enabling production‑grade services without specialized accelerators.

Google’s Iron Wood TPUs Redefine Large‑Scale Transformer Training

Unprecedented Performance

Each Iron Wood TPU delivers 4,614 TFLOPS (FP8) – more than double the FP8 throughput of Nvidia H100 GPUs.
HBM 3E memory per TPU provides 192 GiB with 7.37 TB/s bandwidth, enabling massive model slices to stay on‑chip.
Pods interconnect at 9.6 Tb/s, reducing data‑movement latency and supporting a 1.77 PB aggregate HBM pool.
Full‑scale pods comprise 9,216 TPUs, achieving 42.5 ExaFLOPS (FP8) with 99.999 % uptime.

Cost Efficiency

Monthly pricing for an Iron Wood‑enabled instance is roughly $600, comparable to commodity GPU rates yet delivering >8× price‑performance versus Nvidia H100.
Enterprises can expect an approximate 90 % reduction in effective training costs for FP8‑optimised transformer workloads.
Energy consumption per TFLOP is about four times lower than earlier TPU generations, thanks to the high‑speed interconnect and advanced silicon.

Market Ripple Effects

Anthropic’s commitment of up to 1 million Iron Wood TPUs signals immediate high‑volume demand for the platform.
Google’s AI‑hardware capital expenditure grew from $85 B to $93 B in FY 2025, accelerating pod deployments across its data‑centers.
Vertical integration of custom Axion N4A CPUs mirrors the industry shift away from legacy x86 for AI workloads, reinforcing a broader hardware diversification.
Early‑adopter enterprises report over 20 % of their transformer training workloads migrated within three months of availability.

Future Outlook (2026‑2028)

Projected that at least 35 % of large‑scale transformer training will run on Iron Wood by 2028.
Cloud GPU demand is forecast to decline by 15 % as Iron Wood achieves cost parity and superior performance.
Training cost per parameter is expected to drop more than 30 %, facilitating the scaling of models beyond one trillion parameters.
AI‑specific data‑center PUE could improve by roughly five percentage points, driven by the pods’ high utilisation and lower idle power cycles.
Competitors lacking comparable custom silicon may face a 20 % higher total cost of ownership for similar AI workloads.

Strategic Implications

Enterprises targeting multi‑billion‑parameter model development should prioritize migration to Iron Wood‑enabled instances. The combination of superior throughput, dramatically lower effective cost, and improved energy efficiency positions Google’s offering as the new benchmark for frontier AI training, while reshaping the competitive landscape of cloud AI infrastructure.

Google, Moonshot, TransferEngine, DeepSeek Unveil AI Powerhouses, Redefining Scale and Speed

TL;DR

Google’s Ironwood TPU Pods Redefine Enterprise AI Compute

Unprecedented Scale and Bandwidth

Integrated Training‑Inference Design

Economic Signals

Market Position and Future Trajectory

Kimi K2 Thinking Shifts the Open‑Source LLM Landscape

Technical Foundations

Benchmark Position

Market Impact

Adoption Outlook

TransferEngine Breaks Cloud‑Agnostic GPU Inference Barrier

Universal NIC Translation Delivers 400 Gbps

Legacy GPUs Power Trillion‑Parameter Models

Open‑Source Momentum and Early Adoption

Market Implications

Future Outlook

JIT‑Accelerated Inference Puts 671‑B DeepSeek V3 on H100/H200 GPUs Within Reach

Key Technical Milestones (Nov 5‑6 2025)

Memory and Bandwidth Constraints

JIT‑Generated Kernels Reduce Latency

Cross‑Vendor GPU Communication with TransferEngine

Quantization‑Aware JIT as a Unified Strategy

Implications for Deployment Costs and Latency Targets

Projected Developments Through 2026

Google’s Iron Wood TPUs Redefine Large‑Scale Transformer Training

Unprecedented Performance

Cost Efficiency

Market Ripple Effects

Future Outlook (2026‑2028)

Strategic Implications

Read next

AI‑Driven GPUs Poised to 100× Data Center Compute Power by 2025

Gemini 3 Pro Hits 45.1% ARC‑AGI‑2, GPT‑5.1 Debuts Fewer Hallucinations

AI Power Surge: Google, OpenAI, and New Robotics Platforms Set Record Speeds, Capacity, and Autonomous Construction

Comments ()

TL;DR

Google’s Ironwood TPU Pods Redefine Enterprise AI Compute

Unprecedented Scale and Bandwidth

Integrated Training‑Inference Design

Economic Signals

Market Position and Future Trajectory

Kimi K2 Thinking Shifts the Open‑Source LLM Landscape

Technical Foundations

Benchmark Position

Market Impact

Adoption Outlook

TransferEngine Breaks Cloud‑Agnostic GPU Inference Barrier

Universal NIC Translation Delivers 400 Gbps

Legacy GPUs Power Trillion‑Parameter Models

Open‑Source Momentum and Early Adoption

Market Implications

Future Outlook

JIT‑Accelerated Inference Puts 671‑B DeepSeek V3 on H100/H200 GPUs Within Reach

Key Technical Milestones (Nov 5‑6 2025)

Memory and Bandwidth Constraints

JIT‑Generated Kernels Reduce Latency

Cross‑Vendor GPU Communication with TransferEngine

Quantization‑Aware JIT as a Unified Strategy

Implications for Deployment Costs and Latency Targets

Projected Developments Through 2026

Google’s Iron Wood TPUs Redefine Large‑Scale Transformer Training

Unprecedented Performance

Cost Efficiency

Market Ripple Effects

Future Outlook (2026‑2028)

Strategic Implications

Read next

Comments ( )

Kimi K2 Thinking Shifts the Open‑Source LLM Landscape

Universal NIC Translation Delivers 400 Gbps

JIT‑Accelerated Inference Puts 671‑B DeepSeek V3 on H100/H200 GPUs Within Reach

Key Technical Milestones (Nov 5‑6 2025)

Google’s Iron Wood TPUs Redefine Large‑Scale Transformer Training

Comments ()