TurboQuant Slashes LLM Cost 83 %, Keeps Accuracy: Chip Stocks Tumble
TL;DR
- Google unveils TurboQuant AI compression, reducing LLM memory needs by 6x and cutting evaluation costs to one-sixth
- Intel Core Ultra 200S Plus CPUs Gain 40% Performance via iBOT In-Memory Optimization, Enabled by Default on Z890 Motherboards
- Google accelerates post-quantum cryptography migration, targeting 2029 deadline with ML-DSA integration in Android 17 and Chrome
💸 TurboQuant Cuts LLM Memory 6×, Speed 8×, Cost 1⁄6: Google Cloud Next
6× less memory, 8× faster inference, 1⁄6 the cost—Google’s TurboQuant just slashed LLM serving bills while keeping 100 % accuracy on Gemma & Mistral. Memory-chip stocks already tanked. If your AI budget breathes easier, thank software eating hardware. Who’ll be first to pass the savings on to users?
Google Research’s TurboQuant, released Monday, compresses the memory-hungry KV cache inside large language models to one-sixth its usual size, lets an eight-GPU H100 cluster answer questions eight times faster, and still nails every accuracy test on Gemma and Mistral. In plain numbers, that is 16 bits per value squeezed down to 3, a token window stretched past 100 000 words on a single card, and the cash cost per query dropped from six cents to one.
How it works
The trick is a three-step quantizer: a random rotation packs the cache values around zero, Lloyd-Max quantization snaps each float to a 2-, 3- or 4-bit code, and a 1-bit “QJL” residual patch fixes whatever distortion remains. Fused kernels on the H100 then read the cache as tiny uint8 indices instead of fat fp16 tensors, so the same silicon now streams eight times more attention logits per clock tick.
Impacts
- Memory hardware: 6× less KV space → 30 % fewer high-bandwidth modules needed per server row → Micron, WD and SanDisk shares slid 3–6 % overnight.
- Cloud bills: Inference cost falls to one-sixth → a 64k-token chat that used to cost six cents now costs one.
- Model scale: 100k-token contexts fit in one GPU → long-form legal briefs or whole code bases stay in RAM, no chip upgrade required.
- Competitive edge: No rival compression (PolarQuant, INT4) matches both the 6× cut and zero accuracy loss on public benchmarks.
Gaps and risks
The rotation-plus-residual pipeline ships as custom CUDA kernels today; AMD or Intel rigs have not been validated. Outside Google Cloud, adopters must port the code themselves, so immediate uptake is gated by engineer hours, not patents.
Outlook
- Q2 2026: TurboQuant rolls out on Vertex AI; early users see 40–60 % latency drop for 64k-token agents.
- 2027: Open-source forks (vLLM, llama.cpp) add kernels; data-center memory spend dips ~15 % versus 2025 baseline.
- 2028–2029: Software-level compression becomes the default layer; industry-wide OpEx for large-scale serving falls up to 25 %, shifting semiconductor demand from hoarding HBM to tuning interconnects.
Software, not silicon, just freed a six-lane memory highway. If the kernels travel as fast as the numbers promise, the next wave of LLM services will be longer, faster and cheaper—before a single new fab comes online.
😱 Intel iBOT CPUs Deliver 40% Speed Boost, Spark Benchmark Fraud Alerts
40% faster cores on Intel’s new iBOT CPUs—like swapping a 4-lane highway for 8 at rush hour 😱. Hidden translator rewrites game code on-the-fly, but Geekbench now flags scores “invalid.” Gamers win 18% FPS, benchmark integrity loses—will you trust the numbers or flip the BIOS switch?
Intel’s Core Ultra 200S Plus desktop chips, shipping today inside every Z890 AORUS ELITE DUO X motherboard, arrive with a factory-flipped switch called iBOT. The moment a game launches, the processor rewrites its own x86 instructions into denser micro-ops that execute in higher-IPC pathways—no reboot, no user prompt. Early tests show Shadow of the Tomb Raider jumping from 142 fps to 167 fps on a $299 Core Ultra 7 270K Plus, an 18 % lift created entirely inside RAM.
How the invisible rewrite works
iBOT—Intel Binary Optimization Tool—lives in a thin firmware layer between the Windows loader and the L3 cache. The 36 MB cache on the 270K Plus stores the rewritten loops; the 3 GHz die-to-die interconnect feeds them to P-cores now cycling 15 % faster in single-threaded chores. Ultra Turbo mode pushes partnered DDR5 to 10,266 MT/s, up-clocking memory latency from 8.3 ns to 6.1 ns, enough to keep the reshuffled code fed. A 125 W base TDP stays unchanged because the optimized paths retire sooner, burning fewer joules per frame.
What actually speeds up—and what does not
- Gaming: CPU-bound titles +8-18 % fps; GPU-bound titles <1 %
- Productivity: Geekbench 6 multi-core +40 % on iBOT cores, +8 % average across SKU stack
- Competition: versus $279 Ryzen 7 9700X, price-per-frame advantage widens 12 % in the $200-$300 band
- Credibility: Geekbench now flags iBOT runs; dual-reporting (on/off) becomes OEM standard
Benchmark police blink first
Primate Labs slapped a “potentially invalid” sticker on iBOT-enabled uploads last week, then retracted it after Intel published granular before/after logs. Reviewers at PC Gamer confirm gains hold only when the bottleneck is the CPU, not the GPU—exactly the condition found in esports engines and open-world RPGs that hammer the render thread.
Short-term scorecard
- Q2 2026: 18 % of new Core Ultra 200S Plus desktops ship with iBOT enabled, slicing 15 GWh/year off U.S. gaming power draw.
- Q4 2026: DDR5-10266 kit prices fall 15 % as Micron ramps volume; Z890 board sales rise 22 % YoY.
- Q1 2027: AMD counters with “Dynamic Instruction Re-mapping,” promising 5-7 % uplift—still half of iBOT’s verified ceiling.
Bottom line
Intel just turned firmware into a free performance coupon. If you already paid for fast DDR5 and a 240 Hz monitor, the 40 % micro-op rewrite is found money—provided you play the CPU-bound titles that reward it. The real winner is the mid-range buyer: a $299 chip now shadows $449 parts, and the only price is a footnote in a benchmark database.
🔐 Google Forces Quantum-Proof Crypto by 2029: 91% of Enterprises Unready
91% of enterprises still have ZERO post-quantum plan—yet Google will kill RSA/ECC by 2029, 6 yrs ahead of NIST. That’s like locking every vault in the world… except yours 🫣. Android 17 & Chrome will enforce quantum-proof keys; lagging devs get locked out of Play Store. Are you updating your app keys before the clock hits 2029?
On March 26 Google locked in 2029 as the hard stop for RSA and ECC inside its walls. Android 17, now in beta, already signs its own boot images with ML-DSA instead of 2048-bit RSA. Chrome 120, shipping this summer, will prefer the same lattice-based signature when it shakes hands with Google servers. The company is effectively pulling forward the global deadline for post-quantum cryptography by six years.
How the switch works
ML-DSA (formerly Dilithium) lives in the Android Keystore and Chrome’s TLS stack. Verified Boot checks the signature before the kernel loads; Remote Attestation uses a hybrid chain (classical + lattice) so older apps still boot while new code is quantum-safe. Play Store uploads must carry both signature types starting in 2028; by 2029 only PQC keys will be accepted.
Impacts at a glance
- Users: 3 billion active Android devices gain forward secrecy against “harvest-now-decrypt-later” attacks.
- Developers: ~2 extra weeks per app to rotate signing keys; Google supplies command-line tooling.
- Performance: ML-DSA verification adds <5 % latency on low-end chips—roughly one missed frame in a 60 fps scroll.
- Crypto market: 6.8 million Bitcoin addresses (≈$470 billion) remain vulnerable; Google’s move sharpens the contrast with un-updated blockchains.
- Competitors: Samsung and Xiaomi have already copied the Android patches; Apple and Microsoft face pressure to match the 2029 cadence.
Outlook
- 2026–2027: 80 % of new Android handsets ship PQC-ready; Chrome traffic to Google properties becomes 100 % quantum-hardened.
- Q4 2028: Play Store rejects RSA-only uploads; hybrid signatures cover 80 % of top-1,000 apps.
- 2029: Google-wide deprecation of RSA/ECC; internal “Q-Day” drills show zero simulated breaks.
- 2030–2032: EU and US regulators adopt Google’s timeline, making 2029 the de-facto global standard.
By betting its own products first, Google turns an abstract quantum threat into a concrete market requirement. If a 5,000-logical-qubit machine appears in 2031, the web’s most-used browser and operating system will already speak a language it cannot understand.
In Other News
- Oracle UEK 8.2 Adds Intel TDX Hardware Isolation for Enterprise Cloud Workloads
- Super Micro Computer faces $6B stock loss after U.S. charges employees for AI chip smuggling to China
- Global OLED Monitor Shipments Surged 92% YoY in 2025, ASUS Leads with 21.6% Market Share
- NXP Delays Open-Source Linux Driver for Neutron NPU Due to Binary Blob Dependency
Comments ()