In partnership with

Become An AI Expert In Just 5 Minutes

If you’re a decision maker at your company, you need to be on the bleeding edge of, well, everything. But before you go signing up for seminars, conferences, lunch ‘n learns, and all that jazz, just know there’s a far better (and simpler) way: Subscribing to The Deep View.

This daily newsletter condenses everything you need to know about the latest and greatest AI developments into a 5-minute read. Squeeze it into your morning coffee break and before you know it, you’ll be an expert too.

Subscribe right here. It’s totally free, wildly informative, and trusted by 600,000+ readers at Google, Meta, Microsoft, and beyond.

ResearchAudio.io

A Private Reasoning Model in a Browser Tab

Liquid AI's 900MB thinking model beats Qwen3-1.7B. Zero server calls. Run it in 4 commands.

Type "What is 17 * 24?" into a browser tab. Watch the model pause, write out its reasoning step by step inside tags, then produce the answer. The entire process happens on your graphics card. No token ever leaves your machine. No API key. No server. Liquid AI's LFM2.5-1.2B-Thinking is a 1.2 billion parameter reasoning model that now runs entirely inside Chrome using WebGPU.

That last sentence sounds wrong, so let's unpack it. Below: what the model is, why it matters for privacy, and three ways to run it yourself (including one that takes four terminal commands and zero prior JavaScript experience).

1.2B

Parameters

<900MB

Memory (Q4)

Server Calls

~52

tok/s (NPU)

The Privacy Problem with Cloud AI

Every prompt you send to ChatGPT, Claude, or Gemini crosses a network and lands on someone else's GPU. A company promises to handle your data responsibly. You trust a privacy policy. For casual questions, that is fine. For medical notes, financial models, legal documents, or a personal journal, it is a real problem.

Browser-based inference eliminates the trust question entirely. When the model runs on your own hardware, there is no network request to intercept, no server to subpoena, no third party to trust. Privacy stops being a policy. It becomes physics. Data that never leaves a device cannot be leaked from a device.

Three Technologies Made This Possible

WebGPU is a browser feature that lets websites run math on your graphics card. Think of it as giving JavaScript direct access to the same GPU that renders your video games. Chrome, Edge, and Safari 26+ ship with it enabled. That covers roughly 85-90% of global browser traffic.

Transformers.js is a JavaScript library from Hugging Face that loads and runs ML models in the browser. Think of it as the Python transformers library, but for JavaScript. It handles tokenization, model loading, and text generation with a nearly identical interface.

Quantization to Q4 shrinks the model weights from full precision down to 4 bits per parameter. This compresses the 1.2B-parameter LFM2.5 to under 900MB, small enough to download once and cache in the browser's local storage. After that first download, the model loads from cache. It even works offline.

How the Pieces Fit Together

✍

You Type

a prompt

→

⚙

Web Worker

background thread

→

⚡

WebGPU

your graphics card

→

🧠

LFM2.5 Q4

+ answer

🔒 Everything stays inside your browser tab. Zero network requests during inference.

Source: LiquidAI/LFM2.5-1.2B-Thinking-WebGPU

You type a prompt. The app hands it to a Web Worker (a background thread, so the chat UI stays responsive while the model computes). The Worker uses Transformers.js to run the model on your GPU via WebGPU. The model writes its reasoning steps, then a final answer. Tokens stream back to the screen one at a time, just like ChatGPT, except the "server" is your own laptop.

The Model: LFM2.5-1.2B-Thinking

This is not a standard Transformer. Liquid AI designed a hybrid architecture that mixes two types of layers: 10 convolution blocks (fast at spotting local patterns) and 6 grouped-query blocks (good at connecting distant context). Think of it as welding a CNN onto a Transformer. That hybrid design is how 1.2B parameters fit under 900MB and still outperform Qwen3-1.7B, a pure-Transformer model with 40% more parameters.

Liquid AI trained the Thinking variant on 28 trillion tokens, then refined it with multi-stage reinforcement learning. The result: the model writes its reasoning inside ... tags before committing to a final answer. You can watch it work through a math problem, verify an intermediate step, and correct itself, all visible in the output stream.

On AMD Ryzen NPUs, the model sustains approximately 52 tokens per second at 16K context. It supports 32K context, function calling, and tool use. Liquid AI recommends it for math, logic, data extraction, and agentic workflows. For casual chat and creative writing, they suggest the non-thinking Instruct variant instead.

Run It Yourself: 3 Paths

Path 1: Try it now (zero setup)

Open Liquid AI's Hugging Face Space in Chrome or Edge. The model downloads on first visit (1-3 minutes depending on connection speed). Once loaded, type a question. The thinking tokens will stream in real-time. After the first load, the model is cached in your browser, so future visits start in seconds.

Path 2: Run locally (4 commands)

This clones an open-source React chat app that wraps the model. The prerequisite is Node.js (version 18+). If you do not have it, download it from nodejs.org (pick the LTS version). Node.js comes bundled with npm, the package manager used below. If you come from Python, think of npm as pip for JavaScript.

# 1. Clone the project

git clone https://github.com/sitammeur/lfm2.5-thinking-web.git

# 2. Enter the folder

cd lfm2.5-thinking-web

# 3. Install dependencies (~30 seconds)

npm install

# 4. Start the dev server

npm run dev

Your terminal will print a URL, usually http://localhost:5173. Open it in Chrome. On first visit, the browser downloads the Q4 model weights from Hugging Face (roughly 700-900MB). You will see a progress bar. This is the only time the download happens. After that, the weights are cached in IndexedDB, and the app loads from local storage, even if you disconnect from the internet.

What to expect on first run

After the model downloads, your first prompt will feel slow (10-30 seconds). That is WebGPU compiling GPU shaders, and it only happens once per browser session. After that warm-up, tokens stream at usable speed. A discrete GPU (like an RTX or Radeon card) will be noticeably faster than integrated laptop graphics. If everything looks frozen for a minute, wait. The shader compilation is working.

Path 3: Build your own app

Want to embed the model into your own project instead of using the prebuilt chat? The core logic is surprisingly short. You need two npm packages: @huggingface/transformers and onnxruntime-web. The code below loads the model, sends a prompt, and streams tokens back to the main thread.

// worker.js (runs in a Web Worker so the UI stays smooth)

import

{ AutoModelForCausalLM, AutoTokenizer, TextStreamer }

from "@huggingface/transformers";

// Step 1: Load model + tokenizer (cached after first download)

const modelId = "LiquidAI/LFM2.5-1.2B-Thinking-ONNX";

const tokenizer = await AutoTokenizer.from_pretrained(modelId);

const model = await AutoModelForCausalLM.from_pretrained(

modelId,

{ device: "webgpu", dtype: "q4" } // GPU mode, 4-bit weights

);

// Step 2: Format the prompt as a chat message

const messages = [

{ role: "user", content: "What is 17 * 24?" }

];

const inputs = tokenizer.apply_chat_template(

messages,

{ add_generation_prompt: true, return_tensors: "pt" }

);

// Step 3: Generate tokens, streaming each one back

const streamer = new TextStreamer(tokenizer, {

skip_prompt: true,

skip_special_tokens: true,

callback_function: (text) => {

self.postMessage({ type: "token", text });

}

});

await model.generate({

...inputs,

max_new_tokens: 512,

do_sample: true,

temperature: 0.1,

streamer

});

Three things to know. device: "webgpu" tells the library to use your GPU instead of CPU. dtype: "q4" selects the 4-bit quantized weights (smallest download, recommended). And self.postMessage is how the Web Worker sends each token back to the main thread so your UI can display it. That pattern, model in a Worker and UI in the main thread, is the standard way to build responsive browser-based AI apps.

If Something Goes Wrong

Symptom	Fix
"WebGPU not available"	Use Chrome or Edge. Type `chrome://gpu` in the address bar to confirm WebGPU status. If disabled, enable it at `chrome://flags/#enable-unsafe-webgpu` and restart.
First prompt takes 30+ seconds	Normal. WebGPU compiles GPU shaders on first inference. Wait it out. The second prompt will be much faster.
`npm install` fails	Check your Node.js version: `node --version`. You need 18 or newer. Download LTS from nodejs.org.
Firefox or Safari does not work	Firefox requires a flag: set `dom.webgpu.enabled` to true in `about:config`. Safari works from version 26.0 onward. Older versions do not support WebGPU.

Want to share it with others? Run npm run build. The output is a folder of static files (HTML, JS, CSS, WASM). Deploy it to Vercel, Netlify, GitHub Pages, or any file server. There is no backend. The "server" is the visitor's own GPU. Your hosting cost is zero for inference.

What This Means

Privacy becomes a property of the architecture, not a promise in a document. When inference runs on the user's device, data physically cannot leave. No privacy policy required. No data processing agreement. For HIPAA, GDPR, or any context where data sensitivity matters, this is a qualitatively different level of protection.

Model architecture matters more at the edge than in the cloud. A pure Transformer at 1.7B parameters would need more memory and run slower in the browser. LFM2.5's hybrid convolution-plus-grouped-query design is what makes sub-900MB possible while matching that larger model on reasoning benchmarks. When your deployment target is "a browser tab on someone's laptop," architecture efficiency is not optional.

The distribution model for AI just got simpler. You can now ship a reasoning model the same way you ship a website: as a URL. No Python environment. No Docker container. No GPU rental. The model downloads once, caches locally, and works offline after that. The era of "private AI with zero infrastructure" is a link someone can bookmark.

Two years ago, chain-of-thought reasoning required a data center. Today it runs in a browser tab you can bookmark, with your data going nowhere. The question worth asking: what do people build when private AI costs nothing to deploy?

ResearchAudio.io

Sources: Liquid AI Blog · Model Card · Live Demo · Source Code · ONNX Weights

A Private Reasoning Model in a Browser Tab

Become An AI Expert In Just 5 Minutes

A Private Reasoning Model in a Browser Tab

The Privacy Problem with Cloud AI

Three Technologies Made This Possible

How the Pieces Fit Together

The Model: LFM2.5-1.2B-Thinking

Run It Yourself: 3 Paths

If Something Goes Wrong

What This Means

Keep Reading

Quick Links

Stay Updated