|
Type "What is 17 * 24?" into a browser tab. Watch the model pause, write out its reasoning step by step inside tags, then produce the answer. The entire process happens on your graphics card. No token ever leaves your machine. No API key. No server. Liquid AI's LFM2.5-1.2B-Thinking is a 1.2 billion parameter reasoning model that now runs entirely inside Chrome using WebGPU.
That last sentence sounds wrong, so let's unpack it. Below: what the model is, why it matters for privacy, and three ways to run it yourself (including one that takes four terminal commands and zero prior JavaScript experience).
|
1.2B
Parameters
|
<900MB
Memory (Q4)
|
0
Server Calls
|
~52
tok/s (NPU)
|
The Privacy Problem with Cloud AI
Every prompt you send to ChatGPT, Claude, or Gemini crosses a network and lands on someone else's GPU. A company promises to handle your data responsibly. You trust a privacy policy. For casual questions, that is fine. For medical notes, financial models, legal documents, or a personal journal, it is a real problem.
Browser-based inference eliminates the trust question entirely. When the model runs on your own hardware, there is no network request to intercept, no server to subpoena, no third party to trust. Privacy stops being a policy. It becomes physics. Data that never leaves a device cannot be leaked from a device.
Three Technologies Made This Possible
WebGPU is a browser feature that lets websites run math on your graphics card. Think of it as giving JavaScript direct access to the same GPU that renders your video games. Chrome, Edge, and Safari 26+ ship with it enabled. That covers roughly 85-90% of global browser traffic.
Transformers.js is a JavaScript library from Hugging Face that loads and runs ML models in the browser. Think of it as the Python transformers library, but for JavaScript. It handles tokenization, model loading, and text generation with a nearly identical interface.
Quantization to Q4 shrinks the model weights from full precision down to 4 bits per parameter. This compresses the 1.2B-parameter LFM2.5 to under 900MB, small enough to download once and cache in the browser's local storage. After that first download, the model loads from cache. It even works offline.
How the Pieces Fit Together
|
|
→ |
|
⚙
Web Worker
background thread
|
|
→ |
|
⚡
WebGPU
your graphics card
|
|
→ |
|
|
🔒 Everything stays inside your browser tab. Zero network requests during inference.
|
|
Source: LiquidAI/LFM2.5-1.2B-Thinking-WebGPU
You type a prompt. The app hands it to a Web Worker (a background thread, so the chat UI stays responsive while the model computes). The Worker uses Transformers.js to run the model on your GPU via WebGPU. The model writes its reasoning steps, then a final answer. Tokens stream back to the screen one at a time, just like ChatGPT, except the "server" is your own laptop.
The Model: LFM2.5-1.2B-Thinking
This is not a standard Transformer. Liquid AI designed a hybrid architecture that mixes two types of layers: 10 convolution blocks (fast at spotting local patterns) and 6 grouped-query blocks (good at connecting distant context). Think of it as welding a CNN onto a Transformer. That hybrid design is how 1.2B parameters fit under 900MB and still outperform Qwen3-1.7B, a pure-Transformer model with 40% more parameters.
Liquid AI trained the Thinking variant on 28 trillion tokens, then refined it with multi-stage reinforcement learning. The result: the model writes its reasoning inside ... tags before committing to a final answer. You can watch it work through a math problem, verify an intermediate step, and correct itself, all visible in the output stream.
On AMD Ryzen NPUs, the model sustains approximately 52 tokens per second at 16K context. It supports 32K context, function calling, and tool use. Liquid AI recommends it for math, logic, data extraction, and agentic workflows. For casual chat and creative writing, they suggest the non-thinking Instruct variant instead.
Run It Yourself: 3 Paths
|
Path 1: Try it now (zero setup)
Open Liquid AI's Hugging Face Space in Chrome or Edge. The model downloads on first visit (1-3 minutes depending on connection speed). Once loaded, type a question. The thinking tokens will stream in real-time. After the first load, the model is cached in your browser, so future visits start in seconds.
|
|
Path 2: Run locally (4 commands)
This clones an open-source React chat app that wraps the model. The prerequisite is Node.js (version 18+). If you do not have it, download it from nodejs.org (pick the LTS version). Node.js comes bundled with npm, the package manager used below. If you come from Python, think of npm as pip for JavaScript.
|
|
# 1. Clone the project
git clone https://github.com/sitammeur/lfm2.5-thinking-web.git
# 2. Enter the folder
cd lfm2.5-thinking-web
# 3. Install dependencies (~30 seconds)
npm install
# 4. Start the dev server
npm run dev
|
Your terminal will print a URL, usually http://localhost:5173. Open it in Chrome. On first visit, the browser downloads the Q4 model weights from Hugging Face (roughly 700-900MB). You will see a progress bar. This is the only time the download happens. After that, the weights are cached in IndexedDB, and the app loads from local storage, even if you disconnect from the internet.
|
What to expect on first run
After the model downloads, your first prompt will feel slow (10-30 seconds). That is WebGPU compiling GPU shaders, and it only happens once per browser session. After that warm-up, tokens stream at usable speed. A discrete GPU (like an RTX or Radeon card) will be noticeably faster than integrated laptop graphics. If everything looks frozen for a minute, wait. The shader compilation is working.
|
|
Path 3: Build your own app
Want to embed the model into your own project instead of using the prebuilt chat? The core logic is surprisingly short. You need two npm packages: @huggingface/transformers and onnxruntime-web. The code below loads the model, sends a prompt, and streams tokens back to the main thread.
|
|
// worker.js (runs in a Web Worker so the UI stays smooth)
import
{ AutoModelForCausalLM, AutoTokenizer, TextStreamer }
from "@huggingface/transformers";
// Step 1: Load model + tokenizer (cached after first download)
const modelId = "LiquidAI/LFM2.5-1.2B-Thinking-ONNX";
const tokenizer = await AutoTokenizer.from_pretrained(modelId);
const model = await AutoModelForCausalLM.from_pretrained(
modelId,
{ device: "webgpu", dtype: "q4" } // GPU mode, 4-bit weights
);
// Step 2: Format the prompt as a chat message
const messages = [
{ role: "user", content: "What is 17 * 24?" }
];
const inputs = tokenizer.apply_chat_template(
messages,
{ add_generation_prompt: true, return_tensors: "pt" }
);
// Step 3: Generate tokens, streaming each one back
const streamer = new TextStreamer(tokenizer, {
skip_prompt: true,
skip_special_tokens: true,
callback_function: (text) => {
self.postMessage({ type: "token", text });
}
});
await model.generate({
...inputs,
max_new_tokens: 512,
do_sample: true,
temperature: 0.1,
streamer
});
|
Three things to know. device: "webgpu" tells the library to use your GPU instead of CPU. dtype: "q4" selects the 4-bit quantized weights (smallest download, recommended). And self.postMessage is how the Web Worker sends each token back to the main thread so your UI can display it. That pattern, model in a Worker and UI in the main thread, is the standard way to build responsive browser-based AI apps.
If Something Goes Wrong
|
Symptom
|
Fix
|
|
"WebGPU not available"
|
Use Chrome or Edge. Type chrome://gpu in the address bar to confirm WebGPU status. If disabled, enable it at chrome://flags/#enable-unsafe-webgpu and restart.
|
|
First prompt takes 30+ seconds
|
Normal. WebGPU compiles GPU shaders on first inference. Wait it out. The second prompt will be much faster.
|
npm install fails
|
Check your Node.js version: node --version. You need 18 or newer. Download LTS from nodejs.org.
|
|
Firefox or Safari does not work
|
Firefox requires a flag: set dom.webgpu.enabled to true in about:config. Safari works from version 26.0 onward. Older versions do not support WebGPU.
|
|
Want to share it with others? Run npm run build. The output is a folder of static files (HTML, JS, CSS, WASM). Deploy it to Vercel, Netlify, GitHub Pages, or any file server. There is no backend. The "server" is the visitor's own GPU. Your hosting cost is zero for inference.
|
What This Means
|
Privacy becomes a property of the architecture, not a promise in a document. When inference runs on the user's device, data physically cannot leave. No privacy policy required. No data processing agreement. For HIPAA, GDPR, or any context where data sensitivity matters, this is a qualitatively different level of protection.
|
|
Model architecture matters more at the edge than in the cloud. A pure Transformer at 1.7B parameters would need more memory and run slower in the browser. LFM2.5's hybrid convolution-plus-grouped-query design is what makes sub-900MB possible while matching that larger model on reasoning benchmarks. When your deployment target is "a browser tab on someone's laptop," architecture efficiency is not optional.
|
|
The distribution model for AI just got simpler. You can now ship a reasoning model the same way you ship a website: as a URL. No Python environment. No Docker container. No GPU rental. The model downloads once, caches locally, and works offline after that. The era of "private AI with zero infrastructure" is a link someone can bookmark.
|
Two years ago, chain-of-thought reasoning required a data center. Today it runs in a browser tab you can bookmark, with your data going nowhere. The question worth asking: what do people build when private AI costs nothing to deploy?
|