SOMA.

A tiny mixture-of-experts language model running entirely in this tab — weights, sampling, everything. No server. Nothing you type leaves your device.

No transformers. No attention. Trained without backprop — zeroth-order optimization only (SPSA).

params of weights experts nats/char
fetching weights…
Opened directly from disk? Browsers block pages on file:// from fetching local files, so pick the weights manually — choose (or drop) the soma.bin sitting next to this page.
For the full experience serve the folder instead: python3 -m http.serverhttp://localhost:8000

temp 0.80 top-k 40 top-p 0.95 repeat 1.05 length 400
copied ✓
tok/s ms/tok ctx 0 experts entropy ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ bits mem backend

What is this

SOMA is a character-level language model with a sparse mixture-of-experts architecture — recurrent networks, not transformers — trained from scratch on ~100M characters of web text without backpropagation: weights were updated with SPSA, a zeroth-order method that needs only forward passes. It reads and writes raw bytes — no tokenizer — and the whole thing is 2.5 MB of fp32 weights: smaller than most hero images.

It writes plausible-looking almost-English. At this scale, that is the point: the interesting part is watching a sub-megabyte network learn spelling, morphology, and the rhythm of prose — and seeing exactly which expert handles which character as it happens.

How it works

A one-layer LSTM router reads the stream byte by byte and, for each byte, scores 16 expert networks — each a two-layer LSTM with a GELU MLP. Every expert advances its recurrent state on every byte (so any expert can be picked at any time), but only the argmax winner's hidden state is decoded into next-byte logits, which keeps decoding O(1) in the number of experts. The color by expert toggle above shows the router's choice per character; the sparkline in the stats bar shows how evenly the load spreads.

Because it is a recurrent network there is no attention window and no KV cache: context is unbounded, memory is constant, and every token costs the same ~1 MFLOP.

How it runs here

Inference happens on your device. The matrix kernels are hand-written WebAssembly SIMD (f32x4, dual accumulators) running in a Web Worker; layer norms and gate nonlinearities stay in JavaScript, which keeps the browser bit-for-bit consistent with the reference implementation (verified to <10⁻⁶ against the training math). Typical throughput is thousands of tokens per second on a laptop. If WASM SIMD is unavailable the page falls back to a plain typed-array engine.

The page fetches one file of weights (soma.bin: a JSON header + raw fp32 tensors), caches it, and all inference after that is local — generation makes zero network requests. The only telemetry is an anonymous page-view count; the text you type and generate never leaves the tab.