Overview
Release build is around 170 KB on wasm32-unknown-unknown (panic=abort, opt-level=z, lto=true, codegen-units=1). Pipeline: LUT-driven lexer -> single-pass Pratt parser emitting SSA-versioned bytecode directly -> peephole constant-folding optimiser -> token-threaded interpreter with two layers of adaptive specialisation.
No AST, no IR, bytecode is the only intermediate representation. Around 13,000 lines of Rust; production deps are hashbrown and itoa (SHA-256 in-tree). WASM build adds lol_alloc for a single-threaded leaking bump allocator.
Classes support single-level inheritance, super(), full dunder dispatch, @property / @x.setter. Functional-first: composition preferred, monomorphic dispatch optimised via instance-dunder IC.
Concepts
- Offset-based tokens:
(kind, line, start, end)indices into the source buffer. No string copies during lexing; parser slices lazily. - Single-pass SSA codegen: Variables versioned per assignment (
x_1,x_2). Control-flow joins emitPhiopcodes resolved at runtime. - Token-threaded dispatch:
Vec<Instruction>where each is(opcode: OpCode, operand: u16). Hot loop is a flatmatch; Rust lowers it to a jump table. Not direct threading (computed-goto isn’t available in safe Rust). - Per-instruction inline caching: Each binary op records operand type tags. After
QUICK_THRESH = 4stable hits the IC stores a typedFastOp(AddInt,AddFloat,AddStr,LtFloat,EqStr,ModInt, …) as a speculative fast path with type-guard deopt. - Template memoisation: Pure user functions cache
(args) -> resultafterTPL_THRESH = 2hits, capped at 256 entries per function. Gated on no-kw calls, an outer scope free of impure ops (StoreItem,StoreAttr,Raise,Yield,Global,Nonlocal,Import, …), and byte-stable arguments (mutable containers disqualify). Hashing is an FNV-like fold over rawVal.0bits with a value-eq verify. - NaN-boxed values:
Valis a 64-bit union: 47-bit signed ints (inline), IEEE-754 floats (NaNs canonicalised), bools, None, an undef sentinel, and 28-bit heap indices. - Mark-and-sweep GC: Triggered when
live >= gc_thresholdoralloc_count >= max(live/4, 4096). After each sweepgc_threshold = max(live * 2, 512). Roots: stack, with-stack, yields, event queue, slots and live-slot snapshots, slot templates, globals, every iterator frame’siter_stack, opcode-cache constants, active const pools, function templates.
Bytecode shape
Each Instruction is 4 bytes: 1-byte OpCode (#[repr(u8)] planned), 2-byte operand, 1 byte padding. Opcodes span 17 categories, load, store, arith, bitwise, compare, logic, identity, control flow, iter, build, container, comprehension, function, ssa (Phi), yield, side effects, unsupported (raises at runtime). Around 40 specialised Call* variants for hot builtins; LoadAttr + Call(0) pairs fuse into CallMethod + CallMethodArgs after first dispatch.
OpCode::LoadConst operand = constant index
OpCode::LoadName operand = name slot
OpCode::StoreName operand = name slot
OpCode::Add / Sub operand = 0 (IC slot derived from ip)
OpCode::Call operand = (kw << 8) | pos
OpCode::Phi operand = target slot, sources in chunk.phi_sources
OpCode::ForIter operand = jump target on iterator exhaustionDispatch shape
Hot loop reads cache.fused_ref()[ip], a snapshot of the instruction stream with LoadAttr + Call(0) pairs fused into CallMethod + CallMethodArgs. Fusion runs once per chunk, cached.
For arith/compare opcodes, the loop checks cache.get_fast(ip): if a FastOp is present, it runs inline without a function call; type-guard miss invalidates the slot and falls back to the generic handler. IC is per-instruction, so monomorphic sites stabilise independently.
LoadConst reads a pre-materialised Vec<Val> (OpcodeCache::const_vals) built on first dispatch. Inline-range ints (47-bit) stored inline; 2⁴⁷,2¹²⁷ allocate a HeapObj::LongInt slot. Literals beyond ±2¹²⁷ are rejected at parse time.
Memory model
Val is 64 bits NaN-boxed (QNAN = 0x7FFC_0000_0000_0000, SIGN = 0x8000...):
| Tag | Pattern | Notes |
|---|---|---|
| Float | any non-canonical IEEE-754 | Quiet NaNs remapped to 0x7FF8... |
| Int | QNAN | SIGN | i48 | 47-bit signed inline; auto-promotes to HeapObj::LongInt (i128) on overflow |
| Undef | QNAN | Unbound-local sentinel |
| None | QNAN | 1 | |
| True | QNAN | 2 | |
| False | QNAN | 3 | |
| Heap | QNAN | 4 | (i28 << 4) | 28-bit index into HeapPool (max 1 << 28 slots) |
INT_MAX = 140_737_488_355_327, INT_MIN = -140_737_488_355_328. Inline ints take one ALU op per arithmetic; overflow promotes to HeapObj::LongInt(i128) until results fit inline again. LongInts are interned by value so equal values share a heap index (consistent hash/eq). Hard cap ±2¹²⁷; wider raises OverflowError. Arbitrary-precision bigints would need a Vec<u32>-limb variant (heap-allocs per op, Knuth D / Karatsuba code) or dropping NaN-boxing, both regress WASM-size and inner-loop goals.
PartialEq / Hash for Val funnel value-equal numerics through f64 bits so 1 == 1.0 and hash(1) == hash(1.0), dicts/sets see them as one key. FxBuildHasher uses a fixed seed for reproducible iteration order across runs.
Heap is a Vec<HeapSlot> arena with a free list (capped 524,288, sorted to prefer low indices). Strings, bytes (<=128 B), and LongInts are interned in side hashes so equal values collapse to one slot, short literals short-circuit through identity and dict/set lookups stay consistent across allocations. Live-object cap is Limits.heap (default 10M; sandbox 100K). Single-colour mark-and-sweep, no refcount, cycles reclaimed natively.
HeapObj variants: Str, Bytes, List (Rc<RefCell<Vec<Val>>>), Dict (insertion-ordered), Set, FrozenSet, Tuple, Func(fn_idx, defaults, captures), Range, Slice, Ellipsis (singleton, distinct from '...'), Type, ExcInstance, BoundMethod, NativeFn, Class(name, members), Instance(class, attrs), BoundUserMethod(recv, fn), Coroutine(ip, slots, stack, body, iter_stack, sync_frames) (shared by generators, async def, and the implicit module-body coro; body is BodyRef::Fn(usize) or BodyRef::Module; sync_frames stacks suspended sync sub-calls so a plain def hitting a yielding builtin can resume mid-body), Module(spec, attrs), Extern(Arc<dyn Fn>).
What the compiler intentionally does not do
- No SSA-wide constant propagation through
LoadName(preserved to keep IC, super-op, template paths fast). - No CSE, GVN, LICM, inlining, branch DCE, loop folding. Optimiser is constant folding + phi-noop elimination + dead-instruction compaction with jump-operand remap.
- No dead-store elimination beyond what falls out of constant folding.
- No IR, one representation between source and dispatch.
- No JIT (single-tier, pure Rust). Method JITs need per-arch stencils; trace JITs duplicate the execution model and complicate GC.
- No runtime module system, imports resolve at parse time through a host-injected
Resolver. See Imports. - No bigints, complex numbers,
bytearray,memoryview,Decimal,Fraction. Nogen.send/throw/close. Noasynciomodule,run,sleep,gather,with_timeout,cancel,receiveare top-level builtins.
Coroutine and context-manager dispatch
async def and yield-bearing def both produce HeapObj::Coroutine. run() drives the scheduler; sleep, gather, with_timeout, cancel, receive are top-level builtins. No asyncio module.
A plain def inside a coroutine calling a yielding builtin gets its state (ip, slots, stack/iter deltas) snapshotted as a SyncFrame pushed on the enclosing Coroutine’s sync_frames (innermost-last). resume_coroutine walks this stack inside-out before re-entering the outer body, so each helper’s return value lands at the original Call site, otherwise the outer’s resume_ip would skip past the unfinished helper.
vm.run() wraps the module body as an implicit coroutine with BodyRef::Module, top-level statements suspend like async def bodies. Single-driver: top_loop is the only place that picks coros; run / gather / with_timeout are non-driving, they push targets to the scheduler, park the outer in CoroState::WaitingForChildren { tasks, kind: WaitKind }, yield. WaitKind picks finalize behavior: Run(target) returns its value, Gather returns the list of results, Timeout { deadline_ns, target } enforces a deadline. wake_waiting_outers (gated by waiting_for_children_count) drops terminal children, splices the result into the outer’s saved stack placeholder, marks the outer Ready, or raise_into_outer injects the exception.
Coroutines carry their own exception_frames (7th tuple field). resume_coroutine denormalises stored depths (relative to saved_stack_len / saved_iter_len) on entry, pushes them onto the live exception_stack, renormalises on yield-save; dispatch.rs::exec honors pending_exec_exc_base so handler search includes restored frames. Net: try/except survives yields, try: run(coro) except E: catches a child’s raise across multiple run_resume cycles. SyncFrame.exception_delta does the same for sync-helper try blocks spanning a yield.
with invokes __enter__ / __exit__(exc_type, exc_val, traceback); truthy __exit__ return suppresses. async with reuses sync __enter__ / __exit__ (no async dunders).
References
- Aho, Sethi & Ullman. Compilers: Principles, Techniques and Tools (1986). LUT-based lexer.
- Pratt. Top Down Operator Precedence (POPL 1973).
- Cytron et al. Efficiently Computing Static Single Assignment Form (TOPLAS 1991).
- Gudeman. Representing Type Information in Dynamically Typed Languages (1993). NaN-boxing.
- Deutsch & Schiffman. Efficient Implementation of the Smalltalk-80 System (POPL 1984). Inline caching.
- Ertl & Gregg. The Structure and Performance of Efficient Interpreters (JILP 2003). Threaded dispatch.
- Casey et al. Towards Superinstructions for Java Interpreters (SCOPES 2003). LoadAttr+Call fusion.
- Michie. Memo Functions and Machine Learning (Nature 1968). Pure-function memoization.
- McCarthy. Recursive Functions of Symbolic Expressions (CACM 1960). Mark-sweep GC.
- Backus. Can Programming Be Liberated from the von Neumann Style? (CACM 1978). Function-level paradigm.