Instruction Set

Note: SBBE is a work in progress and all instructions are subject to change. Exact ordinals of instructions are ignored here for the time being.

Instruction encoding

Every instruction is 12 bytes:

8-bit opcode identifying the operation
24-bit argument whose meaning depends on the opcode (type, index, immediate value, packed flags, etc.)
32-bit line number for source mapping (0 if absent)
32-bit column number for source mapping (0 if absent)

This fixed-size encoding makes the IR cache-friendly and trivial to index: the Nth instruction of a function is always at byte offset N * 12.

How the stack works

SBBE is a stack machine. Every value produced by an instruction is pushed onto an implicit operand stack. Every value consumed by an instruction is popped from it. There are no named registers or SSA variables at the IR level.

For example, to compute (3 + 4) * 2:

ldi 3      // stack: [3]
ldi 4      // stack: [3, 4]
add.s i32  // stack: [7]          — pops 4 and 3, pushes 7
ldi 2      // stack: [7, 2]
mul.s i32  // stack: [14]         — pops 2 and 7, pushes 14

Notice that operands are pushed in the order they appear in the expression, but binary operations pop the right operand first (b = pop(); a = pop()). This matches natural evaluation order: push the left side, push the right side, then combine them.

When a backend lowers this to native code, each stack slot becomes a register assignment or a spill to the native stack. The stack abstraction means frontends never need to think about register allocation.

Special instructions

`nop` (no operation)

nop

Does nothing. Used as a placeholder or for alignment. Optimization passes may remove nop instructions freely.

`hold` (non-removable nop)

hold // preserved through optimization

A macro mnemonic that encodes as nop with arg = 1. Optimization passes treat it as non-removable. Useful for cycle timing in embedded targets or for maintaining instruction alignment.

Loading values

There are two ways to push a constant value onto the stack: ldi for small integers and ldc for everything else.

`ldi` (load immediate)

ldi 42  // push(42)
ldi -1  // push(-1)
ldi 0   // push(0)

Pushes a small integer onto the stack as i32. The value is encoded directly in the 24-bit argument field as a signed integer (range: -8,388,608 to 8,388,607), so no constant table lookup is needed. This is the fastest way to load a constant since the value travels with the instruction.

On x86, this maps to mov reg, imm32. On ARM, small values use mov or movz; larger 24-bit values may require a movw/movt pair.

`ldc` (load constant)

ldc f64 3.14       // push(3.14)
ldc i32 1000000    // push(1000000)
ldc f32 -0.5       // push(-0.5)

Pushes a typed value from the constant pool. Used for float values (which cannot be encoded as immediates) and for integers outside the 24-bit range. The assembler is smart: if you write ldc i32 42, it will silently emit ldi 42 instead since the value fits in an immediate.

On x86, float constants are typically loaded from a .rodata section via movsd xmm0, [rip + constant]. On ARM, ldr d0, [pc, #offset].

Stack manipulation

These instructions rearrange values on the stack without performing any computation. They are the stack machine equivalent of register moves in a register machine.

`drop` (discard top of stack)

drop // pop()

Pops and discards the top value. Common after calling a function whose return value is unused.

TODO: allow drop <n> to drop multiple values at once (e.g., drop 3 drops the top three values).

`dup` (duplicate top of stack)

dup // a = pop(); push(a); push(a)

Duplicates the top value. Useful when a value is consumed by one operation but still needed afterward. For example, to compute x * x from a single x on the stack: dup then mul.s i32.

TODO: allow dup <n> to duplicate values deeper in the stack (e.g., dup 2 duplicates the second value from the top).

`swap` (swap top two values)

swap // b = pop(); a = pop(); push(b); push(a)

Swaps the top two stack values. Useful when operands arrive in the wrong order for a non-commutative operation. For example, to compute a - b when b was pushed before a: swap then sub.s i32.

On register machines, this is typically free (the backend just reassigns register names). On the VM, it is a physical swap of two stack slots.

`sel` (branchless conditional select)

sel // c = pop(); b = pop(); a = pop(); push(c ? a : b)

Pops a condition and two values. Pushes the first value if the condition is nonzero, the second otherwise. This is always branchless, making it ideal for small conditional expressions where a branch would be more expensive than computing both sides.

On x86, this maps directly to cmov. On ARM, it maps to csel. These are single-cycle instructions on modern hardware, making sel significantly cheaper than a jmp.if/jmp pair for simple cases like min(a, b) or clamping.

Variable access

Variables (both locals and globals) are accessed by name using ld and str. The parser resolves $name to the correct scope automatically: locals are checked first, then globals (locals shadow globals with the same name). For unnamed function parameters, use ldl with a numeric index.

In a register machine, local variables live in registers. In SBBE’s stack machine, locals are a separate indexed storage area. You push a local’s value onto the stack with ld/ldl, operate on it, then store the result back with str/strl. This is more explicit than a register machine but has the same effect.

`ld` (load variable by name)

ld $x       // push(vars[$x])
ld $counter // push(vars[$counter])

Loads a variable by name. The parser resolves $name to either a local or global and emits the correct internal opcode. This means ld $x might emit a local-get or global-get depending on context.

`str` (store variable by name)

str $x       // vars[$x] = pop()
str $counter // vars[$counter] = pop()

Pops a value and stores it into a variable by name.

`tee` (store without popping)

tee $x // vars[$x] = peek()

Stores the top of stack into a local variable without popping it. This is equivalent to dup followed by str, but in a single instruction. It is useful for patterns like “store and continue using the value.” Only valid for locals; using a global is a parse error.

`ldl` / `strl` (access local by numeric index)

ldl 0  // push(locals[0]) — first parameter
ldl 1  // push(locals[1]) — second parameter
strl 2 // locals[2] = pop()

Load or store a local variable by numeric index. Parameters occupy indices 0 through param_count - 1; declared locals follow. Most compilers emit indexed access for all locals and let the assembler or debugging tools attach names. Named access (ld $x) is primarily for hand-written IR.

Integer arithmetic

All integer arithmetic instructions take a type suffix (i8, i16, i32, i64) that specifies the operand width. Values are stored in the stack at their natural width, so an i8 operation really does operate on 8 bits with wrapping at 256.

Signed and unsigned variants

SBBE integers are untyped bit patterns. Whether a bit pattern is “signed” or “unsigned” is determined by the instruction, not the value. Instructions that behave differently for signed vs. unsigned inputs come in .s and .u variants.

For addition, subtraction, and multiplication, the bit-level result is identical for signed and unsigned two’s complement. However, .s and .u suffixes are still required because they document the frontend’s intent and will be used by future optimization flags (nsw/nuw). A compiler targeting SBBE should emit .s when the source language treats the values as signed and .u when unsigned.

Wrapping behavior

All arithmetic wraps modulo 2^N, where N is the bit width of the type. This means overflow does not trap or produce undefined behavior at the IR level. For example, add.s i8 of 127 + 1 yields -128 (0x80). If your source language requires overflow to be undefined (like C’s signed overflow), the frontend should emit the appropriate nsw flags once they are supported.

`add.s` / `add.u` / `sub.s` / `sub.u` / `mul.s` / `mul.u`

add.s i32 // b = pop(); a = pop(); push((a + b) mod 2^32)
add.u i32 // (same result, unsigned intent)
sub.s i32 // b = pop(); a = pop(); push((a - b) mod 2^32)
mul.s i32 // b = pop(); a = pop(); push((a * b) mod 2^32)

On x86, all of these map to add, sub, imul regardless of the .s/.u suffix. The suffix matters when the LLVM backend emits nsw/nuw flags, or when a future trap-on-overflow mode is enabled.

`div.s` / `div.u`

div.s i32 // b = pop(); a = pop(); push(a / b) signed
div.u i32 // b = pop(); a = pop(); push(a / b) unsigned

Integer division. This is where signedness produces genuinely different results. div.s i8 treats 0xFF as -1; div.u i8 treats it as 255. The signed variant truncates toward zero (matching C99); the unsigned variant produces the floor of the exact quotient. Division by zero is undefined behavior.

On x86, div.s maps to idiv and div.u maps to div. On ARM, div.s maps to sdiv and div.u maps to udiv.

`rem.s` / `rem.u`

rem.s i32 // b = pop(); a = pop(); push(a % b) signed
rem.u i32 // b = pop(); a = pop(); push(a % b) unsigned

Integer remainder. The sign of rem.s follows the dividend (C99 behavior): rem.s(-7, 3) yields -1, not 2. For rem.u, both operands are unsigned and the result is always non-negative.

On x86, the remainder is produced as a side-effect of idiv/div (it lands in edx/rdx). An optimizer aware of this can fuse a div + rem pair into a single division.

`neg`

neg i32 // a = pop(); push((-a) mod 2^32)

Two’s complement negation, equivalent to sub(0, x) or ~x + 1. The minimum signed value wraps to itself: neg i8 of -128 (0x80) produces -128. On x86, this is neg.

Float arithmetic

Float instructions are prefixed with f and operate on f32 or f64 values following IEEE 754 semantics. Unlike integer arithmetic, there is no signed/unsigned distinction for floats since the sign is part of the IEEE 754 encoding.

`fadd` / `fsub` / `fmul` / `fdiv`

fadd f64 // b = pop(); a = pop(); push(a + b)
fsub f64 // b = pop(); a = pop(); push(a - b)
fmul f64 // b = pop(); a = pop(); push(a * b)
fdiv f64 // b = pop(); a = pop(); push(a / b)

On x86 with SSE2, these map to addsd/subsd/mulsd/divsd for f64 and addss/subss/mulss/divss for f32. On ARM, fadd/fsub/fmul/fdiv (NEON/VFP).

`fneg` / `fabs`

fneg f64 // a = pop(); push(-a)
fabs f64 // a = pop(); push(|a|)

fneg flips the sign bit. fabs clears the sign bit. Both are typically single-cycle operations implemented by toggling or masking the MSB of the float encoding, not by arithmetic.

`fsqrt` / `fceil` / `fflr`

fsqrt f64 // a = pop(); push(sqrt(a))
fceil f64 // a = pop(); push(ceil(a))
fflr  f64 // a = pop(); push(floor(a))

Unary float math operations. fsqrt maps to sqrtsd/sqrtss on x86 (hardware instruction). fceil and fflr map to roundsd/roundss with the appropriate rounding mode on x86 with SSE4.1, or to frintp/frintm on ARM.

`fmin` / `fmax`

fmin f64 // b = pop(); a = pop(); push(min(a, b))
fmax f64 // b = pop(); a = pop(); push(max(a, b))

IEEE 754 minimum and maximum. On x86, minsd/maxsd. On ARM, fmin/fmax. Note that NaN handling follows IEEE 754-2008: if either operand is NaN, the result is NaN.

Bitwise operations

All bitwise instructions take a type suffix and operate on integer types. These instructions are sign-agnostic since they operate on raw bit patterns.

`and` / `or` / `xor`

and i32 // b = pop(); a = pop(); push(a & b)
or  i32 // b = pop(); a = pop(); push(a | b)
xor i32 // b = pop(); a = pop(); push(a ^ b)

`not`

not i32 // a = pop(); push(~a)

Bitwise complement (one’s complement). On x86, this is not. Note that not followed by add.u 1 is equivalent to neg (two’s complement).

`shl` / `shr.s` / `shr.u`

shl   i32 // b = pop(); a = pop(); push(a << b)
shr.s i32 // b = pop(); a = pop(); push(a >> b) arithmetic
shr.u i32 // b = pop(); a = pop(); push(a >> b) logical

shl shifts left, filling vacated low bits with zero.

shr.s is an arithmetic right shift: the sign bit is replicated into the vacated high bits. This preserves the sign of signed values. For example, shr.s i8 of -4 (0xFC) by 1 yields -2 (0xFE). On x86, this is sar. On ARM, asr.

shr.u is a logical right shift: vacated high bits are filled with zero. This treats the value as unsigned. For example, shr.u i8 of 0xFC by 1 yields 126 (0x7E). On x86, this is shr. On ARM, lsr.

The shift amount is masked to the type width minus one (e.g., b & 31 for i32), matching the behavior of x86 shl/shr/sar instructions.

`rotl` / `rotr`

rotl i32 // b = pop(); a = pop(); push(rotate_left(a, b))
rotr i32 // b = pop(); a = pop(); push(rotate_right(a, b))

Bitwise rotation. Unlike shifts, no bits are lost; bits shifted out of one end re-enter from the other. On x86, rol/ror. These are useful for hash functions and cryptographic algorithms.

`clz` / `ctz` / `popcnt`

clz    i32 // a = pop(); push(count_leading_zeros(a))
ctz    i32 // a = pop(); push(count_trailing_zeros(a))
popcnt i32 // a = pop(); push(popcount(a))

Bit-counting operations. clz returns the number of leading zero bits (e.g., clz(0x00F00000) for i32 = 8). ctz counts trailing zeros. popcnt counts the number of set bits.

On x86, these map to lzcnt, tzcnt, and popcnt (requires BMI1/POPCNT extensions; older CPUs use bsr/bsf fallbacks). On ARM, clz is a base instruction; ctz and popcnt require rbit + clz or NEON sequences.

For clz and ctz, the result when the input is zero is equal to the bit width of the type (e.g., 32 for i32).

Integer comparison

All comparison instructions push an i32 result (0 or 1) regardless of the operand type. This boolean-as-integer convention means comparison results can be directly consumed by jmp.if, sel, or integer arithmetic (e.g., accumulating a count of true conditions).

eq, ne, and eqz are sign-agnostic since equality depends only on the bit pattern.

The ordered comparisons (lt, gt, le, ge) come in .s and .u variants. The .s variant interprets operands as signed two’s complement: 0xFF as i8 is -1 and compares less than 0. The .u variant interprets them as unsigned: 0xFF as i8 is 255 and compares greater than 0.

`eq` / `ne` / `eqz`

eq  i32 // b = pop(); a = pop(); push(a == b)
ne  i32 // b = pop(); a = pop(); push(a != b)
eqz i32 // a = pop(); push(a == 0)

eqz is a shorthand for comparing against zero. It is slightly more efficient than ldi 0 followed by eq since it avoids pushing the zero constant.

`lt.s` / `lt.u` / `gt.s` / `gt.u`

lt.s i32 // b = pop(); a = pop(); push(a < b) signed
lt.u i32 // b = pop(); a = pop(); push(a < b) unsigned
gt.s i32 // b = pop(); a = pop(); push(a > b) signed
gt.u i32 // b = pop(); a = pop(); push(a > b) unsigned

On x86, signed comparisons use cmp + setl/setg (which check SF != OF). Unsigned comparisons use cmp + setb/seta (which check CF).

`le.s` / `le.u` / `ge.s` / `ge.u`

le.s i32 // b = pop(); a = pop(); push(a <= b) signed
le.u i32 // b = pop(); a = pop(); push(a <= b) unsigned
ge.s i32 // b = pop(); a = pop(); push(a >= b) signed
ge.u i32 // b = pop(); a = pop(); push(a >= b) unsigned

Float comparison

Float comparisons follow IEEE 754 semantics. All push an i32 result (0 or 1). There are no signed/unsigned variants since IEEE 754 floats have a well-defined total ordering (with the exception of NaN).

`feq` / `fne` / `flt` / `fgt` / `fle` / `fge`

feq f64 // b = pop(); a = pop(); push(a == b)
fne f64 // b = pop(); a = pop(); push(a != b)
flt f64 // b = pop(); a = pop(); push(a < b)
fgt f64 // b = pop(); a = pop(); push(a > b)
fle f64 // b = pop(); a = pop(); push(a <= b)
fge f64 // b = pop(); a = pop(); push(a >= b)

Note that comparisons involving NaN always return 0 (false), except fne which returns 1 (true) when either operand is NaN. This matches the IEEE 754 compareQuiet semantics. On x86, these map to ucomisd + setX. On ARM, fcmp + cset.

Type conversions

Conversion instructions take two type operands: the source type followed by the destination type.

`ext.s` / `ext.u` (integer widening)

ext.s i16 i32 // a = pop(); push(sign_extend(a))
ext.u i8  i32 // a = pop(); push(zero_extend(a))

Widen a narrower integer to a wider one. ext.s sign-extends: the sign bit is replicated into the new high bits, preserving the signed value. For example, ext.s i8 i32 of 0xFF yields 0xFFFFFFFF (-1 stays -1). ext.u zero-extends: high bits are filled with zero. For example, ext.u i8 i32 of 0xFF yields 0x000000FF (255).

On x86, ext.s maps to movsx/movsxd and ext.u maps to movzx. On ARM, sxtb/sxth and uxtb/uxth.

`trunc` (integer narrowing)

trunc i32 i8 // a = pop(); push(truncate(a))

Truncate a wider integer to a narrower one. High bits are simply discarded. For example, trunc i32 i8 of 0x12345678 yields 0x78. This is a no-op in most backends since the value is already in a register and the backend just ignores the high bits on subsequent operations.

`f2i.s` / `f2i.u` (float to integer)

f2i.s f64 i32 // a = pop(); push((i32)a) signed
f2i.u f64 i32 // a = pop(); push((u32)a) unsigned

Convert a float to an integer, truncating toward zero. f2i.s treats the result as signed: converting 200.0 to i8 overflows signed range and is undefined. f2i.u treats the result as unsigned: 200.0 to i8 yields 200. Converting NaN or infinity is undefined behavior.

On x86, cvttsd2si/cvttss2si. On ARM, fcvtzs/fcvtzu.

`i2f.s` / `i2f.u` (integer to float)

i2f.s i32 f64 // a = pop(); push((f64)signed(a))
i2f.u i32 f64 // a = pop(); push((f64)unsigned(a))

Convert an integer to a float. i2f.s interprets the input as signed: 0xFF as i8 yields -1.0. i2f.u interprets it as unsigned: 0xFF as i8 yields 255.0. Large integers may lose precision since f32 has only 24 bits of mantissa and f64 has 53.

On x86, cvtsi2sd/cvtsi2ss. On ARM, scvtf/ucvtf.

`fconv` (float width conversion)

fconv f32 f64 // a = pop(); push(f64(a)) — widen, lossless
fconv f64 f32 // a = pop(); push(f32(a)) — narrow, may lose precision

On x86, cvtss2sd (widen) and cvtsd2ss (narrow). On ARM, fcvt.

`cast` (bit reinterpretation)

cast i32 f32 // a = pop(); push(reinterpret_bits(a))
cast f64 i64 // a = pop(); push(reinterpret_bits(a))

Reinterprets the raw bits as a different type with no arithmetic conversion. The source and destination types must have the same size (e.g., i32/f32, i64/f64). This is a zero-cost operation on all backends since no data movement occurs; the bits are simply reinterpreted.

Useful for float bit manipulation (e.g., extracting the exponent of a float) and for serialization. On x86, this maps to movd/movq between integer and SSE registers when needed, but is often free if the value stays in a general-purpose register.

Control flow

SBBE uses unstructured control flow with labeled blocks and explicit jumps, like traditional assembly. There is no structured block/loop/if nesting as in WebAssembly. A function body is a flat list of labeled blocks, each containing a straight-line sequence of instructions.

This design is simpler for frontends to generate and maps directly to the basic-block structure that all backends already work with. The trade-off is that structured control flow (if needed for WASM output) must be reconstructed from the flat block graph.

`jmp` (unconditional jump)

jmp loop_body // goto loop_body

Transfers control to the named block. No stack effect.

`jmp.if` (conditional jump)

jmp.if done // cond = pop(); if (cond) goto done

Pops an i32 condition. If nonzero, jumps to the named block. Otherwise, falls through to the next instruction. This is the primary branching mechanism. On x86, this maps to test + jnz. On ARM, cbnz or tst + b.ne.

`jmpt` (branch table)

jmpt 0 // idx = pop(); goto table[idx]

Pops an i32 index and jumps via a branch table. This is the stack machine equivalent of a computed goto or switch statement. On x86, this typically compiles to an indirect jump through a table: jmp [table + idx*8].

Note: Branch table data structures are not yet implemented in the IR.

`ret` (return from function)

ret // return pop() or return void

Returns from the current function. Pops N values from the stack, where N is the number of return types in the function signature (0 for void). The values are pushed onto the caller’s stack in the same order. For functions with a single return type, this is equivalent to popping one value. For multiple return types, the first return value should be deepest on the stack and the last return value on top.

func $divmod(i32, i32) -> (i32, i32) {
entry:
    ldl 0
    ldl 1
    div.s i32      // quotient (first return value)
    ldl 0
    ldl 1
    rem.s i32      // remainder (second return value)
    ret            // pops both values, pushes both onto caller's stack
}

On x86, the first integer return value goes in rax, the second in rdx. Float returns use xmm0/xmm1. On ARM64, x0/x1 for integers, d0/d1 for floats.

Function calls

`call` (direct call)

call $add // push args, then call

Calls a function by name. Arguments must be pushed onto the stack in declaration order before the call. The return value (if any) is left on the stack. On native backends, this follows the platform’s calling convention (System V for x86_64, AAPCS for ARM64).

`calli` (indirect call)

calli 0 // target = pop(); call funcs[target]

Pops a function index from the stack and calls that function. Arguments must already be on the stack below the index. This is used for function pointers, vtable dispatch, and closures. The argument is reserved for future signature validation.

Memory

SBBE uses a flat linear memory model. All memory is a single contiguous byte array addressed by ptr values (which resolve to the target’s native pointer width: 32 bits on 32-bit targets, 64 bits on 64-bit targets). There is no distinction between stack, heap, or global memory at the IR level; the frontend manages memory layout.

This is similar to WebAssembly’s linear memory. On native backends, the linear memory becomes a pointer to a heap-allocated buffer, and all ldm/stm instructions compile to loads and stores through that base pointer.

`ldm` / `stm` (typed memory access)

ldm i32          // addr = pop(); push(mem[addr])
ldm i32 align=4  // aligned load
stm i32          // val = pop(); addr = pop(); mem[addr] = val
stm i64 align=8  // aligned store

Loads pop an address and push a value. Stores pop a value then an address (note the order: value first, then address). The type determines the width of the access (1, 2, 4, or 8 bytes).

The optional align=N hint tells backends the minimum alignment of the address. ARM requires aligned access for most instructions; x86 tolerates unaligned access but aligned access enables SIMD vectorization. When omitted, backends may assume natural alignment.

Narrow loads and stores

ldm.s8  // addr = pop(); push(sign_extend(mem8[addr]))
ldm.u8  // addr = pop(); push(zero_extend(mem8[addr]))
ldm.s16 // addr = pop(); push(sign_extend(mem16[addr]))
ldm.u16 // addr = pop(); push(zero_extend(mem16[addr]))
stm8    // val = pop(); addr = pop(); mem8[addr] = trunc(val)
stm16   // val = pop(); addr = pop(); mem16[addr] = trunc(val)

Narrow loads read 8 or 16 bits from memory and widen the result to i32. The .s variant sign-extends; .u zero-extends. This is the typical pattern for loading char or short values in C.

Narrow stores truncate a wider value before writing. There is no signed/unsigned distinction for stores since truncation is the same operation regardless.

On x86, ldm.s8 maps to movsx from a byte, ldm.u8 maps to movzx. On ARM, ldrsb/ldrb.

System calls

syscall0 // num = pop(); push(syscall(num))
syscall1 // a1 = pop(); num = pop(); push(syscall(num, a1))
syscall2 // a2 = pop(); a1 = pop(); num = pop(); push(syscall(num, a1, a2))
syscall3 // a3..a1 = pop(); num = pop(); push(syscall(num, a1, a2, a3))
syscall4 // a4..a1 = pop(); ...
syscall5 // a5..a1 = pop(); ...
syscall6 // a6..a1 = pop(); ...

Invoke a host system call. The syscall number and arguments are popped from the stack (topmost = last argument, then earlier arguments, then the syscall number at the bottom). The result is pushed.

The argument count is baked into the opcode name so the stack effect is always statically known. On Linux x86_64, syscall numbers and arguments map directly to rax, rdi, rsi, rdx, r10, r8, r9. On ARM64, x8 for the number and x0-x5 for arguments.

Atomic operations

Atomic instructions provide thread-safe memory access. Each takes a type suffix and a memory ordering that specifies how the operation synchronizes with other threads.

Memory orderings (from weakest to strongest):

relaxed — no ordering constraints, only atomicity is guaranteed
acquire — no reads/writes after this can be reordered before it
release — no reads/writes before this can be reordered after it
acq_rel — both acquire and release
seq_cst — full sequential consistency (strongest, most expensive)

When in doubt, use seq_cst. It is the most conservative ordering and the easiest to reason about. Use weaker orderings only when profiling shows the cost of seq_cst is a bottleneck.

`ald` / `ast` (atomic load / store)

ald i32 seq_cst // addr = pop(); push(atomic_load(mem[addr]))
ast i32 release // val = pop(); addr = pop(); atomic_store(mem[addr], val)

On x86, seq_cst loads and stores use mov with an mfence or xchg. On ARM, ldar/stlr for acquire/release, ldaxr/stlxr sequences for stronger orderings.

`armw.*` (atomic read-modify-write)

armw.add  i32 seq_cst // val = pop(); addr = pop(); old = mem[addr]; mem[addr] += val; push(old)
armw.sub  i32 seq_cst // ...mem[addr] -= val...
armw.and  i32 seq_cst // ...mem[addr] &= val...
armw.or   i32 seq_cst // ...mem[addr] |= val...
armw.xor  i32 seq_cst // ...mem[addr] ^= val...
armw.xchg i32 seq_cst // ...mem[addr] = val...

Atomically read a value, apply the operation, write back, and push the old value (before the modification). This is the building block for lock-free data structures.

On x86, lock add, lock xadd, lock cmpxchg, etc. On ARM, ldaxr/stlxr loops or LSE atomics (ldadd, swp, etc.).

`acas` (atomic compare-and-swap)

acas i32 seq_cst // new = pop(); exp = pop(); addr = pop(); push(success ? 1 : 0)

The fundamental lock-free primitive. Atomically compares mem[addr] to expected. If equal, writes new and pushes 1 (success). If not equal, pushes 0 (failure) and the memory is unchanged.

On x86, this is lock cmpxchg. On ARM, ldaxr/stlxr loop or cas (ARMv8.1 LSE).

`fence` (memory fence)

fence seq_cst // memory barrier, no stack effect

Prevents reordering of memory operations across the fence. On x86, mfence. On ARM, dmb ish.

Stack allocation

`salloc`

salloc // size = pop(); push(alloc(size))

Pops a size (as a ptr-width integer) and allocates that many bytes on the current function’s stack frame. Pushes a ptr to the allocated region. The allocation is automatically freed when the function returns.

This is the stack machine equivalent of C’s alloca(). On native backends, it compiles to a stack pointer adjustment (sub rsp, size on x86). On the VM, it bumps a watermark in the linear memory. The returned pointer is valid only for the duration of the current function call.

SIMD / Vector operations

Status: SIMD opcodes are not yet implemented. This section documents the planned design so that the opcode space and text format can be reserved.

SBBE’s SIMD support operates on the v128 type, a 128-bit value that holds multiple lanes of a narrower scalar type. The v128 type itself carries no lane information; the lane descriptor on each instruction determines how the 128 bits are interpreted.

Lane descriptors

A lane descriptor specifies how a v128 is partitioned into lanes. It is not a type in the SBBE type system. It is an operand to the instruction, encoded in the argument field the same way scalar types are for non-SIMD instructions.

Descriptor	Lanes	Lane width	Total
`i8x16`	16	8-bit int	128
`i16x8`	8	16-bit int	128
`i32x4`	4	32-bit int	128
`i64x2`	2	64-bit int	128
`f32x4`	4	32-bit float	128
`f64x2`	2	64-bit float	128

On x86, v128 maps to an XMM register (SSE) or the lower 128 bits of a YMM register (AVX). On ARM, it maps to a NEON Q register.

Arithmetic

vadd i32x4   // b = pop(); a = pop(); push(a + b) per-lane
vsub i32x4   // b = pop(); a = pop(); push(a - b) per-lane
vmul i32x4   // b = pop(); a = pop(); push(a * b) per-lane
vneg i32x4   // a = pop(); push(-a) per-lane

Integer vector arithmetic wraps per-lane, just like scalar integer arithmetic. There are no .s/.u variants for vadd/vsub/vmul since the bit-level result is the same.

vfadd f32x4  // b = pop(); a = pop(); push(a + b) per-lane
vfsub f32x4  // b = pop(); a = pop(); push(a - b) per-lane
vfmul f32x4  // b = pop(); a = pop(); push(a * b) per-lane
vfdiv f32x4  // b = pop(); a = pop(); push(a / b) per-lane
vfsqrt f32x4 // a = pop(); push(sqrt(a)) per-lane

Float vector arithmetic follows IEEE 754 per-lane. On x86, these map directly to addps/mulps/divps (f32x4) or addpd/mulpd (f64x2). On ARM, fadd/fmul (NEON).

Bitwise

vand   // b = pop(); a = pop(); push(a & b) full 128-bit
vor    // b = pop(); a = pop(); push(a | b) full 128-bit
vxor   // b = pop(); a = pop(); push(a ^ b) full 128-bit
vnot   // a = pop(); push(~a) full 128-bit

Bitwise operations operate on the full 128-bit value regardless of lane interpretation, so they take no lane descriptor.

Shifts

vshl   i32x4 // b = pop(); a = pop(); push(a << b) per-lane, b is scalar
vshr.s i32x4 // b = pop(); a = pop(); push(a >> b) per-lane, arithmetic
vshr.u i32x4 // b = pop(); a = pop(); push(a >> b) per-lane, logical

Each lane is shifted by the same scalar amount (popped from the stack as an i32). On x86, pslld/psrad/psrld. On ARM, vshl/vshr.

Comparison

veq   i32x4  // b = pop(); a = pop(); push(mask) per-lane equality
vlt.s i32x4  // b = pop(); a = pop(); push(mask) per-lane signed less-than
vlt.u i32x4  // b = pop(); a = pop(); push(mask) per-lane unsigned less-than
vgt.s i32x4  // per-lane signed greater-than
vgt.u i32x4  // per-lane unsigned greater-than

Comparisons produce a v128 bitmask where each lane is all-ones (true) or all-zeros (false). This bitmask can be used with vand/vor for branchless lane selection, similar to how sel works for scalars.

Min / Max

vmin.s i32x4 // b = pop(); a = pop(); push(min(a, b)) per-lane signed
vmin.u i32x4 // per-lane unsigned
vmax.s i32x4 // per-lane signed
vmax.u i32x4 // per-lane unsigned
vfmin  f32x4 // per-lane float min
vfmax  f32x4 // per-lane float max

Splat, extract, replace

vsplat i32x4 // a = pop(); push(v128 with a in all lanes)
vext   i32x4 // lane_idx in arg; a = pop(); push(scalar from lane)
vrep   i32x4 // lane_idx in arg; val = pop(); vec = pop(); push(vec with lane replaced)

vsplat broadcasts a scalar to all lanes. This is the primary way to create a vector from a scalar value. On x86, vbroadcastss/vpbroadcastd. On ARM, dup.

vext extracts a single lane as a scalar value. vrep replaces a single lane within a vector.

Shuffle

vshuf i8x16 // mask = pop(); a = pop(); push(shuffle(a, mask))

Rearranges the bytes of a v128 according to a mask vector. Each byte of the mask selects a byte from the source (by index 0-15). This is the most general permutation operation. On x86, pshufb (SSSE3). On ARM, tbl.

Vector memory access

vldm         // addr = pop(); push(mem128[addr])
vldm align=16 // aligned 128-bit load
vstm         // val = pop(); addr = pop(); mem128[addr] = val

Load or store a full v128 from/to memory. Aligned access (align=16) enables the backend to use aligned instructions (movaps on x86, ld1 with alignment hint on ARM) which may be faster.

Vector conversions

vi2f.s i32x4 f32x4 // per-lane signed int to float
vi2f.u i32x4 f32x4 // per-lane unsigned int to float
vf2i.s f32x4 i32x4 // per-lane float to signed int
vf2i.u f32x4 i32x4 // per-lane float to unsigned int
vext.s i16x8 i32x4 // widen lower lanes, sign-extend
vext.u i16x8 i32x4 // widen lower lanes, zero-extend
vtrunc i32x4 i16x8 // narrow with truncation

Lane-wise type conversion. Widening conversions take the lower half of the lanes (e.g., lower 4 of 8 i16 lanes become 4 i32 lanes). Narrowing conversions pack results into the lower half.

Instruction encoding

How the stack works

Special instructions

nop (no operation)

hold (non-removable nop)

Loading values

ldi (load immediate)

ldc (load constant)

Stack manipulation

drop (discard top of stack)

dup (duplicate top of stack)

swap (swap top two values)

sel (branchless conditional select)

Variable access

ld (load variable by name)

str (store variable by name)

tee (store without popping)

ldl / strl (access local by numeric index)

Integer arithmetic

Signed and unsigned variants

Wrapping behavior

add.s / add.u / sub.s / sub.u / mul.s / mul.u

div.s / div.u

rem.s / rem.u

neg