Building architectures from scratch โ MLA, MoE, auxiliary-loss-free load balancing โ scaled progressively from 344k to 100M+ parameters. All weights open. All code open.
Each version scales the same architecture โ same MLA, same MoE โ with increasing capacity and capability.
Simulated demo โ press Run to see example outputs from Axion1. Connect to real inference by pointing to your server.
Axion models require PyTorch, Transformers, and Safetensors. No GPU required.
Clone the model repository:
Load with AutoModelForCausalLM. The custom BPE tokenizer must be loaded separately.
Axion ships with a built-in Flask server and dark-themed HTML interface.
Expected: ~1,000 tok/s ยท ~330s/epoch ยท ~115 min total on Ryzen 5 5600G.
| Key | Axion1 | Description |
|---|---|---|
| d_model | 64 | Embedding dimension |
| n_layers | 4 | Transformer blocks |
| kv_lora_rank | 8 | KV compression rank (MLA) |
| n_routed_experts | 4 | Expert pool size (MoE) |
| n_active_experts | 2 | Experts activated per token |
| vocab_size | 1024 | BPE vocabulary size |
| max_seq_len | 512 | Maximum context length |
MLA compresses KV into a shared latent vector of rank kv_lora_rank, then expands back. This reduces KV cache from O(nยทd) to O(nยทr) where r โช d โ critical for CPU performance.
Each FFN is replaced by a mixture of experts: shared experts always process every token, plus top-K routed experts selected via sigmoid gating with dynamic bias for load balancing โ no auxiliary loss needed.
Every Axion release is a scaling experiment. Same architecture, increasing capacity.