AxionLab Research

Scaling Intelligence
from Zero

Building architectures from scratch โ€” MLA, MoE, auxiliary-loss-free load balancing โ€” scaled progressively from 344k to 100M+ parameters. All weights open. All code open.

344kParameters โ€” Axion1
~160kActive per token
1,100Tokens/sec CPU
100%Open Source

The Axion Series

Each version scales the same architecture โ€” same MLA, same MoE โ€” with increasing capacity and capability.

v1.0 โ€” Released
Axion1
Proof-of-concept: DeepSeek-V3 architecture at extreme miniaturization. Trained on GSM8K in 115 minutes on a Ryzen 5 CPU.
MLAMoERoPESwiGLUCPU
344k total ยท ~160k active
โ— Live
View on HuggingFace โ†’
v0.2 โ€” Coming Soon
Axion1-v0.2
Same architecture, 4ร— the capacity. Expanded vocabulary and noticeably more coherent language generation.
MLAMoEd_model 1286 layers
~1.5M total ยท ~800k active
v0.3 โ€” Planned
Axion1-v0.3
First model expected to produce grammatically coherent multi-sentence responses. Scaling laws in action.
MLAMoEd_model 256
~6M total
Planned
v0.4-0.5 โ€” Future
Axion1-v0.4 / Axion1-v0.5
Scaling to 24M and 100M parameters. Instruction tuning and multi-language support planned.
24M โ†’ 100MMultilingual
24Mโ€“100M params
Planned

Try Axion1

Simulated demo โ€” press Run to see example outputs from Axion1. Connect to real inference by pointing to your server.

Axion1-350k-A250k โ€” inference โ— model loaded
Prompt
0.9
60
Output
Response will appear here...

Updates & Research

Architecture March 8, 2025
Why MLA makes small models faster on CPU
Multi-head Latent Attention compresses KV into a low-rank latent space, reducing KV cache memory ~8x vs standard MHA. At 344k parameters this means the entire KV cache fits in L2 cache on the Ryzen 5.
Read more โ†’
Scaling Coming Soon
Axion2: what happens at 1.5M parameters?
The jump from 344k to 1.5M is not linear โ€” empirically, this is where grammatical structure starts emerging. We document the full training run and compare outputs side by side with Axion1.
Coming soon โ†’

Get Started

Quickstart
Installation Inference Chat Interface
Training
Dataset Training
Architecture
Config Reference MLA & MoE

Installation

Axion models require PyTorch, Transformers, and Safetensors. No GPU required.

pip install torch transformers safetensors flask

Clone the model repository:

git clone https://huggingface.co/AxionLab-Co/Axion1-350k-A250k cd Axion1-350k-A250k

Inference

Load with AutoModelForCausalLM. The custom BPE tokenizer must be loaded separately.

from transformers import AutoModelForCausalLM, LogitsProcessor, LogitsProcessorList from tokenizer import BPETokenizer import torch model = AutoModelForCausalLM.from_pretrained( "AxionLab-Co/Axion1-350k-A250k", trust_remote_code=True ) model.eval() tok = BPETokenizer.load("model.vocab", "model.model") class MinNewTokens(LogitsProcessor): def __init__(self,n,eos,pad): self.n=n;self.bad=[eos,pad];self.i=0 def __call__(self,ids,scores): if self.i<self.n: for b in self.bad: scores[:,b]=float("-inf") self.i+=1; return scores prompt = "# Pergunta:\nQuanto e 5 + 3?\n--\n# Resposta:\n" ids = tok.encode(prompt, add_bos=True, add_eos=False) out = model.generate(torch.tensor([ids]), max_new_tokens=80, temperature=0.9, do_sample=True, use_cache=False, logits_processor=LogitsProcessorList([MinNewTokens(5,tok.token2id["<eos>"],tok.token2id["<pad>"])]) ) print(tok.decode(out[0][len(ids):].tolist()))

Chat Interface

Axion ships with a built-in Flask server and dark-themed HTML interface.

python chat.py # Open http://localhost:5000
python chat.py --port 8080 --host 0.0.0.0

Dataset Preparation

# Download GSM8K (requires internet) python convert.py # Synthetic math data (no internet needed) python convert.py --synthetic --max 2000

Training

# 1. Train tokenizer python tokenizer.py # 2. Train model python train.py --epochs 20 --batch-size 8 --grad-accum 4 # Resume interrupted training python train.py --resume --epochs 20

Expected: ~1,000 tok/s ยท ~330s/epoch ยท ~115 min total on Ryzen 5 5600G.

Config Reference

KeyAxion1Description
d_model64Embedding dimension
n_layers4Transformer blocks
kv_lora_rank8KV compression rank (MLA)
n_routed_experts4Expert pool size (MoE)
n_active_experts2Experts activated per token
vocab_size1024BPE vocabulary size
max_seq_len512Maximum context length

MLA & MoE

Multi-head Latent Attention

MLA compresses KV into a shared latent vector of rank kv_lora_rank, then expands back. This reduces KV cache from O(nยทd) to O(nยทr) where r โ‰ช d โ€” critical for CPU performance.

DeepSeekMoE

Each FFN is replaced by a mixture of experts: shared experts always process every token, plus top-K routed experts selected via sigmoid gating with dynamic bias for load balancing โ€” no auxiliary loss needed.

Where We're Going

Every Axion release is a scaling experiment. Same architecture, increasing capacity.

Axion1-v0.1 โ€” 344k paramsMarch 2025Released
Proof of Architecture
Full DeepSeek-V3 pipeline from scratch. MLA + MoE + BPE tokenizer + HuggingFace integration. Trained on GSM8K in 115 minutes on CPU.
MLAMoEGSM8KHuggingFace
Axion1-v0.3 โ€” ~6M paramsPlanned
Reliable Math Reasoning
d_model 256. Consistent step-by-step reasoning on arithmetic. Broader dataset planned.
d_model 256Multi-dataset
Axion1-v0.4 โ€” ~24M paramsPlanned
Instruction Following
First Axion with instruction tuning. Target: answer general questions in Portuguese and English.
Instruction SFTPT + EN
Axion1-v0.5 โ€” ~100M paramsPlanned
General Purpose
The flagship. Real conversation, multi-turn context, and a full evaluation suite.
100MMulti-turnEval suite