Models

The Axion Series

Each version scales the same architecture — same MLA, same MoE — with increasing capacity and capability.

v1.0 — Released

Axion1

Proof-of-concept: DeepSeek-V3 architecture at extreme miniaturization. Trained on GSM8K in 115 minutes on a Ryzen 5 CPU.

MLAMoERoPESwiGLUCPU

344k total · ~160k active

● Live

View on HuggingFace →

v0.2 — Coming Soon

Axion1-v0.2

Same architecture, 4× the capacity. Expanded vocabulary and noticeably more coherent language generation.

MLAMoEd_model 1286 layers

~1.5M total · ~800k active

⟳ Next

v0.3 — Planned

Axion1-v0.3

First model expected to produce grammatically coherent multi-sentence responses. Scaling laws in action.

MLAMoEd_model 256

~6M total

Planned

v0.4-0.5 — Future

Axion1-v0.4 / Axion1-v0.5

Scaling to 24M and 100M parameters. Instruction tuning and multi-language support planned.

24M → 100MMultilingual

24M–100M params

Planned

Blog

Updates & Research

Launch March 8, 2025

We built DeepSeek-V3 architecture from scratch in one afternoon

Starting from the technical report (arXiv:2412.19437), we implemented Multi-head Latent Attention, DeepSeekMoE with auxiliary-loss-free load balancing, RoPE, RMSNorm, and SwiGLU from zero — no pretrained weights, no distillation. Axion1 was trained in 115 minutes on a Ryzen 5 5600G CPU.

val_loss: 6.49 → epoch 1 val_loss: 4.59 → epoch 2 val_loss: 4.30 → epoch 3 val_loss: 3.88 → epoch 5 val_loss: 3.66 → epoch 7 val_loss: 3.54 → epoch 9 val_loss: ~3.2 → epoch 20 ✓ saved: model.safetensors ✓ active params: ~160k/token

Architecture March 8, 2025

Why MLA makes small models faster on CPU

Multi-head Latent Attention compresses KV into a low-rank latent space, reducing KV cache memory ~8x vs standard MHA. At 344k parameters this means the entire KV cache fits in L2 cache on the Ryzen 5.

Scaling Coming Soon

Axion2: what happens at 1.5M parameters?

The jump from 344k to 1.5M is not linear — empirically, this is where grammatical structure starts emerging. We document the full training run and compare outputs side by side with Axion1.

Coming soon →

Documentation

Get Started

Installation

Axion models require PyTorch, Transformers, and Safetensors. No GPU required.

pip install torch transformers safetensors flask

Clone the model repository:

git clone https://huggingface.co/AxionLab-Co/Axion1-350k-A250k cd Axion1-350k-A250k

Inference

Load with AutoModelForCausalLM. The custom BPE tokenizer must be loaded separately.

from transformers import AutoModelForCausalLM, LogitsProcessor, LogitsProcessorList from tokenizer import BPETokenizer import torch model = AutoModelForCausalLM.from_pretrained( "AxionLab-Co/Axion1-350k-A250k", trust_remote_code=True ) model.eval() tok = BPETokenizer.load("model.vocab", "model.model") class MinNewTokens(LogitsProcessor): def __init__(self,n,eos,pad): self.n=n;self.bad=[eos,pad];self.i=0 def __call__(self,ids,scores): if self.i<self.n: for b in self.bad: scores[:,b]=float("-inf") self.i+=1; return scores prompt = "# Pergunta:\nQuanto e 5 + 3?\n--\n# Resposta:\n" ids = tok.encode(prompt, add_bos=True, add_eos=False) out = model.generate(torch.tensor([ids]), max_new_tokens=80, temperature=0.9, do_sample=True, use_cache=False, logits_processor=LogitsProcessorList([MinNewTokens(5,tok.token2id["<eos>"],tok.token2id["<pad>"])]) ) print(tok.decode(out[0][len(ids):].tolist()))

Chat Interface

Axion ships with a built-in Flask server and dark-themed HTML interface.

python chat.py # Open http://localhost:5000

python chat.py --port 8080 --host 0.0.0.0

Dataset Preparation

# Download GSM8K (requires internet) python convert.py # Synthetic math data (no internet needed) python convert.py --synthetic --max 2000

Training

# 1. Train tokenizer python tokenizer.py # 2. Train model python train.py --epochs 20 --batch-size 8 --grad-accum 4 # Resume interrupted training python train.py --resume --epochs 20

Expected: ~1,000 tok/s · ~330s/epoch · ~115 min total on Ryzen 5 5600G.

Config Reference

Key	Axion1	Description
d_model	64	Embedding dimension
n_layers	4	Transformer blocks
kv_lora_rank	8	KV compression rank (MLA)
n_routed_experts	4	Expert pool size (MoE)
n_active_experts	2	Experts activated per token
vocab_size	1024	BPE vocabulary size
max_seq_len	512	Maximum context length

MLA & MoE

Multi-head Latent Attention

MLA compresses KV into a shared latent vector of rank kv_lora_rank, then expands back. This reduces KV cache from O(n·d) to O(n·r) where r ≪ d — critical for CPU performance.

DeepSeekMoE

Each FFN is replaced by a mixture of experts: shared experts always process every token, plus top-K routed experts selected via sigmoid gating with dynamic bias for load balancing — no auxiliary loss needed.

Roadmap

Where We're Going

Every Axion release is a scaling experiment. Same architecture, increasing capacity.

Axion1-v0.1 — 344k paramsMarch 2025Released

Proof of Architecture

Full DeepSeek-V3 pipeline from scratch. MLA + MoE + BPE tokenizer + HuggingFace integration. Trained on GSM8K in 115 minutes on CPU.

MLAMoEGSM8KHuggingFace

Axion1-v0.2 — ~1.5M paramsComing SoonIn Progress

First Coherent Sentences

d_model 128, 6 layers, expanded vocab. Expected to produce grammatically structured responses. Full training log will be published.

d_model 1286 layersLarger vocab

Axion1-v0.3 — ~6M paramsPlanned

Reliable Math Reasoning

d_model 256. Consistent step-by-step reasoning on arithmetic. Broader dataset planned.

d_model 256Multi-dataset

Axion1-v0.4 — ~24M paramsPlanned

Instruction Following

First Axion with instruction tuning. Target: answer general questions in Portuguese and English.

Instruction SFTPT + EN

Axion1-v0.5 — ~100M paramsPlanned

General Purpose

The flagship. Real conversation, multi-turn context, and a full evaluation suite.

100MMulti-turnEval suite

Scaling Intelligence
from Zero

The Axion Series

Try Axion1

Updates & Research

Get Started

Installation

Inference

Chat Interface

Dataset Preparation

Training

Config Reference

MLA & MoE

Multi-head Latent Attention

DeepSeekMoE

Where We're Going

Scaling Intelligencefrom Zero

The Axion Series

Try Axion1

Updates & Research

Get Started

Installation

Inference

Chat Interface

Dataset Preparation

Training

Config Reference

MLA & MoE

Multi-head Latent Attention

DeepSeekMoE

Where We're Going

Scaling Intelligence
from Zero