CRAYON: Hardware-Accelerated Tokenization

CRAYON is a high-performance, hardware-accelerated tokenizer engineered for instant vocabulary swapping and maximum throughput. Designed to eliminate data preprocessing bottlenecks in LLM pipelines, it operates using a unique cartridge system. XERV is dedicated to Integrating Intelligence In Physical Inteligence.

View CRAYON on GitHub

Architecture

Core Engine: Written in C++17 with linked-list BPE (Byte Pair Encoding) for training.
Hardware Support: Native GPU kernels in CUDA (NVIDIA) and HIP (AMD). Supports CPU with AVX2 & AVX-512 SIMD.
Memory Loading: Uses zero-copy mmap loading of .DAT files for instant startup.

Cartridge System

Crayon represents vocabularies as pre-built binary profiles loaded instantly via zero-copy memory mapping. Developers can seamlessly switch vocabularies without rebuilding the tokenizer state.

Lite Cartridge (50k): 50,000 subwords. DAT: ~1.17 MB, JSON: ~520 KB. General-purpose LLM tokenization.
Standard Cartridge (206k): 206,373 subwords. DAT: ~5.23 MB, JSON: ~2.28 MB. Rich multi-domain/multilingual vocabularies.

Double-Array Trie (DAT) Data Layout

To eliminate pointer chasing and dynamic hash lookups, Crayon implements a unified, cache-aligned Double-Array Trie representation.

BASE Array: Calculates the target child slot via base offset shift.
CHECK Array: Verifies transitions by checking ownership in the parent array.
VALUES Array: Captures vocabulary matches of token IDs when traversing states.

Extreme Stress Test (~100M Characters)

To evaluate absolute throughput limit and memory stability under heavy load, we executed a benchmark tokenizing a massive text block of ~100 million characters (~95.37 MiB) in a Google Colab Tesla T4 GPU test environment using the standard (206k) profile.

Device	Total Tokens	Time (s)	Throughput	Performance
CPU	21,212,115	0.8899	107.16 MiB/s	23.84 M tokens/s
CUDA	21,212,115	17.3300	5.50 MiB/s	1.22 M tokens/s