Original Research & Case Studies
Cutting-edge work in AI infrastructure, tokenizer design, social algorithms, and human-centered systems.
Crayon — Next-Gen Production Tokenizer
Crayon is a from-scratch production-grade tokenizer engineered for maximum speed, minimal memory footprint, and safe/stable vocabulary evolution in live LLM deployments.
2M+
tokens/second on CPU
~500K
target vocabulary size
<0.01¢
per million tokens (goal)
Built on information-theoretic foundations, Crayon uses a longest-match trie with perfect hashing, SIMD acceleration, cache-aware layout, zero-copy processing, GIL-free multithreading, and incremental vocabulary updates — delivering dramatically better throughput and memory behavior than conventional tokenizers like SentencePiece or tiktoken in production environments.