Original Research & Case Studies

Cutting-edge work in AI infrastructure, tokenizer design, social algorithms, and human-centered systems.

Crayon — Next-Gen Production Tokenizer

Crayon is a from-scratch production-grade tokenizer engineered for maximum speed, minimal memory footprint, and safe/stable vocabulary evolution in live LLM deployments.

2M+

tokens/second on CPU

~500K

target vocabulary size

<0.01¢

per million tokens (goal)

Built on information-theoretic foundations, Crayon uses a longest-match trie with perfect hashing, SIMD acceleration, cache-aware layout, zero-copy processing, GIL-free multithreading, and incremental vocabulary updates — delivering dramatically better throughput and memory behavior than conventional tokenizers like SentencePiece or tiktoken in production environments.

Read Full Research Paper (PDF)

More groundbreaking case studies and technical deep-dives coming soon.