CRAYON 🖍️

Hardware-Accelerated Vocabulary Swapping Engine

Live Demo GitHub Repository Benchmarks

System Specs

  • CORE COMPILED: C++17 EXT
  • CPU VECTORIZATION: AVX2 / AVX-512
  • GPU SUPPORT: CUDA / ROCm (HIP)
  • IO PROTOCOL: ZERO-COPY MMAP
  • PEAK THROUGHPUT: 18.4M TOK/S (LITE)
  • RESOLUTION SPEED: 0.54 MS COLD LOAD

1. Executive Summary & System Overview SEC_01

In large language model (LLM) pipelines, tokenization represents the initial gate between raw, unformatted sequence inputs and neural network layer embeddings. Traditional implementations represent vocabularies as dynamic pointer-heavy graph structures or complex hash tables. While flexible, these representations suffer from severe memory fragmentation, high latency CPU page table walks, and an inability to swap vocabularies dynamically without rebuilding internal pointer states.

CRAYON resolves this systemic bottleneck by shifting vocabulary representations from dynamic heap allocations to a memory-aligned, rigid binary Double-Array Trie (DAT) format.

Rather than chasing heap-allocated node pointers, transitions are resolved in O(1) constant-time via array indexing. When a tokenizer state needs to be loaded, Crayon maps the binary representation directly into memory space using the zero-copy system call mmap. This reduces vocabulary profile swap latency to a constant 0.54 milliseconds, allowing dynamic domain-specific adjustments on-the-fly.

2. Active Profiles (The Cartridge System) SEC_02

Rather than loading monolithic, slow-parsing text configurations, Crayon packages its vocabulary sets into binary files known as Cartridges. Cartridges are structured to align directly with CPU/GPU L2/L3 cache blocks and are loaded via zero-copy mappings, which allow the operating system to map files directly to the address space. Crayon supports both standardized production sizes and specialized academic layouts.

LITE PROFILE (lite)

  • Target Size: 50,000 subword tokens
  • DAT Binary Size: ~1.17 MB
  • JSON Spec Size: ~520 KB

Primary Use Case: General-purpose, low-footprint inference pipelines targeting web platforms, edge computing devices, or serverless models.

STANDARD PROFILE (standard)

  • Target Size: 206,373 subword tokens
  • DAT Binary Size: ~5.23 MB
  • JSON Spec Size: ~2.28 MB

Primary Use Case: High-capacity LLM vocabulary layouts requiring multilingual and code representation across mixed distributions.

3. Double-Array Trie (DAT) Data Layout SEC_03

The core data structure of Crayon is the Double-Array Trie (DAT). In classic graph-based tries, each node consists of dynamic pointer tables. Transitioning from state s via a character byte c requires chasing memory addresses, resulting in cache-misses.

DAT compresses the entire node tree structure into three parallel, contiguous, cache-aligned integer arrays of equal length: BASE, CHECK, and VALUES.

// Transition Mathematics for parent state 's' and byte 'c'

1. Compute Child Array Index: t = BASE[s] + c

2. Validate Node Ancestry: CHECK[t] == s

3. Extract Token ID Value: TokenID = VALUES[t]

If the ancestry validation fails (CHECK[t] != s), the state transition is marked invalid, terminating the current lookup path. Backtracking is resolved by reading the last valid VALUES[t] captured prior to state exhaustion.

4. DAT Construction: First-Fit Packing Compiler SEC_04

Compiling a standard prefix trie into a Double-Array Trie is an offline packing problem. For each parent state, we must find an offset base value b in the flat array such that the computed child slots b + c_i (for all child characters c_i) are free from existing collisions. This operation is computationally heavy: O(N × M) where N is vocabulary size and M is alphabet length.

In Crayon, the compiler module (compiler.cpp) is implemented in native C++17. The packing algorithm performs a First-Fit Linear Scan:

  • Extract the children character bytes of a parent node: {c_1, c_2, ..., c_k}.
  • Initialize candidate base index scan: b = 1.
  • Interrogate array boundaries: test if CHECK[b + c_i] == -1 for all child elements.
  • If a collision occurs (any child slot is occupied), increment b and repeat step 3.
  • Once a collision-free base offset b is located, commit the value to BASE[parent] = b, and claim the slots by setting CHECK[b + c_i] = parent.

By moving this compiler layer from Python to native C++17 and releasing the Global Interpreter Lock (GIL) during compilation loops, Crayon compiles vocabularies ~500x faster than naive Python-based trie packers.

5. Hardware-Aligned Engines (The Omni-Backend) SEC_05

To maintain high throughput across varying environments, Crayon does not rely on a single execution path. It implements an Omni-Backend structure, targeting CPUs and GPUs via dedicated hardware-specific compilation layers:

1. CPU Engine: AVX2 ASCII-Optimistic Paths

Implemented in cpu_engine.cpp, the CPU backend features a dual-execution pathway. When tokenizing raw strings, the engine processes chunks of 32 characters in a single cycle using 256-bit SIMD registers to verify ASCII status.

inline int is_ascii_32_avx2(const char* ptr) {
    __m256i chunk = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(ptr));
    int mask = _mm256_movemask_epi8(chunk);
    return mask == 0;
}

If all 32 bytes are verified as ASCII, the engine activates the Fast Path loop, bypassing UTF-8 multibyte boundary validation. This allows the compiler to unroll loops, achieving over 18 million tokens/second on commodity CPU cores. When non-ASCII bytes are detected, it switches to the Safe UTF-8 Path.

2. CUDA Engine: Parallel Threads

Implemented in gpu_engine_cuda.cu. Instead of processing a single document using multiple blocks, the engine maps each sequence or document in a batch to an individual CUDA thread. The compiled DAT arrays are copied to global GPU VRAM, allowing parallel state lookups. The kernel uses a lookahead limit of 128 characters to process subword sequences without needing block-wide shared memory locks or syncs, ensuring PyTorch context interoperability.

3. ROCm HIP Engine: AMD CDNA/RDNA Support

Implemented in rocm_engine.hip. This backend achieves performance parity on AMD hardware using the ROCm ecosystem. During setup, the build script checks for the hipcc compiler. If present, it swaps compilation directives to generate the crayon_rocm extension. This prevents vendor lock-in for enterprise deployments.

6. The Hyper-Fast BPE Trainer SEC_06

Training subword vocabularies on large texts is often a bottleneck in NLP systems, as traditional implementations scan the entire text block to count frequencies and merge pairs in each iteration. Crayon's BPE training engine (trainer.cpp) implements a single-core trainer using three coordinated system structures:

  • Parallel Array Linked-List: The corpus is stored as four contiguous integer arrays: tokens, prev_pos, next_pos, and active. This keeps the training data inside the CPU cache. Merging adjacent elements is reduced to updating index pointers in constant time, rather than modifying string memory on the heap.
  • Inverted Index (pair_locations): A hash map mapping each unique subword pair (A, B) to its list of corpus index locations. The trainer updates only the affected sites, avoiding full corpus rescans.
  • Lazy Max-Heap: A priority queue that stores pairs sorted by frequency. If a merge event invalidates adjacent pairs, the trainer does not adjust their counts in the heap. Instead, when a pair is popped, the trainer compares its heap count with the true count in the hash map. If they mismatch, it is discarded in O(1) time, reducing heap restructure operations.

7. Concurrency & Memory Models SEC_07

For production-grade deployments, Crayon incorporates several memory structures to handle large text data streams without allocation overheads:

Zero-Copy OS Memory Mapping

In Crayon, vocabulary profiles are not loaded via standard heap allocations. The binary .dat file is mapped directly to the virtual memory space of the calling process using the operating system's mmap syscall and Python's buffer protocol (Py_buffer). The system loads the vocabulary in under 1ms because the operating system lazily pages memory blocks from storage only when traversed by index lookups.

Lock-Free Optimistic L1 Caching

To prevent thread contention on multi-core architectures, each execution thread has its own private L1 Cache (2048 entry capacity). Crayon uses optimistic concurrency: reads inspect a version counter before and after copying keys/values to identify concurrent modifications. This avoids mutex lock contention and cache-line false-sharing during tokenization.

8. Verified Production Benchmarks SEC_09

Below are the officially verified benchmark execution speeds capturing throughput characteristics of XERV Crayon against leading industry subword extraction tokenizers. Workloads were evaluated using a commodity AMD64 CPU platform on a standardized mixed-linguistic and syntax corpus.

Engine Profile Vocab Size English Prose Code Syntax Unicode Characters
Crayon (lite) 50,000 18,407,951 tok/s 33,161,787 tok/s 43,921,330 tok/s
Crayon (standard) 206,373 17,154,914 tok/s 18,707,550 tok/s 29,227,498 tok/s
tiktoken (p50k_base) 50,000 2,027,728 tok/s 1,229,651 tok/s 2,272,876 tok/s
tiktoken (cl100k_base) 100,000 1,198,631 tok/s 916,869 tok/s 1,696,065 tok/s
HF GPT-2 (BPE) 50,257 237,117 tok/s

9. Technical FAQ & Search Typo Lexicon SEC_10

Q: How does Crayon achieve sub-millisecond cold loads?

A: Using OS-level memory mapping via the mmap system call. The binary data structure maps directly to the virtual memory table of the calling process, avoiding fread execution loops. The operating system page cache maps blocks on demand, meaning the CPU reads only the nodes required for active state lookups.

Q: What happens when Unicode text causes a state lookup mismatch?

A: If a byte transition fails validation (CHECK[t] != s), the engine halts forward search, emits the last encountered valid Token ID value, and restarts searching from root state 0 at the index following the matched block. If no subword matches, it defaults to raw byte fallback representation.

XERV CRAYON

Systems Reference Manual

COMPILED VIA C++17 ASSEMBLY | PRODUCTION GRADE v5.1.0