WP_Token_Map

High-performance data structure for string key-to-value lookups with optimized memory layout.

Source: wp-includes/class-wp-token-map.php
Since: 6.6.0

Overview

WP_Token_Map provides efficient string token lookup and transformation. It’s optimized for static datasets like HTML named character references, smilies, or other fixed mappings.

The class uses a packed binary format internally to maximize cache locality and minimize memory access during lookups. It supports precomputation for zero-cost initialization in production.

Constants

ConstantValueDescription
STORAGE_VERSION'6.6.0-trunk'Format version for precomputed tables
MAX_LENGTH256Maximum bytes for keys and values

Properties

PropertyTypeVisibilityDescription
$key_lengthintprivatePrefix length for grouping (default 2)
$large_wordsarrayprivatePacked strings for tokens longer than key_length
$groupsstringprivateNull-delimited group prefixes
$small_wordsstringprivatePacked short tokens
$small_mappingsstring[]privateReplacements for short tokens

Methods

from_array()

Creates a token map from an associative array.

php
public static function from_array( array $mappings, int $key_length = 2 ): ?WP_Token_Map
ParameterTypeDescription
$mappingsarrayKey-value pairs (token โ†’ replacement)
$key_lengthintGroup prefix length (default 2)

Returns: WP_Token_Map on success, null if any token exceeds MAX_LENGTH.

Example:

php
$smilies = WP_Token_Map::from_array( array(
    '8O' => '๐Ÿ˜ฏ',
    ':(' => '๐Ÿ™',
    ':)' => '๐Ÿ™‚',
    ':?' => '๐Ÿ˜•',
) );

from_precomputed_table()

Creates a token map from precomputed data.

php
public static function from_precomputed_table( array $state ): ?WP_Token_Map
ParameterTypeDescription
$statearrayPrecomputed state array

State Array Keys:

  • storage_version โ€” Must match STORAGE_VERSION
  • key_length โ€” Group prefix length
  • groups โ€” Null-delimited group prefixes
  • large_words โ€” Packed token strings
  • small_words โ€” Packed short tokens
  • small_mappings โ€” Short token replacements

Returns: WP_Token_Map or null on version mismatch/missing data.


contains()

Checks if a word exists as a lookup key.

php
public function contains( string $word, string $case_sensitivity = 'case-sensitive' ): bool
ParameterTypeDescription
$wordstringToken to check
$case_sensitivitystring'case-sensitive' or 'ascii-case-insensitive'

Returns: true if token exists in map.

Example:

php
$smilies->contains( ':)' );  // true
$smilies->contains( 'nope' ); // false

read_token()

Reads a token starting at the given offset and returns its mapping.

php
public function read_token(
    string $text,
    int $offset = 0,
    ?int &$matched_token_byte_length = null,
    string $case_sensitivity = 'case-sensitive'
): ?string
ParameterTypeDescription
$textstringText to search in
$offsetintStarting byte position
&$matched_token_byte_lengthint|nullReceives matched token length
$case_sensitivitystring'case-sensitive' or 'ascii-case-insensitive'

Returns: Mapped value if token found, null otherwise.

Example:

php
$result = $smilies->read_token( 'Hello :) world', 6, $length );
// $result = '๐Ÿ™‚'
// $length = 2

to_array()

Exports the map back to an associative array.

php
public function to_array(): array

Returns: Array of token => replacement pairs.


precomputed_php_source_table()

Generates PHP source code for precomputed loading.

php
public function precomputed_php_source_table( string $indent = "t" ): string
ParameterTypeDescription
$indentstringIndentation string

Returns: PHP code that can be pasted into a source file.

Example Output:

php
WP_Token_Map::from_precomputed_table(
    array(
        "storage_version" => "6.6.0-trunk",
        "key_length" => 2,
        "groups" => "",
        "large_words" => array(),
        "small_words" => "8Ox00:)x00:(x00:?x00",
        "small_mappings" => array( "๐Ÿ˜ฏ", "๐Ÿ™‚", "๐Ÿ™", "๐Ÿ˜•" )
    )
);

Internal Methods

read_small_token() (private)

Searches for short tokens (โ‰ค key_length).

php
private function read_small_token(
    string $text,
    int $offset = 0,
    ?int &$matched_token_byte_length = null,
    string $case_sensitivity = 'case-sensitive'
): ?string

longest_first_then_alphabetical() (private static)

Comparison function ensuring longer matches take priority.

php
private static function longest_first_then_alphabetical( string $a, string $b ): int

Sorts by length descending, then alphabetically. Prevents substring matches masking longer tokens.


Architecture

Large vs. Small Words

Tokens are classified by length relative to key_length:

  • Small words: Length โ‰ค key_length โ€” stored in packed $small_words string
  • Large words: Length > key_length โ€” grouped by prefix in $large_words array

Packed Format

Large words use a binary packed format:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Length of rest โ”‚ Rest of key   โ”‚ Length of value โ”‚ Value  โ”‚
โ”‚ of key (bytes) โ”‚               โ”‚ (bytes)         โ”‚        โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 0x08           โ”‚ nterDot;      โ”‚ 0x02            โ”‚ ยท      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Choosing Key Length

The key_length parameter affects performance:

  • Too long: Many single-token groups, wasted overhead
  • Too short: Large groups requiring linear search

For HTML character references (2,000+ entries), key_length = 2 provides good distribution. For smaller sets like smilies, key_length = 1 may be better.


Performance Tips

  1. Precompute for production: Use precomputed_php_source_table() to generate static code
  2. Strip common prefixes: If all tokens share a prefix (like & for HTML entities), exclude it and check manually
  3. Experiment with key_length: Test different values for your specific dataset

Usage in WordPress

Primary use case is HTML named character reference decoding in the HTML API:

php
// In wp-includes/html-api/html5-named-character-references.php
return WP_Token_Map::from_precomputed_table( array(
    // ... precomputed HTML entity mappings
) );