WP_Token_Map

High-performance data structure for string key-to-value lookups with optimized memory layout.

Source: wp-includes/class-wp-token-map.php
Since: 6.6.0

Overview

WP_Token_Map provides efficient string token lookup and transformation. It’s optimized for static datasets like HTML named character references, smilies, or other fixed mappings.

The class uses a packed binary format internally to maximize cache locality and minimize memory access during lookups. It supports precomputation for zero-cost initialization in production.

Constants

Constant Value Description
STORAGE_VERSION '6.6.0-trunk' Format version for precomputed tables
MAX_LENGTH 256 Maximum bytes for keys and values

Properties

Property Type Visibility Description
$key_length int private Prefix length for grouping (default 2)
$large_words array private Packed strings for tokens longer than key_length
$groups string private Null-delimited group prefixes
$small_words string private Packed short tokens
$small_mappings string[] private Replacements for short tokens

Methods

from_array()

Creates a token map from an associative array.

public static function from_array( array $mappings, int $key_length = 2 ): ?WP_Token_Map
Parameter Type Description
$mappings array Key-value pairs (token โ†’ replacement)
$key_length int Group prefix length (default 2)

Returns: WP_Token_Map on success, null if any token exceeds MAX_LENGTH.

Example:

$smilies = WP_Token_Map::from_array( array(
    '8O' => '๐Ÿ˜ฏ',
    ':(' => '๐Ÿ™',
    ':)' => '๐Ÿ™‚',
    ':?' => '๐Ÿ˜•',
) );

from_precomputed_table()

Creates a token map from precomputed data.

public static function from_precomputed_table( array $state ): ?WP_Token_Map
Parameter Type Description
$state array Precomputed state array

State Array Keys:

  • storage_version โ€” Must match STORAGE_VERSION
  • key_length โ€” Group prefix length
  • groups โ€” Null-delimited group prefixes
  • large_words โ€” Packed token strings
  • small_words โ€” Packed short tokens
  • small_mappings โ€” Short token replacements

Returns: WP_Token_Map or null on version mismatch/missing data.


contains()

Checks if a word exists as a lookup key.

public function contains( string $word, string $case_sensitivity = 'case-sensitive' ): bool
Parameter Type Description
$word string Token to check
$case_sensitivity string 'case-sensitive' or 'ascii-case-insensitive'

Returns: true if token exists in map.

Example:

$smilies->contains( ':)' );  // true
$smilies->contains( 'nope' ); // false

read_token()

Reads a token starting at the given offset and returns its mapping.

public function read_token(
    string $text,
    int $offset = 0,
    ?int &$matched_token_byte_length = null,
    string $case_sensitivity = 'case-sensitive'
): ?string
Parameter Type Description
$text string Text to search in
$offset int Starting byte position
&$matched_token_byte_length int|null Receives matched token length
$case_sensitivity string 'case-sensitive' or 'ascii-case-insensitive'

Returns: Mapped value if token found, null otherwise.

Example:

$result = $smilies->read_token( 'Hello :) world', 6, $length );
// $result = '๐Ÿ™‚'
// $length = 2

to_array()

Exports the map back to an associative array.

public function to_array(): array

Returns: Array of token => replacement pairs.


precomputed_php_source_table()

Generates PHP source code for precomputed loading.

public function precomputed_php_source_table( string $indent = "t" ): string
Parameter Type Description
$indent string Indentation string

Returns: PHP code that can be pasted into a source file.

Example Output:

WP_Token_Map::from_precomputed_table(
    array(
        "storage_version" => "6.6.0-trunk",
        "key_length" => 2,
        "groups" => "",
        "large_words" => array(),
        "small_words" => "8Ox00:)x00:(x00:?x00",
        "small_mappings" => array( "๐Ÿ˜ฏ", "๐Ÿ™‚", "๐Ÿ™", "๐Ÿ˜•" )
    )
);

Internal Methods

read_small_token() (private)

Searches for short tokens (โ‰ค key_length).

private function read_small_token(
    string $text,
    int $offset = 0,
    ?int &$matched_token_byte_length = null,
    string $case_sensitivity = 'case-sensitive'
): ?string

longest_first_then_alphabetical() (private static)

Comparison function ensuring longer matches take priority.

private static function longest_first_then_alphabetical( string $a, string $b ): int

Sorts by length descending, then alphabetically. Prevents substring matches masking longer tokens.


Architecture

Large vs. Small Words

Tokens are classified by length relative to key_length:

  • Small words: Length โ‰ค key_length โ€” stored in packed $small_words string
  • Large words: Length > key_length โ€” grouped by prefix in $large_words array

Packed Format

Large words use a binary packed format:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Length of rest โ”‚ Rest of key   โ”‚ Length of value โ”‚ Value  โ”‚
โ”‚ of key (bytes) โ”‚               โ”‚ (bytes)         โ”‚        โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 0x08           โ”‚ nterDot;      โ”‚ 0x02            โ”‚ ยท      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Choosing Key Length

The key_length parameter affects performance:

  • Too long: Many single-token groups, wasted overhead
  • Too short: Large groups requiring linear search

For HTML character references (2,000+ entries), key_length = 2 provides good distribution. For smaller sets like smilies, key_length = 1 may be better.


Performance Tips

  1. Precompute for production: Use precomputed_php_source_table() to generate static code
  2. Strip common prefixes: If all tokens share a prefix (like & for HTML entities), exclude it and check manually
  3. Experiment with key_length: Test different values for your specific dataset

Usage in WordPress

Primary use case is HTML named character reference decoding in the HTML API:

// In wp-includes/html-api/html5-named-character-references.php
return WP_Token_Map::from_precomputed_table( array(
    // ... precomputed HTML entity mappings
) );