WP_Token_Map
High-performance data structure for string key-to-value lookups with optimized memory layout.
Source: wp-includes/class-wp-token-map.php
Since: 6.6.0
Overview
WP_Token_Map provides efficient string token lookup and transformation. It’s optimized for static datasets like HTML named character references, smilies, or other fixed mappings.
The class uses a packed binary format internally to maximize cache locality and minimize memory access during lookups. It supports precomputation for zero-cost initialization in production.
Constants
| Constant | Value | Description |
|---|---|---|
STORAGE_VERSION |
'6.6.0-trunk' |
Format version for precomputed tables |
MAX_LENGTH |
256 |
Maximum bytes for keys and values |
Properties
| Property | Type | Visibility | Description |
|---|---|---|---|
$key_length |
int | private | Prefix length for grouping (default 2) |
$large_words |
array | private | Packed strings for tokens longer than key_length |
$groups |
string | private | Null-delimited group prefixes |
$small_words |
string | private | Packed short tokens |
$small_mappings |
string[] | private | Replacements for short tokens |
Methods
from_array()
Creates a token map from an associative array.
public static function from_array( array $mappings, int $key_length = 2 ): ?WP_Token_Map
| Parameter | Type | Description |
|---|---|---|
$mappings |
array | Key-value pairs (token โ replacement) |
$key_length |
int | Group prefix length (default 2) |
Returns: WP_Token_Map on success, null if any token exceeds MAX_LENGTH.
Example:
$smilies = WP_Token_Map::from_array( array(
'8O' => '๐ฏ',
':(' => '๐',
':)' => '๐',
':?' => '๐',
) );
from_precomputed_table()
Creates a token map from precomputed data.
public static function from_precomputed_table( array $state ): ?WP_Token_Map
| Parameter | Type | Description |
|---|---|---|
$state |
array | Precomputed state array |
State Array Keys:
storage_versionโ Must matchSTORAGE_VERSIONkey_lengthโ Group prefix lengthgroupsโ Null-delimited group prefixeslarge_wordsโ Packed token stringssmall_wordsโ Packed short tokenssmall_mappingsโ Short token replacements
Returns: WP_Token_Map or null on version mismatch/missing data.
contains()
Checks if a word exists as a lookup key.
public function contains( string $word, string $case_sensitivity = 'case-sensitive' ): bool
| Parameter | Type | Description |
|---|---|---|
$word |
string | Token to check |
$case_sensitivity |
string | 'case-sensitive' or 'ascii-case-insensitive' |
Returns: true if token exists in map.
Example:
$smilies->contains( ':)' ); // true
$smilies->contains( 'nope' ); // false
read_token()
Reads a token starting at the given offset and returns its mapping.
public function read_token(
string $text,
int $offset = 0,
?int &$matched_token_byte_length = null,
string $case_sensitivity = 'case-sensitive'
): ?string
| Parameter | Type | Description |
|---|---|---|
$text |
string | Text to search in |
$offset |
int | Starting byte position |
&$matched_token_byte_length |
int|null | Receives matched token length |
$case_sensitivity |
string | 'case-sensitive' or 'ascii-case-insensitive' |
Returns: Mapped value if token found, null otherwise.
Example:
$result = $smilies->read_token( 'Hello :) world', 6, $length );
// $result = '๐'
// $length = 2
to_array()
Exports the map back to an associative array.
public function to_array(): array
Returns: Array of token => replacement pairs.
precomputed_php_source_table()
Generates PHP source code for precomputed loading.
public function precomputed_php_source_table( string $indent = "t" ): string
| Parameter | Type | Description |
|---|---|---|
$indent |
string | Indentation string |
Returns: PHP code that can be pasted into a source file.
Example Output:
WP_Token_Map::from_precomputed_table(
array(
"storage_version" => "6.6.0-trunk",
"key_length" => 2,
"groups" => "",
"large_words" => array(),
"small_words" => "8Ox00:)x00:(x00:?x00",
"small_mappings" => array( "๐ฏ", "๐", "๐", "๐" )
)
);
Internal Methods
read_small_token() (private)
Searches for short tokens (โค key_length).
private function read_small_token(
string $text,
int $offset = 0,
?int &$matched_token_byte_length = null,
string $case_sensitivity = 'case-sensitive'
): ?string
longest_first_then_alphabetical() (private static)
Comparison function ensuring longer matches take priority.
private static function longest_first_then_alphabetical( string $a, string $b ): int
Sorts by length descending, then alphabetically. Prevents substring matches masking longer tokens.
Architecture
Large vs. Small Words
Tokens are classified by length relative to key_length:
- Small words: Length โค
key_lengthโ stored in packed$small_wordsstring - Large words: Length >
key_lengthโ grouped by prefix in$large_wordsarray
Packed Format
Large words use a binary packed format:
โโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโ
โ Length of rest โ Rest of key โ Length of value โ Value โ
โ of key (bytes) โ โ (bytes) โ โ
โโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโผโโโโโโโโโค
โ 0x08 โ nterDot; โ 0x02 โ ยท โ
โโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโดโโโโโโโโโ
Choosing Key Length
The key_length parameter affects performance:
- Too long: Many single-token groups, wasted overhead
- Too short: Large groups requiring linear search
For HTML character references (2,000+ entries), key_length = 2 provides good distribution. For smaller sets like smilies, key_length = 1 may be better.
Performance Tips
- Precompute for production: Use
precomputed_php_source_table()to generate static code - Strip common prefixes: If all tokens share a prefix (like
&for HTML entities), exclude it and check manually - Experiment with key_length: Test different values for your specific dataset
Usage in WordPress
Primary use case is HTML named character reference decoding in the HTML API:
// In wp-includes/html-api/html5-named-character-references.php
return WP_Token_Map::from_precomputed_table( array(
// ... precomputed HTML entity mappings
) );