WP_HTML_Decoder
Decodes HTML character references in text content and attributes.
Source: wp-includes/html-api/class-wp-html-decoder.php
Since: 6.6.0
Methods
attribute_starts_with()
Checks if a raw attribute value starts with a given string after decoding.
public static function attribute_starts_with( $haystack, $search_text, $case_sensitivity = 'case-sensitive' ): bool
| Parameter | Type | Description |
|---|---|---|
$haystack |
string | Raw attribute value |
$search_text |
string | Plain string to match |
$case_sensitivity |
string | "case-sensitive" or "ascii-case-insensitive" |
Returns: true if attribute starts with the search text.
This compares the decoded value, handling all HTML character reference encodings.
Example:
$value = 'http://wordpress.org/';
WP_HTML_Decoder::attribute_starts_with( $value, 'http:', 'ascii-case-insensitive' );
// Returns: true
WP_HTML_Decoder::attribute_starts_with( $value, 'https:', 'ascii-case-insensitive' );
// Returns: false
// Works with any encoding
$encoded = 'http:';
WP_HTML_Decoder::attribute_starts_with( $encoded, 'http:' );
// Returns: true
decode_text_node()
Decodes a text node’s content.
public static function decode_text_node( $text ): string
| Parameter | Type | Description |
|---|---|---|
$text |
string | Raw text content |
Returns: Decoded UTF-8 string.
Use this for text between tags (DATA sections), not for attribute values.
Example:
$raw = '“😄”';
$decoded = WP_HTML_Decoder::decode_text_node( $raw );
// Returns: "π"
decode_attribute()
Decodes an attribute value.
public static function decode_attribute( $text ): string
| Parameter | Type | Description |
|---|---|---|
$text |
string | Raw attribute value |
Returns: Decoded UTF-8 string.
Attribute values have different decoding rules than text nodes. Use this for attribute values.
Example:
$raw = 'Eggs & Milk';
$decoded = WP_HTML_Decoder::decode_attribute( $raw );
// Returns: "Eggs & Milk"
decode()
Low-level decoder for arbitrary HTML text.
public static function decode( $context, $text ): string
| Parameter | Type | Description |
|---|---|---|
$context |
string | "attribute" or "data" |
$text |
string | Raw text to decode |
Returns: Decoded UTF-8 string.
Example:
WP_HTML_Decoder::decode( 'data', '©' );
// Returns: "Β©"
read_character_reference()
Reads a character reference at a specific position.
public static function read_character_reference( $context, $text, $at = 0, &$match_byte_length = null )
| Parameter | Type | Description |
|---|---|---|
$context |
string | "attribute" or "data" |
$text |
string | Text containing reference |
$at |
int | Byte offset to start reading |
&$match_byte_length |
int | Set to length of matched reference |
Returns: Decoded character or null if no reference found.
Example:
$text = 'Ships…';
// No reference at position 0
$result = WP_HTML_Decoder::read_character_reference( 'attribute', $text, 0 );
// Returns: null
// Reference at position 5
$result = WP_HTML_Decoder::read_character_reference( 'attribute', $text, 5, $length );
// Returns: "β¦"
// $length: 8 (length of "…")
code_point_to_utf8_bytes()
Converts a Unicode code point to UTF-8 bytes.
public static function code_point_to_utf8_bytes( $code_point ): string
| Parameter | Type | Description |
|---|---|---|
$code_point |
int | Unicode code point |
Returns: UTF-8 encoded character or "οΏ½" for invalid code points.
Example:
WP_HTML_Decoder::code_point_to_utf8_bytes( 0x1f170 );
// Returns: "π
°"
// Invalid code point (half of surrogate pair)
WP_HTML_Decoder::code_point_to_utf8_bytes( 0xd83c );
// Returns: "οΏ½"
Character Reference Types
Named References
& β &
< β <
> β >
" β "
© β Β©
… β β¦
Numeric References (Decimal)
& β &
< β <
© β Β©
😀 β π
Numeric References (Hexadecimal)
& β &
< β <
© β Β©
😀 β π
Context Differences
Attribute Context
In attributes, ambiguous references (not ending in ;) that are followed by alphanumeric characters or = are NOT decoded:
// "¬" followed by "in" is ambiguous in attributes
WP_HTML_Decoder::decode( 'attribute', '¬in' );
// Returns: "¬in" (unchanged)
WP_HTML_Decoder::decode( 'attribute', '∉' );
// Returns: "β" (decoded)
Data Context
In text content, ambiguous references are decoded:
// "¬" is decoded even without "in"
WP_HTML_Decoder::decode( 'data', '¬in' );
// Returns: "Β¬in" (Β¬ = "not")
WP_HTML_Decoder::decode( 'data', '∉' );
// Returns: "β" (decoded as single character)
C1 Control Character Mapping
Numeric references in the C1 control range (0x80-0x9F) are remapped as if encoded in Windows-1252:
WP_HTML_Decoder::decode( 'data', '€' );
// Returns: "β¬" (Euro sign, not control character)
WP_HTML_Decoder::decode( 'data', '“' );
// Returns: """ (left double quote)
This matches browser behavior for legacy compatibility.