WP_HTML_Decoder

Decodes HTML character references in text content and attributes.

Source: wp-includes/html-api/class-wp-html-decoder.php
Since: 6.6.0


Methods

attribute_starts_with()

Checks if a raw attribute value starts with a given string after decoding.

public static function attribute_starts_with( $haystack, $search_text, $case_sensitivity = 'case-sensitive' ): bool
Parameter Type Description
$haystack string Raw attribute value
$search_text string Plain string to match
$case_sensitivity string "case-sensitive" or "ascii-case-insensitive"

Returns: true if attribute starts with the search text.

This compares the decoded value, handling all HTML character reference encodings.

Example:

$value = 'http://wordpress.org/';

WP_HTML_Decoder::attribute_starts_with( $value, 'http:', 'ascii-case-insensitive' );
// Returns: true

WP_HTML_Decoder::attribute_starts_with( $value, 'https:', 'ascii-case-insensitive' );
// Returns: false

// Works with any encoding
$encoded = 'http:';
WP_HTML_Decoder::attribute_starts_with( $encoded, 'http:' );
// Returns: true

decode_text_node()

Decodes a text node’s content.

public static function decode_text_node( $text ): string
Parameter Type Description
$text string Raw text content

Returns: Decoded UTF-8 string.

Use this for text between tags (DATA sections), not for attribute values.

Example:

$raw = '“😄”';
$decoded = WP_HTML_Decoder::decode_text_node( $raw );
// Returns: "πŸ˜„"

decode_attribute()

Decodes an attribute value.

public static function decode_attribute( $text ): string
Parameter Type Description
$text string Raw attribute value

Returns: Decoded UTF-8 string.

Attribute values have different decoding rules than text nodes. Use this for attribute values.

Example:

$raw = 'Eggs & Milk';
$decoded = WP_HTML_Decoder::decode_attribute( $raw );
// Returns: "Eggs & Milk"

decode()

Low-level decoder for arbitrary HTML text.

public static function decode( $context, $text ): string
Parameter Type Description
$context string "attribute" or "data"
$text string Raw text to decode

Returns: Decoded UTF-8 string.

Example:

WP_HTML_Decoder::decode( 'data', '©' );
// Returns: "Β©"

read_character_reference()

Reads a character reference at a specific position.

public static function read_character_reference( $context, $text, $at = 0, &$match_byte_length = null )
Parameter Type Description
$context string "attribute" or "data"
$text string Text containing reference
$at int Byte offset to start reading
&$match_byte_length int Set to length of matched reference

Returns: Decoded character or null if no reference found.

Example:

$text = 'Ships…';

// No reference at position 0
$result = WP_HTML_Decoder::read_character_reference( 'attribute', $text, 0 );
// Returns: null

// Reference at position 5
$result = WP_HTML_Decoder::read_character_reference( 'attribute', $text, 5, $length );
// Returns: "…"
// $length: 8 (length of "…")

code_point_to_utf8_bytes()

Converts a Unicode code point to UTF-8 bytes.

public static function code_point_to_utf8_bytes( $code_point ): string
Parameter Type Description
$code_point int Unicode code point

Returns: UTF-8 encoded character or "οΏ½" for invalid code points.

Example:

WP_HTML_Decoder::code_point_to_utf8_bytes( 0x1f170 );
// Returns: "πŸ…°"

// Invalid code point (half of surrogate pair)
WP_HTML_Decoder::code_point_to_utf8_bytes( 0xd83c );
// Returns: "οΏ½"

Character Reference Types

Named References

&     β†’ &
&lt;      β†’ <
&gt;      β†’ >
&quot;    β†’ "
&copy;    β†’ Β©
&hellip;  β†’ …

Numeric References (Decimal)

&#38;     β†’ &
&#60;     β†’ <
&#169;    β†’ Β©
&#128512; β†’ πŸ˜€

Numeric References (Hexadecimal)

&#x26;    β†’ &
&#x3C;    β†’ <
&#xA9;    β†’ Β©
&#x1F600; β†’ πŸ˜€

Context Differences

Attribute Context

In attributes, ambiguous references (not ending in ;) that are followed by alphanumeric characters or = are NOT decoded:

// "&not" followed by "in" is ambiguous in attributes
WP_HTML_Decoder::decode( 'attribute', '&notin' );
// Returns: "&notin" (unchanged)

WP_HTML_Decoder::decode( 'attribute', '&notin;' );
// Returns: "βˆ‰" (decoded)

Data Context

In text content, ambiguous references are decoded:

// "&not" is decoded even without "in"
WP_HTML_Decoder::decode( 'data', '&notin' );
// Returns: "Β¬in" (Β¬ = "not")

WP_HTML_Decoder::decode( 'data', '&notin;' );
// Returns: "βˆ‰" (decoded as single character)

C1 Control Character Mapping

Numeric references in the C1 control range (0x80-0x9F) are remapped as if encoded in Windows-1252:

WP_HTML_Decoder::decode( 'data', '&#x80;' );
// Returns: "€" (Euro sign, not control character)

WP_HTML_Decoder::decode( 'data', '&#x93;' );
// Returns: """ (left double quote)

This matches browser behavior for legacy compatibility.