WP_HTML_Processor

Full HTML5 parser with tree construction, extending WP_HTML_Tag_Processor.

Source: wp-includes/html-api/class-wp-html-processor.php
Since: 6.4.0
Extends: WP_HTML_Tag_Processor

Constants

Bookmarks

Constant Value Description
MAX_BOOKMARKS 100 Maximum bookmarks (overrides parent’s 10)

Node Processing

Constant Value Description
PROCESS_NEXT_NODE Step to next node
PROCESS_CURRENT_NODE Reprocess current node

Static Methods

create_fragment()

Creates an HTML processor for parsing an HTML fragment.

public static function create_fragment( $html, $context = '<body>', $encoding = 'UTF-8' )
Parameter Type Description
$html string HTML fragment to process
$context string Context element (currently only <body> supported)
$encoding string Text encoding (currently only UTF-8 supported)

Returns: static|null — Processor instance or null on failure.

Example:

$processor = WP_HTML_Processor::create_fragment( '<div><p>Hello</p></div>' );
while ( $processor->next_tag( 'p' ) ) {
    $processor->add_class( 'text' );
}

create_full_parser()

Creates an HTML processor for parsing a complete HTML document.

public static function create_full_parser( $html, $known_definite_encoding = 'UTF-8' )
Parameter Type Description
$html string Complete HTML document
$known_definite_encoding string Document encoding

Returns: static|null — Processor instance or null on failure.


normalize()

Normalizes an HTML fragment to well-formed HTML.

public static function normalize( string $html ): ?string
Parameter Type Description
$html string HTML to normalize

Returns: Normalized HTML string or null if normalization fails.


is_special()

Determines if a tag name is a "special" element.

public static function is_special( $tag_name ): bool
Parameter Type Description
$tag_name string Tag name to check

Returns: true if the element is special.

Special elements have unique parsing rules (e.g., TABLE, TEMPLATE, SCRIPT).


is_void()

Determines if a tag name represents a void element.

public static function is_void( $tag_name ): bool
Parameter Type Description
$tag_name string Tag name to check

Returns: true if the element is void.

Void elements cannot have content (e.g., IMG, BR, INPUT).


Instance Methods

__construct()

Constructor. Do not use directly; use create_fragment() or create_full_parser().

public function __construct( $html, $use_the_static_create_methods_instead = null )

next_tag()

Finds the next tag matching the query, with structure-aware options.

public function next_tag( $query = null ): bool
Parameter Type Description
$query array|string|null Search criteria

Extended Query Options:

Key Type Description
breadcrumbs array Required ancestor path

Returns: true if tag found.

Example:

// Find IMG inside FIGURE
$processor->next_tag( array(
    'breadcrumbs' => array( 'FIGURE', 'IMG' ),
) );

// Find EM inside FIGCAPTION inside FIGURE
$processor->next_tag( array(
    'breadcrumbs' => array( 'FIGURE', 'FIGCAPTION', 'EM' ),
) );

next_token()

Advances to the next token in the HTML document.

public function next_token(): bool

Returns: true if token found.


step()

Runs one step of the tree construction algorithm.

public function step( $node_to_process = self::PROCESS_NEXT_NODE ): bool
Parameter Type Description
$node_to_process string Processing mode constant

Returns: true if step succeeded.


get_breadcrumbs()

Returns the stack of open element names from root to current.

public function get_breadcrumbs(): array

Returns: Array of tag names (e.g., ['HTML', 'BODY', 'DIV', 'P']).

Example:

// For '<html><body><div><p>text</p></div></body></html>'
// When positioned on the P tag:
$breadcrumbs = $processor->get_breadcrumbs();
// Returns: ['HTML', 'BODY', 'DIV', 'P']

matches_breadcrumbs()

Checks if the current position matches a breadcrumb pattern.

public function matches_breadcrumbs( $breadcrumbs ): bool
Parameter Type Description
$breadcrumbs array Expected ancestor path

Returns: true if breadcrumbs match.


get_current_depth()

Returns the nesting depth of the current element.

public function get_current_depth(): int

Returns: Nesting depth (0 = document root).


expects_closer()

Determines if a node expects a closing tag.

public function expects_closer( ?WP_HTML_Token $node = null ): ?bool
Parameter Type Description
$node WP_HTML_Token|null Node to check (current if null)

Returns: true if closer expected, false if void, null if unknown.


is_tag_closer()

Returns whether the current tag is a closing tag.

public function is_tag_closer(): bool

Returns: true for closing tags.


get_tag()

Returns the uppercase tag name for the current match.

public function get_tag(): ?string

Returns: Tag name or null.


get_namespace()

Returns the namespace of the current element.

public function get_namespace(): string

Returns: "html", "svg", or "math".


has_self_closing_flag()

Indicates if the current tag has the self-closing flag.

public function has_self_closing_flag(): bool

Returns: true if flag present.


get_token_name()

Returns the name of the current token.

public function get_token_name(): ?string

Returns: Token name or null.


get_token_type()

Returns the type of the current token.

public function get_token_type(): ?string

Returns: Token type or null.


get_attribute()

Returns the value of an attribute.

public function get_attribute( $name )
Parameter Type Description
$name string Attribute name

Returns: Attribute value, true for boolean, or null.


set_attribute()

Sets an attribute on the current tag.

public function set_attribute( $name, $value ): bool
Parameter Type Description
$name string Attribute name
$value string|bool Attribute value

Returns: true if set successfully.


remove_attribute()

Removes an attribute from the current tag.

public function remove_attribute( $name ): bool
Parameter Type Description
$name string Attribute name

Returns: true if removed.


get_attribute_names_with_prefix()

Returns attribute names starting with a prefix.

public function get_attribute_names_with_prefix( $prefix ): ?array
Parameter Type Description
$prefix string Attribute name prefix

Returns: Array of matching names or null.


add_class()

Adds a CSS class to the current tag.

public function add_class( $class_name ): bool
Parameter Type Description
$class_name string Class to add

Returns: true if added.


remove_class()

Removes a CSS class from the current tag.

public function remove_class( $class_name ): bool
Parameter Type Description
$class_name string Class to remove

Returns: true if removed.


has_class()

Checks if the current tag has a CSS class.

public function has_class( $wanted_class ): ?bool
Parameter Type Description
$wanted_class string Class to check

Returns: true, false, or null.


class_list()

Generator yielding all CSS classes.

public function class_list()

get_modifiable_text()

Returns the text content for the current token.

public function get_modifiable_text(): string

Returns: Decoded text content.


get_comment_type()

Returns the type of comment token.

public function get_comment_type(): ?string

Returns: Comment type constant or null.


serialize()

Serializes the processed HTML.

public function serialize(): ?string

Returns: Serialized HTML or null on failure.


serialize_token()

Serializes just the current token.

public function serialize_token(): string

Returns: Serialized token HTML.


set_bookmark()

Creates a named bookmark at the current position.

public function set_bookmark( $bookmark_name ): bool
Parameter Type Description
$bookmark_name string Bookmark name

Returns: true if set.


has_bookmark()

Checks if a bookmark exists.

public function has_bookmark( $bookmark_name ): bool
Parameter Type Description
$bookmark_name string Bookmark name

Returns: true if exists.


release_bookmark()

Releases a bookmark.

public function release_bookmark( $bookmark_name ): bool
Parameter Type Description
$bookmark_name string Bookmark name

Returns: true if released.


seek()

Moves to a bookmark position.

public function seek( $bookmark_name ): bool
Parameter Type Description
$bookmark_name string Bookmark name

Returns: true if seek succeeded.


get_last_error()

Returns the last error message.

public function get_last_error(): ?string

Returns: Error message or null.


get_unsupported_exception()

Returns the exception for unsupported markup.

public function get_unsupported_exception()

Returns: WP_HTML_Unsupported_Exception or null.


Insertion Mode Methods

These internal methods handle HTML5 tree construction. They are invoked by step().

Method Insertion Mode
step_initial() Initial
step_before_html() Before HTML
step_before_head() Before HEAD
step_in_head() In HEAD
step_in_head_noscript() In HEAD NOSCRIPT
step_after_head() After HEAD
step_in_body() In BODY
step_in_table() In TABLE
step_in_table_text() In TABLE text
step_in_caption() In CAPTION
step_in_column_group() In COLGROUP
step_in_table_body() In TBODY
step_in_row() In TR
step_in_cell() In TD/TH
step_in_select() In SELECT
step_in_select_in_table() In SELECT in TABLE
step_in_template() In TEMPLATE
step_after_body() After BODY
step_in_frameset() In FRAMESET
step_after_frameset() After FRAMESET
step_after_after_body() After after BODY
step_after_after_frameset() After after FRAMESET
step_in_foreign_content() In foreign content

Usage Examples

Structure-Aware Modification

$html = '<article><figure><img src="cat.jpg"><figcaption>A cat</figcaption></figure></article>';
$processor = WP_HTML_Processor::create_fragment( $html );

// Only match IMG inside FIGURE
while ( $processor->next_tag( array( 'breadcrumbs' => array( 'FIGURE', 'IMG' ) ) ) ) {
    $processor->add_class( 'figure-image' );
    $processor->set_attribute( 'loading', 'lazy' );
}

echo $processor->get_updated_html();

Walking the Document Tree

$processor = WP_HTML_Processor::create_fragment( $html );

while ( $processor->next_token() ) {
    $depth = $processor->get_current_depth();
    $breadcrumbs = $processor->get_breadcrumbs();
    $name = $processor->get_token_name();
    
    echo str_repeat( '  ', $depth ) . $name . "n";
}

Normalizing HTML

$messy = '<p>one<p>two</p><div>three';
$clean = WP_HTML_Processor::normalize( $messy );
// Returns properly nested HTML

Checking Element Nesting

$processor = WP_HTML_Processor::create_fragment( $html );

while ( $processor->next_tag( 'a' ) ) {
    // Check if link is inside a heading
    if ( $processor->matches_breadcrumbs( array( 'H1', 'A' ) ) ||
         $processor->matches_breadcrumbs( array( 'H2', 'A' ) ) ) {
        $processor->add_class( 'heading-link' );
    }
}