Core Filters Reference
Comprehensive reference for all WordPress filters used by Data Machine for service discovery, configuration, and data processing.
Service Discovery Filters
datamachine_handlers
Purpose: Register fetch, publish, and upsert handlers
Parameters:
$handlers(array) – Current handlers array
Return: Array of handler definitions
Handler Structure:
$handlers['handler_slug'] = [
'type' => 'fetch|publish|upsert',
'class' => 'HandlerClassName',
'label' => __('Human Readable Name', 'data-machine'),
'description' => __('Handler description', 'data-machine'),
'requires_auth' => true // Optional: Metadata flag for auth detection
];Usage Example:
add_filter('datamachine_handlers', function($handlers) {
$handlers['twitter'] = [
'type' => 'publish',
'class' => 'DataMachine\Core\Steps\Publish\Handlers\Twitter\Twitter',
'label' => __('Twitter', 'data-machine'),
'description' => __('Post content to Twitter with media support', 'data-machine'),
'requires_auth' => true // Eliminates auth provider instantiation overhead
];
return $handlers;
});Handler Metadata:
requires_auth(boolean): Optional metadata flag for performance optimization- Eliminates auth provider instantiation during handler settings modal load
- Auth-enabled handlers: Twitter, Bluesky, Facebook, Threads, Google Sheets (publish & fetch), Reddit (fetch)
datamachine_step_types
Purpose: Register step types for pipeline execution
Parameters:
$steps(array) – Current steps array
Return: Array of step definitions
Step Structure:
$steps['step_type'] = [
'name' => __('Step Display Name', 'data-machine'),
'class' => 'StepClassName',
'position' => 50 // Display order
];datamachine_get_oauth1_handler
Purpose: Service discovery for OAuth 1.0a handler
Parameters:
$handler(OAuth1Handler|null) – Current handler instance
Return: OAuth1Handler instance
Location: /inc/Core/OAuth/OAuth1Handler.php
Usage Example:
$oauth1 = apply_filters('datamachine_get_oauth1_handler', null);
$request_token = $oauth1->get_request_token($url, $key, $secret, $callback, 'twitter');
$auth_url = $oauth1->get_authorization_url($authorize_url, $oauth_token, 'twitter');
$result = $oauth1->handle_callback('twitter', $access_url, $key, $secret, $account_fn);Methods:
get_request_token()– Obtain OAuth request token (step 1)get_authorization_url()– Build authorization URL (step 2)handle_callback()– Complete OAuth flow (step 3)
Providers: Twitter
datamachine_get_oauth2_handler
Purpose: Service discovery for OAuth 2.0 handler
Parameters:
$handler(OAuth2Handler|null) – Current handler instance
Return: OAuth2Handler instance
Location: /inc/Core/OAuth/OAuth2Handler.php
Usage Example:
$oauth2 = apply_filters('datamachine_get_oauth2_handler', null);
$state = $oauth2->create_state('provider_key');
$auth_url = $oauth2->get_authorization_url($base_url, $params);
$result = $oauth2->handle_callback($provider_key, $token_url, $token_params, $account_fn);Methods:
create_state()– Generate OAuth state nonceverify_state()– Verify OAuth state nonceget_authorization_url()– Build authorization URLhandle_callback()– Complete OAuth flow with token exchange
Providers: Reddit, Facebook, Threads, Google Sheets
Handler Registration (via HandlerRegistrationTrait @since v0.2.2)
Modern handler registration uses HandlerRegistrationTrait which automatically registers with all required filters.
Filters Registered by Trait
The HandlerRegistrationTrait (/inc/Core/Steps/HandlerRegistrationTrait.php) automatically registers handlers with the following filters:
datamachine_handlers
Handler metadata registration (always registered)
datamachine_auth_providers
Authentication provider registration (conditional on requires_auth=true)
datamachine_handler_settings
Settings class registration (always registered if settings_class provided)
datamachine_tools (handler tools)
AI tool registration via callback (conditional on tools_callback provided). The
trait wires the callback into the unified datamachine_tools registry as a
deferred _handler_callable entry resolved at pipeline execution time.
Usage Pattern
use DataMachineCoreStepsHandlerRegistrationTrait;
class MyHandlerFilters {
use HandlerRegistrationTrait;
public static function register(): void {
self::registerHandler(
$handler_slug, // 'my_handler'
$handler_type, // 'fetch', 'publish', or 'update'
$handler_class, // MyHandler::class
$label, // __('My Handler', 'textdomain')
$description, // __('Handler description', 'textdomain')
$requires_auth, // true or false
$auth_class, // MyHandlerAuth::class or null
$settings_class, // MyHandlerSettings::class
$tools_callback // Callback function or null
);
}
}
function datamachine_register_my_handler_filters() {
MyHandlerFilters::register();
}
datamachine_register_my_handler_filters();Example Implementation
Publish Handler with OAuth:
use DataMachineCoreStepsHandlerRegistrationTrait;
class TwitterFilters {
use HandlerRegistrationTrait;
public static function register(): void {
self::registerHandler(
'twitter',
'publish',
Twitter::class,
__('Twitter', 'datamachine'),
__('Post content to Twitter with media support', 'datamachine'),
true, // Requires OAuth
TwitterAuth::class,
TwitterSettings::class,
function($handler_slug, $handler_config, $engine_data) {
return [
'twitter_publish' => datamachine_get_twitter_tool($handler_config),
];
}
);
}
}Fetch Handler without Auth:
use DataMachineCoreStepsHandlerRegistrationTrait;
class RSSFilters {
use HandlerRegistrationTrait;
public static function register(): void {
self::registerHandler(
'rss',
'fetch',
RSS::class,
__('RSS Feed', 'datamachine'),
__('Fetch content from RSS/Atom feeds', 'datamachine'),
false, // No auth required
null,
RSSSettings::class,
null // No AI tools for fetch handlers
);
}
}Benefits
- Code Reduction: Reduces handler registration code by ~70%
- Consistency: Ensures uniform registration patterns across all handlers
- Maintainability: Centralizes filter registration logic
- Type Safety: Method signature provides clear parameter requirements
See Handler Registration Trait for complete documentation.
AI Integration Filters
datamachine_tools
Purpose: Unified registry for every AI tool — static global tools AND per-handler
runtime-generated tools. Consumed by ToolPolicyResolver when gathering the
available tool set for a pipeline or chat context.
Parameters:
$tools(array) – Current tools registry (keyed by tool name or internal wrapper key)
Return: Modified tools array
Static tool entry (global tools)
add_filter('datamachine_tools', function($tools) {
$tools['my_tool'] = [
'_callable' => [$this, 'getToolDefinition'], // Lazy resolution
'modes' => ['chat', 'pipeline'],
'ability' => 'datamachine/my-ability', // Links to an ability for permission resolution
'access_level' => 'admin', // Fallback when no ability is linked
];
return $tools;
});The _callable resolves to the full tool definition array (name, description,
parameters, etc.) at first access. See
Tool Registration for the resolved
definition contract.
Handler tool entry (dynamic, runtime-generated)
Handler tools are shaped by the runtime handler configuration of the adjacent
pipeline step (e.g. ai_decides taxonomy choices produce different tool
parameter schemas). The registry entry contains a _handler_callable that
receives runtime context and returns one or more tool definitions.
add_filter('datamachine_tools', function($tools) {
$tools['__handler_tools_wordpress_publish'] = [
'_handler_callable' => function($handler_slug, $handler_config, $engine_data) {
return [
'wordpress_publish' => [
'class' => WordPressPublishTool::class,
'method' => 'handle_tool_call',
'handler' => $handler_slug,
'description' => 'Publish content to WordPress',
'parameters' => build_params_from_config($handler_config),
],
];
},
'handler' => 'wordpress_publish', // Exact slug match against adjacent step
'modes' => ['pipeline'],
'access_level' => 'admin',
];
return $tools;
});Matching modes:
'handler' => 'slug'— entry applies only when the adjacent step’s handler slug equals'slug'.'handler_types' => ['fetch', 'event_import']— entry applies to any handler whose registeredtypeis in the list. Used for cross-cutting tools (e.g.skip_itemexposed to every fetch-type handler).
The callback signature is (string $handler_slug, array $handler_config, array $engine_data): array.
Returned array is ['tool_name' => $tool_definition] (empty array to opt out).
Preferred pattern: use HandlerRegistrationTrait::registerHandler() — the
trait wires the callback into this filter with the correct wrapper shape. Manual
registration is only needed for cross-cutting tools that register against
handler_types.
Resolved tool definition contract
Whether a tool comes from a static _callable or a handler _handler_callable,
the resolved definition follows the same shape:
[
'class' => 'ToolClassName',
'method' => 'handle_tool_call',
'description' => 'Tool description for AI',
'parameters' => [
'param_name' => [
'type' => 'string|integer|boolean',
'required' => true|false,
'description' => 'Parameter description',
],
],
'handler' => 'handler_slug', // Optional: handler-owned tool
'requires_config' => true|false, // Optional: UI configuration indicator
'handler_config' => $handler_config, // Optional: passed to tool execution
'modes' => ['pipeline'], // Filled by registry wrapper if absent
'ability' => 'datamachine/...', // Optional: permission link
'access_level' => 'admin', // Optional: permission fallback
]chubes_ai_request (removed runtime path)
Purpose: Historical provider-dispatch filter. Agent runtime requests now go
through RequestBuilder::build() and the wp-ai-client adapter instead.
Parameters:
$request(array) – AI request data$provider(string) – AI provider slug$streaming_callback(mixed) – Streaming callback function$tools(array) – Available tools array$pipeline_step_id(string|null) – Pipeline step ID for context
Return: Historical array response shape.
Universal Engine Directive System (@since v0.2.0): Centralized AI request construction via RequestBuilder with hierarchical directive application through filter-based architecture.
Directive Application via RequestBuilder:
All AI requests now use RequestBuilder::build() which integrates with PromptBuilder for unified directive management with priority-based ordering:
- Unified Directives (
datamachine_directivesfilter) – Centralized directive registration with priority and agent targeting
Request Structure:
$ai_response = RequestBuilder::build(
$messages, // Messages array with role/content
$provider, // AI provider name
$model, // Model identifier
$tools, // Raw tools array from filters
$agent_mode, // 'chat', 'pipeline', 'system', or extension mode
$context // Agent-specific context
);Current Directive Implementations:
CoreMemoryFilesDirective– Registered memory files from shared, agent, and user layers (priority 20)AgentModeDirective– Mode-specific guidance for chat, pipeline, system, and extension modes (priority 22)CallerContextDirective– Authenticated cross-site caller identity (priority 25)AgentDailyMemoryDirectiveandClientContextDirective– Optional daily memory and client context (priority 35)PipelineMemoryFilesDirective,FlowMemoryFilesDirective, andPipelineSystemPromptDirective– Pipeline-specific context (priorities 40, 45, 50)ChatPipelinesDirective– Pipeline and flow inventory for chat (priority 45)
Unified Directive Registration:
// Register directives with priority and mode targeting
add_filter('datamachine_directives', function($directives) {
$directives[] = [
'class' => MyGlobalDirective::class,
'priority' => 20, // Lower = applied first
'modes' => ['all'] // 'all', 'pipeline', 'chat', 'system', or extension mode
];
$directives[] = [
'class' => MyPipelineDirective::class,
'priority' => 30,
'modes' => ['pipeline']
];
return $directives;
});Note: All AI request building now uses RequestBuilder::build() to ensure consistent request structure and directive application. Do not add new chubes_ai_request dispatch sites.
datamachine_session_title_prompt
Purpose: Customize or replace the AI prompt used for generating session titles.
Parameters:
$default_prompt(string) – The default prompt for title generation$context(array) – Conversation context with the following keys:first_user_message(string) – The first message from the userfirst_assistant_response(string) – The assistant’s first responseconversation_context(string) – Combined conversation context
Return: String – The prompt to use for title generation
Location: /inc/Abilities/SystemAbilities.php
Usage Example:
// Generate code names instead of descriptive titles
add_filter('datamachine_session_title_prompt', function($prompt, $context) {
return "Generate a two-word code name like 'cosmic-owl' or 'azure-phoenix'. " .
"Return ONLY the code name, nothing else.";
}, 10, 2);
// Add custom context to the default prompt
add_filter('datamachine_session_title_prompt', function($prompt, $context) {
return $prompt . "nnAdditional instruction: Keep titles under 5 words.";
}, 10, 2);
// Generate privacy-safe titles without chat content
add_filter('datamachine_session_title_prompt', function($prompt, $context) {
$words = ['cosmic', 'azure', 'golden', 'silent', 'swift'];
$nouns = ['owl', 'phoenix', 'river', 'mountain', 'forest'];
return sprintf(
"Return exactly this title: %s-%s",
$words[array_rand($words)],
$nouns[array_rand($nouns)]
);
}, 10, 2);Use Cases:
first_user_message(string) – The first message from the userfirst_assistant_response(string) – The assistant’s first responseconversation_context(string) – Combined conversation context
Preview & Approval Filters
Data Machine ships one preview/approve primitive: PendingActionStore
plus ResolvePendingActionAbility. Any tool that wants the user to see a
change before it takes effect stages its invocation via
PendingActionHelper::stage() and registers an apply callback on
datamachine_pending_action_handlers. The core content abilities
(edit_post_blocks, replace_post_blocks, insert_content), the socials
publishers, and anything else opting into action_policy=preview all route
through the same lane.
Which preview primitive should I use? There is only one. Call
PendingActionHelper::stage()to stage a pending invocation and register your apply callback ondatamachine_pending_action_handlers. TheResolvePendingActionAbility(ability slugdatamachine/resolve-pending-action, REST routePOST /datamachine/v1/actions/resolve, chat toolresolve_pending_action) finalizes every kind.
datamachine_pending_action_handlers
Which preview primitive should I use? There is only one. Call
PendingActionHelper::stage() to stage a pending invocation and register
your apply callback on datamachine_pending_action_handlers. The
ResolvePendingActionAbility (ability slug
datamachine/resolve-pending-action, REST route
POST /datamachine/v1/actions/resolve, chat tool
resolve_pending_action) finalizes every kind.
add_filter( 'datamachine_pending_action_handlers', function ( $handlers ) {
$handlers['my_kind'] = array(
'apply' => array( MyAbility::class, 'execute' ),
'can_resolve' => function ( array $payload, string $decision, int $user_id ) {
// Return true, false, or a WP_Error. Optional — defaults to
// "any user who can call resolve_pending_action".
return current_user_can( 'edit_posts' );
},
);
return $handlers;
} );Which preview primitive should I use? There is only one. Call
PendingActionHelper::stage() to stage a pending invocation and register
your apply callback on datamachine_pending_action_handlers. The
ResolvePendingActionAbility (ability slug
datamachine/resolve-pending-action, REST route
POST /datamachine/v1/actions/resolve, chat tool
resolve_pending_action) finalizes every kind.
Content Format Filters
Purpose: Register the apply + permission callbacks for a pending-action kind.
apply receives the stored apply_input array and must return either a
value (which is wrapped into the resolver response) or a WP_Error to
surface failure.
datamachine_post_content_format
Data Machine separates authoring/source format from stored format. AI-facing tools should treat normal authored prose as markdown unless a workflow explicitly pins another source format. Storage-aware abilities then convert that source through Block Format Bridge into the post type’s canonical stored shape.
add_filter( 'datamachine_post_content_format', function ( string $format, string $post_type ): string {
return 'wiki' === $post_type ? 'markdown' : $format;
}, 10, 2 );Block Format Bridge is bundled by Data Machine and is an internal substrate for
these boundaries. Consumers should not require a standalone BFB plugin on a
DM-powered site just to use Data Machine content abilities. Keep format repair,
mixed-content detection, and malformed-input normalization in BFB (bfb_normalize())
rather than duplicating those checks in Data Machine call sites.
Purpose: Choose the canonical post_content storage format for a post type.
Core defaults to blocks, while storage-layer plugins can return formats such as
markdown for post types they own.
datamachine_pending_action_staged
content_format on abilities is the caller’s source format (markdown, html,
or blocks). It is not the storage decision. For example, the upsert_post
chat tool defaults omitted content_format to markdown so agents can author
ordinary prose naturally; raw ability/API calls keep the legacy omitted-format
default of block markup for compatibility.
datamachine_pending_action_resolved
Publish handlers follow the same contract: wordpress_publish accepts
content_format, appends source attribution in that source format, then stores
the final content in the post type’s canonical format.
datamachine_tool_action_policy
Purpose: Fires when a tool invocation has been staged and is awaiting user resolution. Use this to notify users, log audit trails, or mirror the payload into a visible queue.
Pipeline Operations Filters
datamachine_create_pipeline
Purpose: Fires after a staged action is accepted or rejected. Receives
$decision, $action_id, $kind, $payload, $result.
Purpose: Last-layer override of the resolved action policy
(direct | preview | forbidden) for a single tool invocation. Runs after
ActionPolicyResolver has consulted deny lists, per-agent overrides, tool
declarations, and mode presets.
Purpose: Create new pipeline
first_user_message(string) – The first message from the userfirst_assistant_response(string) – The assistant’s first responseconversation_context(string) – Combined conversation context
Abilities Integration: Handled by datamachine/create-pipeline ability.
Parameters:
$data = [
'pipeline_name' => 'Pipeline Name',
'pipeline_config' => $config_array
];Return: Integer pipeline ID or false
// Abilities API
$ability = wp_get_ability( 'datamachine/create-pipeline' );
$result = $ability->execute( [ 'pipeline_name' => 'Pipeline Name', 'options' => $options ] );
// Filter Hook (for extensibility)
$pipeline_id = apply_filters('datamachine_create_pipeline', null, $data);datamachine_create_flow
Data Structure:
Usage:
Purpose: Create new flow instance
- Generate code names instead of descriptive titles
- Add custom instructions to title generation
- Create privacy-safe titles that don’t expose chat content
- Customize title style per site or plugin
Abilities Integration: Handled by datamachine/create-flow ability.
Parameters:
// Abilities API
$ability = wp_get_ability( 'datamachine/create-flow' );
$result = $ability->execute( [ 'pipeline_id' => $pipeline_id, 'flow_name' => 'Flow Name' ] );
// Filter Hook (for extensibility)
$flow_id = apply_filters('datamachine_create_flow', null, $data);datamachine_get_pipelines
Return: Integer flow ID or false
Usage:
$pipeline_id(null) – Placeholder for return value$data(array) – Pipeline creation data
Purpose: Retrieve pipeline data
datamachine_get_flow_config
Parameters:
Return: Array of pipeline data
$flow_id(null) – Placeholder for return value$data(array) – Flow creation data
Purpose: Get flow configuration
datamachine_get_flow_step_config
Parameters:
Return: Array of flow configuration
$pipelines(array) – Empty array for return data$pipeline_id(int|null) – Specific pipeline ID or null for all
Purpose: Get specific flow step configuration
Authentication Filters
datamachine_auth_providers
Parameters:
Return: Array containing flow step configuration
$config(array) – Empty array for return data$flow_id(int) – Flow ID
Purpose: Register OAuth authentication providers
Parameters:
$providers['provider_slug'] = new AuthProviderClass();datamachine_retrieve_oauth_account
Return: Array of authentication provider instances
Structure:
$config(array) – Empty array for return data$flow_step_id(string) – Composite flow step ID
Purpose: Get stored OAuth account data
datamachine_oauth_callback
Parameters:
Return: Array of account information
$providers(array) – Current auth providers
Purpose: Generate OAuth authorization URL
Configuration Filters
datamachine_tool_configured
Parameters:
Return: OAuth authorization URL string
$account(array) – Empty array for return data$handler(string) – Handler slug
Purpose: Check if tool is properly configured
datamachine_get_tool_config
Parameters:
Return: Boolean configuration status
$url(string) – Empty string for return data$provider(string) – Provider slug
Purpose: Retrieve tool configuration data
datamachine_handler_settings
Parameters:
Return: Array of tool configuration
$configured(bool) – Default configuration status$tool_id(string) – Tool identifier
Purpose: Register handler settings classes
Parameter Processing Filters
datamachine_engine_data
Parameters:
Return: Array of settings class instances
$config(array) – Empty array for return data$tool_id(string) – Tool identifier
Purpose: Centralized engine data access filter for retrieving stored engine parameters
Parameters:
$engine_data = [
'source_url' => $source_url, // For link attribution and content updates
'image_url' => $image_url, // For media handling
// Additional engine parameters as needed
];Return: Array containing engine data (source_url, image_url, etc.)
add_filter('datamachine_engine_data', function($engine_data, $job_id) {
if (empty($job_id)) {
return [];
}
// Use direct database class instantiation
$db_jobs = new DataMachineCoreDatabaseJobsJobs();
$retrieved_data = $db_jobs->retrieve_engine_data($job_id);
return $retrieved_data ?: [];
}, 10, 2);Engine Data Structure:
// Steps access engine data as needed
$engine_data = apply_filters('datamachine_engine_data', [], $job_id);
$source_url = $engine_data['source_url'] ?? null;
$image_url = $engine_data['image_url'] ?? null;Core Implementation (EngineData.php):
// Fetch handlers store engine parameters in database via centralized filter (array storage)
if ($job_id) {
apply_filters('datamachine_engine_data', null, $job_id, [
'source_url' => $source_url,
'image_url' => $image_url
]);
}Usage by Steps:
$settings(array) – Current settings array
Centralized Handler Filters
datamachine_timeframe_limit
Engine Data Storage (by Fetch Handlers):
Benefits:
$engine_data(array) – Default empty array for return data$job_id(int) – Job ID to retrieve engine data for
Purpose: Shared timeframe parsing across fetch handlers with discovery and conversion modes
Parameters:
$timeframe_options = apply_filters('datamachine_timeframe_limit', null, null);
// Returns:
[
'all_time' => __('All Time', 'data-machine'),
'24_hours' => __('Last 24 Hours', 'data-machine'),
'72_hours' => __('Last 72 Hours', 'data-machine'),
'7_days' => __('Last 7 Days', 'data-machine'),
'30_days' => __('Last 30 Days', 'data-machine'),
]Return: Array of options (discovery mode) or timestamp (conversion mode) or null
$cutoff_timestamp = apply_filters('datamachine_timeframe_limit', null, '24_hours');
// Returns: Unix timestamp for 24 hours ago or null for 'all_time'datamachine_keyword_search_match
Discovery Mode (when $timeframe_limit is null):
Conversion Mode (when $timeframe_limit is a string):
- ✅ Centralized Access: Single filter for all engine data retrieval
- ✅ Filter-Based Discovery: Uses established database service discovery pattern
- ✅ Clean Separation: Engine data separate from AI data packets
- ✅ Flexible: Steps access only what they need via filter call
Purpose: Universal keyword matching with OR logic for all fetch handlers
Parameters:
$matches = apply_filters('datamachine_keyword_search_match', true, $content, 'wordpress,ai,automation');
// Returns true if content contains 'wordpress' OR 'ai' OR 'automation'Return: Boolean indicating if any keyword matches
$default(mixed) – Default value (null or timestamp)$timeframe_limit(string|null) – Timeframe specification
datamachine_data_packet
Usage:
Features:
$default(bool) – Default match result$content(string) – Content to search in$search_term(string) – Comma-separated keywords
Purpose: Centralized data packet creation with standardized structure
Parameters:
$data = apply_filters('datamachine_data_packet', $data, $packet_data, $flow_step_id, $step_type);Return: Array with new packet added to front
- OR Logic: Any keyword match passes the filter
- Case Insensitive: Uses
mb_stripos()for Unicode-safe matching - Comma Separated: Supports multiple keywords separated by commas
- Empty Filter: Returns true when no search term provided (match all)
Data Processing Filters
datamachine_should_reprocess_item
Usage:
Features:
Since: v0.71.0
Purpose: Opt into time-windowed revisit semantics for fetch-side deduplication without every handler growing its own --revisit-days flag.
$data(array) – Current data packet array$packet_data(array) – Packet data to add$flow_step_id(string) – Flow step identifier$step_type(string) – Step type
Wire point: ExecutionContext::isItemProcessed() — applied after the default seen/not-seen check runs. The filter is not invoked in direct or standalone execution modes, or when flow_step_id is empty.
Parameters:
Return: Boolean. true to skip (default seen-before behavior). false to process anyway (revisit).
use DataMachineCoreDatabaseProcessedItemsProcessedItems;
add_filter( 'datamachine_should_reprocess_item', function ( $skip, $ctx ) {
if ( ! $skip ) {
return false;
}
if ( 'wiki_post' !== $ctx['source_type'] ) {
return $skip;
}
$fresh = ( new ProcessedItems() )->has_been_processed_within(
$ctx['flow_step_id'],
$ctx['source_type'],
$ctx['item_identifier'],
7
);
// skip=false means "process"; return true to keep skipping when still fresh.
return $fresh;
}, 10, 2 );Default behavior (no filter): The filter never returns a different value than was passed in; existing deployments behave identically to pre-0.71 installs.
Duplicate Detection Filters
datamachine_duplicate_strategies
Example — reprocess stale wiki posts:
See also: ProcessedItems::get_processed_at(), ProcessedItems::has_been_processed_within(), ProcessedItems::find_stale(), ProcessedItems::find_never_processed() — the time-windowed read API introduced in the same release.
Since: v0.39.0
- Standardized Structure: Ensures type and timestamp fields are present
- Preserves All Fields: Merges packet_data while adding missing structure
- Front Addition: Uses
array_unshift()to add new packets to the beginning
Purpose: Register domain-specific duplicate detection strategies for the datamachine/check-duplicate ability. Extensions use this to add post-type-specific matching logic (e.g., event identity via venue + date + ticket URL) that runs before core’s generic title/source-URL strategies.
Parameters:
[
'id' => 'event_identity_index', // string, required. Stable id, surfaced as `strategy` in the ability result.
'post_type' => 'data_machine_events', // string, required. Specific post type or '*' for all types.
'callback' => [Strategy::class, 'check'], // callable, required. See callback contract below.
'priority' => 5, // int, optional (default: 50). Lower runs first.
]Return: Array of strategy definitions
- Extension strategies registered on this filter (sorted by
priority, lowest first). - Core
published_post_source_urlmatch (exact source URL viaPostIdentityIndex). - Core
published_posttitle match (similarity engine). - Core
queue_itemJaccard match (only whenscopeincludesqueue).
Strategy Definition Structure:
Cascade Order:
First strategy to return a duplicate verdict short-circuits the cascade.
function(array $input): ?array {
// $input['title'] string — incoming title
// $input['post_type'] string — resolved post type
// $input['context'] array — domain-specific payload (venue, startDate, ticketUrl, ...)
// $input['source_url'] string — optional canonical source URL
// ...plus any other fields the caller passed to datamachine/check-duplicate
}Callback Contract:
[
'verdict' => 'duplicate', // string, required — must be 'duplicate' to short-circuit
'source' => 'identity_index', // string, optional — origin of the match
'match' => [ // array, required — match details
'post_id' => 123,
'title' => 'Existing Post',
'url' => 'https://example.com/existing',
// strategy-specific fields are allowed
],
'reason' => 'Matched existing ...', // string, optional — human-readable explanation
'strategy' => 'event_identity_index', // string, optional — overrides `id` in the final result
]The callback receives the full ability input merged with normalized title, post_type, and context:
Return null to pass (let the cascade continue), or an array with:
namespace DataMachineEventsCoreDuplicateDetection;
class EventDuplicateStrategy {
public static function register(): void {
add_filter( 'datamachine_duplicate_strategies', [ static::class, 'addStrategy' ] );
}
public static function addStrategy( array $strategies ): array {
$strategies[] = [
'id' => 'event_identity_index',
'post_type' => 'data_machine_events',
'callback' => [ static::class, 'check' ],
'priority' => 5, // Run before core strategies.
];
return $strategies;
}
public static function check( array $input ): ?array {
$title = $input['title'] ?? '';
$context = $input['context'] ?? [];
$venue = $context['venue'] ?? '';
$date = $context['startDate'] ?? '';
if ( empty( $title ) || empty( $date ) ) {
return null;
}
// ... domain-specific lookup against PostIdentityIndex ...
$post_id = $this->lookup( $title, $venue, $date );
if ( ! $post_id ) {
return null;
}
return [
'verdict' => 'duplicate',
'source' => 'identity_index',
'match' => [
'post_id' => $post_id,
'title' => get_the_title( $post_id ),
'url' => get_permalink( $post_id ),
],
'reason' => 'Matched existing event via venue + date.',
];
}
}Any non-duplicate verdict (or missing verdict) is treated as a pass.
Usage Example (from data-machine-events):
- Use the index for lookups (indexed columns → fast). Safe for reading. See
EventDuplicateStrategy::findByTicketUrl()for a canonical example. - Write to the index via the same writers core uses (e.g.,
EventIdentityWriter::syncIdentityRow()indata-machine-events). Recommended when your extension owns a custom post type and wants fast identity lookups. - Maintain your own lookup (e.g., an existing indexed column on
wp_postslikepost_name+post_parent). Valid for cases where the identity index would be redundant.
Working with PostIdentityIndex:
Core ships DataMachineCoreDatabasePostIdentityIndexPostIdentityIndex — an indexed lookup table (post_id, source_url, title_hash, event-related columns) used by the core source-URL strategy. Extensions have three options:
There is no requirement to use PostIdentityIndex — the filter accepts any callback. Choose based on what’s already indexed for your post type.
Stability:
$skip(bool) — Current skip decision.truemeans "skip — already processed";falsemeans "process".$context(array):flow_step_id(string)source_type(string)item_identifier(string)job_id(int) — 0 when unavailable.
Files Repository Filters
datamachine_files_repository
This filter, the strategy definition shape, the callback signature, and the return array shape are considered a public API as of 0.39.0. They will not change in a backward-incompatible way without a deprecation cycle.
See Also:
flow_step_id(string)source_type(string)item_identifier(string)job_id(int) — 0 when unavailable.
Purpose: Access files repository service
Directive System Filters
datamachine_directives
Parameters:
Return: Array with ‘files’ key containing repository instance
Since: v0.2.5
flow_step_id(string)source_type(string)item_identifier(string)job_id(int) — 0 when unavailable.
Purpose: Unified directive registration with priority-based ordering and agent type targeting
Parameters:
[
'class' => DirectiveClass::class, // Directive class name
'priority' => 20, // Priority (lower = applied first)
'modes' => ['all'] // 'all', 'pipeline', 'chat', 'system', or extension mode
]Return: Modified directives array
add_filter('datamachine_directives', function($directives) {
// Global directive (all agents)
$directives[] = [
'class' => MyGlobalDirective::class,
'priority' => 25,
'modes' => ['all']
];
// Pipeline-specific directive
$directives[] = [
'class' => MyPipelineDirective::class,
'priority' => 35,
'modes' => ['pipeline']
];
return $directives;
});Directive Configuration Structure:
$strategies(array) – Array of strategy definitions (see structure below)$post_type(string) – The post type being checked
datamachine_global_directives (LEGACY — use datamachine_directives)
Usage Example:
Priority Guidelines:
Deprecated: v0.2.5
Replacement: Use datamachine_directives with modes => ['all']
// LEGACY (pre-v0.2.5)
add_filter('datamachine_global_directives', function($directives) {
$directives[] = [
'priority' => 25,
'content' => 'Custom global directive'
];
return $directives;
});
// CURRENT (v0.2.5+)
add_filter('datamachine_directives', function($directives) {
$directives[] = [
'class' => MyGlobalDirective::class,
'priority' => 25,
'modes' => ['all']
];
return $directives;
});datamachine_agent_directives (LEGACY — use datamachine_directives)
Purpose: Modify global AI system directives applied across all AI interactions (pipeline + chat)
Migration Example:
Deprecated: v0.2.5
Replacement: Use datamachine_directives with mode-specific modes targeting
- Source:
inc/Abilities/DuplicateCheck/DuplicateCheckAbility.php::getStrategies() - Canonical consumer: the event duplicate strategy in the
data-machine-eventsextension plugin - Ability docs: datamachine/check-duplicate in ai-tools reference
Purpose: Modify AI system directives for specific agent types (pipeline or chat)
Parameters:
// LEGACY (pre-v0.2.5)
add_filter('datamachine_agent_directives', function($request, $agent_type, $provider, $tools, $context) {
if ($agent_type === 'pipeline') {
$request['messages'][] = [
'role' => 'system',
'content' => 'Pipeline-specific directive'
];
}
return $request;
}, 10, 5);
// CURRENT (v0.2.5+)
add_filter('datamachine_directives', function($directives) {
$directives[] = [
'class' => MyPipelineDirective::class,
'priority' => 30,
'modes' => ['pipeline']
];
return $directives;
});Navigation Filters
datamachine_get_next_flow_step_id
Return: Modified request array
Migration Example:
$repositories(array) – Empty array for repository services
Purpose: Find next step in flow execution sequence
Universal Engine Architecture
Parameters:
Return: String next flow step ID or null if last step
ToolParameters (/inc/Engine/AI/Tools/ToolParameters.php)
Since: 0.2.0
Location: /inc/Engine/AI/
Data Machine’s Universal Engine provides shared AI infrastructure serving both Pipeline and Chat agents. See /docs/core-system/universal-engine.md for complete architecture documentation.
buildParameters()
DataMachineEngineAIToolParameters::buildParameters(array $data, ?string $job_id, ?string $flow_step_id): arrayPurpose: Centralized parameter building for all AI tools with unified flat structure.
Core Methods:
[
'content_string' => 'Clean content text',
'title' => 'Original title',
'job_id' => '123',
'flow_step_id' => 'step_uuid_flow_123'
]buildForHandlerTool()
DataMachineEngineAIToolParameters::buildForHandlerTool(array $data, array $tool_def, ?string $job_id, ?string $flow_step_id): arrayBuilds flat parameter structure for standard AI tools with content extraction and job context.
Returns:
[
// Standard parameters
'content_string' => 'Clean content',
'title' => 'Title',
'job_id' => '123',
'flow_step_id' => 'step_uuid_flow_123',
// Tool metadata
'tool_definition' => [...],
'tool_name' => 'twitter_publish',
'handler_config' => [...],
// Engine parameters (from database)
'source_url' => 'https://example.com/post',
'image_url' => 'https://example.com/image.jpg'
]Builds parameters for handler-specific tools with engine data merging (source_url, image_url).
$directives(array) – Array of directive configurations
ToolExecutor (/inc/Engine/AI/Tools/ToolExecutor.php)
Returns:
Key Features:
ToolPolicyResolver::resolve()
Purpose: Universal tool discovery and execution infrastructure.
$resolver = new DataMachineEngineAIToolsToolPolicyResolver();
$tools = $resolver->resolve( array(
'mode' => ToolPolicyResolver::MODE_PIPELINE, // or MODE_CHAT, MODE_SYSTEM
'previous_step_config' => $previous_step_config,
'next_step_config' => $next_step_config,
'pipeline_step_id' => $flow_step_id,
'engine_data' => $engine_data,
) );Core Method:
- Handler Tools – Retrieved via
datamachine_toolsfilter (runtime-resolved_handler_callableentries) - Global Tools – Retrieved via
datamachine_global_toolsfilter - Chat Tools – Retrieved via
datamachine_chat_toolsfilter (chat only) - Enablement Check – Each tool filtered through
datamachine_tool_enabled
AIConversationLoop (/inc/Engine/AI/AIConversationLoop.php)
Tool discovery moved from ToolExecutor::getAvailableTools() (removed in 0.79)
to ToolPolicyResolver::resolve(). Single entry point for chat and pipeline
modes.
Discovery Process:
$final_response = DataMachineEngineAIAIConversationLoop::run(
array $messages,
array $tools,
string $provider,
string $model,
string $context, // 'pipeline', 'chat', etc.
array $payload = [],
int $max_turns = 25,
bool $single_turn = false
): arrayPurpose: Multi-turn conversation execution with automatic tool calling.
Canonical entry point:
apply_filters(
'agents_api_conversation_runner',
null, // Return non-null array to short-circuit
$messages, $tools, $provider, $model,
$context, $payload, $max_turns, $single_turn
);run() internally applies the agents_api_conversation_runner filter, giving
a registered runtime adapter the chance to short-circuit the built-in loop. If
no adapter returns an array, Data Machine’s built-in execute() runs.
Filter: agents_api_conversation_runner
- 20: Registered memory files
- 22: Runtime agent-mode guidance
- 25-35: Caller, daily memory, and client-reported context
- 40-50: Pipeline, flow, chat inventory, and workflow-specific directives
ConversationStoreInterface (/inc/Core/Database/Chat/ConversationStoreInterface.php)
Return an array matching execute()‘s documented return shape to replace the
built-in loop. Return null (the default) to let Data Machine run the
conversation. See ai-conversation-loop.md
for the full adapter contract.
Features:
apply_filters(
'datamachine_conversation_store',
ConversationStoreInterface $default // the built-in MySQL-table Chat store
);Purpose: Single seam between chat session persistence and the underlying
storage backend. The default implementation (Chat)
preserves byte-for-byte the MySQL-table behavior the codebase used before
this seam was introduced — self-hosted users see no change.
Filter: datamachine_conversation_store
add_filter( 'datamachine_conversation_store', function ( $store ) {
if ( $store instanceof My_AIFramework_Conversation_Store ) {
return $store; // already swapped
}
if ( ! function_exists( 'my_host_is_wpcom' ) || ! my_host_is_wpcom() ) {
return $store; // self-hosted — keep MySQL default
}
return new My_AIFramework_Conversation_Store();
}, 10, 1 );Return a different ConversationStoreInterface implementation to swap the
backend. Return the default (or anything not implementing the interface) to
keep the built-in store. Misuse falls back to the default and logs via
datamachine_log.
Use case: managed-host environments where chat sessions should live in
a framework-provided conversation store rather than the site DB (e.g.
Intelligence on WordPress.com routing through WPCOMAIServicesConversation_Storage).
A consumer plugin ships an adapter and registers it conditionally:
Single consumer of the store: DataMachineCoreDatabaseChatConversationStoreFactory::get().
Every core caller — ChatOrchestrator, the five Chat Session abilities
(via ChatSessionHelpers), ChatCommand, SystemAbilities, the
scheduled cleanup action — resolves the store through the factory. The
factory caches the store per request and applies the filter exactly once.
[
'schema' => 'agents-api.message',
'version' => 1,
'type' => 'text'|'tool_call'|'tool_result'|...,
'role' => 'user'|'assistant'|'system'|'tool',
'content' => string|array,
'payload' => array, // Type-specific fields.
'metadata' => array, // Extension/provider details.
'id' => string, // Optional stable message identifier.
'created_at' => string, // Optional MySQL DATETIME (UTC).
'updated_at' => string, // Optional MySQL DATETIME (UTC).
]Message shape contract
Stores MUST normalize messages on read to the canonical agent message envelope,
documented in
ai-message-envelope.md. The
chat/session storage contract is this JSON-friendly envelope shape:
$request(array) – Current AI request being built$agent_type(string) – Legacy agent type (‘pipeline’ or ‘chat’)$provider(string) – AI provider (openai, anthropic, etc.)$tools(array) – Available tools for the agent$context(array) – Agent-specific context data
The five Chat Session abilities and the DM chat UI consume this shape.
Adapter stores that wrap another host runtime are responsible for translating
host-specific message objects into the canonical envelope at the boundary and
returning envelopes on the way out. Provider-specific role/content/metadata
arrays are projection shapes at provider boundaries, not the store contract.
$next_id(null) – Placeholder for return value$current_flow_step_id(string) – Current step ID
AgentMemoryStoreInterface (/agents-api/inc/Core/FilesRepository/AgentMemoryStoreInterface.php)
Swap boundary
Contract summary (full signatures in ConversationStoreInterface.php):
apply_filters(
'agents_api_memory_store',
null, // Return AgentMemoryStoreInterface to short-circuit
AgentMemoryScope $scope // Identifies (layer, user_id, agent_id, filename)
);Purpose: Single seam between agent memory operations and the underlying
persistence backend. The contract is generic agent-memory persistence: it does
not own section parsing, scaffold/default-file creation, editability, ability
permissions, prompt injection, flows, jobs, or pipeline behavior. The disk
default (DiskAgentMemoryStore)
preserves byte-for-byte the filesystem behavior the codebase used before this
seam was introduced.
Current filter: agents_api_memory_store
Return an AgentMemoryStoreInterface
implementation to replace the disk default for this scope. Return null (the
default) to let Data Machine read and write through the filesystem.
add_filter( 'agents_api_memory_store', function ( $store, $scope ) {
if ( $store instanceof AgentMemoryStoreInterface ) {
return $store; // someone else already swapped
}
if ( filesystem_is_writable_here() ) {
return $store; // disk default wins
}
return new My_PluginDB_Agent_Memory_Store();
}, 10, 2 );This is the only runtime filter in Data Machine today. It replaces the earlier
datamachine_memory_store name in-place so the memory-store seam already uses
Agents API vocabulary before physical extraction. Data Machine does not mirror
the old name under a second alias because that would create a permanent
compatibility ladder instead of moving ownership.
- Content/title extraction from data packets
- Flat parameter structure for AI simplicity
- Tool metadata integration
- Engine parameter injection for handlers (source_url for link attribution, image_url for media handling)
Use case: managed-host environments where the local filesystem is not writable (e.g. WordPress.com, VIP). A consumer plugin (e.g. Intelligence) ships a DB-backed implementation and registers it conditionally:
Contract:
Section parsing, scaffolding, editability gating, ability permissions,
prompt-injection policy, and registry-driven convention-path semantics stay in
AgentMemory and its higher-level callers. The store is the persistence layer
underneath.
- Automatic tool execution during conversation turns
- Conversation completion detection
- Turn-based state management with chronological ordering
- Duplicate message prevention
- Maximum turn limiting (default: 25)
- Runtime-swappable via
agents_api_conversation_runner
Single consumer of the store: DataMachineCoreFilesRepositoryAgentMemory.
- ✅ What stays stable: all 5 chat abilities, REST endpoints, the DM chat UI, the session switcher, title generation, unread counts, last-read logic.
- 🔄 What swaps: concrete storage (MySQL table vs. framework-managed store vs. in-memory test fixture).
- ❌ What is NOT a replacement point: session ownership checks, agent adoption, token resolution, title generation. Those stay in the higher- level callers.
AgentMemory is the only class in core that talks to AgentMemoryStoreFactory. It exposes:
ConversationManager (/inc/Engine/AI/ConversationManager.php)
Higher-level consumers all go through this facade rather than instantiating store types directly:
Outside plugins and extensions should follow the same pattern: instantiate AgentMemory, never reach for AgentMemoryStoreFactory directly.
create_session / get_session / update_session / delete_sessionget_user_sessions / get_user_session_count— switcher dataget_recent_pending_session— timeout-retry dedupupdate_title / mark_session_read— UI statecount_unread— pure derivation from a messages arraycleanup_expired_sessions / cleanup_old_sessions / cleanup_orphaned_sessions— scheduled cleanuplist_sessions_for_day— day-scoped summary rows for the Daily Memory Taskget_storage_metrics— row count + on-disk size for thewp datamachine retention statusCLI; returnnullto opt out
RequestBuilder (/inc/Engine/AI/RequestBuilder.php)
Purpose: Message formatting utilities for AI requests.
Key Features:
build()
$response = DataMachineEngineAIRequestBuilder::build(
array $messages,
string $provider,
string $model,
array $tools,
string $agent_type, // 'pipeline' or 'chat'
array $context
): arrayPurpose: Centralized AI request construction for all agents.
read( $scope )→AgentMemoryReadResult { exists, content, hash, bytes, updated_at }write( $scope, $content, $if_match = null )→AgentMemoryWriteResult(implementations supporting concurrency MUST honor$if_matchand returnerror = 'conflict'on hash mismatch)exists( $scope )→booldelete( $scope )→AgentMemoryWriteResult(idempotent)list_layer( $scope_query )→AgentMemoryListEntry[](enumerates one layer)