FetchHandler Base Class

Overview

The FetchHandler class (/inc/Core/Steps/Fetch/Handlers/FetchHandler.php) is the abstract base class for all fetch handlers in the Data Machine system. Introduced in version 0.2.1, it provides standardized functionality for data fetching operations including deduplication, engine data storage, filtering, and logging.

Architecture

Location: /inc/Core/Steps/Fetch/Handlers/FetchHandler.php
Inheritance: Abstract base class extending Step
Since: 0.2.1

Core Functionality

Single Item Execution Model

All fetch handlers implement the Single Item Execution Model, processing exactly one item per job execution. This ensures that failures are isolated to individual items and prevents batch processing timeouts.

Deduplication Management

Automatic deduplication tracking to prevent processing the same items multiple times:

// Check if item was already processed
if ($this->isItemProcessed($item_id, $flow_step_id)) {
    return $this->emptyResponse();
}

// Mark item as processed
$this->markItemProcessed($item_id, $flow_step_id, $job_id);

Engine Data Storage

Store handler-specific parameters for downstream handlers:

$this->storeEngineData($job_id, [
    'source_url' => $source_url,
    'image_url' => $image_url
]);

Standardized Responses

Consistent response methods for success and error cases:

// Success response with data packets
return $this->successResponse([$dataPacket]);

// Empty response (no new items)
return $this->emptyResponse();

// Error response
return $this->errorResponse('Error message', ['details' => $details]);

Exclude Keywords Filtering (@since v0.3.1)

Filter content based on negative keywords to exclude unwanted items:

// Check if content contains any exclude keywords
$exclude_keywords = $config['exclude_keywords'] ?? '';
if (!empty($exclude_keywords) && $this->applyExcludeKeywords($content, $exclude_keywords)) {
    // Content contains excluded keywords, skip this item
    continue;
}

The applyExcludeKeywords() method returns true if any exclude keyword is found in the text (case-insensitive), indicating the item should be filtered out.

Required Implementation

All fetch handlers must implement the executeFetch() method:

abstract protected function executeFetch(
    int $pipeline_id,
    array $config,
    ?string $flow_step_id,
    int $flow_id,
    ?string $job_id
): array;

Standard Implementation Pattern

use DataMachineCoreStepsFetchHandlersFetchHandler;

class MyFetchHandler extends FetchHandler {
    public function __construct() {
        parent::__construct('my_handler');
    }

    protected function executeFetch(
        int $pipeline_id,
        array $config,
        ?string $flow_step_id,
        int $flow_id,
        ?string $job_id
    ): array {
        // Check deduplication
        if ($this->isItemProcessed($item_id, $flow_step_id)) {
            return $this->emptyResponse();
        }

        // Fetch data from source
        $fetched_data = $this->fetch_from_source($config);

        // Mark as processed
        $this->markItemProcessed($item_id, $flow_step_id, $job_id);

        // Store engine data for downstream handlers
        $this->storeEngineData($job_id, [
            'source_url' => $source_url,
            'image_url' => $image_url
        ]);

        // Create standardized data packet
        $dataPacket = new DataMachineCoreDataPacket(
            ['content_string' => $content_string, 'file_info' => null],
            ['source_type' => 'my_handler', 'item_identifier_to_log' => $item_id],
            'fetch'
        );

        return $this->successResponse([$dataPacket->addTo([])]);
    }
}

Engine Data Parameters

Fetch handlers should store relevant parameters for publish/update handlers:

Parameter Description Used By
source_url Source URL of the content Update handlers, logging
image_url URL of associated image Publish handlers with image support

Handler-Specific Engine Parameters

Different fetch handlers store different engine parameters:

  • Reddit: source_url (post URL), image_url (stored image URL)
  • WordPress Local: source_url (permalink), image_url (featured image URL)
  • WordPress API: source_url (post link), image_url (featured image URL)
  • WordPress Media: source_url (parent post permalink), image_url (media URL)
  • RSS: source_url (item link), image_url (enclosure URL)
  • Universal Web Scraper: source_url (page URL), image_url (detected image)
  • Google Sheets: source_url (empty), image_url (empty)
  • Files: image_url (public URL for images only)

File Handling

For file-based fetch handlers, use the FilesRepository components:

use DataMachineCoreFilesRepositoryFileStorage;

$file_storage = new FileStorage();
$stored_path = $file_storage->store_file($file_content, $filename, $job_id);

Key Methods (@since v0.3.1)

applyExcludeKeywords()

Filter content based on negative keywords:

protected function applyExcludeKeywords(string $text, string $exclude_keywords): bool

Parameters:

  • $text: Text content to search
  • $exclude_keywords: Comma-separated list of keywords to exclude

Returns: true if any exclude keyword is found (item should be filtered out), false otherwise

Features:

  • Case-insensitive matching
  • Unicode-safe via mb_stripos()
  • Handles comma-separated keyword lists
  • Returns false for empty keyword lists

Benefits

  • Deduplication: Automatic prevention of duplicate processing
  • Consistency: Standardized response patterns across all fetch handlers
  • Engine Integration: Seamless data flow to downstream handlers
  • Error Handling: Centralized error response formatting
  • Maintainability: Reduced code duplication and consistent patterns
  • Negative Filtering (@since v0.3.1): Built-in exclude keyword filtering

Implementations

All fetch handlers extend this base class:

  • RSS Handler
  • Reddit Handler
  • Universal Web Scraper Handler
  • WordPress Local Handler
  • WordPress Media Handler
  • WordPress API Handler
  • Google Sheets Handler
  • Files Handler

See Fetch Handlers Overview for comparison.