Universal Web Scraper Handler

Last updated: January 26, 2026

Web scraping handler that extracts event data from arbitrary HTML pages using structured-data extractors first, then an HTML-section fallback.

Overview

The Universal Web Scraper handler prioritizes structured data extraction for maximum accuracy, falling back to HTML-section extraction when structured data is unavailable. It supports multiple extractor implementations, XPath-based event section detection, automatic pagination (up to MAX_PAGES = 20), and ProcessedItems tracking to prevent duplicate processing. It features a "Smart Fallback" mechanism that retries requests with standard headers if browser-mode requests are blocked by captchas (SiteGround/Cloudflare) or return 403 errors.

Features

Structured Data Extraction (Priority Order)

The handler registers 22 extraction layers (21 specialized extractors + HTML fallback) in UniversalWebScraper::getExtractors().

AEG/AXS JSON feed (AegAxsExtractor): Extracts events from AEG/AXS venue JSON feeds
Red Rocks (RedRocksExtractor): Parses Red Rocks Amphitheatre event pages (@since v0.8.0)
Freshtix (FreshtixExtractor): Parses embedded JavaScript event objects on Freshtix platform pages (@since v0.8.0)
Firebase Realtime Database (FirebaseExtractor): Detects Firebase SDK and fetches events from the Firebase REST API (@since v0.8.12)
Embedded Calendars (EmbeddedCalendarExtractor): Detects and scrapes embedded Google Calendar, SeeTickets, and Turntable widgets. Decodes Base64-encoded Google Calendar IDs. Consolidates legacy standalone GoogleCalendar handler logic. (@since v0.8.0)
Squarespace (SquarespaceExtractor): Extracts events from Static.SQUARESPACE_CONTEXT JavaScript objects. Enhanced to search for events in "Summary Blocks" and hidden template data if the main collection is empty. (@since v0.8.12)
Craftpeak (CraftpeakExtractor): Parses Craftpeak brewery/venue calendar pages
SpotHopper (SpotHopperExtractor): Auto-detects SpotHopper platform and extracts events from their public API (@since v0.8.12)
Gigwell (GigwellExtractor): Detects gigwell-gigstream embeds and extracts events from the Gigwell API
Bandzoogle (BandzoogleExtractor): Extracts events from Bandzoogle-powered venue calendar pages (@since v0.8.16)
GoDaddy (GoDaddyExtractor): Extracts events from GoDaddy Website Builder calendars with REST API detection (@since v0.8.16)
Timely (TimelyExtractor): Specialized extractor for Time.ly All-in-One Event Calendar platform (@since v0.8.18)
Elfsight Events (ElfsightEventsExtractor): Extracts events from Elfsight Event Calendar widgets embedded on venue websites (@since v0.8.30)
Schema.org JSON-LD (JsonLdExtractor): Parses <script type="application/ld+json">
External WordPress (WordPressExtractor): Extracts events from external WordPress sites via REST API or structured HTML. Consolidates legacy standalone WordPressEventsAPI handler logic. (@since v0.8.0)
Prekindle (PrekindleExtractor): Auto-detects Prekindle widgets/links and extracts high-fidelity data from their mobile API (@since v0.8.12)
Wix Events JSON (WixEventsExtractor): Extracts events from <script id="wix-warmup-data">
MusicItem (MusicItemExtractor): Extracts Schema.org MusicEvent items from structured data
RHP Events plugin HTML (RhpEventsExtractor): Extracts events from .rhpSingleEvent markup. Enhanced to merge page-level venue data (@since v0.8.21).
OpenDate.io (OpenDateExtractor): Two-step extraction (listing → detail page). Prioritizes React JSON datetime values over JSON-LD for improved time accuracy.
Schema.org Microdata (MicrodataExtractor): Parses itemtype/itemprop markup
HTML section extraction (fallback): Uses XPath selector rules to extract one candidate section at a time for downstream processing. Includes high-confidence selectors for brownbearsw.com (Sahara Lounge).

Page Venue Extraction

The handler includes a PageVenueExtractor (@since v0.8.18) that provides multi-layered venue metadata parsing from site headers and footers. This is particularly useful when event listings are sparse but the parent site contains rich venue information.

JSON-LD Support: Prioritizes MusicVenue, EntertainmentBusiness, and NightClub schema types found in site-wide headers or footers (@since v0.8.21).
Address Detection: Robust identification of addresses in Squarespace announcement bars, footers, and page titles. Handles non-standard US address strings more effectively.
Squarespace Map Blocks: Detects and extracts structured venue data from Squarespace map widgets.
Contextual Merging: Automatically merges discovered page-level venue data into event listings that lack complete address information (e.g., in the RhpEventsExtractor).

Wix Events Support

The handler automatically detects and parses Wix Events platform data:

Automatic Detection: Identifies wix-warmup-data script tags in page HTML
Full Event Data: Extracts title, dates, venue, location, coordinates, ticket URLs, and images
Timezone Handling: Properly converts dates using the event’s configured timezone
Ticket URL capture: Extractors capture ticket URLs when present in source data

Firebase Realtime Database Support

The handler extracts events from websites using Firebase Realtime Database for event storage:

Automatic Detection: Identifies Firebase SDK scripts and databaseURL config in page HTML
REST API Extraction: Fetches events directly from {databaseURL}/events.json
Published Filter: Only imports events where isPublished is true
Full Event Data: Extracts title, dates, description, ticket URLs, and poster images
Date Parsing: Handles Firebase JS date strings with timezone information

Squarespace Support

The handler extracts events from Squarespace platform websites, even when standard JSON-LD is missing or incomplete:

Automatic Detection: Identifies Static.SQUARESPACE_CONTEXT script tags in page HTML
Deep Extraction: Recursively searches the Squarespace context for userItems or items collections
Full Event Data: Extracts title, description, ticket URLs (via buttonLink or clickthroughUrl), and images (assetUrl)
Smart Date Parsing: Handles millisecond timestamps and ISO strings; falls back to parsing dates from description text
Ticket URL capture: Captures specific event call-to-action links common in Squarespace templates

SpotHopper Support

The handler supports seamless extraction from venues using the SpotHopper platform:

Automatic Detection: Detects SpotHopper scripts, widgets, or platform URLs in the page HTML
Automatic ID Extraction: Extracts the venue’s spot_id from JavaScript variables, static asset URLs, or widget configurations
API Extraction: Fetches structured event data directly from SpotHopper’s public API
Linked Data Parsing: Correctly resolves linked venue and image objects from the API response
Full Venue Data: Captures complete venue metadata including coordinates, phone, and address

Prekindle Support

The handler provides high-fidelity extraction for venues using the Prekindle ticketing platform:

Automatic Detection: Detects Prekindle widgets (pk-cal-widget) or organizer links in the page HTML
Automatic ID Extraction: Automatically "sniffs" the org_id from widget attributes or script source URLs
Hybrid Extraction: Fetches the Prekindle mobile grid widget which provides both Schema.org JSON-LD and precise HTML time blocks
Precise Timing: Scrapes the pk-times HTML blocks to capture door and show times that are often missing from standard JSON-LD
Full Metadata: Extracts title, description, ticket URLs, images, price, and organizer information

RHP Events Plugin Support

The handler extracts events from WordPress sites using the RHP Events plugin:

Automatic Detection: Identifies .rhpSingleEvent elements in page HTML
Event Data: Extracts title, date, doors/show times, venue, price, age restriction, and images
Year Detection: Parses month separators (e.g., "December 2025") to determine event year
Ticket Links: Captures external ticket URLs (Etix, etc.) and event detail URLs
Price Parsing: Handles advance and day-of pricing formats

Schema.org JSON-LD Extraction

Standard Compliance: Fully compatible with Schema.org Event specification
Comprehensive Coverage: Supports all major Schema.org Event properties
Nested Structures: Handles @graph structures and arrays of events

HTML Section Extraction (Fallback)

EventSectionFinder selects candidate event HTML sections using XPath rules
Extracted HTML is sent downstream for AI extraction

Processing Features

Engine Data Integration: Stores extracted venue and event data in pipeline engine data
Duplicate Prevention: Uses ProcessedItems tracking (via handler base methods) to skip already-processed identifiers
Automatic Pagination: Follows "next page" links to traverse multi-page event listings (MAX_PAGES=20)
XPath Selector Rules: Prioritized selector system for identifying candidate event sections

Data Flow

When structured data (Wix or JSON-LD) is found:

Extract and normalize event data to standardized format
Apply venue config override if configured in handler settings
Store venue metadata via EventEngineData::storeVenueContext()
Store core event fields (title, dates, ticketUrl, etc.) in engine data
Create DataPacket with both event (normalized) and raw_source (original)
AI step receives pre-extracted data for description generation and enrichment

When falling back to HTML parsing:

EventSectionFinder locates event HTML sections using XPath rules
Raw HTML is sent to AI for extraction
AI extracts all event fields from unstructured content

Configuration

Required Settings

Target URL: The web page URL to scrape for event data

Optional Settings

Venue Selection: Select existing venue or create new with address details
Keyword Filtering: Include/exclude events based on keywords

Venue Configuration

The handler uses VenueFieldsTrait for venue configuration:

Venue Dropdown: Select from existing venue taxonomy terms
Create New Venue: Enter venue name and address details
Address Autocomplete: Venue fields support coordinate lookup via the plugin’s OpenStreetMap/Nominatim-based geocoding endpoint
Override Behavior: Handler venue config overrides extracted venue data

Configuration Notes

Required: source_url
Optional: search, exclude_keywords, and venue override fields (via VenueFieldsTrait in the settings UI)

This handler returns a single eligible event per run (or a single HTML section for downstream extraction), keeping pipeline imports incremental.

Supported Platforms

Wix Events

Websites built on Wix platform with the Wix Events widget:

Event listings with embedded JSON data
Full venue and location information
External ticketing links when present
Event images and descriptions

RHP Events (WordPress Plugin)

WordPress sites using the RHP Events plugin for venue event management:

Structured HTML with consistent CSS classes
Event listings with date, time, price, and venue information
External ticket links (Etix, Ticketmaster, etc.)
Age restriction information

Schema.org Compliant Sites

Any website implementing Schema.org Event markup:

JSON-LD structured data in script tags
Schema.org microdata in HTML elements
Standard Event, MusicEvent, Festival, etc. types

Firebase Sites

Websites using Firebase Realtime Database for event management:

Single-page applications that load events via JavaScript
Sites with Firebase SDK and public database configuration
Event data stored in /events path with metadata structure

Custom HTML Sites

Websites without structured data (AI fallback):

Table-based event calendars
Event card/list layouts
Widget-based calendars (Google Calendar, SeeTickets, Turntable Tickets)

Integration Architecture

Handler Structure

UniversalWebScraper.php: Main handler with structured data extraction and AI fallback
StructuredDataProcessor.php: Common processing logic for all extractors
EventSectionFinder.php: Locates eligible event sections using XPath selector rules
EventSectionSelectors.php: Defines prioritized XPath rules for event section detection
UniversalWebScraperSettings.php: Admin configuration interface using VenueFieldsTrait

Extractors (Extractors/ directory)

ExtractorInterface.php: Contract for all extractor implementations
AegAxsExtractor.php: Parses AEG/AXS JSON feeds
RedRocksExtractor.php: Parses Red Rocks Amphitheatre event pages
FreshtixExtractor.php: Parses Freshtix platform pages
FirebaseExtractor.php: Fetches events from Firebase Realtime Database REST API
SquarespaceExtractor.php: Parses Squarespace context JSON
SpotHopperExtractor.php: Detects SpotHopper and parses their API
WixEventsExtractor.php: Parses Wix warmup-data JSON
RhpEventsExtractor.php: Parses RHP Events plugin HTML
OpenDateExtractor.php: Handles OpenDate.io listing/detail extraction
JsonLdExtractor.php: Parses Schema.org JSON-LD script tags
MicrodataExtractor.php: Parses Schema.org microdata attributes
ElfsightEventsExtractor.php: Parses Elfsight Event Calendar widget data
CraftpeakExtractor.php: Parses Craftpeak brewery/venue calendar pages
MusicItemExtractor.php: Extracts Schema.org MusicEvent items

DataPacket Format

When structured data is extracted:

{
  "event": {
    "title": "Event Name",
    "startDate": "2024-01-15",
    "startTime": "19:00",
    "venue": "Venue Name",
    "ticketUrl": "https://..."
  },
  "raw_source": { /* Original Wix/JSON-LD data */ },
  "venue_metadata": {
    "venueAddress": "123 Main St",
    "venueCity": "City",
    "venueState": "ST"
  },
  "import_source": "universal_web_scraper",
  "extraction_method": "wix_events"
}

When using HTML fallback:

{
  "raw_html": "<div class="event">...</div>",
  "source_url": "https://...",
  "import_source": "universal_web_scraper"
}

Pagination Details

Automatic Link Following: Detects "next page" links using multiple patterns (rel="next", common navigation selectors)
Domain Validation: Only follows links on the same domain to prevent crawl drift
URL Tracking: Maintains visited URL tracking to prevent re-scraping the same page
Max Pages Limit: Respects MAX_PAGES=20 constant to prevent infinite loops

Data Output

Structured extraction returns a DataPacket with JSON body containing:
- event
- venue_metadata
- import_source
- extraction_method
HTML section fallback returns a DataPacket containing raw_html + source_url for downstream extraction.

Supported Sources

Wix (Wix Events)
Squarespace (Context JS)
SpotHopper (API)
Prekindle (Hybrid API/HTML)
Red Rocks Amphitheatre
Freshtix platform sites
Firebase Realtime Database sites
WordPress sites using the RHP Events plugin
OpenDate.io calendars
Sites with Schema.org Event JSON-LD or microdata
AEG/AXS venue feeds embedded/linked from venue pages
Google Calendar (via EmbeddedCalendarExtractor)
External WordPress Events (via WordPressExtractor)