Universal Web Scraper Handler

Web scraping handler that extracts event data from arbitrary HTML pages using structured-data extractors first, then an HTML-section fallback.

Overview

The Universal Web Scraper handler prioritizes structured data extraction for maximum accuracy, falling back to HTML-section extraction when structured data is unavailable. It supports multiple extractor implementations, XPath-based event section detection, automatic pagination (up to MAX_PAGES = 20), and ProcessedItems tracking to prevent duplicate processing. It features a "Smart Fallback" mechanism that retries requests with standard headers if browser-mode requests are blocked by captchas (SiteGround/Cloudflare) or return 403 errors.

Features

Structured Data Extraction (Priority Order)

The handler registers 22 extraction layers (21 specialized extractors + HTML fallback) in UniversalWebScraper::getExtractors().

  1. AEG/AXS JSON feed (AegAxsExtractor): Extracts events from AEG/AXS venue JSON feeds
  2. Red Rocks (RedRocksExtractor): Parses Red Rocks Amphitheatre event pages (@since v0.8.0)
  3. Freshtix (FreshtixExtractor): Parses embedded JavaScript event objects on Freshtix platform pages (@since v0.8.0)
  4. Firebase Realtime Database (FirebaseExtractor): Detects Firebase SDK and fetches events from the Firebase REST API (@since v0.8.12)
  5. Embedded Calendars (EmbeddedCalendarExtractor): Detects and scrapes embedded Google Calendar, SeeTickets, and Turntable widgets. Decodes Base64-encoded Google Calendar IDs. Consolidates legacy standalone GoogleCalendar handler logic. (@since v0.8.0)
  6. Squarespace (SquarespaceExtractor): Extracts events from Static.SQUARESPACE_CONTEXT JavaScript objects. Enhanced to search for events in "Summary Blocks" and hidden template data if the main collection is empty. (@since v0.8.12)
  7. Craftpeak (CraftpeakExtractor): Parses Craftpeak brewery/venue calendar pages
  8. SpotHopper (SpotHopperExtractor): Auto-detects SpotHopper platform and extracts events from their public API (@since v0.8.12)
  9. Gigwell (GigwellExtractor): Detects gigwell-gigstream embeds and extracts events from the Gigwell API
  10. Bandzoogle (BandzoogleExtractor): Extracts events from Bandzoogle-powered venue calendar pages (@since v0.8.16)
  11. GoDaddy (GoDaddyExtractor): Extracts events from GoDaddy Website Builder calendars with REST API detection (@since v0.8.16)
  12. Timely (TimelyExtractor): Specialized extractor for Time.ly All-in-One Event Calendar platform (@since v0.8.18)
  13. Elfsight Events (ElfsightEventsExtractor): Extracts events from Elfsight Event Calendar widgets embedded on venue websites (@since v0.8.30)
  14. Schema.org JSON-LD (JsonLdExtractor): Parses <script type="application/ld+json">
  15. External WordPress (WordPressExtractor): Extracts events from external WordPress sites via REST API or structured HTML. Consolidates legacy standalone WordPressEventsAPI handler logic. (@since v0.8.0)
  16. Prekindle (PrekindleExtractor): Auto-detects Prekindle widgets/links and extracts high-fidelity data from their mobile API (@since v0.8.12)
  17. Wix Events JSON (WixEventsExtractor): Extracts events from <script id="wix-warmup-data">
  18. MusicItem (MusicItemExtractor): Extracts Schema.org MusicEvent items from structured data
  19. RHP Events plugin HTML (RhpEventsExtractor): Extracts events from .rhpSingleEvent markup. Enhanced to merge page-level venue data (@since v0.8.21).
  20. OpenDate.io (OpenDateExtractor): Two-step extraction (listing → detail page). Prioritizes React JSON datetime values over JSON-LD for improved time accuracy.
  21. Schema.org Microdata (MicrodataExtractor): Parses itemtype/itemprop markup
  22. HTML section extraction (fallback): Uses XPath selector rules to extract one candidate section at a time for downstream processing. Includes high-confidence selectors for brownbearsw.com (Sahara Lounge).

Page Venue Extraction

The handler includes a PageVenueExtractor (@since v0.8.18) that provides multi-layered venue metadata parsing from site headers and footers. This is particularly useful when event listings are sparse but the parent site contains rich venue information.

  • JSON-LD Support: Prioritizes MusicVenue, EntertainmentBusiness, and NightClub schema types found in site-wide headers or footers (@since v0.8.21).
  • Address Detection: Robust identification of addresses in Squarespace announcement bars, footers, and page titles. Handles non-standard US address strings more effectively.
  • Squarespace Map Blocks: Detects and extracts structured venue data from Squarespace map widgets.
  • Contextual Merging: Automatically merges discovered page-level venue data into event listings that lack complete address information (e.g., in the RhpEventsExtractor).

Wix Events Support

The handler automatically detects and parses Wix Events platform data:

  • Automatic Detection: Identifies wix-warmup-data script tags in page HTML
  • Full Event Data: Extracts title, dates, venue, location, coordinates, ticket URLs, and images
  • Timezone Handling: Properly converts dates using the event’s configured timezone
  • Ticket URL capture: Extractors capture ticket URLs when present in source data

Firebase Realtime Database Support

The handler extracts events from websites using Firebase Realtime Database for event storage:

  • Automatic Detection: Identifies Firebase SDK scripts and databaseURL config in page HTML
  • REST API Extraction: Fetches events directly from {databaseURL}/events.json
  • Published Filter: Only imports events where isPublished is true
  • Full Event Data: Extracts title, dates, description, ticket URLs, and poster images
  • Date Parsing: Handles Firebase JS date strings with timezone information

Squarespace Support

The handler extracts events from Squarespace platform websites, even when standard JSON-LD is missing or incomplete:

  • Automatic Detection: Identifies Static.SQUARESPACE_CONTEXT script tags in page HTML
  • Deep Extraction: Recursively searches the Squarespace context for userItems or items collections
  • Full Event Data: Extracts title, description, ticket URLs (via buttonLink or clickthroughUrl), and images (assetUrl)
  • Smart Date Parsing: Handles millisecond timestamps and ISO strings; falls back to parsing dates from description text
  • Ticket URL capture: Captures specific event call-to-action links common in Squarespace templates

SpotHopper Support

The handler supports seamless extraction from venues using the SpotHopper platform:

  • Automatic Detection: Detects SpotHopper scripts, widgets, or platform URLs in the page HTML
  • Automatic ID Extraction: Extracts the venue’s spot_id from JavaScript variables, static asset URLs, or widget configurations
  • API Extraction: Fetches structured event data directly from SpotHopper’s public API
  • Linked Data Parsing: Correctly resolves linked venue and image objects from the API response
  • Full Venue Data: Captures complete venue metadata including coordinates, phone, and address

Prekindle Support

The handler provides high-fidelity extraction for venues using the Prekindle ticketing platform:

  • Automatic Detection: Detects Prekindle widgets (pk-cal-widget) or organizer links in the page HTML
  • Automatic ID Extraction: Automatically "sniffs" the org_id from widget attributes or script source URLs
  • Hybrid Extraction: Fetches the Prekindle mobile grid widget which provides both Schema.org JSON-LD and precise HTML time blocks
  • Precise Timing: Scrapes the pk-times HTML blocks to capture door and show times that are often missing from standard JSON-LD
  • Full Metadata: Extracts title, description, ticket URLs, images, price, and organizer information

RHP Events Plugin Support

The handler extracts events from WordPress sites using the RHP Events plugin:

  • Automatic Detection: Identifies .rhpSingleEvent elements in page HTML
  • Event Data: Extracts title, date, doors/show times, venue, price, age restriction, and images
  • Year Detection: Parses month separators (e.g., "December 2025") to determine event year
  • Ticket Links: Captures external ticket URLs (Etix, etc.) and event detail URLs
  • Price Parsing: Handles advance and day-of pricing formats

Schema.org JSON-LD Extraction

  • Standard Compliance: Fully compatible with Schema.org Event specification
  • Comprehensive Coverage: Supports all major Schema.org Event properties
  • Nested Structures: Handles @graph structures and arrays of events

HTML Section Extraction (Fallback)

  • EventSectionFinder selects candidate event HTML sections using XPath rules
  • Extracted HTML is sent downstream for AI extraction

Processing Features

  • Engine Data Integration: Stores extracted venue and event data in pipeline engine data
  • Duplicate Prevention: Uses ProcessedItems tracking (via handler base methods) to skip already-processed identifiers
  • Automatic Pagination: Follows "next page" links to traverse multi-page event listings (MAX_PAGES=20)
  • XPath Selector Rules: Prioritized selector system for identifying candidate event sections

Data Flow

When structured data (Wix or JSON-LD) is found:

  1. Extract and normalize event data to standardized format
  2. Apply venue config override if configured in handler settings
  3. Store venue metadata via EventEngineData::storeVenueContext()
  4. Store core event fields (title, dates, ticketUrl, etc.) in engine data
  5. Create DataPacket with both event (normalized) and raw_source (original)
  6. AI step receives pre-extracted data for description generation and enrichment

When falling back to HTML parsing:

  1. EventSectionFinder locates event HTML sections using XPath rules
  2. Raw HTML is sent to AI for extraction
  3. AI extracts all event fields from unstructured content

Configuration

Required Settings

  • Target URL: The web page URL to scrape for event data

Optional Settings

  • Venue Selection: Select existing venue or create new with address details
  • Keyword Filtering: Include/exclude events based on keywords

Venue Configuration

The handler uses VenueFieldsTrait for venue configuration:

  • Venue Dropdown: Select from existing venue taxonomy terms
  • Create New Venue: Enter venue name and address details
  • Address Autocomplete: Venue fields support coordinate lookup via the plugin’s OpenStreetMap/Nominatim-based geocoding endpoint
  • Override Behavior: Handler venue config overrides extracted venue data

Configuration Notes

  • Required: source_url
  • Optional: search, exclude_keywords, and venue override fields (via VenueFieldsTrait in the settings UI)

This handler returns a single eligible event per run (or a single HTML section for downstream extraction), keeping pipeline imports incremental.

Supported Platforms

Wix Events

Websites built on Wix platform with the Wix Events widget:

  • Event listings with embedded JSON data
  • Full venue and location information
  • External ticketing links when present
  • Event images and descriptions

RHP Events (WordPress Plugin)

WordPress sites using the RHP Events plugin for venue event management:

  • Structured HTML with consistent CSS classes
  • Event listings with date, time, price, and venue information
  • External ticket links (Etix, Ticketmaster, etc.)
  • Age restriction information

Schema.org Compliant Sites

Any website implementing Schema.org Event markup:

  • JSON-LD structured data in script tags
  • Schema.org microdata in HTML elements
  • Standard Event, MusicEvent, Festival, etc. types

Firebase Sites

Websites using Firebase Realtime Database for event management:

  • Single-page applications that load events via JavaScript
  • Sites with Firebase SDK and public database configuration
  • Event data stored in /events path with metadata structure

Custom HTML Sites

Websites without structured data (AI fallback):

  • Table-based event calendars
  • Event card/list layouts
  • Widget-based calendars (Google Calendar, SeeTickets, Turntable Tickets)

Integration Architecture

Handler Structure

  • UniversalWebScraper.php: Main handler with structured data extraction and AI fallback
  • StructuredDataProcessor.php: Common processing logic for all extractors
  • EventSectionFinder.php: Locates eligible event sections using XPath selector rules
  • EventSectionSelectors.php: Defines prioritized XPath rules for event section detection
  • UniversalWebScraperSettings.php: Admin configuration interface using VenueFieldsTrait

Extractors (Extractors/ directory)

  • ExtractorInterface.php: Contract for all extractor implementations
  • AegAxsExtractor.php: Parses AEG/AXS JSON feeds
  • RedRocksExtractor.php: Parses Red Rocks Amphitheatre event pages
  • FreshtixExtractor.php: Parses Freshtix platform pages
  • FirebaseExtractor.php: Fetches events from Firebase Realtime Database REST API
  • SquarespaceExtractor.php: Parses Squarespace context JSON
  • SpotHopperExtractor.php: Detects SpotHopper and parses their API
  • WixEventsExtractor.php: Parses Wix warmup-data JSON
  • RhpEventsExtractor.php: Parses RHP Events plugin HTML
  • OpenDateExtractor.php: Handles OpenDate.io listing/detail extraction
  • JsonLdExtractor.php: Parses Schema.org JSON-LD script tags
  • MicrodataExtractor.php: Parses Schema.org microdata attributes
  • ElfsightEventsExtractor.php: Parses Elfsight Event Calendar widget data
  • CraftpeakExtractor.php: Parses Craftpeak brewery/venue calendar pages
  • MusicItemExtractor.php: Extracts Schema.org MusicEvent items

DataPacket Format

When structured data is extracted:

{
  "event": {
    "title": "Event Name",
    "startDate": "2024-01-15",
    "startTime": "19:00",
    "venue": "Venue Name",
    "ticketUrl": "https://..."
  },
  "raw_source": { /* Original Wix/JSON-LD data */ },
  "venue_metadata": {
    "venueAddress": "123 Main St",
    "venueCity": "City",
    "venueState": "ST"
  },
  "import_source": "universal_web_scraper",
  "extraction_method": "wix_events"
}

When using HTML fallback:

{
  "raw_html": "<div class="event">...</div>",
  "source_url": "https://...",
  "import_source": "universal_web_scraper"
}

Pagination Details

  • Automatic Link Following: Detects "next page" links using multiple patterns (rel="next", common navigation selectors)
  • Domain Validation: Only follows links on the same domain to prevent crawl drift
  • URL Tracking: Maintains visited URL tracking to prevent re-scraping the same page
  • Max Pages Limit: Respects MAX_PAGES=20 constant to prevent infinite loops

Data Output

  • Structured extraction returns a DataPacket with JSON body containing:
    • event
    • venue_metadata
    • import_source
    • extraction_method
  • HTML section fallback returns a DataPacket containing raw_html + source_url for downstream extraction.

Supported Sources

  • Wix (Wix Events)
  • Squarespace (Context JS)
  • SpotHopper (API)
  • Prekindle (Hybrid API/HTML)
  • Red Rocks Amphitheatre
  • Freshtix platform sites
  • Firebase Realtime Database sites
  • WordPress sites using the RHP Events plugin
  • OpenDate.io calendars
  • Sites with Schema.org Event JSON-LD or microdata
  • AEG/AXS venue feeds embedded/linked from venue pages
  • Google Calendar (via EmbeddedCalendarExtractor)
  • External WordPress Events (via WordPressExtractor)