Universal Web Scraper Handler
Web scraping handler that extracts event data from arbitrary HTML pages using structured-data extractors first, then an HTML-section fallback.
Overview
The Universal Web Scraper handler prioritizes structured data extraction for maximum accuracy, falling back to HTML-section extraction when structured data is unavailable. It supports multiple extractor implementations, XPath-based event section detection, automatic pagination (up to MAX_PAGES = 20), and ProcessedItems tracking to prevent duplicate processing. It features a "Smart Fallback" mechanism that retries requests with standard headers if browser-mode requests are blocked by captchas (SiteGround/Cloudflare) or return 403 errors.
Features
Structured Data Extraction (Priority Order)
The handler registers 22 extraction layers (21 specialized extractors + HTML fallback) in UniversalWebScraper::getExtractors().
- AEG/AXS JSON feed (
AegAxsExtractor): Extracts events from AEG/AXS venue JSON feeds - Red Rocks (
RedRocksExtractor): Parses Red Rocks Amphitheatre event pages (@since v0.8.0) - Freshtix (
FreshtixExtractor): Parses embedded JavaScript event objects on Freshtix platform pages (@since v0.8.0) - Firebase Realtime Database (
FirebaseExtractor): Detects Firebase SDK and fetches events from the Firebase REST API (@since v0.8.12) - Embedded Calendars (
EmbeddedCalendarExtractor): Detects and scrapes embedded Google Calendar, SeeTickets, and Turntable widgets. Decodes Base64-encoded Google Calendar IDs. Consolidates legacy standaloneGoogleCalendarhandler logic. (@since v0.8.0) - Squarespace (
SquarespaceExtractor): Extracts events fromStatic.SQUARESPACE_CONTEXTJavaScript objects. Enhanced to search for events in "Summary Blocks" and hidden template data if the main collection is empty. (@since v0.8.12) - Craftpeak (
CraftpeakExtractor): Parses Craftpeak brewery/venue calendar pages - SpotHopper (
SpotHopperExtractor): Auto-detects SpotHopper platform and extracts events from their public API (@since v0.8.12) - Gigwell (
GigwellExtractor): Detects gigwell-gigstream embeds and extracts events from the Gigwell API - Bandzoogle (
BandzoogleExtractor): Extracts events from Bandzoogle-powered venue calendar pages (@since v0.8.16) - GoDaddy (
GoDaddyExtractor): Extracts events from GoDaddy Website Builder calendars with REST API detection (@since v0.8.16) - Timely (
TimelyExtractor): Specialized extractor for Time.ly All-in-One Event Calendar platform (@since v0.8.18) - Elfsight Events (
ElfsightEventsExtractor): Extracts events from Elfsight Event Calendar widgets embedded on venue websites (@since v0.8.30) - Schema.org JSON-LD (
JsonLdExtractor): Parses<script type="application/ld+json"> - External WordPress (
WordPressExtractor): Extracts events from external WordPress sites via REST API or structured HTML. Consolidates legacy standaloneWordPressEventsAPIhandler logic. (@since v0.8.0) - Prekindle (
PrekindleExtractor): Auto-detects Prekindle widgets/links and extracts high-fidelity data from their mobile API (@since v0.8.12) - Wix Events JSON (
WixEventsExtractor): Extracts events from<script id="wix-warmup-data"> - MusicItem (
MusicItemExtractor): Extracts Schema.org MusicEvent items from structured data - RHP Events plugin HTML (
RhpEventsExtractor): Extracts events from.rhpSingleEventmarkup. Enhanced to merge page-level venue data (@since v0.8.21). - OpenDate.io (
OpenDateExtractor): Two-step extraction (listing → detail page). Prioritizes React JSON datetime values over JSON-LD for improved time accuracy. - Schema.org Microdata (
MicrodataExtractor): Parses itemtype/itemprop markup - HTML section extraction (fallback): Uses XPath selector rules to extract one candidate section at a time for downstream processing. Includes high-confidence selectors for
brownbearsw.com(Sahara Lounge).
Page Venue Extraction
The handler includes a PageVenueExtractor (@since v0.8.18) that provides multi-layered venue metadata parsing from site headers and footers. This is particularly useful when event listings are sparse but the parent site contains rich venue information.
- JSON-LD Support: Prioritizes
MusicVenue,EntertainmentBusiness, andNightClubschema types found in site-wide headers or footers (@since v0.8.21). - Address Detection: Robust identification of addresses in Squarespace announcement bars, footers, and page titles. Handles non-standard US address strings more effectively.
- Squarespace Map Blocks: Detects and extracts structured venue data from Squarespace map widgets.
- Contextual Merging: Automatically merges discovered page-level venue data into event listings that lack complete address information (e.g., in the
RhpEventsExtractor).
Wix Events Support
The handler automatically detects and parses Wix Events platform data:
- Automatic Detection: Identifies
wix-warmup-datascript tags in page HTML - Full Event Data: Extracts title, dates, venue, location, coordinates, ticket URLs, and images
- Timezone Handling: Properly converts dates using the event’s configured timezone
- Ticket URL capture: Extractors capture ticket URLs when present in source data
Firebase Realtime Database Support
The handler extracts events from websites using Firebase Realtime Database for event storage:
- Automatic Detection: Identifies Firebase SDK scripts and
databaseURLconfig in page HTML - REST API Extraction: Fetches events directly from
{databaseURL}/events.json - Published Filter: Only imports events where
isPublishedis true - Full Event Data: Extracts title, dates, description, ticket URLs, and poster images
- Date Parsing: Handles Firebase JS date strings with timezone information
Squarespace Support
The handler extracts events from Squarespace platform websites, even when standard JSON-LD is missing or incomplete:
- Automatic Detection: Identifies
Static.SQUARESPACE_CONTEXTscript tags in page HTML - Deep Extraction: Recursively searches the Squarespace context for
userItemsoritemscollections - Full Event Data: Extracts title, description, ticket URLs (via
buttonLinkorclickthroughUrl), and images (assetUrl) - Smart Date Parsing: Handles millisecond timestamps and ISO strings; falls back to parsing dates from description text
- Ticket URL capture: Captures specific event call-to-action links common in Squarespace templates
SpotHopper Support
The handler supports seamless extraction from venues using the SpotHopper platform:
- Automatic Detection: Detects SpotHopper scripts, widgets, or platform URLs in the page HTML
- Automatic ID Extraction: Extracts the venue’s
spot_idfrom JavaScript variables, static asset URLs, or widget configurations - API Extraction: Fetches structured event data directly from SpotHopper’s public API
- Linked Data Parsing: Correctly resolves linked venue and image objects from the API response
- Full Venue Data: Captures complete venue metadata including coordinates, phone, and address
Prekindle Support
The handler provides high-fidelity extraction for venues using the Prekindle ticketing platform:
- Automatic Detection: Detects Prekindle widgets (
pk-cal-widget) or organizer links in the page HTML - Automatic ID Extraction: Automatically "sniffs" the
org_idfrom widget attributes or script source URLs - Hybrid Extraction: Fetches the Prekindle mobile grid widget which provides both Schema.org JSON-LD and precise HTML time blocks
- Precise Timing: Scrapes the
pk-timesHTML blocks to capture door and show times that are often missing from standard JSON-LD - Full Metadata: Extracts title, description, ticket URLs, images, price, and organizer information
RHP Events Plugin Support
The handler extracts events from WordPress sites using the RHP Events plugin:
- Automatic Detection: Identifies
.rhpSingleEventelements in page HTML - Event Data: Extracts title, date, doors/show times, venue, price, age restriction, and images
- Year Detection: Parses month separators (e.g., "December 2025") to determine event year
- Ticket Links: Captures external ticket URLs (Etix, etc.) and event detail URLs
- Price Parsing: Handles advance and day-of pricing formats
Schema.org JSON-LD Extraction
- Standard Compliance: Fully compatible with Schema.org Event specification
- Comprehensive Coverage: Supports all major Schema.org Event properties
- Nested Structures: Handles @graph structures and arrays of events
HTML Section Extraction (Fallback)
- EventSectionFinder selects candidate event HTML sections using XPath rules
- Extracted HTML is sent downstream for AI extraction
Processing Features
- Engine Data Integration: Stores extracted venue and event data in pipeline engine data
- Duplicate Prevention: Uses ProcessedItems tracking (via handler base methods) to skip already-processed identifiers
- Automatic Pagination: Follows "next page" links to traverse multi-page event listings (MAX_PAGES=20)
- XPath Selector Rules: Prioritized selector system for identifying candidate event sections
Data Flow
When structured data (Wix or JSON-LD) is found:
- Extract and normalize event data to standardized format
- Apply venue config override if configured in handler settings
- Store venue metadata via
EventEngineData::storeVenueContext() - Store core event fields (title, dates, ticketUrl, etc.) in engine data
- Create DataPacket with both
event(normalized) andraw_source(original) - AI step receives pre-extracted data for description generation and enrichment
When falling back to HTML parsing:
- EventSectionFinder locates event HTML sections using XPath rules
- Raw HTML is sent to AI for extraction
- AI extracts all event fields from unstructured content
Configuration
Required Settings
- Target URL: The web page URL to scrape for event data
Optional Settings
- Venue Selection: Select existing venue or create new with address details
- Keyword Filtering: Include/exclude events based on keywords
Venue Configuration
The handler uses VenueFieldsTrait for venue configuration:
- Venue Dropdown: Select from existing venue taxonomy terms
- Create New Venue: Enter venue name and address details
- Address Autocomplete: Venue fields support coordinate lookup via the plugin’s OpenStreetMap/Nominatim-based geocoding endpoint
- Override Behavior: Handler venue config overrides extracted venue data
Configuration Notes
- Required:
source_url - Optional:
search,exclude_keywords, and venue override fields (viaVenueFieldsTraitin the settings UI)
This handler returns a single eligible event per run (or a single HTML section for downstream extraction), keeping pipeline imports incremental.
Supported Platforms
Wix Events
Websites built on Wix platform with the Wix Events widget:
- Event listings with embedded JSON data
- Full venue and location information
- External ticketing links when present
- Event images and descriptions
RHP Events (WordPress Plugin)
WordPress sites using the RHP Events plugin for venue event management:
- Structured HTML with consistent CSS classes
- Event listings with date, time, price, and venue information
- External ticket links (Etix, Ticketmaster, etc.)
- Age restriction information
Schema.org Compliant Sites
Any website implementing Schema.org Event markup:
- JSON-LD structured data in script tags
- Schema.org microdata in HTML elements
- Standard Event, MusicEvent, Festival, etc. types
Firebase Sites
Websites using Firebase Realtime Database for event management:
- Single-page applications that load events via JavaScript
- Sites with Firebase SDK and public database configuration
- Event data stored in
/eventspath withmetadatastructure
Custom HTML Sites
Websites without structured data (AI fallback):
- Table-based event calendars
- Event card/list layouts
- Widget-based calendars (Google Calendar, SeeTickets, Turntable Tickets)
Integration Architecture
Handler Structure
- UniversalWebScraper.php: Main handler with structured data extraction and AI fallback
- StructuredDataProcessor.php: Common processing logic for all extractors
- EventSectionFinder.php: Locates eligible event sections using XPath selector rules
- EventSectionSelectors.php: Defines prioritized XPath rules for event section detection
- UniversalWebScraperSettings.php: Admin configuration interface using VenueFieldsTrait
Extractors (Extractors/ directory)
- ExtractorInterface.php: Contract for all extractor implementations
- AegAxsExtractor.php: Parses AEG/AXS JSON feeds
- RedRocksExtractor.php: Parses Red Rocks Amphitheatre event pages
- FreshtixExtractor.php: Parses Freshtix platform pages
- FirebaseExtractor.php: Fetches events from Firebase Realtime Database REST API
- SquarespaceExtractor.php: Parses Squarespace context JSON
- SpotHopperExtractor.php: Detects SpotHopper and parses their API
- WixEventsExtractor.php: Parses Wix warmup-data JSON
- RhpEventsExtractor.php: Parses RHP Events plugin HTML
- OpenDateExtractor.php: Handles OpenDate.io listing/detail extraction
- JsonLdExtractor.php: Parses Schema.org JSON-LD script tags
- MicrodataExtractor.php: Parses Schema.org microdata attributes
- ElfsightEventsExtractor.php: Parses Elfsight Event Calendar widget data
- CraftpeakExtractor.php: Parses Craftpeak brewery/venue calendar pages
- MusicItemExtractor.php: Extracts Schema.org MusicEvent items
DataPacket Format
When structured data is extracted:
{
"event": {
"title": "Event Name",
"startDate": "2024-01-15",
"startTime": "19:00",
"venue": "Venue Name",
"ticketUrl": "https://..."
},
"raw_source": { /* Original Wix/JSON-LD data */ },
"venue_metadata": {
"venueAddress": "123 Main St",
"venueCity": "City",
"venueState": "ST"
},
"import_source": "universal_web_scraper",
"extraction_method": "wix_events"
}
When using HTML fallback:
{
"raw_html": "<div class="event">...</div>",
"source_url": "https://...",
"import_source": "universal_web_scraper"
}
Pagination Details
- Automatic Link Following: Detects "next page" links using multiple patterns (rel="next", common navigation selectors)
- Domain Validation: Only follows links on the same domain to prevent crawl drift
- URL Tracking: Maintains visited URL tracking to prevent re-scraping the same page
- Max Pages Limit: Respects MAX_PAGES=20 constant to prevent infinite loops
Data Output
- Structured extraction returns a
DataPacketwith JSONbodycontaining:eventvenue_metadataimport_sourceextraction_method
- HTML section fallback returns a
DataPacketcontainingraw_html+source_urlfor downstream extraction.
Supported Sources
- Wix (Wix Events)
- Squarespace (Context JS)
- SpotHopper (API)
- Prekindle (Hybrid API/HTML)
- Red Rocks Amphitheatre
- Freshtix platform sites
- Firebase Realtime Database sites
- WordPress sites using the RHP Events plugin
- OpenDate.io calendars
- Sites with Schema.org Event JSON-LD or microdata
- AEG/AXS venue feeds embedded/linked from venue pages
- Google Calendar (via EmbeddedCalendarExtractor)
- External WordPress Events (via WordPressExtractor)