Handling Late-Arriving Data in Streaming Pipelines

by Abdelkader Bekhti, Production AI & Data Architect

The Challenge: Managing Late-Arriving Data in Real-Time Streams

Organizations face the critical challenge of handling late-arriving data in streaming pipelines while maintaining data accuracy and completeness. Traditional streaming systems often miss or incorrectly process data that arrives out of order, leading to incomplete analytics and incorrect business decisions.

This late-arriving data solution leverages watermarking, windowing, and intelligent reprocessing to ensure 99% data completeness while maintaining real-time processing capabilities.

Late-Arriving Data Architecture: Watermarking & Windowing

Our solution delivers 99% data completeness with intelligent late data handling. Here's the architecture:

Streaming Layer

  • Kafka Streaming: Real-time data ingestion with ordering
  • Debezium CDC: Change data capture with timestamp tracking
  • Watermarking: Event time processing with late data tolerance
  • Windowing: Time-based data aggregation with late data handling

Processing Layer

  • DBT Processing: Intelligent data transformation with late data logic
  • Reprocessing Pipeline: Automatic handling of late-arriving data
  • Completeness Monitoring: Real-time data completeness tracking
  • Alert System: Notifications for data quality issues

Late-Arriving Data Architecture

Mini Map
99%
Data Completeness
Watermarking
Event Time Processing
Windowing
Time-based Aggregation
Late Data
Intelligent Handling

Streaming Layer

  • • Kafka real-time ingestion
  • • Debezium CDC with timestamps
  • • Event ordering and tracking
  • • Out-of-order data handling

Processing Layer

  • • Watermarking for late data tolerance
  • • Windowing for time-based aggregation
  • • DBT intelligent transformation
  • • Late data logic implementation

Monitoring Layer

  • • Real-time completeness tracking
  • • Data quality alert system
  • • 99% data completeness
  • • Automated quality checks

Technical Implementation

1. Kafka Watermarking Strategy

The streaming layer implements sophisticated watermark management:

Watermark Computation:

  • Track the maximum event timestamp seen across all partitions
  • Update watermark as new events arrive
  • Events arriving before (watermark - tolerance) are classified as late

Late Event Detection:

  • Compare each event timestamp against current watermark minus 5-minute tolerance
  • Late events are buffered separately with delay metrics (original timestamp, received timestamp, delay in minutes)
  • Log late event occurrences for monitoring

Event Processing Flow:

  • Current events: Enriched with watermark timestamp and processed immediately
  • Late events: Buffered and batch-processed every 10 events
  • Both streams maintain processing metadata for auditing

Watermark Manager Capabilities:

  • Track watermark history for debugging
  • Calculate watermark lag (time between wall clock and watermark)
  • Provide statistics for monitoring dashboards

Late Data Metrics:

  • Track total events vs late events count
  • Categorize by delay bucket (0-5min, 5-15min, 15-60min, 60min+)
  • Calculate data completeness percentage
  • Group late events by source system

2. DBT Late Data Processing

The DBT models handle both current and late events with proper classification:

Late Events Processing:

  • Extract all standard event fields plus late data metadata
  • Classify delays into categories (minimal, moderate, significant, extreme)
  • Track processing status (reprocessed vs current)
  • Filter for valid events only

Current Events Processing:

  • Standard event extraction without delay metadata
  • Mark as 'no_delay' and 'current' processing status
  • Include placeholder fields for unified schema

Unified Event Stream:

  • Combine late and current events with UNION ALL
  • Parse timestamps for proper temporal ordering
  • Categorize events by type (engagement, conversion, discovery)
  • Extract time-based features (hour, day of week)
  • Track completeness status per record

Data Completeness Fact Table:

  • Aggregate metrics by date, source system, and event category
  • Count total, late, and current events
  • Break down by delay category
  • Calculate average and max delay minutes
  • Compute completeness percentage
  • Join with watermark metrics for comprehensive view
  • Assign data quality scores (excellent, good, fair, poor)

3. Windowing and Reprocessing Logic

The windowing processor manages time-based aggregation with late data tolerance:

Window Configuration:

  • Configurable window size (default: 60 minutes)
  • Late data tolerance period (default: 15 minutes after window close)
  • Per-window event buffers for efficient processing

Window Processing Flow:

  1. Assign each event to appropriate window based on timestamp
  2. Buffer events until window is complete (window end + tolerance)
  3. Separate current vs late events within each window
  4. Calculate window-level metrics (total, current, late counts)
  5. Process current events for real-time metrics
  6. Process late events and trigger reprocessing of affected aggregations
  7. Clean up completed window buffers

Reprocessing Triggers:

  • Late events trigger updates to affected window aggregations
  • Track max and average delay for late events
  • Log reprocessing actions for auditing

4. Data Completeness Monitoring

The monitoring layer provides real-time visibility into data quality:

Completeness Checks:

  • Query Cube.js semantic layer for completeness metrics
  • Filter by source system and date range
  • Track total events, late events, and completeness percentage

Watermark Lag Monitoring:

  • Track average and maximum watermark lag by source and date
  • Monitor watermark update frequency
  • Identify lagging sources

Comprehensive Reporting:

  • Generate reports across all source systems
  • Calculate overall completeness percentage
  • Generate alerts for low completeness (< 95% triggers medium alert, < 90% triggers high alert)
  • Include watermark metrics for correlation

Late Data Handling Results & Performance

Data Completeness Achievements

  • Data Completeness: 99% data completeness achieved
  • Late Data Detection: Sub-second late data detection
  • Reprocessing Speed: 5x faster late data reprocessing
  • Watermark Lag: < 2 minutes average watermark lag

System Performance

  • Processing Speed: Handle 1M+ events/hour with late data
  • Memory Efficiency: 50% reduction in memory usage
  • Scalability: Auto-scale with event volume
  • Reliability: 99.9% uptime with late data handling

Implementation Timeline

  • Week 1: Watermarking infrastructure setup
  • Week 2: Late data detection and processing
  • Week 3: Windowing and reprocessing logic
  • Week 4: Monitoring and alerting implementation

Business Impact

Data Quality Assurance

  • Complete Analytics: Ensure all data is captured and processed
  • Accurate Insights: Reliable analytics with complete data
  • Compliance: Meet data completeness requirements
  • Audit Trail: Complete tracking of data processing

Operational Excellence

  • Automated Handling: No manual intervention for late data
  • Real-Time Monitoring: Continuous data quality tracking
  • Proactive Alerts: Early detection of data issues
  • Scalable Solution: Handle growing data volumes

Best Practices for Late Data Handling

1. Watermarking Strategy

  • Event Time Processing: Use event timestamps for processing
  • Watermark Tolerance: Set appropriate late data tolerance
  • Dynamic Watermarks: Adjust watermarks based on data patterns
  • Monitoring: Track watermark lag continuously

2. Windowing Approach

  • Time-Based Windows: Use time boundaries for aggregation
  • Late Data Tolerance: Allow for late data within windows
  • Reprocessing Logic: Automatically reprocess affected windows
  • Performance Optimization: Optimize window processing

3. Data Quality Monitoring

  • Completeness Tracking: Monitor data completeness metrics
  • Alert System: Set up alerts for data quality issues
  • Real-Time Dashboards: Visualize data quality metrics
  • Historical Analysis: Track data quality trends

4. Performance Optimization

  • Efficient Buffering: Optimize late data buffering
  • Parallel Processing: Process late data in parallel
  • Caching Strategy: Cache frequently accessed data
  • Resource Management: Optimize memory and CPU usage

Conclusion

Late-arriving data handling is essential for maintaining data quality and completeness in streaming pipelines. By implementing watermarking, windowing, and intelligent reprocessing, organizations can achieve high data completeness while maintaining real-time processing capabilities.

The key to success lies in:

  1. Robust Watermarking with appropriate tolerance levels
  2. Intelligent Windowing that handles late data gracefully
  3. Automated Reprocessing for affected data windows
  4. Comprehensive Monitoring with real-time quality tracking
  5. Performance Optimization for efficient late data handling

Start your late data handling journey today and ensure complete, accurate data processing.


Need help implementing late data handling? Get in touch to discuss your architecture.

More articles

Real-Time Fraud Detection Pipelines

How to build real-time fraud detection pipelines using Kafka streaming, DBT for pattern detection, and Cube.js for metrics. Production architecture achieving 15% fraud reduction.

Read more

Building a Data Mesh: Lessons from Retail

How to implement a decentralized data architecture, scaling to 10 domains in 8 weeks using domain-driven DBT models and Terraform automation. Real-world lessons from retail.

Read more

Ready to build production-ready systems?

Based in Dubai

  • Dubai
    Dubai, UAE
    Currently accepting limited engagements