Handling Late-Arriving Data in Streaming Pipelines

May 15, 2024

by Abdelkader Bekhti, Production AI & Data Architect

The Challenge: Managing Late-Arriving Data in Real-Time Streams

Organizations face the critical challenge of handling late-arriving data in streaming pipelines while maintaining data accuracy and completeness. Traditional streaming systems often miss or incorrectly process data that arrives out of order, leading to incomplete analytics and incorrect business decisions.

This late-arriving data solution leverages watermarking, windowing, and intelligent reprocessing to ensure 99% data completeness while maintaining real-time processing capabilities.

Late-Arriving Data Architecture: Watermarking & Windowing

Our solution delivers 99% data completeness with intelligent late data handling. Here's the architecture:

Streaming Layer

Kafka Streaming: Real-time data ingestion with ordering
Debezium CDC: Change data capture with timestamp tracking
Watermarking: Event time processing with late data tolerance
Windowing: Time-based data aggregation with late data handling

Processing Layer

DBT Processing: Intelligent data transformation with late data logic
Reprocessing Pipeline: Automatic handling of late-arriving data
Completeness Monitoring: Real-time data completeness tracking
Alert System: Notifications for data quality issues

Late-Arriving Data Architecture

Out of order

Data Sources

Real-time events

Ordering

Kafka Streaming

Real-time ingestion

Timestamp tracking

Debezium CDC

Change data capture

Late data tolerance

Watermarking

Event time processing

Late data handling

Windowing

Time-based aggregation

Late data logic

DBT Processing

Intelligent transformation

99% completeness

Completeness Monitoring

Real-time tracking

Real-time notifications

Alert System

Data quality alerts

99%

Data Completeness

Watermarking

Event Time Processing

Windowing

Time-based Aggregation

Late Data

Intelligent Handling

Streaming Layer

• Kafka real-time ingestion
• Debezium CDC with timestamps
• Event ordering and tracking
• Out-of-order data handling

Processing Layer

• Watermarking for late data tolerance
• Windowing for time-based aggregation
• DBT intelligent transformation
• Late data logic implementation

Monitoring Layer

• Real-time completeness tracking
• Data quality alert system
• 99% data completeness
• Automated quality checks

Technical Implementation

1. Kafka Watermarking Strategy

The streaming layer implements sophisticated watermark management:

Watermark Computation:

Track the maximum event timestamp seen across all partitions
Update watermark as new events arrive
Events arriving before (watermark - tolerance) are classified as late

Late Event Detection:

Compare each event timestamp against current watermark minus 5-minute tolerance
Late events are buffered separately with delay metrics (original timestamp, received timestamp, delay in minutes)
Log late event occurrences for monitoring

Event Processing Flow:

Current events: Enriched with watermark timestamp and processed immediately
Late events: Buffered and batch-processed every 10 events
Both streams maintain processing metadata for auditing

Watermark Manager Capabilities:

Track watermark history for debugging
Calculate watermark lag (time between wall clock and watermark)
Provide statistics for monitoring dashboards

Late Data Metrics:

Track total events vs late events count
Categorize by delay bucket (0-5min, 5-15min, 15-60min, 60min+)
Calculate data completeness percentage
Group late events by source system

2. DBT Late Data Processing

The DBT models handle both current and late events with proper classification:

Late Events Processing:

Extract all standard event fields plus late data metadata
Classify delays into categories (minimal, moderate, significant, extreme)
Track processing status (reprocessed vs current)
Filter for valid events only

Current Events Processing:

Standard event extraction without delay metadata
Mark as 'no_delay' and 'current' processing status
Include placeholder fields for unified schema

Unified Event Stream:

Combine late and current events with UNION ALL
Parse timestamps for proper temporal ordering
Categorize events by type (engagement, conversion, discovery)
Extract time-based features (hour, day of week)
Track completeness status per record

Data Completeness Fact Table:

Aggregate metrics by date, source system, and event category
Count total, late, and current events
Break down by delay category
Calculate average and max delay minutes
Compute completeness percentage
Join with watermark metrics for comprehensive view
Assign data quality scores (excellent, good, fair, poor)

3. Windowing and Reprocessing Logic

The windowing processor manages time-based aggregation with late data tolerance:

Window Configuration:

Configurable window size (default: 60 minutes)
Late data tolerance period (default: 15 minutes after window close)
Per-window event buffers for efficient processing

Window Processing Flow:

Assign each event to appropriate window based on timestamp
Buffer events until window is complete (window end + tolerance)
Separate current vs late events within each window
Calculate window-level metrics (total, current, late counts)
Process current events for real-time metrics
Process late events and trigger reprocessing of affected aggregations
Clean up completed window buffers

Reprocessing Triggers:

Late events trigger updates to affected window aggregations
Track max and average delay for late events
Log reprocessing actions for auditing

4. Data Completeness Monitoring

The monitoring layer provides real-time visibility into data quality:

Completeness Checks:

Query Cube.js semantic layer for completeness metrics
Filter by source system and date range
Track total events, late events, and completeness percentage

Watermark Lag Monitoring:

Track average and maximum watermark lag by source and date
Monitor watermark update frequency
Identify lagging sources

Comprehensive Reporting:

Generate reports across all source systems
Calculate overall completeness percentage
Generate alerts for low completeness (< 95% triggers medium alert, < 90% triggers high alert)
Include watermark metrics for correlation

Late Data Handling Results & Performance

Data Completeness Achievements

Data Completeness: 99% data completeness achieved
Late Data Detection: Sub-second late data detection
Reprocessing Speed: 5x faster late data reprocessing
Watermark Lag: < 2 minutes average watermark lag

System Performance

Processing Speed: Handle 1M+ events/hour with late data
Memory Efficiency: 50% reduction in memory usage
Scalability: Auto-scale with event volume
Reliability: 99.9% uptime with late data handling

Implementation Timeline

Week 1: Watermarking infrastructure setup
Week 2: Late data detection and processing
Week 3: Windowing and reprocessing logic
Week 4: Monitoring and alerting implementation

Business Impact

Data Quality Assurance

Complete Analytics: Ensure all data is captured and processed
Accurate Insights: Reliable analytics with complete data
Compliance: Meet data completeness requirements
Audit Trail: Complete tracking of data processing

Operational Excellence

Automated Handling: No manual intervention for late data
Real-Time Monitoring: Continuous data quality tracking
Proactive Alerts: Early detection of data issues
Scalable Solution: Handle growing data volumes

Best Practices for Late Data Handling

1. Watermarking Strategy

Event Time Processing: Use event timestamps for processing
Watermark Tolerance: Set appropriate late data tolerance
Dynamic Watermarks: Adjust watermarks based on data patterns
Monitoring: Track watermark lag continuously

2. Windowing Approach

Time-Based Windows: Use time boundaries for aggregation
Late Data Tolerance: Allow for late data within windows
Reprocessing Logic: Automatically reprocess affected windows
Performance Optimization: Optimize window processing

3. Data Quality Monitoring

Completeness Tracking: Monitor data completeness metrics
Alert System: Set up alerts for data quality issues
Real-Time Dashboards: Visualize data quality metrics
Historical Analysis: Track data quality trends

4. Performance Optimization

Efficient Buffering: Optimize late data buffering
Parallel Processing: Process late data in parallel
Caching Strategy: Cache frequently accessed data
Resource Management: Optimize memory and CPU usage

Conclusion

Late-arriving data handling is essential for maintaining data quality and completeness in streaming pipelines. By implementing watermarking, windowing, and intelligent reprocessing, organizations can achieve high data completeness while maintaining real-time processing capabilities.

The key to success lies in:

Robust Watermarking with appropriate tolerance levels
Intelligent Windowing that handles late data gracefully
Automated Reprocessing for affected data windows
Comprehensive Monitoring with real-time quality tracking
Performance Optimization for efficient late data handling

Start your late data handling journey today and ensure complete, accurate data processing.

Need help implementing late data handling? Get in touch to discuss your architecture.

Our offices

Follow us

Handling Late-Arriving Data in Streaming Pipelines

The Challenge: Managing Late-Arriving Data in Real-Time Streams

Late-Arriving Data Architecture: Watermarking & Windowing

Streaming Layer

Processing Layer

Late-Arriving Data Architecture

Streaming Layer

Processing Layer

Monitoring Layer

Technical Implementation

1. Kafka Watermarking Strategy

2. DBT Late Data Processing

3. Windowing and Reprocessing Logic

4. Data Completeness Monitoring

Late Data Handling Results & Performance

Data Completeness Achievements

System Performance

Implementation Timeline

Business Impact

Data Quality Assurance

Operational Excellence

Best Practices for Late Data Handling

1. Watermarking Strategy

2. Windowing Approach

3. Data Quality Monitoring

4. Performance Optimization

Conclusion

More articles

Real-Time Fraud Detection Pipelines

Building a Data Mesh: Lessons from Retail

Ready to build production-ready systems?

Based in Dubai