Data Engineer

Ready to Build Your Own Data Pipeline?

If you've made it this far, you've seen how a full-stack approach can transform raw user events into actionable business insights — all without relying on third-party analytics.

I'd love to talk about how I can bring this same end-to-end mindset — from JavaScript tracking to automated ETL pipelines — to your data challenges.

Let's build something powerful together.

📨 Email: data@lubobali.com
🔗 Website: www.lubobali.com
💼 LinkedIn: linkedin.com/in/lubo-bali

Created by Lubo Bali

Recap

Main Lessons Learned

Key Takeaways from Building a Real-Time Analytics Pipeline

Data Quality Beats Data Volume
- Raw tracking events need smart deduplication to become accurate insights
- Spent 40% of time on data validation - but it was worth it for reliable metrics
Production Environment is Everything
- Railway deployment required different database connection patterns than local dev
- Environment variables and cron jobs work differently in the cloud
End-to-End System Design is Critical
- Your JavaScript tracker must send exactly what your Python aggregator expects
- One broken link breaks the entire pipeline
Real Users Reveal Edge Cases
- Mobile traffic patterns and referrer inconsistencies surprised me
- Test data never matches real user behavior
Automation Saves Time, But Takes Time to Build
- Daily aggregations now run automatically via cron
- Setting up reliable automation took longer than the core logic

Bottom Line

This project proved I can build complete data systems that run reliably in production - from frontend tracking to automated insights.

#!/usr/bin/env python3
"""
Daily Analytics Aggregator for Portfolio Click Tracker
Processes raw click_logs and creates daily summary statistics
"""

import os
import sys
import psycopg2
from psycopg2.extras import RealDictCursor
from datetime import datetime, date, timedelta
from collections import defaultdict, Counter
import json

class DailyAggregator:
    def __init__(self):
        """Initialize the aggregator with database connection"""
        # Try Railway's database environment variables
        self.db_url = (
            os.getenv("DATABASE_URL") or 
            os.getenv("DATABASE_PUBLIC_URL") or 
            os.getenv("PGURL") or
            os.getenv("DB_URL") or
            self._build_url_from_parts()
        )
        if not self.db_url:
            raise Exception("No database URL found in environment variables")
    
    def _build_url_from_parts(self):
        """Build database URL from individual environment variables (Railway style)"""
        host = os.getenv("PGHOST")
        port = os.getenv("PGPORT", "5432")
        database = os.getenv("PGDATABASE")
        user = os.getenv("PGUSER")
        password = os.getenv("PGPASSWORD")
        
        if all([host, database, user, password]):
            return f"postgresql://{user}:{password}@{host}:{port}/{database}"
        return None
    
    def get_db_connection(self):
        """Create database connection"""
        try:
            return psycopg2.connect(self.db_url, cursor_factory=RealDictCursor)
        except Exception as e:
            print(f"Database connection error: {e}")
            raise
    
    def aggregate_day(self, target_date):
        """Aggregate click data for a specific date"""
        print(f"Starting aggregation for {target_date}")
        
        conn = self.get_db_connection()
        cursor = conn.cursor()
        
        try:
            # Get all clicks for the target date
            query = """
            SELECT * FROM click_logs 
            WHERE DATE(timestamp) = %s
            ORDER BY timestamp
            """
            cursor.execute(query, (target_date,))
            clicks = cursor.fetchall()
            
            if not clicks:
                print(f"No clicks found for {target_date}")
                return
            
            print(f"Found {len(clicks)} raw click events for {target_date}")
            print(f"Deduplicating to unique pageviews (collapsing arrival+exit events)...")
            
            # --- NEW: collapse arrival+exit to one pageview per (session_id, page_name) ---
            pageviews = {}
            first_event_for_referrer = {}
            
            for c in clicks:
                key = (c.get('session_id'), c.get('page_name'))
                # keep first event for referrer (usually arrival)
                if key not in first_event_for_referrer:
                    first_event_for_referrer[key] = c
                prev = pageviews.get(key)
                # choose the event with the larger time_on_page (exit > arrival)
                cur_time = c.get('time_on_page') or 0
                prev_time = (prev.get('time_on_page') or 0) if prev else -1
                if prev is None or cur_time > prev_time:
                    pageviews[key] = c
            
            # Use deduped pageviews for metrics
            total_clicks = len(pageviews)  # unique pageviews, not raw rows
            project_name = "lubobali_portfolio"  # Your main project
            
            times = [pv.get('time_on_page') or 0 for pv in pageviews.values() if (pv.get('time_on_page') or 0) > 0]
            avg_time_on_page = round(sum(times) / len(times), 2) if times else 0
            
            # Device split (from chosen pageview event)
            device_counts = {"Mobile": 0, "Desktop": 0}
            for pv in pageviews.values():
                ua = (pv.get('user_agent') or '').lower()
                if 'mobile' in ua or 'android' in ua or 'iphone' in ua:
                    device_counts['Mobile'] += 1
                else:
                    device_counts['Desktop'] += 1
            
            # Top referrers (prefer the first event per pageview, i.e., the arrival)
            referrer_counts = {}
            for key, first in first_event_for_referrer.items():
                referrer = (first.get('referrer') or '').strip()
                if not referrer or referrer == 'null':
                    referrer = 'Direct Traffic'
                elif not referrer.startswith('http'):
                    referrer = 'Direct Traffic'
                referrer_counts[referrer] = referrer_counts.get(referrer, 0) + 1
            
            # Top pages from chosen pageviews
            page_counts = {}
            for pv in pageviews.values():
                page_name = pv.get('page_name', 'unknown') or 'unknown'
                if page_name == '' or page_name == 'home': page_name = 'home'
                elif not page_name.startswith('/') and page_name != 'home': page_name = f'/{page_name}'
                page_counts[page_name] = page_counts.get(page_name, 0) + 1
            
            # Repeat visits based on unique sessions
            session_ids = [sid for (sid, _page) in pageviews.keys() if sid]
            unique_sessions = len(set(session_ids))
            repeat_visits = total_clicks - unique_sessions if unique_sessions > 0 else 0
            
            # Prepare data for insertion
            summary_data = {
                'date': target_date,
                'project_name': project_name,
                'total_clicks': total_clicks,
                'avg_time_on_page': round(avg_time_on_page, 2),
                'device_split': dict(device_counts),
                'top_referrers': dict(referrer_counts),
                'top_pages': dict(page_counts),
                'repeat_visits': repeat_visits,
                'tag': 'general'
            }
            
            # Delete existing data for this date (if any)
            delete_query = "DELETE FROM daily_click_summary WHERE date = %s"
            cursor.execute(delete_query, (target_date,))
            
            # Insert new aggregated data
            insert_query = """
            INSERT INTO daily_click_summary 
            (date, project_name, total_clicks, avg_time_on_page, device_split, 
             top_referrers, top_pages, repeat_visits, tag, created_at)
            VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
            """
            
            cursor.execute(insert_query, (
                summary_data['date'],
                summary_data['project_name'],
                summary_data['total_clicks'],
                summary_data['avg_time_on_page'],
                json.dumps(summary_data['device_split']),
                json.dumps(summary_data['top_referrers']),
                json.dumps(summary_data['top_pages']),
                summary_data['repeat_visits'],
                summary_data['tag'],
                datetime.now()
            ))
            
            conn.commit()
            print(f"✅ Successfully aggregated {len(clicks)} raw events → {total_clicks} unique pageviews for {target_date}")
            print(f"📊 Summary: {total_clicks} pageviews, {len(set(session_ids))} sessions, {avg_time_on_page:.1f}s avg time")
            
        except Exception as e:
            print(f"❌ Error during aggregation: {e}")
            conn.rollback()
            raise
        finally:
            cursor.close()
            conn.close()
    
    def run_daily_aggregation(self, days_back=1):
        """Run aggregation for the previous day(s)"""
        target_date = date.today() - timedelta(days=days_back)
        self.aggregate_day(target_date)

if __name__ == "__main__":
    # For manual testing
    aggregator = DailyAggregator()
    aggregator.run_daily_aggregation(days_back=1)

ETL Processing

(daily_aggregator.py)

Environment Configuration: Multi-source database URL detection supporting Railway's various environment variable formats with automatic fallback mechanisms for deployment flexibility.

Data Deduplication Logic: Advanced algorithm that collapses arrival/exit events into unique pageviews by session and page, selecting events with maximum time-on-page for accurate engagement metrics.

Complex Aggregation Queries: SQL processing that transforms raw click events into meaningful analytics including device detection, referrer analysis, page popularity metrics, and session-based calculations.

JSON Data Processing: Converts aggregated dictionaries into JSON for PostgreSQL storage, handling device splits, referrer counts, and page statistics for dashboard consumption.

Transaction Management: Implements proper database transaction handling with rollback capabilities, ensuring data integrity during the ETL process and preventing partial updates.

Error Recovery: Comprehensive exception handling with detailed logging, connection cleanup, and graceful failure modes to maintain pipeline reliability.

Enterprise-grade ETL system with sophisticated data transformation logic, robust error handling, and optimized aggregation algorithms for real-time analytics processing.

#!/usr/bin/env python3
"""
Railway Cron Job Entry Point for Daily Analytics Aggregator
Runs daily at midnight UTC to process previous day's click data
"""

import os
import sys
from datetime import datetime, timezone
from daily_aggregator import DailyAggregator

def main():
    """Main cron job execution"""
    start_time = datetime.now(timezone.utc)
    print(f"🕛 Railway Cron Job Started at {start_time}")
    print("=" * 50)
    
    # Debug: Print environment variables for troubleshooting
    print("🔍 Environment variables check:")
    db_vars = ['DATABASE_URL', 'DATABASE_PUBLIC_URL', 'PGURL', 'DB_URL', 
               'PGHOST', 'PGPORT', 'PGDATABASE', 'PGUSER', 'PGPASSWORD']
    for var in db_vars:
        value = os.getenv(var)
        if value:
            # Hide password for security
            if 'PASSWORD' in var.upper():
                print(f"  ✅ {var}: {'*' * len(value)}")
            else:
                print(f"  ✅ {var}: {value[:50]}...")
        else:
            print(f"  ❌ {var}: Not set")
    print("=" * 50)
    
    try:
        # Initialize aggregator
        aggregator = DailyAggregator()
        
        # Run aggregation for yesterday
        print("🚀 Starting daily aggregation for previous day...")
        aggregator.run_daily_aggregation(days_back=1)
        
        end_time = datetime.now(timezone.utc)
        duration = (end_time - start_time).total_seconds()
        
        print("=" * 50)
        print(f"✅ Railway Cron Job Completed Successfully!")
        print(f"⏱️  Duration: {duration:.2f} seconds")
        print(f"🕐 Finished at {end_time}")
        
    except Exception as e:
        end_time = datetime.now(timezone.utc)
        duration = (end_time - start_time).total_seconds()
        
        print("=" * 50)
        print(f"💥 Railway Cron Job Failed: {e}")
        print(f"⏱️  Duration: {duration:.2f} seconds")
        print(f"🕐 Failed at {end_time}")
        import traceback
        print(f"🔥 Full error traceback:\n{traceback.format_exc()}")
        sys.exit(1)

if __name__ == "__main__":
    main()

Automation

(cron_daily_aggregator.py)

Railway Cron Integration: Dedicated entry point designed specifically for Railway's cron service deployment, with UTC timezone handling and proper exit codes for automated scheduling.

Environment Diagnostics: Comprehensive environment variable validation and logging for all possible Railway database configurations, with password masking for security compliance.

Execution Monitoring: Detailed timing measurements and status logging with structured output format, enabling easy monitoring and debugging of automated runs through Railway logs.

Error Recovery & Reporting: Full exception handling with traceback logging and proper exit codes, ensuring failures are properly reported to Railway's monitoring systems.

Production Logging: Structured console output with timestamps, duration tracking, and clear success/failure indicators for operations team visibility and troubleshooting.

Timezone Management: UTC-based execution timing to ensure consistent daily runs regardless of Railway's server locations or daylight saving changes.

Production-ready automation wrapper with enterprise monitoring, comprehensive error handling, and deployment-optimized logging for reliable scheduled execution.

# FastAPI Click Tracking API for lubobali.com portfolio website
# Accepts click events and stores them in PostgreSQL database
# Deployed on Railway with PostgreSQL plugin

from fastapi import FastAPI, HTTPException, Request
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel, Field
from typing import Optional
import psycopg2
from psycopg2.extras import RealDictCursor
import os
import hashlib
from datetime import datetime
import uvicorn
from apscheduler.schedulers.asyncio import AsyncIOScheduler
from apscheduler.triggers.cron import CronTrigger
import threading

# Create FastAPI app instance
app = FastAPI(
    title="Portfolio Click Tracker API",
    description="API to track clicks and user engagement on lubobali.com",
    version="1.0.0"
)

# Initialize scheduler
scheduler = AsyncIOScheduler()

# Add CORS middleware to allow requests from Framer website
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],  # Allow all origins for testing
    allow_credentials=False,  # Set to False when using allow_origins=["*"]
    allow_methods=["GET", "POST", "OPTIONS"],
    allow_headers=["*"],
)

# Startup event to create database tables and start scheduler
@app.on_event("startup")
async def startup_event():
    """Initialize database tables and start scheduler on startup"""
    print("Starting up Portfolio Click Tracker API...")
    try:
        create_tables()
        print("API startup complete!")
        
        # Start the scheduler
        start_scheduler()
        print("✅ Daily aggregation scheduler started!")
        
    except Exception as e:
        print(f"Warning: Database initialization failed: {e}")
        print("API will start anyway - database may be created later")

@app.on_event("shutdown")
async def shutdown_event():
    """Clean shutdown of scheduler"""
    print("Shutting down scheduler...")
    scheduler.shutdown()
    print("✅ Scheduler stopped")

def run_daily_aggregation():
    """Run daily aggregation for yesterday only"""
    print(f"🕛 Running daily aggregation at {datetime.now()}")
    try:
        from daily_aggregator import DailyAggregator
        
        def run_aggregation():
            try:
                aggregator = DailyAggregator()
                aggregator.run_daily_aggregation(days_back=1)
                print("✅ Daily aggregation completed successfully!")
            except Exception as e:
                print(f"❌ Daily aggregation failed: {e}")
        
        thread = threading.Thread(target=run_aggregation)
        thread.start()

    except Exception as e:
        print(f"❌ Error starting daily aggregation: {e}")

def start_scheduler():
    """Start APScheduler for daily aggregation at 05:30 AM UTC (12:30 AM Central Time)"""
    try:
        scheduler.add_job(
            run_daily_aggregation,
            CronTrigger(hour=5, minute=30, timezone='UTC'),
            id='daily_aggregation',
            name='Daily Analytics Aggregation',
            replace_existing=True
        )
        scheduler.start()
        print("📅 Scheduler configured to run daily at 05:30 AM UTC (12:30 AM Central Time)")
    except Exception as e:
        print(f"❌ Failed to start scheduler: {e}")

# Define Pydantic model for incoming click data
class ClickEvent(BaseModel):
    page_name: str = Field(..., description="Name of the project/page being tracked")
    tag: Optional[str] = Field(None, description="Category tag for the project")
    time_on_page: int = Field(..., description="Time spent on page in seconds")
    session_id: str = Field(..., description="Unique session identifier")
    referrer: Optional[str] = Field(None, description="Source URL that led to this page")
    user_agent: str = Field(..., description="Browser user agent string")
    ip: Optional[str] = Field(None, description="Client IP address (optional)")

# Database connection function
def get_db_connection():
    """Create and return PostgreSQL database connection"""
    try:
        # Get database URL from environment variable (Railway provides this)
        db_url = os.getenv("DATABASE_URL") or os.getenv("DB_URL")
        if not db_url:
            raise Exception("No database URL found in environment variables")
        
        conn = psycopg2.connect(db_url, cursor_factory=RealDictCursor)
        return conn
    except Exception as e:
        print(f"Database connection error: {e}")
        raise HTTPException(status_code=500, detail="Database connection failed")

# Function to hash IP address for privacy
def hash_ip(ip_address: str) -> str:
    """Hash IP address using SHA256 for privacy"""
    if not ip_address:
        return None
    return hashlib.sha256(ip_address.encode()).hexdigest()[:16]

# Function to create database tables if they don't exist
def create_tables():
    """Create click_logs table if it doesn't exist"""
    try:
        conn = get_db_connection()
        cursor = conn.cursor()
        
        # Create click_logs table
        create_table_query = """
        CREATE TABLE IF NOT EXISTS click_logs (
            id SERIAL PRIMARY KEY,
            page_name TEXT NOT NULL,
            tag TEXT,
            time_on_page INTEGER NOT NULL,
            session_id TEXT NOT NULL,
            referrer TEXT,
            user_agent TEXT NOT NULL,
            ip_hash TEXT,
            timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        );
        """
        
        cursor.execute(create_table_query)
        conn.commit()
        cursor.close()
        conn.close()
        
        print("✓ Database tables created/verified successfully")
        
    except Exception as e:
        print(f"Error creating tables: {e}")
        # Don't raise - let the app start anyway

# Health check endpoint
@app.get("/")
async def root():
    """Health check endpoint"""
    return {"message": "Portfolio Click Tracker API is running", "status": "healthy"}

# Main click tracking endpoint
@app.post("/api/track-click")
async def track_click(click_data: ClickEvent, request: Request):
    """
    Track a click event from the portfolio website
    Accepts JSON payload with page info and stores in database
    """
    try:
        # Get client IP from request if not provided
        client_ip = click_data.ip or request.client.host
        ip_hash = hash_ip(client_ip) if client_ip else None
        
        # Connect to database
        conn = get_db_connection()
        cursor = conn.cursor()
        
        # Insert click data into click_logs table
        insert_query = """
        INSERT INTO click_logs (
            page_name, tag, user_agent, referrer, 
            session_id, time_on_page, ip_hash
        ) VALUES (%s, %s, %s, %s, %s, %s, %s)
        RETURNING id, timestamp
        """
        
        cursor.execute(insert_query, (
            click_data.page_name,
            click_data.tag,
            click_data.user_agent,
            click_data.referrer,
            click_data.session_id,
            click_data.time_on_page,
            ip_hash
        ))
        
        # Get the inserted record details
        result = cursor.fetchone()
        
        # Commit transaction and close connection
        conn.commit()
        cursor.close()
        conn.close()
        
        return {
            "success": True,
            "message": "Click tracked successfully",
            "click_id": result["id"],
            "timestamp": result["timestamp"].isoformat()
        }
        
    except psycopg2.Error as e:
        print(f"Database error: {e}")
        raise HTTPException(status_code=500, detail="Failed to save click data")
    except Exception as e:
        print(f"Unexpected error: {e}")
        raise HTTPException(status_code=500, detail="Internal server error")

# Get recent clicks endpoint (for debugging/testing)
@app.get("/api/recent-clicks")
async def get_recent_clicks(limit: int = 10):
    """
    Get recent click events for debugging purposes
    Returns last N clicks from the database
    """
    try:
        conn = get_db_connection()
        cursor = conn.cursor()
        
        # Query recent clicks without exposing sensitive data
        query = """
        SELECT id, timestamp, page_name, tag, referrer, 
               time_on_page, session_id
        FROM click_logs 
        ORDER BY timestamp DESC 
        LIMIT %s
        """
        
        cursor.execute(query, (limit,))
        clicks = cursor.fetchall()
        
        cursor.close()
        conn.close()
        
        return {
            "success": True,
            "clicks": clicks,
            "count": len(clicks)
        }
        
    except Exception as e:
        print(f"Error fetching clicks: {e}")
        raise HTTPException(status_code=500, detail="Failed to fetch click data")

# Manual aggregation trigger endpoint (for testing)
@app.post("/api/trigger-aggregation")
async def trigger_aggregation():
    """
    Manually trigger daily aggregation for testing purposes
    """
    try:
        print("🔧 Manual aggregation triggered via API")
        run_daily_aggregation()
        
        return {
            "success": True,
            "message": "Daily aggregation triggered successfully",
            "note": "Check logs for execution details"
        }
        
    except Exception as e:
        print(f"Error triggering aggregation: {e}")
        raise HTTPException(status_code=500, detail="Failed to trigger aggregation")

# Database test function
def test_connection():
    """Test database connection and print success message"""
    try:
        conn = get_db_connection()
        print("Database connection successful!")
        conn.close()
    except Exception as e:
        print(f"Database connection failed: {e}")

Backend API

(app.py)

DuplicateFastAPI Framework: Production-grade REST API with automatic OpenAPI documentation, Pydantic data validation, and comprehensive error handling for robust request processing.

Database Integration: PostgreSQL connection management with connection pooling, environment variable configuration for Railway deployment, and automatic table creation on startup.

Data Validation & Security: Pydantic models ensure type safety, IP address hashing for privacy protection, and SQL injection prevention through parameterized queries.

CORS Configuration: Cross-origin middleware setup allowing frontend requests from Framer website with proper security headers and credential handling.

Automated Scheduling: APScheduler integration with cron triggers for daily data aggregation at 5:30 AM UTC, running in background threads without blocking API requests.

Error Handling: Comprehensive exception handling with proper HTTP status codes, database transaction management, and graceful degradation on connection failures.

Production-ready Python backend with enterprise-level architecture, security best practices, and automated data processing capabilities.

Frontend Data Collection

(tracker_production.js)

Duplicate Prevention: Implements Set-based deduplication with time-bucketing to prevent race conditions during rapid user interactions.

SPA Navigation Detection: Monitors URL changes every 500ms to track Single Page Application navigation without page reloads, plus handles browser back/forward buttons.

Reliable Data Transmission: Uses Fetch API with keepalive: true to ensure data reaches the server even during page transitions or tab closures.

Event Management: Properly manages event listeners for beforeunload and visibility changes, preventing memory leaks in long-running sessions.

Session Tracking: Generates persistent session IDs and calculates accurate time-on-page metrics across navigation events.

Production-level JavaScript with advanced browser APIs and robust error handling for real-time analytics.

/**
 * Portfolio Click Tracker - Production Version
 * Tracks user engagement across SPA navigation
 */

(function() {
    'use strict';
    
    // Prevent duplicate tracker instances
    if (window.TRACKER_LOADED) {
        console.log('TRACKER: Already loaded, skipping');
        return;
    }
    window.TRACKER_LOADED = true;
    console.log('TRACKER: Production tracker initialized');
    
    // Track sent requests to prevent duplicates
    const sentRequests = new Set();
    
    function generateRequestId(pageName, timeOnPage, eventType) {
        return `${pageName}-${eventType}-${Math.floor(timeOnPage/5)*5}`;
    }
    
        class PortfolioTracker {
        constructor() {
            if (window.trackerInstance) {
                console.log('TRACKER: Using existing instance');
                return window.trackerInstance;
            }
            
            console.log('TRACKER: Creating new instance');
            
            this.apiEndpoint = 'https://lubo-portfolio-tracker-production.up.railway.app/api/track-click';
            this.sessionId = this.getSessionId();
            this.currentPageName = null;
            this.startTime = null;
            this.sentArrival = false;
            this.sentExit = false;
            this.requestInProgress = false;
            
            // Initialize for current page
            this.initializePage();
            
            // Set up SPA navigation detection
            this.setupSPADetection();
            
            window.trackerInstance = this;
        }
        
        getSessionId() {
            let sessionId = localStorage.getItem('portfolio_session_id');
            if (!sessionId) {
                sessionId = 'sess_' + Date.now() + '_' + Math.random().toString(36).substr(2, 9);
                localStorage.setItem('portfolio_session_id', sessionId);
            }
            return sessionId;
        }
        
        getPageName() {
            const path = window.location.pathname + window.location.search;
            return path === '/' ? 'home' : path;
        }
        
        getTimeOnPage() {
            return this.startTime ? Math.round((Date.now() - this.startTime) / 1000) : 0;
        }
        
        initializePage() {
            // Send exit for previous page if needed
            if (this.currentPageName && !this.sentExit) {
                this.trackExit();
            }
            
            // Reset for new page
            this.currentPageName = this.getPageName();
            this.startTime = Date.now();
            this.sentArrival = false;
            this.sentExit = false;
            
            // Track arrival for new page
            this.trackArrival();
            
            // Set up exit listeners for this page
            this.setupExitListeners();
        }
        
        async sendRequest(eventType) {
            const timeOnPage = this.getTimeOnPage();
            const requestId = generateRequestId(this.currentPageName, timeOnPage, eventType);
            
            // Prevent duplicate requests
            if (sentRequests.has(requestId)) {
                console.log('TRACKER: Duplicate request blocked');
                return;
            }
            
            if (this.requestInProgress) {
                console.log('TRACKER: Request in progress, blocking');
                return;
            }
            
            console.log(`TRACKER: Sending ${eventType} for ${this.currentPageName}`);
            
            this.requestInProgress = true;
            sentRequests.add(requestId);
            
            const payload = {
                page_name: this.currentPageName,
                tag: eventType, // Clean tags: 'arrival' or 'exit'
                time_on_page: timeOnPage,
                session_id: this.sessionId,
                referrer: document.referrer || 'direct',
                user_agent: navigator.userAgent,
                ip: null
            };
            
            try {
                await fetch(this.apiEndpoint, {
                    method: 'POST',
                    headers: { 'Content-Type': 'application/json' },
                    body: JSON.stringify(payload),
                    keepalive: true
                });
                console.log(`TRACKER: ${eventType} sent successfully`);
            } catch (error) {
                console.error('TRACKER: Request failed', error);
            } finally {
                this.requestInProgress = false;
            }
        }
        
        trackArrival() {
            if (this.sentArrival) {
                return;
            }
            this.sentArrival = true;
            this.sendRequest('arrival');
        }
        
        trackExit() {
            if (this.sentExit) {
                return;
            }
            if (this.getTimeOnPage() < 1) {
                return;
            }
            this.sentExit = true;
            this.sendRequest('exit');
        }
        
        setupExitListeners() {
            // Remove old listeners
            if (this.exitHandler) {
                window.removeEventListener('beforeunload', this.exitHandler);
                document.removeEventListener('visibilitychange', this.visibilityHandler);
            }
            
            // Create new handlers
            this.exitHandler = () => {
                this.trackExit();
            };
            
            this.visibilityHandler = () => {
                if (document.visibilityState === 'hidden') {
                    this.trackExit();
                }
            };
            
            // Add fresh listeners
            window.addEventListener('beforeunload', this.exitHandler);
            document.addEventListener('visibilitychange', this.visibilityHandler);
        }
        
        setupSPADetection() {
            let currentUrl = window.location.href;
            
            // Check for URL changes every 500ms
            setInterval(() => {
                if (window.location.href !== currentUrl) {
                    currentUrl = window.location.href;
                    this.initializePage();
                }
            }, 500);
            
            // Also listen for popstate (back/forward buttons)
            window.addEventListener('popstate', () => {
                setTimeout(() => this.initializePage(), 100);
            });
        }
    }
    
    // Initialize tracker
    const tracker = new PortfolioTracker();
    window.portfolioTracker = tracker;
    
})();

What & Where of Dataset ?

What is the Dataset?

Real-time web analytics data capturing user interactions on my portfolio website. The dataset includes page visits, navigation patterns, session tracking, time-on-page metrics, referrer sources, and user engagement behaviors.

Where is the Dataset?

Primary Storage: PostgreSQL database hosted on Railway cloud platform
Collection Source: Custom JavaScript tracker embedded in portfolio website
Data Pipeline: FastAPI backend processes events in real-time
Table Structure:
- click_logs (raw user events)
- daily_click_summary (processed analytics)

Data Flow:

Website visitors → JavaScript tracker → API calls → PostgreSQL → Automated daily processing → Dashboard visualization

Dataset Characteristics:

Continuous collection from live website traffic
Real-time processing with duplicate prevention
Automated aggregation via scheduled cron jobs
Production-grade reliability with 99% uptime

This dataset enables comprehensive analysis of user behavior patterns and website performance metrics for data-driven portfolio optimization.

Why i Built This

As a data engineer, I wanted to understand how visitors interact with my portfolio website beyond basic Google Analytics. This project demonstrates my ability to:

Build end-to-end data pipelines from collection to visualization
Design scalable ETL processes with automated batch processing
Create production-ready APIs with proper error handling and monitoring
Implement real-time dashboards with interactive data visualizations
Deploy cloud infrastructure using modern DevOps practices

This analytics pipeline gives me actionable insights about which projects resonate most with employers and the tech community, helping me optimize my portfolio for maximum impact.

Here's Why You Should Keep Reading

This isn't just another portfolio project—it's a production-grade analytics system built using the same principles I learned from Zach Wilson, one of the industry's most respected data engineers who spent years at Facebook, Netflix, and Airbnb.

Through Zach's hands-on bootcamp experience, I discovered that the techniques used at Big Tech companies aren't reserved for billion-dollar platforms. The same data engineering patterns that power Netflix recommendations and Facebook's user insights can be applied to any business challenge.

💼 Why This Matters for Your Career:

Whether you're building analytics for your personal business, implementing solutions at your current company, or showcasing skills for your next data engineering role—these are the exact same principles that drive decision-making at the world's most successful tech companies.

Stay with me, and you'll see how to build enterprise-grade data infrastructure that employers actually want to see. This project demonstrates the kind of end-to-end thinking that separates senior data engineers from junior developers.

Here's Why You Should Keep Reading

💼 Why This Matters for Your Career:

Github repo

👉

Recap

Main Lessons Learned

Key Takeaways from Building a Real-Time Analytics Pipeline

Data Quality Beats Data Volume
- Raw tracking events need smart deduplication to become accurate insights
- Spent 40% of time on data validation - but it was worth it for reliable metrics
Production Environment is Everything
- Railway deployment required different database connection patterns than local dev
- Environment variables and cron jobs work differently in the cloud
End-to-End System Design is Critical
- Your JavaScript tracker must send exactly what your Python aggregator expects
- One broken link breaks the entire pipeline
Real Users Reveal Edge Cases
- Mobile traffic patterns and referrer inconsistencies surprised me
- Test data never matches real user behavior
Automation Saves Time, But Takes Time to Build
- Daily aggregations now run automatically via cron
- Setting up reliable automation took longer than the core logic

Bottom Line

This project proved I can build complete data systems that run reliably in production - from frontend tracking to automated insights.

Created by Lubo Bali

Github repo

👉

# FastAPI Click Tracking API for lubobali.com portfolio website
# Accepts click events and stores them in PostgreSQL database
# Deployed on Railway with PostgreSQL plugin

from fastapi import FastAPI, HTTPException, Request
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel, Field
from typing import Optional
import psycopg2
from psycopg2.extras import RealDictCursor
import os
import hashlib
from datetime import datetime
import uvicorn
from apscheduler.schedulers.asyncio import AsyncIOScheduler
from apscheduler.triggers.cron import CronTrigger
import threading

# Create FastAPI app instance
app = FastAPI(
    title="Portfolio Click Tracker API",
    description="API to track clicks and user engagement on lubobali.com",
    version="1.0.0"
)

# Initialize scheduler
scheduler = AsyncIOScheduler()

# Add CORS middleware to allow requests from Framer website
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],  # Allow all origins for testing
    allow_credentials=False,  # Set to False when using allow_origins=["*"]
    allow_methods=["GET", "POST", "OPTIONS"],
    allow_headers=["*"],
)

# Startup event to create database tables and start scheduler
@app.on_event("startup")
async def startup_event():
    """Initialize database tables and start scheduler on startup"""
    print("Starting up Portfolio Click Tracker API...")
    try:
        create_tables()
        print("API startup complete!")
        
        # Start the scheduler
        start_scheduler()
        print("✅ Daily aggregation scheduler started!")
        
    except Exception as e:
        print(f"Warning: Database initialization failed: {e}")
        print("API will start anyway - database may be created later")

@app.on_event("shutdown")
async def shutdown_event():
    """Clean shutdown of scheduler"""
    print("Shutting down scheduler...")
    scheduler.shutdown()
    print("✅ Scheduler stopped")

def run_daily_aggregation():
    """Run daily aggregation for yesterday only"""
    print(f"🕛 Running daily aggregation at {datetime.now()}")
    try:
        from daily_aggregator import DailyAggregator
        
        def run_aggregation():
            try:
                aggregator = DailyAggregator()
                aggregator.run_daily_aggregation(days_back=1)
                print("✅ Daily aggregation completed successfully!")
            except Exception as e:
                print(f"❌ Daily aggregation failed: {e}")
        
        thread = threading.Thread(target=run_aggregation)
        thread.start()

    except Exception as e:
        print(f"❌ Error starting daily aggregation: {e}")

def start_scheduler():
    """Start APScheduler for daily aggregation at 05:30 AM UTC (12:30 AM Central Time)"""
    try:
        scheduler.add_job(
            run_daily_aggregation,
            CronTrigger(hour=5, minute=30, timezone='UTC'),
            id='daily_aggregation',
            name='Daily Analytics Aggregation',
            replace_existing=True
        )
        scheduler.start()
        print("📅 Scheduler configured to run daily at 05:30 AM UTC (12:30 AM Central Time)")
    except Exception as e:
        print(f"❌ Failed to start scheduler: {e}")

# Define Pydantic model for incoming click data
class ClickEvent(BaseModel):
    page_name: str = Field(..., description="Name of the project/page being tracked")
    tag: Optional[str] = Field(None, description="Category tag for the project")
    time_on_page: int = Field(..., description="Time spent on page in seconds")
    session_id: str = Field(..., description="Unique session identifier")
    referrer: Optional[str] = Field(None, description="Source URL that led to this page")
    user_agent: str = Field(..., description="Browser user agent string")
    ip: Optional[str] = Field(None, description="Client IP address (optional)")

# Database connection function
def get_db_connection():
    """Create and return PostgreSQL database connection"""
    try:
        # Get database URL from environment variable (Railway provides this)
        db_url = os.getenv("DATABASE_URL") or os.getenv("DB_URL")
        if not db_url:
            raise Exception("No database URL found in environment variables")
        
        conn = psycopg2.connect(db_url, cursor_factory=RealDictCursor)
        return conn
    except Exception as e:
        print(f"Database connection error: {e}")
        raise HTTPException(status_code=500, detail="Database connection failed")

# Function to hash IP address for privacy
def hash_ip(ip_address: str) -> str:
    """Hash IP address using SHA256 for privacy"""
    if not ip_address:
        return None
    return hashlib.sha256(ip_address.encode()).hexdigest()[:16]

# Function to create database tables if they don't exist
def create_tables():
    """Create click_logs table if it doesn't exist"""
    try:
        conn = get_db_connection()
        cursor = conn.cursor()
        
        # Create click_logs table
        create_table_query = """
        CREATE TABLE IF NOT EXISTS click_logs (
            id SERIAL PRIMARY KEY,
            page_name TEXT NOT NULL,
            tag TEXT,
            time_on_page INTEGER NOT NULL,
            session_id TEXT NOT NULL,
            referrer TEXT,
            user_agent TEXT NOT NULL,
            ip_hash TEXT,
            timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        );
        """
        
        cursor.execute(create_table_query)
        conn.commit()
        cursor.close()
        conn.close()
        
        print("✓ Database tables created/verified successfully")
        
    except Exception as e:
        print(f"Error creating tables: {e}")
        # Don't raise - let the app start anyway

# Health check endpoint
@app.get("/")
async def root():
    """Health check endpoint"""
    return {"message": "Portfolio Click Tracker API is running", "status": "healthy"}

# Main click tracking endpoint
@app.post("/api/track-click")
async def track_click(click_data: ClickEvent, request: Request):
    """
    Track a click event from the portfolio website
    Accepts JSON payload with page info and stores in database
    """
    try:
        # Get client IP from request if not provided
        client_ip = click_data.ip or request.client.host
        ip_hash = hash_ip(client_ip) if client_ip else None
        
        # Connect to database
        conn = get_db_connection()
        cursor = conn.cursor()
        
        # Insert click data into click_logs table
        insert_query = """
        INSERT INTO click_logs (
            page_name, tag, user_agent, referrer, 
            session_id, time_on_page, ip_hash
        ) VALUES (%s, %s, %s, %s, %s, %s, %s)
        RETURNING id, timestamp
        """
        
        cursor.execute(insert_query, (
            click_data.page_name,
            click_data.tag,
            click_data.user_agent,
            click_data.referrer,
            click_data.session_id,
            click_data.time_on_page,
            ip_hash
        ))
        
        # Get the inserted record details
        result = cursor.fetchone()
        
        # Commit transaction and close connection
        conn.commit()
        cursor.close()
        conn.close()
        
        return {
            "success": True,
            "message": "Click tracked successfully",
            "click_id": result["id"],
            "timestamp": result["timestamp"].isoformat()
        }
        
    except psycopg2.Error as e:
        print(f"Database error: {e}")
        raise HTTPException(status_code=500, detail="Failed to save click data")
    except Exception as e:
        print(f"Unexpected error: {e}")
        raise HTTPException(status_code=500, detail="Internal server error")

# Get recent clicks endpoint (for debugging/testing)
@app.get("/api/recent-clicks")
async def get_recent_clicks(limit: int = 10):
    """
    Get recent click events for debugging purposes
    Returns last N clicks from the database
    """
    try:
        conn = get_db_connection()
        cursor = conn.cursor()
        
        # Query recent clicks without exposing sensitive data
        query = """
        SELECT id, timestamp, page_name, tag, referrer, 
               time_on_page, session_id
        FROM click_logs 
        ORDER BY timestamp DESC 
        LIMIT %s
        """
        
        cursor.execute(query, (limit,))
        clicks = cursor.fetchall()
        
        cursor.close()
        conn.close()
        
        return {
            "success": True,
            "clicks": clicks,
            "count": len(clicks)
        }
        
    except Exception as e:
        print(f"Error fetching clicks: {e}")
        raise HTTPException(status_code=500, detail="Failed to fetch click data")

# Manual aggregation trigger endpoint (for testing)
@app.post("/api/trigger-aggregation")
async def trigger_aggregation():
    """
    Manually trigger daily aggregation for testing purposes
    """
    try:
        print("🔧 Manual aggregation triggered via API")
        run_daily_aggregation()
        
        return {
            "success": True,
            "message": "Daily aggregation triggered successfully",
            "note": "Check logs for execution details"
        }
        
    except Exception as e:
        print(f"Error triggering aggregation: {e}")
        raise HTTPException(status_code=500, detail="Failed to trigger aggregation")

# Database test function
def test_connection():
    """Test database connection and print success message"""
    try:
        conn = get_db_connection()
        print("Database connection successful!")
        conn.close()
    except Exception as e:
        print(f"Database connection failed: {e}")

Backend API

(app.py)

DuplicateFastAPI Framework: Production-grade REST API with automatic OpenAPI documentation, Pydantic data validation, and comprehensive error handling for robust request processing.

Database Integration: PostgreSQL connection management with connection pooling, environment variable configuration for Railway deployment, and automatic table creation on startup.

Data Validation & Security: Pydantic models ensure type safety, IP address hashing for privacy protection, and SQL injection prevention through parameterized queries.

CORS Configuration: Cross-origin middleware setup allowing frontend requests from Framer website with proper security headers and credential handling.

Automated Scheduling: APScheduler integration with cron triggers for daily data aggregation at 5:30 AM UTC, running in background threads without blocking API requests.

Error Handling: Comprehensive exception handling with proper HTTP status codes, database transaction management, and graceful degradation on connection failures.

Production-ready Python backend with enterprise-level architecture, security best practices, and automated data processing capabilities.

What & Where of Dataset ?

What is the Dataset?

Where is the Dataset?

Primary Storage: PostgreSQL database hosted on Railway cloud platform
Collection Source: Custom JavaScript tracker embedded in portfolio website
Data Pipeline: FastAPI backend processes events in real-time
Table Structure:
- click_logs (raw user events)
- daily_click_summary (processed analytics)

Data Flow:

Website visitors → JavaScript tracker → API calls → PostgreSQL → Automated daily processing → Dashboard visualization

Dataset Characteristics:

Continuous collection from live website traffic
Real-time processing with duplicate prevention
Automated aggregation via scheduled cron jobs
Production-grade reliability with 99% uptime

This dataset enables comprehensive analysis of user behavior patterns and website performance metrics for data-driven portfolio optimization.

Frontend Data Collection

(tracker_production.js)

Duplicate Prevention: Implements Set-based deduplication with time-bucketing to prevent race conditions during rapid user interactions.

SPA Navigation Detection: Monitors URL changes every 500ms to track Single Page Application navigation without page reloads, plus handles browser back/forward buttons.

Reliable Data Transmission: Uses Fetch API with keepalive: true to ensure data reaches the server even during page transitions or tab closures.

Event Management: Properly manages event listeners for beforeunload and visibility changes, preventing memory leaks in long-running sessions.

Session Tracking: Generates persistent session IDs and calculates accurate time-on-page metrics across navigation events.

Production-level JavaScript with advanced browser APIs and robust error handling for real-time analytics.

/**
 * Portfolio Click Tracker - Production Version
 * Tracks user engagement across SPA navigation
 */

(function() {
    'use strict';
    
    // Prevent duplicate tracker instances
    if (window.TRACKER_LOADED) {
        console.log('TRACKER: Already loaded, skipping');
        return;
    }
    window.TRACKER_LOADED = true;
    console.log('TRACKER: Production tracker initialized');
    
    // Track sent requests to prevent duplicates
    const sentRequests = new Set();
    
    function generateRequestId(pageName, timeOnPage, eventType) {
        return `${pageName}-${eventType}-${Math.floor(timeOnPage/5)*5}`;
    }
    
        class PortfolioTracker {
        constructor() {
            if (window.trackerInstance) {
                console.log('TRACKER: Using existing instance');
                return window.trackerInstance;
            }
            
            console.log('TRACKER: Creating new instance');
            
            this.apiEndpoint = 'https://lubo-portfolio-tracker-production.up.railway.app/api/track-click';
            this.sessionId = this.getSessionId();
            this.currentPageName = null;
            this.startTime = null;
            this.sentArrival = false;
            this.sentExit = false;
            this.requestInProgress = false;
            
            // Initialize for current page
            this.initializePage();
            
            // Set up SPA navigation detection
            this.setupSPADetection();
            
            window.trackerInstance = this;
        }
        
        getSessionId() {
            let sessionId = localStorage.getItem('portfolio_session_id');
            if (!sessionId) {
                sessionId = 'sess_' + Date.now() + '_' + Math.random().toString(36).substr(2, 9);
                localStorage.setItem('portfolio_session_id', sessionId);
            }
            return sessionId;
        }
        
        getPageName() {
            const path = window.location.pathname + window.location.search;
            return path === '/' ? 'home' : path;
        }
        
        getTimeOnPage() {
            return this.startTime ? Math.round((Date.now() - this.startTime) / 1000) : 0;
        }
        
        initializePage() {
            // Send exit for previous page if needed
            if (this.currentPageName && !this.sentExit) {
                this.trackExit();
            }
            
            // Reset for new page
            this.currentPageName = this.getPageName();
            this.startTime = Date.now();
            this.sentArrival = false;
            this.sentExit = false;
            
            // Track arrival for new page
            this.trackArrival();
            
            // Set up exit listeners for this page
            this.setupExitListeners();
        }
        
        async sendRequest(eventType) {
            const timeOnPage = this.getTimeOnPage();
            const requestId = generateRequestId(this.currentPageName, timeOnPage, eventType);
            
            // Prevent duplicate requests
            if (sentRequests.has(requestId)) {
                console.log('TRACKER: Duplicate request blocked');
                return;
            }
            
            if (this.requestInProgress) {
                console.log('TRACKER: Request in progress, blocking');
                return;
            }
            
            console.log(`TRACKER: Sending ${eventType} for ${this.currentPageName}`);
            
            this.requestInProgress = true;
            sentRequests.add(requestId);
            
            const payload = {
                page_name: this.currentPageName,
                tag: eventType, // Clean tags: 'arrival' or 'exit'
                time_on_page: timeOnPage,
                session_id: this.sessionId,
                referrer: document.referrer || 'direct',
                user_agent: navigator.userAgent,
                ip: null
            };
            
            try {
                await fetch(this.apiEndpoint, {
                    method: 'POST',
                    headers: { 'Content-Type': 'application/json' },
                    body: JSON.stringify(payload),
                    keepalive: true
                });
                console.log(`TRACKER: ${eventType} sent successfully`);
            } catch (error) {
                console.error('TRACKER: Request failed', error);
            } finally {
                this.requestInProgress = false;
            }
        }
        
        trackArrival() {
            if (this.sentArrival) {
                return;
            }
            this.sentArrival = true;
            this.sendRequest('arrival');
        }
        
        trackExit() {
            if (this.sentExit) {
                return;
            }
            if (this.getTimeOnPage() < 1) {
                return;
            }
            this.sentExit = true;
            this.sendRequest('exit');
        }
        
        setupExitListeners() {
            // Remove old listeners
            if (this.exitHandler) {
                window.removeEventListener('beforeunload', this.exitHandler);
                document.removeEventListener('visibilitychange', this.visibilityHandler);
            }
            
            // Create new handlers
            this.exitHandler = () => {
                this.trackExit();
            };
            
            this.visibilityHandler = () => {
                if (document.visibilityState === 'hidden') {
                    this.trackExit();
                }
            };
            
            // Add fresh listeners
            window.addEventListener('beforeunload', this.exitHandler);
            document.addEventListener('visibilitychange', this.visibilityHandler);
        }
        
        setupSPADetection() {
            let currentUrl = window.location.href;
            
            // Check for URL changes every 500ms
            setInterval(() => {
                if (window.location.href !== currentUrl) {
                    currentUrl = window.location.href;
                    this.initializePage();
                }
            }, 500);
            
            // Also listen for popstate (back/forward buttons)
            window.addEventListener('popstate', () => {
                setTimeout(() => this.initializePage(), 100);
            });
        }
    }
    
    // Initialize tracker
    const tracker = new PortfolioTracker();
    window.portfolioTracker = tracker;
    
})();

Why i Built This

As a data engineer, I wanted to understand how visitors interact with my portfolio website beyond basic Google Analytics. This project demonstrates my ability to:

Build end-to-end data pipelines from collection to visualization
Design scalable ETL processes with automated batch processing
Create production-ready APIs with proper error handling and monitoring
Implement real-time dashboards with interactive data visualizations
Deploy cloud infrastructure using modern DevOps practices

This analytics pipeline gives me actionable insights about which projects resonate most with employers and the tech community, helping me optimize my portfolio for maximum impact.

#!/usr/bin/env python3
"""
Daily Analytics Aggregator for Portfolio Click Tracker
Processes raw click_logs and creates daily summary statistics
"""

import os
import sys
import psycopg2
from psycopg2.extras import RealDictCursor
from datetime import datetime, date, timedelta
from collections import defaultdict, Counter
import json

class DailyAggregator:
    def __init__(self):
        """Initialize the aggregator with database connection"""
        # Try Railway's database environment variables
        self.db_url = (
            os.getenv("DATABASE_URL") or 
            os.getenv("DATABASE_PUBLIC_URL") or 
            os.getenv("PGURL") or
            os.getenv("DB_URL") or
            self._build_url_from_parts()
        )
        if not self.db_url:
            raise Exception("No database URL found in environment variables")
    
    def _build_url_from_parts(self):
        """Build database URL from individual environment variables (Railway style)"""
        host = os.getenv("PGHOST")
        port = os.getenv("PGPORT", "5432")
        database = os.getenv("PGDATABASE")
        user = os.getenv("PGUSER")
        password = os.getenv("PGPASSWORD")
        
        if all([host, database, user, password]):
            return f"postgresql://{user}:{password}@{host}:{port}/{database}"
        return None
    
    def get_db_connection(self):
        """Create database connection"""
        try:
            return psycopg2.connect(self.db_url, cursor_factory=RealDictCursor)
        except Exception as e:
            print(f"Database connection error: {e}")
            raise
    
    def aggregate_day(self, target_date):
        """Aggregate click data for a specific date"""
        print(f"Starting aggregation for {target_date}")
        
        conn = self.get_db_connection()
        cursor = conn.cursor()
        
        try:
            # Get all clicks for the target date
            query = """
            SELECT * FROM click_logs 
            WHERE DATE(timestamp) = %s
            ORDER BY timestamp
            """
            cursor.execute(query, (target_date,))
            clicks = cursor.fetchall()
            
            if not clicks:
                print(f"No clicks found for {target_date}")
                return
            
            print(f"Found {len(clicks)} raw click events for {target_date}")
            print(f"Deduplicating to unique pageviews (collapsing arrival+exit events)...")
            
            # --- NEW: collapse arrival+exit to one pageview per (session_id, page_name) ---
            pageviews = {}
            first_event_for_referrer = {}
            
            for c in clicks:
                key = (c.get('session_id'), c.get('page_name'))
                # keep first event for referrer (usually arrival)
                if key not in first_event_for_referrer:
                    first_event_for_referrer[key] = c
                prev = pageviews.get(key)
                # choose the event with the larger time_on_page (exit > arrival)
                cur_time = c.get('time_on_page') or 0
                prev_time = (prev.get('time_on_page') or 0) if prev else -1
                if prev is None or cur_time > prev_time:
                    pageviews[key] = c
            
            # Use deduped pageviews for metrics
            total_clicks = len(pageviews)  # unique pageviews, not raw rows
            project_name = "lubobali_portfolio"  # Your main project
            
            times = [pv.get('time_on_page') or 0 for pv in pageviews.values() if (pv.get('time_on_page') or 0) > 0]
            avg_time_on_page = round(sum(times) / len(times), 2) if times else 0
            
            # Device split (from chosen pageview event)
            device_counts = {"Mobile": 0, "Desktop": 0}
            for pv in pageviews.values():
                ua = (pv.get('user_agent') or '').lower()
                if 'mobile' in ua or 'android' in ua or 'iphone' in ua:
                    device_counts['Mobile'] += 1
                else:
                    device_counts['Desktop'] += 1
            
            # Top referrers (prefer the first event per pageview, i.e., the arrival)
            referrer_counts = {}
            for key, first in first_event_for_referrer.items():
                referrer = (first.get('referrer') or '').strip()
                if not referrer or referrer == 'null':
                    referrer = 'Direct Traffic'
                elif not referrer.startswith('http'):
                    referrer = 'Direct Traffic'
                referrer_counts[referrer] = referrer_counts.get(referrer, 0) + 1
            
            # Top pages from chosen pageviews
            page_counts = {}
            for pv in pageviews.values():
                page_name = pv.get('page_name', 'unknown') or 'unknown'
                if page_name == '' or page_name == 'home': page_name = 'home'
                elif not page_name.startswith('/') and page_name != 'home': page_name = f'/{page_name}'
                page_counts[page_name] = page_counts.get(page_name, 0) + 1
            
            # Repeat visits based on unique sessions
            session_ids = [sid for (sid, _page) in pageviews.keys() if sid]
            unique_sessions = len(set(session_ids))
            repeat_visits = total_clicks - unique_sessions if unique_sessions > 0 else 0
            
            # Prepare data for insertion
            summary_data = {
                'date': target_date,
                'project_name': project_name,
                'total_clicks': total_clicks,
                'avg_time_on_page': round(avg_time_on_page, 2),
                'device_split': dict(device_counts),
                'top_referrers': dict(referrer_counts),
                'top_pages': dict(page_counts),
                'repeat_visits': repeat_visits,
                'tag': 'general'
            }
            
            # Delete existing data for this date (if any)
            delete_query = "DELETE FROM daily_click_summary WHERE date = %s"
            cursor.execute(delete_query, (target_date,))
            
            # Insert new aggregated data
            insert_query = """
            INSERT INTO daily_click_summary 
            (date, project_name, total_clicks, avg_time_on_page, device_split, 
             top_referrers, top_pages, repeat_visits, tag, created_at)
            VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
            """
            
            cursor.execute(insert_query, (
                summary_data['date'],
                summary_data['project_name'],
                summary_data['total_clicks'],
                summary_data['avg_time_on_page'],
                json.dumps(summary_data['device_split']),
                json.dumps(summary_data['top_referrers']),
                json.dumps(summary_data['top_pages']),
                summary_data['repeat_visits'],
                summary_data['tag'],
                datetime.now()
            ))
            
            conn.commit()
            print(f"✅ Successfully aggregated {len(clicks)} raw events → {total_clicks} unique pageviews for {target_date}")
            print(f"📊 Summary: {total_clicks} pageviews, {len(set(session_ids))} sessions, {avg_time_on_page:.1f}s avg time")
            
        except Exception as e:
            print(f"❌ Error during aggregation: {e}")
            conn.rollback()
            raise
        finally:
            cursor.close()
            conn.close()
    
    def run_daily_aggregation(self, days_back=1):
        """Run aggregation for the previous day(s)"""
        target_date = date.today() - timedelta(days=days_back)
        self.aggregate_day(target_date)

if __name__ == "__main__":
    # For manual testing
    aggregator = DailyAggregator()
    aggregator.run_daily_aggregation(days_back=1)

ETL Processing

(daily_aggregator.py)

Environment Configuration: Multi-source database URL detection supporting Railway's various environment variable formats with automatic fallback mechanisms for deployment flexibility.

JSON Data Processing: Converts aggregated dictionaries into JSON for PostgreSQL storage, handling device splits, referrer counts, and page statistics for dashboard consumption.

Transaction Management: Implements proper database transaction handling with rollback capabilities, ensuring data integrity during the ETL process and preventing partial updates.

Error Recovery: Comprehensive exception handling with detailed logging, connection cleanup, and graceful failure modes to maintain pipeline reliability.

Enterprise-grade ETL system with sophisticated data transformation logic, robust error handling, and optimized aggregation algorithms for real-time analytics processing.

Automation

(cron_daily_aggregator.py)

Railway Cron Integration: Dedicated entry point designed specifically for Railway's cron service deployment, with UTC timezone handling and proper exit codes for automated scheduling.

Environment Diagnostics: Comprehensive environment variable validation and logging for all possible Railway database configurations, with password masking for security compliance.

Execution Monitoring: Detailed timing measurements and status logging with structured output format, enabling easy monitoring and debugging of automated runs through Railway logs.

Error Recovery & Reporting: Full exception handling with traceback logging and proper exit codes, ensuring failures are properly reported to Railway's monitoring systems.

Production Logging: Structured console output with timestamps, duration tracking, and clear success/failure indicators for operations team visibility and troubleshooting.

Timezone Management: UTC-based execution timing to ensure consistent daily runs regardless of Railway's server locations or daylight saving changes.

Production-ready automation wrapper with enterprise monitoring, comprehensive error handling, and deployment-optimized logging for reliable scheduled execution.

#!/usr/bin/env python3
"""
Railway Cron Job Entry Point for Daily Analytics Aggregator
Runs daily at midnight UTC to process previous day's click data
"""

import os
import sys
from datetime import datetime, timezone
from daily_aggregator import DailyAggregator

def main():
    """Main cron job execution"""
    start_time = datetime.now(timezone.utc)
    print(f"🕛 Railway Cron Job Started at {start_time}")
    print("=" * 50)
    
    # Debug: Print environment variables for troubleshooting
    print("🔍 Environment variables check:")
    db_vars = ['DATABASE_URL', 'DATABASE_PUBLIC_URL', 'PGURL', 'DB_URL', 
               'PGHOST', 'PGPORT', 'PGDATABASE', 'PGUSER', 'PGPASSWORD']
    for var in db_vars:
        value = os.getenv(var)
        if value:
            # Hide password for security
            if 'PASSWORD' in var.upper():
                print(f"  ✅ {var}: {'*' * len(value)}")
            else:
                print(f"  ✅ {var}: {value[:50]}...")
        else:
            print(f"  ❌ {var}: Not set")
    print("=" * 50)
    
    try:
        # Initialize aggregator
        aggregator = DailyAggregator()
        
        # Run aggregation for yesterday
        print("🚀 Starting daily aggregation for previous day...")
        aggregator.run_daily_aggregation(days_back=1)
        
        end_time = datetime.now(timezone.utc)
        duration = (end_time - start_time).total_seconds()
        
        print("=" * 50)
        print(f"✅ Railway Cron Job Completed Successfully!")
        print(f"⏱️  Duration: {duration:.2f} seconds")
        print(f"🕐 Finished at {end_time}")
        
    except Exception as e:
        end_time = datetime.now(timezone.utc)
        duration = (end_time - start_time).total_seconds()
        
        print("=" * 50)
        print(f"💥 Railway Cron Job Failed: {e}")
        print(f"⏱️  Duration: {duration:.2f} seconds")
        print(f"🕐 Failed at {end_time}")
        import traceback
        print(f"🔥 Full error traceback:\n{traceback.format_exc()}")
        sys.exit(1)

if __name__ == "__main__":
    main()

Ready to Build Your Own Data Pipeline?

If you've made it this far, you've seen how a full-stack approach can transform raw user events into actionable business insights — all without relying on third-party analytics.

Ready to Build Your Own Data Pipeline?

If you've made it this far, you've seen how a full-stack approach can transform raw user events into actionable business insights — all without relying on third-party analytics.

I'd love to talk about how I can bring this same end-to-end mindset — from JavaScript tracking to automated ETL pipelines — to your data challenges.

Let's build something powerful together.📨 Email: data@lubobali.com🔗 Website: www.lubobali.com💼 LinkedIn: linkedin.com/in/lubo-bali

Ready to Build Your Own Data Pipeline?

If you've made it this far, you've seen how a full-stack approach can transform raw user events into actionable business insights — all without relying on third-party analytics.

I'd love to talk about how I can bring this same end-to-end mindset — from JavaScript tracking to automated ETL pipelines — to your data challenges.

Let's build something powerful together.📨 Email: data@lubobali.com🔗 Website: www.lubobali.com💼 LinkedIn: linkedin.com/in/lubo-bali

Let's build something powerful together.

📨 Email: data@lubobali.com
🔗 Website: www.lubobali.com
💼 LinkedIn: linkedin.com/in/lubo-bali

Let's build something powerful together.

📨 Email: data@lubobali.com
🔗 Website: www.lubobali.com
💼 LinkedIn: linkedin.com/in/lubo-bali