PythonPyTorchPostgreSQLSpark

ETA Accuracy & Route Quality Analysis

End-to-end analytics for evaluating ETA accuracy using NYC Taxi trip data

A lightweight end-to-end analytics project for evaluating ETA accuracy and route-quality proxy metrics using NYC Taxi trip data. Accurate Estimated Time of Arrival (ETA) predictions matter for user experience, driver planning, and operations. This project trains a PyTorch neural network to predict trip duration, computes accuracy metrics across segments, identifies failure modes, and proposes an experiment design for production deployment. The repo is structured for clarity and reproducibility: data load → schema and derived columns → model training → metrics computation → analysis and visualizations.

Architecture overview

The pipeline is organized into clear steps. Raw NYC taxi data (e.g. from NYC TLC) is loaded from data/raw.csv, normalized for column naming (e.g. tpep_pickup_datetime vs lpep_pickup_datetime), and cleaned: trip duration and Haversine distance are computed; invalid rows (negative duration, missing coordinates, unrealistic trips) are filtered. Clean data is loaded into PostgreSQL (trips_clean).

SQL scripts define derived columns (hour of day, day of week, distance buckets), indexes, and train/eval splits (70/30). A PyTorch model is trained on the training set and saved; a metrics script loads the model, generates predictions for evaluation trips, and computes MAE, MedAE, P90, and % within thresholds. Analysis and visualizations (error histograms, error by distance/hour/day, calibration plots) are produced in Python. Optional Spark is supported for large datasets via a --use-spark flag.

Technical depth

The model is a feedforward neural network (MLP) with inputs: Haversine distance (km), hour and day of week (cyclical sin/cos encoding), distance bucket (one-hot: <1mi, 1–3mi, 3–5mi, 5–10mi, 10+mi), and normalized pickup/dropoff coordinates. Hidden layers are [128, 64, 32] with ReLU and dropout (0.2); output is predicted trip duration in seconds. Loss is MAE; optimizer is Adam (lr 0.001). Training uses early stopping on validation loss (20% of training data); train/val/test are split temporally to avoid leakage.

Primary metrics: Mean Absolute Error (MAE), Median Absolute Error (MedAE), P90 absolute error, and % of trips within ±10% or ±20% of actual duration. Metrics are segmented by distance bucket, hour of day, and day of week so failure modes can be localized. The project documents expected column formats and supports automatic column mapping for common NYC taxi schema variations.

Tradeoffs

The design prioritizes end-to-end reproducibility and analytical clarity over real-time serving. PostgreSQL holds the source of truth and supports flexible SQL for segmentation; the PyTorch model is trained offline and used for batch prediction in the metrics step. Using a single market (NYC) allows tying results to known traffic and geography; expanding to more cities would require schema and segment definitions to stay comparable.

Spark is optional so the project runs on a laptop with Pandas for smaller datasets; for large files, --use-spark speeds up load and compute. The model does not use route geometry or real-time traffic—only trip-level and temporal features—so it reflects a baseline that can be improved with richer data in a production setting.

Failure modes & experiment design

Documented failure modes include: short trips (<1 mile) with higher relative error due to fixed overhead (lights, stops); peak hours with higher traffic variability; dense urban areas with complex routing; and long trips (10+ miles) with mixed highway/city conditions. The repo proposes an online experiment: control (current model) vs treatment (e.g. segment-aware calibration, additional features). Primary metrics would be MedAE and P90; guardrails would include driver cancel rate, reroute rate, and user satisfaction. Duration, traffic split, and sample size are outlined for power and guardrail analysis.

Limitations & future work

Current limitations: no route information (only OD and distance), no real-time traffic, and a static model that requires retraining. Future directions include route-based features (road type, intersections), real-time traffic integration, uncertainty quantification (e.g. confidence intervals), and continuous or online learning. The scaling-to-production section in the repo outlines feature stores, model serving, monitoring, and A/B testing infrastructure.