Tens of thousands of Bay Area residents commute every day on the Caltrain. Unfortunately, the system is unreliable and the reported delay predictions are often completely wrong. Silicon Valley Data Science has created an extensive data architecture to collect different types of data—video streams, audio streams, GPS data, and web data—for predicting train arrival delays. In the larger data science project, we have conducted train classification using video and audio streams, sentiment analysis using Twitter data, and train arrival delay prediction using various machine learning and statistical methods.
In this talk, we will focus on two aspects of the larger data science project: (1) Classification using video streams and (2) train arrival delay prediction using various machine learning and statistical methods.