Patterns for Successful Data Science Projects

Tuesday, April 24, 2018 - 10:35 am

Running data science workloads is challenge regardless of whether you are running them on your laptop, on an on-premises cluster, or in the cloud. While buying 100% managed service is an option, these tools are usually quite expensive and lack extensibility. Therefore, many companies option for open source data science tools like scikit-learn and Apache Spark's MLlib in order to balance both functionality and cost.

However, even if a project succeeds at a point in time with any set of tools, these projects become harder and harder to maintain as data volumes increase and a desire for real-time pushes technology to its limit. New projects also struggle as new challenges of scale invalidate previous assumptions. This talk will talk discuss some patterns that we see at Databricks that companies leverage to succeed with their data science projects.

Patterns for Successful Data Science Projects (DataEDGE 2018)

Bill Chambers
Product Manager
Databricks

William Chambers is a product manager at Databricks, where he works on Structured Streaming and data science products. He is lead author of Spark: The Definitive Guide, coauthored with Matei Zaharia. Bill also created SparkTutorials.net as a way to teach Apache Spark basics. Bill holds a master’s degree in information management and systems from UC Berkeley’s School of Information. During his time at school, Bill was also creator of the Data Analysis in Python with pandas course for Udemy and co-creator of and first instructor for Python for Data Science, part of UC Berkeley’s Master of Information and Data Science (MIDS) program.