Name: The BP Data Science Sandbox: An Internal Platform for Large Scale Data Science Research @ BP
Start: 2018-10-09T11:50:00-0500
End: 2018-10-09T12:10:00-0500

Back To Schedule

The BP Data Science Sandbox: An Internal Platform for Large Scale Data Science Research @ BP

WATCH VIDEO

Recent years have seen major advances in the state-of-the-art of machine learning, particularly in fields such as natural language processing and 2D computer vision.

These advances have naturally spurred interest in the application of similar techniques to new fields in medicine, science, and engineering. However, the problems in these fields are differentiated from previous machine learning successes by the level of domain expertise required. While classifying an image as a cat, dog, horse, etc is a task that anyone can understand, automatic identification of malignant tumors, subsurface faults, or financial fraud (for example) often requires far more background in the specific domain. Unfortunately, it is rare today for people to have both the skills of a data scientist/statistician and a domain expert (e.g. an oncologist or petroleum engineer).

This problem can generally be solved in two ways: (1) through education (of your data scientists and/or domain experts), or (2) through co-location of these two groups of people such that they can work closely together.

This talk will introduce the BP Data Science Sandbox (DSS) – an internal environment at BP that supports both of the above solutions. The sandbox is a platform made up of hardware, software, and people. On the hardware front, the sandbox includes everything from big memory machines to GPU machines to compute clusters, enabling users of the sandbox to pick and choose the platform that meets their resource requirements. On the software front, the sandbox is built on entirely free and open source software, including common tools such as Jupyter, JupyterHub, Spark, Dask, Tensorflow, and other packages in the Conda ecosystem. On the people front, the sandbox is supported by a team of dedicated data scientists and infrastructure engineers who support users and internal customers of the sandbox.

The DSS team operates in a mix of low-touch and high-touch project modes. In low-touch mode, users of the sandbox use the software and hardware of the platform to build out their own solutions, relying on the sandbox team for only intermittent support and advice. In high-touch mode, the DSS team executes projects on behalf of external business units who lack the internal expertise to build their own data science and machine learning projects.

The DSS represents a unique, internal capability – offering an environment where BP data scientists can work with BP domain experts to research and develop unique and advanced machine learning solutions, rather than relying on third parties.

This talk will cover the technical architecture of the sandbox, the processes that govern access and prioritize projects, as well as touch on a few of the high-touch research projects that have been completed by the sandbox team in areas of seismic processing, digital rock processing, and temperature time series modeling.

Speakers

Stefan Garrard

Keith Gray

Manager, High Performance Computing, BP

Keith Gray is Manager of High Performance Computing for BP. The HPC Team supports the computing requirements for BP’s Advanced Seismic Imaging Research efforts. This team supports one of the largest Linux Clusters dedicated to research in Oil and Gas. Mr. Gray graduated from Virginia... Read More →

Max Grossman

Presenter, 7pod Technologies LLC

5 Grossman pdf

Tuesday October 9, 2018 11:50am - 12:10pm CDT
BioScience Research Collaborative 6500 Main Street, Houston, TX 77030

Data Science Methods and Platforms

Location Auditorium

2018 Data Science Conference

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Stefan Garrard

Keith Gray

Max Grossman