44 min

Go From Notebook To Pipeline For Your Data Science Projects With Orchest The Python Podcast.init

- Technology

Summary
Jupyter notebooks are a dominant tool for data scientists, but they lack a number of conveniences for building reusable and maintainable systems. For machine learning projects in particular there is a need for being able to pivot from exploring a particular dataset or problem to integrating that solution into a larger workflow. Rick Lamers and Yannick Perrenet were tired of struggling with one-off solutions when they created the Orchest platform. In this episode they explain how Orchest allows you to turn your notebooks into executable components that are integrated into a graph of execution for running end-to-end machine learning workflows.

Announcements

Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Your host as usual is Tobias Macey and today I’m interviewing Rick Lamers and Yannick Perrenet about Orchest, a development environment designed for building data science pipelines from notebooks and scripts.

Interview

Introductions
How did you get introduced to Python?
Can you start by giving an overview of what Orchest is and the story behind it?
Who are the users that you are building Orchest for and what are their biggest challenges?

What are some examples of the types of tools or workflows that they are using now?

What are some of the other tools or strategies in the data science ecosystem that Orchest might replace? (e.g. MLFlow, Metaflow, etc.)
What problems does Orchest solve?
Can you describe how Orchest is implemented?

How have the design and goals of the project changed since you first started working on it?

What is the workflow for someone who is using Orchest?
What are some of the sharp edges that they might run into?
What is the deployable unit once a pipeline has been created?

How do you handle verification and promotion of pipelines across staging and production environments?

What are the interfaces available for integrating with or extending Orchest?

How might an organization incorporate a pipeline defined in Orchest with the rest of their data orchestration workflows?

How are you approaching governance and sustainability of the Orchest project?
What are the most interesting, innovative, or unexpected ways that you have seen Orchest used?
What are the most interesting, unexpected, or challenging lessons that you have learned while building Orchest?
When is Orchest the wrong choice?
What do you have planned for the future of the project and company?

Keep In Touch

Rick

ricklamers on GitHub
LinkedIn
@RickLamers on Twitter

Yannick

yannickperrenet on GitHub
LinkedIn

Picks

Tobias

Fresh Bagels

Rick

Vaex

Yannick

Cookiecutter
Pyenv

Links

Orchest
Geoffrey Hinton
Yann LeCun
CoffeeScript
Vim
GAN == Generative Adversarial Network
Git
SQL
BigQuery
Software Carpentry

Podcast Episode

Google Colab
Airflow

Podcast Episode

Kedro

Data Engineering Podcast Episode

nbdev

Podcast Episode

Papermill

Data Engineering Podcast Episode

MLFlow
Metaflow

Podcast Episode

DVC

Podcast Episode

Andrew Ng
Kubeflow
Lua
Caddy
Traefik
DAG == Directed Acyclic Graph
Jupyter Enterprise Gateway
Streamlit
Kubernetes
Dagster

Podcast.__init__ Episode
Data Engineering Podcast Episode

DBT

Data Engineering Podcast Episode

GitLab
Spark
ETL

The intro and outro music is from Requiem for a Fish The F