59 min

Giving Your Data Science Projects And Teams A Home At DagsHub The Python Podcast.init

- Technology

Summary
Collaborating on software projects is largely a solved problem, with a variety of hosted or self-managed platforms to choose from. For data science projects, collaboration is still an open question. There are a number of projects that aim to bring collaboration to data science, but they are all solving a different aspect of the problem. Dean Pleban and Guy Smoilovsky created DagsHub to give individuals and teams a place to store and version their code, data, and models. In this episode they explain how DagsHub is designed to make it easier to create and track machine learning experiments, and serve as a way to promote collaboration on open source data science projects.

Announcements

Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Your host as usual is Tobias Macey and today I’m interviewing Dean Pleban and Guy Smoilovsky about DagsHub, a platform to track experiments, and version data, models & pipelines for your data science and machine learning projects.

Interview

Introduction
How did you first get introduced to Python?
Can you start by describing what the DagsHub platform is and why you built it?
There are a number of projects and platforms that aim to support collaboration among data scientists. What are the distinguishing features of DagsHub and how does it compare to the other options in the ecosystem?

What are the biggest opportunities for improvement that you still see in the space of collaboration on data projects?

What do you see as the biggest points of friction for building experiments and managing source data collaboratively?
Can you describe how the DagsHub platform is implemented?

How have the design and goals of the system changed or evolved since you first began working on it?
How has your own understanding and practices of working on data science/ML projects changed changed?

GitHub has a number of convenience features beyond just storing a git repository. What are the capabilities that you are focusing on to add value to the data science workflow within DagsHub?
How are you approaching the bootstrapping problem of building a critical mass of users to be able to generate a beneficial network effect?
Are there any conventions that make it easier or more familiar for newcomers to a given project? (e.g. code layout, data labeling/tagging formats, etc.)
What are your recommendations for managing onwership/licensing of data assets in public projects?
What are some of the most interesting, innovative, or unexpected ways that you have seen DagsHub used?
What are the most interesting, unexpected, or challenging lessons that you have learned while building DagsHub?
When is DagsHub the wrong choice?
What do you have planned for the future of the platform and business?

Keep In Touch
Follow us on Twitter or LinkedIn, join our Discord, sign up to DAGsHub

@DeanPlbn
@Guy_T_Sky
@TheRealDAGsHub
DagsHub Discord

Picks

Tobias

The Remarkable Journey of Prince Jen by Lloyd Alexander

Dean

Quantum Computing Since Democritus by Scott Aaronson
The Expanse TV Series

Guy

Try to consume only the very best of available content, not the things that are coming out right now.
Applies to textbooks, TV shows, movies
Less Wrong blog
Slate Star Codex \ Astral Codex Ten
Avatar: The Last Airbender
3 Blue 1 Brown YouTube Cha