1 hr 23 min

Kodsnack 567 - Arrow straight through, with Matt Topol and Lars Wikman Kodsnack in English

    • Technology

Fredrik has Matt Topol and Lars Wikman over for a deep and wide chat about Apache Arrow and many, many topics in the orbit of the language-independent columnar memory format for flat and hierarchical data. What does that even mean? What is the point? And why does Arrow only feel more and more interesting and useful the more you think about deeply integrating it into your systems?

Feeding data to systems fast enough is a problem which is focused on much less than it ought to be. With Arrow you can send data over the network, process it on the CPU - or GPU for that matter- and send it along to the database. All without parsing, transformation, or copies unless absolutely necessary.

Thank you Cloudnet for sponsoring our VPS!

Comments, questions or tips? We are @kodsnack, @tobiashieta, @oferlund and @bjoreman on Twitter, have a page on Facebook and can be emailed at info@kodsnack.se if you want to write longer. We read everything we receive.

If you enjoy Kodsnack we would love a review in iTunes! You can also support the podcast by buying us a coffee (or two!) through Ko-fi.

Links

Lars
Matt
Øredev
Matt’s Øredev presentations: State of the Apache Arrow ecosystem: How your project can leverage Arrow! and Leveraging Apache Arrow for ML workflows
Kallbadhuset
Apache Arrow
Lars talks about his Arrow rabbit hole in Regular programming
SIMD/vectorization
Spark
Explorer - builds on Polars
Null bitmap
Zeromq
Airbyte
Arrow flight
Dremio
Arrow flight SQL
Influxdb
Arrow flight RPC
Kafka
Pulsar
Opentelemetry
Arrow IPC format - also known as Feather
ADBC - Arrow database connectivity
ODBC and JDBC
Snowflake
DBT - SQL to SQL
Jinja
Datafusion
Ibis
Substrait
Meta’s Velox engine
Arrow’s project management committee (PMC)
Voltron data
Matt’s Arrow book - In-memory analytics with Apache Arrow
Rapids and Cudf
The Theseus engine - accelerator-native distributed compute engine using Arrow
The composable codex
The standards chapter
Dremio
Hugging face
Apache Hop - orchestration data scheduling thing
Directed acyclic graph
UCX - libraries for finding fast routes for data
Infiniband
NUMA
CUDA
GRPC
Foam bananas
Turkish pepper - Tyrkisk peber
Plopp
Marianne

Titles

For me, it started during the speaker’s dinner
Old, dated, and Java
A real nerd snipe
Identical representation in memory
Working on columns
It’s already laid out that way
Pass the memory, as is
Null plus null is null
A wild perk
Arrow into the thing
So many curly brackets you need to store
Arrow straight through
Something data people like to do
So many backends
The SQL string is for people
I’m rude, and he’s polite
Feed the data fast enough
A depressing amount of JSON
Arrow the whole way through
These are the problems in data
Reference the bytes as they are
Boiling down to Arrow
Data lakehouses
Removing inefficiency

Fredrik has Matt Topol and Lars Wikman over for a deep and wide chat about Apache Arrow and many, many topics in the orbit of the language-independent columnar memory format for flat and hierarchical data. What does that even mean? What is the point? And why does Arrow only feel more and more interesting and useful the more you think about deeply integrating it into your systems?

Feeding data to systems fast enough is a problem which is focused on much less than it ought to be. With Arrow you can send data over the network, process it on the CPU - or GPU for that matter- and send it along to the database. All without parsing, transformation, or copies unless absolutely necessary.

Thank you Cloudnet for sponsoring our VPS!

Comments, questions or tips? We are @kodsnack, @tobiashieta, @oferlund and @bjoreman on Twitter, have a page on Facebook and can be emailed at info@kodsnack.se if you want to write longer. We read everything we receive.

If you enjoy Kodsnack we would love a review in iTunes! You can also support the podcast by buying us a coffee (or two!) through Ko-fi.

Links

Lars
Matt
Øredev
Matt’s Øredev presentations: State of the Apache Arrow ecosystem: How your project can leverage Arrow! and Leveraging Apache Arrow for ML workflows
Kallbadhuset
Apache Arrow
Lars talks about his Arrow rabbit hole in Regular programming
SIMD/vectorization
Spark
Explorer - builds on Polars
Null bitmap
Zeromq
Airbyte
Arrow flight
Dremio
Arrow flight SQL
Influxdb
Arrow flight RPC
Kafka
Pulsar
Opentelemetry
Arrow IPC format - also known as Feather
ADBC - Arrow database connectivity
ODBC and JDBC
Snowflake
DBT - SQL to SQL
Jinja
Datafusion
Ibis
Substrait
Meta’s Velox engine
Arrow’s project management committee (PMC)
Voltron data
Matt’s Arrow book - In-memory analytics with Apache Arrow
Rapids and Cudf
The Theseus engine - accelerator-native distributed compute engine using Arrow
The composable codex
The standards chapter
Dremio
Hugging face
Apache Hop - orchestration data scheduling thing
Directed acyclic graph
UCX - libraries for finding fast routes for data
Infiniband
NUMA
CUDA
GRPC
Foam bananas
Turkish pepper - Tyrkisk peber
Plopp
Marianne

Titles

For me, it started during the speaker’s dinner
Old, dated, and Java
A real nerd snipe
Identical representation in memory
Working on columns
It’s already laid out that way
Pass the memory, as is
Null plus null is null
A wild perk
Arrow into the thing
So many curly brackets you need to store
Arrow straight through
Something data people like to do
So many backends
The SQL string is for people
I’m rude, and he’s polite
Feed the data fast enough
A depressing amount of JSON
Arrow the whole way through
These are the problems in data
Reference the bytes as they are
Boiling down to Arrow
Data lakehouses
Removing inefficiency

1 hr 23 min

Top Podcasts In Technology

All-In with Chamath, Jason, Sacks & Friedberg
All-In Podcast, LLC
Lex Fridman Podcast
Lex Fridman
Apple Events (audio)
Apple
Apple Events (video)
Apple
The TED AI Show
TED
Deep Questions with Cal Newport
Cal Newport