4 episodes

News and discussions about data engineering

DEbrief DEbrief

    • News

News and discussions about data engineering

    003. Hadoop is Dead and Kylin is Not

    003. Hadoop is Dead and Kylin is Not

    This episode was delayed due to ongiong situation in Ukraine. Thank for understanding.



    Hot updates



    Pulsar is updatedApache Kylin
    "Extreme OLAP Engine for Big Data"Three main versions: 2.4, 3.0 and most recent is 4.0.1
    v4 released in the autumn 2021Brings OLAP back to dataBeen around since 2015, brought to you by eBAYNot a friend to HBase, but likes parquetWeb Interface for all data stepsOfficial python client with pandas supportAmbari is killed (put in the attic)Apache Hop 1.1https://www.leanwithdata.com/blog/2022/02/hop-1.1.0/At January, 18th graduated from IncubatorApache Hop Sucks!Dolphin Scheduler



    Lightning news



    Apache Arrow for RustApache Iceber 0.13.0Hudi 0.10.1Apache HBase 2.4.9Apache Seatunnel
    easy-to-use ultra-high-performance distributed data integration platform that supports real-time synchronization of massive dataApache ORC 1.6.13Apache Beam 2.35.0Apache Airflow 2.2.3



    Discussion: DataSecOps



    OWASPData debiasingData anonymization






    Dr. Igor MosyaginData Engineer @ KlarnaIgor identifies himself as a pragmatic engineer with strong academic background. A theoretical physicist by training, he eventually assumed he had enough PhDs and left Academia to work with Data-* related things. As of 2022, Igor works as a Data Platform Engineer at Klarna. On top of that, he’s a huge fan of cephalopods, math rock, and quantum mechanics. He also hates baked carrots so much he decided to mention it in this bioVisit Website (opens in a new tab)Visit Twitter account (opens in a new tab)Visit LinkedIn account (opens in a new tab)Email



    Pasha FinkelshteynDeveloper advocate @ JetBrainsHaving 14 years of experience in IT, Pasha went through a fire in water, from technical support to developer, team lead, and data engineer. Now Pasha works as a developer advocate for Data Engineering at JetBrains. He helps develop the Big Data Tools plugin, gives talks on Kotlin and various aspects of data engineering, and work with data. Also, he is the author and maintainer of Kotlin API for Apache Spark.Visit Website (opens in a new tab)Visit Twitter account (opens in a new tab)Visit Facebook account (opens in a new tab)Visit Instagram account (opens in a new tab)Visit LinkedIn account (opens in a new tab)Visit GitHub account (opens in a new tab)Email

    • 35 min
    002. Graphs and documents

    002. Graphs and documents

    Hot updates



    dbt 1.0.0 releaseddbt is gaining popularityGreat instrument which solves really existing problemRedisJSON is out for public preview https://redis.com/blog/redisjson-public-preview-performance-benchmarking/Need to have Redis 6.x or later
    probably a good point to talk once again thatRedisJSON* is faster than MongoDB and ElasticSearch on direct read, write, and update workloads.available in Redis Cloud
    or you can always buuild it yourselfBasically a bunch of JSON commands for "native" json experience:JSON.SETJSON.GETJSON.NUMINCRBYClient libraries for Go/Node.js/Python/Java/.NET/PHP/Ruby
    only Java and Python are official libsYou can index your JSON documents using RediSearch, and you can set it up to update indexes on every writeCheck the benchmarksWhy community is dissing MongoDB recently? Their SSPL license is to blameFerretDB: relevant interesting solution — MongoDB interface over PostgreSQLNeo4j 4.4 is out last decemberuser impersonation is the main new featureAll cloud providers have their own graph dbGCP and AWS provide AuraDB: neo4j managed cloud serviceAmazon NeptuneAzure CosmosDBNeo4j has two main language engines: cypher, gremlin
    gremlin is a Java API for Graph DB. Techincally, Gremlin itself is a database engine

    .... cypher is a term used to describe freestyle rap in a group setting, which might be something to consider when you search online for cypher tipsSome say cypher looks very Lua-ishNeo4j Desktop and a browser interface
    additional plugins for easier visualizations/explorations"I wonder if we could solve this year's AoC problems with some DB like that"O'Relly often gives out their Neo4j book for freeWhy would you ever need a graph database?



    Lightning news



    Apache IoTDB 0.12.4group by multi levelLots of major verndor release updates due to well-known log4j vulnerabilities
    Do not forget to update log4j if you didn't yetCalcite 1.29.0Log4j obviouslyApache Beam new releaseMinor relase, 3 braking changesLast time we mention it here because they release new versions weeklyLakeFS new releasesThey just never stop:performance improvementsnew OpenAPI methodsecurity checkApache ORC 1.7.2 releasedIt's just good to know that this format is still alive and is being developed.row level filtering in columnar storage formatnow row-level predicates will work on rows (at the reader level)!



    Discussion: ETL and Reverse ETL






    Dr. Igor MosyaginData Engineer @ KlarnaIgor identifies himself as a pragmatic engineer with strong academic background. A theoretical physicist by training, he eventually assumed he had enough PhDs and left Academia to work with Data-* related things. As of 2022, Igor works as a Data Platform Engineer at Klarna. On top of that, he’s a huge fan of cephalopods, math rock, and quantum mechanics. He also hates baked carrots so much he decided to mention it in this bioVisit Website (opens in a new tab)Visit Twitter account (opens in a new tab)Visit LinkedIn account (opens in a new tab)Email



    Pasha FinkelshteynDeveloper advocate @ JetBrainsHaving 14 years of experience in IT, Pasha went through a fire in water, from technical support to developer, team lead, and data engineer. Now Pasha works as a developer advocate for Data Engineering at JetBrains. He helps develop the Big Data Tools plugin, gives talks on Kotlin and various aspects of data engineering, and work with data. Also, he is the author and maintainer of Kotlin API for Apache Spark.Visit Website (opens in a new tab)Visit Twitter account (opens in a new tab)Visit Facebook account (opens in a new tab)Visit Instagram account (opens in a new tab)Visit LinkedIn account (opens in a new tab)Visit GitHub account (opens in a new tab)Email

    • 38 min
    001. MQs, storages, and dataframes

    001. MQs, storages, and dataframes

    A few hot updates



    Apache Geode 1.12.5enterprise edition is known as gemfiregeodistributed storagehas native clients in Java, C#, and C++ (!)JTA compliant transaction supportPinot released 0.9.0Added Segment Merge and RollupRollup is a technique for tree-like groupby
    example: city, streets, housesGeneral info about pinotMade by guys from LinkedIn and Uber
    has zookeeper as depscolumn-oriented databaseIt's an OLAP tool for real-time analyticsthere are BI tools focused on dashboards and reports
    used by analists etcthis is more for data exploration
    for de / ds folksNear real-time ingestion from streams (Kafka, Kinesis, and batch ingestion from Hadoop/S3 and the likes)It has built-in UI for SQL edits and general BI for exploration
    focus on realtime analyticsYou can connect Pinot to various BI tools such Superset, Tableau, or PowerBI to visualize data in PinotRocketMq 4.9.0 / 4.9.2 Comparison TableBased on ActiveMQDoes not need ZooKeeperHas conecept of strict message orderHas focus on perfect configuration OOTBRich web interfaceSQLite 3.37new STRICT table definition and ANY type
    works as cast on writecli client update:multiple connections from the same clientsecurity mode with `-safe`author is well-known as a supporter of flexible typing, have a read https://sqlite.org/flextypegood.html



    Ligthning



    Superset 1.3.2bugfixesif you never saw what's 1.3.0 has to offer, check it out: they have funnelsalso revised treemap vizBeam 2.34.0NiFi new release (1.15.0)main feature is parameter context inheritanceApache Ratis releaseRaftAirflow 2.2.2bugfixes 🤷‍♀️Nats 2.6.5 recent releasebugfixes



    Discussion: Are dataframes necessary?



    Kotlin DataFrame



    Music by https://t.me/red_hands






    Dr. Igor MosyaginData Engineer @ KlarnaIgor identifies himself as a pragmatic engineer with strong academic background. A theoretical physicist by training, he eventually assumed he had enough PhDs and left Academia to work with Data-* related things. As of 2022, Igor works as a Data Platform Engineer at Klarna. On top of that, he’s a huge fan of cephalopods, math rock, and quantum mechanics. He also hates baked carrots so much he decided to mention it in this bioVisit Website (opens in a new tab)Visit Twitter account (opens in a new tab)Visit LinkedIn account (opens in a new tab)Email



    Pasha FinkelshteynDeveloper advocate @ JetBrainsHaving 14 years of experience in IT, Pasha went through a fire in water, from technical support to developer, team lead, and data engineer. Now Pasha works as a developer advocate for Data Engineering at JetBrains. He helps develop the Big Data Tools plugin, gives talks on Kotlin and various aspects of data engineering, and work with data. Also, he is the author and maintainer of Kotlin API for Apache Spark.Visit Website (opens in a new tab)Visit Twitter account (opens in a new tab)Visit Facebook account (opens in a new tab)Visit Instagram account (opens in a new tab)Visit LinkedIn account (opens in a new tab)Visit GitHub account (opens in a new tab)Email

    • 38 min
    000. Pilot

    000. Pilot

    Hot Updates



    Spark 3.2 with pandas API supportApache Beam 2.33.0Arrow 6Airflow 2.2.1



    Lightning news



    Streamlit cloud releasedTestContainers Cloud).Greenplum 0.16 released



    Discussion



    Classical CI/CD vs GitOps






    Dr. Igor MosyaginData Engineer @ KlarnaIgor identifies himself as a pragmatic engineer with strong academic background. A theoretical physicist by training, he eventually assumed he had enough PhDs and left Academia to work with Data-* related things. As of 2022, Igor works as a Data Platform Engineer at Klarna. On top of that, he’s a huge fan of cephalopods, math rock, and quantum mechanics. He also hates baked carrots so much he decided to mention it in this bioVisit Website (opens in a new tab)Visit Twitter account (opens in a new tab)Visit LinkedIn account (opens in a new tab)Email



    Pasha FinkelshteynDeveloper advocate @ JetBrainsHaving 14 years of experience in IT, Pasha went through a fire in water, from technical support to developer, team lead, and data engineer. Now Pasha works as a developer advocate for Data Engineering at JetBrains. He helps develop the Big Data Tools plugin, gives talks on Kotlin and various aspects of data engineering, and work with data. Also, he is the author and maintainer of Kotlin API for Apache Spark.Visit Website (opens in a new tab)Visit Twitter account (opens in a new tab)Visit Facebook account (opens in a new tab)Visit Instagram account (opens in a new tab)Visit LinkedIn account (opens in a new tab)Visit GitHub account (opens in a new tab)Email

    • 25 min

Top Podcasts In News

The Daily
The New York Times
Up First
NPR
The Ben Shapiro Show
The Daily Wire
Pod Save America
Crooked Media
Morning Wire
The Daily Wire
The Megyn Kelly Show
SiriusXM