
13 episodes

Data Engineering Weekly Data Engineering Weekly
-
- Technology
-
-
5.0 • 1 Rating
-
Data Engineering Weekly is a podcast reflection of the popular data engineering newsletter www.dataengineeringweekly.com
-
DEW #133: How to Implement Write-Audit-Publish (WAP), Vector Database - Concepts and examples & Data Warehouse Testing Strategies for Better Data Quality
Welcome to another episode of Data Engineering Weekly. Aswin and I select 3 to 4 articles from each edition of Data Engineering Weekly and discuss them from the author’s and our perspectives.
On DEW #133, we selected the following article
LakeFs: How to Implement Write-Audit-Publish (WAP)
I wrote extensively about the WAP pattern in my latest article, An Engineering Guide to Data Quality - A Data Contract Perspective. Super excited to see a complete guide on implementing the WAP pattern in Iceberg, Hudi, and of course, with LakeFs.
https://lakefs.io/blog/how-to-implement-write-audit-publish/
Jatin Solanki: Vector Database - Concepts and examples
Staying with the vector search, a new class of Vector Databases is emerging in the market to improve the semantic search experiences. The author writes an excellent introduction to vector databases and their applications.
https://blog.devgenius.io/vector-database-concepts-and-examples-f73d7e683d3e
Policy Genius: Data Warehouse Testing Strategies for Better Data Quality
Data Testing and Data Observability are widely discussed topics in Data Engineering Weekly. However, both techniques test once the transformation task is completed. Can we test SQL business logic during the development phase itself? Perhaps unit test the pipeline?
The author writes an exciting article about adopting unit testing in the data pipeline by producing sample tables during the development. We will see more tools around the unit test framework for the data pipeline soon. I don’t think testing data quality on all the PRs against the production database is not a cost-effective solution. We can do better than that, tbh.
https://medium.com/policygenius-stories/data-warehouse-testing-strategies-for-better-data-quality-d5514f6a0dc9
LakeFs: How to Implement Write-Audit-Publish (WAP)Jatin Solanki: Vector Database - Concepts and examplesPolicy Genius: Data Warehouse Testing Strategies for Better Data Quality -
DEW #132: The New Generative AI Infra Stack, Databricks cost management at Coinbase, Exploring an Entity Resolution Framework Across Various Use Cases & What's the hype behind DuckDB?
Welcome to another episode of Data Engineering Weekly. Aswin and I select 3 to 4 articles from each edition of Data Engineering Weekly and discuss them from the author’s and our perspectives.
On DEW #132, we selected the following article
Cowboy Ventures: The New Generative AI Infra Stack
Generative AI has taken the tech industry by storm. In Q1 2023, a whopping $1.7B was invested into gen AI startups. Cowboy ventures unbundle the various categories of Generative AI infra stack here.
https://medium.com/cowboy-ventures/the-new-infra-stack-for-generative-ai-9db8f294dc3f
Coinbase: Databricks cost management at Coinbase
Effective cost management in data engineering is crucial as it maximizes the value gained from data insights while minimizing expenses. It ensures sustainable and scalable data operations, fostering a balanced business growth path in the data-driven era. Coinbase writes one case about cost management for Databricks and how they use the open-source overwatch tool to manage Databrick’s cost.
https://www.coinbase.com/blog/databricks-cost-management-at-coinbase
Walmart: Exploring an Entity Resolution Framework Across Various Use Cases
Entity resolution, a crucial process that identifies and links records representing the same entity across various data sources, is indispensable for generating powerful insights about relationships and identities. This process, often leveraging fuzzy matching techniques, not only enhances data quality but also facilitates nuanced decision-making by effectively managing relationships and tracking potential matches among data records. Walmart writes about the pros and cons of approaching fuzzy matching with rule-based and ML-based matching.
https://medium.com/walmartglobaltech/exploring-an-entity-resolution-framework-across-various-use-cases-cb172632e4ae
Matt Palmer: What's the hype behind DuckDB?
So DuckDB, Is it hype? or does it have the real potential to bring architectural changes to the data warehouse? The author explains how DuckDB works and the potential impact of DuckDB in Data Engineering.
https://mattpalmer.io/posts/whats-the-hype-duckdb/ -
DEW #131: dbt model contract, Instacart ads modularization in LakeHouse Architecture, Jira to automate Glue tables, Server-Side Tracking
Welcome to another episode of Data Engineering Weekly. Aswin and I select 3 to 4 articles from each edition of Data Engineering Weekly and discuss them from the author’s and our perspectives.
On DEW #131, we selected the following article
Ramon Marrero: DBT Model Contracts - Importance and Pitfalls
dbt introduces model contract with 1.5 release. There were a few critics of the dbt model implementation, such as The False Promise of dbt Contracts. I found the argument made in the false promise of the dbt contract surprising, especially the below comments.
As a model owner, if I change the columns or types in the SQL, it's usually intentional. - My immediate no reaction was, Hmm, Not really.
However, as with any initial system iteration, the dbt model contract implementation has pros and cons. I’m sure it will evolve as the adoption increases. The author did an amazing job writing a balanced view of dbt model contract.
https://medium.com/geekculture/dbt-model-contracts-importance-and-pitfalls-20b113358ad7
Instacart: How Instacart Ads Modularized Data Pipelines With Lakehouse Architecture and Spark
Instacart writes about its journey of building its ads measurement platform. A couple of thing stands out for me in the blog.
The Event store is moving from S3/ parquet storage to DeltaLake storage—a sign of LakeHouse format adoption across the board.
Instacart adoption of Databricks ecosystem along with Snowflake.
The move to rewrite SQL into a composable Spark SQL pipeline for better readability and testing.
https://tech.instacart.com/how-instacart-ads-modularized-data-pipelines-with-lakehouse-architecture-and-spark-e9863e28488d
Timo Dechau: The extensive guide for Server-Side Tracking
The blog is an excellent overview of server-side event tracking. The author highlights how the event tracking is always close to the UI flow than the business flow and all the possible things wrong with frontend event tracking. A must-read article if you’re passionate about event tracking like me.
https://hipsterdatastack.substack.com/p/the-extensive-guide-for-server-side
This Schema change could’ve been a JIRA ticket!!!
I found the article excellent workflow automation on top of the familiar ticketing system, JIRA. The blog narrates the challenges with Glue Crawler and how selectively applying the db changes management using JIRA help to overcome its technical debt of running 6+ hours custom crawler.
https://medium.com/credit-saison-india/using-jira-to-automate-updations-and-additions-of-glue-tables-58d39adf9940 -
DEW #129: DoorDash's Generative AI, Europe data salary, Data Validation with Great Expectations, Expedia's Event Sourcing
Welcome to another episode of Data Engineering Weekly. Aswin and I select 3 to 4 articles from each edition of Data Engineering Weekly and discuss them from the author’s and our perspectives.
On DEW #129, we selected the following article
DoorDash identifies Five big areas for using Generative AI
Generative AI has taken the industry by storm, and every company is trying to determine what it means to them. DoorDash writes about its discovery of Generative AI and its application to boost its business.
The assistance of customers in completing tasks
Better tailored and interactive discovery [Recommendation]
Generation of personalized content and merchandising
Extraction of structured information
Enhancement of employee productivity
https://doordash.engineering/2023/04/26/doordash-identifies-five-big-areas-for-using-generative-ai/
Mikkel Dengsøe: Europe data salary benchmark 2023
Fascinating findings on Europe’s data salary among various countries. The key findings are
German-based roles pay lower.
London and Dublin-based roles have the highest compensations. The Dublin sample is skewed to more senior roles, with 55% of reported salaries being senior, which is more indicative of the sample than jobs in Dublin paying higher than in London.
The top 75% percentile jobs in Amsterdam, London, and Dublin pay nearly 50% more than those in Berlin
https://medium.com/@mikldd/europe-data-salary-benchmark-2023-b68cea57923d
Trivago: Implementing Data Validation with Great Expectations in Hybrid Environments
The article by Trivago discusses the integration of data validation with Great Expectations. It presents a well-balanced case study that emphasizes the significance of data validation and the necessity for sophisticated statistical validation methods.
https://tech.trivago.com/post/2023-04-25-implementing-data-validation-with-great-expectations-in-hybrid-environments.html
Expedia: How Expedia Reviews Engineering Is Using Event Streams as a Source Of Truth
“Events as a source of truth” is a simple but powerful idea to persist the state of the business entity as a sequence of state-changing events. How to build such a system? Expedia writes about the review stream system to demonstrate how it adopted the event-first approach.
https://medium.com/expedia-group-tech/how-expedia-reviews-engineering-is-using-event-streams-as-a-source-of-truth-d3df616cccd8 -
DEW #124: State of Analytics Engineering, ChatGPT, LLM & the Future of Data Consulting, Unified Streaming & Batch Pipeline, and Kafka Schema Management
Welcome to another episode of Data Engineering Weekly. Aswin and I select 3 to 4 articles from each edition of Data Engineering Weekly and discuss them from the author’s and our perspectives.
On DEW #124 [https://www.dataengineeringweekly.com/p/data-engineering-weekly-124], we selected the following article
dbt: State of Analytics Engineering
dbt publishes the state of analytical [data???🤔] engineering. If you follow Data Engineering Weekly, We actively talk about data contracts & how data is a collaboration problem, not just an ETL problem. The state of analytical engineering survey validates it as two of the top 5 concerns are data ownership & collaboration between the data producer & consumer. Here are the top 5 key learnings from the report.
46% of respondents plan to invest more in data quality and observability this year— the most popular area for future investment.
Lack of coordination between data producers and data consumers is perceived by all respondents to be this year’s top threat to the ecosystem.
Data and analytics engineers are most likely to believe they have clear goals and are most likely to agree their work is valued.
71% of respondents rated data team productivity and agility positively, while data ownership ranked as a top concern for most.
Analytics leaders are most concerned with stakeholder needs. 42% say their top concern is “Data isn’t where business users need it.”
https://www.getdbt.com/state-of-analytics-engineering-2023/
Rittman Analytics: ChatGPT, Large Language Models and the Future of dbt and Analytics Consulting
Very fascinating to read about the potential impact of LLM in the future of dbt and analytical consulting. The author predicts we are at the beginning of the industrial revolution of computing.
Future iterations of generative AI, public services such as ChatGPT, and domain-specific versions of these underlying models will make IT and computing to date look like the spinning jenny that was the start of the industrial revolution.
🤺🤺🤺🤺🤺🤺🤺🤺🤺May the best LLM wins!! 🤺🤺🤺🤺🤺🤺
https://www.rittmananalytics.com/blog/2023/3/26/chatgpt-large-language-models-and-the-future-of-dbt-and-analytics-consulting
LinkedIn: Unified Streaming And Batch Pipelines At LinkedIn: Reducing Processing time by 94% with Apache Beam
One of the curses of adopting Lambda Architecture is the need for rewriting business logic in both streaming and batch pipelines. Spark attempt to solve this by creating a unified RDD model for streaming and batch; Flink introduces the table format to bridge the gap in batch processing. LinkedIn writes about its experience adopting Apache Beam’s approach, where Apache Beam follows unified pipeline abstraction that can run in any target data processing runtime such as Samza, Spark & Flink.
https://engineering.linkedin.com/blog/2023/unified-streaming-and-batch-pipelines-at-linkedin--reducing-proc
Wix: How Wix manages Schemas for Kafka (and gRPC) used by 2000 microservices
Wix writes about managing schema for 2000 (😬) microservices by standardizing schema structure with protobuf and Kafka schema registry. Some exciting reads include patterns like an internal Wix Docs approach & integration of the documentation publishing as part of the CI/ CD pipelines.
https://medium.com/wix-engineering/how-wix-manages-schemas-for-kafka-and-grpc-used-by-2000-microservices-2117416ea17b -
DEW #123: Generative AI at BuzzFeed, Building OnCall Culture & Dimensional Modeling at WhatNot
Welcome to another episode of Data Engineering Weekly Radio. Ananth and Aswin discussed a blog from BuzzFeed that shares lessons learned from building products powered by generative AI. The blog highlights how generative AI can be integrated into a company's work culture and workflow to enhance creativity rather than replace jobs. BuzzFeed provided their employees with intuitive access to APIs and integrated the technology into Slack for better collaboration.
Some of the lessons learned from BuzzFeed's experience include:
Getting the technology into the hands of creative employees to amplify their creativity.
Effective prompts are a result of close collaboration between writers and engineers.
Moderation is essential and requires building guardrails into the prompts.
Demystifying the technical concepts behind the technology can lead to better applications and tools.
Educating users about the limitations and benefits of generative AI.
The economics of using generative AI can be challenging, especially for hands-on business models.
The conversation also touched upon the non-deterministic nature of generative AI systems, the importance of prompt engineering, and the potential challenges in integrating generative AI into data engineering workflows. As technology progresses, it is expected that the economics of generative AI will become more favorable for businesses.
https://tech.buzzfeed.com/lessons-learned-building-products-powered-by-generative-ai-7f6c23bff376
Moving on, We discuss the importance of on-call culture in data engineering teams. We emphasize the significance of data pipelines and their impact on businesses. With a focus on communication, ownership, and documentation, we highlight how data engineers should prioritize and address issues in data systems.
We also discuss the importance of on-call rotation, runbooks, and tools like PagerDuty and Airflow to streamline alerts and responses. Additionally, we mention the value of having an on-call handoff process, where one engineer summarizes their experiences and alerts during their on-call period, allowing for improvements and a better understanding of common issues.
Overall, this conversation stresses the need for a learning culture within data engineering teams, focusing on building robust systems, improving team culture, and increasing productivity.
https://towardsdatascience.com/how-to-build-an-on-call-culture-in-a-data-engineering-team-7856fac0c99
Finally, Ananth and Aswin discuss an article about adopting dimensional data modeling in hyper-growth companies. We appreciate the learning culture and emphasize balancing speed, maturity, scale, and stability.
We highlight how dimensional modeling was initially essential due to limited computing and expensive storage. However, as storage became cheaper and computing more accessible, dimensional modeling was often overlooked, leading to data junkyards. In the current landscape, it's important to maintain business-aware domain-driven data marts and acknowledge that dimensional modeling still has a role.
The conversation also touches upon the challenges of tracking slowly changing dimensions and the responsibility of data architects, engineers, and analytical engineers in identifying and implementing such dimensions. We discuss the need for a fine balance between design thinking and experimentation and stress the importance of finding the right mix of correctness and agility for each company.
https://medium.com/whatnot-engineering/same-data-sturdier-frame-layering-in-dimensional-data-modeling-at-whatnot-5e6a548ee713