Data Engineering Weekly

Ananth Packkildurai
Data Engineering Weekly

The Weekly Data Engineering Newsletter www.dataengineeringweekly.com

  1. قبل ٦ أيام

    The Future of Data Lakehouses: A Fireside Chat with Vinoth Chandar - Founder CEO Onehouse & PMC Chair of Apache Hudi

    Exploring the Evolution of Lakehouse Technology: A Conversation with Vinoth Chandar and Onehouse CEO In this episode, Ananth, author of Data Engineering Weekly and CEO of Onehouse, discusses the latest developments in the Lakehouse technology space, particularly focusing on Apache Hudi, Iceberg, and Delta Lake. They discuss the intricacies of building high-scale data ecosystems, the impact of table format standardization, and technical advances in incremental processing and indexing. The conversation delves into the role of open source in shaping the future of data engineering and addresses community questions about integrating various databases and improving operational efficiency. 00:00 Introduction and New Year Greetings 01:19 Introduction to Apache Hudi and Its Impact 02:22 Challenges and Innovations in Data Engineering 04:16 Technical Deep Dive: Hudi's Evolution and Features 05:57 Comparing Hudi with Other Data Formats 13:22 Hudi 1.0: New Features and Enhancements 20:37 Industry Perception and the Future of Data Formats 24:29 Technical Differentiators and Project Longevity 26:05 Open Standards and Vendor Games 26:41 Standardization and Data Platforms 28:43 Competition and Collaboration in Data Formats 33:38 Future of Open Source and Data Community 36:14 Technical Questions from the Audience 47:26 Closing Remarks and Future Outlook This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.dataengineeringweekly.com

    ٤٨ من الدقائق
  2. ٢٨‏/٠٦‏/١٤٤٦ هـ

    Agents of Change: Navigating 2025 with AI and Data Innovation

    Agents of Change: Navigating 2025 with AI and Data Innovation In this episode of Dew, the hosts and guests discuss their predictions for 2025, focusing on the rise and impact of agentic AI. The conversation covers three main categories: 1. The role of agent AI 2. The future workforce dynamic involving human and AI agent 3. Innovations in data platforms heading into 2025. Highlights include insights from Ashwin and our special guest, Rajesh, on building robust agent systems, strategies for data engineers and AI engineers to remain relevant, data quality and observability, and the evolving landscape of Lakehouse architectures. The discussion also discusses the challenges of integrating multi-agent systems and the economic implications of AI sovereignty and data privacy. 00:00 Introduction and Predictions for 2025 01:49 Exploring Agentic AI 04:44 The Evolution of AI Models 16:36 Enterprise Data and AI Integration 25:06 Managing AI Agents 36:37 Opportunities in AI and Agent Development 38:02 The Evolving Role of AI and Data Engineers 38:31 Managing AI Agents and Data Pipelines 39:05 The Future of Data Scientists in AI 40:03 Multi-Agent Systems and Interoperability 44:09 Economic Viability of Multi-Agent Systems 47:06 Data Platforms and Lakehouse Implementations 53:14 Data Quality, Observability, and Governance 01:02:20 The Rise of Multi-Cloud and Multi-Engine Systems 01:06:21 Final Thoughts and Future Outlook This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.dataengineeringweekly.com

    ١ س ١١ د
  3. ١٢‏/٠٦‏/١٤٤٥ هـ

    Data Engineering Trends With Aswin & Ananth

    Welcome to another insightful edition of Data Engineering Weekly. As we approach the end of 2023, it's an opportune time to reflect on the key trends and developments that have shaped the field of data engineering this year. In this article, we'll summarize the crucial points from a recent podcast featuring Ananth and Ashwin, two prominent voices in the data engineering community. Understanding the Maturity Model in Data Engineering A significant part of our discussion revolved around the maturity model in data engineering. Organizations must recognize their current position in the data maturity spectrum to make informed decisions about adopting new technologies. This approach ensures that adopting new tools and practices aligns with the organization's readiness and specific needs. The Rising Impact of AI and Large Language Models 2023 witnessed a substantial impact of AI and large language models in data engineering. These technologies are increasingly automating processes like ETL, improving data quality management, and evolving the landscape of data tools. Integrating AI into data workflows is not just a trend but a paradigm shift, making data processes more efficient and intelligent. Lake House Architectures: The New Frontier Lakehouse architectures have been at the forefront of data engineering discussions this year. The key focus has been interoperability among different data lake formats and the seamless integration of structured and unstructured data. This evolution marks a significant step towards more flexible and powerful data management systems. The Modern Data Stack: A Critical Evaluation The modern data stack (MDS) has been a hot topic, with debates around its sustainability and effectiveness. While MDS has driven hyper-specialization in product categories, challenges in integration and overlapping tool categories have raised questions about its long-term viability. The future of MDS remains a subject of keen interest as we move into 2024. Embracing Cost Optimization Cost optimization has emerged as a priority in data engineering projects. With the shift to cloud services, managing costs effectively while maintaining performance has become a critical concern. This trend underscores the need for efficient architectures that balance performance with cost-effectiveness. Streaming Architectures and the Rise of Apache Flink Streaming architectures have gained significant traction, with Apache Flink leading the way. Its growing adoption highlights the industry's shift towards real-time data processing and analytics. The support and innovation around Apache Flink suggest a continued focus on streaming architectures in the coming year. Looking Ahead to 2024 As we look towards 2024, there's a sense of excitement about the potential changes in fundamental layers like S3 Express and the broader impact of large language models. The anticipation is for more intelligent data platforms that effectively combine AI capabilities with human expertise, driving innovation and efficiency in data engineering. In conclusion, 2023 has been a year of significant developments and shifts in data engineering. As we move into 2024, we will likely focus on refining these trends and exploring new frontiers in AI, lake house architectures, and streaming technologies. Stay tuned for more updates and insights in the next editions of Data Engineering Weekly. Happy holidays, and here's to a groundbreaking 2024 in data engineering! This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.dataengineeringweekly.com

    ٣٨ من الدقائق
  4. ١٧‏/١٢‏/١٤٤٤ هـ

    DEW #133: How to Implement Write-Audit-Publish (WAP), Vector Database - Concepts and examples & Data Warehouse Testing Strategies for Better Data Quality

    Welcome to another episode of Data Engineering Weekly. Aswin and I select 3 to 4 articles from each edition of Data Engineering Weekly and discuss them from the author’s and our perspectives. On DEW #133, we selected the following article LakeFs: How to Implement Write-Audit-Publish (WAP) I wrote extensively about the WAP pattern in my latest article, An Engineering Guide to Data Quality - A Data Contract Perspective. Super excited to see a complete guide on implementing the WAP pattern in Iceberg, Hudi, and of course, with LakeFs. https://lakefs.io/blog/how-to-implement-write-audit-publish/ Jatin Solanki: Vector Database - Concepts and examples Staying with the vector search, a new class of Vector Databases is emerging in the market to improve the semantic search experiences. The author writes an excellent introduction to vector databases and their applications. https://blog.devgenius.io/vector-database-concepts-and-examples-f73d7e683d3e Policy Genius: Data Warehouse Testing Strategies for Better Data Quality Data Testing and Data Observability are widely discussed topics in Data Engineering Weekly. However, both techniques test once the transformation task is completed. Can we test SQL business logic during the development phase itself? Perhaps unit test the pipeline? The author writes an exciting article about adopting unit testing in the data pipeline by producing sample tables during the development. We will see more tools around the unit test framework for the data pipeline soon. I don’t think testing data quality on all the PRs against the production database is not a cost-effective solution. We can do better than that, tbh. https://medium.com/policygenius-stories/data-warehouse-testing-strategies-for-better-data-quality-d5514f6a0dc9 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.dataengineeringweekly.com

    ٢٣ من الدقائق
  5. ١٧‏/١٢‏/١٤٤٤ هـ

    DEW #132: The New Generative AI Infra Stack, Databricks cost management at Coinbase, Exploring an Entity Resolution Framework Across Various Use Cases & What's the hype behind DuckDB?

    Welcome to another episode of Data Engineering Weekly. Aswin and I select 3 to 4 articles from each edition of Data Engineering Weekly and discuss them from the author’s and our perspectives. On DEW #132, we selected the following article Cowboy Ventures: The New Generative AI Infra Stack Generative AI has taken the tech industry by storm. In Q1 2023, a whopping $1.7B was invested into gen AI startups. Cowboy ventures unbundle the various categories of Generative AI infra stack here. https://medium.com/cowboy-ventures/the-new-infra-stack-for-generative-ai-9db8f294dc3f Coinbase: Databricks cost management at Coinbase Effective cost management in data engineering is crucial as it maximizes the value gained from data insights while minimizing expenses. It ensures sustainable and scalable data operations, fostering a balanced business growth path in the data-driven era. Coinbase writes one case about cost management for Databricks and how they use the open-source overwatch tool to manage Databrick’s cost. https://www.coinbase.com/blog/databricks-cost-management-at-coinbase Walmart: Exploring an Entity Resolution Framework Across Various Use Cases Entity resolution, a crucial process that identifies and links records representing the same entity across various data sources, is indispensable for generating powerful insights about relationships and identities. This process, often leveraging fuzzy matching techniques, not only enhances data quality but also facilitates nuanced decision-making by effectively managing relationships and tracking potential matches among data records. Walmart writes about the pros and cons of approaching fuzzy matching with rule-based and ML-based matching. https://medium.com/walmartglobaltech/exploring-an-entity-resolution-framework-across-various-use-cases-cb172632e4ae Matt Palmer: What's the hype behind DuckDB? So DuckDB, Is it hype? or does it have the real potential to bring architectural changes to the data warehouse? The author explains how DuckDB works and the potential impact of DuckDB in Data Engineering. https://mattpalmer.io/posts/whats-the-hype-duckdb/ This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.dataengineeringweekly.com

    ٣٥ من الدقائق
  6. ٢٠‏/١١‏/١٤٤٤ هـ

    DEW #131: dbt model contract, Instacart ads modularization in LakeHouse Architecture, Jira to automate Glue tables, Server-Side Tracking

    Welcome to another episode of Data Engineering Weekly. Aswin and I select 3 to 4 articles from each edition of Data Engineering Weekly and discuss them from the author’s and our perspectives. On DEW #131, we selected the following article Ramon Marrero: DBT Model Contracts - Importance and Pitfalls dbt introduces model contract with 1.5 release. There were a few critics of the dbt model implementation, such as The False Promise of dbt Contracts. I found the argument made in the false promise of the dbt contract surprising, especially the below comments. As a model owner, if I change the columns or types in the SQL, it's usually intentional. - My immediate no reaction was, Hmm, Not really. However, as with any initial system iteration, the dbt model contract implementation has pros and cons. I’m sure it will evolve as the adoption increases. The author did an amazing job writing a balanced view of dbt model contract. https://medium.com/geekculture/dbt-model-contracts-importance-and-pitfalls-20b113358ad7 Instacart: How Instacart Ads Modularized Data Pipelines With Lakehouse Architecture and Spark Instacart writes about its journey of building its ads measurement platform. A couple of thing stands out for me in the blog. * The Event store is moving from S3/ parquet storage to DeltaLake storage—a sign of LakeHouse format adoption across the board. * Instacart adoption of Databricks ecosystem along with Snowflake. * The move to rewrite SQL into a composable Spark SQL pipeline for better readability and testing. https://tech.instacart.com/how-instacart-ads-modularized-data-pipelines-with-lakehouse-architecture-and-spark-e9863e28488d Timo Dechau: The extensive guide for Server-Side Tracking The blog is an excellent overview of server-side event tracking. The author highlights how the event tracking is always close to the UI flow than the business flow and all the possible things wrong with frontend event tracking. A must-read article if you’re passionate about event tracking like me. Credit Saison: Using Jira to Automate Updations and Additions of Glue Tables This Schema change could’ve been a JIRA ticket!!! I found the article excellent workflow automation on top of the familiar ticketing system, JIRA. The blog narrates the challenges with Glue Crawler and how selectively applying the db changes management using JIRA help to overcome its technical debt of running 6+ hours custom crawler. https://medium.com/credit-saison-india/using-jira-to-automate-updations-and-additions-of-glue-tables-58d39adf9940 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.dataengineeringweekly.com

    ٢٨ من الدقائق
  7. ٠٧‏/١١‏/١٤٤٤ هـ

    DEW #129: DoorDash's Generative AI, Europe data salary, Data Validation with Great Expectations, Expedia's Event Sourcing

    Welcome to another episode of Data Engineering Weekly. Aswin and I select 3 to 4 articles from each edition of Data Engineering Weekly and discuss them from the author’s and our perspectives. On DEW #129, we selected the following article DoorDash identifies Five big areas for using Generative AI. Generative AI took the industry by storm, and every company is trying to figure out what it means to them. DoorDash writes about its discovery of Generative AI and its application to boost its business. * The assistance of customers in completing tasks * Better tailored and interactive discovery [Recommendation] * Generation of personalized content and merchandising * Extraction of structured information * Enhancement of employee productivity https://doordash.engineering/2023/04/26/doordash-identifies-five-big-areas-for-using-generative-ai/ Mikkel Dengsøe: Europe data salary benchmark 2023 Fascinating findings on Europe’s data salary among various countries. The key findings are * German-based roles pay lower. * London and Dublin-based roles have the highest compensations. The Dublin sample is skewed to more senior roles, with 55% of reported salaries being senior, which is more indicative of the sample than jobs in Dublin paying higher than in London. * The top 75% percentile jobs in Amsterdam, London, and Dublin pay nearly 50% more than those in Berlin https://medium.com/@mikldd/europe-data-salary-benchmark-2023-b68cea57923d Trivago: Implementing Data Validation with Great Expectations in Hybrid Environments The article by Trivago discusses the integration of data validation with Great Expectations. It presents a well-balanced case study that emphasizes the significance of data validation and the necessity for sophisticated statistical validation methods. https://tech.trivago.com/post/2023-04-25-implementing-data-validation-with-great-expectations-in-hybrid-environments.html Expedia: How Expedia Reviews Engineering Is Using Event Streams as a Source Of Truth “Events as a source of truth” is a simple but powerful idea to persist the state of the business entity as a sequence of state-changing events. How to build such a system? Expedia writes about the review stream system to demonstrate how it adopted the event-first approach. https://medium.com/expedia-group-tech/how-expedia-reviews-engineering-is-using-event-streams-as-a-source-of-truth-d3df616cccd8 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.dataengineeringweekly.com

    ٣٢ من الدقائق
  8. ٠٩‏/١٠‏/١٤٤٤ هـ

    DEW #124: State of Analytics Engineering, ChatGPT, LLM & the Future of Data Consulting, Unified Streaming & Batch Pipeline, and Kafka Schema Management

    Welcome to another episode of Data Engineering Weekly. Aswin and I select 3 to 4 articles from each edition of Data Engineering Weekly and discuss them from the author’s and our perspectives. On DEW #124, we selected the following article dbt: State of Analytics Engineering dbt publishes the state of analytical [data???🤔] engineering. If you follow Data Engineering Weekly, We actively talk about data contracts & how data is a collaboration problem, not just an ETL problem. The state of analytical engineering survey validates it as two of the top 5 concerns are data ownership & collaboration between the data producer & consumer. Here are the top 5 key learnings from the report. * 46% of respondents plan to invest more in data quality and observability this year— the most popular area for future investment. * Lack of coordination between data producers and data consumers is perceived by all respondents to be this year’s top threat to the ecosystem. * Data and analytics engineers are most likely to believe they have clear goals and are most likely to agree their work is valued. * 71% of respondents rated data team productivity and agility positively, while data ownership ranked as a top concern for most. * Analytics leaders are most concerned with stakeholder needs. 42% say their top concern is “Data isn’t where business users need it.” https://www.getdbt.com/state-of-analytics-engineering-2023/ Rittman Analytics: ChatGPT, Large Language Models and the Future of dbt and Analytics Consulting Very fascinating to read about the potential impact of LLM in the future of dbt and analytical consulting. The author predicts we are at the beginning of the industrial revolution of computing. Future iterations of generative AI, public services such as ChatGPT, and domain-specific versions of these underlying models will make IT and computing to date look like the spinning jenny that was the start of the industrial revolution. 🤺🤺🤺🤺🤺🤺🤺🤺🤺May the best LLM wins!! 🤺🤺🤺🤺🤺🤺 https://www.rittmananalytics.com/blog/2023/3/26/chatgpt-large-language-models-and-the-future-of-dbt-and-analytics-consulting LinkedIn: Unified Streaming And Batch Pipelines At LinkedIn: Reducing Processing time by 94% with Apache Beam One of the curses of adopting Lambda Architecture is the need for rewriting business logic in both streaming and batch pipelines. Spark attempt to solve this by creating a unified RDD model for streaming and batch; Flink introduces the table format to bridge the gap in batch processing. LinkedIn writes about its experience adopting Apache Beam’s approach, where Apache Beam follows unified pipeline abstraction that can run in any target data processing runtime such as Samza, Spark & Flink. https://engineering.linkedin.com/blog/2023/unified-streaming-and-batch-pipelines-at-linkedin--reducing-proc Wix: How Wix manages Schemas for Kafka (and gRPC) used by 2000 microservices Wix writes about managing schema for 2000 (😬) microservices by standardizing schema structure with protobuf and Kafka schema registry. Some exciting reads include patterns like an internal Wix Docs approach & integration of the documentation publishing as part of the CI/ CD pipelines. https://medium.com/wix-engineering/how-wix-manages-schemas-for-kafka-and-grpc-used-by-2000-microservices-2117416ea17b This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.dataengineeringweekly.com

    ٣٧ من الدقائق

حول

The Weekly Data Engineering Newsletter www.dataengineeringweekly.com

قد يعجبك أيضًا

للاستماع إلى حلقات ذات محتوى فاضح، قم بتسجيل الدخول.

اطلع على آخر مستجدات هذا البرنامج

قم بتسجيل الدخول أو التسجيل لمتابعة البرامج وحفظ الحلقات والحصول على آخر التحديثات.

تحديد بلد أو منطقة

أفريقيا والشرق الأوسط، والهند

آسيا والمحيط الهادئ

أوروبا

أمريكا اللاتينية والكاريبي

الولايات المتحدة وكندا