Disseminate: The Computer Science Research Podcast

Jack Waudby
Disseminate: The Computer Science Research Podcast

This podcast features interviews with Computer Science researchers. Hosted by Dr. Jack Waudby researchers are interviewed, highlighting the problem(s) they tackled, solutions they developed, and how their findings can be applied in practice. This podcast is for industry practitioners, researchers, and students, aims to further narrow the gap between research and practice, and to generally make awesome Computer Science research more accessible. We have 2 types of episode: (i) Cutting Edge (red/blue logo) where we talk to researchers about their latest work, and (ii) High Impact (gold/silver logo) where we talk to researchers about their influential work. You can support the show through Buy Me a Coffee. A donation of $3 will help us keep making you awesome Computer Science research podcasts.  Hosted on Acast. See acast.com/privacy for more information.

  1. Raunak Shah | R2D2: Reducing Redundancy and Duplication in Data Lakes | #59

    28 THG 10

    Raunak Shah | R2D2: Reducing Redundancy and Duplication in Data Lakes | #59

    In this episode, Raunak Shah joins us to discuss the critical issue of data redundancy in enterprise data lakes, which can lead to soaring storage and maintenance costs. Raunak highlights how large-scale data environments, ranging from terabytes to petabytes, often contain duplicate and redundant datasets that are difficult to manage. He introduces the concept of "dataset containment" and explains its significance in identifying and reducing redundancy at the table level in these massive data lakes—an area where there has been little prior work. Raunak then dives into the details of R2D2, a novel three-step hierarchical pipeline designed to efficiently tackle dataset containment. By utilizing schema containment graphs, statistical min-max pruning, and content-level pruning, R2D2 progressively reduces the search space to pinpoint redundant data. Raunak also discusses how the system, implemented on platforms like Azure Databricks and AWS, offers significant improvements over existing methods, processing TB-scale data lakes in just a few hours with high accuracy. He concludes with a discussion on how R2D2 optimally balances storage savings and performance by identifying datasets that can be deleted and reconstructed on demand, providing valuable insights for enterprises aiming to streamline their data management strategies. Materials: SIGMOD'24 Paper - R2D2: Reducing Redundancy and Duplication in Data LakesICDE'24 - Towards Optimizing Storage Costs in the Cloud Hosted on Acast. See acast.com/privacy for more information.

    31 phút
  2. Matt Perron | Analytical Workload Cost and Performance Stability With Elastic Pools | #57

    22 THG 7

    Matt Perron | Analytical Workload Cost and Performance Stability With Elastic Pools | #57

    In this episode, we dive deep into the complexities of managing analytical query workloads with our guest, Matt Perron. Matt explains how the rapid and unpredictable fluctuations in resource demands present a significant challenge for provisioning. Traditional methods often lead to either over-provisioning, resulting in excessive costs, or under-provisioning, which causes poor query latency during demand spikes. However, there's a promising solution on the horizon. Matt shares insights from recent research that showcases the viability of using cloud functions to dynamically match compute supply with workload demand without the need for prior resource provisioning. While effective for low query volumes, this approach becomes cost-prohibitive as query volumes increase, highlighting the need for a more balanced strategy. Matt introduces us to a novel strategy that combines the best of both worlds: the rapid scalability of cloud functions and the cost-effectiveness of virtual machines. This innovative approach leverages the fast but expensive cloud functions alongside slow-starting yet inexpensive virtual machines to provide elasticity without sacrificing cost efficiency. He elaborates on how their implementation, called Cackle, achieves consistent performance and cost savings across a wide range of workloads and conditions. Tune in to learn how Cackle avoids the pitfalls of traditional approaches, delivering stable query performance and minimizing costs even as demand fluctuates wildly. Links: Cackle: Analytical Workload Cost and Performance Stability With Elastic Pools [SIGMOD'24]Matt's Homepage Hosted on Acast. See acast.com/privacy for more information.

    52 phút

Xếp Hạng & Nhận Xét

5
/5
5 Xếp hạng

Giới Thiệu

This podcast features interviews with Computer Science researchers. Hosted by Dr. Jack Waudby researchers are interviewed, highlighting the problem(s) they tackled, solutions they developed, and how their findings can be applied in practice. This podcast is for industry practitioners, researchers, and students, aims to further narrow the gap between research and practice, and to generally make awesome Computer Science research more accessible. We have 2 types of episode: (i) Cutting Edge (red/blue logo) where we talk to researchers about their latest work, and (ii) High Impact (gold/silver logo) where we talk to researchers about their influential work. You can support the show through Buy Me a Coffee. A donation of $3 will help us keep making you awesome Computer Science research podcasts.  Hosted on Acast. See acast.com/privacy for more information.

Có Thể Bạn Cũng Thích

Bạn cần đăng nhập để nghe các tập có chứa nội dung thô tục.

Luôn cập nhật thông tin về chương trình này

Đăng nhập hoặc đăng ký để theo dõi các chương trình, lưu các tập và nhận những thông tin cập nhật mới nhất.

Chọn quốc gia hoặc vùng

Châu Phi, Trung Đông và Ấn Độ

Châu Á Thái Bình Dương

Châu Âu

Châu Mỹ Latinh và Caribê

Hoa Kỳ và Canada