Data Engineering Podcast

by Tobias Macey

4.6(130 reviews)

440 episodes

Updated Daily

Accepts GuestsHas SponsorsLocation 🇺🇸

Overview

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Language

🇺🇲

Publishing Since

1/8/2017

Visit Website RSS Feed

Reach out to this podcast

Get in touch with the podcast creators

Email Addresses

1 available

Phone Numbers

0 available

Get Full Contact Details

Recent Episodes

August 4, 2024

The Evolution of DataOps: Insights from DataKitchen's CEO

Summary In this episode of the Data Engineering Podcast, host Tobias Macey welcomes back Chris Berg, CEO of DataKitchen, to discuss his ongoing mission to simplify the lives of data engineers. Chris explains the challenges faced by data engineers, such as constant system failures, the need for rapid changes, and high customer demands. Chris delves into the concept of DataOps, its evolution, and the misappropriation of related terms like data mesh and data observability. He emphasizes the importance of focusing on processes and systems rather than just tools to improve data engineering workflows. Chris also introduces DataKitchen's open-source tools, DataOps TestGen and DataOps Observability, designed to automate data quality validation and monitor data journeys in production. Announcements <ul><li>Hello and welcome to the Data Engineering Podcast, the show about modern data management</li><li>Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to <a href="https://www.dataengineeringpodcast.com/starburst" target="_blank">dataengineeringpodcast.com/starburst</a> and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.</li><li>Your host is Tobias Macey and today I'm interviewing Chris Bergh about his tireless quest to simplify the lives of data engineers</li></ul>Interview <ul><li>Introduction</li><li>How did you get involved in the area of data management?</li><li>Can you describe what DataKitchen is and the story behind it?</li><li>You helped to define and popularize "DataOps", which then went through a journey of misappropriation similar to "DevOps", and has since faded in use. What is your view on the realities of "DataOps" today?</li><li>Out of the popularized wave of "DataOps" tools came subsequent trends in data observability, data reliability engineering, etc. How have those cycles influenced the way that you think about the work that you are doing at DataKitchen?</li><li>The data ecosystem went through a massive growth period over the past ~7 years, and we are now entering a cycle of consolidation. What are the fundamental shifts that we have gone through as an industry in the management and application of data?</li><li>What are the challenges that never went away?</li><li>You recently open sourced the dataops-testgen and dataops-observability tools. What are the outcomes that you are trying to produce with those projects?</li><li>What are the areas of overlap with existing tools and what are the unique capabilities that you are offering?</li><li>Can you talk through the technical implementation of your new obserability and quality testing platform?</li><li>What does the onboarding and integration process look like?</li><li>Once a team has one or both tools set up, what are the typical points of interaction that they will have over the course of their workday?</li><li>What are the most interesting, innovative, or unexpected ways that you have seen dataops-observability/testgen used?</li><li>What are the most interesting, unexpected, or challenging lessons that you have learned while working on promoting DataOps?</li><li>What do you have planned for the future of your work at DataKitchen?</li></ul>Contact Info <ul><li><a href="https://www.linkedin.com/in/chrisbergh/" target="_blank">LinkedIn</a></li></ul>Parting Question <ul><li>From your perspective, what is the biggest gap in the tooling or technology for data management today?</li></ul>Links <ul><li><a href="https://datakitchen.io/" target="_blank">DataKitchen</a></li><li><a href="https://www.dataengineeringpodcast.com/episodepage/datakitchen-dataops-with-chris-bergh-episode-26" target="_blank">Podcast Episode</a></li><li><a href="https://www.nasa.gov/ames/core-area-of-expertise-air-traffic-management/" target="_blank">NASA</a></li><li><a href="https://dataopsmanifesto.org/en/" target="_blank">DataOps Manifesto</a></li><li><a href="https://thenewstack.io/its-time-for-data-reliability-engineering/?utm_referrer=https%3A%2F%2Fwww.google.com%2F" target="_blank">Data Reliability Engineering</a></li><li><a href="https://www.ibm.com/topics/data-observability" target="_blank">Data Observability</a></li><li><a href="https://www.getdbt.com/" target="_blank">dbt</a></li><li><a href="https://itrevolution.com/product/enterprise-technology-leadership-summit-las-vegas-2024/" target="_blank">DevOps Enterprise Summit</a></li><li><a href="https://amzn.to/46BsRSo" target="_blank">Building The Data Warehouse</a> by Bill Inmon (affiliate link)</li><li><a href="https://github.com/DataKitchen/data-observability-installer" target="_blank">dataops-testgen, dataops-observability</a></li><li><a href="https://info.datakitchen.io/data-observability-and-data-quality-testing-certification" target="_blank">Free Data Quality and Data Observability Certification</a></li><li><a href="https://www.databricks.com/" target="_blank">Databricks</a></li><li><a href="https://dora.dev/" target="_blank">DORA Metrics</a></li><li><a href="https://datakitchen.io/two-downs-make-two-ups-the-only-success-metrics-that-matter-for-your-data-analytics-team/" target="_blank">DORA for data</a></li></ul>The intro and outro music is from <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug" target="_blank">The Hug</a> by <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/" target="_blank">The Freak Fandango Orchestra</a> / <a href="http://creativecommons.org/licenses/by-sa/3.0/" target="_blank">CC BY-SA</a>

July 28, 2024

Achieving Data Reliability: The Role of Data Contracts in Modern Data Management

Summary Data contracts are both an enforcement mechanism for data quality, and a promise to downstream consumers. In this episode Tom Baeyens returns to discuss the purpose and scope of data contracts, emphasizing their importance in achieving reliable analytical data and preventing issues before they arise. He explains how data contracts can be used to enforce guarantees and requirements, and how they fit into the broader context of data observability and quality monitoring. The discussion also covers the challenges and benefits of implementing data contracts, the organizational impact, and the potential for standardization in the field. Announcements <ul><li>Hello and welcome to the Data Engineering Podcast, the show about modern data management</li><li>Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to <a href="https://www.dataengineeringpodcast.com/starburst" target="_blank">dataengineeringpodcast.com/starburst</a> and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.</li><li>At Outshift, the incubation engine from Cisco, they are driving innovation in AI, cloud, and quantum technologies with the powerful combination of enterprise strength and startup agility. Their latest innovation for the AI ecosystem is Motific, addressing a critical gap in going from prototype to production with generative AI. Motific is your vendor and model-agnostic platform for building safe, trustworthy, and cost-effective generative AI solutions in days instead of months. Motific provides easy integration with your organizational data, combined with advanced, customizable policy controls and observability to help ensure compliance throughout the entire process. Move beyond the constraints of traditional AI implementation and ensure your projects are launched quickly and with a firm foundation of trust and efficiency. Go to <a href="https://motifica.ai" target="_blank">motific.ai</a> today to learn more!</li><li>Your host is Tobias Macey and today I'm interviewing Tom Baeyens about using data contracts to build a clearer API for your data</li></ul>Interview <ul><li>Introduction</li><li>How did you get involved in the area of data management?</li><li>Can you describe the scope and purpose of data contracts in the context of this conversation?</li><li>In what way(s) do they differ from data quality/data observability?</li><li>Data contracts are also known as the API for data, can you elaborate on this?</li><li>What are the types of guarantees and requirements that you can enforce with these data contracts?</li><li>What are some examples of constraints or guarantees that cannot be represented in these contracts?</li><li>Are data contracts related to the shift-left?</li><li>Data contracts are also known as the API for data, can you elaborate on this?</li><li>The obvious application of data contracts are in the context of pipeline execution flows to prevent failing checks from propagating further in the data flow. What are some of the other ways that these contracts can be integrated into an organization's data ecosystem?</li><li>How did you approach the design of the syntax and implementation for Soda's data contracts?</li><li>Guarantees and constraints around data in different contexts have been implemented in numerous tools and systems. What are the areas of overlap in e.g. dbt, great expectations?</li><li>Are there any emerging standards or design patterns around data contracts/guarantees that will help encourage portability and integration across tooling/platform contexts?</li><li>What are the most interesting, innovative, or unexpected ways that you have seen data contracts used?</li><li>What are the most interesting, unexpected, or challenging lessons that you have learned while working on data contracts at Soda?</li><li>When are data contracts the wrong choice?</li><li>What do you have planned for the future of data contracts?</li></ul>Contact Info <ul><li><a href="https://www.linkedin.com/in/tombaeyens/?originalSubdomain=be" target="_blank">LinkedIn</a></li></ul>Parting Question <ul><li>From your perspective, what is the biggest gap in the tooling or technology for data management today?</li></ul>Closing Announcements <ul><li>Thank you for listening! Don't forget to check out our other shows. <a href="https://www.pythonpodcast.com" target="_blank">Podcast.__init__</a> covers the Python language, its community, and the innovative ways it is being used. The <a href="https://www.aiengineeringpodcast.com" target="_blank">AI Engineering Podcast</a> is your guide to the fast-moving world of building AI systems.</li><li>Visit the <a href="https://www.dataengineeringpodcast.com" target="_blank">site</a> to subscribe to the show, sign up for the mailing list, and read the show notes.</li><li>If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.</li></ul>Links <ul><li><a href="https://www.soda.io/" target="_blank">Soda</a></li><li><a href="https://www.dataengineeringpodcast.com/soda-data-quality-management-episode-178" target="_blank">Podcast Episode</a></li><li><a href="https://en.wikipedia.org/wiki/JBoss_Enterprise_Application_Platform" target="_blank">JBoss</a></li><li><a href="https://datacreation.substack.com/p/what-is-and-what-isnt-a-data-contract" target="_blank">Data Contract</a></li><li><a href="https://airflow.apache.org/" target="_blank">Airflow</a></li><li><a href="https://en.wikipedia.org/wiki/Unit_testing" target="_blank">Unit Testing</a></li><li><a href="https://en.wikipedia.org/wiki/Integration_testing" target="_blank">Integration Testing</a></li><li><a href="https://www.openapis.org/" target="_blank">OpenAPI</a></li><li><a href="https://graphql.org/" target="_blank">GraphQL</a></li><li><a href="https://martinfowler.com/bliki/CircuitBreaker.html" target="_blank">Circuit Breaker Pattern</a></li><li><a href="https://docs.soda.io/soda/quick-start-sodacl.html" target="_blank">SodaCL</a></li><li><a href="https://docs.soda.io/soda/data-contracts.html" target="_blank">Soda Data Contracts</a></li><li><a href="https://www.datamesh-architecture.com/" target="_blank">Data Mesh</a></li><li><a href="https://greatexpectations.io/" target="_blank">Great Expectations</a></li><li><a href="https://docs.getdbt.com/docs/build/unit-tests" target="_blank">dbt Unit Tests</a></li><li><a href="https://opendatacontract.com/" target="_blank">Open Data Contracts</a></li><li><a href="https://bitol-io.github.io/open-data-contract-standard/latest/" target="_blank">ODCS == Open Data Contract Standard</a></li><li><a href="https://opendataproducts.org/" target="_blank">ODPS == Open Data Product Specification</a></li></ul>The intro and outro music is from <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug" target="_blank">The Hug</a> by <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/" target="_blank">The Freak Fandango Orchestra</a> / <a href="http://creativecommons.org/licenses/by-sa/3.0/" target="_blank">CC BY-SA</a>

July 21, 2024

How Generative AI Is Impacting Data Engineering Teams

Summary Generative AI has rapidly gained adoption for numerous use cases. To support those applications, organizational data platforms need to add new features and data teams have increased responsibility. In this episode Lior Gavish, co-founder of Monte Carlo, discusses the various ways that data teams are evolving to support AI powered features and how they are incorporating AI into their work. Announcements <ul><li>Hello and welcome to the Data Engineering Podcast, the show about modern data management</li><li>Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to <a href="https://www.dataengineeringpodcast.com/starburst" target="_blank">dataengineeringpodcast.com/starburst</a> and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.</li><li>Your host is Tobias Macey and today I'm interviewing Lior Gavish about the impact of AI on data engineers</li></ul>Interview <ul><li>Introduction</li><li>How did you get involved in the area of data management?</li><li>Can you start by clarifying what we are discussing when we say "AI"?</li><li>Previous generations of machine learning (e.g. deep learning, reinforcement learning, etc.) required new features in the data platform. What new demands is the current generation of AI introducing?</li><li>Generative AI also has the potential to be incorporated in the creation/execution of data pipelines. What are the risk/reward tradeoffs that you have seen in practice?<ul><li>What are the areas where LLMs have proven useful/effective in data engineering?</li></ul></li><li>Vector embeddings have rapidly become a ubiquitous data format as a result of the growth in retrieval augmented generation (RAG) for AI applications. What are the end-to-end operational requirements to support this use case effectively?<ul><li>As with all data, the reliability and quality of the vectors will impact the viability of the AI application. What are the different failure modes/quality metrics/error conditions that they are subject to?</li></ul></li><li>As much as vectors, vector databases, RAG, etc. seem exotic and new, it is all ultimately shades of the same work that we have been doing for years. What are the areas of overlap in the work required for running the current generation of AI, and what are the areas where it diverges?<ul><li>What new skills do data teams need to acquire to be effective in supporting AI applications?</li></ul></li><li>What are the most interesting, innovative, or unexpected ways that you have seen AI impact data engineering teams?</li><li>What are the most interesting, unexpected, or challenging lessons that you have learned while working with the current generation of AI?</li><li>When is AI the wrong choice?</li><li>What are your predictions for the future impact of AI on data engineering teams?</li></ul>Contact Info <ul><li><a href="https://www.linkedin.com/in/lgavish/" target="_blank">LinkedIn</a></li></ul>Parting Question <ul><li>From your perspective, what is the biggest gap in the tooling or technology for data management today?</li></ul>Closing Announcements <ul><li>Thank you for listening! Don't forget to check out our other shows. <a href="https://www.pythonpodcast.com" target="_blank">Podcast.__init__</a> covers the Python language, its community, and the innovative ways it is being used. The <a href="https://www.aiengineeringpodcast.com" target="_blank">AI Engineering Podcast</a> is your guide to the fast-moving world of building AI systems.</li><li>Visit the <a href="https://www.dataengineeringpodcast.com" target="_blank">site</a> to subscribe to the show, sign up for the mailing list, and read the show notes.</li><li>If you've learned something or tried out a project from the show then tell us about it! Email <a target="_blank">[email protected]</a> with your </li></ul>Links <ul><li><a href="https://www.montecarlodata.com/" target="_blank">Monte Carlo</a><ul><li><a href="https://www.dataengineeringpodcast.com/monte-carlo-observability-data-quality-episode-155" target="_blank">Podcast Episode</a></li></ul></li><li><a href="https://en.wikipedia.org/wiki/Natural_language_processing" target="_blank">NLP == Natural Language Processing</a></li><li><a href="https://en.wikipedia.org/wiki/Large_language_model" target="_blank">Large Language Models</a></li><li><a href="https://en.wikipedia.org/wiki/Generative_artificial_intelligence" target="_blank">Generative AI</a></li><li><a href="https://en.wikipedia.org/wiki/MLOps" target="_blank">MLOps</a></li><li><a href="https://www.coursera.org/articles/what-is-machine-learning-engineer" target="_blank">ML Engineer</a></li><li><a href="https://www.featurestore.org/what-is-a-feature-store" target="_blank">Feature Store</a></li><li><a href="https://github.blog/2024-04-04-what-is-retrieval-augmented-generation-and-what-does-it-do-for-generative-ai/" target="_blank">Retrieval Augmented Generation (RAG)</a></li><li><a href="https://www.langchain.com/" target="_blank">Langchain</a></li></ul>The intro and outro music is from <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug" target="_blank">The Hug</a> by <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/" target="_blank">The Freak Fandango Orchestra</a> / <a href="http://creativecommons.org/licenses/by-sa/3.0/" target="_blank">CC BY-SA</a>

Similar Podcasts

DataFramed

DataCamp

Talk Python To Me

Michael Kennedy

Software Engineering Daily

The Real Python Podcast

Real Python

Super Data Science: ML & AI Podcast with Jon Krohn

Jon Krohn

Software Engineering Radio - the podcast for professional software developers

[email protected]

Python Bytes

Michael Kennedy and Brian Okken

Data Skeptic

Kyle Polich

The Changelog: Software Development, Open Source

Changelog Media

Practical AI: Machine Learning, Data Science

Practical AI LLC

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Sam Charrington

The Stack Overflow Podcast

Thoughtworks Technology Podcast

Thoughtworks

Go Time: Golang, Software Engineering

Changelog Media

The Cloudcast

Massive Studios

Legal Disclaimer

Pod Engine is not affiliated with, endorsed by, or officially connected with any of the podcasts displayed on this platform. We operate independently as a podcast discovery and analytics service.

All podcast artwork, thumbnails, and content displayed on this page are the property of their respective owners and are protected by applicable copyright laws. This includes, but is not limited to, podcast cover art, episode artwork, show descriptions, episode titles, transcripts, audio snippets, and any other content originating from the podcast creators or their licensors.

We display this content under fair use principles and/or implied license for the purpose of podcast discovery, information, and commentary. We make no claim of ownership over any podcast content, artwork, or related materials shown on this platform. All trademarks, service marks, and trade names are the property of their respective owners.

While we strive to ensure all content usage is properly authorized, if you are a rights holder and believe your content is being used inappropriately or without proper authorization, please contact us immediately at [email protected] for prompt review and appropriate action, which may include content removal or proper attribution.

By accessing and using this platform, you acknowledge and agree to respect all applicable copyright laws and intellectual property rights of content owners. Any unauthorized reproduction, distribution, or commercial use of the content displayed on this platform is strictly prohibited.

Recent articles

Recent articles

Data Engineering Podcast

Recent Episodes

The Evolution of DataOps: Insights from DataKitchen's CEO

Achieving Data Reliability: The Role of Data Contracts in Modern Data Management

How Generative AI Is Impacting Data Engineering Teams

Similar Podcasts

DataFramed

Talk Python To Me

Software Engineering Daily

The Real Python Podcast

Super Data Science: ML & AI Podcast with Jon Krohn

Software Engineering Radio - the podcast for professional software developers

Python Bytes

Data Skeptic

The Changelog: Software Development, Open Source

Practical AI: Machine Learning, Data Science

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

The Stack Overflow Podcast

Thoughtworks Technology Podcast

Go Time: Golang, Software Engineering

The Cloudcast

Legal Disclaimer