The Databricks Data Engineer

Jakub Lasak

Helping 18k+ Databricks data engineers become seniors: interview like seniors, execute like seniors, think like seniors.

  1. The Databricks interview round nobody studies for (and almost everybody fails)

    před 4 h

    The Databricks interview round nobody studies for (and almost everybody fails)

    Picture the debrief room after a Databricks loop. Two candidates went through that day. On paper, a coin flip: SQL tied, Spark internals solid, system design clean for both. Score only the rounds with a rubric and you cannot separate them. And yet the room isn't split. One gets the offer, and the thing that decided it wasn't any of the rounds they studied for. It was the conversation everyone treats as filler. There's a reason your strongest technical answers can't win it for you, and it's not the one you'd guess. In this episode: - Why the round with no whiteboard and no visible rubric is the one that decides close calls - What it actually measures, and why your best technical round can't test it - The three ways strong engineers fail it without ever noticing - Why "tell me about a decision you'd make differently" is a trap baited with your own best work - The one prep move that needs zero new facts, just a few honest walks This episode is for Databricks data engineers cleared past the technical bar who keep losing the close calls. Whether you're prepping for your next loop or wondering why a strong interview still ended in a no, you'll walk away with a specific way to rehearse the conversation everyone treats as a throwaway. --- Helping 18,000+ Databricks data engineers become seniors: interview like seniors, execute like seniors, think like seniors. Follow The Databricks Data Engineer for new episodes every Monday, Wednesday, and Friday. LinkedIn: linkedin.com/in/jrlasak Newsletter: dataengineer.wiki #DataEngineering #Databricks #DataEngineer #CareerGrowth #ApacheSpark #DeltaLake

    12 min
  2. The Spark Shuffle is baggage claim: why your job waits instead of computes (and more workers won't fix it)

    15. 6.

    The Spark Shuffle is baggage claim: why your job waits instead of computes (and more workers won't fix it)

    Your Spark job has been running for forty minutes. The dashboard shows your cluster isn't even busy. So you do the obvious thing: add more workers. And it changes nothing. Here's why. During a shuffle, Spark is barely computing at all. It's tagging every row by destination, piling rows together, spilling the overflow to disk, and hauling data across the network between executors. It's an airport rerouting every passenger's bag to a new carousel, and more baggage handlers can't speed up a single overloaded belt. In this episode: - Why your slowest wide transformation spends most of its time on logistics, not computing - The four-step model that lets you explain the shuffle to a teammate in sixty seconds - Why adding workers can make a skewed job slower, not faster - The two numbers in the Spark UI that tell you whether it's skew, partition count, or spill - The one diagnostic to run before you ever resize the cluster again This episode is for Databricks data engineers whose joins and aggregations crawl for reasons the cluster size never seems to fix. Whether you're mid-level and tired of guessing, or senior and tired of paying for compute that doesn't help, you'll walk away able to read a slow shuffle instead of throwing hardware at it. --- Helping 18,000+ Databricks data engineers become seniors: interview like seniors, execute like seniors, think like seniors. Follow The Databricks Data Engineer for new episodes every Monday, Wednesday, and Friday. LinkedIn: linkedin.com/in/jrlasak Newsletter: dataengineer.wiki #DataEngineering #Databricks #DataEngineer #CareerGrowth #ApacheSpark #DeltaLake

    11 min
  3. Your Databricks data quality framework is a Yeti: everyone talks about it, nobody has seen it work

    8. 6.

    Your Databricks data quality framework is a Yeti: everyone talks about it, nobody has seen it work

    An architecture review. A platform team is presenting their data quality setup, and honestly, it's impressive. Expectations on every ingestion table. Drift metrics on the dashboard. A dedicated alerts channel. Then a finance engineer asks the only question that counts: when did this last catch something before one of us did? Silence. That silence is the whole problem. The decks, the suites, the dashboards are everywhere. The proof that any of it actually works is somewhere else entirely. In this episode: - Why the discipline every Databricks team talks about is the one with the fewest confirmed wins - A 30-second test that tells you if your data quality framework is alive or just well documented - Why checks chosen by what's easy to write miss the incidents that actually break you - The one deadline-night decision that quietly kills more frameworks than any outage - The reframe that separates teams who get caught off guard from teams who don't This episode is for Databricks data engineers who own pipelines feeding the numbers their company argues about. Whether you've got expectations in warn mode you never reverted, or a monitoring dashboard nobody reads, you'll walk away with a way to tell whether your framework is real and what to do if it isn't. --- Helping 18,000+ Databricks data engineers become seniors: interview like seniors, execute like seniors, think like seniors. Follow The Databricks Data Engineer for new episodes every Monday, Wednesday, and Friday. LinkedIn: linkedin.com/in/jrlasak Newsletter: dataengineer.wiki #DataEngineering #Databricks #DataEngineer #DataQuality #ApacheSpark #DeltaLake

    12 min
  4. Why senior Databricks engineers write less code than mid-level ones

    2. 6.

    Why senior Databricks engineers write less code than mid-level ones

    Two engineers, same team, both five years in. Last quarter Mark shipped forty-seven pull requests across three pipelines. Sam shipped nine. On any dashboard, Mark wins by a mile. Sam got the staff offer. Mark got a kind note about continuing to demonstrate impact. This isn't politics, and it isn't luck. It's a pattern that specifically catches the engineers who are best at shipping, because the most valuable work a senior Databricks data engineer does is invisible by construction. You can't put a ticket number on a problem that never happened. In this episode: - Why the exact behavior that makes you great at mid-level is the behavior that keeps you stuck there - The three categories of senior work that produce zero lines of code but move the entire platform - How to tell leveraged work apart from work that just feels safe to ship - Why "less code" is a symptom and not a goal, and the failure mode of engineers who get that backwards - The one question to ask before your hands hit the keyboard that changes what you volunteer for next sprint This episode is for Databricks data engineers who ship more than anyone on the team and quietly wonder why it isn't landing at review time. Whether you're a mid-level engineer optimizing the wrong line on the chart, or a senior tired of watching your highest-leverage work go uncounted, you'll walk away with language to make prevented work legible and a lens for spending your hours where they actually compound. --- Helping 18,000+ Databricks data engineers become seniors: interview like seniors, execute like seniors, think like seniors. Follow The Databricks Data Engineer for new episodes every Monday, Wednesday, and Friday. LinkedIn: linkedin.com/in/jrlasak Newsletter: dataengineer.wiki #DataEngineering #Databricks #DataEngineer #CareerGrowth #ApacheSpark #DeltaLake

    11 min
  5. 4 habits that quietly turn your Databricks Delta Lake into a swamp

    26. 5.

    4 habits that quietly turn your Databricks Delta Lake into a swamp

    You built the table right. Well-partitioned, documented, fast enough that the row count came back before you finished reading your own Slack. Six months later it takes four minutes to return that same count, and nobody on your team ever decided to make it that way. There was no meeting, no design doc, no ticket titled "let's make this unqueryable by Q3." A swamp is not a decision. It's the sum of a few dozen reasonable shortcuts that compound into something nobody would have signed off on if you'd proposed it all at once. Which is why telling people to "be more careful" never fixes it. They were already careful. In this episode: - Why your slowest Delta table isn't slow because the data is big, and what it's actually choking on - The storage-bill surprise that's invisible in every query until the invoice lands - How the most generous thing you do for a blocked teammate quietly destroys whether anyone can trust the table - Why nobody can clean up a swamp where nobody knows what's load-bearing, and the cheapest fix in the whole estate - When you should ignore all of this advice, because over-governing a throwaway table is just a different swamp This episode is for Databricks data engineers staring at the one table everyone groans about, the one that actually matters, wondering how it got like this. Whether you run batch, streaming, or DLT, you'll walk away able to name exactly which kind of rot is filling your worst table and the specific senior counter-move that reverses it. --- Helping 18,000+ Databricks data engineers become seniors: interview like seniors, execute like seniors, think like seniors. Follow The Databricks Data Engineer for new episodes every Monday, Wednesday, and Friday. LinkedIn: linkedin.com/in/jrlasak Newsletter: dataengineer.wiki #DataEngineering #Databricks #DataEngineer #CareerGrowth #ApacheSpark #DeltaLake

    12 min
  6. Liquid Clustering vs Z-Ordering: 4 questions that decide

    18. 5.

    Liquid Clustering vs Z-Ordering: 4 questions that decide

    You open your Databricks workspace. Two Delta tables. Same size, same downstream BI workload. Table A was partitioned and z-ordered in 2023, runs fine. Table B is greenfield this quarter, liquid clustering by default. Your tech lead asks how aggressive you want to be with migration tickets. Whatever you type back is probably wrong. This is not a feature swap. It's a paradigm shift, and the migration math only makes sense once you can name what actually moved underneath you. Migrate-everything is wrong. Migrate-nothing is wrong. The right answer is per-table, with named criteria. In this episode: - What actually changed when liquid clustering shipped, and the one phrase that simplifies every migration debate you'll have for the next two years - The four-question filter to run table by table, in order, before you commit to a layout decision - The surviving cases where the old paradigm still wins, including the one the evangelism crowd never names - Why liquid clustering and partitioning on a Delta table are mutually exclusive, and the operational property you give up if you migrate the wrong tables - The named audit that turns six hundred legacy tables into three buckets in an afternoon - What kind of senior engineer your tech lead remembers when the promotion conversation happens This episode is for Databricks data engineers staring at a migration backlog, defending a greenfield default, or trying to explain to a platform team why some tables shouldn't be touched. Whether you're a mid-level engineer running your first migration, or a senior engineer setting the standard for the next two years of greenfield Delta tables, you'll walk away with a defended per-table answer and the vocabulary to back it up. --- Helping 18,000+ Databricks data engineers become seniors: interview like seniors, execute like seniors, think like seniors. Follow The Databricks Data Engineer for new episodes every Monday, Wednesday, and Friday. LinkedIn: linkedin.com/in/jrlasak Newsletter: dataengineer.wiki #DataEngineering #Databricks #DataEngineer #CareerGrowth #ApacheSpark #DeltaLake

    18 min
  7. The compounding curve: why some Databricks engineers' salaries grow 5x faster than others

    11. 5.

    The compounding curve: why some Databricks engineers' salaries grow 5x faster than others

    Year one. Two new juniors join the same Databricks platform org. Same starting salary, same skills, same desk. Year three, five thousand bucks apart. Year eight, household-car-and-a-half apart. Every year. Forever. Both worked hard. Both stayed technical. Both got positive reviews. Neither did anything wrong. So what happened? Salary in this field isn't one curve. It's two that look identical for the first three years, then peel apart. The choice between them gets made on a handful of small Tuesdays most engineers don't even remember. In this episode: - Why skill is the floor and leverage is the ceiling, and why the better technician is often the worse-paid engineer - The four small Tuesday choices that decide which curve a Databricks data engineer walks up - The difference between expanding what you ship and expanding what you own, and why your manager only fights for one of them - How a junior with twelve hours of writing across four years out-leveraged engineers with twice her tenure - The compass question to run on every career fork before the curve runs you This episode is for Databricks data engineers who suspect their salary trajectory isn't matching their effort, and who want to know what the highest-paid engineers on their team are doing differently. Whether you're a mid-level wondering why peers at the same level make fifty grand more, or a senior trying to understand why your raises keep shrinking, you'll walk away with a four-part audit you can run on your last six months and your next decision. --- Helping 18,000+ Databricks data engineers become seniors: interview like seniors, execute like seniors, think like seniors. Follow The Databricks Data Engineer for new episodes every Monday, Wednesday, and Friday. LinkedIn: linkedin.com/in/jakublasak Newsletter: dataengineer.wiki #DataEngineering #Databricks #DataEngineer #CareerGrowth #ApacheSpark #DeltaLake

    23 min
  8. The 90/9/1 rule of Databricks performance work - how to triage Spark optimization in 60 seconds

    4. 5.

    The 90/9/1 rule of Databricks performance work - how to triage Spark optimization in 60 seconds

    Your team is three weeks into a Databricks performance push. Broadcast hints in PRs. AQE flags toggled like christmas lights. Partition counts re-tuned for the third time. The manager is asking, gently, when the gains are showing up in the bill. The staff DE on the next team finished theirs in two afternoons. Same workloads, bigger drop. They were running a triage you have never been taught. In this episode: - Why most of what your team calls Spark optimization is cosmetic and will never move the bill, no matter how clean the PR - The two named tests senior Databricks engineers run on every workload before they touch a config - Why the same change (caching, salted joins, skew handling) can be cosmetic on one workload and structural on the one next to it - Where the real leverage in a Spark workload actually lives, and why it is almost always visible from outside the code For Databricks data engineers stuck in a performance push that is not converting effort into runtime or bill drops. Whether you are mid-level drowning in config tweaks, or senior watching the bill refuse to move, you will walk away with a one-minute triage you can run on any Spark workload tomorrow morning. --- Helping 18,000+ Databricks data engineers become seniors: interview like seniors, execute like seniors, think like seniors. Follow The Databricks Data Engineer for new episodes every Monday, Wednesday, and Friday. LinkedIn: linkedin.com/in/jakublasak Newsletter: dataengineer.wiki #DataEngineering #Databricks #DataEngineer #CareerGrowth #ApacheSpark #DeltaLake

    17 min

Informace

Helping 18k+ Databricks data engineers become seniors: interview like seniors, execute like seniors, think like seniors.

Také by se vám mohlo líbit