, , , ,

Databricks vs AWS EMR – Theory and Real Life.

I saw a recent post on r/datengineering, a question centered around why Databricks is so popular when tools like EMR have been floating around for so long. It got me thinking about it. It really isn’t all about the technical side and offerings, although that does play a large role. There are always proponents for every technology, old or new … like our favorite band or sports team, fight to the death for what we love and cherish. I want to talk theoretically, and technically about Databricks and EMR, and why you should use Databricks. 🙂

Databricks vs EMR – The battle has been won.

Now some internet genius will probably argue that EMR and Databricks shouldn’t really be compared. Why?

Databricks = Spark + Delta Lake + Notebooks + ….

EMR = Spark, Hive, HBase, Flink, Hudi, and Presto …

Yeah …. well … when you read those lists you know what everyone is using … Spark of course. In it’s hay-day that is what made EMR so huge … the fact that it was pre-packaged Spark clusters. I mean who seriously wants to setup and manage their own Spark clusters from scratch? If that was the case Spark would end up going the same way as Hadoop … dead and left in the by-gone server rooms of some monstrous corporation.

So why has Databricks won the battle? Because it’s the hot new tool .. wether you like it or not. It’s gotten huge adoption and if you start to peruse job postings in the Data Engineering space it’s becoming more and more common for a requirement.

But it does beg the question of why. Why did Databricks kick EMR like Ron kicked Baxter?

Why is Databricks better?

It’s funny … sometimes people re-invent the wheel, or sometimes they just make the wheel better and get rid of all bad spots.

I think Databricks is better then EMR for two reasons.

  • EMR is still very painful to use … it’s too HARD to use.
  • Databricks makes use easy, and plopped nice and handy features on-top, like icing on the cake.

Again, it’s not like what Databricks did was rocket science.

  • Provide the smoothest ramp to using Spark at scale. Lower the learning curve.
  • Provide notebooks connected to Spark at scale in this Data Science and Analytics crazed world.
  • Add a bunch of nice to have features to make Data Engineering easier.

What did EMR get wrong?

I could go on and on here, EMR is a great tool with great power … but with great power comes great responsibility. Any seasoned developer knows that the devil is always in the detail.

The little EMR devils that aways get you ….

  • Long startup and shutdown times costing money and adding overhead to every job.
  • Ridiculous complexity when trying to debug and dig through EMR logs to find problems and troubleshoot.
  • Configuration and use of EMR clusters is way more complex and it should or needs to be.

Don’t believe the last one? Here is some light reading for you. https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances-guidelines.html and https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-gs.html

I mean I get that when you provide a managed service like EMR to make running Spark clusters easy … it’s a great thing … but someone at some point should have stopped and asked … in the end is the experience of using this tool actually save time and money, or does it create EXTRA headache and overhead that will eventually drive people mad … and to other solutions.

What Databricks got right.

It’s all rainbows and unicorns with Databricks. Well maybe not … but if you’ve been fighting EMR clusters, coming to Databricks will feel like it.

Databricks got the simple stuff right … very right.

  • Making using Spark at scale so easy anyone can do it.
  • Provide less technical people access to the power of Spark at scale. Think DS and other Analysts via Notebooks.
  • Provide built in Data Warehouse support (Delta Lake).
  • Provide tools for developers to make life easy. (APIs, Integrations, features.)

Musings on why Databricks will/is/whatever beating the pooh out of EMR.

In short Databricks decided to solve pain points of working with Big Data at scale .. with a popular tool … Spark. It isn’t always a technical reasons that define what tools win. I mean of course you have to execute well … which Databricks has … fully functioning myriad of features and tools. But in the end they just save developers around the world time, energy, blood, sweat, and tears.

EMR will still play a huge part in the Big Data world now and going forward. It’s just too big and embedded in a-lot of organizations. It’s like Hadoop, sometimes it can just be hard to extract yourself from it. But it will happen. Just like on premise relational databases gave way to RDS, and Hadoop died with cloud storage … EMR will eventually roll over in it’s death-nil while Databricks skips happily away into the sunset.

2 replies
  1. rguillome
    rguillome says:

    Hi,

    Don’t really want to debate in your personal article, Reddit is better for that but just to illustrate how can be subjective an article like this :
    – Aws EMR provides Notebooks as well
    – Hudi is an “equivalent” of Delta Lake

    I like both of them, maybe on a “daily user basis” Databricks could be more user-friendly and easy-to-dig-in than EMR but maybe from an automation point of view, EMR and Aws ecosystem is better…
    Agree about the starting time of EMR it’s really annoying.

    Regards,

    • Dave
      Dave says:

      I’ve found databricks notebooks to be much smoother to use. I think comparing Hudi to Delta is a bit of a stretch. Hudi is trying to get to the same level but Delta is still pretty far ahead in my opinion

Comments are closed.