5 things I wish I knew about Databricks … before I started.

Photo by Saad Chaudhry on Unsplash

How many times in your life, that is but a mist, have you thought, “If I had only known that in the beginning?” I feel as if I’ve committed that cardinal sin as a developer and Data Engineer … falling in love with a tool to the exclusion of all else. I mean truly, Databricks has brought Big Data to the masses, all you need is your laptop and 10 minutes of PySpark training before your spending gobs of money, processing massive amounts of data. Where else, and with what else can you do such things? Try it with EMR, good luck to you.

That being said, when you love something you start to notice the slight imperfections and problems with that something. You get kinda nit-picky. Such is life. I want to save some poor soul out there some heartache, that moment when you’ve been writing code for hours or days, and come upon a little surprise that makes your heart drop into your shoes, and the blood runs to your face. Here are 10 things I wish I knew about Databricks before I started. Maybe it will save you time, help you, who knows.

5 Databricks tips before you disappear into Neverland.

These tips might not apply to everyone, like most things in life it depends on context, architecture, and the like. But, hopefully, I can save someone some pain and suffering, maybe some food for thought as you are considering Databricks as part of your “Moden Data Stack,” insert eye roll. Unfortunately, most technology decisions are just made, just like that, if someone does a POC that’s probably considered unique, most folks just flock to the lights like moths. I mean I’m probably a moth too.

Enough pontificating.

1. Databricks is better with Delta Lake.

It’s a package deal, my friend. Ketchup with your french fries, butter on your toast, it’s that sort of thing you know? Databricks is a package deal in my opinion, otherwise what’s the point? If you’re going to use it, use it. I have a feeling many folks think Databricks = Spark, and this is true. But times have changed my friend, it’s so much more than that.

You should think of Databricks as a system of tools that all fit well together, and you will get the best results by using all of them. Of course, Databricks indeed makes using Spark at scale a trivial affair, but combine it with Delta Lake as a storage layer and now you have the juggernaut. Delta Lake brings the Data Warehouse / Data Lake / Lake House to you on a platter, with a cherry on top. The ability to process hundreds of TBs , or PBs of data, and the structured Data Store to deposit all that data into, in a manner you are familiar with … now that’s something worth using.

  • Delta Lake extends the usefulness of Spark.
  • Delta Lake brings classic Data Warehouses to data operations at scale.
  • Data Engineering is about data storage, Delta Lake is a first-class data storage option.

Don’t even think about using Databricks without Delta Lake, Delta Lake makes the unimaginable possible, constraints, partitions, DDL , DML, the list goes on.

2. There is more to life than Notebooks

Something that amazes me still to this day, is the emphasis put on Notebooks when folks are talking about Notebooks. Sure, I get it, using a Notebook to do interactive development and data exploration is nice (and very nice for Databrick’s pocketbook with the cluster running to support it). But my friend, there is more to life, think about Production workloads, Notebooks should not be used for such thing.

Databricks has wonderful API’s for things like Jobs via which a person can stitch together and automate massive and complex Spark pipelines. There is also a full suite of Machine Learning tools, which can be of much use. Don’t forget about Workflows if you want to orchestrate those pipelines inside the tool itself. You can even integrate your git Repos with Databricks for crying out loud.

I could go on and on, but this advice goes for any tool … learn what is offered, once your using a tool it’s usually a small lift to ingest other features into your architecture.

Photo by Michael Olsen on Unsplash

3. Beware of Cost

As with any tool, cost can be the silent killer. People get excited about new tools, they start going to town, creating this and that resource, so many people using so many Notebooks, spinning up massive clusters, writing to Delta Lake here and there and everywhere. But wait! You have to pay for that. Don’t forget that big data running on Spark clusters can get resource-consuming real fast. Big data sets, big instances, these all cost you money.

Here are some tips to keep things under control.

  • Clusters should use a single ON-DEMAND driver node, and SPOT or SPOT-WITH-FALLBACK for all worker nodes.
  • Don’t use small instance sizes for clusters, Spark needs memory, and favor fewer large nodes than lots of small ones. Makes things run faster.
  • Setup a single cluster of development usage by many people, a shared cluster. Everyone doesn’t need their own.
  • Use Jobs over Notebooks for running workloads, they cost less. (Compare All-Purpose compute to Jobs compute, huge difference).
  • Delta Tables are easy to use, don’t get carried away, you pay for storage somewhere.
  • SQL Compute is expensive, unless you really need this feature, don’t use it. You can work around it. Too expensive.
  • Use a Standard account, premium accounts are expensive and increase the cost of Compute, trust me you can get away with a standard account.

Be judicious in your use and management of compute. Things can get out of control quickly, death by a thousand cuts or all at once. Be smart, plan ahead.

4. Pick the correct orchesration tool for Databricks.

This one may seem a little strange at first, but bear with me. Most of the time in Data Engineering our pipelines exist across multiple technologies, rarely can we simply “always use Databricks” or some other tool. We need good orchestration and dependency management tools for our pipelines. Hence the popularity of tools like Apache Airflow.

Sure, Databricks recently started offering Workflows, but trust me, it can’t compete with the classic and popular scheduling and orchestration tools out there today. Once you really start getting into Databricks you will need a tool to manage or large and complex pipelines.

Pick a tool that has awesome and wide-ranging Databricks integration. I would suggest Apache Airflow as an option. You pick whatever you want.

  • Databricks is an awesome tool, but an orchestration and scheduling tool is key for complex pipelines and future happiness.
  • Databricks has a great API, find tools that have done the work of integrating those APIs and making it easy to use.
Photo by Cesar Carlevarino Aragon on Unsplash

5. Learn to use advanced features.

Another part of Databricks + Delta Lake that many folks don’t ever take advantage of is the advanced features required to use the tools to their fullest, and that will give the best performance and results. Like any tool(s) you can just jump in the deep end, but that usually doesn’t work out so well in the long run. Here is a list of topics that you should be utterly familiar with following topics before building solutions with Databricks.

I could go on, but you get the idea, know what you’re using before you start.

Musings

It’s hard to know where to stop sometimes. Hopefully, I have you a few things to think about that you would not have otherwise. Personally, it’s always the details that come back to get me, or that I discover after that fact and start thinking “why didn’t I know about this before?” The problem with Databricks is that it’s such a massive tool and offers an amazing amount of power and features.

The problem with so many easy to use features is that you don’t know what you don’t know. Some things are more documented than others. Before jumping into Databricks for the first time for some new project think about …

  • Delta Lake
  • Think past Notebooks
  • Beware of Cost
  • Pick the best orchestration tool to pair with Databricks, like a fine wine.
  • Dive into the advanced features, for your own good. This is where experts separate themselves.

Best of luck in your Databricks journey.