6 Tips for Optimizing Databricks Cluster and Pipeline Performance and Cost

Databricks, easily the hotest tool these days for Data Lakes and Data Warehousing, it’s a beast. As with any new technology there are always growing pains, learnings, and tips and tricks that might not be obvious to those dipping their toes in the water. Not understand certain concepts, and being unware of specific configurations can cost you time and money very easily when running large ETL pipelines on Databricks.

I want to share 7 tips for Databricks newbies, and oldies, that are foundational to good Data Engineering architecture, affecting both performance and cost.

6 Tips for Data Engineers when working on Databricks.

Most Data Engineer’s understand that they devil is in the details, especially when it comes to technology. Databricks is no different. It’s become the defacto tool for Data Warehousing and Data Lakes, and it offers many great features, Delta Lake, Notebooks, APIs. But, what comes along with all those awesome features? Configurations, complexity, hidden costs … the list goes on. That’s not surprising though, it’s the way of life for Data Engineers.

Sometimes the hardest part of the job is learning enough so you even know what to look for.

I want to give you a headstart, give you a little gentle nudge in the right direction.

Some of the tips are about understanding important concepts, some about configuration, others about what not to do.

The Tips

Understand Job clusters vs All-Purpose.
Understand Notebooks, when to use them, when not too.
Understand APIs, especially for Job Clusters.
Understand configuration of On-Demand vs Spot. (on AWS)
OPTIMIZE and VACUUM your Delta Lake tables.
Partition, partition, partition.

Let’s dive into each one.

Job vs All-Purpose Clusters.

Boy-o-boy this one is going to save you money $$$$. A Spark cluster is a Spark cluster isn’t it? Yes and no in the Databricks world.

What do you need to know about these two different clusters?

Job Clusters – exist for the duration of the Spark “job” you are running., then the go bye-bye themselves. Oh, and they are cheaper.
All-Purpose clusters can be created from the UI (and API), and stick around till you kill them. Oh, and they are more expensive.

Ehhh-emm….

Job Cluster – $0.10/ DBU
All-Purpose – $0.40/ DBU

Now that adds up my friend.

All-Purpose clusters are great for interactive work, using a Notebook, doing dev work or exploriotry analysis. Don’t use them for continual production workloads. Be careful how much you use them for development, they are not cheap. They are an awesome feature, one of the reasons why Databricks is so awesome, these easy and ready to use Clusters that can be turned off and on, attached and detached from Notebooks are amazing.

But they are expensive and should not be used for Production pipelines.

Job clusters are what you should mostly use for your daily workloads. They are way cheaper, and they go away when your pipeline job you are running on the cluster finishes. Don’t worry though, your logs and all that stuff are still available.

Beware the Notebook.

I won’t park here long, because this has been said before … but it needs to be said again. STOP USING NOTEBOOKS FOR PRODUCTION.

I’m not being unclear am I?

Besides being way expensive, per above, Notebooks enforce and re-enforce bad habits. I know they are easy to use and wonderful for quick dev and exporitory work … but I really doubt you unit test your Notebooks for starters.

Job Clusters enforce you to codify and hone your configurations and code, because there is no UI, so be default code built for the Job clusters will always be more Productionized then any Notebook crap.

Just don’t do it.

The APIs … learn them.

The key to true success with Databricks is the use of their many and wonderful APIs. This isn’t really specific advice for Databricks alone, it’s just that the APIs they provide are very nice, well documented and easy to use.

As with any Data Engineering project and pipeline, success often lies in the ability to codify jobs, test them, re-use them etc.

If you want bullet proof Databricks architecture then you need to interact with the platform via the APIs and code.

The two main API’s you should really learn are ….

Cluster API
Job API

The Cluster API for example is pretty straight forward …. but the options and configurations are endless. These REST APIs can be used with JSON of course.

my_cluster = {
  "cluster_name": "etl-job-cluster",
  "spark_version": "9.0.x-scala2.12",
  "node_type_id": "c4.8xlarge",
  "aws_attributes": {
    "zone_id": "us-east-1"
  },
  "num_workers": 3,
        "init_scripts":[{ "s3": { "destination" : "s3://some-bucket/databricks-scripts/init.sh", "region" :"us-east-1" } }],
}

The Jobs API is very similar, enough configuration to drive you crazy.

{
        'new_cluster': my_cluster,
        'spark_submit_task': {
              'parameters': [
                            '--py-files',
                            's3://some-bucket/libs/libs.zip',
                            's3://some-bucket/databricks-scripts/etl_script.py'
                                ]
                            }
                        }

Oh, and the easiest way to schedule and orchestrate all your Databricks API calls …. Airflow of course, check out their Operators.

On-Demand vs Spot instances. (AWS)

This is a little stinker. This should really play a more prominent spot in the Databricks documentation, but for some reason it doesn’t and is kinda buried and not pointed out.

This is all about the $$$ dollars my friend.

If you know anything about cloud compute, on AWS or any other platform, you know you can get price differences between On-Demand (won’t disappear) and Spot resources (that could go poof). This can be a tricky topic when your talking about Spark clusters, but here is what you need to know.

you should always have at least one On-Demand instance to host the Driver node for Spark. ( you don’t want your driver to disappear.)
If you have too many Spot instances and they start to randomly disappear during the job run, those tasks will have to be re-run by Spark, causing longer job runtimes.
If you have large clusters for heavy workloads, you probably don’t want ALL On-Demand instances, you need to mix it up to reduce cost.

Your cluster configuration might look like something as follows …

"aws_attributes": {
    "zone_id": "us-east-1",
    "first_on_demand": 1,
        "availability": "SPOT_WITH_FALLBACK",
  },

This is a fine line to walk, but you need to walk it. Your wallet will thank you later.

OPTIMIZE AND VACUUM

If your using Databricks most likey you are using Delta Lake. Delta Lake is file based stored system that gives ACID and common SQL like transactions for our Data Lake with Spark.

You have to learn and understand the basics of Spark and Delta Lake to be successful on Databricks. They have tried to make everything simple and easy to use, and they did, but there are some things you have to learn and overcome.

Small file sizes will kill any Spark job, if you have tons of small files this will create large bottlenecks for any pipeline.

OPTIMIZE delta_table;

This is something that has to be done DAILY on all Delta Lake tables that are being updated daily or more often. Too many small files and your ETL and transformations will turn into little turtles. OPTIMIZE smashes files together, and will drastically increase performance, saving you time and money.

Closely tied to this is VACUUM. It will remove old unused files that aren’t in the Delta Lake transaction log. Your probably familiar with this similar concept from Postgres.

Add this to the list of things to run on your tables daily.

Partition, partition, partition.

This really has nothing to do with Databricks itself, and everything to do with Spark and Delta Lake. Good practice surrounding data modeling and file store are key to success.

Don’t blame Databricks for poor performance when you aren’t taking the time to set up your data stores for success.

Data modeling lessons and tutorials are severely lacking when it comes to Data Lakes and tools like Delta Lake. Big Data is everywhere and the key to unlocking its potential is Partition strategy.

It should be a very rare case that you design a Delta Lake table on Databricks without one or more partitions.

This will affect speed and cost, ETL jobs, pipelines, and analytics will be slow, leaving you blaming Databricks if you don’t partition.

Musings

I know I covered a lot, but some of these concepts are key to understanding and taking full advantage of Databricks. So many people just jump straight into the new technologies, expecting all their problems to disappear. This will only be true when you understand the basics what it takes to be successful on Databricks.

It comes down to cost and performance.

You must take the time to understand the features and functionally of Databricks, the different kind of Clusters, what they should be used for, how to configure them to maximize cost and performance.

Forgetting maintenance like OPTIMIZE and VACUUM will have you pulling your hair out with slow ETL.

If you don’t learn to use the APIs you will never have a solid Production ready platform running on Databricks. Check out the Airflow Operators.