, , , ,

Introduction to dbt … for Data Engineers.

Photo by Josh Rakower on Unsplash

So, you’ve heard about dbt have you. I honestly can’t decide if it’s here to stay or not, probably is, enough folks are using it, and preaching about it. I personally have always been a little skeptical of dbt, not because it can’t do what it says it can do, it can, but because I’m old and bitter from my many years of Data Engineering, and I always see the problems in things.

But, I will let you judge that for yourself. Today I want to give a brief overview of dbt, kick the tires, muse about its features, and most importantly, look at dbt from a Data Engineering perspective, ferret out the good, the bad, and the ugly. I will try my best to be nice but don’t count on it. Code is on GitHub.

What is dbt anyways?

I’m just going to pretend you’ve been living under a rock, or a desk, put there by your angry boss, unable to see the light of day, and therefore have no idea what dbt is, and what it’s used for. dbt has coined itself as the “transformation tool“, and the “The analytics engineering workflow“, which I find a little strange. Just call it what it is … another ETL tool. I know, I know, it’s more than that, testing, documentation etc. I get it. But it is mostly a data transformation tool.

What isn’t dbt? Well, it’s not an extract or load tool, it doesn’t do that, just transformation. This does bring up the old grump in me, and I have to complain about it.

As Data Engineers building complex pipelines and data architecture, we should always keep an eye on complexity, when adding tools to the stack … they must give back more than they take. Just adding tools on top of tools creates issues inevitably. It should make you pause that you will have to use a different tool to do your Data extraction and loading … that same tool which most likely already has transformation capabilities! Fine, fine, I know you want all the other benefits of dbt, but I’m just saying. Even if dbt integrated with that tool, it’s just another abstraction and layer on top of probably what is already a complex system.

My take on dbt.

What is dbt to me? It’s heaven for those heavily SQL based teams that rely on it for all transformations etc, especially those working in the Data Warehouse and Data Lake teams. Anyone using Postgres, MySQL, SparkSQL, Snowflake, Redshift and the like, these SQL focused teams who struggle with ETL, testing, complex SQL and warehousing … of course dbt will be the light shining in the darkness to you.

Photo by Josh Boot on Unsplash

Diving into dbt.

So how do we even start learning dbt? Let’s just start by listing some of the main ideas and concepts of dbt, and casually mention how we might implement those features in real life. Then, once we are done doing that, let’s try to do some simple dbt project in Postgres. So what are the concepts to be aware of with dbt?

  • “A dbt project is a directory of .sql and .yml files.” – dbt docs
  • “A model is a single .sql file.” – dbt docs
  • “Each model contains a single select statement.” – dbt docs
  • Project (.yaml ) file which configures and defines your dbt project.
  • “Tests are assertions you make about your models and other resources in your dbt project.” – dbt docs
  • “Snapshots implement type-2 Slowly Changing Dimensions over mutable source tables.” – dbt docs
  • “Seeds are CSV files in your dbt project.” – dbt docs
  • “Models have materializations. (build strategies for your select query)” – dbt docs
    • view
    • table
    • ephemeral
    • incremental
  • dbt CLI makes it easy to do dbt things at the command line.
  • dbt does documentation well.

I think we pretty much have a good idea of what to expect from dbt with that list.

Example dbt project with Postgres.

Photo by Simon Hurry on Unsplash

What better way to learn dbt then to do a little project? What easier technology to use than Postgres? First things, first, let’s set up a Docker and docker-compose setup that includes Postgres and dbt with some prepopulated schema and data that we can use for a sample dbt project.

Ok, we got some grunt work to do, so let’s get to it. First, let’s build our Dockerfile, which won’t be much.

FROM python:3

WORKDIR app
COPY . /app

RUN pip3 install dbt-postgres==1.2.2

Well, there’s that. Now, let’s get our docker-compose going, which will be a little more involved, as we require Postgres for our example.

version: "3.9"
services:
  postgres:
    image: postgres:10.5
    restart: always
    environment:
      - POSTGRES_USER=postgres
      - POSTGRES_PASSWORD=postgres
    ports:
      - '5432:5432'
    volumes:
      - ./postgres-data:/var/lib/postgresql/data
      - ./DDL/create_tables.sql:/docker-entrypoint-initdb.d/create_tables.sql
    healthcheck:
      test: [ "CMD-SHELL", "pg_isready -U postgres" ]
      interval: 5s
      timeout: 5s
      retries: 5
  run:
    environment:
      - postgres_host=postgres
      - postgres_database=postgres
      - postgres_user=postgres
      - postgres_password=postgres
    image: "dbt-example"
    depends_on:
      postgres:
        condition: service_healthy
    volumes:
      - .:/app
    command: dbt --profiles-dir ./.dbt  debug

Few things worthy of note that will be needed in our docker-compose example to help with our example dbt project. First, note under volumes that we are calling a DDL script to run when Postgres spins up.

volumes:
      - ./postgres-data:/var/lib/postgresql/data
      - ./DDL/create_tables.sql:/docker-entrypoint-initdb.d/create_tables.sql

Here is that DDL that includes a few values to get some real data into our Postgres table. We are using the open-source Divvy bike trip data. We have two tables, a raw table, a fact table, and a statement to load data into our raw table when Postgres starts up. This will allow us to write a simple dbt model that can move data from raw state to a fact model.

I mean, this is the whole point after all.

CREATE TABLE IF NOT EXISTS trips
    (
        ride_id TEXT PRIMARY KEY,
        rideable_type TEXT,
        started_at timestamp without time zone,
        ended_at timestamp without time zone,
        start_station_name TEXT NULL,
        start_station_id INTEGER NULL,
        end_station_name TEXT NULL,
        end_station_id INTEGER NULL,
        start_lat NUMERIC(4,2),
        start_lng NUMERIC(4,2),
        end_lat NUMERIC(4,2),
        end_lng NUMERIC(4,2),
        member_casual TEXT NOT NULL
);

INSERT INTO trips VALUES
('85522531742CB311','electric_bike','2022-08-26 23:21:28','2022-08-26 23:48:03',NULL,NULL,NULL,NULL,41.95,-87.66,42.03,-87.67,'casual'),
('D789EFF6D43AB2A1','electric_bike','2022-08-26 19:25:46','2022-08-26 19:37:02',NULL,NULL,'Western Ave & Roscoe St','15634',41.93,-87.69,41.943034,-87.687288,'casual'),
('D21A8992F5A8CDA3','electric_bike','2022-08-26 19:14:07','2022-08-26 19:19:58',NULL,NULL,NULL,NULL,41.93,-87.7,41.92,-87.7,'casual');

Let’s get to the dbt part.

Now that we have a framework setup that will allow us to do an example dbt project, let’s get to the dbt work. Apparently, the first thing we will need to do is add a profile that can connect to our Postgres warehouse. You should use environment variables to hold your credentials, in our profile.yaml will just be using default Postgres creds.

trips:
  target: dev
  outputs:
    dev:
      type: postgres
      host: postgres
      user: postgres
      password: postgres
      port: 5432
      dbname: postgres
      schema: postgres
      threads: 2

Let’s also add a super basic dbt_project.yml file. Note, I added the model I’m going to create later with a materialization of table. Aka, I have raw data I want to transform into a fact table.

name: trips

config-version: 2
version: 1.0

profile: trips

models:
  trips:
    raw_trips:
      # materialize all models in models/events as tables
      +materialized: table

Let’s go ahead and build our Dockerfile so we can test this connection to our Postgres warehouse.

docker build . --tag=dbt-example

The following command we will put in the docker-compose file as what we want to run.

dbt --profiles-dir ./.dbt  debug

Now we can run our test connect. Per below it looks like all is well, our Docker container with dbt can talk to our Postgres warehouse.

>> docker-compose up run
Attaching to dbtsampleproject-run-1
dbtsampleproject-run-1  | 23:40:07  Running with dbt=1.2.2
dbtsampleproject-run-1  | dbt version: 1.2.2
dbtsampleproject-run-1  | python version: 3.10.7
dbtsampleproject-run-1  | python path: /usr/local/bin/python
dbtsampleproject-run-1  | os info: Linux-5.10.104-linuxkit-aarch64-with-glibc2.31
dbtsampleproject-run-1  | Using profiles.yml file at /app/.dbt/profiles.yml
dbtsampleproject-run-1  | Using dbt_project.yml file at /app/dbt_project.yml
dbtsampleproject-run-1  | 
dbtsampleproject-run-1  | Configuration:
dbtsampleproject-run-1  |   profiles.yml file [OK found and valid]
dbtsampleproject-run-1  |   dbt_project.yml file [OK found and valid]
dbtsampleproject-run-1  | 
dbtsampleproject-run-1  | Required dependencies:
dbtsampleproject-run-1  |  - git [OK found]
dbtsampleproject-run-1  | 
dbtsampleproject-run-1  | Connection:
dbtsampleproject-run-1  |   host: postgres
dbtsampleproject-run-1  |   port: 5432
dbtsampleproject-run-1  |   user: postgres
dbtsampleproject-run-1  |   database: postgres
dbtsampleproject-run-1  |   schema: postgres
dbtsampleproject-run-1  |   search_path: None
dbtsampleproject-run-1  |   keepalives_idle: 0
dbtsampleproject-run-1  |   sslmode: None
dbtsampleproject-run-1  |   Connection test: [OK connection ok]
dbtsampleproject-run-1  | 
dbtsampleproject-run-1  | All checks passed!

Writing our first dbt model.

Photo by Tim De Pauw on Unsplash

Ok, after all that hard work I think a nap is in order. But alas, in the words of Gandalf “The world is not in your books and maps, it’s out there.” So we must carry on. The Model in dbt apparently is where all the magic happens. You can read more about models here, but it’s essentially just SELECT statements that makeup the different transforms we want to run.

The SQL can be templated, the model can have different materializations as we mentioned before, etc. But, let’s keep it simple for our example. We just want a model that will take our raw table data and do some slight transformations and then produce a fact model that could be materialized as a table.

SELECT
        trips.ride_id,
        trips.rideable_type,
        trips.started_at,
        trips.ended_at,
        EXTRACT(EPOCH FROM (trips.started_at - trips.ended_at)) as trip_duration,
        CASE WHEN trips.start_station_name IS NULL
             THEN 'Unknown' ELSE trips.start_station_name
             END as start_station_name,
        trips.start_station_id,
        CASE WHEN trips.end_station_name IS NULL
             THEN 'Unknown' ELSE trips.end_station_name
             END as end_station_name,
        trips.end_station_id,
        trips.start_lat,
        trips.start_lng,
        trips.end_lat,
        trips.end_lng,
        CASE WHEN trips.member_casual = 'member' THEN 1 ELSE 0 END as member_casual
FROM trips

Here is our very boring dbt model that doesn’t do much of anything besides some CASE statements. But, since in our project configuration we put our sterilization of this model to table, we should now have a raw_trips table that exists with our transformed data.

By the way, I changed the docker-compose command to be the following line to run the actual model.

command: dbt --profiles-dir ./.dbt run

It all looks like it ran ok.

(venvrm) (base) danielbeach@Daniels-MacBook-Pro dbtSampleProject % docker-compose up run 
[+] Running 2/0
 ⠿ Container dbtsampleproject-postgres-warehouse-1  Running                                                                                                            0.0s
 ⠿ Container dbtsampleproject-run-1                 Created                                                                                                            0.0s
Attaching to dbtsampleproject-run-1
dbtsampleproject-run-1  | 13:39:59  Running with dbt=1.2.2
dbtsampleproject-run-1  | 13:39:59  Found 1 model, 0 tests, 0 snapshots, 0 analyses, 256 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
dbtsampleproject-run-1  | 13:39:59  
dbtsampleproject-run-1  | 13:39:59  Concurrency: 2 threads (target='dev')
dbtsampleproject-run-1  | 13:39:59  
dbtsampleproject-run-1  | 13:40:00  1 of 1 START table model postgres.raw_trips .................................... [RUN]
dbtsampleproject-run-1  | 13:40:00  1 of 1 OK created table model postgres.raw_trips ............................... [SELECT 3 in 0.27s]
dbtsampleproject-run-1  | 13:40:00  
dbtsampleproject-run-1  | 13:40:00  Finished running 1 table model in 0 hours 0 minutes and 0.59 seconds (0.59s).
dbtsampleproject-run-1  | 13:40:00  
dbtsampleproject-run-1  | 13:40:00  Completed successfully
dbtsampleproject-run-1  | 13:40:00  
dbtsampleproject-run-1  | 13:40:00  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
dbtsampleproject-run-1 exited with code 0

Let’s jump inside our Postgres image and see if that table exists with our transformed data.

psql -h 0.0.0.0 -U postgres
psql (14.1, server 10.5 (Debian 10.5-2.pgdg90+1))
Type "help" for help.

postgres=# SELECT * FROM raw_trips;
     ride_id      | rideable_type |     started_at      |      ended_at       | trip_duration | start_station_name | start_station_id |    end_station_name     | end_station_id | start_lat | start_lng | end_lat | end_lng | member_casual 
------------------+---------------+---------------------+---------------------+---------------+--------------------+------------------+-------------------------+----------------+-----------+-----------+---------+---------+---------------
 85522531742CB311 | electric_bike | 2022-08-26 23:21:28 | 2022-08-26 23:48:03 |         -1595 | Unknown            |                  | Unknown                 |                |     41.95 |    -87.66 |   42.03 |  -87.67 |             0
 D789EFF6D43AB2A1 | electric_bike | 2022-08-26 19:25:46 | 2022-08-26 19:37:02 |          -676 | Unknown            |                  | Western Ave & Roscoe St |          15634 |     41.93 |    -87.69 |   41.94 |  -87.69 |             0
 D21A8992F5A8CDA3 | electric_bike | 2022-08-26 19:14:07 | 2022-08-26 19:19:58 |          -351 | Unknown            |                  | Unknown                 |                |     41.93 |    -87.70 |   41.92 |  -87.70 |             0
(3 rows)

(END)

Well look at that, it worked. We can see our Unknown values in our start_station_name for those that were NULL etc, and of course the model was materialized as a table as we configured.

Musings on dbt for Data Engineers.

Short and to the point that one. I think that’s enough for most folks to get at least the basic understanding of dbt and generally what it does and provides. Of course, it probably only scratches the surface and doesn’t explore things like tests and documentation. That being said, I do have a few comments and thoughts about dbt as a tool for Data Engineers who’s pipelines include lots of SQL.

  • I would not switch my pipeline to all SQL just to use dbt.
  • If my pipelines included at least 30% or more of SQL, I would use dbt.
  • The configuration and setup/use of dbt is extremely straightforward and easy to understand.
  • The concepts around dbt projects are very approachable.
  • The testing and docs ability of dbt are the icing on the cake.
  • dbt and it’s models don’t solve the nasty spaghetti SQL code that is the doom of many SQL-based data teams.

There is much to love about dbt, and much to hate. I love what dbt does for those teams that are mostly SQL based, it gives the framework, testability, and generally a better approach to managing complex SQL pipelines. It’s better than just stored producers and ad-hoc SQL strung together in a never-ending trail of misery. But, I can see that it would not necessarily stop data teams from going down that path. You would have to be rigorous and die-hard in your approach to the models and their complexity.

I can also see where the cost of dbt, the way it wants you to develop, could easily spiral out of control. If you’re writing small models all the way down the line in your pipelines, and someone is materializing those to tables … your storage and compute cost is probably going to skyrocket. Reading and writing all that data … ugh. But, do the costs outweigh the benefits? Probably for some yes, for some no. It is interesting that a lot of complaints about dbt costs have been surfacing lately, who knows if that is simply folks not implementing the technology correctly (usually the case), or if there is some truth to it all … is it simply a more expensive way to do things. Time will tell.