Data Warehousing Archives - Page 5 of 5 - Confessions of a Data Guy

Data, Data Engineering, Data Warehousing, SQL

Apparently Apache Hive is still a thing…. I should probably learn it.

So what’s up with Apache Hive? It’s been around a long time…but all the sudden it seems like it’s requirement in every other job posting these days. “It’s not you… it’s me.” That’s what I would tell Hive if it suddenly materialized as Mr. Smith via the Matrix that I’m pretty sure is the new reality these days. I’ve been around Hadoop and Spark for awhile now and I feel like Hive is that weird 2nd cousin who shows up at Thanksgiving. You know you should like and be nice to him, but you’re not sure why. It seems like Hive sits in a strange world. It’s not a RDBMS, although it does ACID, but it’s touted as a Data Warehousing tool. Time to dig in.

October 19, 2020

Data, Data Engineering, Data Warehousing, Python, SQL

PySpark SQLContext….tired of your decades old ETL process?

Seriously. Haven’t you had enough of SSIS, SAP Data Services, Informatica, blah blah blah? It’s been the same old ETL process for the last 20 years. CSV files appear somewhere, some poor old aged and angry Developer soul in a cubicle pulls up the same old GUI ETL tool, maps a bunch of columns to some SQL Server, if you’re in a forward thinking shop…maybe Postgres. This is after painstakingly designing the Data Warehouse with good ole’ Kimball in mind. Data flows from some staging table to some facts and dimensions. Eventually some SQL queries are run and a Data Mart is produced summarizing a years worth of data for a crabby Sales or Product department. Brings a tear to my eye. And this is all because Apache Spark sounds scary to some people?

September 25, 2020

Data, Data Engineering, Data Warehousing, SQL

SQL Database (RDBMS) Design for Data Engineers

Database design… hmmm. There is probably nothing more all over the board in tech. Data warehousing, analytics, OLTP… everyone with their own “defend this hill to the death” ideas. Kimball vs Inmon. Hmmm.. what to do, what to do? After defending my own hills to the death over the years and arguing over whiteboards I’ve come to a conclusion. The right answer is somewhere in the middle. Understanding a few basic design principals will help any data engineer master writing DDL for anything from a Data Warehouse to a high load OLTP systems… across all RDBMS platforms.

July 20, 2020

Data, Data Warehousing, SQL

Columnstore Indexes – Always Faster Uh?

Columnstore indexes promise to be the savior of every data warehouse. So, what are they, when should you use them, when to stay away? Columnstore indexes are just what they sound like, data physically stored in a columnar way. This is what makes them so fast when it comes aggregating large amounts of data. The data is compressed and similar values are stored together, the database engine can grab all the values it needs to SUM for example, very quickly, this all leads to faster query results.

March 10, 2019

Data, Data Engineering, Data Warehousing, Python

Python and Apache Parquet. Yes Please.

Update: Check out my new Parquet post.
Recently while delving and burying myself alive in AWS Glue and PySpark, I ran across a new to me file format. Apache Parquet.

It promised to be the unicorn of data formats. I’ve not been disappointed yet.

September 29, 2018

Data Warehousing

Where Good Data Warehousing Goes Wrong.

It wouldn’t be the first time.

The story is usually the same, lots of people, contractors, software installation, months of ETL work, months of database work, testing testing and more testing. And then it arrives, a beautiful spiffy Enterprise Data Warehouse with all it’s facts and dimensions in all their Kimball glory.

April 18, 2018

Data, Data Warehousing, SQL

T-SQL Basics : Running Totals

Some of the most unused yet powerful functions in T-SQL are Window functions. These functions are powerful because they allow calculations on a Window of the data you specify, even while the calculation scrolls through your data.

March 13, 2018

Data, Data Warehousing

It’s Called a Non-Lookup Dude

Seriously…..It’s called a non-lookup dude. Probably one of the most annoying situations I’ve come across when working on Enterprise Data Warehouse {EDW} teams/projects is the non-lookup problem.

February 22, 2018

Apparently Apache Hive is still a thing…. I should probably learn it.

PySpark SQLContext….tired of your decades old ETL process?

SQL Database (RDBMS) Design for Data Engineers

Columnstore Indexes – Always Faster Uh?

Python and Apache Parquet. Yes Please.

Where Good Data Warehousing Goes Wrong.

T-SQL Basics : Running Totals

It’s Called a Non-Lookup Dude

Interesting links

Pages

Categories

Archive