Home - Confessions of a Data Guy

Big Data, Data, Data Engineering, Python

Databricks Access Control – The 3 Most Important Steps

It’s not often I yearn for the good old days of SQL Server, but I’ve had a few of those moments lately. Some things I miss, some I don’t, and it’s probably because I’m getting old and crusty, stuck in my ways, by permissioning is one of those topics where I think about the good […]

March 3, 2022

Data, Data Engineering, Python, Ramblings

boto3 or aws cli for s3 … Python vs Bash and other thoughts.

For any Data Engineer working on aws for any length of time, there is one task that always seems to come up and never go away. Manipulating files on s3 a bucket on aws is something I’ve had to do for years, it just never goes away. It’s always something … listing files, moving files, […]

February 28, 2022

Big Data, Data, Data Engineering, Data Warehousing

Part 4 – Keys To Success – Idempotency and Partitioning.

As the road winds on we come to Part 4, of our 5 Part Series on Data Warehouses, Lakes, and Lake Houses. Finally, we are getting to some fun topics after all the boring stuff. Today I want to talk about the two keys to success in your Data Lakes … Idempotency and Partitioning. I […]

February 9, 2022

Big Data, Data, Data Engineering, Data Warehousing

Databricks + Delta Lake MERGE duplicates – Deterministic vs Non-Deterministic ETL.

Is there any problem more classic to the Data Lakes and Data Warehouses than duplicate records? You would think after doing the same ETL for over a decade I could avoid the issue, apparently not. It’s good never to think too highly of one’s self, the duplicates can get us all. Today I want to […]

February 3, 2022

Big Data, Data, Data Engineering, Data Warehousing, Python

End-to-End Pipeline Integration Testing for Databricks + Delta Lake.

The testing never ends. Tests tests tests, and more tests. When it comes to data engineering and data pipelines it seems good practices are finally catching up after years. In the past, the data engineering community took a lot of heat, and rightly so, for not adopting good software engineering principles, especially in data pipelines. […]

February 1, 2022

Big Data, Data, Data Engineering, Data Warehousing

Part 3 – Data Modeling in Data Warehouses, Data Lakes, and Lake Houses.

Now we are getting to the crux of the matter. I would say Data Modeling is probably one of the most unaddressed, yet important parts of Data Warehousing, Data Lakes, and Lake Houses. It raises the most questions and concerns and is responsible for the rise and fall of many Data Engineers. This is what […]

January 17, 2022

Big Data, Data, Data Engineering, Data Warehousing

Part 2 – How Technology Platforms affect your Data Warehouse, Data Lake, and Lake Houses.

This is a start of a 5 part series on Demystifying Data Warehouses / Data Lakes / Lake Houses. In Part 2 We are digging into the common Big Data tools and how those technologies have a direct impact on Data Models and what kind of Datastore ends up being designed. Part 1 – What […]

January 15, 2022

Big Data, Data, Data Engineering, Data Warehousing

5 Part Series – Demystifying Data Warehouses / Data Lakes / Lake Houses

Even I get confused these days. Data Warehouse, Data Lake, and Lake Houses … why do we have three, what are the differences? Is it all just marketing huff-a-luff? Technology and life in the data world seem to be changing fast these days. Lot’s of new vendors on the streets trying to hawk their tools […]

January 1, 2022

Big Data, Data, Data Engineering

2 Useful PySpark Functions

I’ve come to have a great love for PySpark, it’s such an easy and powerful tool to use. I use it every day to crunch tens to hundreds of terabytes of data, without even blinking an eye. And all this with the ease of Python, it’s almost too good to be true. I have to […]

December 28, 2021

Big Data, Data, Data Engineering, Ramblings

DataFrames vs SparkSQL – Which One Should You Choose?

I’ve been amazed at the growth of Spark over the last few years. I remember 5 years when I first started writing about Spark here and there, it was popular, but still not used that widely at smaller companies. AWS Glue was just starting to get popular, it seemed the barrier to widely adopted Spark […]

December 27, 2021

Databricks Access Control – The 3 Most Important Steps

boto3 or aws cli for s3 … Python vs Bash and other thoughts.

Part 4 – Keys To Success – Idempotency and Partitioning.

Databricks + Delta Lake MERGE duplicates – Deterministic vs Non-Deterministic ETL.

End-to-End Pipeline Integration Testing for Databricks + Delta Lake.

Part 3 – Data Modeling in Data Warehouses, Data Lakes, and Lake Houses.

Part 2 – How Technology Platforms affect your Data Warehouse, Data Lake, and Lake Houses.

5 Part Series – Demystifying Data Warehouses / Data Lakes / Lake Houses

2 Useful PySpark Functions

DataFrames vs SparkSQL – Which One Should You Choose?

Interesting links

Pages

Categories

Archive