Python Archives - Page 6 of 13 - Confessions of a Data Guy

Data, Data Engineering, Golang, Python, Rust

Working with Cloud Storage (s3). Golang vs Rust vs Python. Who shall emerge victorious?

I’ve always been a firm believer in using the right tool for the job. Sometimes I look at a piece of code … and ask … why? I mean just because you can do something doesn’t mean that you should. I see a lot of my job as someone who writes code … as not just my ability to write code, but the ability to reason about problems and design simple and elegant solutions that solve the problem at hand.

I try not to let my love of a tool, language, or package color my view of the world as it is. In fact, there is wisdom to be found in being critical of those languages and tools you love the most. Be aware of their shortcomings and failures. This leads to better software and architecture designs, and less complexity. Too often I’ve seen folks picking their tool of choice and then sticking with it till the bitter end, and it usually is bitter. There is more to life than writing obtuse Scala code that is illegible for some mundane task.

This sort of thing is a blight on everyone and every system. Now I must descend from my high horse and join the peasants on the dusty road of life. Today I want to look at some very common Data Engineering tasks, namely cloud storage, and what it is like to do such a thing with Golang, Rust, and Python. I will let you draw your own conclusions. Maybe. Code available on GitHub.

September 10, 2022

Big Data, Data, Data Engineering, Python

A Few Wonderful PySpark Features.

Just when I think it cannot get more popular, it does. I have to admit, PySpark is probably the best thing that ever happened to Big Data. It made what was once a myth, approachable to the average person. No need for esoteric Java skills, no more MapReduce, just plain old Python. Another amazing thing about Spark in general, and by extension PySpark, is the sheer amount of out-of-the-box capabilities. I wanted to dedicate this post to a few amazing and wonderful features of PySpark that make Data Engineering fun and powerful.

June 24, 2022

Big Data, Data, Data Engineering, Data Quality, Data Warehousing, Python

Great Expectations with Databricks and Apache Spark. A Tale of Data Quality.

It still seems like the wild west of Data Quality these days. Tools like Apache Deque are just too much for most folks, and Data Quality is still new enough to the scene as a serious thought topic that most tools haven’t matured that much, and companies dropping money on some tool is still a little suspect. I’ve probably heard more about Great Expectations as a DQ tool than most.

With the popularity of PySpark as a Big Data tool, and Great Expectations coming into its own, I’ve been meaning to dive into what it would actually look like to to use Great Expectations at scale and answer some simple questions. How easy is it to get up and running with Spark, what’s the path of least resistance to getting some basic Data Quality checks in place in a data pipeline.

June 2, 2022

Big Data, Data, Data Engineering, Python

Data Engineering/Data Pipeline repo Project Template (free).

Probably one of the hardest hurdles to jump over when starting out in anything new, including Data Engineering and Data Pipelines, is knowing where to start. It always can be a little daunting. One aspect that can make or break any project, giving you the confidence to move forward like Sparticus to conquer, is having a good project template for your repository of code and logic that will encapsulate and present your code to others.

I’ve created a free and hopefully helpful Python blank GitHub project template that you can clone, change, and steal to your heart’s desire. I hope it will be helpful and set you going in the right direction for your next project.

May 8, 2022

Big Data, Data, Data Engineering, Python, Ramblings

Review of Prefect for Data Engineers

Not going to lie, I do enjoy the vendor wars that this marketing craze called “The Modern Data Stack” has created. I like to keep just about everything in life at arm’s length. Kinda like the way you look at your crazy third cousin out of the corner of your eye at the family reunion. I mean it’s nice to have all these options to choose from these days when building data pipelines.

One tool I haven’t been able to poke the tires on yet is Prefect. It appears to be another data orchestration tool for Python, but we shall find out. I want this to be an introduction to Prefect, we shall just try it out and let the chips fall where they may.

April 30, 2022

Big Data, Data, Data Engineering, Data Warehousing, Python

Reducing Complexity with Databricks + Delta Lake COPY INTO

As the years drag by in Data Engineering, there are a few things that I have come to appreciate more and more. One of those topics that is close to number one on the list is complexity reduction. Today’s modern data stacks are filled to the brim with technologies and tools, full to the brim, and overflowing. So many tools with such wonderful features, sometimes all the magic comes with a downside. Complexity. Complexity can turn something wonderful into a nightmare.

Reducing (not avoiding) complexity seems to be one of the main tenets I work on these days when designing resilient, reliable, and repeatable data pipelines that can process terabytes of data. One of those tools is COPY INTO feature of Databricks + Delta Lake.

April 14, 2022

Big Data, Data, Data Engineering, Data Quality, Data Warehousing, Python

Data Quality – Great Expectations for Data Engineers

Mmmm … Data Quality … it is a thing these days. I look forlornly back to the ancient days of SQL Server when nobody cared about such things. Alas, we live in a different world, where hundreds of terabytes of data are the norm, and Data Quality becomes a thing. I’ve been meaning to give Great Expectations a poke for like a year, but just haven’t had the time or inclination to do so, but times are changing, and so should I.

I’m not really planning on giving an in-depth guide to Data Quality with Great Expectations, what I’m more interested in are topics like, how easy is it to set up and use, what’s the overhead, what are the main features and concepts and are they easy to understand. I find this sort of review of Data Engineering tools to be more helpful than simply a regurgitation of the documentation.

March 17, 2022

Big Data, Data, Data Engineering, Python

Databricks Access Control – The 3 Most Important Steps

It’s not often I yearn for the good old days of SQL Server, but I’ve had a few of those moments lately. Some things I miss, some I don’t, and it’s probably because I’m getting old and crusty, stuck in my ways, by permissioning is one of those topics where I think about the good old days. Data access control and permissions are topics that we all kinda ignore as not that important … until we actually are trying to do something with them. Then all of sudden we start complaining about complexity and why this isn’t easier. That’s a good way to describe my reaction to having to work on Databricks Access control.

But, I learned a few things and I think they will be helpful for someone. Read on for the basics of handling permissions and access control in Databricks and Delta Lake.

March 3, 2022

Data, Data Engineering, Python, Ramblings

boto3 or aws cli for s3 … Python vs Bash and other thoughts.

For any Data Engineer working on aws for any length of time, there is one task that always seems to come up and never go away. Manipulating files on s3 a bucket on aws is something I’ve had to do for years, it just never goes away. It’s always something … listing files, moving files, copying files, checking for files, getting the last modified file, checking file sizes, downloading files … it pretty much never ends.

Luckily aws provides a few tools to make these easy, their handy cli for command-line work, or the trusty boto3 Python package. I want to give an introduction to the common commands Data Engineers have to run with both the aws cli and boto3 to perform various common tasks. We will then compare and contrast which tool to use in our pipelines and the pros and cons of each.

February 28, 2022

Big Data, Data, Data Engineering, Data Warehousing, Python

End-to-End Pipeline Integration Testing for Databricks + Delta Lake.

The testing never ends. Tests tests tests, and more tests. When it comes to data engineering and data pipelines it seems good practices are finally catching up after years. In the past, the data engineering community took a lot of heat, and rightly so, for not adopting good software engineering principles, especially in data pipelines.

In the defense of many data engineers, because of the varied backgrounds people come from, some were never taught or realized the importance of good software design and testing practices. Sure, it always “takes more time” upfront to design data pipelines with code that is functional and unit-testable, and worse, able to be integration tested from end to end. It requires some foresight and thought in both data architecture and pipeline design to enable complete testability.

Integration testing end-to-end in an automated manner is a tough nut to crack. How can you do such a thing on massive pipelines that crunch hundreds of TBs of data? With a little creativity.

February 1, 2022

Working with Cloud Storage (s3). Golang vs Rust vs Python. Who shall emerge victorious?

A Few Wonderful PySpark Features.

Great Expectations with Databricks and Apache Spark. A Tale of Data Quality.

Data Engineering/Data Pipeline repo Project Template (free).

Review of Prefect for Data Engineers

Reducing Complexity with Databricks + Delta Lake COPY INTO

Data Quality – Great Expectations for Data Engineers

Databricks Access Control – The 3 Most Important Steps

boto3 or aws cli for s3 … Python vs Bash and other thoughts.

End-to-End Pipeline Integration Testing for Databricks + Delta Lake.

Interesting links

Pages

Categories

Archive