Python Archives - Page 7 of 13 - Confessions of a Data Guy

Performance Testing Postgres Inserts with Python

Sometimes I get to feeling nostalgic for the good ol’ days. What days am I talking about? My Data Engineering days when all I had to worry about was reading files with Python and throwing stuff into Postgres or some other database. The good ol’ days. The other day I was reminiscing about what I worked on a lot during the beginning of my data career. Relational databases plus Python was pretty much the name of the game.

One of the struggles I always had was how fast can I load this data into Postgres? psycopg2 was always my Python package of choice for working with Postgres, it’s a wonderful library. Today I want to give a shout-out to my old self by performance testing Python inserts into Postgres. There are about a million ways and sizes and shapes to getting a bunch of records from some CSV file, through Python, and into Postgres.

I also enjoy making people mad … there’s always that. Nothing makes people mad at you like a good ol’ performance test 🙂

December 17, 2021

Data, Data Engineering, Python, SQL

ORM’s are the Cigarettes of the Data Engineering World.

Seriously, just don’t do it, they are bad for you. Listen to your mother, just say no. The dreaded ORM’s ( Object Relational Mapping ) that do all the hard SQL work for you. But, they come with many unintended consequences that are bad for your health and wellness in the long term. Many unsuspecting victims have been sucked into ORMs with the promise of an easier transition to allow programmers a familiar object-oriented design pattern for manipulating the data in a relational database, say Postgres or MySQL.

Again I tell you, don’t fall for the siren songs, there are tears and sorrow down the long and lonely ORM road.

November 15, 2021

Big Data, Data, Data Engineering, Python

3 Tips for Unit Testing PySpark Pipelines

I’m not sure what it is, but some prevailing evil in the Data Engineering world has made it not so common for PySpark pipelines to be unit tested. Who knows, it’s probably a combination of things. Data Engineers have been accused of not having good Software Engineering principles. Functional testing is a hot commodity in the Software Engineering world but probably takes a while to trickle its way into mainstream Data Engineering. It can require good Docker skills. Also, generally speaking, the old school Data and ETL Developers that preceded Data Engineers in the bygone days never unit tested …. so neither do their ancestors.

Who knows? All that being said I want to give you 3 tips to help you unit test your PySpark ETL data pipelines.

November 9, 2021

Big Data, Data, Data Engineering, Python, Ramblings

Review of Airbyte for Data Engineers

It’s hard to keep up with the never-ending stream of new Data Engineering tools these days. Always something new around the next bend. I find it interesting to kick the tries on the new kids on the block. It’s always interesting to see what angle or pain point a new tool tries to hone in on. I mean if you think about Data Engineering in general, the fundamentals really haven’t changed that much over the years, the tools change, but what we do hasn’t. We are expected to move data from point A to point B in a reliable, scalable, and efficient manner.

Today I’m going to be reviewing a tool called Airbyte. When I review a new product I’m usually incredibly basic about what I look for and I try to answer some easy and obvious questions. How easy is it to set up and use? What does the documentation look like? When I run into a problem can I solve it? Is the overhead of adding this new tool to a tech stack worth what features it offers? This is how we will explore Airbyte.

October 13, 2021

Big Data, Data, Data Engineering, Python

Bitwise Operations for Data Engineers

Ugh. Cursed bitwise operations … something usually reserved for the low-level mythical engineers writing code no one should have to write. I’ve escaped all but twice during my meager existence, recently I had to use a bitwise operation while converting a Python hashing algorithm into PySpark code. It made my brain hurt. What is this wizardry all about anyways? It got me thinking, I should really attempt to learn something about bitwise operations since it comes up once every 10 years.

October 13, 2021

Data, Data Engineering, Python

gRPC for Data Engineers

If you’ve been around Data Engineering for a while, like me, you’ve noticed a few trends in the industry at wide, and in individual data engineers themselves. There seem to be a few types of data engineers, and it depends on where you’ve worked, and what your projects have looked like that put you here or there. Some data engineers focus on general ETL, Data Warehousing, and such things. They move data around and transform it using a myriad of tools. The other set of data engineers are more focused on infrastructure at a low level, they provide the underlying tools and services others use to make that data move around and transfer.

Which are you? One of those topics you may or may not be familiar with depending on your background is RPC or more specifically gRPC. What is it?

September 27, 2021

Big Data, Data, Data Engineering, Python, Ramblings

The Wild West of Parallel Computing – Review of Bodo.ai

It truly is the Wild West of parallel computing these days. It seems that big data has brought out an onslaught of companies trying to either take advantage of making it easier to use any number of big data platforms or making up their own. Most of them usually take shots at tools like Spark and Dask, probably two of the more well-known big data engines. Of course with Python’s rise, especially in Data Science and ML, many of these tools target that audience.

One such newcomer is Bodo.ai, and I’ve seen them pop up on places like r/dataengineering. Fortunately, they have a free community edition, so let’s kick the tires and see what’s going on.

September 24, 2021

Big Data, Data, Data Engineering, Machine Learning, Python

Dask vs PySpark – Performance and Other Thoughts.

Every once in awhile I see someone talking about their wonder distributed cluster of Dask machines, and my curiosity gets aroused. I know plenty of people use Dask, mostly on their local machines, but it seems like the meteoric rise of Spark, especially with tools like EMR and Databricks, that Dask is slowly slipping into the shadows. I’ve had bad experiences with Dask in the past, trying to get it work well in production. I suppose that comes from working with tried and true Spark and other bullet proof distributed system. I’ve been meaning to return to Dask for awhile, compare a similar Dask and Spark cluster on performance … and other things like ease of setup and writing code. Let’s get too it.

September 6, 2021

Big Data, Data, Data Engineering, Python

“Don’t mess with the dials,” they said. Spark (PySpark) Shuffle Partition Configuration and Performance.

Sometimes I amaze myself. I’ve been using PySpark for a few years now, happily crunching hundreds of TBs of data without much problem. Sure you randomly run into OOM errors and other such nonsense. Usually inspecting the code for something silly, throwing in a persist() or cache() here and there will solve 99% of the problems. I’ve always approached Spark performance with an overly pragmatic approach. Spark being the beast that it is, it’s easy to hide performance problems with more resources etc. I’ve generally tried to stay away from UDF's just using good coding practices and out of the box functionality. Ensuring good predicate pushdown’s, data partitioning etc are all helpful and important. But in the end… I don’t really know much about the out-of-the-box Spark configurations and how they affect performance.

Do the configurations change based on data size and partitioning strategy plus resources and cluster size? Probably. Does that seem complicated to figure out? Yes. Is the internet full of conflicting, vague and confusing advice? Of course.

August 13, 2021

Big Data, Data, Data Engineering, Python, Scala

String Slicing Performance – Python vs Scala vs Spark.

Good ole’ string slicing. That’s one thing that never changes in Data Engineering, working with strings. You would think we would all get to row up some day and do the complicated stuff, but apparently you can’t outrun your past. I blame this mostly on the data and old schools companies. Plain text and flat files are still incredibly popular and common for storing and exporting data between systems. Hence string work comes upon us all like some terrible overload. The one you should fear the most is fixed width delimited files. I ran into a problem recently where PySpark was surprisingly terrible at processing fixed with delimited files and “string slicing.” It got me wondering … is it me or you?

July 17, 2021

Performance Testing Postgres Inserts with Python

ORM’s are the Cigarettes of the Data Engineering World.

3 Tips for Unit Testing PySpark Pipelines

Review of Airbyte for Data Engineers

Bitwise Operations for Data Engineers

gRPC for Data Engineers

The Wild West of Parallel Computing – Review of Bodo.ai

Dask vs PySpark – Performance and Other Thoughts.

“Don’t mess with the dials,” they said. Spark (PySpark) Shuffle Partition Configuration and Performance.

String Slicing Performance – Python vs Scala vs Spark.

Interesting links

Pages

Categories

Archive