5 Basic and Undervalued Data Engineering Skills

What is the standard for most data engineers these days? Turns out SQL and Python are still running the show pretty much across the board. There’s always a variety of skills in those areas, some better, some worse, although with a little work and repetition it’s pretty easy to master both SQL and Python. I’ve found that Python and SQL … or Java … or Scala … having good development skills is really only half the battle. It seems there is always a few basic data engineering skills that come up over and over. They are simple skills, foundational skills that allow an average data engineer to be better. They make a person more versatile and able solve more complex problems and work across a wide variety of of tech stacks and cloud providers. What are they? Read on my fair weathered friend.

Read more

String Slicing Performance – Python vs Scala vs Spark.

Good ole’ string slicing. That’s one thing that never changes in Data Engineering, working with strings. You would think we would all get to row up some day and do the complicated stuff, but apparently you can’t outrun your past. I blame this mostly on the data and old schools companies. Plain text and flat files are still incredibly popular and common for storing and exporting data between systems. Hence string work comes upon us all like some terrible overload. The one you should fear the most is fixed width delimited files. I ran into a problem recently where PySpark was surprisingly terrible at processing fixed with delimited files and “string slicing.” It got me wondering … is it me or you?

Read more

Data Lake vs Data Warehouse – What’s the Dealio?

Data Lake, Data Warehouse, Lake House, Data Mart, it’s always something isn’t it? Don’t get me started on Data Mesh. Yikes, it’s hard to keep up these days. I want to explore the Data Lake vs the Data Warehouse and what it really all boils down to, what is the real difference. Is it data modeling, architecture, storage? I think their are a few different things that differentiate a Data Lake from a true Data Warehouse, let’s talk.

Read more