, ,

Gentle Introduction to Geospatial for Data Engineers

Quick view of geospatial data landscape.

What does a data engineer need to know about working with geospatial data? I’m going to give my two cents on what is and is not important. First, prepare to be annoyed as you will most likely spend hours debugging strange and not obvious errors and bugs. You should run screaming the other way, but in case that is not a option, here are the basics.

GEOSPATIAL TOOLS

So the landscape of tools when working with data pipelines for in the geospatial world isn’t very large. There a few ways to look at it, either your just moving data from point A – B with no transform, or your moving data with something happening in-between source and destination.

  1. If your just moving files/data round just use BytesIO.
  2. GDAL
  3. Rasterio and Shapely
  4. PostGIS. (Postgres)
  5. Geotrellis (Big Data)

GDAL is the grand-daddy of geospatial tools and probably one of the most powerful. What can GDAL do? Pretty much any complex and crazy geospatial wizardry you can think of or have never heard of. It has a Python package and API. Be warned. It’s a bear to install on a system. It’s many minute and strange dependencies that will be broken in two seconds with the smallest thing changes. I would highly suggest you find a Dockerfile that contains GDAL and use it.

Rasterio and Shapely. Rasterio is based on and uses GDAL and Shapely uses the same engine that power Postgis. They are both ok and probably the best choice when you’re doing very simple transformations and work with geospatial data. Though, they will probably be slower than their GDAL counterpart.

Postgis in the grand-daddy of RDBMS geospatial tools. SQL is easy to learn and familiar to many, so the fact that you can have the power of geospatial transforms, like the intersection of two geometries, in SQL language is amazing and powerful. If the size of your data permits it, I would use this as a first choice for storing and working with geospatial files.

You can work with terrabytes of geospatial data without and official big data tool, and I do everyday, but Geotrellis was made to solve this problem. It’s made to be integrated with Spark, the king of big data.

Geospatial File Types

The likely hood or running into the following file types are high. These are the most common forms for storing geospatial data in.

  1. Shapefile
  2. Geotiff
  3. Geojson

Shapefiles have been around for awhile. And don’t let the name confuse you, “one” shapefile actually consists of multiple files, like up to a dozen. It’s easily the most annoying and worst to work with as a data engineering because you have so many files to keep together, and keep track of. If you have Shapefiles, put them into Postgis and move on with life.

Geotiffs are everywhere. They store data like girds/pixels with geospatial data baked into them. They are easy to work with but can be quite large due to the different bands of data in them.

Another common way of dealing with geospatial data is to use geojson. It’s is becoming more popular, because json is so widely used. It can be annoying because it can get messy with large amounts of data, and reading reading json files is always slow. There are good uses for geojson and it can be easy go to option with your dealing with small, simple, geospatial datasets.

What is geospatial data?

My intention isn’t give an in-depth review of geospatial data, just the bare minimum you need to know, to be dangerous.

The first thing you need to know is what does the data “look like” inside my file. It’s probably in one two formats, raster or vector. Think of rasters like a Excel file, a grid of pixels/boxes that each contain information. Vectors are points, think of x and y, a bunch of these little points everywhere with information tied to each point. Obviously it goes more complicated that this, but just think about which one your data would fit into.

Next, it’s important to remember that just because you have two sets of geospatial data doesn’t mean that are compatible. What do I mean? You need to read about CRS and Projections. Needless to say this topic can be a little ephemeral for the first time, but once you get it, it will stick. Just know that you should find out the CRS and Projection of the data you are working with before you start mixing things and working on the data, because funny stuff will start happening if you don’t watch out.

In conclusion

Geospatial data can be fun when your looking for something new and challenging to work on. Just try to learn the basics, the different tools, the file types, some slight specifics out the type of data (raster/vector) and the CRS/Projection systems that define that data. It’s a smaller community that works on these problems, but there is plenty of help out on the web. In the future we will look at some actual geospatial data and how to use Python to work with it.