, , ,

Big Data File Showdown – Part 2 – ORC with Python.

In part 1 of the big data file formats we reviewed Parquet vs Avro. It was apparent from the start that the two file formats were built for different things. Avro is clearly a complex row structured file format used in communication and transactions, where schema is king and nested structures are no problem. Parquet on the other hand has risen to the top with the popularity of Spark, is columnar based storage and is well suited to structured and tabular type data. But, lest the annals of inter-webs call us uncouth and forgetful, we must add ORC file format to the list.

ORC file format. Who are you and where have you come from.

Anyone who has been around data engineering for awhile, especially the Hadoop eco-system has probably at least heard of ORC files. And probably like me, most people have heard of them and that is it, I’ve never run across them in the wild before. With Spark + Parquet taking over the world, I’m not keeping my hopes up of running across some behemoth cloud HDFS/Hive/s3 sink of ORC’s. It’s time to take matters into my own hands. Let’s try to learn something about ORC and how it relates to Parquet and Avro, our other two champions.

Understanding the ORC file format.

Of course the first thing to do is cruise around the documentation and see what’s going on. So what is ORC?

“… Optimized Row Columnar …”

ORC documentation

Well … so we have Avro that is row optimized, Parquet this is columnar, and now we have ORC that is both? When reading the documentation it was designed for usage with Hive, but I don’t know if that means … only use with Hive and HDFS as opposed to ORC in some s3 bucket. Let’s just list some of the topics around ORC.

  • ORC files are made up of Stripes …. groups of row data.
  • It supports data types like “datetime, decimal, and the complex types (struct, list, map, and union)
  • the file has/can have “indexes.” Apparently helps in seeking rows and skipping row groups when a read comes in with a predicate … aka push down filter.
  • file footer contains meta information about the ORC file as a whole.
  • compression can be ” Snappy, Zlib, or none.
  • supports ACID when used with Hive.

Boiling down ORC.

So what do you really need to know about ORC? From reading all the documentation it comes down to this ….

Stripes are integral to ORC, they are 64MB in size by default, are “independent” from each other … allowing distributed work to happen on a file. Columns are separate from each other in the stripe, allowing only needed data to be read.

The indexes in an ORC file allow push down read filters into a file, signaling which Stripes actually need to be read.

The more I read the documentation, the more it reminded me of Parquet files. There was even mention of ORC being faster than Parquet and being chosen by Facebook for Data Warehouse work. As someone who has used Parquet + Spark for a long time, I’m very curious to see how ORC compares to Parquet. And, there are other things I think about besides just performance, although that is important.

What I really want to know is how easy is it to develop with?

ORC files with Python

Many times during the life of a data engineer I find myself opening Parquet files with Pandas locally, and writing them, just to inspect data, run tests and do general development work. Of course a lot of development is done with Spark, but I need to be able to work easily file formats locally. So the big question is, how easy is it to read and write ORC files with Python?

Pandas + ORC

I was super happy to see Pandas support for ORC files, this just makes development life so much easier … probably a non-starter for using a new file format. It appears the backend uses pyarrow … that great package. The documentation says it’s only good for reading a single file though. I’m not sure if you can read directories of ORCs, but this is very common with parquet files. Hopefully Spark supports that.

Time to try it, on one of my Linode’s I downloaded a bunch of divvy’s bike trip data.

wget https://divvy-tripdata.s3.amazonaws.com/202004-divvy-tripdata.zip
>>>
root@localhost:~/bikes# ls
202004-divvy-tripdata.csv  202007-divvy-tripdata.csv  202010-divvy-tripdata.csv
202005-divvy-tripdata.csv  202008-divvy-tripdata.csv  202011-divvy-tripdata.csv
202006-divvy-tripdata.csv  202009-divvy-tripdata.csv  202012-divvy-tripdata.csv

Well, first disappointment, Pandas cannot write to ORC format. Not the end of the world, but not the best thing either.

Pyarrow + ORC

Well, if I can’t write ORC with Pandas, just do it with pyarrow I guess.

>>> import pyarrow.orc as orc
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/pyarrow/orc.py", line 24, in <module>
    import pyarrow._orc as _orc
ModuleNotFoundError: No module named 'pyarrow._orc'

I usually always just back off the version and try again, that worked. pyarrow==0.15.0 was the trick. Well, then I was in for another surprise, I can’t write ORC files with PyArrow either. Apparently I can’t write ORC files with anything Python. pooh. After reading some more articles here and here I started to come to a realization. When they said ORC was made for Hive, they were not kidding. It seems to be the only option for writing ORC files … even the Spark functionality depends on Hive.

What that all really means to me is that if I were working with ORC’s in the same way I use and develop Parquets on a daily basis, it would be a pain.

Go ahead and Google … “convert csv to orc” and have some fun reading. Apparently this pain point is very common.

How in the world do I create ORC data?

Well apparently until recently it was black magic basically, after meeting the devil under the old oak tree at midnight. But recently the ORC project released some java command utils to convert a CSV to an ORC. Yeah that looks like a ton of fun.

I’m going to skip all that jazz and go straight to Spark.

>>> df = spark.read.csv('bikes/*.csv', header='true')
>>> df.show()
+----------------+-------------+-------------------+-------------------+--------------------+----------------+--------------------+--------------+------------------+------------------+------------------+------------------+-------------+
|         ride_id|rideable_type|         started_at|           ended_at|  start_station_name|start_station_id|    end_station_name|end_station_id|         start_lat|         start_lng|           end_lat|           end_lng|member_casual|
+----------------+-------------+-------------------+-------------------+--------------------+----------------+--------------------+--------------+------------------+------------------+------------------+------------------+-------------+
|322BD23D287743ED|  docked_bike|2020-08-20 18:08:14|2020-08-20 18:17:51|Lake Shore Dr & D...|             329|Clark St & Lincol...|           141|         41.932588|        -87.636427|         41.915689|          -87.6346|       member|
|

>>> df.write.format("orc").save("bikes/orcs/orcy.orc")

There it is … the elusive ORC file finally.

root@localhost:~# ls bikes/orcs/
>>> orcy.orc

I’m surprised the partitions didn’t get written as separate ORC files, but what do I know about ORC?

>>> df.rdd.getNumPartitions()                                                   
6
>>> df.write.format("parquet").save("bikes/parquets/")
root@localhost:~# ls bikes/parquets/
part-00000-e51c91a5-23d4-4dd6-95a8-213db9b9f23e-c000.snappy.parquet
part-00001-e51c91a5-23d4-4dd6-95a8-213db9b9f23e-c000.snappy.parquet
part-00002-e51c91a5-23d4-4dd6-95a8-213db9b9f23e-c000.snappy.parquet
part-00003-e51c91a5-23d4-4dd6-95a8-213db9b9f23e-c000.snappy.parquet
part-00004-e51c91a5-23d4-4dd6-95a8-213db9b9f23e-c000.snappy.parquet
part-00005-e51c91a5-23d4-4dd6-95a8-213db9b9f23e-c000.snappy.parquet

That’s me just confirming that Parquet writer would write out partitions. Just for some background there are about 3,114,796 records in the dataframe.

Performance of Parquet vs ORC reads with predicates.

So now that I have ORC and Parquet files of the same data, I do want try a predicate push down read with Spark on both file types and see if there is any significant difference.

from pyspark.sql import SparkSession
from datetime import datetime

spark = SparkSession \
    .builder \
    .appName("parquet vs orc") \
    .getOrCreate()


t1 = datetime.now()
df = spark.read.parquet('bikes/parquets/')
electric = df.where((df['member_casual'] == 'member') & (df['rideable_type'] == 'electric_bike')).select('started_at', 'ended_at')
electric.collect()
t2 = datetime.now()
print(t2-t1)

That took 0:00:26.062626 with Spark and Parquet. Now let’s try ORC. So all I am going to do is swap out this line.

df = spark.read.orc('bikes/orcs/orcy.orc')

Well, that was a little faster. 0:00:24.599139. Probably doesn’t seem like a ton, but if you work with huge amounts of data being written and read with complex ETL. That would make a difference. That’s means ORC is probably about 6% faster. I know there is a lot of tuning you can do with both Parquet and ORC, but out of the box it’s good to know ORC can handle itself and offers great performance.

Who know’s maybe that performance gain out of the box with ORC gets great as the data grows?

Also, what about if I add one of those indicies that documentation talked about.

df.write.option("orc.bloom.filter.columns", "member_casual").option("orc.dictionary.key.threshold", "1.0").format("orc").save("bikes/orcs2/orcy.orc")

Above I re-wrote the that ORC file while adding a “boom.filter” to the column member_casual that is part of the push down filter predicate in Spark. Let’s re-run the the above query and see if we can squeak out any more time.

Wow, got a little more 0:00:24.015829. That got us up to a 8% performance increase in ORC over Parquet.

Musings on ORC files.

I have to say, I was a little disappointed in the general support for ORC files floating around there in the world. Now I think I know why Parquet took the first place in everyone’s heart. It may not seem a big deal to some people that there isn’t good Python support for ORC, I mean it is a HDFS/Hadoop tool more than anything. But, it would make day-to-day development with ORC files a pain.

For PySpark developers there is usually a lot of downloading parquet files from s3, testing them, writing sample files for unit tests, examining data, doing analysis. Of course I use the Spark REPL to do work, and I could do that with ORC. But, it just wouldn’t be the same.

I was impressed by reading the ORC documentation and testing its performance in a rudimentary way. It seems to be up and coming, offering some seriously cool features. Even the probably mediocre 8% performance is impressive to think about when talking big data and complex ETL with lots of push down predicates.

I’m going to keep my eye on ORC in the future, and hopefully try it out on a project, see if it’s as promising at it seems to be.