Ever heard of something called a File Object in Python? Ever heard of BytesIO or StringIO? Your missing out. It’s easy, fast, and wonderful, in short, it’s the best. For some reason IO streams are a totally underused feature that rarely comes up in most code. We all know that memory if faster than disk IO, this is what I use IO streams for.
Simple Introduction to StringIO and BytesIO in Python.
Why would you want to use from io import StringIO, BytesIO? Think about it. If you are the typical data engineer, business intelligence engineer, report developer, or data analyst starting out, what is one of the first things you learn? Probably opening a file, maybe a csv file, a text file, a JSON file, maybe it’s a zip file full of other files. Maybe your downloading a bunch of data into files. Whatever. It just usually involves a file of some sort 80% of the time. You may have not ever thought about it, but reading and writing files to disk will be a bottleneck in programs, especially if you are reading and writing the same file multiple times. Say you’re downloading a file over HTTP to disk, then unzipping it, then reading the file in. That’s a lot of disk IO.
This is where StringIO and BytesIO come in. One way to think about these streams of data is that they act like a File Object. What does that mean? It means you can treat and interact with these objects like you would any other file, the have the File API on-top of them. You can read them, write to them, etc. A file but not a file, get it??
The best part is that they live in memory. That means fast. Of course you need the resources to do this on your machine or server, but that usually isn’t a problem. One minor detail to remember about a StringIO/BytesIO is that when created, it acts like an already opened file!
Let’s look at some examples of how this could work. Simplistic, but to the point.
from io import StringIO, BytesIO import csv in_memory_file = StringIO() csv_writer = csv.writer(in_memory_file) csv_writer.writerows([[1, 2, 3], [4, 5, 6]]) in_memory_file.seek(0) for row in in_memory_file: print(row)
So what’s going on here? First, we create an in memory open file object, in_memory_file = StringIO() . Next I’m showing you how we could create what is basically a csv file, just without calling with open(‘file.csv’, ‘w’) as csv_file. So we are swapping out the usual step where we create a file on disk to write data too.
A csv writer takes a file object, well we have one of those don’t we! Next we call .writerows() on our csv_writer object, we write two separate rows. The next part may seem a little strange to you, seek(0). Since we are dealing with stream, file like object, and we’ve written to lines, technically that opened file object is at the “end.” To read back out of that file object and print the lines we wrote, we need to be at the beginning.
Let’s take this just slightly further to show how StringIO/BytesIO could be useful. Let’s say we have a boring job and our boss asks us to download information about Livestock and Meat International Trade data from the government, and insert relevant information into a database. The typical workflow would be to download the zipfile, unpack it, read in the relevant csv file from disk, find the data and off to the races.
Well we know better now don’t we. Sounds like a few spots of file IO, like writing the zip to disk, unzipping the files to disk, then reading the csv file from disk. But, there is a more excellent way.
import requests from io import BytesIO from zipfile import ZipFile, is_zipfile url = 'https://www.ers.usda.gov/webdocs/DataFiles/81475/LivestockMeatTrade.zip' try: response = requests.get(url) except: print('Problem downloading zip file.') if response.status_code == 200: in_memory_zip = BytesIO(response.content) with ZipFile(in_memory_zip) as zippy: for item in zippy.infolist(): if 'Exports' in item.filename: with zippy.open(item.filename) as export_file: for row in export_file: print(row.decode('utf-8'))
Easy! It’s fast because we are doing everything in memory and it’s simple code, straight forward. Really what I’m showing you here is that many of the packages and methods you use in Python can take a file like object, in this example with ZipFile(in_memory_zip) as zippy, doesn’t matter if it’s an actual file sitting your disk, or a file like object sitting in memory.
I feel like that’s enough for you to get started with StringIO and BytesIO. Next time your writing Python and working with a file, think of io, give it a try!