, , ,

Pandas DataFrame.to_sql() …. { how you should configure it to not be that guy. }

Sometimes Pandas is slow like this…. until you tweak it.

I never understand it when someone comes up with a great tool, then defaults it to work poorly… leaving the rest up to imagination. The Pandas dataframe has a great and underutilized tool… to_sql() . Lesson learned, always read the fine print I guess. I’m usually guilty of this myself… wondering why something in slow and sucks… and not taking time to read the documentation. Here are some musings on using the to_sql() in Pandas and how you should configure to not pull your hair out.

Setting up to test Pandas.DataFrame.to_sql()

So what I want to do is test some of the configuration values on Pandas.DataFrame.to_sql() have on performance on reasonably sized datasets. Firstly, I don’t use Pandas that much myself. I’ve found it nearly impossible to use for big data, it’s a memory hog and only useful on what I would consider small datasets that fit in a not high-mem instance memory footprint. True Believers use Spark DataFrames instead. Either-way all you raving Pandas fans can keep your comments to yourself. I digress.

I’m going to use a 2,438,457 row sized data frame built from Divvy Bike Trip free datasets, I will then try dumping this dataframe straight to a Postgres table. This isn’t that unusual of a data flow. First I’m going to run just a basic Docker Postgres instance.

FYI. Code is available on GitHub.

Run local Postgres

docker run --name postgres -e POSTGRES_PASSWORD=postgres -d -p 5432:5432 postgres

Now you will just have the base Docker postgres image running locally at localhost or 0.0.0.0 listening on part 5432. Default user postgres and password as well.

Mash data together.

Next I downloaded 5 csv files from 2020 using the above Divy link.

202006-divvy-tripdata.csv       202007-divvy-tripdata.csv       202008-divvy-tripdata.csv       202009-divvy-tripdata.csv       202010-divvy-tripdata.csv

There are about 2.4 million rows so this should be a good test.

from glob import glob
import pandas as pd


def gather_file_names() -> iter:
    file_names = glob("*divvy*.csv")
    return file_names


def combine_files(incoming_files: iter) -> pd.DataFrame:
    blah = pd.concat(
        [
            pd.read_csv(csv_file, index_col=None, header=0)
            for csv_file in incoming_files
        ],
        axis=0,
        ignore_index=True,
    )
    return blah


def main():
    gather_file_names()
    my_lovely_frame = combine_files(incoming_files=gather_file_names())
    print(len(my_lovely_frame.index))


if __name__ == "__main__":
    main()

Testing out Pandas.DataFrame.to_sql()

So now let’s test out the “out of the box” dataframe to SQL function of Pandas. The to_sql function takes a sqlalchemy engine to connect to the database, in our case a local Postgres running via Docker. ( you will need to pip install psycopg2 as well.. you will have best luck on mac os using pip3 install psycopg2-binary)

Let’s add a few lines of code to create the sqlalchemy engine and call the to_sql method without messing with any of the parameters.

from sqlalchemy import create_engine

engine = create_engine('postgresql://postgres:postgres@localhost:5432/postgres')
t1 = datetime.now()
print("starting to insert records.")
my_lovely_frame.to_sql(con=engine, name="trip_data", if_exists="replace")
t2 = datetime.now()
x = t2 - t1
print(f"finished inserting records... it took {x}")

Yikes. 47 minutes uh? Bahaha!!

starting to insert records.
finished inserting records... it took 0:47:14.091238

Time to play, let’s try setting chunksize.

my_lovely_frame.to_sql(
        con=engine, name="trip_data", if_exists="replace", chunksize=50000
    )

Hmm…

starting to insert records.
finished inserting records... it took 0:46:55.817583

Wow.. that was useless. Let’s try the multi (multiple inserts per statement)

my_lovely_frame.to_sql(
        con=engine, name="trip_data", if_exists="replace", chunksize=50000, method="multi"
    )

Well that was an improvement, no surprise there. Not doing a insert statement for every single row seems like a no brainer…

starting to insert records.
finished inserting records... it took 0:09:15.084440

Moral Of Story

Configurations and documentation matter. From 47 minutes to 9 minutes is a bit of jump for just adding method="multi" wouldn’t you say?

Just Because I Can…

I was curious about the fastest way with Python to get 2.4 million rows into the database without using COPY.

Multiprocessing? Let’s add a new function.

from concurrent.futures import ProcessPoolExecutor
from random import randint

def pandas_smandas(file_uri: str) -> None:
    my_lovely_frame = pd.read_csv(file_uri)
    engine = create_engine("postgresql://postgres:postgres@localhost:5432/postgres")
    print("starting to insert records.")
    say_what = randint(1, 100)
    my_lovely_frame.to_sql(
        con=engine, name=f"trip_data_{say_what}", if_exists="replace", chunksize=50000, method="multi"
    )

def main():
    t1 = datetime.now()
    with ProcessPoolExecutor(max_workers=5) as PoolParty:
        PoolParty.map(pandas_smandas, gather_file_names())
    t2 = datetime.now()
    x = t2 - t1
    print(f"finished inserting records... it took {x}")

Well of course that was much faster.

finished inserting records... it took 0:02:40.825578

Got any other ideas to make it faster? Leave a comment below!

2 replies
    • Daniel
      Daniel says:

      Because pandas.dataframe.to_sql() can be used for other rdbms besides Postgres. Also, the to_sql function is convenient and widely used because of Pandas, so it’s good to know how to use it properly when needed. Thanks for reading the blog.

Comments are closed.