polars.exceptions.ComputeError: schema lengths differ

So, you are happily using the new Rust GOAT dataframe tool Polars to mung messy data, maybe like me, messing with 40GBs of CSV data over multiple files. You are pretty much going to run into this error.

polars.exceptions.ComputeError: schema lengths differ
This error occurred with the following context stack:
[1] ‘csv scan’
[2] ‘select’

Surprise surprise, schema’s that don’t match.

Heck, you think you have fixed it and then.

polars.exceptions.SchemaError: provided schema does not match number of columns in file (193 != 197 in file)

Why can’t a guy get a simple merge schema option like Spark??!!

Mismatch Schemas … a problem as old as time.

To be honest, I’m surprised Polars is struggling this much, but to be honest, it’s what one would expect from a “newish” tool that hasn’t reached its full potential among the hungry Python throngs spewing out “software” in the name of Data.

If you google around you might things look attractive, I found this buried open issues on Polars GitHub.

 

So what do you do when you have a bunch of mismatched CSV files that SHOULD match schema’s but dont?

You suck it up. You need to take the greatest common denominator … aka which file has the MOST or ALL of the columns? Your only option is to munge up the schema yourself that works and supply it to Polars.

I do wish there was an option to simply merge schemas and match based on name across files … just adding columns where the only exist in one file to the end of the file, and NULL out those records where it doesn’t exist.