Gandalf's Beard! DataFrames in Golang. - Confessions of a Data Guy

Photo by Thomas Schweighofer on Unsplash

I’m not sure if DataFrames in Golang were created by Gandalf or by Saruman, it is still unclear to me. I mean, if I want a DataFrame that bad … why not just use something normal like Python, Spark, or pretty much anything else but Golang. But, I mean if Rust gets DataFusion, then Golang can’t be left out to dry, can it!? I mean I guess if you’re hardcore Golang and nothing else will do, and you’re playing around with CSV files, then maybe? Seems like kind of a stretch. But, I have a hard time saying no to Golang, it’s just so much fun. Kinda like when Gandalf told them little hobbits and dwarfs to not stray from the path going through Fangorn Forest, those little buggers did it anyways. Code available on GitHub.

DataFrames in Golang, just because we can.

I still think it’s a stretch, but I guess we could probably find some use cases for needing DataFrames in Golang in real life, maybe. One thing is for sure, it should be fast like everything else Golang does. Let’s contrive an example so we can just play around with Golang DataFrames and see what they have to offer, or horrible or pleasant it is to work on them.

Problem Statement

Let’s say one day your Over-Lord shows up in your Slack. This Over-Lord brandishes a freshly polished sword and exclaims that there is a new quest that needs to be completed, one that is perfect for such a peasant and lowly creature as yourself. Your Over-Load declares you must be subject to a recent edict from High-Powers that only Golang can be used for such quests, and that using Python or other non-static languages will be punishable by a gruesome and horrible death.

You, being the paltry knave you are, immediately agree to this most honorable quest, and execute it with much vigor.

The quest is to use Golang to process incoming CSV files that contain detailed bike trip data, filter them to members only records, then aggregate the data by counting the number of bike rides per station, and find the most popular stations. So …

read incoming CSV files as a DataFrame.
filter the CSV to member only records.
count the number of bike rides per start_station_name.
order the results by the count decending.

Quest Begins.

package main

import (
	"fmt"
	"log"
	"os"

	"github.com/go-gota/gota/dataframe"
)

func main() {
	csvfile, err := os.Open("data/202206-divvy-tripdata.csv")
	if err != nil {
		log.Fatal(err)
	}
	df := dataframe.ReadCSV(csvfile)
	fmt.Println("df: ", df)
}

Well, it works, I had no doubt.

(base) danielbeach@Daniels-MacBook-Pro goframes % go run frames.go
df:  [769204x13] DataFrame

    ride_id          rideable_type started_at          ended_at            ...
 0: 600CFD130D0FD2A4 electric_bike 2022-06-30 17:27:53 2022-06-30 17:35:15 ...

 1: F5E6B5C1682C6464 electric_bike 2022-06-30 18:39:52 2022-06-30 18:47:28 ...
 2: B6EB6D27BAD771D2 electric_bike 2022-06-30 11:49:25 2022-06-30 12:02:54 ...
 3: C9C320375DE1D5C6 electric_bike 2022-06-30 11:15:25 2022-06-30 11:19:43 ...
 4: 56C055851023BE98 electric_bike 2022-06-29 23:36:50 2022-06-29 23:45:17 ...
 5: B664188E8163D045 electric_bike 2022-06-30 16:42:10 2022-06-30 16:58:22 ...
 6: 338C05A3E90D619B electric_bike 2022-06-30 18:39:07 2022-06-30 19:05:02 ...
 7: C037F5F4107788DE electric_bike 2022-06-30 12:46:14 2022-06-30 14:12:48 ...
 8: C19B08D794D1C89E electric_bike 2022-06-30 11:09:38 2022-06-30 11:10:25 ...
 9: 6E9E3A041C14E960 electric_bike 2022-06-30 11:05:46 2022-06-30 11:09:11 ...
    ...              ...           ...                 ...                 ...
    <string>         <string>      <string>            <string>            ...

Not Showing: start_station_name <string>, start_station_id <string>,
end_station_name <string>, end_station_id <string>, start_lat <float>, start_lng <float>,
end_lat <float>, end_lng <float>, member_casual <string>

Next, we need to apply the filter to find member only records.

members_df := df.Filter(dataframe.F{Colname: "member_casual", Comparator: series.Eq, Comparando: "member"})

It’s honestly a little annoying and verbose to filter an DataFrame in Golang, but I guess it is what it is. I mean generally, it makes sense, first pass in the column, the comparision operator, and then the value to filter too. It just feels a little awkward.

Next, we need to aggregate by start_station_name and then count the number of ride_ids happening per start_station_name.

station_groups := members_df.GroupBy("start_station_name")
station_rides := station_groups.Aggregation([]dataframe.AggregationType{dataframe.Aggregation_COUNT}, []string{"ride_id"})
sorted := station_rides.Arrange(dataframe.RevSort("ride_id_COUNT"))
fmt.Println("df: ", sorted)

And the result.

(base) danielbeach@Daniels-MacBook-Pro goframes % go run frames.go
df:  [1084x2] DataFrame

    ride_id_COUNT start_station_name
 0: 46090.000000
 1: 3143.000000   DuSable Lake Shore Dr & North Blvd
 2: 2964.000000   Kingsbury St & Kinzie St
 3: 2811.000000   Streeter Dr & Grand Ave
 4: 2737.000000   Wells St & Concord Ln
 5: 2617.000000   Clark St & Elm St
 6: 2580.000000   Theater on the Lake
 7: 2575.000000   Wells St & Elm St
 8: 2362.000000   Michigan Ave & Oak St
 9: 2300.000000   Broadway & Barry Ave

Ok, the grouping is pretty normal station_groups := members_df.GroupBy("start_station_name") , but the aggregation is a little wonky. station_rides := station_groups.Aggregation([]dataframe.AggregationType{dataframe.Aggregation_COUNT}, []string{"ride_id"})

Again, a little verbose. I mean even DataFusion with Rust looks a little better, although not much. let df = df.aggregate(vec![col("member_casual")], vec![count(col("ride_id"))])?;

And it took about main took 6.897392875s per the Golang timer. I’m generally curious how this stacks up to just plain Pandas with Python.

import pandas as pd
from datetime import datetime

def main():
    t1 = datetime.now()
    df = pd.read_csv("data/202206-divvy-tripdata.csv")
    df = df[df.member_casual == 'member']
    df2 = df.groupby(['start_station_name'])['ride_id'].count().reset_index(name='count') \
                             .sort_values(['count'], ascending=False)
    print(df2)
    t2 = datetime.now()
    print("it took {x} to run".format(x=t2-t1))

if __name__ == '__main__':
    main()

And the performance … 02.991810

(base) danielbeach@Daniels-MacBook-Pro goframes % python3 test_with_python.py
                      start_station_name  count
291   DuSable Lake Shore Dr & North Blvd   3143
489             Kingsbury St & Kinzie St   2964
959              Streeter Dr & Grand Ave   2811
1012               Wells St & Concord Ln   2737
189                    Clark St & Elm St   2617
...                                  ...    ...
it took 0:00:02.991810 to run

Interesting, of course the Python Pandas is easier to read and write, but it’s also way faster than the Golang DataFrame, although that probably says more about the gota package. Just goes to show that when someone says Golang or x language will always be faster than poor old Python, that they are forgetting about implementation.

Musings on DataFrames with Golang

Although I was surprised that Pythons C based Pandas was faster than Golang, at least the gota implementation, I guess it’s not that surprising after all. A lot of work has gone into Pandas, as compared to the newish gota with Golang. It is nice that to have the option to use DataFrames in a fairly easy manner with Golang, if you’re a poor old peasent who’s only at the beck and call of your Over-Lord who demands you use “better” and “fast” languages like Golang.

It looks like there are plenty more features and function of gota DataFrames with Golang, although based on the implementation verbosity and the performance, I doubt I will ever use it again. I think I would prefer pretty much anything else first, Pandas, Spark, or even DataFusion with Rust. Code available in GitHub.

Gandalf’s Beard! DataFrames in Golang.

DataFrames in Golang, just because we can.

Problem Statement

Quest Begins.

Musings on DataFrames with Golang

Interesting links

Pages

Categories

Archive