, , , ,

Golang – Useful for everyday Data Engineering?

I periodically try to pick up a new programming language on my journey through Data Engineering life. There are many reasons to do that, personal growth, boredom, seeing what others like, and helping me think differently about my code. Golang has been on my list for at least a year. I don’t hear much about it in the Data Engineering world myself, at least in the places I haunt like r/dataengineering and Linkedin.

I know tools like Kubernetes and Docker are written with Go, so it must be powerful and wonderful. But, what about Data Engineering work … and everyday Data Engineering work at that, is Go useful as an everyday tool for everyday simple Data Engineering tasks? Read on my friend.

Goland for every day Data Engineering.

Before all the crazy people from the internet start bothering me about my Go code, let me be clear, this is me learning … and I’m learning from a certain viewpoint. I want to know if Golang is reasonably usable for day to day Data Engineering tasks …. for me. You can decide yourself what you think about Golang. ( all code below on GitHub)

For me, and what I do day to day, learning a new lanague comes down to a few main points.

  • What’s the learning curve like.
  • How hard is it to do simple tasks.
  • How does the language make think about solving problems.
  • How does the lanauge seem to fit data pipelines.

Sure, that’s probably not why a lot of folks use Go, but for me, that’s what I would use it for. On the flip side I always enjoy learning a new language, it helps me be more expressive and a better problem solver in my daily Data Engineering life. Each language has it’s own nuiances and favors certain ideas and approaches. Combined, this sort of learning is good for the soul and mind.

First Example Golang Project – Reading CSV files.

One of the first tasks I always try to complete when working with a new language is processing some CSV files. It’s usually pointless CSV processing, but it’s more for the learning and the experience, to get a feel for the lanauge, in this case Go, and start to get an understanding of how easy or difficult certain things are in each lanauge.

I find it a good indicator if it’s “easy” to process a CSV file. We are simply going to read some CSV files from the free Divvy bike trip data set. We will read each file, and count the number of member records each file contains.

All that being said, let’s take a look at my first attempt at writing Go and then talk about.

package main

import (
	"encoding/csv"
	"fmt"
	"io/fs"
	"io/ioutil"
	"log"
	"os"
	"path/filepath"
	"strings"
)

func read_dir() []fs.FileInfo {
	files, err := ioutil.ReadDir("data")
	if err != nil {
		log.Fatal(err)
	}
	return files
}

func get_paths(files []fs.FileInfo) []string {
	var fs []string
	for _, f := range files {
		thepath, err := filepath.Abs(filepath.Dir(f.Name()))
		if err != nil {
			log.Fatal(err)
		}
		if strings.Contains(f.Name(), ".csv") {
			fs = append(fs, string(thepath)+string("/data/")+string(f.Name()))
		}
	}
	return fs
}

func read_csv(filePath string) [][]string {
	f, err := os.Open(filePath)
	if err != nil {
		log.Fatal(err)
	}
	defer f.Close()

	csvReader := csv.NewReader(f)
	records, err := csvReader.ReadAll()
	return records
}

func work_records(records [][]string) {
	sum := 0
	for _, r := range records {
		if r[12] == "member" {
			sum += 1
		}
	}
	result := fmt.Sprintf("the file had %v member rides in it", sum)
	fmt.Println(result)
}

func main() {
	start := time.Now()
	fs := read_dir()
	paths := get_paths(fs)
	fmt.Println(paths)
	for _, p := range paths {
		rcrds := read_csv(p)
		work_records(rcrds)
	}
	duration := time.Since(start)
	fmt.Println(duration)
}

>> go run csv.go
[/Users/danielbeach/code/csv_go/data/202004-divvy-tripdata.csv /Users/danielbeach/code/csv_go/data/202005-divvy-tripdata.csv /U
sers/danielbeach/code/csv_go/data/202006-divvy-tripdata.csv /Users/danielbeach/code/csv_go/data/202007-divvy-tripdata.csv /User
s/danielbeach/code/csv_go/data/202008-divvy-tripdata.csv /Users/danielbeach/code/csv_go/data/202009-divvy-tripdata.csv /Users/d
anielbeach/code/csv_go/data/202010-divvy-tripdata.csv /Users/danielbeach/code/csv_go/data/202011-divvy-tripdata.csv /Users/dani
elbeach/code/csv_go/data/202012-divvy-tripdata.csv /Users/danielbeach/code/csv_go/data/202101-divvy-tripdata.csv /Users/danielb
each/code/csv_go/data/202102-divvy-tripdata.csv]
the file had 61148 member rides in it
the file had 113365 member rides in it
the file had 188287 member rides in it
the file had 282184 member rides in it
the file had 332700 member rides in it
the file had 302266 member rides in it
the file had 243641 member rides in it
the file had 171617 member rides in it
the file had 101493 member rides in it
the file had 78717 member rides in it
the file had 39491 member rides in it
>> 4.583988333s

Thoughts on my first Golang script.

I honestly loved my first experience writing Go for my silly little CSV pipeline, I learned a lot about Go and got a decent feel for how things fit together. When I think back to learning Scala for the first time, for example, Go seemed to be a little bit more approachable for me.

Here are some the first things I noticed …

  • common imports like encoding/csv and strings , even path/filepath make simple tasks easy.
  • easy to define functions and types.
  • catching errors is easy.
  • syntax is easy and straight forward.

Even thinking back to my first time writing Scala, this Go script was just more straight forward to write, being able to process a CSV file. It might not seem like much, but having a package like encoding/csv where I can simply and easily load a CSV file …

csvReader := csv.NewReader(f)
records, err := csvReader.ReadAll()
return records

It’s refreshing and to me is a good sign that Go can solve simple tasks in a simple way, making it a decent choice for every day Data Engineering. Again, another sign of Go‘s usefulness in common DE tasks was the nice strings module …

if strings.Contains(f.Name(), ".csv")

It’s the simple things in life that make things easier. I’m a fan of Golang. But, out of curiosity what would this code look like in Python? I’m mostly curious about the performance.

import csv
from glob import glob
from datetime import datetime

def get_files(dir: str = 'data') -> list:
    files = glob(f'{dir}/*.csv')
    return files

def read_csv(file: str) -> iter:
    with open(file, "r") as f:
        reader = csv.reader(f)
        next(reader, None)  # skip header
        rows = [row for row in reader]
    return rows


def work_records(records: iter) -> None:
    total = 0
    for record in records:
        if 'member' in record[12]:
            total += 1
    print("the file had {v} member rides in it".format(v=str(total)))


def main():
    t1 = datetime.now()
    files = get_files()
    for file in files:
        records = read_csv(file)
        work_records(records)
    t2 = datetime.now()
    print(f"{t2}")

main()
 >> python3 main.py
the file had 171617 member rides in it
the file had 101493 member rides in it
the file had 61148 member rides in it
the file had 302266 member rides in it
the file had 78717 member rides in it
the file had 39491 member rides in it
the file had 282184 member rides in it
the file had 188287 member rides in it
the file had 113365 member rides in it
the file had 332700 member rides in it
the file had 243641 member rides in it
 13:57:54.929582

Yikes! Sure that Python code is a little cleaner, but man that Go is way faster! 4.583 seconds for Go compared to 13:57 for Python. Of course I’m not surprised by that, I figured Go would be faster.

What get’s me excited about Go is not only is it way, way faster, but also that the Go script itself was easy to write, and that for a beginner!

Musings on Golang as a day-to-day Data Engineering tool.

I’m excited to continue to learn Golang, it seems like a fun tool tool to use and write. I’m looking forward to testing some more code with Go, like maybe doing some more http requests and file manipulations. I’m curious about its integrations with cloud tools like aws, it’s concurrency options, and just generally how well it will continue to perform and how easy it will be to use.

At the end of the day learning Go is going to be a good exercise in keeping myself moving forward, thinking in new ways, and solving problems with a new tool that will keep me agile and open minded. I like the syntax and data structures so far, it’s easy to understand and use, I feel the learning curve is less then what I experienced with Scala.

Golang gets a big thumbs up for me, you will see more Go in my blogs in the future!

1 reply

Comments are closed.