, , ,

My Journey from Python to Scala – Part 1

UPDATE: If you want to know how my Scala SHOULD have been written. Check out this link!

I feel like a frontiersmen heading west, into the unknown. I’ve been successful using Python as a Data Engineer for some time, processing terabytes of data with what “real” programmers sneer at as barely even a real language. Whatever. But, some of my favorite tools, like Spark, are written in Scala, and it’s on the rise, so I should probably join the lemmings in their mad dash. If for no other reason then to expand my horizons.

What to Learn First with Scala.

I asked an acquaintance who is one of the best developers ever what he recommended for starting out. Scala for the Impatient was one book that came up, and it’s been great so far. Honestly I’m not that worried about picking up a different syntax, but Scala is notorious for having a steep learning curve. Beyond functional programming it just seems like you have to think differently about a problem and become used to solving it differently in Scala.

I’m one of those people who just isn’t going to sit around watching hours of tutorials. I’m just going to start solving common problems I find myself doing in Python, with Scala, that is how I will learn. Hopefully the books I read will just teach my how to write Scala better, and the style and idioms that are common place amount Scala developers.

So as a data engineer, what are some of the very basic problems or things that I do all the time?

  • download files
  • download data from APIs
  • manipulate files
  • read and write files.
  • push and pull data from SQL databases.

Sounds like I should just start with some basic file manipulation. I’m not really going to cover much about the different syntax of Scala. I’m just going to do some common tasks and share what it’s like to do them in Scala after working in Python for years. I’m going to makeup a simple ETL task.

Say we have to process a number of csv files, this this case the free Divvy bike trip files, parsing each file and breaking all the records into groups around from_station_id, storing these records as separate files for use in another data pipeline. These are my first attempts at Scala, so I’m sure nothing I’m doing will be right, but it will be a good window into what you could expect moving from Scala to Python.

So to summarize, how can we take input CSV files, group the records inside that large CSV file, and write out smaller CSV files that contain each group.

This is a sample of records from the Divvy csv files.

So, the first thing I would do in Python is just use the CSV or Pandas module, it would make this task simple. But, I want to learn Scala so here it goes. My first attempt was just to use normal Scala functionality to read and print the lines of the csv file, while also trying to read the columns so I can later use the column I need, from_station_id, to sort the records.

import scala.io.Source

object ReadCSV {
  def main(args: Array[String]): Unit = {
    val csv_file = Source.fromFile("Divvy_Trips_2019_Q4.csv")
    for (line <- csv_file.getLines.drop(1)){
      val cols = line.split(",")
      println(cols(0))
    }
  }
}

The first thing I had to wrap my head around, is that everything in an object in Scala. Also, it isn’t clear to me when I should embrace OOP in form of a class, or pure functional code, what Scala is known for using the object as seen above. I’m sure both work nicely together, but for now I think I will just try to write an object, that has one more functions that will do the CSV file processing I need. I can learn the error of my ways later.

Right away I noticed the import statements won’t be that new or unique coming from Python, I’m sure there will be things to learn, but it doesn’t appear to be all that different. Also it appears objects/functions can just be defined with a name and {} to delineate code blocks. Not a big deal.

object ReadCSV {.....}

I’m used to type hinting my code in Python, so writing my function wasn’t that different either. I think this is one thing I’m going to like about Scala. Having to think through data structures of both the IN and OUT of a function is a good practice, by Python is too forgiving in this aspect. As you can see below, the main function takes an Array of Strings, and returns nothing or VOID, otherwise in Scala known as Unit.

def main(args: Array[String]): Unit =

Defining a main function, specifying the input arguments and type, and return type, Unit.

Reading the file into a value (immutable) is straight forward.

val csv_file = Source.fromFile("Divvy_Trips_2019_Q4.csv")

The for loop in Scala is pretty straight forward. for ( var -< range) {} is syntax. So, for some variable in a range, do something. Drop just gets me past the header row.

for (line <- csv_file.getLines.drop(1))

In my case I wanted to split the columns out of the line/row.

val cols = line.split(",")

The value I am looking for lives in cols(5) , which is indexed from zero of course. This didn’t work. Some of the values in the CSV file are ” quoted and hold commas. So my spilt doesn’t do the trick for some lines. Being a newbie to Scala, I figured the only thing to do was to turn to Google, and that surprisingly turned up less options than I would have thought. It basically appears most people would resort to using Java packages to do this. Seriously? Blah.

I did turn up one Scala package that seemed to be fairly popular. Found here. It worked fine for reading the CSV and dealt with the column quotes seamlessly for me.

This is literally the first Scala script I’ve ever written, so it’s probably completely wrong, but did the trick for me. It reads in the main CSV file, iterates all the rows, assigning the rows to groups based on their station_id, then writes those groups out to separate files. I was also able to immediately break the only rule I was told ( only use immutable data types ), so that’s fun.

import com.github.tototoshi.csv._
import scala.collection.mutable.Map
import scala.collection.mutable.ListBuffer

trait StationCollection {
  val station_collection = Map.empty[String, ListBuffer[Seq[String]]]
}

object ReadCSV extends StationCollection {

  def main(args: Array[String]): Unit = {
    val csv_rows = csv_iterator()
    csv_rows foreach assigner
    write_records(station_collection)
    }

  def csv_iterator(): Iterator[Seq[String]] = {
    val reader = CSVReader.open("Divvy_Trips_2019_Q4.csv")
    val csv_rows = reader.iterator
    csv_rows.next() //get past header
    csv_rows
  }

  def assigner(row_list: Seq[String]): Unit = {
    val station_id = row_list(5)
    if (station_collection.contains(station_id)){
      station_collection(station_id) += row_list
    }
    else {
      station_collection += (station_id -> ListBuffer(row_list))
    }
  }

  def write_records(records_collection: Map[String, ListBuffer[Seq[String]]]): Unit = {
    for ((k,v) <- records_collection) {
      val writer = CSVWriter.open(s"$k.csv")
      for (value <- v) writer.writeRow(value)
    }

  }
}

Thoughts on my first Scala script.

Not going to lie, the first little while trying to figure out what SBT is, compiling a Scala project and what a Scala project directory should look like, was a little annoying. Party of writing Python all day long is getting spoiled I suppose. That’s probably no surprise, of course learning to develop in a new language is going to take a little effort.

What about Syntax moving from Python to Scala?

This was not as big a deal as I thought it would be. After an hour had gone by, I had kinda moved on already. This is especially true if you’re a Python developer who was/is using Type hinting. Defining objects and Traits in Scala might take a little bit of work if you haven’t been writing OOP code in Python as well, if you have, you will easily grasp at least the basics of what’s going on in Scala programs.

What is the most difficult part of moving from Python to Scala?

The most difficult part of moving from Python to Scala in this instance was the total mind-shift and thought pattern writing code in Scala forced upon me. It also relates to what I have I love most about Scala. It has a feeling of being less forgiving and verbose than Python. This is of course well known and talked about. It requires you to think more about what you’re going to do before you do it.

Having a complete picture and grasp of data types, and what methods and interactions are provided by those data types seems to be extremely important in Scala. Do you use a list, map, blah blah blah? What functionality do those data types offer? Also using immutable data structures mixed with this new though paradigm lead me to what I have is attractive about Scala.

What I love the most about moving from Python to Scala.

I love the obvious, verbose, and straight forward way in which Scala forces you to think and write code. I can see why it is the scalable language. It’s English like enough to understand, yet verbose enough to be concise and clear. I love the way a for loop and what to do in the loop is a one-liner. Unpack a sequence of values and do the same thing to those values.

for (value <- v) writer.writeRow(value)

Or again, take a sequence and for each element apply a function.

csv_rows foreach assigner

I know programming style like this can be achieved in Python, but it is not the norm. Scala however is built this way, it’s in its blood.

By the way there is about 704,054 rows in this CSV file I’m processing. Both my Scala and my Python code both output a list of files perfectly.

My not perfect Python code… but short and to the point…

import pandas as pd
from datetime import datetime

t1 = datetime.now()
df = pd.read_csv('Divvy_Trips_2019_Q4.csv')
df.sort_values(by='from_station_id', inplace=True)

stations = df['from_station_id'].unique().tolist()

for station in stations:
    to_file_df = df.loc[df.from_station_id == station]
    to_file_df.to_csv(f'{station}.csv')
t2 = datetime.now()
print(t2- t1)

That took all of 8 seconds for the Python run. The Scala code taking less than 6 seconds.

There is quite a big difference in the code of the Scala vs the Python, but to be fair I will probably to try re-write the Scala code to try some of what I’m doing in Pandas. Just sorting a sequence, etc etc. Even with the most likely in-efficient Scala code that is 4 times as long as the Python code, of course the Scala code is faster, no surprise there.

Conclusion.

Scala took me an hour or two to wrap my mind around. It seems to enforce a different programming style that is much less forgiving and more verbose than Python. But that is what I’m enjoying about learning Scala, it seems to drive a deeper understanding of data types and what you are programming, I’m sure this will only make me a better writer of even Python code. Scala will be learning curve but those coming from Python who are familiar with basic OOP and type hinting/data structures will probably be able to jump on the band wagon quicker then those who have not.