, ,

My Journey from Python to Scala – Part Deux

In Part 1 of my laborious journey from Python to Scala, I did some work with file operations, CSV files, and messing with the data. It took me a little longer then I expected to wrap my head around the Scala functional/object/immutable approach to software design. But, in the end if felt satisfying and I’m starting to be a convert. Scala makes you think a little harden then Python, is less forgiving, and requires more of you as the developer. In part deux, I figured the next topic to grapple with some simple retrieval of remote files and writing those files to disk. Also, I wanted to take a crack at Classes in Scala.

My data source this time will be the public and open source Gutenberg project. I’ve already written another version of this code in Python, so I figured in my journey to understand Scala, re-writing something would be a great option. All the books are free from copy write and available through numerous FTP mirrors. The idea will be to open a CSV file and iterate through the records, searching for this files in the FTP server and downloading them locally.

Here is the code first, then I will talk a little about my experience.

Scala code to download files from FTP server.

import java.io.{File, BufferedOutputStream, FileInputStream, FileOutputStream}
import com.github.tototoshi.csv._
import org.apache.commons.net.ftp.{FTPClient, FTPClientConfig, FTPFile}


class GutenbergFTP(host_name: String) {
  val ftp = new FTPClient
  val config = new FTPClientConfig

  def setup_ftp()= {
    ftp.configure(config)
    ftp.connect(host_name)
    ftp.enterLocalPassiveMode()
    ftp.login("anonymous", "")
    val reply = ftp.getReplyCode
    println(s"Reply code was $reply")
  }

  def list_ftp_files(): Array[FTPFile] = {
    ftp.listFiles(".").filter(FTPFile => FTPFile.isFile).filter(FTPFile => FTPFile.getName.contains(".txt"))
  }

  def write_ftp_file(file: FTPFile)= {
    val output_file = new File(file.getName)
    val out_stream = new BufferedOutputStream(new FileOutputStream(output_file))
    ftp.retrieveFile(file.getName, out_stream)
    out_stream.close()
  }
}

class sGutenberg extends GutenbergFTP(host_name = "aleph.gutenberg.org") {
  val input_csv: String = "input.csv"

  def csv_iterator(): Iterator[Seq[String]] = {
    val reader = CSVReader.open(input_csv)
    val csv_rows = reader.iterator
    csv_rows.next() //get past header
    csv_rows
  }

  def get_file_location(file_number: String): String = {
    """Files are structured into directories by splitting each number, up UNTIL the last number. Then a folder
        named with the file number. So if a file number is 418, it is located at 4/1/418. Below 10 is just 0/filenumber."""
    val folder_numbers: String = file_number.slice(0,file_number.length-1).toList.mkString("/")
    val ftp_directory_location: String = s"$folder_numbers/$file_number"
    ftp_directory_location
  }

}

object gutenberg {
  def main(args: Array[String]): Unit = {
    val sG = new sGutenberg
    sG.setup_ftp()
    val csv_rows = sG.csv_iterator()
    for (row <- csv_rows) {
       val file_number: String = row(1)
       val remote_location: String = sG.get_file_location(file_number)
       println(s"Working on directory $remote_location")
       sG.ftp.changeWorkingDirectory(remote_location)
       val files = sG.list_ftp_files()
       for (f <- files){
         val file_name: String = f.getName
         println(s"Downloading file $file_name")
         sG.write_ftp_file(f)
       }
      sG.ftp.changeToParentDirectory()
      }
    }
  }

My Struggles with Scala.

I would have to say, the second time around writing some Scala felt a little more familiar, because of the Classes probably. But I still struggle with what the design/implementation of a Scala programs. I’m used to writing OOP with Python, so I did feel like I was on more familiar ground when using two Classes to describe what I wanted to do.

Why do I struggle with implementation you ask? Because there are many ways to write code in Scala, and here is an Alpakka sample of downloading all files from a FTP server. That code just flows and seems to be how Scala was made to be written/implemented. Yet, I could have never done that myself the first time approaching the problem, above was my first shot without any cheating.

val ftpSettings = FtpSettings(InetAddress.getByName("localhost")).withPort(port)

val fetchedFiles: Future[immutable.Seq[(String, IOResult)]] =
  Ftp
    .ls("/", ftpSettings)                                    //: FtpFile (1)
    .filter(ftpFile => ftpFile.isFile)                       //: FtpFile (2)
    .mapAsyncUnordered(parallelism = 5) { ftpFile =>         // (3)
      val localPath = targetDir.resolve("." + ftpFile.path)
      val fetchFile: Future[IOResult] = Ftp
        .fromPath(ftpFile.path, ftpSettings)
        .runWith(FileIO.toPath(localPath))                   // (4)
      fetchFile.map { ioResult =>                            // (5)
        (ftpFile.path, ioResult)
      }
    }                                                        //: (String, IOResult)
    .runWith(Sink.seq)                                       // (6)

When look at my code compared to the sample with in essence do pretty much the same thing, they are just totally different styles. Is that bad? Probably. Scala seems to be designed to be written concisely in a scalable manner, stringing methods together one after the other. It’s just a different way to approach a problem that I’m not used to.

What I like about Scala Classes …. more than Python Classes.

One of the pieces that I love about Scala Classes as compared to most Python Classes I write, is that they are more concise.

  • I like that I don’t have to use self in my Scala classes for member properties. I find it obvious that if I’m writing a Class that is what I want to do.
  • I like not having to write a _init_(self) in Scala, but can pass constructors directly into my Scala Class. GutenbergFTP(host_name = "aleph.gutenberg.org")
  • I enjoy the immutability of using val everywhere and knowing I can’t overwrite these values and screw something up.
  • I like how Scala Classes can simply extend each other and I don’t have to do a super confusing super init.

Here is a sample snippet from my Python Gutenberg class just for reference…

class Gutenberg():
    def __init__(self):
        self.request_url = ''
        self.csv_file_path = 'ingest_file/Gutenberg_files.csv'
        self.csv_data = []
        self.cwd = os.getcwd()
        self.ftp_uri = 'aleph.gutenberg.org'
        self.ftp_object = None
        self.download_uris = []
        self.file_download_locattion = 'downloads/'

    def load_csv_file(self):
        absolute_path = self.cwd
        file = self.csv_file_path
        with open(f'{absolute_path}/{file}', 'r') as file:
            data = csv.reader(file, delimiter=',')
            next(data, None)
            for row in data:
                self.csv_data.append({"author": row[0],
                                      "FileNumber": row[1],
                                      "Title": row[2]})

What else did I struggle with? Trying to write a file in Scala wasn’t all that obvious, as well as working with Streams. I think there are just multiple ways to write files, and combining it with a Stream for me, for the first time, was a little bit cumbersome.

Conclusion

Overall I enjoyed my second foray into Scala, I’m looking forward to working more with reading/writing files, and doing some HTTP stuff in the future. I think what I realized this time is how much more “advanced”, for lack of a better term, Scala is over Python. I can see why people complain about the learning curve. I still have very little concept of how to structure my Scala programs, even as I’m writing them, I can feel how cumbersome they are and how much I have to learn. Scala feels powerful and clean, but I can tell it will take a few years of practicing off and on to ever ever be able to say I can “write Scala.”