Getting My Spark On

There sure has been a lot of kerfuffle around Spark lately. Spark this Spark that, Spark is the best thing ever, and so on and so forth. I recently had some small exposure to PySpark when working on a Glue project, at the time a lot of the functions reminded me of Pandas and I’ve been trying to find time to explore Spark a little more.

What better way to try out Spark then to use Docker.  My experience with Docker has been limited, but it seemed like a great tool, espeically when playing around with new technology that you know nothing about.

docker pull ubuntu
docker run -it [image-id] bin/bash
apt-get update
apt-get install openjdk-8-jdk
apt-get install python2.7 python-pip
apt-get install wget

Basically pull down a Ubuntu image from the Docker Hub, run it and open its’ bash command line.

The first pre-req for installing Spark is going to be Java. Don’t make the mistake of getting the latest version of Java, a lot of tutorials tell you to use …

apt-get install default-jdk

This seems to be incorrect, when I did this I was able to install Spark fine but I was getting strange errors when trying to submit a job to run. After a little research it seems like Java 8 is more stable with Spark.

Installing wget above will let us pull down the Spark install. The following will download Spark, unpack it, and link it.

wget https://archive.apache.org/dist/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz
tar xvzf spark-2.3.0-bin-hadoop2.7.tgz
ln -s spark-2.3.0-bin-hadoop2.7 spark

Next I needed to download VIM, something I’m not the biggest fan of, but it’s easy to modify files via the command  line.

apt-get install vim
vi ~/.bashrc

The vi command lets us modify the .bashrc file and insert the following lines at the bottom.

SPARK_HOME=/LinuxHint/spark
export PATH=$SPARK_HOME/bin:$PATH

Alright, ready to go now. So, this wan’t so obvious as first, but as of recently all external PySpark scripts must be submitted via….

spark/bin/spark-submit

Next, I need something easy to try out for my first script. I wrote this little piece to download a txt file that I could mess with. Of course the great St. Augustine would make for a interesting read.

import urllib2
url='https://www.ccel.org/ccel/schaff/npnf101.txt'
response = urllib2.urlopen(url)
with open('StAugustine.txt', 'w') as f:
    f.write(response.read())

This downloaded a text file of Confessions.

from pyspark import SparkContext
sc = SparkContext("local", "App Name", pyFiles=[])
 
text_file = sc.textFile("./spark/StAugustine.txt")
counts = text_file.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("./spark/counts.txt")

There is a nice example of a simple word count on the Apache Spark website. Easy enough, so just submit the file….

spark/bin/spark-submit sparktest.py

Run that and out pops a directory with a results file.

So, that’s the extent of my exploring Spark, gotta start somewhere right? The documentation seems good, and I know next to nothing about configuration, RDD’s etc etc, but the first step is getting it running and being able to submit a job right? So till next time.