Create Your Very Own Apache Spark/Hadoop Cluster….then do something with it?

I’ve never seen so many posts about Apache Spark before, not sure if it’s 3.0, or because the world is burning down. I’ve written about Spark a few times, even 2 years ago, but it still seems to be steadily increasing in popularity, albeit still missing from many companies tech stacks. With the continued rise os AWS Glue and GCP DataProc, running Spark scripts without managing a cluster has never been easier. Granted, most people never work on datasets large enough to warrant the use of Spark.. and Pandas works fine for them. Also, very annoyingly it seems most videos/posts on Spark about shuffling/joins blah blah that make no sense to someone who doesn’t use Spark on daily basis, or they are so “Hello World” as to be useless in the real world. Let’s solve that problem by setting up our own Spark/Hadoop cluster and doing some “real” data things with it.

Firstly, I used a 3 node setup on Linode of course. 3 nodes with 2GB ram and 1 CPU will cost about $30 per month. All the code for this is available on GitHub.

Next, you probably need to setup ufw on all your nodes, blocking all traffic except between your node ip’s, and probably your personal IP address to your master (you can just ssh to the other nodes from master). That’s just a friendly reminder to keep your Hadoop cluster from getting hacked… like happened to me the first time.

Step-by-step installing Apache Spark and Hadoop Cluster.

  1. install java of course sudo apt-get -y install openjdk-8-jdk-headless default-jre
    1. This must be done on all nodes.
  2. Install Scala sudo apt install scala
    1. I have no idea if this must be done on all nodes, I did.
  3. Setup password-less ssh between all nodes.
    1. sudo apt install openssh-server openssh-client
    2. create keys ssh-keygen -t rsa -P ""
    3. move .pub key into each worker nodes ~/.ssh/authorized_keys location.
    4. sudo vim /etc/hosts
      1. add lines for master and nodes with name and ip address… something like follows.
      2. 173.255.199.161 master and maybe 198.58.124.54 worker1
  4. Probably want to install Spark on master and nodes…. wget https://archive.apache.org/dist/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz
    1. unpack tar xvf spark-2.4.3-bin-hadoop2.7.tgz
    2. move sudo mv spark-2.4.3-bin-hadoop2.7/ /usr/local/spark
    3. We need to set Spark/Java path etc…. sudo vim ~/.bashrc
      1. export PATH=/usr/local/spark/bin:$PATH
      2. export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
      3. export PATH=$JAVA_HOME/bin:$PATH
      4. source ~/.bashrc
      5. vim /usr/local/spark/conf/spark-env.sh
        1. export SPARK_MASTER_HOST=<master-ip <- fill in your IP address of master node here.
        2. export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
        3. modify the ridiculously named file vim /usr/local/spark/conf/slaves
          1. add names of master and workers from .hosts file above.
      6. Finally, start Spark…. sh /usr/local/spark/sbin/start-all.sh

After all that you should be able to get the Spark UI via a browser.

Getting Hadoop (hdfs) installed on your Spark Cluster.

Just a few extra steps here and we will have hdfs up and running on the Spark Cluster as well. In real life Spark would probably be reading from some sort of AWS or GCP bucket, or from Hadoop/HDFS filesystem…. not your local drives. (a Spark worker can’t access a file stored on the normal drive system of the master)

  1. wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz
  2. tar -xvf hadoop-2.7.3.tar.gz
  3. mv hadoop-2.7.3  hadoop
  4. vim ~/hadoop/etc/hadoop/hadoop-env.sh
    1. export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
  5. vim ~/hadoop/etc/hadoop/core-site.xml
    <configuration>
    <property>
    <name>fs.default.name</name>
    <value>hdfs://master:9000</value>
    </property>
    </configuration>
  6. vim ~/hadoop/etc/hadoop/hdfs-site.xml

<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/beach/data/nameNode</value>
</property>
<property>
<name>dfs.namenode.data.dir</name>
<value>/home/beach/data/dataNode</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
</configuration>

7. cd ~/hadoop/etc/hadoop
1. mv mapred-site.xml.template mapred-site.xml
2. vim mapred-site.xml

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

8. vim ~/hadoop/etc/hadoop/yarn-site.xml

<configuration>
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>

9. vim ~/hadoop/etc/hadoop/slaves
localhost
worker1
worker2

10. vim ~/hadoop/etc/hadoop/yarn-site.xml

<configuration>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>800</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>800</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>400</value>
</property>
</configuration>

11. vim ~/hadoop/etc/hadoop/mapred-site.xml

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>800</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>400</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>400</value>
</property>
</configuration>

12. sudo vim ~/.bashrc
1. export PATH=/home/beach/hadoop/bin:/home/beach/hadoop/sbin:$PATH

13. running hdfs dfsadmin -report you can see the Hadoop cluster.

Yikes! Now I have a Spark cluster with HDFS installed, the sky is the limit now! Next time we will dig into running Spark scripts in real life.