Compute Cluster Info

Connecting

In your ~/.ssh/config:

Host cluster.cs.sfu.ca
  User YOUR_USERNAME
  Port 24
  HostName cluster.cs.sfu.ca
  LocalForward 9870 controller.local:9870
  LocalForward 8088 controller.local:8088
  LocalForward 18080 controller.local:18080

Then after connecting, you will be able to access the HDFS web frontend at http://localhost:9870/ and the YARN web frontend at http://localhost:8088/.

It would probably also be pragmatic to create an SSH key, and add the public key to your ~/.ssh/authorized_key on the cluster gateway (with ssh-copy-id or similar).

HDFS

The HDFS DataNode is controller.local:54310, and the standard hdfs commands will work. Each user should have an HDFS home directory at /user/YOUR_USERNAME.

hdfs dfs -ls /user/YOUR_USERNAME
hdfs dfs -copyFromLocal some_dataset .
hdfs dfs -copyToLocal some_results .

YARN

Jobs can be submitted to YARN in the standard way. JAR files can be uploaded to the gateway, or created there. The commands to compile and submit a job will be like this:

javac -classpath $(hadoop classpath) MapReduceJob.java
jar cf job.jar *.class
yarn jar job.jar MapReduceJob

Spark

Spark jobs can be started with the usual Spark commands once you are logged into the gateway:

spark-submit Class.scala
spark-submit code.py

Interactive Spark sessions can also be started with the usual commands:

spark-shell
pyspark

Kafka

Kafka is installed on all of the cluster's worker nodes, and can be contacted on any of them on port 9092. For example, with the Python kafka package, connecting would be like this:

bootstrap_servers = ['node1.local:9092', 'node2.local:9092']
producer = kafka.KafkaProducer(bootstrap_servers=bootstrap_servers)
consumer = kafka.KafkaConsumer(topic, bootstrap_servers=bootstrap_servers)

Spark DataFrames jobs that work with Kafka need the spark-sql-kafka package:

spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0 streaming_job.py

Cassandra

Each of the worker nodes in the cluster has Cassandra installed, and you can connect to any of them as an initial contact.

cqlsh node1.local

Or in Python code with the cassandra driver:

from cassandra.cluster import Cluster
cluster = Cluster(['node1.local', 'node2.local'])

Or from a Spark job with the Spark Cassandra Connector:

cluster_seeds = 'node1.local,node2.local'
spark = SparkSession.builder.config('spark.cassandra.connection.host', cluster_seeds).getOrCreate()

Starting a Spark job that uses Cassandra requires a couple of packages. This seems to be a working command line to do that:

spark-submit --packages com.datastax.spark:spark-cassandra-connector_2.12:3.4.0 --conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions cassandra-job.py