Home Docs Software Data Science Spark3 Spark3
On this page Modules# There are a few Spark-Modules:
will list them. If you require an update or updated combination, please make your install-request using the
software install form
.
Security of a Spark Cluster# Security in Spark is OFF by default. This could mean you are vulnerable to attack by default.
This documentation
covers the subject comprehensively.
In the following script the basic steps are explained. It can be customized and used to generate security configuration.
Please be aware: the script overwrites ~/.spark-config/spark-defaults.conf
.
#!/bin/bash
#
# -- Secure a Standalone Spark Cluster:
# -- Generate a keystore and truststore and create a spark config file
#
# Generate a random password (users should never really need to know this)
# Being lazy and using a single password throughout (as password and spark secret etc)
echo ""
echo " Generating Secure Spark config"
echo ""
module purge
module load lang/Java/1.8.0_202
SPARK_PASSWORD = $(tr -dc A-Za-z0-9_ < /dev/urandom | head -c 12)
# echo "Generating Trust and Key store files"
# Generate the keystore
keytool -genkey -alias spark \
-keyalg RSA -keystore spark-keystore.jks \
-dname "cn = spark, ou=MPCDF, o=MPG, c=DE" \
-storepass $SPARK_PASSWORD -keypass $SPARK_PASSWORD
# Export the public cert
keytool -export -alias spark -file spark.cer -keystore spark-keystore.jks -storepass $SPARK_PASSWORD
# Import public cert into truststore
keytool -import -noprompt -alias spark -file spark.cer -keystore spark-truststore.ts -storepass $SPARK_PASSWORD
# Move files to config dir and clean up
MY_SPARK_CONF_DIR = ~/.spark-config
mkdir -p $MY_SPARK_CONF_DIR
chmod 700 $MY_SPARK_CONF_DIR
# echo "Moving generated trust and key store files to $MY_SPARK_CONF_DIR"
mv -f spark-keystore.jks $MY_SPARK_CONF_DIR
mv -f spark-truststore.ts $MY_SPARK_CONF_DIR
# clean up intermediate file
rm -f spark.cer
# Create the spark default conf file with the secure configs
# echo "Creating SPARK Config files in $MY_SPARK_CONF_DIR"
cat << EOF > $MY_SPARK_CONF_DIR/spark-defaults.conf
spark.ui.enabled false
spark.authenticate true
spark.authenticate.secret $SPARK_PASSWORD
spark.ssl.enabled true
spark.ssl.needClientAuth true
spark.ssl.protocol TLS
spark.ssl.keyPassword $SPARK_PASSWORD
spark.ssl.keyStore $MY_SPARK_CONF_DIR/spark-keystore.jks
spark.ssl.keyStorePassword $SPARK_PASSWORD
spark.ssl.trustStore $MY_SPARK_CONF_DIR/spark-truststore.ts
spark.ssl.trustStorePassword $SPARK_PASSWORD
EOF
echo " Secure Spark config created in - $MY_SPARK_CONF_DIR"
One node job# If your interactive SLURM job uses only one node, you can start a spark-shell directly on the node. By default spark-shell starts in the deploy mode local[*]
. This means it uses all available resources of the local node. The spark-shell allows you to work with your data interactively using the SCALA language.
spark-shell
21/08/06 18:21:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://login21:4040
Spark context available as ' sc' (master = local[*], app id = local-1628266920018).
Spark session available as ' spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ ' _/
/___/ .__/\_ ,_/_/ /_/\_\ version 3.1.2
/_/
Using Scala version 2.12.10 ( Java HotSpot( TM) 64-Bit Server VM, Java 1.8.0_202)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
Job Based Usage# If you have already a packaged application, it should be submitted as a job. The following script serves as a template to start a scala application (packed to myJar.jar
as an example) with the entry point in the class Main in the package main.
#!/bin/bash
# further SLURM job settings go here
# load module
module load devel/Spark
# start application
spark-submit --driver-memory 8G --master local[*] --class main.Main myJar.jar
The option --master local[*]
allows Spark to adjust the number of workers on its own.
The option --driver-memory
is used to set the driver memory, the memory of the workers requires typically no changes.
Spawning spark cluster# Explanation in the interactive mode# First of all start an interactive SLURM job .
In the SLURM job, you can start the spark cluster only in standalone mode.
Setting of the important environment variables for the spark configuration# For example, if the current directory should be used as working directory:
CLUSTERWORKDIR = " $PWD "
export SPARK_LOG_DIR = " $CLUSTERWORKDIR /log"
export SPARK_WORKER_DIR = " $CLUSTERWORKDIR /run"
if a customized configuration file should be used (for example for security settings), SPARK_CONF_DIR
should be set.
Example:
export SPARK_CONF_DIR = "~/.spark-config"
Master process# The Spark folders should exist:
mkdir -p $SPARK_LOG_DIR $SPARK_WORKER_DIR
If the infrastructure is ready, the master process can be started on the head node of your SLURM job:
export MASTER = $( hostname -f) :7077
start-master.sh
The starting of the master process can take a while.
Workers# Now, the worker processes can be spawned on the for your job reserved nodes:
srun spark-class org.apache.spark.deploy.worker.Worker $MASTER -d $SPARK_WORKER_DIR &
Skript example# #!/bin/bash
CLUSTERWORKDIR = "$PWD"
export SPARK_LOG_DIR = "$CLUSTERWORKDIR/log"
export SPARK_WORKER_DIR = "$CLUSTERWORKDIR/run"
# export SPARK_CONF_DIR="~/.spark-config" # if you want to create a secure cluster uncomment this line
export MASTER = $(hostname -f):7077
export MASTER_URL = spark://$MASTER
WAITTIME = 10s
echo Starting master on $MASTER
mkdir -p $SPARK_LOG_DIR $SPARK_WORKER_DIR
start-master.sh
echo "wait $WAITTIME to allow master to start"
sleep $WAITTIME
echo Starting workers
srun spark-class org.apache.spark.deploy.worker.Worker $MASTER -d $SPARK_WORKER_DIR &
echo "wait $WAITTIME to allow workers to start"
sleep $WAITTIME
Spark submit# Be sure, that all workers are started before you submit any spark jobs.
Example:
spark-submit --total-executor-cores 20 --executor-memory 5G /path/example.py file:///path/Spark/Data/project.txt
SLURM submission script example# In the following example, the previously described example script create_spark_cluster.sh
is used.
#!/bin/bash
#SBATCH -N 2
#SBATCH -t 00:10:00
#SBATCH --mem 20000
#SBATCH --ntasks-per-node 8
#SBATCH --cpus-per-task 5
#SBATCH -p parallel
#SBATCH -C anyarch
echo ""
echo " Starting the Spark Cluster "
echo ""
./create_spark_cluster.sh
echo $MASTER
echo ""
echo " About to run the spark job"
echo ""
spark-submit --total-executor-cores 16 --executor-memory 4G /path/example.py file:///path/Spark/Data/project.txt