SPARK DEPLOYMENT MODE
The Deployment mode of job submission in Apache Spark is as below:
1) Client mode (default)
2) Cluster mode
The difference between the 2 deployment modes is where the driver program runs.
· Client mode deployment (Definition) is when the driver program runs on the edge node.
· Cluster mode deployment (Definition) is when the driver program runs on the worker node inside the cluster.
Few terminologies to help us understand the concept better:
1) An edge node is a gateway node to the cluster or a node on the edge which connects user and the cluster.
2) The driver program is the main program that drives the spark job.
3) The executor program does the actual data processing on the worker nodes and is controlled by the Driver program.
Note that if the Driver program fails then the executor program also fails.
Spark job runs on “spark-submit” command execution on edge node.
./bin/spark-submit \
— class <main-class> \
— master <master-url> \
— deploy-mode <deploy-mode> \
— conf <key>=<value> \
… # other options
<application-jar> \
[application-arguments]
(from Apache Spark website)
To understand deployment mode, we will assume a setup having 1 edge node, 3 worker nodes.
In client mode, spark-submit gets executed on edge node then, driver program will be launched on the same edge node. This driver program will utilize resources (memory, CPU utilization) of the edge node.
Thus, in our setup example of 1 edge node and 3 worker nodes, when spark job (job 1) submitted on edge node and deployment as client mode, driver program (d1) will run on the same edge node and it will spawn executor programs on 3 worker nodes (w1, w2 and w3).
If multiple jobs are submitted in client mode then, those many drivers will run on that single edge node. All these drivers (d1 and d2 from our example) will hog the memory/CPU of the edge node, and then there are chances of jobs getting failed due to an “Out of memory” error.
Client mode deployment is advisable when the edge node is local to the cluster.
In cluster mode, spark-submit gets executed on the edge node but driver program will run on one worker node from the cluster. So, for a given job, one of the worker nodes will become the driver node and control the remaining nodes (worker nodes).
Thus, in our setup example of 1 edge node and 3 worker nodes, when spark job (job 1) submitted on edge node and deployment as cluster mode, driver program will run on the one of the worker node (d1) and it will spawn executor programs on 3 -1 = 2 worker nodes (w1 and w2).
If multiple jobs are submitted in cluster mode then, those many drivers (d1 and d2 from our example) will run on different nodes. All these drivers will utilize resources from their own machines and hence there will not be memory issue.
Cluster mode deployment is advisable when the edge node is at a remote place i.e. away from the cluster.
Thus, depending on the proximity or remoteness of edge node, we can choose the deployment mode while Spark job submission.
Happy Learning.