Getting started Apache Spark with Java

In this tutorial, we will learn how to set up Apache Spark for local development and getting started with Java application. We will use an Apache log file to show few basic RDD operations.

Tools & Libraries used:

  1. Apache spark 2.1.1
  2. Gradle (3.x) / Maven
  3. Eclipse IDE
  4. JDK 1.8
  5. Mac OS X or Windows

This is a part of Apache Spark Tutorials series. 

Apache Spark Installation

Download the distribution

Go to official Apache Spark download website and select spark release, package type (use Pre-built for Hadoop 2.7 and later) and download type (direct download). Click on Download tar link.

Apache Spark Download Site

Apache Spark Download Site

Extract the distribution

Use UI based application (Unarchiver, WinZip or others) to extract the downloaded tar file content. You can place the Apache Spark folder in any place of your like.

You can also use the following command to untar:

Apache Spark Binary Folder Structure

Apache Spark Folder Structure

Running Apache Spark

As you know Apache Spark is the cluster of computational engines. In the typical setting, we need at least one master node (driver program) and many executor nodes.

Apache Spark supports testable code and rapid development. For it, they give ways to start Apache Spark on a single machine (local mode). Though as a developer you have to be careful that you always code keeping cluster computational in mind.

The behavior of program might change when it executes on the cluster as computation is done on many machines in parallel and then the result is aggregated by the master node.

For development

For ease of development, there are many ways to start Apache Spark on the local machine.

Running Spark CLI (REPL) on local machine

Note: Our tutorial is focused on Java-based spark application and now Apache doesn’t support Java CLI. This example is for giving you an idea about Apache Spark CLI.

Starting Scala CLI (REPL), which have SparkContext initialize and available as variable sc, in local mode with 4 worker threads.

Go to Apache Spark home directory and execute the following command

$ ./bin/spark-shell --master local[4]

Note:

  1. spark-shell internally use the more basic spark-submit script, which is explained later in the section
  2. Starting Apache Spark in local mode with following parameters
    1. local – Single thread with no parallel processing worker thread.
    2. local[K] – K worker thread for parallel processing. Ideally, it should be not greater than no of cores available on CPU.
    3. local[*] – Number of worker thread assigned will be equal to no of cores available on CPU.
Apache Spark scala shell initialize with local 4 threads

Apache Spark Scala shell initialise with local 4 threads

You can execute exploratory queries in the shell using Scala programming language. We can also start the shell for Python and R. How to start a shell in Python and R is beyond the scope of this tutorial.

Running Spark application on local machine

Go to Apache Spark home directory and execute the following command

./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --deploy-mode client
  --master local[*] \
  /path/to/examples.jar \
  args

In above command, script spark-submit will start a local Spark executor node with as many worker threads as the number of cores in the CPUs. With deploy-mode as client spark-submit starts driver program and connect with the worker for execution. It also displays console outputs. This is very good for development.

Note: We will learn how to create Java-based Spark application and package them further in the tutorial, so we can run that using spark-submit.

For production

For production deployment we require Apache Spark running in cluster mode using Standalone Cluster Manager, Apache Mesos or Hadoop YARN. Discussing more in this tutorial is beyond the scope of this tutorial.

Apache Spark Application in Java

Let’s start building a sample Apache Spark application which read an apache log files (NASA sample log file) and does the following analysis:

  1. Total lines
  2. Total characters
  3. How many times files with extension.html and.gif accessed

You may clone the source code from GitHub repository.

Github URL – https://github.com/geekmj/apache-spark-examples

Git clone URL – https://github.com/geekmj/apache-spark-examples.git

Source code Download Link – https://github.com/geekmj/apache-spark-examples/archive/master.zip 

It has several Apache Spark sample projects. The log file apache-log04-aug-31-aug-2011-nasa.log for analysis is kept in the data folder.

Import project inside folder 01-getting-started as Gradle or maven project. Make sure you have the Gradle plugin installed for Eclipse. Maven plugin come by default with Eclipse.

Note: You may fresh start by creating a maven or java based project by your own and using the source code provided in this tutorial.

Apache Spark Java getting started project structure

After Importing source code in Eclipse, Apache Spark Java getting started project structure

Gradle & Maven build file

We have provided Gradle and maven build file for the project.

File: build.gradle 

apply plugin: 'java-library'

repositories {
    jcenter()
}

dependencies {
    compileOnly 'org.apache.spark:spark-core_2.11:2.1.0'
    testImplementation 'org.apache.spark:spark-core_2.11:2.1.0','junit:junit:4.12'
}

File: pom.xml

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>org.geekmj</groupId>
  <artifactId>apache-spark-examples</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  <packaging>jar</packaging>
  <name>apache-spark-examples</name>
  
  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <maven.compiler.source>1.8</maven.compiler.source>
    <maven.compiler.target>1.8</maven.compiler.target>
  </properties>
  
  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.12</version>
      <scope>test</scope>
    </dependency>
     <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.11</artifactId>
      <version>2.1.0</version>
      <scope>provided</scope>
    </dependency>
  </dependencies>
</project>

Resilient Distributed Dataset (RDD)

Data and its manipulation is the key for any computational platform. Apache Spark provides its own flavor of a data structure called, Resilient Distributed Dataset (RDD), which can be created with the Parallelized collection or external dataset. RDD is partitioned into multiple sets and one set is given to one node for distributed processing. This data structure is the base for any high-level implementation in Apache Spark.

Apache Spark Resilient Distributed Dataset

Apache Spark Resilient Distributed Dataset Lifecycle

Apache Spark driver program load parallelized collection or external data source in resilient distributed dataset (RDD) format. Few salient features of RDD are:

  1. They distributed on many processing nodes for operation.
  2. It can also be persisted in memory for reusability.
  3. RDD recovers automatically from failure.
  4. While creating parallelized collection we can specify the number of partition for the data set.

For our example, RDD will be created for a log file.

Driver Program

File: SparkDriverProgram.java

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
/**
 * (c) geekmj.org All right reserved  
 * 
 * Main Spark Driver Program Class
 * It creates Resilient Distributed Dataset for a log file (external data source)
 * It does following operation on RDD
 * 1. Total no. of lines in log file
 * 2. Total characters in log file
 * 3. Total no. of URL in log file with HTML and GIF extension
 */
public class SparkDriverProgram {
    public static void main(String args[]) {
    	/* Define Spark Configuration */
		SparkConf conf = new SparkConf().setAppName("01-Getting-Started").setMaster(args[0]);
        
		/* Create Spark Context with configuration */
		JavaSparkContext sc = new JavaSparkContext(conf);
        
		/* Create a Resilient Distributed Dataset for a log file 
		 * Each line in log file become a record in RDD
		 * */
        JavaRDD<String> lines = sc.textFile(
        		"/Volumes/Drive2/projects/project_workspace/github/apache-spark-examples/data/apache-log04-aug-31-aug-2011-nasa.log");
        
		System.out.println("Total lines in log file " + lines.count());
        
		/* Map operation -> Mapping number of characters into each line as RDD */
		JavaRDD<Integer> lineCharacters = lines.map(s -> s.length());
		/* Reduce operation -> Calculating total characters */
		int totalCharacters = lineCharacters.reduce((a, b) -> a + b);
        
		System.out.println("Total characters in log file " + totalCharacters);
        
		/* Reduce operation -> checking each line for .html character pattern */
		System.out.println("Total URL with html extension in log file " 
				+ lines.filter(oneLine -> oneLine.contains(".html")).count());
		
		/* Reduce operation -> checking each line for .gif character pattern */
		System.out.println("Total URL with gif extension in log file "
				+ lines.filter(oneLine -> oneLine.contains(".gif")).count());
        
		sc.close();
    }
}

Above code snippet is main Apache Spark driver program.

Note: Please change the log file path as per your system.

Run the application

Using spark-submit

Go to project home directory and build the distribution Jar.

Gradle command:

$ gradle build

It creates Jar file 01-getting-started.jar inside project_home/build/libs/.

Maven command:

mvn package

It creates Jar file apache-spark-examples-0.0.1-SNAPSHOT.jar inside project_home/target/.

Go inside Apache Spark installation directory.

Run spark-submit command:

./bin/spark-submit 
--class SparkDriverProgram 
--deploy-mode client 
--master local[*] path/to/jar.jar 
local

It executes the Spark Application as explained in an earlier section.

Console output:

17/05/10 19:36:44 INFO Executor: Starting executor ID driver on host localhost
17/05/10 19:36:44 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 54530.
17/05/10 19:36:44 INFO NettyBlockTransferService: Server created on 10.11.2.101:54530
17/05/10 19:36:46 INFO DAGScheduler: ResultStage 0 (count at SparkDriverProgram.java:28) finished in 0.723 s
17/05/10 19:36:46 INFO DAGScheduler: Job 0 finished: count at SparkDriverProgram.java:28, took 0.904987 s
....
Total lines in log file 784979
.....
17/05/10 19:36:46 INFO SparkContext: Starting job: reduce at SparkDriverProgram.java:33
17/05/10 19:36:47 INFO DAGScheduler: Job 1 finished: reduce at SparkDriverProgram.java:33, took 0.388462 s
.....
Total characters in log file 82949738
.....
17/05/10 19:36:47 INFO SparkContext: Starting job: count at SparkDriverProgram.java:39
17/05/10 19:36:47 INFO DAGScheduler: Job 2 finished: count at SparkDriverProgram.java:39, took 0.392778 s
......
Total URL with html extension in log file 168469
......
17/05/10 19:36:47 INFO SparkContext: Starting job: count at SparkDriverProgram.java:43
17/05/10 19:36:48 INFO DAGScheduler: Job 3 finished: count at SparkDriverProgram.java:43, took 0.404912 s
.....
Total URL with gif extension in log file 467323
.....
17/05/10 19:36:48 INFO SparkUI: Stopped Spark web UI at http://10.11.2.101:4040
17/05/10 19:36:48 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/05/10 19:36:48 INFO MemoryStore: MemoryStore cleared
17/05/10 19:36:48 INFO BlockManager: BlockManager stopped
17/05/10 19:36:48 INFO BlockManagerMaster: BlockManagerMaster stopped
17/05/10 19:36:48 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/05/10 19:36:48 INFO SparkContext: Successfully stopped SparkContext
17/05/10 19:36:48 INFO ShutdownHookManager: Shutdown hook called
17/05/10 19:36:48 INFO ShutdownHookManager: Deleting directory /private/var/folders/9v/qhdgf5gn6_ggdcp_4qdyl8s80000gn/T/spark-55700647-9ca3-431c-bc96-eebf82227fb7

Using Eclipse IDE

We should execute SparkDriverProgram as general Java application and that’s it. We can also debug it, just like another java application.

Download Icon SmallDownload the Full ProjectGithub IconFollow Project On Github

References

  1. We have used Apache Log (04 Aug – 31 Aug 2011) of NASA Kennedy Space Center WWW server in Florida.
  2. More about submitting Apache Spark application.
  3. Apache Spark official programming guide.

One Reply to “Getting started Apache Spark with Java”

  1. adam

    It’s great to come across a blog every once in a while that isn’t the same out of date rehashed material. Fantastic read.
    I’ve bookmarked your site, and I’m adding your RSS feeds to my Google account.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *