Big Data has come a long way. Apache spark is one of the fastest big data computational engine. We will answer to often asked questions about basics of Apache Spark in this article.
What problems Apache Spark solves and how it solves them?
Big data computation problem
When the size of data is large in terabytes, it is time taking and inefficient to load them in single machine’s memory and process them for computation. The cost of running computation on high-end machines (large memory with multiple cores and processors) are very high.
Apache Spark is cluster-based parallel processing engine which runs efficiently on low-end machines. It can run in-memory as well as on disk.
Limitation in MapReduce processing
MapReduce is a big data parallel and distributed algorithm to process and generate data set on a cluster. It is the programming model used by Apache Hadoop for big data computation.
MapReduce process everything on disk (cluster of disks) in following sequential steps.
- Read data from disk
- Map data
- Reduce data
- Write result on disk
IO (Input & Output) from disk takes most of the time of a MapReduce operations. It goes really inefficient when a problem needs multiple iterations on the same data set. We need iterations on same data set for Graph manipulation, Machine learning algorithms and in other problems.
Apache Spark overcomes the MapReduce big data processing bottlenecks with their in-memory resilient distributed dataset (RDD) data structure, a clustered read-only multi-set of data items. In-Memory it is 100x faster than Apache Hadoop MapReduce, while on the disk it is 10x faster.
With RDDs implementation now it is possible to use iterative algorithms which use the same dataset in a loop as well as repeated database style querying.
Complex big data ecosystem
Apache Hadoop, another big data computation platform grown as the complex ecosystem of tools and libraries for solving real-time streaming, structured data analysis, machine learning etc. Usually, development has to opt for newer frameworks leads towards increased cost of maintenance.
Apache spark provides a unified ecosystem with low-end API along with high-level APIs and tools for real-time streaming, machine learning etc.
What are various components of Apache Spark?
Spark Core – Apache Spark core includes RDD (Resilient Distributed Dataset) API, Cluster management, scheduling, data source handling, memory management, fault tolerance and others functionalities. Apache Spark general purpose fast computational core provides fundamentals for building higher level API for various purposes. Benefits of tightly coupled architecture are that when Core get improvements, further high-level API also get benefitted.
Spark SQL, DataFrames, and Datasets – It provides API for processing structured data (JSON, relational database, and others). You can query to the dataset using SQL-like syntax.
Spark Streaming – It provides API for processing real-time stream of data coming from various sources like Kafka, Flume, Kinesis or TCP sockets. These real-time streams can further process using complex machine learning, graph, and other algorithms.
Spark MLib (Machine Learning) – It provides API for executing Machine Learning algorithms like Classification, Regression, Collaborative Filtering, and others.
Spark GraphX – It provides API for doing parallel computation on Graph data (e.g Facebook friend graph). It also gives inbuilt support for Graph algorithms like Page Rank and Triangle counting.
In which programming languages you can write Spark application?
Apache Spark provides libraries and tool for writing an application using Scala, Java, Python and R programming languages.
You can also interact with Apache Spark with Scala Python and R CLI (Command Line Interface) to execute exploratory queries. It helps Data Scientists a lot.
How Apache Spark application execute?
The lifecycle for Spark program execution on the cluster:
- You write Spark application, package them and send them to main spark server (not Worker Node). In your application’s main program (Driver Program) you use SparkContext. Spark application runs on a cluster as independent process co-ordinated by SparkContext.
- SparkContext can connect to worker nodes using many Cluster Managers.
- SparkContext acquires worker node executor process. These process run computations and store data for the app.
- SparkContext send application code (JARs or Python files) to executor node.
- SparkContext send tasks to worker node executor for further process.
Few special features of Spark application architecture are:
- Each application is given their executor process. Executor process stays up until the end of the program and executes it in multiple threads. It brings isolation between multiple Spark applications. Data sharing between nodes are also not possible without writing them on disk.
- Spark can work with any Cluster Manager as long as they are able to acquire executor processes on the node.
- The network connection between worker node and driver program is a must.
- Keeping driver program and worker node close to each other (preferably on the same LAN) decrease latency of cluster task scheduling.
What data storage Apache Spark supports?
Apache Spark doesn’t have its own data storage capabilities. Though it supports several data storages like Hadoop Distributed File System (HDFS), HBASE, Cassandra, Apache Hive, Amazon S3 and others. It has options to add custom data backend.
Which cluster managers Apache Spark supports?
Apache Spark comes with its native cluster manager, which is good for small deployment. Though it can also use Hadoop YARN and Apache Mesos as its cluster manager.
Why Apache Hadoop bundle with Apache Spark distribution?
Apache Spark gives support for Hadoop YARN and Mesos cluster manager. Apache Spark depends on Hadoop client libraries for YARN and Mesos. We may download Spark build without Hadoop bundle. We can also refer to existing installation of Hadoop if it is on the same machine as Apache Spark.