Tech Update: Apache Spark: A Spark of Light in the World of Analytics

 In Our Thinking

Today, we live in a ‘Data Age’. No business organization today can survive without data analysis. Data analysis helps businesses in taking important market decisions and improving customer relationship by gaining insights about their likes and dislikes. Apache Spark is a new name which has started making rounds in the Big Data Analysis circles. Experts seem to claim that it might soon take over the world of analytics. Let us see what makes Apache Spark such a hot topic of discussion –

What is Apache Spark?

Apache Spark was developed at UC Berkeley AMPLAB as an in-memory framework which overcomes the drawbacks of Hadoop MapReduce. The broad architecture of Apache Spark is as below:

 Apache Spark Architecture


Some of the main components of the framework include –

  • Driver Program: It is the main program of Apache Spark which is coordinated by a Spark object called SparkContext. The driver program’s tasks include listening to the executors, accepting incoming requests from executors, and scheduling jobs on the cluster. The driver should usually run close to the worker nodes and should be accessible from worker nodes through the network.
  • Cluster Manager: The Cluster Manager manages the entire Spark Cluster.
  • Worker Nodes: Worker Nodes consists of Executors whose job is to run the jobs or tasks scheduled by driver Program.

Building Blocks of Apache Spark:

The base of Spark depends on the following two concepts:

  1. Resilient Distributed Dataset (RDD): It is a dataset which is partitioned and is read-only. To create an RDD, we can perform operations on either data present in secure storage or in other RDDs. These operations are called transformations. It is an immutable object and unlike Hadoop, does not create replicas of data – rather, it maintains its lineage and then re-creates only the lost partition. Hence, there is no additional overhead of maintaining replicas for fault tolerance. RDD allows the programmer to control partition (how the data should be partitioned) and persistence (which data has to be stored where). RDD allows execution of the following operations:
  2. Transformations: Transformations are functions such as, GroupByKey, that are applied on RDD’s. Transformations are evaluated lazily, i.e. they are executed only when some actions are applied on them.
  3. Actions: These functions include collect(), count(),take() etc. When actions are applied on RDDs, the transformations are executed and actions return the result of RDD.
  4. Directed Acyclic Graph (DAG): DAG is a sophisticated and complex engine provided by Spark. It bolsters cyclic data flow. A DAG of assigned task stages to be executed on Spark cluster is created by each Spark executor. Hadoop’s MR creates only two stages such as Map and Reduce whereas; DAG can have any number of stages. This strategy of splitting complex Spark jobs into several stages rather than several jobs is what makes it faster than Hadoop MR.

Spark provides a wide variety of components that are tightly bound to each other. The components are shown in figure below:

 Spark Ecosystem


Some of the key features of Spark include –

  • The heart of Spark’s core components is RDD. RDDs are the base for faster execution.
  • Spark’s core component also manages basic I/O functionalities, scheduling, and distributed computation.
  • Spark SQL permits the user to run SQL-like queries.
  • Spark provides a data abstraction called Schema RDD and table’s schema can be created automatically for data in JSON format or can even be customized. Both structured and unstructured data are supported. We can have JDBC/ODBC driver interface for connections.
  • Spark’s streaming component is utilized to obtain real-time data through Dstream (series of RDD’s).
  • MLlib, the machine-learning library, provides several classifications and clustering algorithms. The algorithms are designed to effectively utilize Spark’s distributed and parallel computation.
  • GraphX component provides support for analysis on Graph database by introducing Resilient Distributed Property Graph. It is directed multi-graph which has properties that are attached to every edge and vertex.

 Working of Spark

The figure below shows the map and reduce function. As shown in the figure, the output of map function is stored in OS Buffer cache and the OS decides whether to store it in cache or spill it in the disc. The spill files are not merged or partitioned, however, the map output from the same core are stored in a single file.

Map and Reduce function in Spark

Map and Reduce function in Spark


The intermediate results of map phase are pushed into reducers in the form of shuffle files. Finally, this shuffle file is stored in reducer’s memory.

Apache Spark vs The Elephant (Apache Hadoop) for Big Data Analysis:

Spark overcomes the drawbacks of Hadoop Mapreduce and, therefore, it is gaining popularity. Apache Spark is a quick and universally useful cluster computing system. It is popular for fast processing. Spark runs programs up to 100x speedier than Hadoop MR in memory, or 10x quicker on disk.  Hadoop, though being a popular framework for data-intensive applications, does not perform well on an iterative process (like data analysis) due to the cost paid for reloading data from disk for every iteration. Spark, being an in-memory database, efficiently utilizes memory with the help of RDD, by maintaining the data in the cache and efficiently manages fault tolerance with the help of RDD lineage. The iterative operations of spark and Hadoop are compared in the figure below

Comparison of Spark and Hadoop for iterative operations


Hadoop MapReduce is inefficient for multi-pass applications that involve quick response time and  low-latency  sharing of data across several parallel operations. These applications are common in data analytics, and include iterative machine learning algorithms like classification and clustering algorithms and graph algorithms. Spark provides support for streaming applications, several machine learning algorithms as well as  graph databases. Data mining algorithms usually involve performing different operations on the same data.  Hence, storing the  data in cache (as in Spark) rather than disc (as in Hadoop) tremendously reduces the data access time.

Apache Spark is undoubtedly taking the Big Data world by storm. It tremendously simplifies the tasks of data scientists by seamlessly integrating complex capabilities like machine learning, graph algorithms, Dataframes, real-time streaming etc. within a single framework. Unlike Hadoop framework, a single installation gives you access to all these analytics tools.

Start typing and press Enter to search