Apache Spark Basics

Tanja Adžić
9 min readApr 7, 2024

--

In this blog, I explore Apache Spark and its importance in handling Big Data. I’ll showcase its distributed computing capabilities, efficiency in processing large datasets, and integration with Big Data frameworks. I’ll go through parallel processing, resilience, and scalability features, and provide insights for data engineers and other interested individuals.

Contents

Why use Apache Spark?

Apache Spark is an open-source, in-memory application framework designed for distributed data processing and iterative analysis of massive datasets. Here are the key points of Apache Spark:

Open Source Foundation

Spark’s foundation within the Apache ecosystem shows its collaborative and community-driven nature. Being open source encourages transparency, innovation, and widespread adoption across various industries and organizations.

In-Memory Processing

One of Spark’s defining features is its ability to perform operations entirely within the memory or RAM of the computing cluster. By eliminating the need for frequent disk access, Spark achieves unparalleled processing speed and efficiency, making it ideal for applications requiring real-time or near-real-time analytics.

Distributed Data Processing

Spark uses the power of distributed computing, where multiple nodes or machines work together to process and analyze data in parallel. This distributed architecture enables Spark to tackle massive datasets that would overwhelm traditional single-machine systems. Also, it provides scalability, as additional resources can be seamlessly added to the cluster to accommodate growing data volumes and processing demands.

Scalability

Spark’s architecture is designed for horizontal scalability, allowing organizations to effortlessly scale their data processing capabilities by adding more nodes to the Spark cluster. This modular approach to growth ensures that Spark remains adaptable to evolving business needs and data requirements, without compromising on performance or efficiency.

Fault Tolerance and Redundancy

Spark prioritizes fault tolerance and redundancy, crucial for maintaining data integrity and system reliability in large-scale distributed environments. By replicating data across multiple nodes and employing sophisticated fault recovery mechanisms, Spark ensures that processing tasks can continue uninterrupted even in the event of node failures or network disruptions.

Comparison with MapReduce

In contrast to traditional big data processing frameworks like MapReduce, Spark offers significant performance improvements by minimizing costly read/write operations to disk. By keeping data predominantly in-memory and leveraging efficient data processing techniques, Spark achieves orders of magnitude faster processing speeds, reducing overall computation time and resource overhead.

Versatility

Spark’s versatility extends beyond its core capabilities for distributed data processing. It provides a comprehensive ecosystem of libraries and tools tailored for various data engineering and data science tasks. From data manipulation and querying with SparkSQL to advanced machine learning and streaming analytics with SparkML and Spark Streaming, Spark offers a unified platform for end-to-end big data solutions.

Funtional Programming

Functional Programming (FP) is a programming style rooted in the mathematical function format, like the f of x notation in algebra classes. Unlike imperative programming, FP emphasizes declarative syntax, focusing on “what” needs to be done rather than “how” it’s done. This abstraction of implementation details streamlines code comprehension and emphasizes the final output, or “the what.”

Historically, the LISt Processing Language (LISP) pioneered FP in the 1950s, but today, various languages, including Scala, Python, R, and Java, offer functional programming capabilities. Scala, in particular, stands out as a recent addition to this family of languages and serves as the primary language for Apache Spark’s development.

Scala’s treatment of functions as first-class citizens empowers functional programming. Functions can be passed as arguments, returned by other functions, and treated as variables, facilitating elegant and concise code structures.

Consider a simple functional program that increments numbers by one. Defining a function f of x equal to x + 1 allows direct application to an entire list or array, exemplifying the efficiency of functional programming compared to procedural paradigms.

Here is an example of the increment problem in Python:

def increment_by_one(ls):
N = size(ls}

for i in range(N):
ls([i] += 1
return ls

Parallelization

Functional programming’s emphasis on “what” rather than “how” extends to parallelization, a significant benefit. By breaking tasks into parallel computing chunks, or nodes, FP enables efficient utilization of resources without altering function definitions or code. This implicit parallelization scales seamlessly, accommodating datasets of any size by merely adding compute resources.

Lambda calculus

Underlying functional programming is lambda calculus, a mathematical concept asserting that every computation can be expressed as an anonymous function applied to a dataset. Lambda functions, or operators, facilitate concise and expressive code, as seen in Scala and Python examples:

#lambda func
val add: (Int, Int) => Int = (x: Int, y: Int) => x + y

val result = add(3, 5)
println("Result:", result)
#lamda func
add = lambda x, y: x + y

result = add(3, 5)
print("Result:", result)

Apache Spark leverages functional programming principles to efficiently process big data. By distributing work between worker nodes and parallelizing computations, Spark programs inherently scale, regardless of data size. Adding resources to the Spark cluster enables seamless handling of datasets ranging from kilobytes to petabytes, exemplifying the scalability and efficiency of functional programming in data processing.

Parallel Programming using Resilient Distributed Datasets

A Resilient Distributed Dataset (RDD) serves as the foundational data abstraction in Apache Spark. RDDs are collections of fault-tolerant elements distributed across nodes in a cluster, capable of undergoing parallel operations. Immutable in nature, RDDs cannot be changed once created, ensuring data integrity and resilience.

Spark applications

Every Spark application revolves around a driver program responsible for executing the main functions and coordinating parallel operations across the cluster. RDDs support various file types, including text, sequence files, Avro, Parquet, and Hadoop input formats, as well as integration with local filesystems, Cassandra, HBase, HDFS, Amazon S3, and numerous relational and NoSQL databases.

Creating an RDD in Spark

Creating an RDD in Spark can be achieved through several methods. Firstly, one can utilize an external or local file from a supported Hadoop filesystem. Alternatively, the parallelize function can be applied to an existing collection within the driver program, which can be written in Python, Java, or Scala. The provided code snippets illustrate the creation of an RDD from an existing collection, such as a list:

// SCALA

val data = Array(1,2,3,4,5)
val distData = sc.parallelize(data)
#PYTHON

data = [1,2,3,4,5]
distData = sc.parallelize(data)

When creating parallel collections, specifying the number of partitions is crucial to efficient task distribution. Spark typically sets the number of partitions automatically based on the cluster configuration, but manual specification is also possible. Each partition corresponds to a task executed by a worker node, with the recommended ratio being two to four partitions per CPU.

Parallel Programming

Parallel programming, similar to distributed programming, leverages multiple compute resources simultaneously to solve computational tasks. Tasks are divided into discrete parts, solved concurrently by multiple processors accessing a shared memory pool. RDDs facilitate parallel programming by enabling operations to be inherently performed in parallel based on how RDDs are created and distributed across the cluster.

Resilience and Spark

RDDs ensure resilience in Spark through immutability and caching. As immutable data structures, RDDs are inherently recoverable, while caching allows datasets to be stored in memory across operations, enhancing performance significantly. By persisting partitions in memory, RDDs enable faster execution of subsequent actions, particularly beneficial for iterative algorithms and interactive use cases.

Scale out — Data Parallelism in Apache Spark

The Apache Spark architecture consists of three main components important for distributed data processing.

  1. Data Storage

First component is the data storage component, responsible for loading datasets from storage into memory. Apache Spark accepts datasets from any Hadoop-compatible data source, ensuring flexibility in data ingestion.

2. Compute Interface

The second component is the high-level programming APIs available in Scala, Python, and Java. These APIs facilitate seamless interaction with Spark, allowing developers to write code in their preferred language.

3. Management

The final component is the cluster management framework, which orchestrates the distributed computing aspects of Spark. This framework, which can operate as a standalone server, Mesos, Yet Another Resource Negotiator (YARN), or another compatible platform, is vital for scaling big data processing tasks.

Visualizing these components, data flows from a Hadoop file system into the compute interface/API, which then delegates tasks to different nodes for distributed/parallel processing:

Spark Core

At the heart of Spark lies the Spark Core, often referred to simply as “Spark.” This fault-tolerant engine manages memory, task scheduling, and the distribution of RDDs (Resilient Distributed Datasets) and other data types across the cluster. Spark Core serves as the base engine for large-scale parallel and distributed data processing.

Scaling

To understand how Spark scales with big data, let’s see the Spark Application Architecture, which has the driver program and the executor program.

Executor programs operate on worker nodes, with Spark capable of starting additional executor processes on workers if sufficient memory and cores are available. Executors can also utilize multiple cores for multithreaded calculations.

Communication between the driver and the executors is crucial. The driver, like an executive management in a company, oversees Spark jobs, splitting them into tasks submitted to the executors. The driver then receives task results upon task completion, orchestrating the overall data processing workflow.

In this analogy, the executor programs represent junior employees executing assigned tasks with provided resources, while worker nodes correspond to physical office space. Scaling big data processing involves adding additional worker nodes incrementally to accommodate growing processing demands.

Dataframes and Apache Spark

Spark SQL serves as an important module within Apache Spark, and is meant for structured data processing. It offers developers the flexibility to interact with data through SQL queries and the DataFrame API, supporting Java, Scala, Python, and R APIs. Regardless of the chosen API or language, Spark SQL employs the same execution engine to compute results, ensuring consistency across computations.

Developers can leverage the API that best suits the transformation they intend to perform, ensuring a natural and intuitive coding experience. For instance, executing a SQL query in Spark SQL, as demonstrated in the Python example, requires registering the desired data entity as a table view before execution:

results = spark.sql(
"SELECT * FROM people")

names = results.map(lambda p: p.name)

In contrast to the basic Spark RDD API, Spark SQL has advanced features such as a cost-based optimizer, columnar storage, and code generation. These optimizations help Spark’s performance by providing additional insights into both the data structure and the ongoing computation process.

DataFrames

But what exactly is a DataFrame? A DataFrame represents a collection of data organized into named columns, akin to a relational database table or a data frame in R/Python. Built on top of the Spark SQL RDD API, DataFrames leverage RDDs to execute relational queries efficiently.

Creating and manipulating DataFrames is straightforward, as demonstrated in the Python code snippet, which reads data from a JSON file and registers it as a DataFrame for further processing:

df = spark.read.json('people.json')
df.show()
df.printSchema()

df.createTempView('people')

Benefits od DataFrames

DataFrames offer scalability, supporting datasets ranging from a few kilobytes on a single machine to petabytes on large clusters. They also support a wide range of data formats and storage systems, and their integration with Spark and APIs for Python, Java, Scala, and R makes them developer-friendly.

DISCLAIMER

Disclaimer: The notes and information presented in this blog post were compiled during the course “Introduction to Big Data with Spark and Hadoop” and are intended to provide an educational overview of the subject matter for personal use.

--

--

Tanja Adžić

Data Scientist and aspiring Data Engineer, skilled in Python, SQL. I love to solve problems.