Apache Spark : The Good & The Bad

You could also describe Spark as a distributed, data processing engine for batch and streaming modes featuring SQL queries, graph processing, and machine learning.

In contrast to Hadoop's two-stage disk-based MapReduce computation engine, Spark's multi-stage in-memory computing engine allows for running most computations in memory.

Hence provides better performance for certain applications, e.g. iterative algorithms or interactive data mining.

Aims at speed, ease of use, extensibility and interactive analytics.

Using Spark Application Frameworks, Spark simplifies access to machine learning and predictive analytics at scale.

Access any data type across any data source.

The Apache Spark project is an umbrella for SQL (with Datasets), streaming, machine learning (pipelines) and graph processing engines built on top of the Spark Core.

You can run them all in a single application using a consistent API.

Spark runs locally as well as in clusters, on-premises or in cloud. It runs on top of Hadoop YARN, Apache Mesos, standalone or in the cloud (Amazon EC2 or IBM Bluemix).

Apache Spark | The Good

It does in-memory, distributed and iterative computation, which is particularly useful when working with machine learning algorithms. Other tools might require writing intermediate results to disk and reading them back into memory, which can make using iterative algorithms painfully slow.

Spark's API is truly appealing. Users can choose from multiple languages: Python, R, Scala and Java. Spark offers a data frame abstraction with object-oriented methods for transformations, joins, filters and more. This object orientation makes it easy to create custom reusable code that is also testable with mature testing frameworks.

"Lazy execution" is especially helpful as it allows you to define a complex series of transformations represented as an object. Further, you can inspect the structure of the end result without even executing the individual intermediate steps.

And Spark checks for errors in the execution plan before submitting so that bad code fails fast.

Easy Conversion

PySpark offers a " toPandas()" method to seamlessly convert Spark DataFrames to Pandas, and its " SparkSession.createDataFrame() " can do the reverse.

"toPandas()" method allows you to work in-memory once Spark has crunched the data into smaller datasets. When combined with Pandas' plotting method, you can chain together commands to join your large datasets, filter, aggregate and plot all in one command.

Python is a language that enables rapid operationalization of data and the PySpark package extends this functionality to massive datasets.

Also Read | Think ElasticSearch Is Just for Search? Stretch Again!

Easy Transformations

Pivoting is a challenge for many big data frameworks. In SQL, it typically requires many case statements. Spark has an easy and intuitive way of pivoting a DataFrame. The user simply performs a "groupBy" on the target index columns, a pivot of the target field to use as columns and finally an aggregation step. It executes surprisingly fast and is also easy to use.

Another asset of Spark is the " map-side join " broadcast method. This method speeds up joins significantly when one of the tables is smaller than the other and can fit in its entirety on individual machines.

The smaller one gets sent to all nodes so the data from the bigger table doesn't need to be moved around. This also helps mitigate problems from skew. If the big table has a lot of skew on the join keys, it will try to send a large amount of data from the big table to a small number of nodes to perform the join and overwhelm those nodes.

Open Source Community

Spark has a massive open-source community behind it. The community improves the core software and contributes practical add-on packages.

For example, a team has developed a natural language processing library for Spark. Previously, a user would either have to use other software or rely on slow user-defined functions to leverage Python packages such as Natural Language Toolkit.

Must Read | Technological Trend Can Help Your App Keep The Warmth!

Apache Spark | The Bad

Cluster Management

Apache Spark is notoriously difficult to tune and maintain.

If your cluster isn't expertly managed, this can negate "the Good" as we described above. Jobs failing with out-of-memory errors is very common and having many concurrent users makes resource management even more challenging.

Do you go with fixed or dynamic memory allocation? How many of your cluster's cores do you allow Spark to use?

How much memory does each executor get? How many partitions should Spark use when it shuffles data?

Getting all these settings right for data science workloads is difficult.

Debugging

Debugging Spark can be frustrating. The client-side type checking for " DataFrame" operations in PySpark can catch some bugs. But memory errors and errors occurring within user-defined functions can be difficult to track down.

Distributed systems are inherently complex, and so it goes for Spark. Error messages can be misleading or suppressed. Logging from a PySpark User Defined Function ( UDF) is difficult and introspection into current processes is not feasible.

Creating tests for your UDFs that run locally helps, but sometimes a function that passes local tests fail when running on the cluster. Figuring out the cause in those cases is challenging.

Slowness of PySpark UDFs

PySpark UDFs are much slower and more memory-intensive than Scala and Java UDFs are. The performance skew towards Scala and Java is understandable, since Spark is written in Scala and runs on the Java Virtual Machine (JVM).

Python UDFs require moving data from the executor's JVM to a Python interpreter, which is slow. If Python UDF performance is problematic, Spark does enable a user to create Scala UDFs, which can be run in Python. However, this slows down development time.

Technology | How IoT ( Internet of Things) works ? Step by step process explained

Hard-to-Guarantee Maximal Parallelism

One of Spark's key value propositions is distributed computation, yet it can be difficult to ensure Spark parallelizes computations as much as possible. Spark tries to elastically scale how many executors a job uses based on the job's needs, but it often fails to scale up on its own.

So if you set the minimum number of executors too low, your job may not utilize more executors when it needs them.

Spark divides RDDs (Resilient Distributed Dataset)/DataFrames into partitions, which is the smallest unit of work that an executor takes on. If you set too few partitions, then there may not be enough chunks of work for all the executors to work on. Also, fewer partitions means larger partitions, which can cause executors to run out of memory.