#bigdata 31e — Apache Spark, Storm, and Flink

4 min readMay 10, 2019

Batch processing is operations with large sets of static data based on reading and writes to disk, returning the result later when the computation already had been completed.

Hadoop with MapReduce is a typical Batch operation, and therefore slower compared to the processes called “in-memory.”

Remembering that we are talking about TB, PB and EB data to be processed.

Spark it is a framework that does not use the MapReduce layer of Hadoop. Its primary motivation is to carry out processing “in memory” instead of on disk as performed in MapReduce.

Figure — Apache Spark
(Credits Apache Foundation)

The performance gain is enormous because access to in-memory data is performed in nanoseconds while that in the disk, in milliseconds.

Spark emerged at the University of California Berkeley in 2009 as a research project to accelerate the execution of machine learning algorithms on the Hadoop platform and became one core project of the Apache Foundation.

The creators of Spark founded the company Databricks (www.databricks.com) that continues the development of Spark offering platform and services.

Spark was developed to address the limitations of MapReduce, which does not keep the data in memory after processing for use and immediate analysis.

Figure — Spark Ecosystem (Credits databricks)

Spark provides layers for SparkSQL, Spark Streaming, MLib (Machine Learning) and GraphX (for graphs) which allows the use and construction of new software libraries.

The developer in Spark can use Scala, Python, Java, and R as languages.

Spark is becoming essential for companies that want to implement Big Data and also crucial in Data Scientist training.

Apache Storm

Streaming data is created, recorded and analyzed in real time. It includes a wide variety of data, such as device-generated logins, e-commerce, sensor data generation on the Internet of Things, social networking information, geospatial services, among others.

The software Storm is a Hadoop framework for data streaming that focuses on workloads that require near-real-time processing.

Figure — Apache Storm
(Credits Apache Foundation)

It is scalable, fault tolerant, easy to install, configure and program. It can manage large amounts of data and produce results in less time than other similar solutions.

Storm can be used for real-time data analysis, machine learning, Internet of Things, ETL, among others.

Apache Flink

Flink is a framework for Hadoop for streaming data, which also handles batch processing.

The data streaming is finite, meaning you collect a certain amount of data, such as 500,000 tweets from Twitter, and handle them as a batch of data to be processed and analyzed.

Figure — Apache Flink
(Credits Apache Foundation)

Flink handles batch processing as a subset of the flow processing, called the primary processing method, with flows used to complement and provide results from data analysis.

The applications can be considered the same ones that we relate to Storm, but considering flows, or finite part of the data to be analyzed.

Flink is free software from the Apache Foundation (https://flink.apache.org )

Apache Kafka e Samza

The Hadoop ecosystem grows every day in search of performance.

Recently emerged two more oriented tools for streaming data that is Apache and Apache Kafka and Samza.

Apache Kafka is a distributed platform for streaming data used to build applications using data structures captured in real time (pipelines calls) in tens and hundreds of clusters at the same time.

Figure — Apache Kafka
(Credits Apache Foundation)

Apache Samza is a framework for distributed processing of streaming data.

Samza uses Kafka for messaging, YARN for fault tolerance and resource management.

CURIOSITIES

Spark can be 100 times faster than MapReduce using “in-memory” processing.
The difference between Storm and Flink for data streaming analysis is that Flink seeks to work with finite data batch analysis using streams, while Storm performs real-time, performance-focused data analysis.
Hadoop lays the foundation for Spark to work, especially with distributed HDFS storage.
Hadoop was developed in Java, and Spark in Scala, a programming language that contains compelling properties for data manipulation.
The choice of a data streaming framework depends on the type of application developed, the configurations of the servers, and the size and resources offered by the Hadoop network.

#bigdata 31e — Apache Spark, Storm, and Flink

More information about this article

Written by 🇵🇹 José Antonio Ribeiro Neto (Zezinho).