#bigdata 31e — Apache Spark, Storm, and Flink
Batch processing is operations with large sets of static data based on reading and writes to disk, returning the result later when the computation already had been completed.
Hadoop with MapReduce is a typical Batch operation, and therefore slower compared to the processes called “in-memory.”
Remembering that we are talking about TB, PB and EB data to be processed.
Spark it is a framework that does not use the MapReduce layer of Hadoop. Its primary motivation is to carry out processing “in memory” instead of on disk as performed in MapReduce.
The performance gain is enormous because access to in-memory data is performed in nanoseconds while that in the disk, in milliseconds.
Spark emerged at the University of California Berkeley in 2009 as a research project to accelerate the execution of machine learning algorithms on the Hadoop platform and became one core project of the Apache Foundation.
The creators of Spark founded the company Databricks (www.databricks.com) that continues the development of Spark offering platform and services.
Spark was developed to address the limitations of MapReduce, which does not keep the data in memory after processing for use and immediate analysis.
Spark provides layers for SparkSQL, Spark Streaming, MLib (Machine Learning) and GraphX (for graphs) which allows the use and construction of new software libraries.
The developer in Spark can use Scala, Python, Java, and R as languages.
Spark is becoming essential for companies that want to implement Big Data and also crucial in Data Scientist training.
Streaming data is created, recorded and analyzed in real time. It includes a wide variety of data, such as device-generated logins, e-commerce, sensor data generation on the Internet of Things, social networking information, geospatial services, among others.
The software Storm is a Hadoop framework for data streaming that focuses on workloads that require near-real-time processing.
It is scalable, fault tolerant, easy to install, configure and program. It can manage large amounts of data and produce results in less time than other similar solutions.
Storm can be used for real-time data analysis, machine learning, Internet of Things, ETL, among others.
Flink is a framework for Hadoop for streaming data, which also handles batch processing.
The data streaming is finite, meaning you collect a certain amount of data, such as 500,000 tweets from Twitter, and handle them as a batch of data to be processed and analyzed.
Flink handles batch processing as a subset of the flow processing, called the primary processing method, with flows used to complement and provide results from data analysis.
The applications can be considered the same ones that we relate to Storm, but considering flows, or finite part of the data to be analyzed.
Flink is free software from the Apache Foundation (https://flink.apache.org )
Apache Kafka e Samza
The Hadoop ecosystem grows every day in search of performance.
Recently emerged two more oriented tools for streaming data that is Apache and Apache Kafka and Samza.
Apache Kafka is a distributed platform for streaming data used to build applications using data structures captured in real time (pipelines calls) in tens and hundreds of clusters at the same time.
Apache Samza is a framework for distributed processing of streaming data.
Samza uses Kafka for messaging, YARN for fault tolerance and resource management.
- Spark can be 100 times faster than MapReduce using “in-memory” processing.
- The difference between Storm and Flink for data streaming analysis is that Flink seeks to work with finite data batch analysis using streams, while Storm performs real-time, performance-focused data analysis.
- Hadoop lays the foundation for Spark to work, especially with distributed HDFS storage.
- Hadoop was developed in Java, and Spark in Scala, a programming language that contains compelling properties for data manipulation.
- The choice of a data streaming framework depends on the type of application developed, the configurations of the servers, and the size and resources offered by the Hadoop network.