#bigdata 24e — Hadoop System Technology

I —BIG DATA HADOOP TECHNOLOGY

The technology that enabled data scalability in Big Data is called Hadoop.

It is a free software platform written in Java language for cluster-oriented distributed computing and processing large volumes of data, with attention to fault tolerance.

If Windows is the operating system of microcomputers, Hadoop would be the Big Data Operating System.

Doug Cutting, the inventor of Hadoop, relied on two Google articles published in the years 2003/04, one on file systems and another on a programming methodology called MapReduce to create Hadoop.

Figure — Doug Cutting (credits Doug Couting)

He is currently Chief Architect at Cloudera, the most significant Hadoop distributor in the world.

I.1 — CURIOSITIES:

  1. Doug Cutting, nicknamed the system he was developing in honor of a toy elephant of his son affectionately named “Hadoop.”
  2. Hadoop removes the complexity of high-performance computing and can be installed on conventional machines, taking advantage of parallel processing capabilities in a computer network, being fault-tolerant, requiring few administrators and developers.
  3. Hadoop was written in Java (open source) and was inspired by Google’s GFS and MapReduce.
  4. Hadoop joined Apache Lucene in 2006, and in 2008 became a top-level Apache Foundation project.
  5. Yahoo has long been the most significant contributor to the Hadoop Project.
  6. Facebook in 2010 announced the installation of Hadoop on 2,900 servers with 30 PB of data.

II — HADOOP INSTALLATIONS

Hadoop is excellent for working with massive volumes of data in Big Data.

Figure — Hadoop logo (credits Apache Foundation)

Let’s describe some Hadoop installations.

1 — Yahoo

  • Yahoo has more than 120,000 servers running Hadoop and 800 PB of data storage. Few companies in the world have an infrastructure similar to this, and we must consider that Doug Cutting started Hadoop in a project within Yahoo.

2 — Facebook

  • One of the large Hadoop cluster of the world is of the Facebook. They have over 4,000 Clusters and hold hundreds of millions of GigaBytes. Developers use Hive, a subset of SQL (database query language) to search data on Hadoop servers.

3 — Hortonworks

  • According to Hortonworks, which provides Hadoop platform, only one of its leading Hadoop services customers have facilities with 4,500 servers and 200 PB of data being managed with more than one billion files and blocks of data on the Hadoop platform.

II.1 — CURIOSITIES:

  1. Google, Yahoo, and Facebook are Hadoop’s darlings, as they flitted from the ground up with this technology, developing new knowledge of Big Data for the free software market.
  2. Hadoop has become accessible to small and medium businesses from cloud services to Big Data.
  3. World Data are now in the PetaBytes values, and by 2020, should reach ZB (1,000,000,000,000,000,000 bytes). Hadoop is the technology that can process all these volumes of data.

III — HADOOP DISTRIBUTIONS

The story of Big Data, Hadoop, and Data Science are connected, and many professionals working in these areas on Google, Yahoo, and other Silicon Valley big business such as LinkedIn have turned away from these companies to create new Hadoop-based companies.

Three Hadoop distributors have stood out, which are:

  1. CLOUDERA (www.cloudera.com)
  2. MAPR (https://mapr.com)
  3. HORTONWORKS (https://br.hortonworks.com)

They allow you to download Hadoop for free and install it on your computer for testing and study.

Amazon Elastic Map Reduce, Microsoft Azure HDInsight, Google Cloud, IBM, SAP (Altiscale) and DELL among others, offer their Hadoop products and services or support the use of Hadoop from one of these Hadoop distributors.

Another essential company is DataBricks (https://databricks.com), which developed and popularized the use of Spark (real-time memory processing) by solving batch processing issues with MapReduce.

Recently Cloudera and Hortonworks have merged.

III.1 — CURIOSITIES:

  1. Three engineers from Google, Yahoo, and Facebook (Christophe Bisciglia, Amr Awadallah, and Jeff Hammerbacher) teamed up with Mike Olson, a retired Oracle executive to create Cloudera in 2008.
  2. Cloudera was ranked as one of the fastest growing companies in North America (Deloitte’s 2017 Technology Fast 500).
  3. A Hortonworks, was founded in 2011 by 24 from Hadoop Yahoo. Original Team Engineers, and accumulates more experience in Hadoop than any other organization in the world.
  4. MapR was founded in 2009 with private equity of US $9 million financing from Lightspeed Venture Partners and New Enterprise Associates. Executives come from Google, Lightspeed Venture Partners, EMC Corporation.

More information about this article

Article selected from the eBook “Big Data for Executives and Market Professionals.
eBook in English: Amazon or Apple Store
eBook in Portuguese: Amazon or Apple Store
eBook Web Sites: Portuguese | English

--

--

--

Author. Big Data Researcher. USA WebCT IT Executive. Director of Education and Technology. Portuguese Brazilian citizen. Peace for everyone. bit.ly/2WDtZUA

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

How to unpivot a table in Google Sheets easily

Pivot table icon

What on Earth is Recursion?

Use This Lead Data API To Personalize Emails

Dart Style Guide

Day 20: Shield Colors & Fixed Enemy Firing Bug

Ships bunkering operation — planning, preparation, safety checks & confirmation

30 Days to 50 Apps: Bootstrapping a New Platform

MVC Fundamentals

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Jose Antonio Ribeiro Neto (Zezinho)

Jose Antonio Ribeiro Neto (Zezinho)

Author. Big Data Researcher. USA WebCT IT Executive. Director of Education and Technology. Portuguese Brazilian citizen. Peace for everyone. bit.ly/2WDtZUA

More from Medium

An Introduction to Docker

How can we use a Lambda function to receive SNS alerts to Slack when an AWS Glue job fails a retry?

#MapReduce... What !?!?

Solving AWS Athena + JDBC Simba Driver Connection Issue