#bigdata 24e — Hadoop System Technology

I —BIG DATA HADOOP TECHNOLOGY

The technology that enabled data scalability in Big Data is called Hadoop.

It is a free software platform written in Java language for cluster-oriented distributed computing and processing large volumes of data, with attention to fault tolerance.

If Windows is the operating system of microcomputers, Hadoop would be the Big Data Operating System.

Doug Cutting, the inventor of Hadoop, relied on two Google articles published in the years 2003/04, one on file systems and another on a programming methodology called MapReduce to create Hadoop.

Figure — Doug Cutting (credits Doug Couting)

He is currently Chief Architect at Cloudera, the most significant Hadoop distributor in the world.

I.1 — CURIOSITIES:

  1. Doug Cutting, nicknamed the system he was developing in honor of a toy elephant of his son affectionately named “Hadoop.”

II — HADOOP INSTALLATIONS

Hadoop is excellent for working with massive volumes of data in Big Data.

Figure — Hadoop logo (credits Apache Foundation)

Let’s describe some Hadoop installations.

1 — Yahoo

  • Yahoo has more than 120,000 servers running Hadoop and 800 PB of data storage. Few companies in the world have an infrastructure similar to this, and we must consider that Doug Cutting started Hadoop in a project within Yahoo.

2 — Facebook

  • One of the large Hadoop cluster of the world is of the Facebook. They have over 4,000 Clusters and hold hundreds of millions of GigaBytes. Developers use Hive, a subset of SQL (database query language) to search data on Hadoop servers.

3 — Hortonworks

  • According to Hortonworks, which provides Hadoop platform, only one of its leading Hadoop services customers have facilities with 4,500 servers and 200 PB of data being managed with more than one billion files and blocks of data on the Hadoop platform.

II.1 — CURIOSITIES:

  1. Google, Yahoo, and Facebook are Hadoop’s darlings, as they flitted from the ground up with this technology, developing new knowledge of Big Data for the free software market.

III — HADOOP DISTRIBUTIONS

The story of Big Data, Hadoop, and Data Science are connected, and many professionals working in these areas on Google, Yahoo, and other Silicon Valley big business such as LinkedIn have turned away from these companies to create new Hadoop-based companies.

Three Hadoop distributors have stood out, which are:

  1. CLOUDERA (www.cloudera.com)

They allow you to download Hadoop for free and install it on your computer for testing and study.

Amazon Elastic Map Reduce, Microsoft Azure HDInsight, Google Cloud, IBM, SAP (Altiscale) and DELL among others, offer their Hadoop products and services or support the use of Hadoop from one of these Hadoop distributors.

Another essential company is DataBricks (https://databricks.com), which developed and popularized the use of Spark (real-time memory processing) by solving batch processing issues with MapReduce.

Recently Cloudera and Hortonworks have merged.

III.1 — CURIOSITIES:

  1. Three engineers from Google, Yahoo, and Facebook (Christophe Bisciglia, Amr Awadallah, and Jeff Hammerbacher) teamed up with Mike Olson, a retired Oracle executive to create Cloudera in 2008.

More information about this article

Article selected from the eBook “Big Data for Executives and Market Professionals.
eBook in English: Amazon or Apple Store
eBook in Portuguese: Amazon or Apple Store
eBook Web Sites: Portuguese | English

Author. Big Data Researcher. USA WebCT IT Executive. Director of Technology and Education. Portuguese Brazilian citizen. Soccer player. bit.ly/2WDtZUA