Hadoop Ecosystem

Presentation

•

World Languages

•

5th Grade

•

Practice Problem

•

Hard

John Anderson

FREE Resource

121 Slides • 0 Questions

Hadoop Ecosystem

Dr. Syarif Hidayat, S.Kom., M.I.T.
(Dr. SH)

Big Data

Program Studi Informatika – Program Sarjana

Chapter 4

Foundations for Big Data Systems
and Programming

Why do we worry about foundations?

• At chemistry lab. We really need to understand the chemistry or theory

of the practical concepts before we start hearing about the tubes.

• Similarly, learning these concepts now will help you to understanding

the practical concepts in the Hadoop.

What is a Distributed File System?

• This is where the name “file system” comes from.

• Long-term information storage:

Access result of process later

Store large amounts of information

Enable access of multiple processes

• For all these reasons, we store information in files on a hard

disk.

• There are many of these files, and they get managed by operating

system.

• How the operating system manages files is called a file system.

Accessing File

Personal laptops or desktop

computers with

a single hard drive

WHAT IF YOU HAVE MORE DATA?

Buy a bigger disk?

Copy data to an

external hard drive?

PERSONAL

WORK

• Data sets, or parts of a data set, can be replicated across the

nodes of a distributed file system.

• Distributed file systems replicate the data between the racks,

and also computers distributed across geographical regions.

• Data replication makes the system more fault tolerant.

• So when some nodes or a rack goes down, there are other

parts of the system, the same data can be found and
analyzed.

• Data replication also helps with scaling the access to this data by many

users.

• Often, if the data is popular, many reader processes will want access to

it and each reader can get their own node to access to and analyze data.

• This increases overall system performance.

Note : a problem with having such a distributive replication is, that it is
hard to make changes to data over time. However, in most big data
systems, the data is written once and the updates to data is maintained as
additional data sets over time.

Scalable Computing over the Internet

• Most computing is done on a single compute node.
• If the computation needs more than a node or parallel

processing, like many scientific computing problems, we can
use parallel computers.

• a parallel computer is a very large number of single computing nodes

with specialized capabilities connected to other network.

• For example, the Gordon Supercomputer at The San Diego

Supercomputer Center, has 1,024 compute nodes with 16 cores each
equalling 16,384 compute cores in total.

• This type of specialized computer is pretty costly compared to its most

recent cousin, the commodity cluster.

Architecture of a commodity cluster

Enables data-parallelism

• Such architectures enable what we call data-parallelism.
• we will refer to it as data-parallelism in the context of Big-data

computing.

• Large volumes and varieties of big data can be analyzed using

this mode of parallelism, achieving scalability, performance and
cost reduction.

Distributed computing

Common failures in commodity clusters

• It is this type of distributed computing that pushed for a

change towards cost effective reliable and Fault-tolerant
systems for management and analysis of big data.

Programming Models for Big Data

How to take advantage of these
infrastructure advances. What are
the right programming models?

1. Support big data operations

• Split volumes of data
• Access data fast
• Distribute computations to nodes

2. Handle fault tolerance

• Replicate data partitions
• Recover files when needed

Requirements for Big Data

Programming Models

3. Enable adding more racks

4. Optimized for specific data types

Document

Graph

Table

Key-value

Stream

Multimedia

Natural model for independent parallel tasks over multiple

resources!

• What you have just seen is an example of big data modeling

in action. Only it is really the data processed by human
processors.

• This scenario can be modeled by a common programming

model for big data. Namely MapReduce.

• MapReduce is a big data programming model that supports

all the requirements of big data modeling.

Systems Getting Started with
Hadoop

Why Hadoop?

1. Enable scalability
Commodity hardware is cheap

Major Goals

2. Handle Fault Tolerance
Be ready: crashes happen

3. Optimized for a Variety Data Types
Big data comes in a variety of flavors

4. Facilitate a Shared Environment
Since even modest-sized clusters can have many cores, it is
important to allow multiple jobs to execute simultaneously.

5. Provide Value
Community-supported
Wide range of applications

The Hadoop Ecosystem

Main Hadoop Components

Cloud Computing

When use Hadoop?

The Hadoop Ecosystem

• In 2004 Google published a paper about their in-house

processing framework they called MapReduce.

• The next year, Yahoo released an open-source

implementation based on this framework called Hadoop.

• In the following years, other frameworks and tools were

released to the community as open-source projects.

Let's look at one set of tools in the Hadoop ecosystem as a
layer diagram.

• Distributed file system as foundation
Scalable storage
Fault tolerance

• Flexible scheduling and resource management

• Simplified programming model
Map

apply()

Reduce

summarize()

• Higher-level programming models
Pig = dataflow scripting
Hive = SQL-like queries

• Specialized models for graph processing

• Real-time and in-memory processing
In-memory

100x faster for some tasks

• NoSQL for non-files
Key-values
Sparse tables

• Zookeeper for management
Synchronization
Configuration
High-availability

A major benefit of the Hadoop ecosystem is that all these
tools are open-source projects.

Large community for

support

Download separately or
part of pre-built image

The Hadoop Distributed File System (HDFS)

• HDFS = foundation for Hadoop ecosystem
• It provides two capabilities that are essential for managing big

data. Scalability to large data sets and reliability to cope with
hardware failures.

• HDFS allows you to store and access large datasets.
• According to Hortonworks, a leading vendor of Hadoop

services, HDFS has shown production scalability

Store massively
large data sets

up to 200 Petabytes, 4500
servers, 1 billion files and

blocks

What happens next? Do we lose the information
stored in block C?

• HDFS is designed for full tolerance in such case.
• HDFS replicates, or makes a copy of, file blocks on different

nodes to prevent data loss. In this example:

Customized reading to handle variety of file types

Two key components of HDFS

1. NameNode for metadata

Usually one per cluster

• The NameNode coordinates operations.
• Keeps track of file name, location in directory, etc.
• Mapping of contents on DataNode.

2. DataNode for block storage

Usually one per machine

Listens to NameNode for block creation, deletion, replication : Data
locality, Fault tolerance

YARN: The Resource Manager for Hadoop

• YARN is a resource manage layer that sits just above the storage layer

HDFS.

• YARN interacts with applications and schedules resources for their use.
• YARN enables running multiple applications over HDFC increases

resource efficiency.

One dataset

many applications

Resource
Manager

Data

Computation
Framework

Node

Manager

Essential gears in YARN engine

Resource Manager

Code Manager

Container

Applications Master

YARN

More Applications

MapReduce: Simple Programming for Big
Results

• MapReduce is a programming model for the Hadoop ecosystem.
• It relies on YARN to schedule and execute parallel processing over the

distributed file blocks in HDFS.

Parallel Programming = Requires Expertise

MapReduce

Based on Functional Programming

Map = apply operation to all elements
Reduce = summarize operation on elements

Example MapReduce Application: WordCount

f (x) = y

• Frequently changing data
• Dependent tasks
• Interactive analysis

When to reconsider Hadoop?

Future anticipated data growth

Long term availability of data

Many platforms over single data store

High variety

High volume

100

Do you really need Hadoop?

101

Cloud Computing: An Important Big Data
Enabler

• The cloud is one of the two influences of the launch of the big

data era.

• We called it on-demand computing, and we said that it

enables us to compute any time any anywhere.

• Simply, whenever we demand it.

102

103

104

105

BUILD RESOURCES

• BUILDING HARDWARE -> MORE WORK

Buying networking hardware, storage disks, upgrading hardware when it
becomes obsolete, and so on.

• HARDWARE ESTIMATION IS HARD

How do you estimate the size of your hardware needs?

• SOFTWARE STACKS ARE COMPLEX

Getting the software that fits your needs is equally challenging.

• HIGH CAPITAL INVESTMENTS

This requires high initial capital investments and efficient operation of
several departments in business.

106

CLOUD

• PAY AS YOU GO
• QUICK IMPLEMENTATION
• DEPLOY CLOSER TO YOUR CLIENT
• RESOURCE ESTIMATION SOLVED
• WORK ON YOUR DOMAIN EXPERTISE
• INSTANTLY GET DIFFERENT RESOURCES
• DESIGN YOUR OWN COMPUTING PLATFORM

107

108

Cloud Service Models: Exploration of Choices

109

110

111

112

• The decision of which service you want to explore is a

function of several variables. It depends on

113

114

115

Value from Hadoop and Pre-built Hadoop
Images

• Assembling your own software stack from scratch can be

messy and a lot of work for beginners.

• Getting pre-built software images is similar to buying pre-

assembled furniture.

• Packaging of these pre-built software images is enabled by

virtual machines using virtualization software.

116

Pre-built Images for Hadoop

117

118

119

120

121

Hadoop Ecosystem

Dr. Syarif Hidayat, S.Kom., M.I.T.
(Dr. SH)

Big Data

Program Studi Informatika – Program Sarjana

Chapter 4

Hadoop Ecosystem

Dr. Syarif Hidayat, S.Kom., M.I.T.
(Dr. SH)

Big Data

Program Studi Informatika – Program Sarjana

Chapter 4

Show answer

Auto Play

Slide 1 / 121

SLIDE

Similar Resources on Wayground

105 questions

Science vocabulary

Presentation

•

4th - 7th Grade

113 questions

BI 10 ES PMA 2 Review (STO)

Presentation

•

5th Grade

110 questions

Preverjanje pred 2.testom

Presentation

•

6th Grade

116 questions

MATH REVIEW

Presentation

•

6th Grade

126 questions

B7 [U1] Countries

Presentation

•

6th Grade

105 questions

Revolutionary War

Presentation

•

6th Grade

112 questions

3B Ch.3 Now and Then

Presentation

•

3rd Grade

125 questions

ELA Anchor Charts 6th Grade

Presentation

•

6th Grade

Popular Resources on Wayground

11 questions

Hallway & Bathroom Expectations

Quiz

•

6th - 8th Grade

10 questions

HCS SCI 03 Summer School Assessment 2

Quiz

•

3rd Grade

11 questions

Home Scope

Quiz

•

7th - 8th Grade

12 questions

2026 TAP Technology in the Classroom

Presentation

•

Professional Development

15 questions

HCS SCI 05 Summer School Assessment 2 Review

Quiz

•

5th Grade

15 questions

HCS SCI 04 Summer School Review 2

Quiz

•

4th Grade

59 questions

Geometry Unit 3 Review

Quiz

•

9th - 12th Grade

14 questions

FAST ELA READING SMAPLE TEST MATERIALS

Passage

•

3rd Grade

Discover more resources for World Languages

15 questions

HCS SCI 05 Summer School Assessment 2 Review

Quiz

•

5th Grade