Search Header Logo
Hadoop Ecosystem

Hadoop Ecosystem

Assessment

Presentation

World Languages

5th Grade

Practice Problem

Hard

Created by

John Anderson

FREE Resource

121 Slides • 0 Questions

1

media

Hadoop Ecosystem

Dr. Syarif Hidayat, S.Kom., M.I.T.
(Dr. SH)

Big Data

Program Studi Informatika – Program Sarjana

Chapter 4

2

media

Foundations for Big Data Systems
and Programming

3

media

Why do we worry about foundations?

At chemistry lab. We really need to understand the chemistry or theory

of the practical concepts before we start hearing about the tubes.

Similarly, learning these concepts now will help you to understanding

the practical concepts in the Hadoop.

4

media

What is a Distributed File System?

5

media

This is where the name “file system” comes from.

6

media

Long-term information storage:

1.

Access result of process later

2.

Store large amounts of information

3.

Enable access of multiple processes

For all these reasons, we store information in files on a hard

disk.

7

media

There are many of these files, and they get managed by operating

system.

How the operating system manages files is called a file system.

8

media

Accessing File

9

media

Personal laptops or desktop

computers with

a single hard drive

WHAT IF YOU HAVE MORE DATA?

Buy a bigger disk?

Copy data to an

external hard drive?

10

media

PERSONAL

WORK

11

media

12

media

13

media

Data sets, or parts of a data set, can be replicated across the

nodes of a distributed file system.

Distributed file systems replicate the data between the racks,

and also computers distributed across geographical regions.

Data replication makes the system more fault tolerant.

14

media

So when some nodes or a rack goes down, there are other

parts of the system, the same data can be found and
analyzed.

15

media

Data replication also helps with scaling the access to this data by many

users.

Often, if the data is popular, many reader processes will want access to

it and each reader can get their own node to access to and analyze data.

This increases overall system performance.

16

media

Note : a problem with having such a distributive replication is, that it is
hard to make changes to data over time. However, in most big data
systems, the data is written once and the updates to data is maintained as
additional data sets over time.

17

media

18

media

Scalable Computing over the Internet

Most computing is done on a single compute node.
If the computation needs more than a node or parallel

processing, like many scientific computing problems, we can
use parallel computers.

19

media

a parallel computer is a very large number of single computing nodes

with specialized capabilities connected to other network.

For example, the Gordon Supercomputer at The San Diego

Supercomputer Center, has 1,024 compute nodes with 16 cores each
equalling 16,384 compute cores in total.

This type of specialized computer is pretty costly compared to its most

recent cousin, the commodity cluster.

20

media

21

media

22

media

Architecture of a commodity cluster

23

media

Enables data-parallelism

Such architectures enable what we call data-parallelism.
we will refer to it as data-parallelism in the context of Big-data

computing.

Large volumes and varieties of big data can be analyzed using

this mode of parallelism, achieving scalability, performance and
cost reduction.

24

media

Distributed computing

25

media

Common failures in commodity clusters

26

media

27

media

It is this type of distributed computing that pushed for a

change towards cost effective reliable and Fault-tolerant
systems for management and analysis of big data.

28

media

Programming Models for Big Data

How to take advantage of these
infrastructure advances. What are
the right programming models?

29

media

30

media

31

media

1. Support big data operations

Split volumes of data
Access data fast
Distribute computations to nodes

2. Handle fault tolerance

Replicate data partitions
Recover files when needed

Requirements for Big Data

Programming Models

32

media

3. Enable adding more racks

33

media

4. Optimized for specific data types

Document

Graph

Table

Key-value

Stream

Multimedia

34

media

Natural model for independent parallel tasks over multiple

resources!

35

media

36

media

37

media

38

media

39

media

What you have just seen is an example of big data modeling

in action. Only it is really the data processed by human
processors.

This scenario can be modeled by a common programming

model for big data. Namely MapReduce.

MapReduce is a big data programming model that supports

all the requirements of big data modeling.

40

media

41

media

Systems Getting Started with
Hadoop

42

media

Why Hadoop?

43

media

1. Enable scalability
Commodity hardware is cheap

Major Goals

44

media

2. Handle Fault Tolerance
Be ready: crashes happen

45

media

3. Optimized for a Variety Data Types
Big data comes in a variety of flavors

46

media

4. Facilitate a Shared Environment
Since even modest-sized clusters can have many cores, it is
important to allow multiple jobs to execute simultaneously.

47

media

5. Provide Value
Community-supported
Wide range of applications

48

media

The Hadoop Ecosystem

49

media

Main Hadoop Components

50

media

Cloud Computing

51

media

When use Hadoop?

52

media

The Hadoop Ecosystem

In 2004 Google published a paper about their in-house

processing framework they called MapReduce.

The next year, Yahoo released an open-source

implementation based on this framework called Hadoop.

In the following years, other frameworks and tools were

released to the community as open-source projects.

53

media

54

media

55

media

Let's look at one set of tools in the Hadoop ecosystem as a
layer diagram.

56

media

Distributed file system as foundation
Scalable storage
Fault tolerance

57

media

Flexible scheduling and resource management

58

media

Simplified programming model
Map

apply()

Reduce

summarize()

59

media

Higher-level programming models
Pig = dataflow scripting
Hive = SQL-like queries

60

media

Specialized models for graph processing

61

media

Real-time and in-memory processing
In-memory

100x faster for some tasks

62

media

NoSQL for non-files
Key-values
Sparse tables

63

media

Zookeeper for management
Synchronization
Configuration
High-availability

64

media

A major benefit of the Hadoop ecosystem is that all these
tools are open-source projects.

Large community for

support

Download separately or
part of pre-built image

65

media

66

media

The Hadoop Distributed File System (HDFS)

HDFS = foundation for Hadoop ecosystem
It provides two capabilities that are essential for managing big

data. Scalability to large data sets and reliability to cope with
hardware failures.

67

media

HDFS allows you to store and access large datasets.
According to Hortonworks, a leading vendor of Hadoop

services, HDFS has shown production scalability

Store massively
large data sets

up to 200 Petabytes, 4500
servers, 1 billion files and

blocks

68

media

69

media

What happens next? Do we lose the information
stored in block C?

70

media

HDFS is designed for full tolerance in such case.
HDFS replicates, or makes a copy of, file blocks on different

nodes to prevent data loss. In this example:

71

media

Customized reading to handle variety of file types

72

media

Two key components of HDFS

1. NameNode for metadata

Usually one per cluster

The NameNode coordinates operations.
Keeps track of file name, location in directory, etc.
Mapping of contents on DataNode.

2. DataNode for block storage

Usually one per machine

Listens to NameNode for block creation, deletion, replication : Data
locality, Fault tolerance

73

media

74

media

YARN: The Resource Manager for Hadoop

YARN is a resource manage layer that sits just above the storage layer

HDFS.

YARN interacts with applications and schedules resources for their use.
YARN enables running multiple applications over HDFC increases

resource efficiency.

75

media

76

media

One dataset

many applications

77

media

78

media

Resource
Manager

Data

Computation
Framework

Node

Manager

79

media

80

media

Essential gears in YARN engine

Resource Manager

Code Manager

Container

Applications Master

81

media

82

media

YARN

More Applications

83

media

MapReduce: Simple Programming for Big
Results

MapReduce is a programming model for the Hadoop ecosystem.
It relies on YARN to schedule and execute parallel processing over the

distributed file blocks in HDFS.

84

media

Parallel Programming = Requires Expertise

MapReduce

85

media

Based on Functional Programming

Map = apply operation to all elements
Reduce = summarize operation on elements

Example MapReduce Application: WordCount

f (x) = y

86

media

87

media

88

media

89

media

90

media

91

media

92

media

93

media

94

media

95

media

96

media

97

media

98

media

Frequently changing data
Dependent tasks
Interactive analysis

99

media

When to reconsider Hadoop?

Future anticipated data growth

Long term availability of data

Many platforms over single data store

High variety

High volume

100

media

Do you really need Hadoop?

101

media

Cloud Computing: An Important Big Data
Enabler

The cloud is one of the two influences of the launch of the big

data era.

We called it on-demand computing, and we said that it

enables us to compute any time any anywhere.

Simply, whenever we demand it.

102

media

103

media

104

media

105

media

BUILD RESOURCES

BUILDING HARDWARE -> MORE WORK

Buying networking hardware, storage disks, upgrading hardware when it
becomes obsolete, and so on.

HARDWARE ESTIMATION IS HARD

How do you estimate the size of your hardware needs?

SOFTWARE STACKS ARE COMPLEX

Getting the software that fits your needs is equally challenging.

HIGH CAPITAL INVESTMENTS

This requires high initial capital investments and efficient operation of
several departments in business.

106

media

CLOUD

PAY AS YOU GO
QUICK IMPLEMENTATION
DEPLOY CLOSER TO YOUR CLIENT
RESOURCE ESTIMATION SOLVED
WORK ON YOUR DOMAIN EXPERTISE
INSTANTLY GET DIFFERENT RESOURCES
DESIGN YOUR OWN COMPUTING PLATFORM

107

media

108

media

Cloud Service Models: Exploration of Choices

109

media

110

media

111

media

112

media

The decision of which service you want to explore is a

function of several variables. It depends on

113

media

114

media

115

media

Value from Hadoop and Pre-built Hadoop
Images

Assembling your own software stack from scratch can be

messy and a lot of work for beginners.

Getting pre-built software images is similar to buying pre-

assembled furniture.

Packaging of these pre-built software images is enabled by

virtual machines using virtualization software.

116

media

Pre-built Images for Hadoop

117

media

118

media

119

media

120

media

121

media

Hadoop Ecosystem

Dr. Syarif Hidayat, S.Kom., M.I.T.
(Dr. SH)

Big Data

Program Studi Informatika – Program Sarjana

Chapter 4

media

Hadoop Ecosystem

Dr. Syarif Hidayat, S.Kom., M.I.T.
(Dr. SH)

Big Data

Program Studi Informatika – Program Sarjana

Chapter 4

Show answer

Auto Play

Slide 1 / 121

SLIDE