

Hadoop Ecosystem
Presentation
•
World Languages
•
5th Grade
•
Practice Problem
•
Hard
John Anderson
FREE Resource
121 Slides • 0 Questions
1
Hadoop Ecosystem
Dr. Syarif Hidayat, S.Kom., M.I.T.
(Dr. SH)
Big Data
Program Studi Informatika – Program Sarjana
Chapter 4
2
Foundations for Big Data Systems
and Programming
3
Why do we worry about foundations?
• At chemistry lab. We really need to understand the chemistry or theory
of the practical concepts before we start hearing about the tubes.
• Similarly, learning these concepts now will help you to understanding
the practical concepts in the Hadoop.
4
What is a Distributed File System?
5
• This is where the name “file system” comes from.
6
• Long-term information storage:
1.
Access result of process later
2.
Store large amounts of information
3.
Enable access of multiple processes
• For all these reasons, we store information in files on a hard
disk.
7
• There are many of these files, and they get managed by operating
system.
• How the operating system manages files is called a file system.
8
Accessing File
9
Personal laptops or desktop
computers with
a single hard drive
WHAT IF YOU HAVE MORE DATA?
Buy a bigger disk?
Copy data to an
external hard drive?
10
PERSONAL
WORK
11
12
13
• Data sets, or parts of a data set, can be replicated across the
nodes of a distributed file system.
• Distributed file systems replicate the data between the racks,
and also computers distributed across geographical regions.
• Data replication makes the system more fault tolerant.
14
• So when some nodes or a rack goes down, there are other
parts of the system, the same data can be found and
analyzed.
15
• Data replication also helps with scaling the access to this data by many
users.
• Often, if the data is popular, many reader processes will want access to
it and each reader can get their own node to access to and analyze data.
• This increases overall system performance.
16
Note : a problem with having such a distributive replication is, that it is
hard to make changes to data over time. However, in most big data
systems, the data is written once and the updates to data is maintained as
additional data sets over time.
17
18
Scalable Computing over the Internet
• Most computing is done on a single compute node.
• If the computation needs more than a node or parallel
processing, like many scientific computing problems, we can
use parallel computers.
19
• a parallel computer is a very large number of single computing nodes
with specialized capabilities connected to other network.
• For example, the Gordon Supercomputer at The San Diego
Supercomputer Center, has 1,024 compute nodes with 16 cores each
equalling 16,384 compute cores in total.
• This type of specialized computer is pretty costly compared to its most
recent cousin, the commodity cluster.
20
21
22
Architecture of a commodity cluster
23
Enables data-parallelism
• Such architectures enable what we call data-parallelism.
• we will refer to it as data-parallelism in the context of Big-data
computing.
• Large volumes and varieties of big data can be analyzed using
this mode of parallelism, achieving scalability, performance and
cost reduction.
24
Distributed computing
25
Common failures in commodity clusters
26
27
• It is this type of distributed computing that pushed for a
change towards cost effective reliable and Fault-tolerant
systems for management and analysis of big data.
28
Programming Models for Big Data
How to take advantage of these
infrastructure advances. What are
the right programming models?
29
30
31
1. Support big data operations
• Split volumes of data
• Access data fast
• Distribute computations to nodes
2. Handle fault tolerance
• Replicate data partitions
• Recover files when needed
Requirements for Big Data
Programming Models
32
3. Enable adding more racks
33
4. Optimized for specific data types
Document
Graph
Table
Key-value
Stream
Multimedia
34
Natural model for independent parallel tasks over multiple
resources!
35
36
37
38
39
• What you have just seen is an example of big data modeling
in action. Only it is really the data processed by human
processors.
• This scenario can be modeled by a common programming
model for big data. Namely MapReduce.
• MapReduce is a big data programming model that supports
all the requirements of big data modeling.
40
41
Systems Getting Started with
Hadoop
42
Why Hadoop?
43
1. Enable scalability
Commodity hardware is cheap
Major Goals
44
2. Handle Fault Tolerance
Be ready: crashes happen
45
3. Optimized for a Variety Data Types
Big data comes in a variety of flavors
46
4. Facilitate a Shared Environment
Since even modest-sized clusters can have many cores, it is
important to allow multiple jobs to execute simultaneously.
47
5. Provide Value
Community-supported
Wide range of applications
48
The Hadoop Ecosystem
49
Main Hadoop Components
50
Cloud Computing
51
When use Hadoop?
52
The Hadoop Ecosystem
• In 2004 Google published a paper about their in-house
processing framework they called MapReduce.
• The next year, Yahoo released an open-source
implementation based on this framework called Hadoop.
• In the following years, other frameworks and tools were
released to the community as open-source projects.
53
54
55
Let's look at one set of tools in the Hadoop ecosystem as a
layer diagram.
56
• Distributed file system as foundation
Scalable storage
Fault tolerance
57
• Flexible scheduling and resource management
58
• Simplified programming model
Map
apply()
Reduce
summarize()
59
• Higher-level programming models
Pig = dataflow scripting
Hive = SQL-like queries
60
• Specialized models for graph processing
61
• Real-time and in-memory processing
In-memory
100x faster for some tasks
62
• NoSQL for non-files
Key-values
Sparse tables
63
• Zookeeper for management
Synchronization
Configuration
High-availability
64
A major benefit of the Hadoop ecosystem is that all these
tools are open-source projects.
Large community for
support
Download separately or
part of pre-built image
65
66
The Hadoop Distributed File System (HDFS)
• HDFS = foundation for Hadoop ecosystem
• It provides two capabilities that are essential for managing big
data. Scalability to large data sets and reliability to cope with
hardware failures.
67
• HDFS allows you to store and access large datasets.
• According to Hortonworks, a leading vendor of Hadoop
services, HDFS has shown production scalability
Store massively
large data sets
up to 200 Petabytes, 4500
servers, 1 billion files and
blocks
68
69
What happens next? Do we lose the information
stored in block C?
70
• HDFS is designed for full tolerance in such case.
• HDFS replicates, or makes a copy of, file blocks on different
nodes to prevent data loss. In this example:
71
Customized reading to handle variety of file types
72
Two key components of HDFS
1. NameNode for metadata
Usually one per cluster
• The NameNode coordinates operations.
• Keeps track of file name, location in directory, etc.
• Mapping of contents on DataNode.
2. DataNode for block storage
Usually one per machine
Listens to NameNode for block creation, deletion, replication : Data
locality, Fault tolerance
73
74
YARN: The Resource Manager for Hadoop
• YARN is a resource manage layer that sits just above the storage layer
HDFS.
• YARN interacts with applications and schedules resources for their use.
• YARN enables running multiple applications over HDFC increases
resource efficiency.
75
76
One dataset
many applications
77
78
Resource
Manager
Data
Computation
Framework
Node
Manager
79
80
Essential gears in YARN engine
Resource Manager
Code Manager
Container
Applications Master
81
82
YARN
More Applications
83
MapReduce: Simple Programming for Big
Results
• MapReduce is a programming model for the Hadoop ecosystem.
• It relies on YARN to schedule and execute parallel processing over the
distributed file blocks in HDFS.
84
Parallel Programming = Requires Expertise
MapReduce
85
Based on Functional Programming
Map = apply operation to all elements
Reduce = summarize operation on elements
Example MapReduce Application: WordCount
f (x) = y
86
87
88
89
90
91
92
93
94
95
96
97
98
• Frequently changing data
• Dependent tasks
• Interactive analysis
99
When to reconsider Hadoop?
Future anticipated data growth
Long term availability of data
Many platforms over single data store
High variety
High volume
100
Do you really need Hadoop?
101
Cloud Computing: An Important Big Data
Enabler
• The cloud is one of the two influences of the launch of the big
data era.
• We called it on-demand computing, and we said that it
enables us to compute any time any anywhere.
• Simply, whenever we demand it.
102
103
104
105
BUILD RESOURCES
• BUILDING HARDWARE -> MORE WORK
Buying networking hardware, storage disks, upgrading hardware when it
becomes obsolete, and so on.
• HARDWARE ESTIMATION IS HARD
How do you estimate the size of your hardware needs?
• SOFTWARE STACKS ARE COMPLEX
Getting the software that fits your needs is equally challenging.
• HIGH CAPITAL INVESTMENTS
This requires high initial capital investments and efficient operation of
several departments in business.
106
CLOUD
• PAY AS YOU GO
• QUICK IMPLEMENTATION
• DEPLOY CLOSER TO YOUR CLIENT
• RESOURCE ESTIMATION SOLVED
• WORK ON YOUR DOMAIN EXPERTISE
• INSTANTLY GET DIFFERENT RESOURCES
• DESIGN YOUR OWN COMPUTING PLATFORM
107
108
Cloud Service Models: Exploration of Choices
109
110
111
112
• The decision of which service you want to explore is a
function of several variables. It depends on
113
114
115
Value from Hadoop and Pre-built Hadoop
Images
• Assembling your own software stack from scratch can be
messy and a lot of work for beginners.
• Getting pre-built software images is similar to buying pre-
assembled furniture.
• Packaging of these pre-built software images is enabled by
virtual machines using virtualization software.
116
Pre-built Images for Hadoop
117
118
119
120
121
Hadoop Ecosystem
Dr. Syarif Hidayat, S.Kom., M.I.T.
(Dr. SH)
Big Data
Program Studi Informatika – Program Sarjana
Chapter 4
Hadoop Ecosystem
Dr. Syarif Hidayat, S.Kom., M.I.T.
(Dr. SH)
Big Data
Program Studi Informatika – Program Sarjana
Chapter 4
Show answer
Auto Play
Slide 1 / 121
SLIDE
Similar Resources on Wayground
103 questions
11/13 Native American Region Review
Presentation
•
4th Grade
105 questions
102 Language Exam Guide B2P2
Presentation
•
6th - 8th Grade
112 questions
Data and Stats Review
Presentation
•
6th - 8th Grade
112 questions
Basic Statistics Review
Presentation
•
6th - 8th Grade
108 questions
Weather
Presentation
•
5th Grade
124 questions
ide tik
Presentation
•
7th Grade
116 questions
Basic science rev questions for MTT
Presentation
•
7th - 9th Grade
111 questions
THE WORLD AROUND US
Presentation
•
1st - 6th Grade
Popular Resources on Wayground
20 questions
"What is the question asking??" Grades 3-5
Quiz
•
1st - 5th Grade
20 questions
“What is the question asking??” Grades 6-8
Quiz
•
6th - 8th Grade
10 questions
Fire Safety Quiz
Quiz
•
12th Grade
20 questions
Equivalent Fractions
Quiz
•
3rd Grade
34 questions
STAAR Review 6th - 8th grade Reading Part 1
Quiz
•
6th - 8th Grade
20 questions
“What is the question asking??” English I-II
Quiz
•
9th - 12th Grade
20 questions
Main Idea and Details
Quiz
•
5th Grade
47 questions
8th Grade Reading STAAR Ultimate Review!
Quiz
•
8th Grade