15. 什么是HDFS?Hadoop Distributed File System
Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations.
106. ClouderaCloudera -> Hadoop
Redhat -> Linux Kernel
109. ClouderaApache Hadoop is a new way for enterprises to store and analyze data.
Relational and data warehouse products excel at OLAP and OLTP workloads over structured data. Hadoop, however, was designed to solve a different problem: the fast, reliable analysis of both structured data and complex data.
As a result, many enterprises deploy Hadoop alongside their legacy IT systems, which allows them to combine old data and new data sets in powerful new ways.
110. Why Hadoop?Apache Hadoop is an ideal platform for consolidating large-scale data from a variety of new and legacy sources. It complements existing data management solutions with new analyses and processing tools. It delivers immediate value to companies in a variety of vertical markets. Examples include:
111. Why Hadoop?
114. Cloudera’s Distribution for Apache Hadoop (CDH)
116. CDH-SqoopSqoop is a tool designed to transfer data between Hadoop and relational databases.
You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS.
120. Usage in Facebook
121. Data Flow Architecture at FacebookWeb ServersScribe MidTierFilersProduction
ClusterOracle RACFederated MySQLScribe-Hadoop ClustersAdhoc
122. Warehousing at FacebookInstrumentation
Realtime (work in progress)
Metadata Discovery (CoHive)
Workflow specification and execution (Chronos)
Monitoring and alerting
123. Hadoop & Hive Cluster @ FacebookProduction Cluster
300 nodes/2400 cores
3PB of raw storage
1200 nodes/9600 cores
12PB of raw storage
Node (DataNode + TaskTracker) configuration
2CPU, 4 core per cpu
12 x 1TB disk (900GB usable per disk)
124. Hive & Hadoop Usage @ FacebookStatistics per day:
10TB of compressed new data added per day
135TB of compressed data scanned per day
7500+ Hive jobs per day
80K compute hours per day
Hive simplifies Hadoop:
New engineers go though a Hive training session
~200 people/month run jobs on Hadoop/Hive
Analysts (non-engineers) use Hadoop through Hive
95% of hadoop jobs are Hive Jobs
125. Hive & Hadoop Usage @ FacebookTypes of Applications:
Eg: Daily/Weekly aggregations of impression/click counts
Measures of user engagement
Ad hoc Analysis
Eg: how many group admins broken down by state/country
Machine Learning (Assembling training data)
Eg: User Engagement as a function of user attributes
(1) The Google File System
(2) MapReduce: Simplied Data Processing on Large Clusters
(3) The Hadoop Distributed File System
(4) HDFS scalability: the limits to growth
(5) Case Study GFS: Evolution on Fast-forward