云计算与大数据

qkylin

贡献于2014-02-18

字数:0 关键词: 分布式/云计算/大数据

Cloud Computing and Big Data Xiaofeng Meng Renmin University of China 数据挖掘教学研讨会,北京, 2012,8,9 Outline Introduction to Big Data1 Cloud Computing and Big Data 2 3 4 Conclusion Challenging Problems Our Work 5 数据挖掘教学研讨会,北京, 2012,8,9 Outline Introduction to Big Data1 Cloud Computing and Big Data 2 3 4 Conclusion Challenging Problems Our Work 5 数据挖掘教学研讨会,北京, 2012,8,9 Big Data is so hot!  Google Trends of Big Data  Big Data Across the Federal Government (USA, March, 2012) 数据挖掘教学研讨会,北京, 2012,8,9 数据挖掘教学研讨会,北京, 2012,8,9 What is Big Data? 数据挖掘教学研讨会,北京, 2012,8,9 DB(Database) vs. BD(Big Data) “Small data”, Very Large Database (VLDB) . MB, 结构数据 . 以数据为对象解决其存储和管 理问题 Big Data, Extremely Large Database(XLDB) . >PB,非结构数据 . 以数据为资源解决诸领域问题 数据工程 数据思维 Data Engineering Data Thinking 数据挖掘教学研讨会,北京, 2012,8,9 社会的数字化与数字的社会化 社会的数字化:数据足迹(data print) . 在数字化时代,各色人等有意无意留下的数据足迹 越来越丰富 . 数据足迹是有社会意义(social meaning)的,蕴 含着社会结构 数字的社会化: . 数据足迹及其结构本身就是社会结构和过程的一个 环节,不断塑造着新的社会秩序和关系 数据挖掘教学研讨会,北京, 2012,8,9 数据思维:计算社会科学 一切社会解释、监控、预测与规划都离不开对 数据足迹的收集、整理和分析 计算社会科学方法: . 基于特定社会需要,在特定社会理论指导下,收集、 整理和分析数据足迹,以便进行社会解释、监控、 预测与规划的过程和活动 数据挖掘教学研讨会,北京, 2012,8,9 What Can Big Data do ? Prediction 数据挖掘教学研讨会,北京, 2012,8,9 What Can Big Data do ?  华尔街根据民众情绪抛售股票  对冲基金依据购物网站的顾客评论,分析企业产品销 售情况  银行根据求职网站的岗位数量,推断就业率  投资机构收集并分析上市企业声明,从中寻找破产的 蛛丝马迹  美国疾病控制和预防中心依据网民搜索,分析全球范 围内流感等病疫的传播情况  美国总统奥巴马的竞选团队依据选民的微博,实时分 析选民对总统竞选人的喜好 数据挖掘教学研讨会,北京, 2012,8,9 What Can Big Data do ? Fraud Detection Healthcare Transportation Telecommunications Life sciences Financial transactions ………. 数据挖掘教学研讨会,北京, 2012,8,9 安阳殷墟遗址(公元前1300,距今3300年) 数据挖掘教学研讨会,北京, 2012,8,9 甲骨文大坑, 1万7千余片 Big Data Application 应用 用户数 精确度 可靠度 数据量 反应 科学计算 少 极高 低 -- 中等 Tera 慢 股市交易 大量 高 极高 Gega 快 Web数据 大量 中等 -- 高 中等 Peta 快 微博数据 大量 中等 -- 高 中等 100Peta 快 。。。 数据挖掘教学研讨会,北京, 2012,8,9 Outline Introduction to Big Data1 Cloud Computing and Big Data 2 3 4 Conclusion Challenging Problems Our Work 5 数据挖掘教学研讨会,北京, 2012,8,9 Cloud Computing and Big Data Cloud Computing is just like the highway which can support a variety of transportation Big Data can be seen as one vehicle on the highway Cloud Computing is infrastructure while Big Data is its service object 数据挖掘教学研讨会,北京, 2012,8,9 Big Data Analysis Pipeline Analysis Integration Extraction& Cleaning Acquisition Interpretation Collaboration of cloud computing can greatly promote these process From: 数据挖掘教学研讨会,北京, 2012,8,9 Acquisition Multiple data resource and huge amount Much of this data is of no interest Data Reduction is important 数据挖掘教学研讨会,北京, 2012,8,9 Extraction & Cleaning Various data type: Structured &Unstructured Extraction is often highly application dependent Missing information and error information should be cleaned. 数据挖掘教学研讨会,北京, 2012,8,9 Integration Right metadata is needed there has to be some translation of data as it flows from one model(platform) to the other. E.g. Transfer data from Hadoop to DB2 数据挖掘教学研讨会,北京, 2012,8,9 Analysis Fundamentally different from traditional statistical analysis on small samples Real-time analysis Lack of coordination between database systems 数据挖掘教学研讨会,北京, 2012,8,9 21% 18% 12% 11% 10% 9% 9% 8% 4% 2% 1% 1% 3% 35% 11% 0% 5% 10% 15% 20% 25% 30% 35% 40% Oracle Exadata Microsoft SQL PDW IBM DB2 Smart Analytics System Hadoop/Mapreduce IBM Netzza HP Vertica Teradata EDW EMC Greenplum Sybase IQ Infobright Kognitb WX2 ParAccel Analytic Database Other We aren't using big data analytics tools Don't know Big Data Analytics Tools in Use 数据挖掘教学研讨会,北京, 2012,8,9 Batch Process: MapReduce Stream Process: Storm(Twitter), S4(Yahoo!) 数据挖掘教学研讨会,北京, 2012,8,9 Interpretation Big data is of limited value if users cannot understand the analysis The provenance of the result data Data visualization 数据挖掘教学研讨会,北京, 2012,8,9 Outline Introduction to Big Data1 Cloud Computing and Big Data 2 3 4 Conclusion Challenging Problems Our Work 5 数据挖掘教学研讨会,北京, 2012,8,9 Data, Data and Data! 数据挖掘教学研讨会,北京, 2012,8,9 Difficult to get the data Data is all around you! Data type is various Most data is occupied by company Researchers are difficult to get the data 数据挖掘教学研讨会,北京, 2012,8,9 No Size Fits All Web data Science data Financial Data Moving Object Data ……… 数据挖掘教学研讨会,北京, 2012,8,9 Scale We must store everything because we don’t know which part of the data is valuable. Find a Needle in Haystack 数据挖掘教学研讨会,北京, 2012,8,9 “Data is widely available; what is scarce is the ability to extract wisdom from it.” 数据挖掘教学研讨会,北京, 2012,8,9 Hal Varian, Google's chief economist “大海捕鱼”vs.“池塘捕鱼” 数据挖掘教学研讨会,北京, 2012,8,9 Timeliness Many situations need the result of analysis immediately Real-time processing can be a challenge with big data, especially in dynamic data environments like financial trading and social media. Develop partial results in advance and then do incremental computation New index structures are required From: 数据挖掘教学研讨会,北京, 2012,8,9 Parallelism  Parallelism across nodes in a cluster  Parallelism within a single node Cloud Computing New hardware: SSD、PCM… 数据挖掘教学研讨会,北京, 2012,8,9 Archival CPU RAM DISK CPU SCM TAPE RAM CPU DISK TAPE 2013+ Active StorageMemoryLogic TAPEDISK FLASH SSDRAM 1980 2008 fast, synch slow, asynch Memory like … storage like Privacy Manage privacy is both technical and sociological problem New data source bring new problems:LBS、 Microblog…. Share private data while limiting disclosure and ensuring sufficient data utility in the shared data Differential privacy is a very important step, but it reduces information content too far in order to be useful in most practical cases From: 数据挖掘教学研讨会,北京, 2012,8,9 Outline Introduction to Big Data1 Cloud Computing and Big Data 2 3 4 Conclusion Challenging Problems Our Work 5 数据挖掘教学研讨会,北京, 2012,8,9 大数据管理框架 大数据特征 . 多源异构:存在较大的异质性 . 分布广泛:分布在各个区域 . 动态增长:增长快,更新快 . 数据-模式:先有数据后有模式 如何高效管理海量数据? Web Data Management (2000-now) 2010 2009 2006 2001 EasyScholar C-DBLP Deep Web Integration Surface Web Data Extraction ScholarSpace2011-Present 数据挖掘教学研讨会,北京, 2012,8,9 面向领域的Web数据集成技术  成功研发多个线上系统,验证了数据集成技术有效性 学术空间ScholarSpace 工作通数据集成系统 舆情监控平台 图书价格比较网 (访问量超过了350万人次) (集成数据量超过了300万条) (集成数据量超过了450万条) (动态集成方式, 实时数据)数据挖掘教学研讨会,北京, 2012,8,9 ScholarSpace 文献:50万 作者:40万 累计访问:400万人次 日访问量:6000人次 数据挖掘教学研讨会,北京, 2012,8,9 ScholarSpace 实体: 作者, 论文, 期刊, 会议, 研究机构, … 关联: 作者关系, 论文发表关系,合作者关系, 数据抽取 数据集成 Advisor Advisor Advisor Co-Author Co-Author Author-Of Author-Of Author-Of Published-In Published-In Member Classmate Reference Published-In Author-Of 关联演化 浏览 查询 分析 基于任务 多种形式 丰富多样 隶属关系, 导师关系,参考文献关系… 关联发现、删除、更新 数据挖掘教学研讨会,北京, 2012,8,9 Web数据管理框架 数据挖掘教学研讨会,北京, 2012,8,9 成果意义 建立了一种将数据结构化管理的途 径,为解决特定领域的大数据集成 问题奠定了基础 进而为大数据管理提供一种新的解 决思路 数据挖掘教学研讨会,北京, 2012,8,9 Cloud Data Management (2008-now) 2011/06 2010/06 2010/01 2008/04 Query Process & Benchmark_v2.0 TaijiDB_v1.0 System Survey & Benchmark_v1.0 Storage & Index for Cloud Extensive Research &TaijiDB_v2.0present join query distribution strategy progress estimate multidimensional index, query optimization, online aggregation 数据挖掘教学研讨会,北京, 2012,8,9 Our work: Practical Industry Applications Motivated Cloud Data Management Privacy Benchmark & Demo Online Aggregation Cloud Storage & Index Timeliness 数据挖掘教学研讨会,北京, 2012,8,9 COLA: A Cloud-based On-Line Aggregation System 数据挖掘教学研讨会,北 京 Motivation Wikipedia Page Traffic Statistics SELECT language, SUM(pageviews) FROM table WHERE language IN(‘en’,’ja’,’de’,’es’,’fr’,’it’,’pl’) GROUP BY language. 20TB Amazon EC2 60 node cluster 数据挖掘教学研讨会,北京, 2012,8,9 Online Aggregation in the Cloud - Motivation Wikipedia Page Traffic Statistics SELECT language, SUM(pageviews) FROM table WHERE language IN(‘en’,’ja’,’de’,’es’,’fr’,’it’,’pl’) GROUP BY language. 20TB Being processed… 数据挖掘教学研讨会,北京, 2012,8,9 Online Aggregation in the Cloud - Motivation Wikipedia Page Traffic Statistics SELECT language, SUM(pageviews) FROM table WHERE language IN(‘en’,’ja’,’de’,’es’,’fr’,’it’,’pl’) GROUP BY language. 20TB 95h $1400 Batch-processing Online Aggregation 1h Results with 95% confidence Save Cost !!! 数据挖掘教学研讨会,北京, 2012,8,9 COLA - Architecture Online Aggregation Executor  State Manage  Estimate  Progress Prediction Query Engine  Backward Compatibility  Transparent User Interface  2 interfaces  2 processing modes Data Manager  Data Sampling Metadata Management 数据挖掘教学研讨会,北京, 2012,8,9 COLA - Implementation COLA Result Estimator State Manager Data Sampler OLA Translator Progress Predictor  Map Translator Combine Translator Reduce Translator No Translator  Result Estimation & Confidence Interval Computation Combiner+ Reducer  Split-based Queue: a queue for a table equal length  a State Manager for a Reducer Stateful Incremental Computation  MapReduce DAG Graph  Task-based PERT Network Critical Path 数据挖掘教学研讨会,北京, 2012,8,9 Experiment1: one-table aggregation query: • 10 node cluster, ~320GB input file • frequency:0.02 • confidence level: 95% Experiment 2: two-table aggregation query • 10 node cluster, ~300GB &30G input file • frequency:0.02 • confidence level: 95% RS1 2012/8/9 System Performance SELECT language, SUM(pageviews) FROM table WHERE language IN(‘en’,’ja’,’de’,’es’,’fr’,’it’,’pl’) GROUP BY language. SELECT language, SUM(pageviews) FROM table 1, table2 WHERE table1.pagename=table2.pagename AND table1.language=table2 .language AND table2.pagesize>=5000 GROUP BY table1.language GROUP BY language. 数据挖掘教学研讨会,北京, 2012,8,9 RS1 2012/8/9 System Performance – One Table  High accuracy  Rapid Convergence  low cost 1797seconds vs. 1728 seconds 数据挖掘教学研讨会,北京, 2012,8,9 RS1 2012/8/9 System Performance – Two Tables 数据挖掘教学研讨会,北京, 2012,8,9 Storage and Index for Big Data Management 数据挖掘教学研讨会,北 京 Introduction Cloud DBs Pros: • Fault tolerant • High availability • Good scalability Cons: • Not support SQL • Not support complex queries • Multi-dimensional query • Lack of effective index 数据挖掘教学研讨会,北京, 2012,8,9 Introduction Cloud DBs select sum(number) from Product where product.name = ‘beer’ and product.price <=10$ and product.price >=5$ rowke y nam e Price number 1 beer 3.00$ 1000 2 beer 7.00$ 2500 3 milk 2.00$ 1300 4 mlik 4.5$ 2100 Tabl e:Product select * from Product where product.rowkey > 1 and product.rowkey < =3 • Fast query on rowkey • The data is organized on rowkey • Not support multi-dimensional query on non-rowkey • Scan the whole table • 100TB, 1000 nodes • 33 minutes Multi-dimensional index and Multi-fields query optimization for big data are important 数据挖掘教学研讨会,北京, 2012,8,9 Research works Multi-dimensional index in the cloud . Application scenario • Intelligent Transportation System (ITS) • A successful use case of Internet of Things (IoT) • Multi-Fields Query Processing in the Cloud – Application scenario: Telecom Application driven 数据挖掘教学研讨会,北京, 2012,8,9 Multi-dimensional Index in the cloud - motivation  Massive . Millions of senors or GPS enabled devices . 10^6 * 2*60*24*1KB = 3TB/day  High Update frequency . Data collection frequency: record/10sec, 30sec . Hundreds of thousands of insertion per second  Multi-Dimensional . Inherent attributes: spatio-temporal attributes . Other attributes: speed, direction … Toyota: G-Book 1 million+ members GE: OnStar 5 million+ membersIntelligent Transportation System • Traffic condition analysis • Traffic Planning • City management Multi-dimensional query 数据挖掘教学研讨会,北京, 2012,8,9 Limits of Current Approaches  Traditional DBMS Be in trouble with scalability Can not support high insert throughput  Cloud DBs Pros • High scalability、availability and fault tolerance • Efficient random read and write • Support high insertion throughput Cons • Only support fast rowkey based query • Can not support multi-dimensional query efficiently Requirement  Design a new index model that can support efficient multi-dimensional query according to the characteristics of IoT applications  The index model must support high inert throughput at the same time  Implementing the new index based on HBase 数据挖掘教学研讨会,北京, 2012,8,9 Multi-level Index Framework  Dividing the data into current data and historical data, indexing them at different granularities  For the present data, indexing the time intervals and subspaces at high level ; For the historical data, indexing each record in batch 数据挖掘教学研讨会,北京, 2012,8,9 System Architecture 数据挖掘教学研讨会,北京, 2012,8,9 Demonstration: Insert Throughput Uniform Distribution Data Set Skewed Distribution Data Set Demonstration: Range Query Uniform Distribution Data Set Skewed Distribution Data Set Research works Multi-dimensional index in the cloud . Application scenario: Intelligent Transportation System (ITS) • Multi-Fields Query Processing in the Cloud – Application scenario: Telecom 数据挖掘教学研讨会,北京, 2012,8,9 Multi-Fields Query Processing in the Cloud: motivation  VAS :Value-Added-Service  NBG:NSN Browsing Gateway • CDR(Called Detail Record) • 450B per record, 3 months • Data set: 22TB~47TB • Query: msisdn + time range, url + time range input output R (msisdn, url, ts, size, otherData) R (msisdn, url, ts, size, otherData) Record:450 bytes/r Total size:22TB~47TB Traditional solution: Oracle Limitation:poor performance, expensive Select Top 100 url, sum(size) s, count(msisdn) c From R Where msisdn =861346672558 And ts>20120205 And ts<20120429 Group by url Order by c Based on msisdn and time range Select Top 100 msisdn, sum(size) s, count(url) c From R Where url=“www.baidu.com” And ts>20120205 And ts<20120429 Group by msisdn Order by c Based on url and time range WapServer DB for CDR DB Report Substitute RDBMS using Cloud DB Multi-Fields Query Processing in the Cloud: motivation Multiple Layer Grid Tree For Telecom Typedef Struct MLGT { Int N; Int M; Boolean bm[m][n]; Long insets[m][n]; Long trange; Long mspace; Map< RegionID, MLGT>SR; } MLGT ts(0,0) (mspace,0) (0,trange) sub MGLTsub MGLTsub MGLTmsisdn Split Region Region(Cell) sub MGLT sub MGLT MLGT(Multiple Layer Grid Tree) Solution: HBase + MapReduce Organize all the Regions of a given table in HBase into a multiple layer grid tree (MLGT) Query processing region region Query decomposition HBase Map/Reduce Map task Tablets meta info setting parameters Query results Job settings Query1 2 2 3 4 Data flow Component Map task Map task Map task Map task 5 Experiment Results Platform: 8 nodes, two Intel Quad-Core 2.4GHz CPU, 16GB , 7 TB hard disk, Ubuntu 10.10, network bandwidth is 1Gbps. Dataset: CDR data, 1TB 90(days) Performance evaluation experiment. 374 43 21 0 50 100 150 200 250 300 350 400 450 Baseline MDRO MDRO-OPT Query Times (s) 582 553 104 0 100 200 300 400 500 600 700 Baseline MDRO MDRO-OPT Query Time (s) Results of Performance evaluation experiment Conclusion Multi-dimensional index . Using index to improve the query performance . High insert throughput Multi-Fields Query Processing . Using a novel region organization method and MapReduce to improve the query performance To improve the multi-dimensional query performance for Big Data ! 数据挖掘教学研讨会,北京, 2012,8,9 Benchmark of Cloud DBMS and Taiji DB 数据挖掘教学研讨会,北 京 Existing CloudDB Data Analysis WEB Data Management Applications Architecture Storage Key Value Data Model 数据挖掘教学研讨会,北京, 2012,8,9 Benchmarking the CloudDBMS - Motivation How is the performance? How to choose the most appropriate system? How to evaluate the systems? 数据挖掘教学研讨会,北京, 2012,8,9 数据挖掘教学研讨会,北京, 2012,8,9 Benchmark Design Standardization Broad representation Efficiency Benchmark Test CaseOperation Scenario Metrics Real application scenario from industry A series of metrics to evaluate performance Representative operations in the business application Business process in the application 数据挖掘教学研讨会,北京, 2012,8,9 Benchmark Scenario Input Output 数据挖掘教学研讨会,北京, 2012,8,9 Benchmark Operations PUT PUT(KEY,VALUE) GET VALUE = GET(KEY) SCAN RESULT = SCAN(STARTKEY,ENDKEY) LOAD LOAD(PATH) 数据挖掘教学研讨会,北京, 2012,8,9 Evaluation Results Partition Without partition nodes nodes nodes nodes Response time Data Import File Load Scalability 数据挖掘教学研讨会,北京, 2012,8,9 TaijiDB - Motivation RDBMS Cloud DBMS Easy to use Need relatively long time to learn Structured Query Language No such Language High Cost Low Cost We want to build a low cost system which is easy to use Naïve users should feel comfortable with the system 数据挖掘教学研讨会,北京, 2012,8,9 TaijiDB - Motivation Real World Applications Big Data Cloud ComputingCloud Based DBMS No One-To-All Solutions In the Cloud TaijiDB: A TitAnIc and Just-In-time DataBase 数据挖掘教学研讨会,北京, 2012,8,9 系统架构 2012/8/9 HDFS Tables & Files & Logs HMaster Basic SQL Interface/Application Interfaces HRegionServer HRegionServer SSD & Buffer Management Unified API Operation & Management Service Storage Management Index Management Query Optimization Random Sampling Algorithm E - Commerce Internet of Things Telecom Security Lock Monitoring Load Balance Metadata Testing Multi-Level Index MLGT Algorithm Progress Estimating Online Aggregation Cassandra Keyspace Super Column Thrift Interface Super Column Super Column HBase Front-end Interface Query Processing Unified Execution Engine Storage Manager 数据挖掘教学研讨会,北京, 2012,8,9 Take advantage of different systems (Taiji DB) In order to fit the requirements, we need to evaluate the performance (Benchmark) Different applications has various requirements Based on the benchmark ,we know advantages and disadvantages of different systems 数据挖掘教学研讨会,北京, 2012,8,9 Platform Platform 1: 50 PCs: Intel Core2 Q9550 2.83GHz 8GB RAM Platform 2: 10 Servers: 2* Intel Xeon E5620(2.4GHz) 16GB RAM 4*2T SATA Disk 数据挖掘教学研讨会,北京, 2012,8,9 Outline Introduction to Big Data1 Cloud Computing and Big Data 2 3 4 Conclusion Challenging Problems Our Work 5 数据挖掘教学研讨会,北京, 2012,8,9 Summary Cloud Computing helps organizations store, manage, share and analyze their Big Data in an affordable and easy-to-use way The concept of Big data is wide and empty. We must focus on one or some domains. Data thinking: Nothing can do without data Different situations need different type of process: Batch or Stream Hardware and software both need to update 数据挖掘教学研讨会,北京, 2012,8,9 香山科学会议 网络数据科学与工程 李国杰,华云生,姚期智,成思危 主要议题 . 社会、经济与IT领域中网络大数据应用 . 网络数据科学的共性理论基础 . 网络大数据的良性生态环境构建 中国科学报-李国杰 数据挖掘教学研讨会,北京, 2012,8,9 XLDB Asia 2012 Invited Talks  Reference cases from scientific communities  Astroinformatics, Geoinformatics, Earth…  Reference cases from industry  Facebook, eBay, EMC, Taobao…  Research on Big Data Management  Laura(IBM), Xiaodong Zhang (Ohio), Martin(MonetDB)… Panel Discussion  Handling Extremely Large Scientific Data  NoSQL: the Cure for Big Data?  Evolution or Revolution: Database Research for Big Data Lightning talks 数据挖掘教学研讨会,北京, 2012,8,9 Useful Resources 数据挖掘教学研讨会,北京, 2012,8,9 数据挖掘教学研讨会, 北京 2012 8 9 About our Lab Innovative Data Management Research Http://idke.ruc.edu.cn Google wamdm 未来每18 个月产生的数据量 等于有史以来的数据量之和 --Jim Gray 1998图灵奖获奖演说 谢 谢! 数据挖掘教学研讨会,北京, 2012,8,9

下载文档,方便阅读与编辑

文档的实际排版效果,会与网站的显示效果略有不同!!

需要 6 金币 [ 分享文档获得金币 ] 12 人已下载

下载文档

相关文档