Big Data Analytics with R and Hadoop Set up an integrated infrastructure of R and Hadoop to turn your data analytics into Big Data analytics Vignesh Prajapati BIRMINGHAM - MUMBAI Big Data Analytics with R and Hadoop Copyright © 2013 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: November 2013 Production Reference: 1181113 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78216-328-2 www.packtpub.com Cover Image by Duraid Fatouhi (firstname.lastname@example.org) Credits Author Vignesh Prajapati Reviewers Krishnanand Khambadkone Muthusamy Manigandan Vidyasagar N V Siddharth Tiwari Acquisition Editor James Jones Lead Technical Editor Mandar Ghate Technical Editors Shashank Desai Jinesh Kampani Chandni Maishery Project Coordinator Wendell Palmar Copy Editors Roshni Banerjee Mradula Hegde Insiya Morbiwala Aditya Nair Kirti Pai Shambhavi Pai Laxmi Subramanian Proofreaders Maria Gould Lesley Harrison Elinor Perry-Smith Indexer Mariammal Chettiyar Graphics Ronak Dhruv Abhinash Sahu Production Coordinator Pooja Chiplunkar Cover Work Pooja Chiplunkar About the Author Vignesh Prajapati, from India, is a Big Data enthusiast, a Pingax (www.pingax. com) consultant and a software professional at Enjay. He is an experienced ML Data engineer. He is experienced with Machine learning and Big Data technologies such as R, Hadoop, Mahout, Pig, Hive, and related Hadoop components to analyze datasets to achieve informative insights by data analytics cycles. He pursued B.E from Gujarat Technological University in 2012 and started his career as Data Engineer at Tatvic. His professional experience includes working on the development of various Data analytics algorithms for Google Analytics data source, for providing economic value to the products. To get the ML in action, he implemented several analytical apps in collaboration with Google Analytics and Google Prediction API services. He also contributes to the R community by developing the RGoogleAnalytics' R library as an open source code Google project and writes articles on Data-driven technologies. Vignesh is not limited to a single domain; he has also worked for developing various interactive apps via various Google APIs, such as Google Analytics API, Realtime API, Google Prediction API, Google Chart API, and Translate API with the Java and PHP platforms. He is highly interested in the development of open source technologies. Vignesh has also reviewed the Apache Mahout Cookbook for Packt Publishing. This book provides a fresh, scope-oriented approach to the Mahout world for beginners as well as advanced users. Mahout Cookbook is specially designed to make users aware of the different possible machine learning applications, strategies, and algorithms to produce an intelligent as well as Big Data application. Acknowledgment First and foremost, I would like to thank my loving parents and younger brother Vaibhav for standing beside me throughout my career as well as while writing this book. Without their support it would have been totally impossible to achieve this knowledge sharing. As I started writing this book, I was continuously motivated by my father (Prahlad Prajapati) and regularly followed up by my mother (Dharmistha Prajapati). Also, thanks to my friends for encouraging me to initiate writing for big technologies such as Hadoop and R. During this writing period I went through some critical phases of my life, which were challenging for me at all times. I am grateful to Ravi Pathak, CEO and founder at Tatvic, who introduced me to this vast field of Machine learning and Big Data and helped me realize my potential. And yes, I can't forget James, Wendell, and Mandar from Packt Publishing for their valuable support, motivation, and guidance to achieve these heights. Special thanks to them for filling up the communication gap on the technical and graphical sections of this book. Thanks to Big Data and Machine learning. Finally a big thanks to God, you have given me the power to believe in myself and pursue my dreams. I could never have done this without the faith I have in you, the Almighty. Let us go forward together into the future of Big Data analytics. About the Reviewers Krishnanand Khambadkone has over 20 years of overall experience. He is currently working as a senior solutions architect in the Big Data and Hadoop Practice of TCS America and is architecting and implementing Hadoop solutions for Fortune 500 clients, mainly large banking organizations. Prior to this he worked on delivering middleware and SOA solutions using the Oracle middleware stack and built and delivered software using the J2EE product stack. He is an avid evangelist and enthusiast of Big Data and Hadoop. He has written several articles and white papers on this subject, and has also presented these at conferences. Muthusamy Manigandan is the Head of Engineering and Architecture with Ozone Media. Mani has more than 15 years of experience in designing large-scale software systems in the areas of virtualization, Distributed Version Control systems, ERP, supply chain management, Machine Learning and Recommendation Engine, behavior-based retargeting, and behavior targeting creative. Prior to joining Ozone Media, Mani handled various responsibilities at VMware, Oracle, AOL, and Manhattan Associates. At Ozone Media he is responsible for products, technology, and research initiatives. Mani can be reached at mmaniga@ yahoo.co.uk and http://in.linkedin.com/in/mmanigandan/. Vidyasagar N V had an interest in computer science since an early age. Some of his serious work in computers and computer networks began during his high school days. Later he went to the prestigious Institute Of Technology, Banaras Hindu University for his B.Tech. He is working as a software developer and data expert, developing and building scalable systems. He has worked with a variety of second, third, and fourth generation languages. He has also worked with flat files, indexed files, hierarchical databases, network databases, and relational databases, such as NOSQL databases, Hadoop, and related technologies. Currently, he is working as a senior developer at Collective Inc., developing Big-Data-based structured data extraction techniques using the web and local information. He enjoys developing high-quality software, web-based solutions, and designing secure and scalable data systems. I would like to thank my parents, Mr. N Srinivasa Rao and Mrs. Latha Rao, and my family who supported and backed me throughout my life, and friends for being friends. I would also like to thank all those people who willingly donate their time, effort, and expertise by participating in open source software projects. Thanks to Packt Publishing for selecting me as one of the technical reviewers on this wonderful book. It is my honor to be a part of this book. You can contact me at email@example.com. Siddharth Tiwari has been in the industry since the past three years working on Machine learning, Text Analytics, Big Data Management, and information search and Management. Currently he is employed by EMC Corporation's Big Data management and analytics initiative and product engineering wing for their Hadoop distribution. He is a part of the TeraSort and MinuteSort world records, achieved while working with a large financial services firm. He pursued Bachelor of Technology from Uttar Pradesh Technical University with equivalent CGPA 8. www.PacktPub.com Support files, eBooks, discount offers and more You might want to visit www.PacktPub.com for support files and downloads related to your book. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub. com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at firstname.lastname@example.org for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks. TM http://PacktLib.PacktPub.com Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books. Why Subscribe? • Fully searchable across every book published by Packt • Copy and paste, print and bookmark content • On demand and accessible via web browser Free Access for Packt account holders If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access. Table of Contents Preface 1 Chapter 1: Getting Ready to Use R and Hadoop 13 Installing R 14 Installing RStudio 15 Understanding the features of R language 16 Using R packages 16 Performing data operations 16 Increasing community support 17 Performing data modeling in R 18 Installing Hadoop 19 Understanding different Hadoop modes 20 Understanding Hadoop installation steps 20 Installing Hadoop on Linux, Ubuntu flavor (single node cluster) 20 Installing Hadoop on Linux, Ubuntu flavor (multinode cluster) 23 Installing Cloudera Hadoop on Ubuntu 25 Understanding Hadoop features 28 Understanding HDFS 28 Understanding the characteristics of HDFS 28 Understanding MapReduce 28 Learning the HDFS and MapReduce architecture 30 Understanding the HDFS architecture 30 Understanding HDFS components 30 Understanding the MapReduce architecture 31 Understanding MapReduce components 31 Understanding the HDFS and MapReduce architecture by plot 31 Understanding Hadoop subprojects 33 Summary 36 Table of Contents [ ii ] Chapter 2: Writing Hadoop MapReduce Programs 37 Understanding the basics of MapReduce 37 Introducing Hadoop MapReduce 39 Listing Hadoop MapReduce entities 40 Understanding the Hadoop MapReduce scenario 40 Loading data into HDFS 40 Executing the Map phase 41 Shuffling and sorting 42 Reducing phase execution 42 Understanding the limitations of MapReduce 43 Understanding Hadoop's ability to solve problems 44 Understanding the different Java concepts used in Hadoop programming 44 Understanding the Hadoop MapReduce fundamentals 45 Understanding MapReduce objects 45 Deciding the number of Maps in MapReduce 46 Deciding the number of Reducers in MapReduce 46 Understanding MapReduce dataflow 47 Taking a closer look at Hadoop MapReduce terminologies 48 Writing a Hadoop MapReduce example 51 Understanding the steps to run a MapReduce job 52 Learning to monitor and debug a Hadoop MapReduce job 58 Exploring HDFS data 59 Understanding several possible MapReduce definitions to solve business problems 60 Learning the different ways to write Hadoop MapReduce in R 61 Learning RHadoop 61 Learning RHIPE 62 Learning Hadoop streaming 62 Summary 62 Chapter 3: Integrating R and Hadoop 63 Introducing RHIPE 64 Installing RHIPE 65 Installing Hadoop 65 Installing R 66 Installing protocol buffers 66 Environment variables 66 The rJava package installation 67 Installing RHIPE 67 Understanding the architecture of RHIPE 68 Understanding RHIPE samples 69 RHIPE sample program (Map only) 69 Word count 71 Table of Contents [ iii ] Understanding the RHIPE function reference 73 Initialization 73 HDFS 73 MapReduce 75 Introducing RHadoop 76 Understanding the architecture of RHadoop 77 Installing RHadoop 77 Understanding RHadoop examples 79 Word count 81 Understanding the RHadoop function reference 82 The hdfs package 82 The rmr package 85 Summary 85 Chapter 4: Using Hadoop Streaming with R 87 Understanding the basics of Hadoop streaming 87 Understanding how to run Hadoop streaming with R 92 Understanding a MapReduce application 92 Understanding how to code a MapReduce application 94 Understanding how to run a MapReduce application 98 Executing a Hadoop streaming job from the command prompt 98 Executing the Hadoop streaming job from R or an RStudio console 99 Understanding how to explore the output of MapReduce application 99 Exploring an output from the command prompt 99 Exploring an output from R or an RStudio console 100 Understanding basic R functions used in Hadoop MapReduce scripts 101 Monitoring the Hadoop MapReduce job 102 Exploring the HadoopStreaming R package 103 Understanding the hsTableReader function 104 Understanding the hsKeyValReader function 106 Understanding the hsLineReader function 107 Running a Hadoop streaming job 110 Executing the Hadoop streaming job 112 Summary 112 Chapter 5: Learning Data Analytics with R and Hadoop 113 Understanding the data analytics project life cycle 113 Identifying the problem 114 Designing data requirement 114 Preprocessing data 115 Performing analytics over data 115 Visualizing data 116 Table of Contents [ iv ] Understanding data analytics problems 117 Exploring web pages categorization 118 Identifying the problem 118 Designing data requirement 118 Preprocessing data 120 Performing analytics over data 121 Visualizing data 128 Computing the frequency of stock market change 128 Identifying the problem 128 Designing data requirement 129 Preprocessing data 129 Performing analytics over data 130 Visualizing data 136 Predicting the sale price of blue book for bulldozers – case study 137 Identifying the problem 137 Designing data requirement 138 Preprocessing data 139 Performing analytics over data 141 Understanding Poisson-approximation resampling 141 Summary 147 Chapter 6: Understanding Big Data Analysis with Machine Learning 149 Introduction to machine learning 149 Types of machine-learning algorithms 150 Supervised machine-learning algorithms 150 Linear regression 150 Linear regression with R 152 Linear regression with R and Hadoop 154 Logistic regression 157 Logistic regression with R 159 Logistic regression with R and Hadoop 159 Unsupervised machine learning algorithm 162 Clustering 162 Clustering with R 163 Performing clustering with R and Hadoop 163 Recommendation algorithms 167 Steps to generate recommendations in R 170 Generating recommendations with R and Hadoop 173 Summary 178 Chapter 7: Importing and Exporting Data from Various DBs 179 Learning about data files as database 181 Understanding different types of files 182 Installing R packages 182 Table of Contents [ v ] Importing the data into R 182 Exporting the data from R 183 Understanding MySQL 183 Installing MySQL 184 Installing RMySQL 184 Learning to list the tables and their structure 184 Importing the data into R 185 Understanding data manipulation 185 Understanding Excel 186 Installing Excel 186 Importing data into R 186 Exporting the data to Excel 187 Understanding MongoDB 187 Installing MongoDB 188 Mapping SQL to MongoDB 189 Mapping SQL to MongoQL 190 Installing rmongodb 190 Importing the data into R 190 Understanding data manipulation 191 Understanding SQLite 192 Understanding features of SQLite 193 Installing SQLite 193 Installing RSQLite 193 Importing the data into R 193 Understanding data manipulation 194 Understanding PostgreSQL 194 Understanding features of PostgreSQL 195 Installing PostgreSQL 195 Installing RPostgreSQL 195 Exporting the data from R 196 Understanding Hive 197 Understanding features of Hive 197 Installing Hive 197 Setting up Hive configurations 198 Installing RHive 199 Understanding RHive operations 199 Understanding HBase 200 Understanding HBase features 200 Installing HBase 201 Installing thrift 203 Installing RHBase 203 Table of Contents [ vi ] Importing the data into R 204 Understanding data manipulation 204 Summary 204 Appendix: References 205 R + Hadoop help materials 205 R groups 207 Hadoop groups 207 R + Hadoop groups 208 Popular R contributors 208 Popular Hadoop contributors 209 Index 211 Preface The volume of data that enterprises acquire every day is increasing exponentially. It is now possible to store these vast amounts of information on low cost platforms such as Hadoop. The conundrum these organizations now face is what to do with all this data and how to glean key insights from this data. Thus R comes into picture. R is a very amazing tool that makes it a snap to run advanced statistical models on data, translate the derived models into colorful graphs and visualizations, and do a lot more functions related to data science. One key drawback of R, though, is that it is not very scalable. The core R engine can process and work on very limited amount of data. As Hadoop is very popular for Big Data processing, corresponding R with Hadoop for scalability is the next logical step. This book is dedicated to R and Hadoop and the intricacies of how data analytics operations of R can be made scalable by using a platform as Hadoop. With this agenda in mind, this book will cater to a wide audience including data scientists, statisticians, data architects, and engineers who are looking for solutions to process and analyze vast amounts of information using R and Hadoop. Using R with Hadoop will provide an elastic data analytics platform that will scale depending on the size of the dataset to be analyzed. Experienced programmers can then write Map/Reduce modules in R and run it using Hadoop's parallel processing Map/Reduce mechanism to identify patterns in the dataset. Preface [ 2 ] Introducing R R is an open source software package to perform statistical analysis on data. R is a programming language used by data scientist statisticians and others who need to make statistical analysis of data and glean key insights from data using mechanisms, such as regression, clustering, classification, and text analysis. R is registered under GNU (General Public License). It was developed by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, which is currently handled by the R Development Core Team. It can be considered as a different implementation of S, developed by Johan Chambers at Bell Labs. There are some important differences, but a lot of the code written in S can be unaltered using the R interpreter engine. R provides a wide variety of statistical, machine learning (linear and nonlinear modeling, classic statistical tests, time-series analysis, classification, clustering) and graphical techniques, and is highly extensible. R has various built-in as well as extended functions for statistical, machine learning, and visualization tasks such as: • Data extraction • Data cleaning • Data loading • Data transformation • Statistical analysis • Predictive modeling • Data visualization It is one of the most popular open source statistical analysis packages available on the market today. It is crossplatform, has a very wide community support, and a large and ever-growing user community who are adding new packages every day. With its growing list of packages, R can now connect with other data stores, such as MySQL, SQLite, MongoDB, and Hadoop for data storage activities. Preface [ 3 ] Understanding features of R Let's see different useful features of R: • Effective programming language • Relational database support • Data analytics • Data visualization • Extension through the vast library of R packages Studying the popularity of R The graph provided from KD suggests that R is the most popular language for data analysis and mining: The following graph provides details about the total number of R packages released by R users from 2005 to 2013. This is how we explore R users. The growth was exponential in 2012 and it seems that 2013 is on track to beat that. Preface [ 4 ] R allows performing Data analytics by various statistical and machine learning operations as follows: • Regression • Classification • Clustering • Recommendation • Text mining Introducing Big Data Big Data has to deal with large and complex datasets that can be structured, semi-structured, or unstructured and will typically not fit into memory to be processed. They have to be processed in place, which means that computation has to be done where the data resides for processing. When we talk to developers, the people actually building Big Data systems and applications, we get a better idea of what they mean about 3Vs. They typically would mention the 3Vs model of Big Data, which are velocity, volume, and variety. Velocity refers to the low latency, real-time speed at which the analytics need to be applied. A typical example of this would be to perform analytics on a continuous stream of data originating from a social networking site or aggregation of disparate sources of data. Preface [ 5 ] Volume refers to the size of the dataset. It may be in KB, MB, GB, TB, or PB based on the type of the application that generates or receives the data. Variety refers to the various types of the data that can exist, for example, text, audio, video, and photos. Big Data usually includes datasets with sizes. It is not possible for such systems to process this amount of data within the time frame mandated by the business. Big Data volumes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single dataset. Faced with this seemingly insurmountable challenge, entirely new platforms are called Big Data platforms. Getting information about popular organizations that hold Big Data Some of the popular organizations that hold Big Data are as follows: • Facebook: It has 40 PB of data and captures 100 TB/day • Yahoo!: It has 60 PB of data • Twitter: It captures 8 TB/day • EBay: It has 40 PB of data and captures 50 TB/day Preface [ 6 ] How much data is considered as Big Data differs from company to company. Though true that one company's Big Data is another's small, there is something common: doesn't fit in memory, nor disk, has rapid influx of data that needs to be processed and would benefit from distributed software stacks. For some companies, 10 TB of data would be considered Big Data and for others 1 PB would be Big Data. So only you can determine whether the data is really Big Data. It is sufficient to say that it would start in the low terabyte range. Also, a question well worth asking is, as you are not capturing and retaining enough of your data do you think you do not have a Big Data problem now? In some scenarios, companies literally discard data, because there wasn't a cost effective way to store and process it. With platforms as Hadoop, it is possible to start capturing and storing all that data. Introducing Hadoop Apache Hadoop is an open source Java framework for processing and querying vast amounts of data on large clusters of commodity hardware. Hadoop is a top level Apache project, initiated and led by Yahoo! and Doug Cutting. It relies on an active community of contributors from all over the world for its success. With a significant technology investment by Yahoo!, Apache Hadoop has become an enterprise-ready cloud computing technology. It is becoming the industry de facto framework for Big Data processing. Hadoop changes the economics and the dynamics of large-scale computing. Its impact can be boiled down to four salient characteristics. Hadoop enables scalable, cost-effective, flexible, fault-tolerant solutions. Exploring Hadoop features Apache Hadoop has two main features: • HDFS (Hadoop Distributed File System) • MapReduce Preface [ 7 ] Studying Hadoop components Hadoop includes an ecosystem of other products built over the core HDFS and MapReduce layer to enable various types of operations on the platform. A few popular Hadoop components are as follows: • Mahout: This is an extensive library of machine learning algorithms. • Pig: Pig is a high-level language (such as PERL) to analyze large datasets with its own language syntax for expressing data analysis programs, coupled with infrastructure for evaluating these programs. • Hive: Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad hoc queries, and the analysis of large datasets stored in HDFS. It has its own SQL-like query language called Hive Query Language (HQL), which is used to issue query commands to Hadoop. • HBase: HBase (Hadoop Database) is a distributed, column-oriented database. HBase uses HDFS for the underlying storage. It supports both batch style computations using MapReduce and atomic queries (random reads). • Sqoop: Apache Sqoop is a tool designed for efficiently transferring bulk data between Hadoop and Structured Relational Databases. Sqoop is an abbreviation for (SQ)L to Had(oop). • ZooKeper: ZooKeeper is a centralized service to maintain configuration information, naming, providing distributed synchronization, and group services, which are very useful for a variety of distributed systems. • Ambari: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters, which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig, and Sqoop. Preface [ 8 ] Understanding the reason for using R and Hadoop together I would also say that sometimes the data resides on the HDFS (in various formats). Since a lot of data analysts are very productive in R, it is natural to use R to compute with the data stored through Hadoop-related tools. As mentioned earlier, the strengths of R lie in its ability to analyze data using a rich library of packages but fall short when it comes to working on very large datasets. The strength of Hadoop on the other hand is to store and process very large amounts of data in the TB and even PB range. Such vast datasets cannot be processed in memory as the RAM of each machine cannot hold such large datasets. The options would be to run analysis on limited chunks also known as sampling or to correspond the analytical power of R with the storage and processing power of Hadoop and you arrive at an ideal solution. Such solutions can also be achieved in the cloud using platforms such as Amazon EMR. What this book covers Chapter 1, Getting Ready to Use R and Hadoop, gives an introduction as well as the process of installing R and Hadoop. Chapter 2, Writing Hadoop MapReduce Programs, covers basics of Hadoop MapReduce and ways to execute MapReduce using Hadoop. Chapter 3, Integrating R and Hadoop, shows deployment and running of sample MapReduce programs for RHadoop and RHIPE by various data handling processes. Chapter 4, Using Hadoop Streaming with R, shows how to use Hadoop Streaming with R. Chapter 5, Learning Data Analytics with R and Hadoop, introduces the Data analytics project life cycle by demonstrating with real-world Data analytics problems. Chapter 6, Understanding Big Data Analysis with Machine Learning, covers performing Big Data analytics by machine learning techniques with RHadoop. Chapter 7, Importing and Exporting Data from Various DBs, covers how to interface with popular relational databases to import and export data operations with R. Appendix, References, describes links to additional resources regarding the content of all the chapters being present. Preface [ 9 ] What you need for this book As we are going to perform Big Data analytics with R and Hadoop, you should have basic knowledge of R and Hadoop and how to perform the practicals and you will need to have R and Hadoop installed and configured. It would be great if you already have a larger size data and problem definition that can be solved with data- driven technologies, such as R and Hadoop functions. Who this book is for This book is great for R developers who are looking for a way to perform Big Data analytics with Hadoop. They would like all the techniques of integrating R and Hadoop, how to write Hadoop MapReduce, and tutorials for developing and running Hadoop MapReduce within R. Also this book is aimed at those who know Hadoop and want to build some intelligent applications over Big Data with R packages. It would be helpful if readers have basic knowledge of R. Conventions In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning. Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "Preparing the Map() input." A block of code is set as follows: mapred.job.tracker localhost:54311 The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.