HADOOP: Strengths and Limitations in National Security Missions How SAP Solutions Complement Hadoop to Enable Mission Success Bob Palmer | Senior Director, SAP National Security Services June 2012 www.SAPNS2.com adoop – a distributed file system for managing large data sets – is a top-of-mind subject in the U.S. Intelligence Community and Department of Defense, as well as in large commercial operations such as Yahoo, eBay and Amazon. This paper seeks to empower our clients and partners with a fundamental understanding of: ■ The basic architecture of Hadoop; ■ Which features and benefits of Hadoop are driving its adoption; ■ Misconceptions that may exist that lead to the idea that Hadoop is the “hammer for every nail;” and ■ How SAP solutions can extend and complement Hadoop to achieve optimal mission or business outcomes for your organization. What Is Hadoop? Hadoop is a “distributed file system,” not a database. The Hadoop Distributed File System (HDFS) manages the splitting up and storage of large files of data across many inexpensive commodity servers, which are known as “worker nodes.” When Hadoop splits up the files, it puts redundant copies of the chunks of the file on more than one disc drive, providing “self-healing” redundancy if a low cost commodity server fails. Hadoop also manages the distribution of scripts that perform business logic on the data files that are split up on those many server nodes. This splitting up of the H 2 SAP National Security Services | SAP NS2 business logic to each of the CPUs and RAM on many inexpensive worker nodes is what makes Hadoop work well on very large “Big Data” files. Analysis logic is performed in parallel on all of the server nodes at once, on each of the 64MB or 128MB chunks of the file. Hadoop software is written in Java and is licensed for free; it was developed as an open-source initiative of the Apache Foundation. The name Hadoop is not an acronym; it was the name of a toy elephant of the son of the original author of the Hadoop framework, Doug Cutting. Companies such as Cloudera and Hortonworks sell support for Hadoop; they maintain their own versions and perform support and maintenance for an annual fee based on the number of nodes the customer has. They also sell licensed software utilities to manage and enhance the freeware Hadoop software. Thus, we can say that Cloudera and Hortonworks are to Hadoop as Redhat and SUSE are to Linux. Hadoop Architecture The Secret Sauce: MapReduce The exploration or analysis of the data in the Hadoop file system is done through a software process called MapReduce. MapReduce scripts are written in Java by developers on the “edge server.” The MapReduce engine breaks up jobs into many small tasks, run in parallel and launched on all nodes that have a part of the file; this is how Hadoop addresses the problem of storage and analysis of Big Data. Both the data and the computations on that data are in the RAM memory of the node computer. SAP National Security Services | SAP NS2 3 The map function of the MapReduce engine distributes the logic that the developer wants to execute on a given data file to each of the CPU processors that are on the nodes that have a small chunk of that file on their discs. The reduce function takes the raw results of the mapping function and performs the business logic computations desired against them: aggregation, sums, word counts, averages, et cetera. As an example of a MapReduce function, let’s say that you want to find out what the average temperature was for every city in America in 1901. The mapper function would organize data from all the chunks of your one-terabyte-sized “temperatures for the 20th century” file using the key of “year 1901.” Values returned from each of the nodes would be 78°, 58°, 32°, 64°, and so on. The reducer task would then take the mapped data organized by year and average them. Hadoop is self-healing, because the map function splits large files into 64MB or 128MB chunks spread out over the many disc drives of the nodes. It redundantly puts copies of the chunks on more than one disc drive. This is why one can build Hadoop deployments using hundreds of inexpensive commodity servers with plain-old disc drives. If a box fails, you don’t even fix it, you just yank it out and trash it, and Hadoop manages the redundant data with no loss. Scalability is simple to achieve; if the CPU or RAM memory is being taxed by the MapReduce scripts, just add more inexpensive commodity server CPUs and disc drives. Hadoop as a Threat to Hardware Manufacturers Since it was conceived as a way to slash the per-terabyte cost of storing very large amounts of data because one gets the software for free and uses cheap commodity servers and disc drives to store it, Hadoop is certainly a threat to high-end hardware manufacturers. One response of the hardware vendors has been to create “Hadoop appliances.” For example, enterprise hardware manufacturers will deliver a “refrigerator box” to your loading dock that has thousands of nodes of Cloudera- or Hortonworks-brand Hadoop already loaded on it with redundant cooling and network interfaces. This approach appeals to a customer who has a long-standing relationship with a hardware vendor and wants a turn-key way to store large amounts of semi-structured data. But this is against the original concept of Hadoop; namely the use of cheap commodity hardware with self-healing software to store data at a minimum cost-per-terabyte. Hadoop appliances also appeal to customers who do not have the existing data-center real estate and infrastructure to power and cool hundreds of commodity servers. Utilities that Interface with Hadoop MapReduce scripts can also be generated automatically by utilities such as HIVE and PIG. HIVE is a utility for developers that know SQL (Standard Query Language); it takes a subset of ANSI SQL and generates MapReduce scripts from it. Note that not all SQL commands are appropriate for a distributed file system like Hadoop. HIVE could be used to bridge from a SQL based business 4 SAP National Security Services | SAP NS2 intelligence (BI) reporting front-end to Hadoop, but only a part of the full SQL functionality could be executed by Hadoop. HIVE user documentation publishes which SQL functions are available. Compared to highly optimized, handwritten MapReduce scripts, HIVE-generated scripts typically run about 1.7 times slower. PIG is a similar utility that creates MapReduce jobs, except it is intended for use by developers that are conversant in Perl or Python scripting. It uses a scripting language called PIG Latin. Sqoop is a bulk loader for Hadoop, which takes data from file sources and loads them into Hadoop. Mahout is a utility that performs data mining and text analysis on Hadoop files, and complex statistical analysis. It is a highly configurable, robust, and complex development tool. The Perceived Business Benefits of Hadoop The attractions of Hadoop are similar to those of other open source business drivers. The procurement process is simpler, without the need for capital expenditure, because of the absence of initial software licensing costs. Paid support from vendors like Cloudera and Hortonworks can be an operations and maintenance (O&M) budget expenditure. The skilled labor needed for MapReduce scripting (and updating those scripts to reflect changing end-user demand) may be seen as a sunk cost if a large employee base of developers already exists in a client’s shop, or amongst the system integrator staff. Aside from the initial cost, the freedom and flexibility of deployment unconstrained by licensing considerations makes Hadoop attractive to customers. What Hadoop Is Not So Good At Since Hadoop is a file system and not a database there is no way to change the data in the files.1 There is no such thing as an update or delete function in Hadoop as there is in a database- management system, and no concept of commit data or roll-back data as in a transactional processing database system. So we can say that Hadoop is a “write-once, read-many” affair. Therefore, Hadoop is best used for storing large amounts of unstructured or semi-structured data from streaming data sources like log files, internet search results (click-stream data), signal intelligence, or sensor data. It is not as well suited for storing and managing changes to discrete business or mission data elements that have precise meta-data definitions. In order to do analysis or exploration of a file in Hadoop, the whole file must be read for every computational process because by its nature there are no predefined data schema or indexes in Hadoop.2 1 There are schemes which can append log files to Hadoop files to emulate the deletion or changing of data elements in Hadoop files. On query, the log file is invoked to indicate that the result of the MapReduce should not include a given element. 2 Hadoop is “schema-on-read.” SAP National Security Services | SAP NS2 5 Since Hadoop is architected to accommodate very large data files by splitting the file up into small chunks over many worker nodes, it does not perform well if many small files are stored in the distributed file system. If only one or two nodes are needed for the file size, the overhead of managing the distribution has no economy of scale. Therefore we can generalize that Hadoop is not a good way to organize and analyze large numbers of smaller files. Hadoop is good for asking novel questions of raw data that you never even knew you would need as you collected it, and you expect that it will take a lot of batch-processing time to get the results you are looking for.3 It is less useful when you know the types of questions that will be asked on well- structured data sets. And Hadoop will not return query results in the sub-second response times that business intelligence users are accustomed to for report queries based on well-structured and indexed data warehouses. For example, Hadoop is not the best solution when a CEO answers the question: “Who were our 10 biggest customers in North America in Q3, by product line?” Conversely, if you tried to read and sort every single row of Big Data on a relational database management system without any indexes it could take hours, instead of the half hour that Hadoop might need for the same-sized file. It’s really apples and oranges, i.e. different use-cases for Hadoop versus data warehouse business intelligence. In other words, Hadoop falls short in performance when a user wants to ask a specific business question to find just three records in a large data set.4 A well-indexed relational database or a columnar database like SAP’s Sybase IQ can deliver those three records out of billions with sub- second response times. Hadoop doesn’t know how to look at just a few records to get an answer returned in milliseconds — it would have to go through the whole file to generate the result set that the script was looking for. Therefore Hadoop would be a poor choice for a master recruiting database, but it would be a very good repository to track click-stream responses to the recruiting website. 3 For example, an anecdotal data fable is that a large national retailer captures every single check-out in Hadoop. By analyzing the data to understand how products should be placed on shelves, the retailer found that beer and diapers often appeared in the same checkout and should be conveniently shelved near each other, presumably because male buyers that are sent to the store for diapers often also buy beer. 4 Note that this is the very use-case that is right in the wheel house for SAP Sybase IQ and for SAP’s in-memory data warehouse, HANA. 6 SAP National Security Services | SAP NS2 Hadoop as an ETL Tool to Load Databases Hadoop is essentially batch oriented; a developer builds a MapReduce script to look at a whole one or two terabyte file and he expects it to run for 20 minutes to an hour or longer. The idea of Hadoop is to be cheap enough per terabyte that you can store all of the raw data,5 even data that currently has no anticipated uses — which is certainly not a well constructed data warehouse philosophy. Because of this, an important use-case for Hadoop is to use it as an ETL (Extract, Transform and Load)6 tool to cull specific, important data elements that have business or mission value out of the huge data files, and load them into a relational database management system or a columnar database, for transactional application use or business intelligence analysis. Optimizing Mission Performance Using a Hybrid Solution: SAP’s Sybase and BusinessObjects Solutions with Hadoop This is where SAP’s Sybase IQ Columnar Database Management System combined with SAP BusinessObjects reporting can add important functionality to the total Big Data environment. The result of the reduce function may be loaded into an “analysis repository” where it can then be exposed to end-users using a self-service business intelligence tool like SAP BusinessObjects Web Intelligence and Explorer. The SAP Sybase columnar RDBMS is uniquely appropriate as a repository for the result of the MapReduce from the Hadoop system. It is optimized in such a way that there is no bottleneck in reporting performance for even extremely large data sets coming from a reduce job. Sybase IQ by its nature is “self-indexing.” In addition, the columnar vs. row-based storage of data means that the “seek time” response is much shorter, even when analyzing the very large data sets that may be the result of the Hadoop MapReduce script. The technical reason that Sybase IQ’s columnar database is much faster for analysis than traditional row-oriented database is that row-oriented databases serialize all of the attributes of an entity together in one place on the disc drive. For example, let’s say you had the following data: John Doe E8 $65,000 12/01/1975 Mary Smith E4 $22,000 05/02/1989 James Jones O2 $23,000 06/14/1990 etc. etc. etc. etc. 5 For example, anecdotal rumor has it that a large telecommunications vendor couldn’t figure out what caused a system-wide outage for several days because the log files were on too many different servers, and they could not analyze them together. Now they are collecting all the logs to one large Hadoop system. 6 Or more precisely in most cases, Extract, Load and Transform. SAP National Security Services | SAP NS2 7 On the disc, the row-based storage would look like this: 1: John Doe,E8,$65,000,12/01/1975,etc. 2: Mary Smith,E4,$22,000,05/02/1989,etc. 3: Jim Jones,O2,$23,000,06/14/1990,etc. So if you were doing an analysis of salary level, you would have to query every row, which may be in different segments of the disc, and look for the attributes in which you have analytical interest. This process can be accelerated by building an indexing strategy to reduce the number of rows that need to be queried to get to an analytical result, but that pre-supposes that the queries to be asked are planned and known ahead of time; such “tuning” is against the paradigm of true ad-hoc query and analysis. In contrast, a columnar data store makes all the attributes of the entities quickly accessible on the disc in one place, e.g.: 1: John Doe,Mary Smith,James Jones,etc. 2: E8,E4,O2,etc, 3: $65,000,$22,000,$23,000,etc. 4: 12-01-1975, 05-02-1989, 06-14-1990,etc. So all the attributes that are needed for analysis are found all together in the same area of the disc; this allows the system to use far fewer disc “hits” when accessing data for analysis. It also allows powerful data compression because bit maps can be used to represent the data elements, instead of the full depiction of the data. This can result in significant savings in data center infrastructure costs. Typical compression rates are 3.5 times compression versus raw data, and 10 times compression vs. row based data base storage of that data.7 The current state-of-the-practice for delivering the output of Hadoop MapReduce jobs is often handwritten Java reports or delimited files, which by their nature are not flexible or ad hoc. In addition, if the requirements for analysis are not static, there is an ongoing need to refine and rewrite MapReduce jobs to execute analysis. This typically has been done as collaboration between the data scientist or MapReduce developer and the end-user, who is a subject matter expert in the mission. 7 For example, one large U.S. federal government user of Sybase IQ realized a significant reduction in rack space from 45 racks to 13 racks, which included database, extract transform and load, and storage components. 8 SAP National Security Services | SAP NS2 Iterative Process of Analysis Using MapReduce As shown above, this could produce a bottleneck in analysis that has nothing to do with data node scalability; there may be limited man-hours available for data scientists to react to the ever-changing analysis needs of the end-user analysts. Instead, we propose that Hadoop could be considered a catch-all collection point for continuously generated data. Mission-useful data could then be extracted for further analysis using business logic in Hadoop MapReduce jobs. The output of the MapReduce jobs could be used to load SAP Sybase IQ, and SAP BusinessObjects could then be used to analyze and visualize that result to make it actionable for the mission, in a graphical and flexible user interface (see diagram below). This would empower analysts to get different analytical views and juxtapositions of the data elements — with sub-second response — without the need to go back to the IT shop for a revision to a Reduce job, or a new Java report. Total-cost-of-ownership discussions about Hadoop must include the labor it takes to handwrite and constantly update the MapReduce scripts as driven by ever-changing end-user demands. Here the strengths of SAP BusinessObjects’ ad hoc reporting capabilities can be important. To the extent that end-user requirements to analyze data can be pushed to a more self-service front-end using SAP BusinessObjects, the need for IT involvement to create new MapReduce scripts can be constrained. This will shorten the time between end-user desire and fulfillment. SAP National Security Services | SAP NS2 9 Thus one could re-frame the use of MapReduce. Instead of having end-user analysis logic distributed to the nodes, use Hadoop to do what it does best, i.e. the heavy lifting against the raw data files with MapReduce; then load the pertinent data that is likely to be needed to answer a given mission need into a Sybase IQ “analysis repository,” which would then be analyzed with high performance and ad hoc flexibility without rewriting scripts, using SAP BusinessObjects Explorer and Web Intelligence. The SAP BusinessObjects Web Intelligence and Explorer user interfaces are designed to empower non-IT subject matter experts to navigate the data domain and “slice and dice” the data set without any expertise in query language or the underlying data structure. Hadoop with Ad-Hoc Bl Analysis Capabilities The inherent architecture of Sybase IQ as an analytical repository with sub-second response times, linked to SAP BusinessObjects self-service graphical user interface, means that the organization can be more agile and responsive to user-driven needs for new analysis. In addition, Sybase IQ is well- suited to be the landing place for the results files of a MapReduce job, because the self-indexing nature of its columnar data store means that it is not necessary to plan a well-designed star schema data structure to receive the data, as it would be in a row-based data base. If desired, SAP’s Sybase mobility solutions can even enable the delivery of the analysis product to mobile or hand-held devices in the field. The analysis repository may also be the nexus between Hadoop data and other systems in the enterprise; reports can thus seamlessly present analysis of data from heterogeneous systems, data warehouses, and even transactional application data. In the recruiting example cited earlier, this methodology could relate together the well structured data elements about a prospective recruit with the semi-structured data coming into Hadoop files from a recruiting website, and then present that information to the staff in a highly graphical and flexible self-service user interface. 10 SAP National Security Services | SAP NS2 Sybase IQ as Nexus for Hadoop Data and Structured Data Data Transparency to the Analyst Summary With this paradigm, we can help clients to optimize mission performance by leveraging the best of both worlds: the ability to store absolutely all incoming data without regard to size or structure in a cost-effective Hadoop file system; and then the use of MapReduce to create a universe (or universes) of data elements that can be delivered to non-technical end-users for analysis with ad hoc flexibility and with sub-second response times using SAP’s industry leading solutions for reporting and analysis. SAP National Security Services | SAP NS2 11 Author: Bob Palmer Senior Director, SAP National Security Services (SAP NS2) firstname.lastname@example.org 301.641.7785 SAP, R/3, xApps, xApp, SAP NetWeaver, Duet, SAP Business ByDesign, ByDesign, PartnerEdge and other SAP products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of SAP AG in Germany and in several other countries all over the world. Business Objects and the Business Objects logo, BusinessObjects, Crystal Reports, Crystal Decisions, Web Intelligence, Xcelsius and other Business Objects products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of Business Objects S.A. in the United States and in several other countries. Business Objects is an SAP Company. All other product and service names mentioned and associated logos displayed are the trademarks of their respective companies. Data contained in this document serves informational purposes only. National product specifications may vary. The information in this document is proprietary to SAP. This document is a preliminary version and not subject to your license agreement or any other agreement with SAP. This document contains only intended strategies, developments, and functionalities of the SAP® product and is not intended to be binding upon SAP to any particular course of business, product strategy, and/or development. Please note that this document is subject to change and may be changed by SAP at any time without notice. SAP assumes no responsibility for errors or omissions in this document. SAP does not warrant the accuracy or completeness of the information, text, graphics, links, or other items contained within this material. This document is provided without a warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability, fitness for a particular purpose, or non-infringement. SAP shall have no liability for damages of any kind including without limitation direct, special, indirect, or consequential damages that may result from the use of these materials. This limitation shall not apply in cases of intent or gross negligence.