the essence of knowledge FnT TCS 2:4 Algorithms and Data Structures for External Memory Jeffrey Scott VitterAlgorithms and Data Structures for External Memory Jeffrey Scott Vitter Data sets in large applications are often too massive to fit completely inside the computer's internal memory. The resulting input/output communication (or I/O) between fast internal memory and slower external memory (such as disks) can be a major performance bottleneck. Algorithms and Data Structures for External Memory surveys the state of the art in the design and analysis of external memory (or EM) algorithms and data structures, where the goal is to exploit locality in order to reduce the I/O costs. A variety of EM paradigms are considered for solving batched and online problems efficiently in external memory. Algorithms and Data Structures for External Memory describes several useful paradigms for the design and implementation of efficient EM algorithms and data structures. The problem domains considered include sorting, permuting, FFT, scientific computing, computational geometry, graphs, databases, geographic information systems, and text and string processing. Algorithms and Data Structures for External Memory is an invaluable reference for anybody interested in, or conducting research in the design, analysis, and implementation of algorithms and data structures. This book is originally published as Foundations and Trends® in Theoretical Computer Science Volume 2 Issue 4, ISSN: 1551-305X. Foundations and Trends® in Theoretical Computer Science 2:4 Algorithms and Data Structures for External Memory Jeffrey Scott Vitter now now TCSv2n4.qxd 4/24/2008 11:56 AM Page 1 Algorithms and Data Structures for External Memory Algorithms and Data Structures for External Memory Jeﬀrey Scott Vitter Department of Computer Science Purdue University West Lafayette Indiana, 47907–2107 USA jsv@purdue.edu Boston – Delft Foundations and Trends R in Theoretical Computer Science Published, sold and distributed by: now Publishers Inc. PO Box 1024 Hanover, MA 02339 USA Tel. +1-781-985-4510 www.nowpublishers.com sales@nowpublishers.com Outside North America: now Publishers Inc. PO Box 179 2600 AD Delft The Netherlands Tel. +31-6-51115274 The preferred citation for this publication is J. S. Vitter, Algorithms and Data Structures for External Memory, Foundation and Trends R in Theoretical Computer Science, vol 2, no 4, pp 305–474, 2006 ISBN: 978-1-60198-106-6 c 2008 J. S. Vitter All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, mechanical, photocopying, recording or otherwise, without prior written permission of the publishers. Photocopying. In the USA: This journal is registered at the Copyright Clearance Cen- ter, Inc., 222 Rosewood Drive, Danvers, MA 01923. Authorization to photocopy items for internal or personal use, or the internal or personal use of speciﬁc clients, is granted by now Publishers Inc. for users registered with the Copyright Clearance Center (CCC). The ‘services’ for users can be found on the internet at: www.copyright.com For those organizations that have been granted a photocopy license, a separate system of payment has been arranged. Authorization does not extend to other kinds of copy- ing, such as that for general distribution, for advertising or promotional purposes, for creating new collective works, or for resale. In the rest of the world: Permission to pho- tocopy must be obtained from the copyright owner. Please apply to now Publishers Inc., PO Box 1024, Hanover, MA 02339, USA; Tel. +1-781-871-0245; www.nowpublishers.com; sales@nowpublishers.com now Publishers Inc. has an exclusive license to publish this material worldwide. Permission to use this content must be obtained from the copyright license holder. Please apply to now Publishers, PO Box 179, 2600 AD Delft, The Netherlands, www.nowpublishers.com; e-mail: sales@nowpublishers.com Foundations and Trends R in Theoretical Computer Science Volume 2 Issue 4, 2006 Editorial Board Editor-in-Chief: Madhu Sudan Department of CS and EE MIT, Stata Center, Room G640 32 Vassar Street, Cambridge MA 02139, USA madhu@mit.edu Editors Bernard Chazelle (Princeton) Oded Goldreich (Weizmann Inst.) Shaﬁ Goldwasser (MIT and Weizmann Inst.) Jon Kleinberg (Cornell University) L´aszl´oLov´asz (Microsoft Research) Christos Papadimitriou (UC. Berkeley) Prabhakar Raghavan (Yahoo! Research) Peter Shor (MIT) Madhu Sudan (MIT) ´Eva Tardos (Cornell University) Avi Wigderson (IAS) Editorial Scope Foundations and Trends R in Theoretical Computer Science will publish survey and tutorial articles in the following topics: • Algorithmic game theory • Computational algebra • Computational aspects of combinatorics and graph theory • Computational aspects of communication • Computational biology • Computational complexity • Computational geometry • Computational learning • Computational Models and Complexity • Computational Number Theory • Cryptography and information security • Data structures • Database theory • Design and analysis of algorithms • Distributed computing • Information retrieval • Operations Research • Parallel algorithms • Quantum Computation • Randomness in Computation Information for Librarians Foundations and Trends R in Theoretical Computer Science, 2006, Volume 2, 4 issues. ISSN paper version 1551-305X. ISSN online version 1551-3068. Also available as a combined paper and online subscription. Foundations and Trends R in Theoretical Computer Science Vol. 2, No. 4 (2006) 305–474 c 2008 J. S. Vitter DOI: 10.1561/0400000014 Algorithms and Data Structures for External Memory Jeﬀrey Scott Vitter Department of Computer Science, Purdue University, West Lafayette, Indiana, 47907–2107, USA, jsv@purdue.edu Abstract Data sets in large applications are often too massive to ﬁt completely inside the computer’s internal memory. The resulting input/output communication (or I/O) between fast internal memory and slower external memory (such as disks) can be a major performance bottle- neck. In this manuscript, we survey the state of the art in the design and analysis of algorithms and data structures for external memory (or EM for short), where the goal is to exploit locality and parallelism in order to reduce the I/O costs. We consider a variety of EM paradigms for solving batched and online problems eﬃciently in external memory. For the batched problem of sorting and related problems like per- muting and fast Fourier transform, the key paradigms include distribu- tion and merging. The paradigm of disk striping oﬀers an elegant way to use multiple disks in parallel. For sorting, however, disk striping can be nonoptimal with respect to I/O, so to gain further improvements we discuss distribution and merging techniques for using the disks inde- pendently. We also consider useful techniques for batched EM problems involving matrices, geometric data, and graphs. In the online domain, canonical EM applications include dictionary lookup and range searching. The two important classes of indexed data structures are based upon extendible hashing and B-trees. The paradigms of ﬁltering and bootstrapping provide convenient means in online data structures to make eﬀective use of the data accessed from disk. We also re-examine some of the above EM problems in slightly diﬀerent settings, such as when the data items are moving, when the data items are variable-length such as character strings, when the data structure is compressed to save space, or when the allocated amount of internal memory can change dynamically. Programming tools and environments are available for simplifying the EM programming task. We report on some experiments in the domain of spatial databases using the TPIE system (Transparent Par- allel I/O programming Environment). The newly developed EM algo- rithms and data structures that incorporate the paradigms we discuss are signiﬁcantly faster than other methods used in practice. Preface I ﬁrst became fascinated about the tradeoﬀs between computing and memory usage while a graduate student at Stanford University. Over the following years, this theme has inﬂuenced much of what I have done professionally, not only in the ﬁeld of external memory algorithms, which this manuscript is about, but also on other topics such as data compression, data mining, databases, prefetching/caching, and random sampling. The reality of the computer world is that no matter how fast com- puters are and no matter how much data storage they provide, there will always be a desire and need to push the envelope. The solution is not to wait for the next generation of computers, but rather to examine the fundamental constraints in order to understand the limits of what is possible and to translate that understanding into eﬀective solutions. In this manuscript you will consider a scenario that arises often in large computing applications, namely, that the relevant data sets are simply too massive to ﬁt completely inside the computer’s internal memory and must instead reside on disk. The resulting input/output communication (or I/O) between fast internal memory and slower external memory (such as disks) can be a major performance ix x Preface bottleneck. This manuscript provides a detailed overview of the design and analysis of algorithms and data structures for external memory (or simply EM ), where the goal is to exploit locality and parallelism in order to reduce the I/O costs. Along the way, you will learn a variety of EM paradigms for solving batched and online problems eﬃciently. For the batched problem of sorting and related problems like per- muting and fast Fourier transform, the two fundamental paradigms are distribution and merging. The paradigm of disk striping oﬀers an elegant way to use multiple disks in parallel. For sorting, however, disk striping can be nonoptimal with respect to I/O, so to gain fur- ther improvements we discuss distribution and merging techniques for using the disks independently, including an elegant duality property that yields state-of-the-art algorithms. You will encounter other useful techniques for batched EM problems involving matrices (such as matrix multiplication and transposition), geometric data (such as ﬁnding inter- sections and constructing convex hulls) and graphs (such as list ranking, connected components, topological sorting, and shortest paths). In the online domain, which involves constructing data structures to answer queries, we discuss two canonical EM search applications: dictionary lookup and range searching. Two important paradigms for developing indexed data structures for these problems are hash- ing (including extendible hashing) and tree-based search (including B-trees). The paradigms of ﬁltering and bootstrapping provide con- venient means in online data structures to make eﬀective use of the data accessed from disk. You will also be exposed to some of the above EM problems in slightly diﬀerent settings, such as when the data items are moving, when the data items are variable-length (e.g., strings of text), when the data structure is compressed to save space, and when the allocated amount of internal memory can change dynamically. Programming tools and environments are available for simplifying the EM programming task. You will see some experimental results in the domain of spatial databases using the TPIE system, which stands for Transparent Parallel I/O programming Environment. The newly developed EM algorithms and data structures that incorporate the paradigms discussed in this manuscript are signiﬁcantly faster than other methods used in practice. Preface xi I would like to thank my colleagues for several helpful comments, especially Pankaj Agarwal, Lars Arge, Ricardo Baeza-Yates, Adam Buchsbaum, Jeﬀrey Chase, Michael Goodrich, Wing-Kai Hon, David Hutchinson, Gonzalo Navarro, Vasilis Samoladas, Peter Sanders, Rahul Shah, Amin Vahdat, and Norbert Zeh. I also thank the referees and edi- tors for their help and suggestions, as well as the many wonderful staﬀ members I’ve had the privilege to work with. Figure 1.1 is a modiﬁed version of a ﬁgure by Darren Vengroﬀ, and Figures 2.1 and 5.2 come from [118, 342]. Figures 5.4–5.8, 8.2–8.3, 10.1, 12.1, 12.2, 12.4, and 14.1 are modiﬁed versions of ﬁgures in [202, 47, 147, 210, 41, 50, 158], respec- tively. This manuscript is an expanded and updated version of the article in ACM Computing Surveys, Vol. 33, No. 2, June 2001. I am very appre- ciative for the support provided by the National Science Foundation through research grants CCR–9522047, EIA–9870734, CCR–9877133, IIS–0415097, and CCF–0621457; by the Army Research Oﬃce through MURI grant DAAH04–96–1–0013; and by IBM Corporation. Part of this manuscript was done at Duke University, Durham, North Carolina; the University of Aarhus, ˚Arhus, Denmark; INRIA, Sophia Antipolis, France; and Purdue University, West Lafayette, Indiana. I especially want to thank my wife Sharon and our three kids (or more accurately, young adults) Jillian, Scott, and Audrey for their ever- present love and support. I most gratefully dedicate this manuscript to them. West Lafayette, Indiana —J.S.V. March 2008 Contents 1 Introduction 1 1.1 Overview 4 2 Parallel Disk Model (PDM) 9 2.1 PDM and Problem Parameters 11 2.2 Practical Modeling Considerations 14 2.3 Related Models, Hierarchical Memory, and Cache-Oblivious Algorithms 16 3 Fundamental I/O Operations and Bounds 21 4 Exploiting Locality and Load Balancing 25 4.1 Locality Issues with a Single Disk 26 4.2 Disk Striping and Parallelism with Multiple Disks 27 5 External Sorting and Related Problems 29 5.1 Sorting by Distribution 31 5.2 Sorting by Merging 38 5.3 Prefetching, Caching, and Applications to Sorting 42 5.4 A General Simulation for Parallel Disks 52 5.5 Handling Duplicates: Bundle Sorting 53 5.6 Permuting 54 5.7 Fast Fourier Transform and Permutation Networks 54 xiii xiv Contents 6 Lower Bounds on I/O 57 6.1 Permuting 57 6.2 Lower Bounds for Sorting and Other Problems 61 7 Matrix and Grid Computations 65 7.1 Matrix Operations 65 7.2 Matrix Transposition 66 8 Batched Problems in Computational Geometry 69 8.1 Distribution Sweep 71 8.2 Other Batched Geometric Problems 76 9 Batched Problems on Graphs 77 9.1 Sparsiﬁcation 80 9.2 Special Cases 81 9.3 Sequential Simulation of Parallel Algorithms 81 10 External Hashing for Online Dictionary Search 83 10.1 Extendible Hashing 84 10.2 Directoryless Methods 87 10.3 Additional Perspectives 87 11 Multiway Tree Data Structures 89 11.1 B-trees and Variants 89 11.2 Weight-Balanced B-trees 92 11.3 Parent Pointers and Level-Balanced B-trees 93 11.4 Buﬀer Trees 95 12 Spatial Data Structures and Range Search 99 12.1 Linear-Space Spatial Structures 102 12.2 R-trees 103 Contents xv 12.3 Bootstrapping for 2-D Diagonal Corner and Stabbing Queries 107 12.4 Bootstrapping for Three-Sided Orthogonal 2-D Range Search 110 12.5 General Orthogonal 2-D Range Search 112 12.6 Other Types of Range Search 114 12.7 Lower Bounds for Orthogonal Range Search 116 13 Dynamic and Kinetic Data Structures 119 13.1 Dynamic Methods for Decomposable Search Problems 119 13.2 Continuously Moving Items 121 14 String Processing 123 14.1 Inverted Files 123 14.2 String B-Trees 124 14.3 Suﬃx Trees and Suﬃx Arrays 127 14.4 Sorting Strings 127 15 Compressed Data Structures 129 15.1 Data Representations and Compression Models 130 15.2 External Memory Compressed Data Structures 133 16 Dynamic Memory Allocation 139 17 External Memory Programming Environments 141 Conclusions 145 Notations and Acronyms 147 References 151 1 Introduction The world is drowning in data! In recent years, we have been deluged by a torrent of data from a variety of increasingly data-intensive applica- tions, including databases, scientiﬁc computations, graphics, entertain- ment, multimedia, sensors, web applications, and email. NASA’s Earth Observing System project, the core part of the Earth Science Enterprise (formerly Mission to Planet Earth), produces petabytes (1015 bytes) of raster data per year [148]. A petabyte corresponds roughly to the amount of information in one billion graphically formatted books. The online databases of satellite images used by Microsoft TerraServer (part of MSN Virtual Earth) [325] and Google Earth [180] are multiple ter- abytes (1012 bytes) in size. Wal-Mart’s sales data warehouse contains over a half petabyte (500 terabytes) of data. A major challenge is to develop mechanisms for processing the data, or else much of the data will be useless. For reasons of economy, general-purpose computer systems usually contain a hierarchy of memory levels, each level with its own cost and performance characteristics. At the lowest level, CPU registers and caches are built with the fastest but most expensive memory. For internal main memory, dynamic random access memory (DRAM) is 1 2 Introduction Fig. 1.1 The memory hierarchy of a typical uniprocessor system, including registers, instruc- tion cache, data cache (level 1 cache), level 2 cache, internal memory, and disks. Some sys- tems have in addition a level 3 cache, not shown here. Memory access latency ranges from less than one nanosecond (ns, 10−9 seconds) for registers and level 1 cache to several mil- liseconds (ms, 10−3 seconds) for disks. Typical memory sizes for each level of the hierarchy are shown at the bottom. Each value of B listed at the top of the ﬁgure denotes a typical block transfer size between two adjacent levels of the hierarchy. All sizes are given in units of bytes (B), kilobytes (KB, 103 B), megabytes (MB, 106 B), gigabytes (GB, 109 B), and petabytes (PB, 1015 B). (In the PDM model deﬁned in Chapter 2, we measure the block size B in units of items rather than in units of bytes.) In this ﬁgure, 8 KB is the indicated physical block transfer size between internal memory and the disks. However, in batched applications we often use a substantially larger logical block transfer size. typical. At a higher level, inexpensive but slower magnetic disks are used for external mass storage, and even slower but larger-capacity devices such as tapes and optical disks are used for archival storage. These devices can be attached via a network fabric (e.g., Fibre Channel or iSCSI) to provide substantial external storage capacity. Figure 1.1 depicts a typical memory hierarchy and its characteristics. Most modern programming languages are based upon a program- ming model in which memory consists of one uniform address space. The notion of virtual memory allows the address space to be far larger than what can ﬁt in the internal memory of the computer. Programmers have a natural tendency to assume that all memory references require the same access time. In many cases, such an assumption is reasonable (or at least does not do harm), especially when the data sets are not large. The utility and elegance of this programming model are to a large extent why it has ﬂourished, contributing to the productivity of the software industry. 3 However, not all memory references are created equal. Large address spaces span multiple levels of the memory hierarchy, and accessing the data in the lowest levels of memory is orders of magnitude faster than accessing the data at the higher levels. For example, loading a register can take a fraction of a nanosecond (10−9 seconds), and accessing internal memory takes several nanoseconds, but the latency of access- ing data on a disk is multiple milliseconds (10−3 seconds), which is about one million times slower! In applications that process massive amounts of data, the Input/Output communication (or simply I/O) between levels of memory is often the bottleneck. Many computer programs exhibit some degree of locality in their pattern of memory references: Certain data are referenced repeatedly for a while, and then the program shifts attention to other sets of data. Modern operating systems take advantage of such access patterns by tracking the program’s so-called “working set”—avague notion that roughly corresponds to the recently referenced data items [139]. If the working set is small, it can be cached in high-speed memory so that access to it is fast. Caching and prefetching heuristics have been developed to reduce the number of occurrences of a “fault,” in which the referenced data item is not in the cache and must be retrieved by an I/O from a higher level of memory. For example, in a page fault, an I/O is needed to retrieve a disk page from disk and bring it into internal memory. Caching and prefetching methods are typically designed to be general-purpose, and thus they cannot be expected to take full advan- tage of the locality present in every computation. Some computations themselves are inherently nonlocal, and even with omniscient cache management decisions they are doomed to perform large amounts of I/O and suﬀer poor performance. Substantial gains in performance may be possible by incorporating locality directly into the algorithm design and by explicit management of the contents of each level of the memory hierarchy, thereby bypassing the virtual memory system. We refer to algorithms and data structures that explicitly manage data placement and movement as external memory (or EM ) algorithms and data structures. Some authors use the terms I/O algorithms or out-of-core algorithms. We concentrate in this manuscript on the I/O 4 Introduction communication between the random access internal memory and the magnetic disk external memory, where the relative diﬀerence in access speeds is most apparent. We therefore use the term I/O to designate the communication between the internal memory and the disks. 1.1 Overview In this manuscript, we survey several paradigms for exploiting local- ity and thereby reducing I/O costs when solving problems in external memory. The problems we consider fall into two general categories: (1) Batched problems, in which no preprocessing is done and the entire ﬁle of data items must be processed, often by streaming the data through the internal memory in one or more passes. (2) Online problems, in which computation is done in response to a continuous series of query operations. A common tech- nique for online problems is to organize the data items via a hierarchical index, so that only a very small portion of the data needs to be examined in response to each query. The data being queried can be either static, which can be pre- processed for eﬃcient query processing, or dynamic, where the queries are intermixed with updates such as insertions and deletions. We base our approach upon the parallel disk model (PDM) described in the next chapter. PDM provides an elegant and reason- ably accurate model for analyzing the relative performance of EM algo- rithms and data structures. The three main performance measures of PDM are the number of (parallel) I/O operations, the disk space usage, and the (parallel) CPU time. For reasons of brevity, we focus on the ﬁrst two measures. Most of the algorithms we consider are also eﬃcient in terms of CPU time. In Chapter 3, we list four fundamental I/O bounds that pertain to most of the problems considered in this manuscript. In Chapter 4, we show why it is crucial for EM algorithms to exploit locality, and we discuss an automatic load balancing technique called disk striping for using multiple disks in parallel. 1.1 Overview 5 Our general goal is to design optimal algorithms and data struc- tures, by which we mean that their performance measures are within a constant factor of the optimum or best possible.1 In Chapter 5, we look at the canonical batched EM problem of external sorting and the related problems of permuting and fast Fourier transform. The two important paradigms of distribution and merging — as well as the notion of duality that relates the two — account for all well-known external sorting algorithms. Sorting with a single disk is now well under- stood, so we concentrate on the more challenging task of using multiple (or parallel) disks, for which disk striping is not optimal. The challenge is to guarantee that the data in each I/O are spread evenly across the disks so that the disks can be used simultaneously. In Chapter 6, we cover the fundamental lower bounds on the number of I/Os needed to perform sorting and related batched problems. In Chapter 7, we discuss grid and linear algebra batched computations. For most problems, parallel disks can be utilized eﬀectively by means of disk striping or the parallel disk techniques of Chapter 5, and hence we restrict ourselves starting in Chapter 8 to the concep- tually simpler single-disk case. In Chapter 8, we mention several eﬀec- tive paradigms for batched EM problems in computational geometry. The paradigms include distribution sweep (for spatial join and ﬁnd- ing all nearest neighbors), persistent B-trees (for batched point loca- tion and visibility), batched ﬁltering (for 3-D convex hulls and batched point location), external fractional cascading (for red-blue line segment intersection), external marriage-before-conquest (for output-sensitive convex hulls), and randomized incremental construction with grada- tions (for line segment intersections and other geometric problems). In Chapter 9, we look at EM algorithms for combinatorial problems on graphs, such as list ranking, connected components, topological sort- ing, and ﬁnding shortest paths. One technique for constructing I/O- eﬃcient EM algorithms is to simulate parallel algorithms; sorting is used between parallel steps in order to reblock the data for the simu- lation of the next parallel step. 1 In this manuscript we generally use the term “optimum” to denote the absolute best possible and the term “optimal” to mean within a constant factor of the optimum. 6 Introduction In Chapters 10–12, we consider data structures in the online setting. The dynamic dictionary operations of insert, delete, and lookup can be implemented by the well-known method of hashing. In Chapter 10, we examine hashing in external memory, in which extra care must be taken to pack data into blocks and to allow the number of items to vary dynamically. Lookups can be done generally with only one or two I/Os. Chapter 11 begins with a discussion of B-trees, the most widely used online EM data structure for dictionary operations and one-dimensional range queries. Weight-balanced B-trees provide a uniform mechanism for dynamically rebuilding substructures and are useful for a variety of online data structures. Level-balanced B-trees permit maintenance of parent pointers and support cut and concatenate operations, which are used in reachability queries on monotone subdivisions. The buﬀer tree is a so-called “batched dynamic” version of the B-tree for eﬃcient implementation of search trees and priority queues in EM sweep line applications. In Chapter 12, we discuss spatial data structures for mul- tidimensional data, especially those that support online range search. Multidimensional extensions of the B-tree, such as the popular R-tree and its variants, use a linear amount of disk space and often perform well in practice, although their worst-case performance is poor. A non- linear amount of disk space is required to perform 2-D orthogonal range queries eﬃciently in the worst case, but several important special cases of range searching can be done eﬃciently using only linear space. A use- ful design paradigm for EM data structures is to “externalize” an eﬃ- cient data structure designed for internal memory; a key component of how to make the structure I/O-eﬃcient is to “bootstrap” a static EM data structure for small-sized problems into a fully dynamic data structure of arbitrary size. This paradigm provides optimal linear-space EM data structures for several variants of 2-D orthogonal range search. In Chapter 13, we discuss some additional EM approaches useful for dynamic data structures, and we also investigate kinetic data struc- tures, in which the data items are moving. In Chapter 14, we focus on EM data structures for manipulating and searching text strings. In many applications, especially those that operate on text strings, the data are highly compressible. Chapter 15 discusses ways to develop data structures that are themselves compressed, but still fast to query. 1.1 Overview 7 Table 1.1 Paradigms for I/O eﬃciency discussed in this manuscript. Paradigm Section Batched dynamic processing 11.4 Batched ﬁltering 8 Batched incremental construction 8 Bootstrapping 12 Buﬀer trees 11.4 B-trees 11, 12 Compression 15 Decomposable search 13.1 Disk striping 4.2 Distribution 5.1 Distribution sweeping 8 Duality 5.3 External hashing 10 Externalization 12.3 Fractional cascading 8 Filtering 12 Lazy updating 11.4 Load balancing 4 Locality 4.1 Marriage before conquest 8 Merging 5.2 Parallel block transfer 4.2 Parallel simulation 9 Persistence 11.1 Random sampling 5.1 R-trees 12.2 Scanning (or streaming) 2.2 Sparsiﬁcation 9 Time-forward processing 11.4 In Chapter 16, we discuss EM algorithms that adapt optimally to dynamically changing internal memory allocations. In Chapter 17, we discuss programming environments and tools that facilitate high-level development of eﬃcient EM algorithms. We focus primarily on the TPIE system (Transparent Parallel I/O Environment), which we use in the various timing experiments in this manuscript. We conclude with some ﬁnal remarks and observations in the Conclusions. Table 1.1 lists several of the EM paradigms discussed in this manuscript. 2 Parallel Disk Model (PDM) When a data set is too large to ﬁt in internal memory, it is typically stored in external memory (EM) on one or more magnetic disks. EM algorithms explicitly control data placement and transfer, and thus it is important for algorithm designers to have a simple but reasonably accurate model of the memory system’s characteristics. A magnetic disk consists of one or more platters rotating at con- stant speed, with one read/write head per platter surface, as shown in Figure 2.1. The surfaces of the platters are covered with a mag- netizable material capable of storing data in nonvolatile fashion. The read/write heads are held by arms that move in unison. When the arms are stationary, each read/write head traces out a concentric circle on its platter called a track. The vertically aligned tracks that correspond to a given arm position are called a cylinder. For engineering reasons, data to and from a given disk are typically transmitted using only one read/write head (i.e., only one track) at a time. Disks use a buﬀer for caching and staging data for I/O transfer to and from internal memory. To store or retrieve a data item at a certain address on disk, the read/write heads must mechanically seek to the correct cylinder and then wait for the desired data to pass by on a particular track. The seek 9 10 Parallel Disk Model (PDM) platter track arms read/write head spindle (a) (b) tracks Fig. 2.1 Magnetic disk drive: (a) Data are stored on magnetized platters that rotate at a constant speed. Each platter surface is accessed by an arm that contains a read/write head, and data are stored on the platter in concentric circles called tracks. (b) The arms are physically connected so that they move in unison. The tracks (one per platter) that are addressable when the arms are in a ﬁxed position are collectively referred to as a cylinder. time to move from one random cylinder to another is often on the order of 3 to 10 milliseconds, and the average rotational latency, which is the time for half a revolution, has the same order of magnitude. Seek time can be avoided if the next access is on the current cylinder. The latency for accessing data, which is primarily a combination of seek time and rotational latency, is typically on the order of several milliseconds. In contrast, it can take less than one nanosecond to access CPU registers and cache memory — more than one million times faster than disk access! Once the read/write head is positioned at the desired data location, subsequent bytes of data can be stored or retrieved as fast as the disk rotates, which might correspond to over 100 megabytes per second. We can thus amortize the relatively long initial delay by transferring a large contiguous group of data items at a time. We use the term block to refer to the amount of data transferred to or from one disk in a single I/O operation. Block sizes are typically on the order of several kilobytes and are often larger for batched applications. Other levels of the memory hierarchy have similar latency issues and as a result also 2.1 PDM and Problem Parameters 11 use block transfer. Figure 1.1 depicts typical memory sizes and block sizes for various levels of memory. Because I/O is done in units of blocks, algorithms can run con- siderably faster when the pattern of memory accesses exhibit locality of reference as opposed to a uniformly random distribution. However, even if an application can structure its pattern of memory accesses and exploit locality, there is still a substantial access gap between internal and external memory performance. In fact the access gap is growing, since the latency and bandwidth of memory chips are improving more quickly than those of disks. Use of parallel processors (or multicores) further widens the gap. As a result, storage systems such as RAID deploy multiple disks that can be accessed in parallel in order to get additional bandwidth [101, 194]. In the next section, we describe the high-level parallel disk model (PDM), which we use throughout this manuscript for the design and analysis of EM algorithms and data structures. In Section 2.2, we con- sider some practical modeling issues dealing with the sizes of blocks and tracks and the corresponding parameter values in PDM. In Section 2.3, we review the historical development of models of I/O and hierarchical memory. 2.1 PDM and Problem Parameters We can capture the main properties of magnetic disks and multiple disk systems by the commonly used parallel disk model (PDM) introduced by Vitter and Shriver [345]. The two key mechanisms for eﬃcient algo- rithm design in PDM are locality of reference (which takes advantage of block transfer) and parallel disk access (which takes advantage of multiple disks). In a single I/O, each of the D disks can simultaneously transfer a block of B contiguous data items. PDM uses the following main parameters: N = problem size (in units of data items); M = internal memory size (in units of data items); B = block transfer size (in units of data items); 12 Parallel Disk Model (PDM) D = number of independent disk drives; P = number of CPUs, where M 0, assuming that m ≥ D(ln2 + δ)/ε +3D, the average number of I/Os performed by RCD is 2+ε + O(e−δD) n D logm−3D−D(ln2+δ)/ε n m +2n D + o n D . (5.2) When D = o(m), for any desired constant 0 <α<1, we can choose ε and δ appropriately to bound (5.2) as follows with a constant of 5.1 Sorting by Distribution 37 proportionality of 2: ∼ 2 n D logαm n . (5.3) The only diﬀerences between (5.3) and the leading term of the lower bound we derive in Chapter 6 are the presence of the ceiling around the logarithm term and the fact that the base of the logarithm is arbitrarily close to m but not exactly m. RCD operates very fast in practice. Figure 5.2 shows a typical simu- lation [342] that indicates that RCD operates with small buﬀer memory requirements; the layout discipline associated with the SRM method discussed in Section 5.2.1 performs similarly. Randomized cycling distribution sort and the related merge sort algorithms discussed in Sections 5.2.1 and 5.3.4 are the methods of 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 0 10 20 30 40 50 60 70 80 90 100 110 Frequency Memory Used (blocks) Buckets issue Blocks in Random Order N=2000000 D=10 S=50 epsilon=0.1 RCD SRD RSD FRD Fig. 5.2 Simulated distribution of memory usage during a distribution pass with n = 2 × 106, D = 10, S = 50, ε =0.1 for four methods: RCD (randomized cycling distri- bution), SRD (simple randomized distribution — striping with a random starting disk), RSD (randomized striping distribution — striping with a random starting disk for each stripe), and FRD (fully randomized distribution — each bucket is independently and ran- domly assigned to a disk). For these parameters, the performance of RCD and SRD are virtually identical. 38 External Sorting and Related Problems choice for sorting with parallel disks. Distribution sort algorithms may have an advantage over the merge approaches presented in Section 5.2 in that they typically make better use of lower levels of cache in the memory hierarchy of real systems, based upon analysis of distribution sort and merge sort algorithms on models of hierarchical memory, such as the RUMH model of Vitter and Nodine [344]. On the other hand, the merge approaches can take advantage of replacement selection to start oﬀ with larger run sizes. 5.2 Sorting by Merging The merge paradigm is somewhat orthogonal to the distribution paradigm of the previous section. A typical merge sort algorithm works as follows [220]: In the “run formation” phase, we scan the n blocks of data, one memoryload at a time; we sort each memoryload into a single “run,” which we then output onto a series of stripes on the disks. At the end of the run formation phase, there are N/M = n/m (sorted) runs, each striped across the disks. In actual implementations, we can use the “replacement selection” technique to get runs of 2M data items, on the average, when M B [136, 220]. After the initial runs are formed, the merging phase begins, as shown in Figure 5.3 for the case D = 1. In each pass of the merging phase, we merge groups of R runs. For each merge, we scan the R runs and Input buffer 1 Input buffer R Output buffer Internal Memory on disk R runsMerged run on disk Fig. 5.3 Schematic illustration of a merge during the course of a single-disk (D = 1) merge sort. (For simplicity, the input and output use separate disks.) Each of R sorted runs on the disk on the right are streamed through internal memory and merged into a single sorted run that is output to the disk on the left. Each run has double buﬀers of total size at least 2B to allow the input from the runs to be overlapped with the output of the merged run. 5.2 Sorting by Merging 39 merge the items in an online manner as they stream through internal memory. Double buﬀering is used to overlap I/O and computation. At most R =Θ(m) runs can be merged at a time, and the resulting number of passes is O(logm n). To achieve the optimal sorting bound (5.1), we must perform each merging pass in O(n/D) I/Os, which is easy to do for the single-disk case. In the more general multiple-disk case, each parallel input oper- ation during the merging must on the average bring in the next Θ(D) blocks needed. The challenge is to ensure that those blocks reside on diﬀerent disks so that they can be input in a single parallel I/O (or a small constant number of I/Os). The diﬃculty lies in the fact that the runs being merged were themselves formed during the previous merge pass. Their blocks were output to the disks in the previous pass without knowledge of how they would interact with other runs in later merges. For the binary merging case R = 2, we can devise a perfect solution, in which the next D blocks needed for the merge are guaranteed to be on distinct disks, based upon the Gilbreath principle [172, 220]: We stripe the ﬁrst run into ascending order by disk number, and we stripe the other run into descending order. Regardless of how the items in the two runs interleave during the merge, it is always the case that we can access the next D blocks needed for the output via a single I/O operation, and thus the amount of internal memory buﬀer space needed for binary merging is minimized. Unfortunately there is no analogue to the Gilbreath principle for R>2, and as we have seen above, we need the value of R to be large in order to get an optimal sorting algorithm. The Greed Sort method of Nodine and Vitter [274] was the ﬁrst optimal deterministic EM algorithm for sorting with multiple disks. It handles the case R>2 by relaxing the condition on the merging process. In each step, two blocks from each disk are brought into inter- nal memory: the block b1 with the smallest data item value and the block b2 whose largest item value is smallest. If b1 = b2, only one block is input into memory, and it is added to the next output stripe. Oth- erwise, the two blocks b1 and b2 are merged in memory; the smaller B items are added to the output stripe, and the remaining B items are output back to the disks. The resulting run that is produced is only an “approximately” merged run, but its saving grace is that no 40 External Sorting and Related Problems two inverted items are very far apart. A ﬁnal application of Column- sort [232] suﬃces to restore total order; partial striping is employed to meet the memory constraints. One disadvantage of Greed Sort is that the input and output I/O operations involve independent disks and are not done in a striped manner, thus making it diﬃcult to write parity information for error correction and recovery. Chaudhry and Cormen [97] show experimentally that oblivious algo- rithms such as Columnsort work well in the context of cluster-based sorting. Aggarwal and Plaxton [22] developed an optimal deterministic merge sort based upon the Sharesort hypercube parallel sorting algo- rithm [126]. To guarantee even distribution during the merging, it employs two high-level merging schemes in which the scheduling is almost oblivious. Like Greed Sort, the Sharesort algorithm is theo- retically optimal (i.e., within a constant factor of the optimum), but the constant factor is larger than for the distribution sort methods. 5.2.1 Simple Randomized Merge Sort One approach to merge sort is to stripe each run across the disks and use the disk striping technique of Section 4.2. However, disk striping devotes too much internal memory (namely, 2RD blocks) to cache blocks not yet merged, and thus the eﬀective order of the merge is reduced to R =Θ(m/D) (cf. (4.2)), which gives a nonoptimal result. A better approach is the simple randomized merge sort (SRM) algo- rithm of Barve et al. [68, 72], referred to as “randomized striping” by Knuth [220]. It uses much less space in internal memory for caching blocks and thus allows R to be much larger. Each run is striped across the disks, but with a random starting point (the only place in the algo- rithm where randomness is utilized). During the merging process, the next block needed from each disk is input into memory, and if there is not enough room, the least needed blocks are “ﬂushed” back to disk (without any I/Os required) to free up space. Barve et al. [68] derive an asymptotic upper bound on the expected I/O performance, with no assumptions about the original distribution of items in the ﬁle. A more precise analysis, which is related to the 5.2 Sorting by Merging 41 so-called cyclic occupancy problem, is an interesting open problem. The cyclic occupancy problem is similar to the classical occupancy problem we discussed in Section 5.1 in that there are b balls distributed into d bins. However, in the cyclical occupancy problem, the b balls are grouped into c chains of length b1, b2,...,bc, where 1≤i≤c bi = b. Only the head of each chain is randomly inserted into a bin; the remaining balls of the chain are inserted into the successive bins in a cyclic manner (hence the name “cyclic occupancy”). We conjecture that the expected maximum bin size in the cyclic occupancy problem is at most that of the classical occupancy problem [68, 220, problem 5.4.9–28]. The bound has been established so far only in an asymptotic sense [68]. The expected performance of SRM is not optimal for some param- eter values, but it signiﬁcantly outperforms the use of disk striping for reasonable values of the parameters. Barve and Vitter [72] give experi- mental conﬁrmation of the speedup with six fast disk drives and a 500 megahertz CPU, as shown in Table 5.1. When the internal memory is large enough to provide Θ(DlogD) blocks of cache space and 3D blocks for output buﬀers, SRM provably achieves the optimal I/O bound (5.1). For any parameter ε → 0, assum- ing that m ≥ D(logD)/ε2 +3D, the average number of I/Os performed by SRM is (2 + ε) n D logm−3D−D(logD)/ε2 n m +2n D + o n D . (5.4) When D = o(m/logm), for any desired constant 0 <α<1, we can choose ε to bound (5.4) as follows with a constant of proportionality of 2: ∼ 2 n D logαm n . (5.5) Table 5.1 The ratio of the number of I/Os used by simple randomized merge sort (SRM) to the number of I/Os used by merge sort with disk striping, during a merge of kD runs, where kD ≈ M/2B. The ﬁgures were obtained by simulation. D =5 D =10 D =50 k =5 0.56 0.47 0.37 k =10 0.61 0.52 0.40 k =50 0.71 0.63 0.51 42 External Sorting and Related Problems In the next section, we show how to get further improvements in merge sort by a more careful prefetch schedule for the runs, combined with the randomized cycling strategy discussed in Section 5.1. 5.3 Prefetching, Caching, and Applications to Sorting In this section, we consider the problem of prefetch scheduling for par- allel disks: We are given a sequence of blocks Σ=b1,b2,...,bN . The initial location of the blocks on the D disks is prespeciﬁed by an arbitrary function disk :Σ→{1,2,...,D}. That is, block bi is located on disk disk(bi). The object of prefetching is to schedule the fewest possible number of input I/O operations from the D disks so that the blocks can be read by an application program in the order given by Σ. When a block is given to the application, we say that it is “read” and can be removed from internal memory. (Note that “read” refers to the handover of the block from the scheduling algorithm to the application, whereas “input” refers to the prior I/O operation that brought the block into internal memory.) We use the m blocks of internal memory as prefetch buﬀers to store blocks that will soon be read. If the blocks in Σ are distinct, we call the prefetching problem read- once scheduling. If some blocks are repeated in Σ (i.e., they must be read at more than one time by the application in internal memory), then it may be desirable to cache blocks in the prefetch buﬀers in order to avoid re-inputting them later, and we call the problem read-many scheduling. One way to make more informed prefetching decisions is to use knowledge or prediction of future read requests, which we can infor- mally call lookahead. Cao et al. [95], Kimbrel and Karlin [217], Barve et al. [69], Vitter and Krishnan [343], Curewitz et al. [125], Krishnan and Vitter [226], Kallahalla and Varman [204, 205], Hutchinson et al. [202], Albers and B¨uttner [28], Shah et al. [312, 313], and 5.3 Prefetching, Caching, and Applications to Sorting 43 Hon et al. [198] have developed competitive and optimal methods for prefetching blocks in parallel I/O systems. We focus in the remainder of this section on prefetching with knowledge of future read requests. We use the general framework of Hutchinson et al. [202], who demonstrate a powerful duality between prefetch scheduling and the following problem “in the reverse direc- tion,” which we call output scheduling: We are given a sequence Σ = b 1,b 2,...,b N of blocks that an application program produces (or “writes”) in internal memory. A mapping disk :Σ →{1,2,...,D} speciﬁes the desired ﬁnal disk location for each block; that is, the target location for block b i is disk disk(b i). The goal of output scheduling is to construct the shortest schedule of parallel I/O output operations so that each block b i is output to its proper target location disk disk(b i). (Note that “write” refers to the handover of the block from the application to the scheduling algorithm, whereas “output” refers to the subsequent I/O operation that moves the block onto disk.) We use the m blocks of internal memory as output buﬀers to queue the blocks that have been written but not yet output to the disks. If the blocks in Σ are distinct, we call the output scheduling problem write-once scheduling. If some blocks are repeated in Σ, we call the problem write-many scheduling. If we are able to keep a block in an output buﬀer long enough until it is written again by the application program, then we need to output the block at most once rather than twice. The output scheduling problem is generally easy to solve optimally. In Sections 5.3.2–5.3.4, we exploit a duality [202] of the output schedul- ing problems of write-once scheduling, write-many scheduling, and dis- tribution in order to derive optimal algorithms for the dual prefetch problems of read-once scheduling, read-many scheduling, and merging. Figure 5.4 illustrates the duality and information ﬂow for prefetch and output scheduling. 44 External Sorting and Related Problems Output buffers Prefetch buffers / D = 6 Internal Memory 123456 Up to m prefetched or queued blocks Disk numbers D = 6 Disks Correspondence between output step in greedy write-once schedule and prefetching step in lazy read-once schedule written in ΣR order Stream of blocks are Stream of blocks are read in Σ order Fig. 5.4 Duality between prefetch scheduling and output scheduling. The prefetch schedul- ing problem for sequence Σ proceeds from bottom to top. Blocks are input from the disks, stored in the prefetch buﬀers, and ultimately read by the application program in the order speciﬁed by the sequence Σ. The output scheduling problem for the reverse sequence ΣR proceeds from top to bottom. Blocks are written by the application program in the order speciﬁed by ΣR, queued in the output buﬀers, and ultimately output to the disks. The hatched blocks illustrate how the blocks of disk 2 might be distributed. 5.3.1 Greedy Read-Once Scheduling Before we discuss an optimum prefetching algorithm for read-once scheduling, we shall ﬁrst look at the following natural approach adopted by SRM [68, 72] in Section 5.2.1, which unfortunately does not achieve the optimum schedule length. It uses a greedy approach: Suppose that blocks b1, b2, ..., bi of the sequence Σ have already been read in prior steps and are thus removed from the prefetch buﬀers. The current step consists of reading the next blocks of Σ that are already in the prefetch buﬀers. That is, suppose blocks bi+1, bi+2, ..., bj are in the prefetch buﬀers, but bj+1 is still on a disk. Then blocks bi+1, bi+2, ..., bj are read and removed from the prefetch buﬀers. The second part of the current step involves input from the disks. For each of the D disks, consider its highest priority block not yet input, 5.3 Prefetching, Caching, and Applications to Sorting 45 21 buffer pool input step fi opqr nhe g mkjdcba l 8976543 Fig. 5.5 Greedy prefetch schedule for sequence Σ = a,b,c,...,r of n = 18 blocks, read using D = 3 disks and m = 6 prefetch buﬀers. The shading of each block indicates the disk it belongs on. The schedule uses T = 9 I/O steps, which is one step more than optimum. according to the order speciﬁed by Σ. We use P to denote the set of such blocks. (It necessarily follows that bj+1 must be in P.) Let Res be the blocks already resident in the prefetch buﬀer, and let Res be the m highest priority blocks in P ∪ Res. We input the blocks of Res ∩ P (i.e., the blocks of Res that are not already in the prefetch buﬀer). At the end of the I/O step, the prefetch buﬀer contains the blocks of Res . If a block b in Res does not remain in Res , then the algorithm has eﬀectively “kicked out” b from the prefetch buﬀers, and it will have to be re-input in a later I/O step. An alternative greedy policy is to not evict records from the prefetch buﬀers; instead we input the m −|Res | highest priority blocks of P. Figure 5.5 shows an example of the greedy read schedule for the case of m = 6 prefetch buﬀers and D = 3 disks. (Both greedy versions described above give the same result in the example.) An optimum I/O schedule has T = 8 I/O steps, but the greedy schedule uses a nonopti- mum T = 9 I/O steps. Records g and h are input in I/O steps 2 and 3 even though they are not read until much later, and as a result they take up space in the prefetch buﬀers that prevents block l (and thus blocks o, p, q, and r) from being input earlier. 5.3.2 Prefetching via Duality: Read-Once Scheduling Hutchinson et al. [202] noticed a natural correspondence between a prefetch schedule for a read-once sequence Σ and an output schedule for the write-once sequence ΣR, where ΣR denotes the sequence of 46 External Sorting and Related Problems blocks of Σ in reverse order. In the case of the write-once problem, the following natural greedy output algorithm is optimum: As the blocks of ΣR are written, we put each block into an output buﬀer for its designated disk. There are m output buﬀers, each capable of storing one block, so the writing can proceed only as quickly as space is freed up in the write buﬀers. In each parallel I/O step, we free up space by outputting a queued block to each disk that has at least one block in an output buﬀer. The schedule of I/O output steps is called the output schedule, and it is easy to see that it is optimum in terms of the number of parallel I/O steps required. Figure 5.6, when read right to left, shows the output schedule for the reverse sequence ΣR. When we run any output schedule for ΣR in reverse, we get a valid prefetch schedule for the original sequence Σ, and vice versa. Therefore, an optimum output schedule for ΣR, which is easy to compute in a greedy fashion, can be reversed to get an optimum prefetch schedule for Σ. Figure 5.6, when considered from right to left, shows an optimum output schedule for sequence ΣR. When looked at left to right, it shows an optimum prefetch schedule for sequence Σ. The prefetch schedule of Figure 5.6 is “lazy” in the sense that the input I/Os seem to be artiﬁcially delayed. For example, cannot block e be input earlier than in step 4? Hutchinson et al. [202] give a pru- dent prefetching algorithm that guarantees the same optimum prefetch schedule length, but performs I/Os earlier when possible. It works by redeﬁning the priority of a block to be the order it appears in the input step order of the lazy schedule (i.e., a, b, f, c, i, . . . in Figure 5.6). Prefetch buﬀers are “reserved” for blocks in the same order of priority. In the current I/O step, using the language of Section 5.3.1, we input every block in P that has a reserved prefetch buﬀer. Figure 5.7 shows the prudent version of the lazy schedule of Figure 5.6. In fact, if we iterate the prudent algorithm, using as the block priorities the input step order of the prudent schedule of Figure 5.7, we get an even more prudent schedule, in which the e and n blocks move up one more I/O step than in Figure 5.7. Further iteration in this particular example will not yield additional improvement. 5.3 Prefetching, Caching, and Applications to Sorting 47 bc jk Σ 87654321 1 output step234567 buffer pool Σ R input step rqponlkjihgfecab d m 8 filopqr e g h n ad m Fig. 5.6 Optimum dual schedules: read-once schedule for Σ via lazy prefetching and write-once schedule for ΣR via greedy output. The read sequence Σ = a,b,c,...,r of n = 18 blocks and its reverse sequence ΣR are read/written using D = 3 disks and m =6 prefetch/output buﬀers. The shading of each block indicates the disk it belongs on. The input steps are pictured left to right, and the output steps are pictured right to left. The schedule uses T = 8 I/O steps, which is optimum. 86573241 djkmba i 467 f c i d lgoj hpk q m nre 5382 priority order from optimal lazy schedule: input step buffer pool ba 1 f pqrlo c nhge Fig. 5.7 Prudent prefetch schedule for the example in Figure 5.6. The blocks are prioritized using the input step order from the lazy prefetch schedule in Figure 5.6. The prefetch schedule remains optimum with T = 8 I/O steps, but some of the blocks are input earlier than in the lazy schedule. 48 External Sorting and Related Problems To get an idea of how good optimum is, suppose that Σ is com- posed of S subsequences interlaced arbitrarily and that each of the S subsequences is striped in some manner on the disks. Theorem 5.2 ([202]). If each of the S subsequences forming Σ is stored on the disks in a randomized cycling layout [342] (discussed in Section 5.1), the expected number of I/O steps for our optimum (lazy or prudent) prefetching method is 1+O D m n D + min S + n D,O m D logm . (5.6) If the internal memory size is large (i.e., m>S(D − 1)) to guarantee outputting D blocks per I/O at each output step, then for both striped layouts and randomized cycling layouts, the number of I/Os is n D + S. (5.7) The second term in (5.7) can be reduced further for a randomized cycling layout. In each of the expressions (5.6) and (5.7), the second of its two terms corresponds to the I/Os needed for initial loading of the prefetch buﬀers in the case of the lazy prefetch schedule (or equivalently, the I/Os needed to ﬂush the output buﬀers in the case of the greedy output schedule). 5.3.3 Prefetching with Caching via Duality: Read-Many Scheduling When any given block can appear multiple times in Σ, as in read- many scheduling, it can help to cache blocks in the prefetch buﬀers in order to avoid re-inputting them for later reads. For the corresponding write-once problem, we can construct an optimum output schedule via a modiﬁed greedy approach. In particular, at each output step and for each disk, we choose as the block to output to the disk the one whose next appearance in ΣR is furthest into the future — a criteria akin to the least-recently-used heuristic developed by Belady [78] for cache 5.3 Prefetching, Caching, and Applications to Sorting 49 replacement. Intuitively, the “furthest” block will be of least use to keep in memory for a later write. Hutchinson et al. [202] show that the output schedule is provably optimum, and hence by duality, if we solve the write-many problem for ΣR, then when we run the output schedule in reverse, we get an optimum read-many prefetch schedule for Σ. 5.3.4 Duality between Merge Sort and Distribution Sort In the case of merging R runs together, if the number R of runs is small enough so that we have RD block-sized prefetch buﬀers in internal memory, then it is easy to see that the merge can proceed in an optimum manner. However, this constraint limits the size of R, as in disk striping, which can be suboptimal for sorting. The challenge is to make use of substantially fewer prefetch buﬀers so that we can increase R to be as large as possible. The larger R is, the faster we can do merge sort, or equivalently, the larger the ﬁles that we can sort in a given number of passes. We saw in Section 5.2.1 that Θ(DlogD) prefetch buﬀers suﬃce for SRM to achieve optimal merge sort performance. A tempting approach is duality: We know from Section 5.1.3 that we need only Θ(D) output buﬀers to do a distribution pass if we lay out the buckets on the disks in a randomized cycling (RCD) pattern. If we can establish duality, then we can merge runs using Θ(D) prefetch buﬀers, assuming the runs are stored on the disks using randomized cycling. Figure 5.8 illustrates the duality between merging and distribution. However, one issue must ﬁrst be resolved before we can legitimately apply duality. In each merge pass of merge sort, we merge R runs at a time into a single run. In order to apply duality, which deals with read and write sequences, we need to predetermine the read order Σ for the merge. That is, if we can specify the proper read order Σ of the blocks, then we can legitimately apply Theorem 5.2 to the write problem on ΣR. The solution to determining Σ is to partition internal memory so that not only does it consist of several prefetch buﬀers but it also includes R merging buﬀers, where R is the number of runs. Each merg- ing buﬀer stores a (partially ﬁlled) block from a run that is participat- ing in the merge. We say that a block is read when it is moved from 50 External Sorting and Related Problems 123456 Up to m’ prefetched or queued blocks Disk numbers Partitioning buffers Merging buffers / Input buffers Output buffers / Stream of blocks are read in Σ order Correspondence between and prefetching step in lazy read-once schedule output step in greedy write-once schedule Output buffers R = 8 runs / S = 8 buckets 12345 Prefetch buffers / Disk numbers Up to 3-D I/O-buffered blocks via merge via distribution D = 6 6 D = 6 Disks D = 6 Disks 123 4 5 Internal Memory 78 Blocks disassembled Blocks assembled written in Stream of blocks are order 6 ΣR Fig. 5.8 Duality between merging with R = 8 runs and distribution with S = 8 buckets, using D = 6 disks. The merge of the R runs proceeds from bottom to top. Blocks are input from the disks, stored in the prefetch buﬀers, and ultimately read into the merging buﬀers. The blocks of the merged run are moved to the output buﬀers and then output to the disks. The order in which blocks enter the merging buﬀers determines the sequence Σ, which can be predetermined by ordering the blocks based upon their smallest key values. The distribution into S buckets proceeds from top to bottom. Blocks are input from the disks into input buﬀers and moved to the partitioning buﬀers. The blocks of the resulting buckets are written in the order ΣR to the output buﬀers and then output to the appropriate disks. 5.3 Prefetching, Caching, and Applications to Sorting 51 a prefetch buﬀer to a merging buﬀer, where it stays until its items are exhausted by the merging process. When a block expires, it is replaced by the next block in the read sequence Σ (unless Σ has expired) before the merging is allowed to resume. The ﬁrst moment that a block abso- lutely must be read and moved to the merging buﬀer is when its smallest key value enters into the merging process. We therefore deﬁne the read priority of a block b to be its smallest key value. We can sort the small- est key values (one per block) to form the read order Σ. Computing the read sequence Σ is fast to do because sorting N/B key values is a considerably smaller problem than sorting the entire ﬁle of N records. A subtle point is to show that this Σ ordering actually “works,” namely, that at each step of the merging process, the item r with the smallest key value not yet in the merged run will be added next to the merged run. It may be, for example, that the R merging buﬀers contain multiple blocks from one run but none from another. However, at the time when item r should be added to the merged run, there can be at most one other nonempty run in a merging buﬀer from each of the other R − 1 runs. Therefore, since there are R merging buﬀers and since the merging proceeds only when all R merging buﬀers are nonempty (unless Σ is expired), it will always be the case that the block containing r will be resident in one of the merging buﬀers before the merging proceeds. We need to use a third partition of internal memory to serve as output buﬀers so that we can output the merged run in a striped fashion to the D disks. Knuth [220, problem 5.4.9–26] has shown that we may need as many output buﬀers as prefetch buﬀers, but about 3D output buﬀers typically suﬃce. So the remaining m = m − R − 3D blocks of internal memory are used as prefetch buﬀers. We get an optimum merge schedule for read sequence Σ by comput- ing the greedy output schedule for the reverse sequence ΣR. Figure 5.8 shows the ﬂow through the various components in internal memory. When the runs are stored on the disks using randomized cycling, the length of the greedy output schedule corresponds to the performance of a distribution pass in RCD, which is optimal. We call the resulting merge sort randomized cycling merge sort (RCM). It has the identical I/O performance bound (5.2) as does RCD, except that each level of 52 External Sorting and Related Problems merging requires some extra overhead to ﬁll the prefetch buﬀers to start the merge, corresponding to the additive terms in Theorem 5.2. For any parameters ε, δ > 0, assuming that m ≥ D(ln2 + δ)/ε +3D, the average number of I/Os for RCM is 2+ε + O(e−δD) n D logm−3D−D(ln2+δ)/ε n m +2n D + min n D,O logm ε + o n D . (5.8) When D = o(m)=o(n/logn), for any desired constant α>0, we can choose ε and δ appropriately to bound (5.8) as follows with a constant of proportionality of 2: ∼ 2 n D logαm n . (5.9) Dementiev and Sanders [136] show how to overlap computation eﬀectively with I/O in the RCM method. We can apply the duality approach to other methods as well. For example, we could get a sim- ple randomized distribution sort that is dual to the SRM method of Section 5.2.1. 5.4 A General Simulation for Parallel Disks Sanders et al. [304] and Sanders [303] give an elegant randomized tech- nique to simulate the Aggarwal–Vitter model of Section 2.3, in which D simultaneous block transfers are allowed regardless of where the blocks are located on the disks. On the average, the simulation realizes each I/O in the Aggarwal–Vitter model by only a constant number of I/Os in PDM. One property of the technique is that the input and output I/O steps use the disks independently. Armen [60] had ear- lier shown that deterministic simulations resulted in an increase in the number of I/Os by a multiplicative factor of log(N/D)/loglog(N/D). The technique of Sanders et al. consists of duplicating each disk block and storing the two copies on two independently and uniformly chosen disks (chosen by a hash function). In terms of the occupancy model, each ball (block) is duplicated and stored in two random bins (disks). Let us consider the problem of retrieving a speciﬁc set of k = 5.5 Handling Duplicates: Bundle Sorting 53 O(D) blocks from the disks. For each block, there is a choice of two disks from which it can be input. Regardless of which k blocks are requested, Sanders et al. show that with high probability k/D or k/D + 1 I/Os suﬃce to retrieve all k blocks. They also give a simple linear-time greedy algorithm that achieves optimal input schedules with high probability. A natural application of this technique is to the layout of data on multimedia servers in order to support multiple stream requests, as in video on demand. When outputting blocks of data to the disks, each block must be output to both the disks where a copy is stored. Sanders et al. prove that with high probability D blocks can be output in O(1) I/O steps, assuming that there are O(D) blocks of internal buﬀer space to serve as output buﬀers. The I/O bounds can be improved with a corresponding tradeoﬀ in redundancy and internal memory space. 5.5 Handling Duplicates: Bundle Sorting Arge et al. [42] describe a single-disk merge sort algorithm for the problem of duplicate removal, in which there are a total of K distinct items among the N items. When duplicates get grouped together during a merge, they are replaced by a single copy of the item and a count of the occurrences. The algorithm runs in O nmax 1,logm(K/B) I/Os, which is optimal in the comparison model. The algorithm can be used to sort the ﬁle, assuming that a group of equal items can be represented by a single item and a count. A harder instance of sorting called bundle sorting arises when we have K distinct key values among the N items, but all the items have diﬀerent secondary information that must be maintained, and therefore items cannot be aggregated with a count. Abello et al. [2] and Matias et al. [249] develop optimal distribution sort algorithms for bundle sorting using BundleSort(N,K)=O n · max 1, logm min{K,n} (5.10) I/Os, and Matias et al. [249] prove a matching lower bound. Matias et al. [249] also show how to do bundle sorting (and sorting in general) in place (i.e., without extra disk space). In distribution sort, for 54 External Sorting and Related Problems example, the blocks for the subﬁles can be allocated from the blocks freed up from the ﬁle being partitioned; the disadvantage is that the blocks in the individual subﬁles are no longer consecutive on the disk. The algorithms can be adapted to run on D disks with a speedup of O(D) using the techniques described in Sections 5.1 and 5.2. 5.6 Permuting Permuting is the special case of sorting in which the key values of the N data items form a permutation of {1,2,...,N}. Theorem 5.3 ([345]). The average-case and worst-case number of I/Os required for permuting N data items using D disks is Θ min N D , Sort(N) . (5.11) The I/O bound (5.11) for permuting can be realized by one of the optimal external sorting algorithms except in the extreme case for which B logm = o(logn), when it is faster to move the data items one by one in a nonblocked way. The one-by-one method is trivial if D =1, but with multiple disks there may be bottlenecks on individual disks; one solution for doing the permuting in O(N/D) I/Os is to apply the randomized balancing strategies of [345]. An interesting theoretical question is to determine the I/O cost for each individual permutation, as a function of some simple characteri- zation of the permutation, such as number of inversions. We examine special classes of permutations having to do with matrices, such as matrix transposition, in Chapter 7. 5.7 Fast Fourier Transform and Permutation Networks Computing the Fast Fourier Transform (FFT) in external memory con- sists of a series of I/Os that permit each computation implied by the FFT directed graph (or butterﬂy) to be done while its arguments are in internal memory. A permutation network computation consists of an oblivious (ﬁxed) pattern of I/Os that can realize any of the N! possible 5.7 Fast Fourier Transform and Permutation Networks 55 permutations; data items can only be reordered when they are in inter- nal memory. A permutation network can be constructed by a series of three FFTs [358]. Theorem 5.4 ([23]). With D disks, the number of I/Os required for computing the N-input FFT digraph or an N-input permutation network is Sort(N). Cormen and Nicol [119] give some practical implementations for one-dimensional FFTs based upon the optimal PDM algorithm of Vit- ter and Shriver [345]. The algorithms for FFT are faster and simpler than for sorting because the computation is nonadaptive in nature, and thus the communication pattern is ﬁxed in advance. 6 Lower Bounds on I/O In this chapter, we prove the lower bounds from Theorems 5.1–5.4, including a careful derivation of the constants of proportionality in the permuting and sorting lower bounds. We also mention some related I/O lower bounds for the batched problems in computational geometry and graphs that we cover later in Chapters 8 and 9. 6.1 Permuting The most trivial batched problem is that of scanning (a.k.a. streaming or touching) a ﬁle of N data items, which can be done in a linear num- ber O(N/DB)=O(n/D) of I/Os. Permuting is one of several simple problems that can be done in linear CPU time in the (internal mem- ory) RAM model. But if we assume that the N items are indivisible and must be transferred as individual entities, permuting requires a nonlinear number of I/Os in PDM because of the locality constraints imposed by the block parameter B. Our main result for parallel disk sorting is that we close the gap between the upper and lower bounds up to lower order terms. The lower bound from [23] left open the nature of the constant factor of 57 58 Lower Bounds on I/O proportionality of the leading term; in particular, it was not clear what happens if the number of output steps and input steps diﬀer. Theorem 6.1 ([202]). Assuming that m = M/B is an increasing function, the number of I/Os required to sort or permute n indivis- ible items, up to lower-order terms, is at least 2N D logn B logm + 2logN ∼ 2n D logm n if B logm = ω(logN); N D if B logm = o(logN). (6.1) The main case in Theorem 6.1 is the ﬁrst one, and this theorem shows that the constant of proportionality in the Sort(N) bound (5.1) of Theorem 5.1 is at least 2. The second case in the theorem is the pathological case in which the block size B and internal memory size M are so small that the optimum way to permute the items is to move them one at a time in the naive manner, not making use of blocking. We devote the rest of this section to a proof of Theorem 6.1. For the lower bound calculation, we can assume without loss of generality that there is only one disk, namely, D = 1. The I/O lower bound for general D follows by dividing the lower bound for one disk by D. We call an input operation simple if each item that is transferred from the disk gets removed from the disk and deposited into an empty location in internal memory. Similarly, an output is simple if the trans- ferred items are removed from internal memory and deposited into empty locations on disk. Lemma 6.2 ([23]). For each computation that implements a permu- tation of the N items, there is a corresponding computation strategy involving only simple I/Os such that the total number of I/Os is no greater. The lemma can be demonstrated easily by starting with a valid permutation computation and working backwards. At each I/O step, 6.1 Permuting 59 in backwards order, we cancel the transfer of an item if its transfer is not needed for the ﬁnal result; if it is needed, we make the transfer simple. The resulting I/O strategy has only simple I/Os. For the lower bound, we use the basic approach of Aggarwal and Vitter [23] and bound the maximum number of permutations that can be produced by at most t I/Os. If we take the value of t for which the bound ﬁrst reaches N!, we get a lower bound on the worst-case number of I/Os. In a similar way, we can get a lower bound on the average case by computing the value of t for which the bound ﬁrst reaches N!/2. In particular, we say that a permutation p1,p2,...,pN of the N items can be produced after tI input operations and tO output oper- ations if there is some intermixed sequence of tI input operations and tO output operations so that the items end up in the permuted order p1,p2,...,pN in extended memory. (By extended memory we mean the memory locations of internal memory followed by the memory locations on disk, in sequential order.) The items do not have to be in contiguous positions in internal memory or on disk; there can be arbitrarily many empty locations between adjacent items. As mentioned above, we can assume that I/Os are simple. Each I/O causes the transfer of exactly B items, although some of the items may be nil. In the PDM model, the I/Os obey block boundaries, in that all the non-nil items in a given I/O come from or go to the same block on disk. Initially, before any I/Os are performed and the items reside on disk, the number of producible permutations is 1. Let us consider the eﬀect of an output. There can be at most N/B + o − 1 nonempty blocks before the oth output operation, and thus the items in the oth output can go into one of N/B + o places relative to the other blocks. Hence, the oth output boosts the number of producible permutations by a factor of at most N/B + o, which can be bounded trivially by N(1 + logN). (6.2) For the case of an input operation, we ﬁrst consider an input I/O from a speciﬁc block on disk. If the b items involved in the input I/O were together in internal memory at some previous time (e.g., if the block was created by an earlier output operation), then the items could 60 Lower Bounds on I/O have been arranged in an arbitrary order by the algorithm while they were in internal memory. Thus, the b! possible orderings of the b input items relative to themselves could already have been produced before the input operation. This implies in a subtle way that rearranging the newly input items among the other M − b items in internal memory can boost the number of producible permutations by a factor of at mostM b , which is the number of ways to intersperse b indistinguishable items within a group of size M. The above analysis applies to input from a speciﬁc block. If the input was preceded by a total of o output operations, there are at most N/B + o ≤ N(1 + logN) blocks to choose from for the I/O, so the number of producible permutations is boosted further by at most N(1 + logN). Therefore, assuming that at some prior time the b input items were together in internal memory, an input operation can boost the number of producible permutations by at most N(1 + logN) M b . (6.3) Now let us consider an input operation in which some of the input items were not together previously in internal memory (e.g., the ﬁrst time a block is input). By rearranging the relative order of the items in internal memory, we can increase the number of producible permuta- tions by a factor of B!. Given that there are N/B full blocks initially, we get the maximum increase when all N/B blocks are input in full, which boosts the number of producible permutations by a factor of (B!)N/B. (6.4) Let I be the total number of input I/O operations. In the ith input operation, let bi be the number of items brought into internal mem- ory. By the simplicity property, some of the items in the block being accessed may not be brought into internal memory, but rather may be left on disk. In this case, bi counts only the number of items that are removed from disk and put into internal memory. In particular, we have 0 ≤ bi ≤ B. By the simplicity property, we need to make room in internal mem- ory for the new items that arrive, and in the end all items are stored 6.2 Lower Bounds for Sorting and Other Problems 61 back on disk. Therefore, we get the following lower bound on the num- ber O of output operations: O ≥ 1 B 1≤i≤I bi . (6.5) Combining (6.2), (6.3), and (6.4), we ﬁnd that N(1 + logN) I+O 1≤i≤I M bi ≥ N! (B!)N/B , (6.6) where O satisﬁes (6.5). Let ˜B ≤ B be the average number of items input during the I input operations. By a convexity argument, the left-hand side of (6.6) is maxi- mized when each bi has the same value, namely, ˜B. We can rewrite (6.5) as O ≥ I ˜B/B, and thus we get I ≤ (I + O)/(1 + ˜B/B). Combining these facts with (6.6), we get N(1 + logN) I+O M ˜B I ≥ N! (B!)N/B ; (6.7) N(1 + logN) I+O M ˜B (I+O)/(1+ ˜B/B) ≥ N! (B!)N/B . (6.8) By assumption that M/B is an increasing function, the left-hand side of (6.8) is maximized when ˜B = B,soweget N(1 + logN) I+O M B (I+O)/2 ≥ N! (B!)N/B . (6.9) The lower bound on I + O for D = 1 follows by taking logarithms of both sides of (6.9) and solving for I + O using Stirling’s formula. We get the general lower bound of Theorem 6.1 for D disks by dividing the result by D. 6.2 Lower Bounds for Sorting and Other Problems Permuting is a special case of sorting, and hence, the permuting lower bound of Theorem 6.1 applies also to sorting. In the unlikely case that B logm = o(logn), the permuting bound is only Ω(N/D), and we must 62 Lower Bounds on I/O resort to the comparison model to get the full lower bound (5.1) of Theorem 5.1 [23]. In the typical case in which B logm = Ω(logn), the comparison model is not needed to prove the sorting lower bound; the diﬃculty of sorting in that case arises not from determining the order of the data but from permuting (or routing) the data. The constant of proportion- ality of 2 in the lower bound (6.1) of Theorem 6.1 is nearly realized by randomized cycling distribution sort (RCD) in Section 5.1.3, simple randomized merge sort (SRM) in Section 5.2.1, and the dual methods of Section 5.3.4. The derivation of the permuting lower bound in Section 6.1 also works for permutation networks, in which the communication pattern is oblivious (ﬁxed). Since the choice of disk block is ﬁxed for each I/O step, there is no N(1 + logN) term as there is in (6.6), and correspond- ingly there is no additive 2logN term as there is in the denominator of the left-hand side of Theorem 6.1. Hence, when we solve for I + O,we get the lower bound (5.1) rather than (5.11). The lower bound follows directly from the counting argument; unlike the sorting derivation, it does not require the comparison model for the case B logm = o(logn). The lower bound also applies directly to FFT, since permutation net- works can be formed from three FFTs in sequence. Arge et al. [42] show for the comparison model that any problem with an Ω(N logN) lower bound in the (internal memory) RAM model requires Ω(nlogm n) I/Os in PDM for a single disk. Their argument leads to a matching lower bound of Ω nmax 1,logm(K/B) I/Os in the comparison model for duplicate removal with one disk. Erick- son [150] extends the sorting and element distinctness lower bound to the more general algebraic decision tree model. For the problem of bundle sorting, in which the N items have a total of K distinct key values (but the secondary information of each item is diﬀerent), Matias et al. [249] derive the match- ing lower bound BundleSort(N,K)=Ω nmax 1, logm min{K,n} . The proof consists of the following parts. The ﬁrst part is a simple proof of the same lower bound as for duplicate removal, but without resorting to the comparison model (except for the pathological case B logm = o(logn)). It suﬃces to replace the right-hand side of (6.9) 6.2 Lower Bounds for Sorting and Other Problems 63 by N!/ (N/K)! K, which is the maximum number of permutations of N numbers having K distinct values. Solving for I + O gives the lower bound Ω nmax 1,logm(K/B) , which is equal to the desired lower bound for BundleSort(N,K) when K = B1+Ω(1) or M = B1+Ω(1). Matias et al. [249] derive the remaining case of the lower bound for BundleSort(N,K) by a potential argument based upon the derivation of the transposition lower bound (Theorem 7.2). Dividing by D gives the lower bound for D disks. Chiang et al. [105], Arge [30], Arge and Miltersen [45], Munagala and Ranade [264], and Erickson [150] give models and lower bound reductions for several computational geometry and graph problems. The geometry problems discussed in Chapter 8 are equivalent to sort- ing in both the internal memory and PDM models. Problems such as list ranking and expression tree evaluation have the same nonlin- ear I/O lower bound as permuting. Other problems such as connected components, biconnected components, and minimum spanning forest of sparse graphs with E edges and V vertices require as many I/Os as E/V instances of permuting V items. This situation is in contrast with the (internal memory) RAM model, in which the same problems can all be done in linear CPU time. In some cases there is a gap between the best known upper and lower bounds, which we examine further in Chapter 9. The lower bounds mentioned above assume that the data items are in some sense “indivisible,” in that they are not split up and reassem- bled in some magic way to get the desired output. It is conjectured that the sorting lower bound (5.1) remains valid even if the indivisibil- ity assumption is lifted. However, for an artiﬁcial problem related to transposition, Adler [5] showed that removing the indivisibility assump- tion can lead to faster algorithms. A similar result is shown by Arge and Miltersen [45] for the decision problem of determining if N data item values are distinct. Whether the conjecture is true is a challenging theoretical open problem. 7 Matrix and Grid Computations 7.1 Matrix Operations Dense matrices are generally represented in memory in row-major or column-major order. For certain operations such as matrix addition, both representations work well. However, for standard matrix multipli- cation (using only semiring operations) and LU decomposition, a better representation is to block the matrix into square √ B × √ B submatri- ces, which gives the upper bound of the following theorem: Theorem 7.1 ([199, 306, 345, 357]). The number of I/Os required for standard matrix multiplication of two K × K matrices or to com- pute the LU factorization of a K × K matrix is Θ K3 min{K, √ M }DB . The lower bound follows from the related pebbling lower bound by Savage and Vitter [306] for the case D = 1 and then dividing by D. Hong and Kung [199] and Nodine et al. [272] give optimal EM algo- rithms for iterative grid computations, and Leiserson et al. [233] reduce 65 66 Matrix and Grid Computations the number of I/Os of naive multigrid implementations by a Θ(M1/5) factor. Gupta et al. [188] show how to derive eﬃcient EM algorithms automatically for computations expressed in tensor form. If a K × K matrix A is sparse, that is, if the number Nz of nonzero elements in A is much smaller than K2, then it may be more eﬃcient to store only the nonzero elements. Each nonzero element Ai,j is rep- resented by the triple (i,j,Ai,j). Vengroﬀ and Vitter [337] report on algorithms and benchmarks for dense and sparse matrix operations. For further discussion of numerical EM algorithms we refer the reader to the surveys by Toledo [328] and Kowarschik and Weiß [225]. Some issues regarding programming environments are covered in [115] and Chapter 17. 7.2 Matrix Transposition Matrix transposition is the special case of permuting that involves con- version of a matrix from row-major order to column-major order. Theorem 7.2 ([23]). With D disks, the number of I/Os required to transpose a p × q matrix from row-major order to column-major order is Θ n D logm min{M,p,q,n} , (7.1) where N = pq and n = N/B. When B is relatively large (say, 1 2 M) and N is O(M2), matrix trans- position can be as hard as general sorting, but for smaller B, the special structure of the transposition permutation makes transposition easier. In particular, the matrix can be broken up into square submatrices of B2 elements such that each submatrix contains B blocks of the matrix in row-major order and also B blocks of the matrix in column-major order. Thus, if B2 1, d-dimensional orthogo- nal range queries can be done in O(n1−1/d + z) I/Os, and inserts and deletes take O(logB N) I/Os. The O-tree of Kanth and Singh [211] pro- vides similar bounds. Cross trees also support the dynamic operations of cut and concatenate in O(n1−1/d) I/Os. In some restricted models for linear-space data structures, the 2-D range search query performance of cross trees and O-trees can be considered to be optimal, although it is much larger than the logarithmic bound of Criterion 1. One way to get multidimensional EM data structures is to aug- ment known internal memory structures, such as quad trees and kd- trees, with block-access capabilities. Examples include kd-B-trees [293], buddy trees [309], hB-trees [151, 238], and Bkd-trees [285]. Grid ﬁles [196, 268, 353] are a ﬂattened data structure for storing the cells of a two-dimensional grid in disk blocks. Another technique is to “lin- earize” the multidimensional space by imposing a total ordering on it (a so-called space-ﬁlling curve), and then the total order is used to organize the points into a B-tree [173, 207, 277]. Linearization can also be used to represent nonpoint data, in which the data items are parti- tioned into one or more multidimensional rectangular regions [1, 276]. All the methods described in this paragraph use linear space, and they work well in certain situations; however, their worst-case range query performance is no better than that of cross trees, and for some methods, such as grid ﬁles, queries can require Θ(n) I/Os, even if there are no points satisfying the query. We refer the reader to [18, 170, 270] for a broad survey of these and other interesting methods. Space-ﬁlling curves arise again in connection with R-trees, which we describe next. 12.2 R-trees 103 12.2 R-trees The R-tree of Guttman [190] and its many variants are a practical multidimensional generalization of the B-tree for storing a variety of geometric objects, such as points, segments, polygons, and polyhedra, using linear disk space. Internal nodes have degree Θ(B) (except pos- sibly the root), and leaves store Θ(B) items. Each node in the tree has associated with it a bounding box (or bounding polygon) of all the items in its subtree. A big diﬀerence between R-trees and B-trees is that in R-trees the bounding boxes of sibling nodes are allowed to overlap. If an R-tree is being used for point location, for example, a point may lie within the bounding box of several children of the cur- rent node in the search. In that case the search must proceed to all such children. In the dynamic setting, there are several popular heuristics for where to insert new items into an R-tree and how to rebalance it; see [18, 170, 183] for a survey. The R*-tree variant of Beckmann et al. [77] seems to give best overall query performance. To insert an item, we start at the root and recursively insert the item into the subtree whose bounding box would expand the least in order to accommodate the item. In case of a tie (e.g., if the item already ﬁts inside the bounding boxes of two or more subtrees), we choose the subtree with the smallest resulting bounding box. In the normal R-tree algorithm, if a leaf node gets too many items or if an internal node gets too many children, we split it, as in B-trees. Instead, in the R*-tree algorithm, we remove a certain percentage of the items from the overﬂowing node and rein- sert them into the tree. The items we choose to reinsert are the ones whose centroids are furthest from the center of the node’s bounding box. This forced reinsertion tends to improve global organization and reduce query time. If the node still overﬂows after the forced reinser- tion, we split it. The splitting heuristics try to partition the items into nodes so as to minimize intuitive measures such as coverage, overlap, or perimeter. During deletion, in both the normal R-tree and R*-tree algorithms, if a leaf node has too few items or if an internal node has too few children, we delete the node and reinsert all its items back into the tree by forced reinsertion. 104 Spatial Data Structures and Range Search The rebalancing heuristics perform well in many practical scenar- ios, especially in low dimensions, but they result in poor worst-case query bounds. An interesting open problem is whether nontrivial query bounds can be proven for the “typical-case” behavior of R-trees for problems such as range searching and point location. Similar questions apply to the methods discussed in Section 12.1. New R-tree partitioning methods by de Berg et al. [128], Agarwal et al. [17], and Arge et al. [38] provide some provable bounds on overlap and query performance. In the static setting, in which there are no updates, constructing the R*-tree by repeated insertions, one by one, is extremely slow. A faster alternative to the dynamic R-tree construction algorithms mentioned above is to bulk-load the R-tree in a bottom-up fashion [1, 206, 276]. Such methods use some heuristic for grouping the items into leaf nodes of the R-tree, and then recursively build the nonleaf nodes from bot- tom to top. As an example, in the so-called Hilbert R-tree of Kamel and Faloutsos [206], each item is labeled with the position of its cen- troid on the Peano-Hilbert space-ﬁlling curve, and a B+-tree is built upon the totally ordered labels in a bottom-up manner. Bulk load- ing a Hilbert R-tree is therefore easy to do once the centroid points are presorted. These static construction methods algorithms are very diﬀerent in spirit from the dynamic insertion methods: The dynamic methods explicitly try to reduce the coverage, overlap, or perimeter of the bounding boxes of the R-tree nodes, and as a result, they usually achieve good query performance. The static construction methods do not consider the bounding box information at all. Instead, the hope is that the improved storage utilization (up to 100%) of these packing methods compensates for a higher degree of node overlap. A dynamic insertion method related to [206] was presented in [207]. The qual- ity of the Hilbert R-tree in terms of query performance is generally not as good as that of an R*-tree, especially for higher-dimensional data [84, 208]. In order to get the best of both worlds — the query performance of R*-trees and the bulk construction eﬃciency of Hilbert R-trees — Arge et al. [41] and van den Bercken et al. [333] independently devised fast bulk loading methods based upon buﬀer trees that do top-down construction in O(nlogm n) I/Os, which matches the performance of 12.2 R-trees 105 Fig. 12.2 Costs for R-tree processing (in units of 1000 I/Os) using the naive repeated insertion method and the buﬀer R-tree for various buﬀer sizes: (a) Cost for bulk-loading the R-tree; (b) Query cost. the bottom-up methods within a constant factor. The former method is especially eﬃcient and supports dynamic batched updates and queries. In Figure 12.2 and Table 12.1, we report on some experiments that test the construction, update, and query performance of various R-tree methods. The experimental data came from TIGER/line data sets from four US states [327]; the implementations were done using the TPIE system, described in Chapter 17. Figure 12.2 compares the construction cost for building R-trees and the resulting query performance in terms of I/Os for the naive sequen- tial method for constructing R*-trees (labeled “naive”) and the newly developed buﬀer R*-tree method [41] (labeled “buﬀer”). An R-tree was constructed on the TIGER road data for each state and for each of four possible buﬀer sizes. The four buﬀer sizes were capable of storing 0, 106 Spatial Data Structures and Range Search Table 12.1 Summary of the costs (in number of I/Os) for R-tree updates and queries. Packing refers to the percentage storage utilization. Update with 50% of the data Data set Update method Building Querying Packing (%) RI Naive Hilbert Buﬀer 259,263 15,865 13,484 6,670 7,262 5,485 64 92 90 CT Naive Hilbert Buﬀer 805,749 51,086 42,774 40,910 40,593 37,798 66 92 90 NJ Naive Hilbert Buﬀer 1,777,570 120,034 101,017 70,830 69,798 65,898 66 92 91 NY Naive Hilbert Buﬀer 3,736,601 246,466 206,921 224,039 230,990 227,559 66 92 90 600, 1,250, and 5,000 rectangles, respectively; buﬀer size 0 corresponds to the naive method and the larger buﬀers correspond to the buﬀer method. The query performance of each resulting R-tree was mea- sured by posing rectangle intersection queries using rectangles taken from TIGER hydrographic data. The results, depicted in Figure 12.2, show that buﬀer R*-trees, even with relatively small buﬀers, achieve a tremendous speedup in number of I/Os for construction without any worsening in query performance, compared with the naive method. The CPU costs of the two methods are comparable. The storage utilization of buﬀer R*-trees tends to be in the 90% range, as opposed to roughly 70% for the naive method. Bottom-up methods can build R-trees even more quickly and more compactly, but they generally do not support bulk dynamic opera- tions, which is a big advantage of the buﬀer tree approach. Kamel et al. [208] develop a way to do bulk updates with Hilbert R-trees, but at a cost in terms of query performance. Table 12.1 compares dynamic update methods for the naive method, for buﬀer R-trees, and for Hilbert R-trees [208] (labeled “Hilbert”). A single R-tree was built for each of the four US states, containing 50% of the road data objects for that state. Using each of the three algorithms, the remaining 50% of the objects were inserted into the R-tree, and the construction time was 12.3 Bootstrapping for 2-D Diagonal Cornerand Stabbing Queries 107 measured. Query performance was then tested as before. The results in Table 12.1 indicate that the buﬀer R*-tree and the Hilbert R-tree achieve a similar degree of packing, but the buﬀer R*-tree provides better update and query performance. 12.3 Bootstrapping for 2-D Diagonal Corner and Stabbing Queries An obvious paradigm for developing an eﬃcient dynamic EM data structure, given an existing data structure that works well when the problem ﬁts into internal memory, is to “externalize” the internal mem- ory data structure. If the internal memory data structure uses a binary tree, then a multiway tree such as a B-tree must be used instead. How- ever, when searching a B-tree, it can be diﬃcult to report all the items in the answer to the query in an output-sensitive manner. For example, in certain searching applications, each of the Θ(B) subtrees of a given node in a B-tree may contribute one item to the query answer, and as a result each subtree may need to be explored (costing several I/Os) just to report a single item of the answer. Fortunately, we can sometimes achieve output-sensitive reporting by augmenting the data structure with a set of ﬁltering substructures, each of which is a data structure for a smaller version of the same prob- lem. We refer to this approach, which we explain shortly in more detail, as the bootstrapping paradigm. Each substructure typically needs to store only O(B2) items and to answer queries in O(logB B2 + Z/B)= O Z/B I/Os, where Z is the number of items reported. A substruc- ture can even be static if it can be constructed in O(B) I/Os, since we can keep updates in a separate buﬀer and do a global rebuilding in O(B) I/Os whenever there are Θ(B) updates. Such a rebuilding costs O(1) I/Os (amortized) per update. We can often remove the amor- tization and make it worst-case using the weight-balanced B-trees of Section 11.2 as the underlying B-tree structure. Arge and Vitter [58] ﬁrst uncovered the bootstrapping paradigm while designing an optimal dynamic EM data structure for diago- nal corner two-sided 2-D queries (see Figure 12.1(a)) that meets all three design criteria listed in Chapter 12. Diagonal corner two-sided 108 Spatial Data Structures and Range Search queries are equivalent to stabbing queries, which have the following form: “Given a set of one-dimensional intervals, report all the intervals ‘stabbed’ by the query value x.” (That is, report all intervals that con- tain x.) A diagonal corner query x on a set of 2-D points (a1,b2), (a2,b2),... is equivalent to a stabbing query x on the set of closed intervals [a1,b2], [a2,b2],... . The EM data structure for stabbing queries is a multiway version of the well-known interval tree data structure [144, 145] for internal memory, which supports stabbing queries in O(logN + Z) CPU time and updates in O(logN) CPU time and uses O(N) space. We can externalize it by using a weight-balanced B-tree as the underlying base tree, where the nodes have degree Θ( √ B ). Each node in the base tree corresponds in a natural way to a one-dimensional range of x- values; its Θ( √ B ) children correspond to subranges called slabs, and the Θ( √ B2)=Θ(B) contiguous sets of slabs are called multislabs,asin Section 8.1 for a similar batched problem. Each interval in the problem instance is stored in the lowest node v in the base tree whose range completely contains the interval. The interval is decomposed by v’s Θ( √ B ) slabs into at most three pieces: the middle piece that com- pletely spans one or more slabs of v, the left end piece that partially protrudes into a slab of v, and the right end piece that partially pro- trudes into another slab of v, as shown in Figure 12.3. The three pieces left end pieces ending in slab 2 one-dimensional list of one-dimensional list of right end pieces ending in slab 6 slab 8slab 2slab 1 slab 5slab 3 slab 4 slab 7slab 6 Fig. 12.3 Internal node v of the EM priority search tree, for B = 64 with √ B = 8 slabs. Node v is the lowest node in the tree completely containing the indicated interval. The middle piece of the interval is stored in the multislab list corresponding to slabs 3–5. (The multislab lists are not pictured.) The left and right end pieces of the interval are stored in the left-ordered list of slab 2 and the right-ordered list of slab 6, respectively. 12.3 Bootstrapping for 2-D Diagonal Cornerand Stabbing Queries 109 are stored in substructures of v. In the example in Figure 12.3, the middle piece is stored in a list associated with the multislab it spans (corresponding to the contiguous range of slabs 3–5), the left end piece is stored in a one-dimensional list for slab 2 ordered by left endpoint, and the right end piece is stored in a one-dimensional list for slab 6 ordered by right endpoint. Given a query value x, the intervals stabbed by x reside in the substructures of the nodes of the base tree along the search path from the root to the leaf for x. For each such node v, we consider each of v’s multislabs that contains x and report all the intervals in the multislab list. We also walk sequentially through the right-ordered list and left- ordered list for the slab of v that contains x, reporting intervals in an output-sensitive way. The big problem with this approach is that we have to spend at least one I/O per multislab containing x, regardless of how many intervals are in the multislab lists. For example, there may be Θ(B) such mul- tislab lists, with each list containing only a few stabbed intervals (or worse yet, none at all). The resulting query performance will be highly nonoptimal. The solution, according to the bootstrapping paradigm, is to use a substructure in each node consisting of an optimal static data structure for a smaller version of the same problem; a good choice is the corner data structure developed by Kanellakis et al. [210]. The cor- ner substructure in this case is used to store all the intervals from the “sparse” multislab lists, namely, those that contain fewer than B inter- vals, and thus the substructure contains only O(B2) intervals. When visiting node v, we access only v’s nonsparse multislab lists, each of which contributes Z ≥ B intervals to the answer, at an output-sensitive cost of O(Z/B) I/Os, for some Z. The remaining Z stabbed intervals stored in v can be found by a single query to v’s corner substructure, at a cost of O(logB B2 + Z/B)=O Z/B I/Os. Since there are O(logB N) nodes along the search path in the base tree, the total col- lection of Z stabbed intervals is reported in O(logB N + z) I/Os, which is optimal. Using a weight-balanced B-tree as the underlying base tree allows the static substructures to be rebuilt in worst-case optimal I/O bounds. 110 Spatial Data Structures and Range Search The above bootstrapping approach also yields dynamic EM segment trees with optimal query and update bound and O(nlogB N)-block space usage. Stabbing queries are important because, when combined with one- dimensional range queries, they provide a solution to dynamic inter- val management, in which one-dimensional intervals can be inserted and deleted, and intersection queries can be performed. These oper- ations support indexing of one-dimensional constraints in constraint databases. Other applications of stabbing queries arise in graphics and GIS. For example, Chiang and Silva [106] apply the EM interval tree structure to extract at query time the boundary components of the iso- surface (or contour) of a surface. A data structure for a related prob- lem, which in addition has optimal output complexity, appears in [10]. Range-max and stabbing-max queries are studied in [13, 15]. 12.4 Bootstrapping for Three-Sided Orthogonal 2-D Range Search Arge et al. [50] provide another example of the bootstrapping paradigm by developing an optimal dynamic EM data structure for three-sided orthogonal 2-D range searching (see Figure 12.1(c)) that meets all three design criteria. In internal memory, the optimal structure is the priority search tree [251], which answers three-sided range queries in O(logN + Z) CPU time, does updates in O(logN) CPU time, and uses O(N) space. The EM structure of Arge et al. [50] is an external- ization of the priority search tree, using a weight-balanced B-tree as the underlying base tree. Each node in the base tree corresponds to a one-dimensional range of x-values, and its Θ(B) children correspond to subranges consisting of vertical slabs. Each node v contains a small substructure called a child cache that supports three-sided queries. Its child cache stores the “Y-set” Y (w) for each of the Θ(B) children w of v. The Y-set Y (w) for child w consists of the highest Θ(B) points in w’s slab that are not already stored in the child cache of some ancestor of v. There are thus a total of Θ(B2) points stored in v’s child cache. We can answer a three-sided query of the form [x1,x2] × [y1,+∞)by visiting a set of nodes in the base tree, starting with the root. For each 12.4 Bootstrapping for Three-Sided Orthogonal 2-D Range Search 111 visited node v, we pose the query [x1,x2] × [y1,+∞)tov’s child cache and output the results. The following rules are used to determine which of v’s children to visit: We visit v’s child w if either (1) w is along the leftmost search path for x1 or the rightmost search path for x2 in the base tree, or (2) the entire Y-set Y (w) is reported when v’s child cache is queried. (See Figure 12.4.) There are O(logB N) nodes w that are visited because of rule 1. When rule 1 is not satisﬁed, rule 2 provides an eﬀective ﬁltering mechanism to guarantee output-sensitive reporting: The I/O cost for initially accessing a child node w can be “charged” to the Θ(B) points of Y (w) reported from v’s child cache; conversely, if not all of Y (w) is reported, then the points stored in w’s subtree will be too low to satisfy the query, and there is no need to visit w (see Fig- ure 12.4(b)). Provided that each child cache can be queried in O(1) I/Os plus the output-sensitive cost to output the points satisfying the query, the resulting overall query time is O(logB N + z), as desired. All that remains is to show how to query a child cache in a con- stant number of I/Os, plus the output-sensitive cost. Arge et al. [50] (a) (b) 432154321 wwwwwww 5www Fig. 12.4 Internal node v of the EM priority search tree, with slabs (children) w1, w2, ..., w5. The Y-sets of each child, which are stored collectively in v’s child cache, are indicated by the bold points. (a) The three-sided query is completely contained in the x- range of w2. The relevant (bold) points are reported from v’s child cache, and the query is recursively answered in w2. (b) The three-sided query spans several slabs. The relevant (bold) points are reported from v’s child cache, and the query is recursively answered in w2, w3, and w5. The query is not extended to w4 in this case because not all of its Y-set Y (w4) (stored in v’s child cache) satisﬁes the query, and as a result none of the points stored in w4’s subtree can satisfy the query. 112 Spatial Data Structures and Range Search provide an elegant and optimal static data structure for three-sided range search, which can be used in the EM priority search tree described above to implement the child caches of size O(B2). The static structure is a persistent B-tree optimized for batched construction. When used for O(B2) points, it occupies O(B) blocks, can be built in O(B) I/Os, and supports three-sided queries in O Z/B I/Os per query, where Z is the number of points reported. The static structure is so simple that it may be useful in practice on its own. Both the three-sided structure developed by Arge et al. [50] and the structure for two-sided diagonal queries discussed in Section 12.3 satisfy Criteria 1–3 of Chapter 12. So in a sense, the three-sided query structure subsumes the diagonal two-sided structure, since three-sided queries are more general. However, diagonal two-sided structure may prove to be faster in practice, because in each of its corner substructures, the data accessed during a query are always in contiguous blocks, whereas the static substructures used in three-sided search do not guarantee block contiguity. On a historical note, earlier work on two-sided and three-sided queries was done by Ramaswamy and Subramanian [289] using the notion of path caching; their structure met Criterion 1 but had higher storage overheads and amortized and/or nonoptimal update bounds. Subramanian and Ramaswamy [323] subsequently developed the p- range tree data structure for three-sided queries, with optimal linear disk space and nearly optimal query and amortized update bounds. 12.5 General Orthogonal 2-D Range Search The dynamic data structure for three-sided range searching can be gen- eralized using the ﬁltering technique of Chazelle [98] to handle general four-sided queries with optimal I/O query bound O(logB N + z) and optimal disk space usage O n(logn)/log(logB N +1) [50]. The update bound becomes O (logB N)(logn)/log(logB N +1) , which may not be optimal. The outer level of the structure is a balanced (logB N + 1)-way 1-D search tree with Θ(n) leaves, oriented, say, along the x-dimension. It therefore has about (logn)/log(logB N + 1) levels. At each level of the 12.5 General Orthogonal 2-D Range Search 113 tree, each point of the problem instance is stored in four substructures (described below) that are associated with the particular tree node at that level that spans the x-value of the point. The space and update bounds quoted above follow from the fact that the substructures use linear space and can be updated in O(logB N) I/Os. To search for the points in a four-sided query rectangle [x1,x2] × [y1,y2], we decompose the four-sided query in the following natural way into two three-sided queries, a stabbing query, and logB N − 1 list traversals: We ﬁnd the lowest node v in the tree whose x-range contains [x1,x2]. If v is a leaf, we can answer the query in a single I/O. Otherwise we query the substructures stored in those children of v whose x-ranges intersect [x1,x2]. Let 2 ≤ k ≤ logB N + 1 be the number of such chil- dren. The range query when restricted to the leftmost such child of v is a three-sided query of the form [x1,+∞] × [y1,y2], and when restricted to the rightmost such child of v, the range query is a three-sided query of the form [−∞,x2] × [y1,y2]. Two of the substructures at each node are devoted for three-sided queries of these types; using the linear-sized data structures of Arge et al. [50] in Section 12.4, each such query can be done in O(logB N + z) I/Os. For the k − 2 intermediate children of v, their x-ranges are completely contained inside the x-range of the query rectangle, and thus we need only do k − 2 list traversals in y-order and retrieve the points whose y-values are in the range [y1,y2]. If we store the points in each node in y-order (in the third type of substructure), the Z output points from a node can be found in O Z/B I/Os, once a starting point in the linear list is found. We can ﬁnd all k − 2 starting points via a single query to a stabbing query substructure S associated with v. (This structure is the fourth type of substructure.) For each two y-consecutive points (ai,bi) and (ai+1,bi+1) associated with a child of v, we store the y-interval [bi,bi+1]inS. Note that S contains intervals contributed by each of the logB N + 1 children of v. By a single stabbing query with query value y1, we can thus identify the k − 2 starting points in only O(logB N) I/Os [58], as described in Section 12.3. (We actually get starting points for all the children of v, not just the k − 2 ones of interest, but we can discard the starting 114 Spatial Data Structures and Range Search points that we do not need.) The total number of I/Os to answer the range query is thus O(logB N + z), which is optimal. Atallah and Prabhakar [62] and Bhatia et al. [86] consider the prob- lem of how to tile a multidimensional array of blocks onto parallel disks so that range queries on a range queries can be answered in near- optimal time. 12.6 Other Types of Range Search Govindarajan et al. [181] develop data structures for 2-D range-count and range-sum queries. For other types of range searching, such as in higher dimensions and for nonorthogonal queries, diﬀerent ﬁltering techniques are needed. So far, relatively little work has been done, and many open problems remain. Vengroﬀ and Vitter [336] develop the ﬁrst theoretically near- optimal EM data structure for static three-dimensional orthogonal range searching. They create a hierarchical partitioning in which all the points that dominate a query point are densely contained in a set of blocks. Compression techniques are needed to minimize disk storage. By using (B logn)-approximate boundaries rather than B-approximate boundaries [340], queries can be done in O(logB N + z) I/Os, which is optimal, and the space usage is O n(logn)k+1 (log(logB N + 1))k disk blocks to support (3 + k)-sided 3-D range queries, in which k of the dimensions (0 ≤ k ≤ 3) have ﬁnite ranges. The result also provides optimal O(logN + Z)-time query performance for three-sided 3-D queries in the (internal memory) RAM model, but using O(N logN) space. By the reduction in [100], a data structure for three-sided 3-D queries also applies to 2-D homothetic range search, in which the queries correspond to scaled and translated (but not rotated) transformations of an arbitrary ﬁxed polygon. An interesting special case is “fat” orthog- onal 2-D range search, where the query rectangles are required to have bounded aspect ratio (i.e., when the ratio of the longest side length to the shortest side length of the query rectangle is bounded). For exam- ple, every rectangle with bounded aspect ratio can be covered by a constant number of overlapping squares. An interesting open problem 12.6 Other Types of Range Search 115 is to develop linear-sized optimal data structures for fat orthogonal 2-D range search. By the reduction, one possible approach would be to develop optimal linear-sized data structures for three-sided 3-D range search. Agarwal et al. [9] consider halfspace range searching, in which a query is speciﬁed by a hyperplane and a bit indicating one of its two sides, and the answer to the query consists of all the points on that side of the hyperplane. They give various data structures for halfspace range searching in two, three, and higher dimensions, including one that works for simplex (polygon) queries in two dimensions, but with a higher query I/O cost. They have subsequently improved the storage bounds for halfspace range queries in two dimensions to obtain an optimal static data structure satisfying Criteria 1 and 2 of Chapter 12. The number of I/Os needed to build the data structures for 3-D orthogonal range search and halfspace range search is rather large (more than Ω(N)). Still, the structures shed useful light on the com- plexity of range searching and may open the way to improved solutions. An open problem is to design eﬃcient construction and update algo- rithms and to improve upon the constant factors. Cheng et al. [102, 103] develop eﬃcient indexes for range queries and join queries over data with uncertainty, in which each data point has an estimated distribution of possible locations. Chien et al. [107] derive a duality relation that links the number of I/O steps and the space bound for range searching to the corresponding measures for text indexing. Callahan et al. [94] develop dynamic EM data structures for sev- eral online problems in d dimensions. For any ﬁxed ε>0, they can ﬁnd an approximately nearest neighbor of a query point (within a 1+ε factor of optimum) in O(logB N) I/Os; insertions and deletions can also be done in O(logB N) I/Os. They use a related approach to maintain the closest pair of points; each update costs O(logB N) I/Os. Govindarajan et al. [182] achieve the same bounds for closest pair by maintaining a well-separated pair decomposition. For ﬁnding nearest neighbors and approximate nearest neighbors, two other approaches are partition trees [8, 9] and locality-sensitive hashing [176]. Planar point location is studied in [37, 332], and the dual problem of planar 116 Spatial Data Structures and Range Search point enclosure is studied in [51]. Numerous other data structures have been developed for range queries and related problems on spatial data. We refer to [18, 31, 170, 270] for a broad survey. 12.7 Lower Bounds for Orthogonal Range Search We mentioned in the introduction to the chapter that Subramanian and Ramaswamy [323] prove that no EM data structure for 2-D range searching can achieve design Criterion 1 using less than O n(logn)/log(logB N +1) disk blocks, even if we relax the criterion to allow O (logB N)c + z I/Os per query, for any constant c. The result holds for an EM version of the pointer machine model, based upon the approach of Chazelle [99] for the (internal memory) RAM model. Hellerstein et al. [193] consider a generalization of the layout-based lower bound argument of Kanellakis et al. [210] for studying the trade- oﬀ between disk space usage and query performance. They develop a model for indexability, in which an “eﬃcient” data structure is expected to contain the Z points in the answer to a query compactly within O Z/B = O z blocks. One shortcoming of the model is that it considers only data layout and ignores the search component of queries, and thus it rules out the important ﬁltering paradigm discussed earlier in Chapter 12. For example, it is reasonable for any query algorithm to perform at least logB N I/Os, so if the answer size Z is at most B,an algorithm may still be able to satisfy Criterion 1 even if the items in the answer are contained within O(logB N) blocks rather than O(z)=O(1) blocks. Arge et al. [50] modify the model to rederive the same nonlinear- space lower bound O n(logn)/log(logB N +1) of Subramanian and Ramaswamy [323] for 2-D range searching by considering only answer sizes Z larger than (logB N)cB, for which the number of blocks allowed to hold the items in the answer is Z/B = O (logB N)c + z . This approach ignores the complexity of how to ﬁnd the relevant blocks, but as mentioned in Section 12.5 the authors separately provide an optimal 2-D range search data structure that uses the same amount of disk space and does queries in the optimal O(logB N + z) I/Os. Thus, despite its shortcomings, the indexability model is elegant and 12.7 Lower Bounds for Orthogonal Range Search 117 can provide much insight into the complexity of blocking data in exter- nal memory. Further results in this model appear in [224, 301]. One intuition from the indexability model is that less disk space is needed to eﬃciently answer 2-D queries when the queries have bounded aspect ratio. An interesting question is whether R-trees and the linear- space structures of Sections 12.1 and 12.2 can be shown to perform provably well for such queries. Another interesting scenario is where the queries correspond to snapshots of the continuous movement of a sliding rectangle. When the data structure is restricted to contain only a single copy of each point, Kanth and Singh [211] show for a restricted class of index-based trees that d-dimensional range queries in the worst case require Ω(n1−1/d + z) I/Os, and they provide a data structure with a matching bound. Another approach to achieve the same bound is the cross tree data structure [186] mentioned in Section 12.1, which in addition supports the operations of cut and concatenate. 13 Dynamic and Kinetic Data Structures In this chapter, we consider two scenarios where data items change: dynamic (in which items are inserted and deleted) and kinetic (in which the data items move continuously along speciﬁed trajectories). In both cases, queries can be done at any time. It is often useful for kinetic data structures to allow insertions and deletions; for example, if the trajectory of an item changes, we must delete the old trajectory and insert the new one. 13.1 Dynamic Methods for Decomposable Search Problems In Chapters 10–12, we have already encountered several dynamic data structures for the problems of dictionary lookup and range search. In Chapter 12, we saw how to develop optimal EM range search data structures by externalizing some known internal memory data struc- tures. The key idea was to use the bootstrapping paradigm, together with weight-balanced B-trees as the underlying data structure, in order to consolidate several static data structures for small instances of range searching into one dynamic data structure for the full problem. The bootstrapping technique is speciﬁc to the particular data structures 119 120 Dynamic and Kinetic Data Structures involved. In this section, we look at a technique that is based upon the properties of the problem itself rather than upon that of the data structure. We call a problem decomposable if we can answer a query by query- ing individual subsets of the problem data and then computing the ﬁnal result from the solutions to each subset. Dictionary search and range searching are obvious examples of decomposable problems. Bent- ley developed the logarithmic method [83, 278] to convert eﬃcient static data structures for decomposable problems into general dynamic ones. In the internal memory setting, the logarithmic method consists of maintaining a series of static substructures, at most one each of sizes 1, 2, 4, 8, . . . . When a new item is inserted, it is initialized in a sub- structure of size 1. If a substructure of size 1 already exists, the two substructures are combined into a single substructure of size 2. If there is already a substructure of size 2, they in turn are combined into a single substructure of size 4, and so on. For the current value of N,itis easy to see that the kth substructure (i.e., of size 2k) is present exactly when the kth bit in the binary representation of N is 1. Since there are at most logN substructures, the search time bound is logN times the search time per substructure. As the number of items increases from 1 to N, the kth structure is built a total of N/2k times (assuming N is a power of 2). If it can be built in O(2k) time, the total time for all inser- tions and all substructures is thus O(N logN), making the amortized insertion time O(logN). If we use up to three substructures of size 2k at a time, we can do the reconstructions in advance and convert the amortized update bounds to worst-case [278]. In the EM setting, in order to eliminate the dependence upon the binary logarithm in the I/O bounds, the number of substructures must be reduced from logN to logB N, and thus the maximum size of the kth substructure must be increased from 2k to Bk. As the number of items increases from 1 to N, the kth substructure has to be built NB/Bk times (when N isapowerofB), each time taking O Bk(logB N)/B I/Os. The key point is that the extra factor of B in the numerator of the ﬁrst term is cancelled by the factor of B in the denominator of the second term, and thus the resulting total insertion time over all N insertions and all logB N structures is O N(logB N)2 I/Os, which is 13.2 Continuously Moving Items 121 O (logB N)2 I/Os amortized per insertion. By global rebuilding we can do deletions in O(logB N) I/Os amortized. As in the internal memory case, the amortized updates can typically be made worst-case. Arge and Vahrenhold [56] obtain I/O bounds for dynamic point location in general planar subdivisions similar to those of [6], but with- out use of level-balanced trees. Their method uses a weight-balanced base structure at the outer level and a multislab structure for stor- ing segments similar to that of Arge and Vitter [58] described in Sec- tion 12.3. They use the logarithmic method to construct a data struc- ture to answer vertical rayshooting queries in the multislab structures. The method relies upon a total ordering of the segments, but such an ordering can be changed drastically by a single insertion. However, each substructure in the logarithmic method is static (until it is combined with another substructure), and thus a static total ordering can be used for each substructure. The authors show by a type of fractional cascad- ing that the queries in the logB N substructures do not have to be done independently, which saves a factor of logB N in the I/O cost, but at the cost of making the data structures amortized instead of worst-case. Agarwal et al. [11] apply the logarithmic method (in both the binary form and B-way variant) to get EM versions of kd-trees, BBD trees, and BAR trees. 13.2 Continuously Moving Items Early work on temporal data generally concentrated on time-series or multiversion data [298]. A question of growing interest in this mobile age is how to store and index continuously moving items, such as mobile telephones, cars, and airplanes (e.g., see [283, 297, 356]). There are two main approaches to storing moving items: The ﬁrst technique is to use the same sort of data structure as for nonmoving data, but to update it whenever items move suﬃciently far so as to trigger important com- binatorial events that are relevant to the application at hand [73]. For example, an event relevant for range search might be triggered when two items move to the same horizontal displacement (which happens when the x-ordering of the two items is about to switch). A diﬀerent approach is to store each item’s location and speed trajectory, so that 122 Dynamic and Kinetic Data Structures no updating is needed as long as the item’s trajectory plan does not change. Such an approach requires fewer updates, but the represen- tation for each item generally has higher dimension, and the search strategies are therefore less eﬃcient. Kollios et al. [223] developed a linear-space indexing scheme for moving points along a (one-dimensional) line, based upon the notion of partition trees. Their structure supports a variety of range search and approximate nearest neighbor queries. For example, given a range and time, the points in that range at the indicated time can be retrieved in O(n1/2+ε + k) I/Os, for arbitrarily small ε>0. Updates require O (logn)2 I/Os. Agarwal et al. [8] extend the approach to handle range searches in two dimensions, and they improve the update bound to O (logB n)2 I/Os. They also propose an event-driven data struc- ture with the same query times as the range search data structure of Arge and Vitter [50] discussed in Section 12.5, but with the potential need to do many updates. A hybrid data structure combining the two approaches permits a tradeoﬀ between query performance and update frequency. R-trees oﬀer a practical generic mechanism for storing multidimen- sional points and are thus a natural alternative for storing mobile items. One approach is to represent time as a separate dimension and to clus- ter trajectories using the R-tree heuristics. However, the orthogonal nature of the R-tree does not lend itself well to diagonal trajectories. For the case of points moving along linear trajectories, ˇSaltenis et al. [297] build an R-tree (called the TPR-tree) upon only the spatial dimen- sions, but parameterize the bounding box coordinates to account for the movement of the items stored within. They maintain an outer approx- imation of the true bounding box, which they periodically update to reﬁne the approximation. Agarwal and Har-Peled [19] show how to maintain a provably good approximation of the minimum bounding box with need for only a constant number of reﬁnement events. Agarwal et al. [12] develop persistent data structures where query time degrades in proportion to how far the time frame of the query is from the current time. Xia et al. [359] develop change-tolerant versions of R-trees with fast update capabilities in practice. 14 String Processing In this chapter, we survey methods used to process strings in external memory, such as inverted ﬁles, search trees, suﬃx trees, suﬃx arrays, and sorting, with particular attention to more recent developments. For the case of strings we make the following modiﬁcations to our standard notation: K = number of strings; N = total length of all strings (in units of characters); M = internal memory size (in units of characters); B = block transfer size (in units of characters). where M0, we deﬁne the hth-order empirical entropy Hh(T)as Hh(T)= x∈Σh y∈Σ Prob[y,x] · log 1 Prob[y | x], (15.3) where Prob[y,x] represents the empirical joint probability that a random sequence of h + 1 characters from the text consists of the h-character sequence x followed by the character y, and Prob[y | x] represents the empirical conditional probability that the character y occurs next, given that the preceding h characters are x. The expression (15.3) is similar to expression (15.2) for the 0th- order entropy, except that we partition the probability space in (15.3) according to context, to capture statistically signiﬁcant patterns from the text. We always have Hh(T) ≤ H0(T). To continue the English text example, with context length h = 1, the letter “h” would be encoded in very few bits for context “t,” since “h” often follows “t.” On the other hand, “h” would be encoded in a large number of bits in context “b,” since “h” rarely follows a “b.” 15.2 External Memory Compressed Data Structures Some of the basic primitives we might want to perform on a set S of N items from universe U are the following dictionary operations: • Member(P): determine whether P occurs in S; • Rank(P): count the number of items s in S such that s ≤ P; • Select(i): return item si, the ith smallest item in S. Raman et al. [288] have shown for the (internal) RAM model how to represent S in log |U| N + O |U| loglog|U| /log|U| bits so that each of the three primitive operations can be performed in constant time. 134 Compressed Data Structures When storing a set S of K variable-length text strings containing a total of N characters, we use the lexicographic ordering to compare strings. We may want to support an additional primitive: • Preﬁx Range(P): return all strings in S that have P as a preﬁx. To handle this scenario, Ferragina et al. [159] discuss some edge lin- earizations of the classic trie dictionary data structure that are simul- taneously cache-friendly and compressed. The front-coding scheme is one example of linearization; it is at the core of preﬁx B-trees [75] and many other disk-conscious compressed indexes for string collections. However, it is largely thought of as a space-eﬀective heuristic without eﬃcient search support. Ferragina et al. introduce new insights on front- coding and other novel linearizations, and they study how close their space occupancy is to the information-theoretic minimum L, which is given by the minimum encoding needed for a labeled binary cardinal tree. They adapt these linearizations along with an EM version of cen- troid decomposition to design a novel cache-oblivious (and therefore EM) dictionary that oﬀers competitive I/O search time and achieves nearly optimal space in the information-theoretic sense. In particular, the data structures in [159] have the following properties: • Space usage of 1+o(1) L +4K bits in blocked format; Query bounds of O |P|/B + logK I/Os for primi- tives Member(P) and Rank(P), O |si|/B + logK I/Os for Select(i), and O |P|/B + logK + Z/B I/Os for Preﬁx Range(P); • Space usage of (1 + ε)L + O(K) bits in blocked format; Query bounds of O |P| + |succ(P)| /εB + logB K I/Os for Member(P) and Rank(P), O |si|/εB + logB K I/Os for Select(i), and O |P| + |succ(P)| /εB + logB K + Z/B I/Os for Preﬁx Range(P), where Z is the number of occurrences, succ(P) denotes the successor of P in S, and 0 <ε<1 is a user-deﬁned parameter. The data struc- tures can also be tuned optimally to handle any particular distribution of queries. 15.2 External Memory Compressed Data Structures 135 Eltabakh et al. [146] consider sets of variable-length strings encoded using run-length coding in order to exploit space savings when there are repeated characters. They adapt string B-trees [157, 158] (see Sec- tion 14.2) with the EM priority search data structure for three-sided range searching [50] (see Section 12.4) to develop a dynamic compressed data structure for the run-length encoded data. The data structure sup- ports substring matching, preﬁx matching, and range search queries. The data structure uses O( ˆN/B) blocks of space, where ˆN is the total length of the compressed strings. Insertions and deletions of t run- length encoded suﬃxes take O tlogB( ˆN + t) I/Os. Queries for a pat- tern P can be performed in O logB ˆN + | ˆP| + Z /B I/Os, where | ˆP| is the size of the search pattern in run-length encoded format. One of the major advances in text indexing in the last decade has been the development of entropy-compressed data structures. Until fairly recently, the best general-purpose data structures for pattern matching queries were the suﬃx tree and its more space-eﬃcient ver- sion, the suﬃx array, which we studied in Section 14.3. However, the suﬃx array requires several times more space than the text being indexed. The basic reason is that suﬃx arrays store an array of point- ers, each requiring at least logN bits, whereas the original text being indexed consists of N characters, each of size log|Σ| bits. For a ter- abyte of ascii text (i.e., N =240), each text character occupies 8 bits. The suﬃx array, on the other hand, consists of N pointers to the text, each pointer requiring logN = 40 bits, which is ﬁve times larger.1 For reasonable values of h, the compressed suﬃx array of Grossi et al. [185] requires an amount of space in bits per character only as large as the hth-order entropy Hh(T) of the original text, plus lower-order terms. In addition, the compressed suﬃx array is self-indexing, in that it encodes the original text and provides random access to the charac- ters of the original text. Therefore, the original text does not need to be stored, and we can delete it. The net result is a big improvement over conventional suﬃx arrays: Rather than having to store both the original 1 Imagine going to a library and ﬁnding that the card catalog is stored in an annex that is ﬁve times larger than the main library! Suﬃx arrays support general pattern matching, which card catalogs do not, but it is still counterintuitive for an index to require so much more space than the text it is indexing. 136 Compressed Data Structures text and a suﬃx array that is several times larger, we can instead get fast lookup using only a compressed suﬃx array, whose size is just a fraction of the original text size. Similar results are obtained by Fer- ragina and Manzini [162] and Ferragina et al. [163] using the Burrows– Wheeler transform. In fact, the two transforms of [185] and [162] are in a sense inverses of one another. A more detailed survey of compressed text indexes in the internal memory setting appears in [267]. In the external memory setting, however, the approaches of [162, 185] have the disadvantage that the algorithms exhibit little locality and thus do not achieve the desired I/O bounds. Chien et al. [107] introduce a new variant of the Burrows–Wheeler transform called the Geometric Burrows–Wheeler Transform (GBWT). Unlike BWT, which merely permutes the text, GBWT converts the text into a set of points in two-dimensional geometry. There is a corresponding equivalence between text pattern matching and geometric range searching. Using this transform, we can answer several open questions about compressed text indexing: (1) Can compressed data structures be designed in external memory with similar I/O performance as their uncompressed counterparts? (2) Can compressed data structures be designed for position- restricted pattern matching, in which the answers to a query must be located in a speciﬁed range in the text? The data structure of Chien et al. [107] has size O N log|Σ| bits and can be stored in fully blocked format; pattern matching queries for a pattern P can be done in O |P|/B + (log|Σ| N)(logB N)+Z logB N I/Os, where Z is the number of occurrences. The size of the Chien et al. data structure [107] is on the order of the size of the problem instance. An important open problem is to design a pattern matching data structure whose size corresponds to a higher-order compression of the original text, as exists for the (internal memory) RAM model. Another open problem is to reduce the second- order terms in the I/O bound. Arroyuelo and Navarro [61] propose an EM version of a self-index based upon the Lempel–Ziv compression method [266, 365]. It uses 15.2 External Memory Compressed Data Structures 137 8NHh(T)+o N log|Σ| bits of space in blocked format, but does not provide provably good I/O bounds. In practice, it uses about double the space occupied by the original text and has reasonable I/O per- formance. M¨akinen et al. [245] achieve space bounds of O N(H0 +1) bits and O N Hh(log|Σ| + loglogN +1) bits in blocked format and a pattern matching query bound of O |P|logB N + Z I/Os. The index of Gonz´alez and Navarro [178] uses O R(logN)log(1 + N/R) bits in blocked format, where R ≤ min N, NHh + |Σ|h , and does pattern matching queries in O |P| + Z/B I/Os. Compressed data structures have also been explored for graphs, but in a less formal sense. The object is to represent the graph succinctly and still provide fast support for the basic primitives, such as access- ing the vertices and edges of the graph. For example, consider a large graph that represents a subset of the World Wide Web, in which each web page corresponds to a vertex and each link from one web page to another corresponds to an edge. Such graphs can often be compressed by a factor of 3–5 by exploiting certain characteristics. For example, the indegrees and outdegrees of the vertices tend to be distributed according to a power law; the probability that a web page has i links is proportional to 1/iθ, where θ ≈ 2.1 for incoming links and θ ≈ 2.72 for outgoing links. Most of the links from a web page tend to point to another page on the same site with a nearby address, and thus gap methods can give a compressed representation. Adjacent web pages often share the same outgoing links, and thus the adjacency lists can be compressed relative to one another. We refer the reader to [112] for a survey of graph models and initial work in the EM setting. 16 Dynamic Memory Allocation The amount of internal memory allocated to a program may ﬂuctuate during the course of execution because of demands placed on the system by other users and processes. EM algorithms must be able to adapt dynamically to whatever resources are available so as to preserve good performance [279]. The algorithms in the previous chapters assume a ﬁxed memory allocation; they must resort to virtual memory if the memory allocation is reduced, often causing a severe degradation in performance. Barve and Vitter [71] discuss the design and analysis of EM algo- rithms that adapt gracefully to changing memory allocations. In their model, without loss of generality, an algorithm (or program) P is allo- cated internal memory in phases: During the ith phase, P is allocated mi blocks of internal memory, and this memory remains allocated to P until P completes 2mi I/O operations, at which point the next phase begins. The process continues until P ﬁnishes execution. The model makes the reasonable assumption that the duration for each memory allocation phase is long enough to allow all the memory in that phase to be used by the algorithm. 139 140 Dynamic Memory Allocation For sorting, the lower bound approach of Theorem 6.1 implies that i 2mi logmi =Ω(nlogn). (16.1) We say that P is dynamically optimal for sorting if i 2mi logmi = O(nlogn) (16.2) for all possible sequences m1, m2, . . . of memory allocation. Intuitively, if P is dynamically optimal, no other algorithm can perform more than a constant number of sorts in the worst-case for the same sequence of memory allocations. Barve and Vitter [71] deﬁne the model in generality and give dynam- ically optimal strategies for sorting, matrix multiplication, and buﬀer tree operations. Their work represents the ﬁrst theoretical model of dynamic allocation and the ﬁrst algorithms that can be considered dynamically optimal. Previous work was done on memory-adaptive algorithms for merge sort [279, 363] and hash join [280], but the algo- rithms handle only special cases and can be made to perform nonopti- mally for certain patterns of memory allocation. 17 External Memory Programming Environments There are three basic approaches to supporting development of I/O-eﬃcient code, which we call access-oriented, array-oriented, and framework-oriented. Access-oriented systems preserve the programmer abstraction of explicitly requesting data transfer. They often extend the I/O interface to include data type speciﬁcations and collective speciﬁcation of multi- ple transfers, sometimes involving the memories of multiple processing nodes. Examples of access-oriented systems include the UNIX ﬁle sys- tem at the lowest level, higher-level parallel ﬁle systems such as Whip- tail [316], Vesta [116], PIOUS [262], and the High Performance Storage System [351], and I/O libraries MPI-IO [115] and LEDA-SM [124]. Array-oriented systems access data stored in external memory pri- marily by means of compiler-recognized data types (typically arrays) and operations on those data types. The external computation is directly speciﬁed via iterative loops or explicitly data-parallel oper- ations, and the system manages the explicit I/O transfers. Array- oriented systems are eﬀective for scientiﬁc computations that make regular strides through arrays of data and can deliver high-performance parallel I/O in applications such as computational ﬂuid dynam- ics, molecular dynamics, and weapon system design and simulation. 141 142 External Memory Programming Environments Array-oriented systems are generally ill-suited to irregular or com- binatorial computations. Examples of array-oriented systems include PASSION [326], Panda [308] (which also has aspects of access orienta- tion), PI/OT [281], and ViC* [113]. Instead of viewing batched computation as an enterprise in which code reads data, operates on it, and writes results, a framework- oriented system views computation as a continuous process during which a program is fed streams of data from an outside source and leaves trails of results behind. TPIE (Transparent Parallel I/O Envi- ronment) [41, 49, 330, 337] provides a framework-oriented interface for batched computation, as well as an access-oriented interface for online computation. For batched computations, TPIE programmers do not need to worry about making explicit calls to I/O routines. Instead, they merely specify the functional details of the desired computation, and TPIE automatically choreographs a sequence of data movements to feed the computation.1 TPIE [41, 49, 330, 337], which serves as the implementation plat- form for the experiments described in Sections 8.1 and 12.2, as well as in several of the references, is a comprehensive set of C++ templated classes and functions for EM paradigms and a run-time library. Its goal is to help programmers develop high-level, portable, and eﬃcient imple- mentations of EM algorithms. It consists of three main components: a block transfer engine (BTE), a memory manager (MM), and an access method interface (AMI). The BTE is responsible for moving blocks of data to and from the disk. It is also responsible for scheduling asyn- chronous “read-ahead” and “write-behind” when necessary to allow computation and I/O to overlap. The MM is responsible for managing internal memory in coordination with the AMI and BTE. The AMI provides the high-level uniform interface for application programs. The AMI is the only component that programmers normally need to inter- act with directly. Applications that use the AMI are portable across hardware platforms, since they do not have to deal with the underlying details of how I/O is performed on a particular machine. 1 The TPIE software distribution is available free of charge on the World Wide Web at http://www.cs.duke.edu/TPIE/. 143 We have seen in the previous chapter that many batched problems in spatial databases, GIS, scientiﬁc computing, graphs, and string pro- cessing can be solved optimally using a relatively small number of basic paradigms such as scanning (or streaming), multiway distribution, and merging, which TPIE supports as access mechanisms. Batched pro- grams in TPIE thus consist primarily of a call to one or more of these standard access mechanisms. For example, a distribution sort can be programmed by using the access mechanism for multiway distribution. The programmer has to specify the details as to how the partitioning elements are formed and how the buckets are deﬁned. Then the mul- tiway distribution is invoked, during which TPIE automatically forms the buckets and outputs them to disk using double buﬀering. For online data structures such as hashing, B-trees, and R-trees, TPIE supports more traditional block access like the access-oriented systems. STXXL (Standard Template Library for XXL Data Sets) [135] sup- ports all three basic approaches: access-oriented via a block manage- ment layer, array-oriented via the vector class, and framework-oriented via pipelining and iteration. STXXL’s support for pipelined scanning and sorting, in which the output of one computation is fed directly into the input of a subsequent computation, can save a factor of about 2 in the number of I/Os for some batched problems. STXXL also supports a library of standard data structures, such as stacks, queues, search trees, and priority queues. The FG programming environment [117, 127] combines elements of both access-oriented and framework-oriented systems. The programmer creates software pipelines to mitigate data accesses that exhibit high latency, such as disk accesses or interprocessor communication. Buﬀers traverse each pipeline; each pipeline stage repeatedly accepts a buﬀer from its predecessor stage, performs some action on the buﬀer, and conveys the buﬀer to its successor stage. FG maps each stage to its own thread, and thus multiple stages can overlap. Programmers can implement many batched EM algorithms eﬃciently — in terms of both I/O complexity and programmer eﬀort — by structuring each pipeline to implement a single pass of a PDM algorithm, matching the buﬀer 144 External Memory Programming Environments size to the block size, and running a copy of the pipeline on each node of a cluster. Google’s MapReduce [130] is a framework-oriented system that sup- ports a simple functional style of batched programming. The input data are assumed to be in the form of a list of key-value pairs. The pro- grammer speciﬁes a Map function and a Reduce function. The system applies Map to each key-value pair, which produces a set of interme- diate key-value pairs. For each k, the system groups together all the intermediate key-value pairs that have the same key k and passes them to the Reduce function; Reduce merges together those key-value pairs and forms a possibly smaller set of values for key k. The system handles the details of data routing, parallel scheduling, and buﬀer management. This framework is useful for a variety of massive data applications on computer clusters, such as pattern matching, counting access frequen- cies of web pages, constructing inverted indexes, and distribution sort. Conclusions In this manuscript, we have described several useful paradigms for the design and implementation of eﬃcient external memory (EM) algo- rithms and data structures. The problem domains we have considered include sorting, permuting, FFT, scientiﬁc computing, computational geometry, graphs, databases, geographic information systems, and text and string processing. Interesting challenges remain in virtually all these problem domains, as well as for related models such as data streaming and cache-oblivious algorithms. One diﬃcult problem is to prove lower bounds for permut- ing and sorting without the indivisibility assumption. Another promis- ing area is the design and analysis of EM algorithms for eﬃcient use of multiple disks. Optimal I/O bounds have not yet been determined for several basic EM graph problems such as topological sorting, shortest paths, breadth-ﬁrst search, depth-ﬁrst search, and connected compo- nents. Several problems remain open in the dynamic and kinetic set- tings, such as range searching, ray shooting, point location, and ﬁnding nearest neighbors. With the growing power of multicore processors, consideration of parallel CPU time will become increasingly more important. There is an 145 146 Conclusions intriguing connection between problems that have good I/O speedups and problems that have fast and work-eﬃcient parallel algorithms. A continuing goal is to develop optimal EM algorithms and to trans- late theoretical gains into observable improvements in practice. For some of the problems that can be solved optimally up to a constant factor, the constant overhead is too large for the algorithm to be of practical use, and simpler approaches are needed. In practice, algo- rithms cannot assume a static internal memory allocation; they must adapt in a robust way when the memory allocation changes. Many interesting algorithmic challenges arise from newly developed architectures. The EM concepts and techniques discussed in the book may well apply. Examples of new architectures include computer clus- ters, multicore processors, hierarchical storage devices, wearable com- puters, small mobile devices and sensors, disk drives with processing capabilities, and storage devices based upon microelectromechanical systems (MEMS). For example, active (or intelligent) disks, in which disk drives have some processing capability and can ﬁlter information sent to the host, have been proposed to further reduce the I/O bottle- neck, especially in large database applications [4, 292]. MEMS-based nonvolatile storage has the potential to serve as an intermediate level in the memory hierarchy between DRAM and disks. It could ultimately provide better latency and bandwidth than disks, at less cost per bit than DRAM [307, 338]. Notations and Acronyms Several of the external memory (EM) paradigms discussed in this manuscript are listed in Table 1.1 at the end of Chapter 1. The parameters of the parallel disk model (PDM) are deﬁned in Section 2.1: N = problem size (in units of data items); M = internal memory size (in units of data items); B = block transfer size (in units of data items); D = number of independent disk drives; P = number of CPUs; Q = number of queries (for a batched problem); Z = answer size (in units of data items). The parameter values satisfy M0 and K>0 such that f(k) ≤ c g(k) , for all k ≥ K. f(k)=Ω g(k) f(k) is big-omega of g(k), as k →∞: g(k)= O f(k) . f(k)=Θ g(k) f(k) is big-theta of g(k), as k →∞: f(k)= O g(k) and f(k)=Ω g(k) . f(k)=o g(k) f(k) is little-oh of g(k), as k →∞: limk→∞ f(k) g(k) =0. f(k)=ω g(k) f(k) is little-omega of g(k), as k →∞: g(k)= o f(k) . x ceiling of x: the smallest integer k satisfying k ≥ x. x ﬂoor of x: the largest integer k satisfying k ≤ x. min{x,y} minimum of x and y. max{x,y} maximum of x and y. Prob{R} probability that relation R is true. logb x base-b logarithm of x;ifb is not speciﬁed, we use b =2. lnx natural logarithm of x: loge x. 150 Notations and Acronyms k binomial coeﬃcient: if = 0, it is 1; else, it is k(k − 1)...(k − +1) ( − 1)...1 . References [1] D. J. Abel, “A B+-tree structure for large quadtrees,” Computer Vision, Graphics, and Image Processing, vol. 27, pp. 19–31, July 1984. [2] J. Abello, A. L. Buchsbaum, and J. Westbrook, “A functional approach to external graph algorithms,” Algorithmica, vol. 32, no. 3, pp. 437–458, 2002. [3] J. Abello, P. M. Pardalos, and M. G. Resende, eds., Handbook of Massive Data Sets. Norwell, Mass.: Kluwer Academic Publishers, 2002. [4] A. Acharya, M. Uysal, and J. Saltz, “Active disks: Programming model, algo- rithms and evaluation,” ACM SIGPLAN Notices, vol. 33, pp. 81–91, November 1998. [5] M. Adler, “New coding techniques for improved bandwidth utilization,” in Proceedings of the IEEE Symposium on Foundations of Computer Science, (Burlington, VT), pp. 173–182, October 1996. [6] P. K. Agarwal, L. Arge, G. S. Brodal, and J. S. Vitter, “I/O-eﬃcient dynamic point location in monotone planar subdivisions,” in Proceedings of the ACM- SIAM Symposium on Discrete Algorithms, pp. 11–20, ACM Press, 1999. [7] P. K. Agarwal, L. Arge, and A. Danner, “From LIDAR to GRID DEM: A scalable approach,” in Proceedings of the International Symposium on Spatial Data Handling, 2006. [8] P. K. Agarwal, L. Arge, and J. Erickson, “Indexing moving points,” Journal of Computer and System Sciences, vol. 66, no. 1, pp. 207–243, 2003. [9] P. K. Agarwal, L. Arge, J. Erickson, P. G. Franciosa, and J. S. Vitter, “Eﬃcient searching with linear constraints,” Journal of Computer and System Sciences, vol. 61, pp. 194–216, October 2000. [10] P. K. Agarwal, L. Arge, T. M. Murali, K. Varadarajan, and J. S. Vitter, “I/O- eﬃcient algorithms for contour line extraction and planar graph blocking,” in 151 152 References Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, pp. 117– 126, ACM Press, 1998. [11] P. K. Agarwal, L. Arge, O. Procopiuc, and J. S. Vitter, “A framework for index bulk loading and dynamization,” in Proceedings of the International Colloquium on Automata, Languages, and Programming, pp. 115–127, Crete, Greece: Springer-Verlag, 2001. [12] P. K. Agarwal, L. Arge, and J. Vahrenhold, “Time responsive external data structures for moving points,” in Proceedings of the Workshop on Algorithms and Data Structures, pp. 50–61, 2001. [13] P. K. Agarwal, L. Arge, J. Yang, and K. Yi, “I/O-eﬃcient structures for orthogonal range-max and stabbing-max queries,” in Proceedings of the Euro- pean Symposium on Algorithms, pp. 7–18, Springer-Verlag, 2003. [14] P. K. Agarwal, L. Arge, and K. Yi, “I/O-eﬃcient construction of constrained delaunay triangulations,” in Proceedings of the European Symposium on Algo- rithms, pp. 355–366, Springer-Verlag, 2005. [15] P. K. Agarwal, L. Arge, and K. Yi, “An optimal dynamic interval stabbing- max data structure?,” in Proceedings of the ACM-SIAM Symposium on Dis- crete Algorithms, pp. 803–812, ACM Press, 2005. [16] P. K. Agarwal, L. Arge, and K. Yi, “I/O-eﬃcient batched union-ﬁnd and its applications to terrain analysis,” in Proceedings of the ACM Symposium on Computational Geometry, ACM Press, 2006. [17] P. K. Agarwal, M. de Berg, J. Gudmundsson, M. Hammar, and H. J. Haverkort, “Box-trees and R-trees with near-optimal query time,” Discrete and Computational Geometry, vol. 28, no. 3, pp. 291–312, 2002. [18] P. K. Agarwal and J. Erickson, “Geometric range searching and its relatives,” in Advances in Discrete and Computational Geometry, (B. Chazelle, J. E. Goodman, and R. Pollack, eds.), pp. 1–56, Providence, Rhode Island: Ameri- can Mathematical Society Press, 1999. [19] P. K. Agarwal and S. Har-Peled, “Maintaining the approximate extent mea- sures of moving points,” in Proceedings of the ACM-SIAM Symposium on Dis- crete Algorithms, pp. 148–157, Washington, DC: ACM Press, January 2001. [20] A. Aggarwal, B. Alpern, A. K. Chandra, and M. Snir, “A model for hierarchi- cal memory,” in Proceedings of the ACM Symposium on Theory of Computing, pp. 305–314, New York: ACM Press, 1987. [21] A. Aggarwal, A. Chandra, and M. Snir, “Hierarchical memory with block transfer,” in Proceedings of the IEEE Symposium on Foundations of Computer Science, pp. 204–216, Los Angeles, 1987. [22] A. Aggarwal and C. G. Plaxton, “Optimal parallel sorting in multi-level stor- age,” in Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, pp. 659–668, ACM Press, 1994. [23] A. Aggarwal and J. S. Vitter, “The Input/Output complexity of sorting and related problems,” Communications of the ACM, vol. 31, no. 9, pp. 1116–1127, 1988. [24] C. Aggarwal, Data Streams: Models and Algorithms. Springer-Verlag, 2007. [25] M. Ajtai, M. Fredman, and J. Koml´os, “Hash functions for priority queues,” Information and Control, vol. 63, no. 3, pp. 217–225, 1984. References 153 [26] D. Ajwani, I. Malinger, U. Meyer, and S. Toledo, “Characterizing the perfor- mance of ﬂash memory storage devices and its impact on algorithm design,” in Proceedings of the International Workshop on Experimental Algorithmics, (Provincetown, Mass.), pp. 208–219, Springer-Verlag, 2008. [27] D. Ajwani, U. Meyer, and V. Osipov, “Improved external memory BFS imple- mentation,” in Proceedings of the Workshop on Algorithm Engineering and Experiments, (New Orleans), pp. 3–12, January 2007. [28] S. Albers and M. B¨uttner, “Integrated prefetching and caching in single and parallel disk systems,” Information and Computation, vol. 198, no. 1, pp. 24– 39, 2005. [29] B. Alpern, L. Carter, E. Feig, and T. Selker, “The uniform memory hierarchy model of computation,” Algorithmica, vol. 12, no. 2–3, pp. 72–109, 1994. [30] L. Arge, “The I/O-complexity of ordered binary-decision diagram manipu- lation,” in Proceedings of the International Symposium on Algorithms and Computation, pp. 82–91, Springer-Verlag, 1995. [31] L. Arge, “External memory data structures,” in Handbook of Massive Data Sets, (J. Abello, P. M. Pardalos, and M. G. Resende, eds.), ch. 9, pp. 313–358, Norwell, Mass.: Kluwer Academic Publishers, 2002. [32] L. Arge, “The buﬀer tree: A technique for designing batched external data structures,” Algorithmica, vol. 37, no. 1, pp. 1–24, 2003. [33] L. Arge, G. Brodal, and R. Fagerberg, “Cache-oblivious data structures,” in Handbook on Data Structures and Applications, (D. Mehta and S. Sahni, eds.), CRC Press, 2005. [34] L. Arge, G. S. Brodal, and L. Toma, “On external-memory MST, SSSP, and multi-way planar graph separation,” Journal of Algorithms, vol. 53, no. 2, pp. 186–206, 2004. [35] L. Arge, J. S. Chase, P. Halpin, L. Toma, J. S. Vitter, D. Urban, and R. Wick- remesinghe, “Eﬃcient ﬂow computation on massive grid datasets,” GeoInfor- matica, vol. 7, pp. 283–313, December 2003. [36] L. Arge, A. Danner, H. J. Haverkort, and N. Zeh, “I/O-eﬃcient hierarchi- cal watershed decomposition of grid terrains models,” in Proceedings of the International Symposium on Spatial Data Handling, 2006. [37] L. Arge, A. Danner, and S.-H. Teh, “I/O-eﬃcient point location using persis- tent B-trees,” in Workshop on Algorithm Engineering and Experimentation, 2003. [38] L. Arge, M. de Berg, H. J. Haverkort, and K. Yi, “The priority R-tree: A practically eﬃcient and worst-case optimal R-tree,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 347–358, ACM Press, 2004. [39] L. Arge, P. Ferragina, R. Grossi, and J. S. Vitter, “On sorting strings in exter- nal memory,” in Proceedings of the ACM Symposium on Theory of Computing, pp. 540–548, ACM Press, 1997. [40] L. Arge, M. T. Goodrich, M. Nelson, and N. Sitchinava, “Fundamental parallel algorithms for private-cache chip multiprocessors,” in Proceedings of the ACM Symposium on Parallel Algorithms and Architectures, Munich: ACM Press, June 2008. 154 References [41] L. Arge, K. H. Hinrichs, J. Vahrenhold, and J. S. Vitter, “Eﬃcient bulk oper- ations on dynamic R-trees,” Algorithmica, vol. 33, pp. 104–128, January 2002. [42] L. Arge, M. Knudsen, and K. Larsen, “A general lower bound on the I/O- complexity of comparison-based algorithms,” in Proceedings of the Workshop on Algorithms and Data Structures, pp. 83–94, Springer-Verlag, 1993. [43] L. Arge, U. Meyer, and L. Toma, “External memory algorithms for diameter and all-pairs shortest-paths on sparse graphs,” in Proceedings of the Interna- tional Colloquium on Automata, Languages, and Programming, pp. 146–157, Springer-Verlag, 2004. [44] L. Arge, U. Meyer, L. Toma, and N. Zeh, “On external-memory planar depth ﬁrst search,” Journal of Graph Algorithms and Applications, vol. 7, no. 2, pp. 105–129, 2003. [45] L. Arge and P. Miltersen, “On showing lower bounds for external-memory computational geometry problems,” in External Memory Algorithms and Visualization, (J. Abello and J. S. Vitter, eds.), pp. 139–159, Providence, Rhode Island: American Mathematical Society Press, 1999. [46] L. Arge, O. Procopiuc, S. Ramaswamy, T. Suel, J. Vahrenhold, and J. S. Vit- ter, “A uniﬁed approach for indexed and non-indexed spatial joins,” in Pro- ceedings of the International Conference on Extending Database Technology, Konstanz, Germany: ACM Press, March 2000. [47] L. Arge, O. Procopiuc, S. Ramaswamy, T. Suel, and J. S. Vitter, “Scalable sweeping-based spatial join,” in Proceedings of the International Conference on Very Large Databases, pp. 570–581, New York: Morgan Kaufmann, August 1998. [48] L. Arge, O. Procopiuc, S. Ramaswamy, T. Suel, and J. S. Vitter, “Theory and practice of I/O-eﬃcient algorithms for multidimensional batched searching problems,” in Proceedings of the ACM-SIAM Symposium on Discrete Algo- rithms, pp. 685–694, ACM Press, January 1998. [49] L. Arge, O. Procopiuc, and J. S. Vitter, “Implementing I/O-eﬃcient data structures using TPIE,” in Proceedings of the European Symposium on Algo- rithms, pp. 88–100, Springer-Verlag, 2002. [50] L. Arge, V. Samoladas, and J. S. Vitter, “Two-dimensional indexability and optimal range search indexing,” in Proceedings of the ACM Conference on Principles of Database Systems, pp. 346–357, Philadelphia: ACM Press, May– June 1999. [51] L. Arge, V. Samoladas, and K. Yi, “Optimal external memory planar point enclosure,” in Proceedings of the European Symposium on Algorithms, pp. 40– 52, Springer-Verlag, 2004. [52] L. Arge and L. Toma, “Simpliﬁed external memory algorithms for planar DAGs,” in Proceedings of the Scandinavian Workshop on Algorithmic Theory, pp. 493–503, 2004. [53] L. Arge and L. Toma, “External data structures for shortest path queries on planar digraphs,” in Proceedings of the International Symposium on Algo- rithms and Computation, pp. 328–338, Springer-Verlag, 2005. [54] L. Arge, L. Toma, and J. S. Vitter, “I/O-eﬃcient algorithms for problems on grid-based terrains,” in Workshop on Algorithm Engineering and Experimen- tation, San Francisco: Springer-Verlag, January 2000. References 155 [55] L. Arge, L. Toma, and N. Zeh, “I/O-eﬃcient topological sorting of planar DAGs,” in Proceedings of the ACM Symposium on Parallel Algorithms and Architectures, pp. 85–93, ACM Press, 2003. [56] L. Arge and J. Vahrenhold, “I/O-eﬃcient dynamic planar point location,” Computational Geometry, vol. 29, no. 2, pp. 147–162, 2004. [57] L. Arge, D. E. Vengroﬀ, and J. S. Vitter, “External-memory algorithms for processing line segments in geographic information systems,” Algorithmica, vol. 47, pp. 1–25, January 2007. [58] L. Arge and J. S. Vitter, “Optimal external memory interval man- agement,” SIAM Journal on Computing, vol. 32, no. 6, pp. 1488– 1508, 2003. [59] L. Arge and N. Zeh, “I/O-eﬃcient strong connectivity and depth-ﬁrst search for directed planar graphs,” in Proceedings of the IEEE Symposium on Foun- dations of Computer Science, pp. 261–270, 2003. [60] C. Armen, “Bounds on the separation of two parallel disk models,” in Proceed- ings of the Workshop on Input/Output in Parallel and Distributed Systems, pp. 122–127, May 1996. [61] D. Arroyuelo and G. Navarro, “A Lempel–Ziv text index on secondary stor- age,” in Proceedings of the Symposium on Combinatorial Pattern Matching, pp. 83–94, Springer-Verlag, 2007. [62] M. J. Atallah and S. Prabhakar, “(Almost) optimal parallel block access for range queries,” Information Sciences, vol. 157, pp. 21–31, 2003. [63] R. A. Baeza-Yates, “Expected behaviour of B+-trees under random inser- tions,” Acta Informatica, vol. 26, no. 5, pp. 439–472, 1989. [64] R. A. Baeza-Yates, “Bounded disorder: The eﬀect of the index,” Theoretical Computer Science, vol. 168, pp. 21–38, 1996. [65] R. A. Baeza-Yates and P.-A. Larson, “Performance of B+-trees with partial expansions,” IEEE Transactions on Knowledge and Data Engineering, vol. 1, pp. 248–257, June 1989. [66] R. A. Baeza-Yates and B. Ribeiro-Neto, eds., Modern Information Retrieval. Addison Wesley Longman, 1999. [67] R. A. Baeza-Yates and H. Soza-Pollman, “Analysis of linear hashing revis- ited,” Nordic Journal of Computing, vol. 5, pp. 70–85, 1998. [68] R. D. Barve, E. F. Grove, and J. S. Vitter, “Simple randomized mergesort on parallel disks,” Parallel Computing, vol. 23, no. 4, pp. 601–631, 1997. [69] R. D. Barve, M. Kallahalla, P. J. Varman, and J. S. Vitter, “Competitive analysis of buﬀer management algorithms,” Journal of Algorithms, vol. 36, pp. 152–181, August 2000. [70] R. D. Barve, E. A. M. Shriver, P. B. Gibbons, B. K. Hillyer, Y. Matias, and J. S. Vitter, “Modeling and optimizing I/O throughput of multiple disks on a bus,” in Procedings of ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, pp. 83–92, Atlanta: ACM Press, May 1999. [71] R. D. Barve and J. S. Vitter, “A theoretical framework for memory-adaptive algorithms,” in Proceedings of the IEEE Symposium on Foundations of 156 References Computer Science, pp. 273–284, New York: IEEE Computer Society Press, October 1999. [72] R. D. Barve and J. S. Vitter, “A simple and eﬃcient parallel disk mergesort,” ACM Trans. Computer Systems, vol. 35, pp. 189–215, March/April 2002. [73] J. Basch, L. J. Guibas, and J. Hershberger, “Data structures for mobile data,” Journal of Algorithms, vol. 31, pp. 1–28, 1999. [74] R. Bayer and E. McCreight, “Organization of large ordered indexes,” Acta Informatica, vol. 1, pp. 173–189, 1972. [75] R. Bayer and K. Unterauer, “Preﬁx B-trees,” ACM Transactions on Database Systems, vol. 2, pp. 11–26, March 1977. [76] B. Becker, S. Gschwind, T. Ohler, B. Seeger, and P. Widmayer, “An asymp- totically optimal multiversion B-tree,” VLDB Journal, vol. 5, pp. 264–275, December 1996. [77] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger, “The R*-tree: An eﬃcient and robust access method for points and rectangles,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 322–331, ACM Press, 1990. [78] A. L. Belady, “A study of replacement algorithms for virtual storage comput- ers,” IBM Systems Journal, vol. 5, pp. 78–101, 1966. [79] M. A. Bender, E. D. Demaine, and M. Farach-Colton, “Cache-oblivious B- trees,” SIAM Journal on Computing, vol. 35, no. 2, pp. 341–358, 2005. [80] M. A. Bender, M. Farach-Colton, and B. Kuszmaul, “Cache-oblivious string B-trees,” in Proceedings of the ACM Conference on Principles of Database Systems, pp. 233–242, ACM Press, 2006. [81] M. A. Bender, J. T. Fineman, S. Gilbert, and B. C. Kuszmaul, “Concurrent cache-oblivious B-trees,” in Proceedings of the ACM Symposium on Parallel Algorithms and Architectures, pp. 228–237, ACM Press, 2005. [82] J. L. Bentley, “Multidimensional divide and conquer,” Communications of the ACM, vol. 23, no. 6, pp. 214–229, 1980. [83] J. L. Bentley and J. B. Saxe, “Decomposable searching problems I: Static-to- dynamic transformations,” Journal of Algorithms, vol. 1, pp. 301–358, Decem- ber 1980. [84] S. Berchtold, C. B¨ohm, and H.-P. Kriegel, “Improving the query performance of high-dimensional index structures by bulk load operations,” in Proceedings of the International Conference on Extending Database Technology, pp. 216– 230, Springer-Verlag, 1998. [85] M. Berger, E. R. Hansen, R. Pagh, M. Pˇatra¸scu, M. Ruˇzi´c, and P. Tiedemann, “Deterministic load balancing and dictionaries in the parallel disk model,” in Proceedings of the ACM Symposium on Parallel Algorithms and Architectures, ACM Press, 2006. [86] R. Bhatia, R. K. Sinha, and C.-M. Chen, “A hierarchical technique for con- structing eﬃcient declustering schemes for range queries,” The Computer Journal, vol. 46, no. 4, pp. 358–377, 2003. [87] N. Blum and K. Mehlhorn, “On the average number of rebalancing operations in weight-balanced trees,” Theoretical Computer Science, vol. 11, pp. 303–320, July 1980. References 157 [88] R. S. Boyer and J. S. Moore, “A fast string searching algorithm,” Communi- cations of the ACM, vol. 20, pp. 762–772, October 1977. [89] C. Breimann and J. Vahrenhold, “External memory computational geometry revisited,” in Algorithms for Memory Hierarchies, pp. 110–148, 2002. [90] K. Brengel, A. Crauser, P. Ferragina, and U. Meyer, “An experimental study of priority queues in external memory,” ACM Journal of Experimental Algo- rithmics, vol. 5, p. 17, 2000. [91] G. S. Brodal and R. Fagerberg, “Lower bounds for external memory dictio- naries,” in Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, pp. 546–554, ACM Press, 2003. [92] G. S. Brodal and J. Katajainen, “Worst-case eﬃcient external-memory priority queues,” in Proceedings of the Scandinavian Workshop on Algorithmic Theory, pp. 107–118, Stockholm: Springer-Verlag, July 1998. [93] A. L. Buchsbaum, M. Goldwasser, S. Venkatasubramanian, and J. R. Westbrook, “On external memory graph traversal,” in Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, pp. 859–860, ACM Press, January 2000. [94] P. Callahan, M. T. Goodrich, and K. Ramaiyer, “Topology B-trees and their applications,” in Proceedings of the Workshop on Algorithms and Data Struc- tures, pp. 381–392, Springer-Verlag, 1995. [95] P. Cao, E. W. Felten, A. R. Karlin, and K. Li, “Implementation and perfor- mance of integrated application-controlled ﬁle caching, prefetching and disk scheduling,” ACM Transactions on Computer Systems, vol. 14, pp. 311–343, November 1996. [96] L. Carter and K. S. Gatlin, “Towards an optimal bit-reversal permutation pro- gram,” in Proceedings of the IEEE Symposium on Foundations of Computer Science, pp. 544–553, November 1998. [97] G. Chaudhry and T. H. Cormen, “Oblivious vs. distribution-based sorting: An experimental evaluation,” in Proceedings of the European Symposium on Algorithms, pp. 317–328, Springer-Verlag, 2005. [98] B. Chazelle, “Filtering search: A new approach to query-answering,” SIAM Journal on Computing, vol. 15, pp. 703–724, 1986. [99] B. Chazelle, “Lower bounds for orthogonal range searching: I. The reporting case,” Journal of the ACM, vol. 37, pp. 200–212, April 1990. [100] B. Chazelle and H. Edelsbrunner, “Linear space data structures for two types of range search,” Discrete and Computational Geometry, vol. 2, pp. 113–126, 1987. [101] P. M. Chen, E. K. Lee, G. A. Gibson, R. H. Katz, and D. A. Patterson, “RAID: High-performance, reliable secondary storage,” ACM Computing Sur- veys, vol. 26, pp. 145–185, June 1994. [102] R. Cheng, Y. Xia, S. Prabhakar, R. Shah, and J. S. Vitter, “Eﬃcient index- ing methods for probabilistic threshold queries over uncertain data,” in Pro- ceedings of the International Conference on Very Large Databases, Toronto: Morgan Kaufmann, August 2004. [103] R. Cheng, Y. Xia, S. Prabhakar, R. Shah, and J. S. Vitter, “Eﬃcient join pro- cessing over uncertain-valued attributes,” in Proceedings of the International 158 References ACM Conference on Information and Knowledge Management, Arlington, Va.: ACM Press, November 2006. [104] Y.-J. Chiang, “Experiments on the practical I/O eﬃciency of geometric algo- rithms: Distribution sweep vs. plane sweep,” Computational Geometry: Theory and Applications, vol. 8, no. 4, pp. 211–236, 1998. [105] Y.-J. Chiang, M. T. Goodrich, E. F. Grove, R. Tamassia, D. E. Vengroﬀ, and J. S. Vitter, “External-memory graph algorithms,” in Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, pp. 139–149, ACM Press, January 1995. [106] Y.-J. Chiang and C. T. Silva, “External memory techniques for isosurface extraction in scientiﬁc visualization,” in External Memory Algorithms and Visualization, (J. Abello and J. S. Vitter, eds.), pp. 247–277, Providence, Rhode Island: American Mathematical Society Press, 1999. [107] Y.-F. Chien, W.-K. Hon, R. Shah, and J. S. Vitter, “Geometric Burrows– Wheeler transform: Linking range searching and text indexing,” in Proceed- ings of the Data Compression Conference, Snowbird, Utah: IEEE Computer Society Press, March 2008. [108] R. A. Chowdhury and V. Ramachandran, “External-memory exact and approximate all-pairs shortest-paths in undirected graphs,” in Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, pp. 735–744, ACM Press, 2005. [109] V. Ciriani, P. Ferragina, F. Luccio, and S. Muthukrishnan, “Static optimal- ity theorem for external memory string access,” in Proceedings of the IEEE Symposium on Foundations of Computer Science, pp. 219–227, 2002. [110] D. R. Clark and J. I. Munro, “Eﬃcient suﬃx trees on secondary storage,” in Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, pp. 383– 391, Atlanta: ACM Press, June 1996. [111] K. L. Clarkson and P. W. Shor, “Applications of random sampling in compu- tational geometry, II,” Discrete and Computational Geometry, vol. 4, pp. 387– 421, 1989. [112] F. Claude and G. Navarro, “A fast and compact Web graph representation,” in Proceedings of the International Symposium on String Processing and Infor- mation Retrieval, pp. 105–116, Springer-Verlag, 2007. [113] A. Colvin and T. H. Cormen, “ViC*: A compiler for virtual-memory C*,” in Proceedings of the International Workshop on High-Level Programming Models and Supportive Environments, pp. 23–33, 1998. [114] D. Comer, “The ubiquitous B-Tree,” ACM Computing Surveys, vol. 11, no. 2, pp. 121–137, 1979. [115] P. Corbett, D. Feitelson, S. Fineberg, Y. Hsu, B. Nitzberg, J.-P. Prost, M. Snir, B. Traversat, and P. Wong, “Overview of the MPI-IO parallel I/O interface,” in Input/Output in Parallel and Distributed Computer Systems, (R. Jain, J. Werth, and J. C. Browne, eds.), ch. 5, pp. 127–146, Kluwer Academic Pub- lishers, 1996. [116] P. F. Corbett and D. G. Feitelson, “The Vesta parallel ﬁle system,” ACM Transactions on Computer Systems, vol. 14, pp. 225–264, August 1996. References 159 [117] T. H. Cormen and E. R. Davidson, “FG: A framework generator for hid- ing latency in parallel programs running on clusters,” in Proceedings of the 17th International Conference on Parallel and Distributed Computing Sys- tems, pp. 137–144, Sep. 2004. [118] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms. Cambridge, Mass.: MIT Press, 2nd ed., September 2001. [119] T. H. Cormen and D. M. Nicol, “Performing out-of-core FFTs on parallel disk systems,” Parallel Computing, vol. 24, pp. 5–20, January 1998. [120] T. H. Cormen, T. Sundquist, and L. F. Wisniewski, “Asymptotically tight bounds for performing BMMC permutations on parallel disk systems,” SIAM Journal on Computing, vol. 28, no. 1, pp. 105–136, 1999. [121] A. Crauser and P. Ferragina, “A theoretical and experimental study on the construction of suﬃx arrays in external memory,” Algorithmica, vol. 32, no. 1, pp. 1–35, 2002. [122] A. Crauser, P. Ferragina, K. Mehlhorn, U. Meyer, and E. A. Ramos, “I/O- optimal computation of segment intersections,” in External Memory Algo- rithms and Visualization, (J. Abello and J. S. Vitter, eds.), pp. 131–138, Providence, Rhode Island: American Mathematical Society Press, 1999. [123] A. Crauser, P. Ferragina, K. Mehlhorn, U. Meyer, and E. A. Ramos, “Ran- domized external-memory algorithms for line segment intersection and other geometric problems,” International Journal of Computational Geometry and Applications, vol. 11, no. 3, pp. 305–337, 2001. [124] A. Crauser and K. Mehlhorn, “LEDA-SM: Extending LEDA to secondary memory,” in Proceedings of the Workshop on Algorithm Engineering, (J. S. Vitter and C. Zaroliagis, eds.), pp. 228–242, London: Springer-Verlag, July 1999. [125] K. Curewitz, P. Krishnan, and J. S. Vitter, “Practical Prefetching Via Data Compression,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 257–266, Washington, DC: ACM Press, May 1993. [126] R. Cypher and G. Plaxton, “Deterministic sorting in nearly logarithmic time on the hypercube and related computers,” Journal of Computer and System Sciences, vol. 47, no. 3, pp. 501–548, 1993. [127] E. R. Davidson, FG: Improving Parallel Programs and Parallel Programming Since 2003. PhD thesis, Dartmouth College Department of Computer Science, Aug. 2006. [128] M. de Berg, J. Gudmundsson, M. Hammar, and M. H. Overmars, “On R- trees with low query complexity,” Computational Geometry, vol. 24, no. 3, pp. 179–195, 2003. [129] M. de Berg, M. van Kreveld, M. Overmars, and O. Schwarzkopf, Computa- tional Geometry Algorithms and Applications. Berlin: Springer-Verlag, 1997. [130] J. Dean and S. Ghemawat, “MapReduce: Simpliﬁed data processing on large clusters,” in Proceedings of the Symposium on Operating Systems Design and Implementation, pp. 137–150, USENIX, December 2004. [131] F. K. H. A. Dehne, W. Dittrich, and D. A. Hutchinson, “Eﬃcient Exter- nal Memory Algorithms by Simulating Coarse-Grained Parallel Algorithms,” Algorithmica, vol. 36, no. 2, pp. 97–122, 2003. 160 References [132] F. K. H. A. Dehne, W. Dittrich, D. A. Hutchinson, and A. Maheshwari, “Bulk synchronous parallel algorithms for the external memory model,” Theory of Computing Systems, vol. 35, no. 6, pp. 567–597, 2002. [133] R. Dementiev, Algorithm Engineering for Large Data Sets. PhD thesis, Saar- land University, 2006. [134] R. Dementiev, J. K¨arkk¨ainen, J. Mehnert, and P. Sanders, “Better external memory suﬃx array construction,” ACM Journal of Experimental Algorith- mics, in press. [135] R. Dementiev, L. Kettner, and P. Sanders, “STXXL: Standard Template Library for XXL Data Sets,” Software — Practice and Experience, vol. 38, no. 6, pp. 589–637, 2008. [136] R. Dementiev and P. Sanders, “Asynchronous parallel disk sorting,” in Pro- ceedings of the ACM Symposium on Parallel Algorithms and Architectures, pp. 138–148, ACM Press, 2003. [137] R. Dementiev, P. Sanders, D. Schultes, and J. Sibeyn, “Engineering an exter- nal memory minimum spanning tree algorithm,” in Proceedings of IFIP Inter- national Conference on Theoretical Computer Science, Toulouse: Kluwer Aca- demic Publishers, 2004. [138] H. B. Demuth, Electronic data sorting. PhD thesis, Stanford University, 1956. [139] P. J. Denning, “Working sets past and present,” IEEE Transactions on Soft- ware Engineering, vol. SE-6, pp. 64–84, 1980. [140] D. J. DeWitt, J. F. Naughton, and D. A. Schneider, “Parallel sorting on a shared-nothing architecture using probabilistic splitting,” in Proceedings of the International Conference on Parallel and Distributed Information Systems, pp. 280–291, December 1991. [141] W. Dittrich, D. A. Hutchinson, and A. Maheshwari, “Blocking in parallel multisearch problems,” Theory of Computing Systems, vol. 34, no. 2, pp. 145– 189, 2001. [142] J. R. Driscoll, N. Sarnak, D. D. Sleator, and R. E. Tarjan, “Making data structures persistent,” Journal of Computer and System Sciences, vol. 38, pp. 86–124, 1989. [143] M. C. Easton, “Key-sequence data sets on indelible storage,” IBM Journal of Research and Development, vol. 30, pp. 230–241, 1986. [144] H. Edelsbrunner, “A new approach to rectangle intersections, Part I,” Inter- national Journal of Computer Mathematics, vol. 13, pp. 209–219, 1983. [145] H. Edelsbrunner, “A New approach to rectangle intersections, Part II,” Inter- national Journal of Computer Mathematics, vol. 13, pp. 221–229, 1983. [146] M. Y. Eltabakh, W.-K. Hon, R. Shah, W. Aref, and J. S. Vitter, “The SBC- tree: An index for run-length compressed sequences,” in Proceedings of the International Conference on Extending Database Technology, Nantes, France: Springer-Verlag, March 2008. [147] R. J. Enbody and H. C. Du, “Dynamic hashing schemes,” ACM Computing Surveys, vol. 20, pp. 85–113, June 1988. [148] “NASA’s Earth Observing System (EOS) web page, NASA Goddard Space Flight Center,” http://eospso.gsfc.nasa.gov/. References 161 [149] D. Eppstein, Z. Galil, G. F. Italiano, and A. Nissenzweig, “Sparsiﬁcation — a technique for speeding up dynamic graph algorithms,” Journal of the ACM, vol. 44, no. 5, pp. 669–696, 1997. [150] J. Erickson, “Lower bounds for external algebraic decision trees,” in Pro- ceedings of the ACM-SIAM Symposium on Discrete Algorithms, pp. 755–761, ACM Press, 2005. [151] G. Evangelidis, D. B. Lomet, and B. Salzberg, “The hBΠ-tree: A multi- attribute index supporting concurrency, recovery and node consolidation,” VLDB Journal, vol. 6, pp. 1–25, 1997. [152] R. Fagerberg, A. Pagh, and R. Pagh, “External string sorting: Faster and cache oblivious,” in Proceedings of the Symposium on Theoretical Aspects of Computer Science, pp. 68–79, Springer-Verlag, 2006. [153] R. Fagin, J. Nievergelt, N. Pippinger, and H. R. Strong, “Extendible hash- ing — a fast access method for dynamic ﬁles,” ACM Transactions on Database Systems, vol. 4, no. 3, pp. 315–344, 1979. [154] M. Farach-Colton, P. Ferragina, and S. Muthukrishnan, “On the sorting- complexity of suﬃx tree construction,” Journal of the ACM, vol. 47, no. 6, pp. 987–1011, 2000. [155] J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan, “An approx- imate L1-diﬀerence algorithm for massive data streams,” SIAM Journal on Computing, vol. 32, no. 1, pp. 131–151, 2002. [156] W. Feller, An Introduction to Probability Theory and its Applications. Vol. 1, New York: John Wiley & Sons, 3rd ed., 1968. [157] P. Ferragina and R. Grossi, “Fast string searching in secondary storage: The- oretical developments and experimental results,” in Proceedings of the ACM- SIAM Symposium on Discrete Algorithms, pp. 373–382, Atlanta: ACM Press, June 1996. [158] P. Ferragina and R. Grossi, “The String B-tree: A new data structure for string search in external memory and its applications,” Journal of the ACM, vol. 46, pp. 236–280, March 1999. [159] P. Ferragina, R. Grossi, A. Gupta, R. Shah, and J. S. Vitter, “On searching compressed string collections cache-obliviously,” in Proceedings of the ACM Conference on Principles of Database Systems, Vancouver: ACM Press, June 2008. [160] P. Ferragina, N. Koudas, S. Muthukrishnan, and D. Srivastava, “Two- dimensional substring indexing,” Journal of Computer and System Sciences, vol. 66, no. 4, pp. 763–774, 2003. [161] P. Ferragina and F. Luccio, “Dynamic dictionary matching in external mem- ory,” Information and Computation, vol. 146, pp. 85–99, November 1998. [162] P. Ferragina and G. Manzini, “Indexing compressed texts,” Journal of the ACM, vol. 52, no. 4, pp. 552–581, 2005. [163] P. Ferragina, G. Manzini, V. M¨akinen, and G. Navarro, “Compressed represen- tations of sequences and full-text indexes,” ACM Transaction on Algorithms, vol. 3, p. 20, May 2007. [164] P. Flajolet, “On the performance evaluation of extendible hashing and trie searching,” Acta Informatica, vol. 20, no. 4, pp. 345–369, 1983. 162 References [165] R. W. Floyd, “Permuting information in idealized two-level storage,” in Com- plexity of Computer Computations, (R. Miller and J. Thatcher, eds.), pp. 105– 109, Plenum, 1972. [166] W. Frakes and R. A. Baeza-Yates, eds., Information Retrieval: Data Structures and Algorithms. Prentice-Hall, 1992. [167] G. Franceschini, R. Grossi, J. I. Munro, and L. Pagli, “Implicit B-Trees: A new data structure for the dictionary problem,” Journal of Computer and System Sciences, vol. 68, no. 4, pp. 788–807, 2004. [168] M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran, “Cache-oblivious algorithms,” in Proceedings of the IEEE Symposium on Foundations of Com- puter Science, pp. 285–298, 1999. [169] T. A. Funkhouser, C. H. Sequin, and S. J. Teller, “Management of large amounts of data in interactive building walkthroughs,” in Proceedings of the ACM Symposium on Interactive 3D Graphics, pp. 11–20, Boston: ACM Press, March 1992. [170] V. Gaede and O. G¨unther, “Multidimensional access methods,” ACM Com- puting Surveys, vol. 30, pp. 170–231, June 1998. [171] G. R. Ganger, “Generating representative synthetic workloads: An unsolved problem,” in Proceedings of the Computer Measurement Group Conference, pp. 1263–1269, December 1995. [172] M. Gardner, ch. 7, Magic Show. New York: Knopf, 1977. [173] I. Gargantini, “An eﬀective way to represent quadtrees,” Communications of the ACM, vol. 25, pp. 905–910, December 1982. [174] T. M. Ghanem, R. Shah, M. F. Mokbel, W. G. Aref, and J. S. Vitter, “Bulk operations for space-partitioning trees,” in Proceedings of IEEE International Conference on Data Engineering, Boston: IEEE Computer Society Press, April 2004. [175] G. A. Gibson, J. S. Vitter, and J. Wilkes, “Report of the working group on storage I/O issues in large-scale computing,” ACM Computing Surveys, vol. 28, pp. 779–793, December 1996. [176] A. Gionis, P. Indyk, and R. Motwani, “Similarity search in high dimensions via hashing,” in Proceedings of the International Conference on Very Large Databases, pp. 78–89, Edinburgh, Scotland: Morgan Kaufmann, 1999. [177] R. Goldman, N. Shivakumar, S. Venkatasubramanian, and H. Garcia-Molina, “Proximity search in databases,” in Proceedings of the International Confer- ence on Very Large Databases, pp. 26–37, August 1998. [178] R. Gonz´alez and G. Navarro, “A compressed text index on secondary mem- ory,” in Proceedings of the International Workshop on Combinatorial Algo- rithms, (Newcastle, Australia), pp. 80–91, College Publications, 2007. [179] M. T. Goodrich, J.-J. Tsay, D. E. Vengroﬀ, and J. S. Vitter, “External-memory computational geometry,” in Proceedings of the IEEE Symposium on Founda- tions of Computer Science, pp. 714–723, Palo Alto: IEEE Computer Society Press, November 1993. [180] “Google Earth online database of satellite images,” Available on the World- Wide Web at http://earth.google.com/. References 163 [181] S. Govindarajan, P. K. Agarwal, and L. Arge, “CRB-tree: An eﬃcient index- ing scheme for range-aggregate queries,” in Proceedings of the International Conference on Database Theory, pp. 143–157, Springer-Verlag, 2003. [182] S. Govindarajan, T. Lukovszki, A. Maheshwari, and N. Zeh, “I/O-eﬃcient well-separated pair decomposition and its applications,” Algorithmica, vol. 45, pp. 385–614, August 2006. [183] D. Greene, “An implementation and performance analysis of spatial data access methods,” in Proceedings of IEEE International Conference on Data Engineering, pp. 606–615, 1989. [184] J. L. Griﬃn, S. W. Schlosser, G. R. Ganger, and D. F. Nagle, “Modeling and performance of MEMS-based storage devices,” in Procedings of ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, pp. 56–65, Santa Clara, Cal.: ACM Press, June 2000. [185] R. Grossi, A. Gupta, and J. S. Vitter, “High-order entropy-compressed text indexes,” in Proceedings of the ACM-SIAM Symposium on Discrete Algo- rithms, ACM Press, January 2003. [186] R. Grossi and G. F. Italiano, “Eﬃcient cross-trees for external memory,” in External Memory Algorithms and Visualization, (J. Abello and J. S. Vitter, eds.), pp. 87–106, Providence, Rhode Island: American Mathematical Society Press, 1999. [187] R. Grossi and G. F. Italiano, “Eﬃcient splitting and merging algorithms for order decomposable problems,” Information and Computation, vol. 154, no. 1, pp. 1–33, 1999. [188] S. K. S. Gupta, Z. Li, and J. H. Reif, “Generating eﬃcient programs for two- level memories from tensor-products,” in Proceedings of the IASTED/ISMM International Conference on Parallel and Distributed Computing and Systems, pp. 510–513, October 1995. [189] D. Gusﬁeld, Algorithms on Strings, Trees, and Sequences. Cambridge, UK: Cambridge University Press, 1997. [190] A. Guttman, “R-trees: A dynamic index structure for spatial searching,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 47–57, ACM Press, 1984. [191] H. J. Haverkort and L. Toma, “I/O-eﬃcient algorithms on near-planar graphs,” in Proceedings of the Latin American Theoretical Informatics Sym- posium, pp. 580–591, 2006. [192] T. Hazel, L. Toma, R. Wickremesinghe, and J. Vahrenhold, “Terracost: A ver- satile and scalable approach to computing least-cost-path surfaces for massive grid-based terrains,” in Proceedings of the ACM Symposium on Applied Com- puting, pp. 52–57, ACM Press, 2006. [193] J. M. Hellerstein, E. Koutsoupias, and C. H. Papadimitriou, “On the analysis of indexing schemes,” in Proceedings of the ACM Symposium on Principles of Database Systems, pp. 249–256, Tucson: ACM Press, May 1997. [194] L. Hellerstein, G. Gibson, R. M. Karp, R. H. Katz, and D. A. Patterson, “Coding techniques for handling failures in large disk arrays,” Algorithmica, vol. 12, no. 2–3, pp. 182–208, 1994. 164 References [195] M. R. Henzinger, P. Raghavan, and S. Rajagopalan, “Computing on data streams,” in External Memory Algorithms and Visualization, (J. Abello and J. S. Vitter, eds.), pp. 107–118, Providence, Rhode Island: American Mathe- matical Society Press, 1999. [196] K. Hinrichs, “Implementation of the grid ﬁle: Design concepts and experience,” BIT, vol. 25, no. 4, pp. 569–592, 1985. [197] W.-K. Hon, T.-W. Lam, R. Shah, S.-L. Tam, and J. S. Vitter, “Cache-oblivious index for approximate string matching,” in Proceedings of the Symposium on Combinatorial Pattern Matching, pp. 40–51, London, Ontario, Canada: Springer-Verlag, July 2007. [198] W.-K. Hon, R. Shah, P. J. Varman, and J. S. Vitter, “Tight competitive ratios for parallel disk prefetching and caching,” in Proceedings of the ACM Symposium on Parallel Algorithms and Architectures, Munich: ACM Press, June 2008. [199] J. W. Hong and H. T. Kung, “I/O complexity: The red-blue pebble game,” in Proceedings of the ACM Symposium on Theory of Computing, pp. 326–333, ACM Press, May 1981. [200] D. Hutchinson, A. Maheshwari, J.-R. Sack, and R. Velicescu, “Early expe- riences in implementing the buﬀer tree,” in Proceedings of the Workshop on Algorithm Engineering, Springer-Verlag, 1997. [201] D. A. Hutchinson, A. Maheshwari, and N. Zeh, “An external memory data structure for shortest path queries,” Discrete Applied Mathematics, vol. 126, no. 1, pp. 55–82, 2003. [202] D. A. Hutchinson, P. Sanders, and J. S. Vitter, “Duality between prefetching and queued writing with parallel disks,” SIAM Journal on Computing, vol. 34, no. 6, pp. 1443–1463, 2005. [203] P. Indyk, R. Motwani, P. Raghavan, and S. Vempala, “Locality-preserving hashing in multidimensional spaces,” in Proceedings of the ACM Symposium on Theory of Computing, pp. 618–625, El Paso: ACM Press, May 1997. [204] M. Kallahalla and P. J. Varman, “Optimal prefetching and caching for parallel I/O systems,” in Proceedings of the ACM Symposium on Parallel Algorithms and Architectures, Crete, Greece: ACM Press, July 2001. [205] M. Kallahalla and P. J. Varman, “Optimal read-once parallel disk scheduling,” Algorithmica, vol. 43, no. 4, pp. 309–343, 2005. [206] I. Kamel and C. Faloutsos, “On packing R-trees,” in Proceedings of the International ACM Conference on Information and Knowledge Management, pp. 490–499, 1993. [207] I. Kamel and C. Faloutsos, “Hilbert R-tree: An improved R-tree using frac- tals,” in Proceedings of the International Conference on Very Large Databases, pp. 500–509, 1994. [208] I. Kamel, M. Khalil, and V. Kouramajian, “Bulk insertion in dynamic R- trees,” in Proceedings of the International Symposium on Spatial Data Han- dling, pp. 3B, 31–42, 1996. [209] P. C. Kanellakis, G. M. Kuper, and P. Z. Revesz, “Constraint query lan- guages,” Journal of Computer and System Sciences, vol. 51, no. 1, pp. 26–52, 1995. References 165 [210] P. C. Kanellakis, S. Ramaswamy, D. E. Vengroﬀ, and J. S. Vitter, “Indexing for data models with constraints and classes,” Journal of Computer and System Sciences, vol. 52, no. 3, pp. 589–612, 1996. [211] K. V. R. Kanth and A. K. Singh, “Optimal dynamic range searching in non- replicating index structures,” in Proceedings of the International Conference on Database Theory, pp. 257–276, Springer-Verlag, January 1999. [212] J. K¨arkk¨ainen and S. S. Rao, “Full-text indexes in external memory,” in Algo- rithms for Memory Hierarchies, (U. Meyer, P. Sanders, and J. Sibeyn, eds.), ch. 7, pp. 149–170, Berlin: Springer-Verlag, 2003. [213] I. Katriel and U. Meyer, “Elementary graph algorithms in external memory,” in Algorithms for Memory Hierarchies, (U. Meyer, P. Sanders, and J. Sibeyn, eds.), ch. 4, pp. 62–84, Berlin: Springer-Verlag, 2003. [214] R. Khandekar and V. Pandit, “Oﬄine Sorting Buﬀers On Line,” in Proceedings of the International Symposium on Algorithms and Computation, pp. 81–89, Springer-Verlag, December 2006. [215] S. Khuller, Y. A. Kim, and Y.-C. J. Wan, “Algorithms for data migration with cloning,” SIAM Journal on Computing, vol. 33, no. 2, pp. 448–461, 2004. [216] M. Y. Kim, “Synchronized disk interleaving,” IEEE Transactions on Comput- ers, vol. 35, pp. 978–988, November 1986. [217] T. Kimbrel and A. R. Karlin, “Near-optimal parallel prefetching and caching,” SIAM Journal on Computing, vol. 29, no. 4, pp. 1051–1082, 2000. [218] D. G. Kirkpatrick and R. Seidel, “The ultimate planar convex hull algo- rithm?,” SIAM Journal on Computing, vol. 15, pp. 287–299, 1986. [219] S. T. Klein and D. Shapira, “Searching in compressed dictionaries,” in Proceed- ings of the Data Compression Conference, Snowbird, Utah: IEEE Computer Society Press, 2002. [220] D. E. Knuth, Sorting and Searching.Vol.3ofThe Art of Computer Program- ming, Reading, MA: Addison-Wesley, 2nd ed., 1998. [221] D. E. Knuth, MMIXware. Berlin: Springer-Verlag, 1999. [222] D. E. Knuth, J. H. Morris, and V. R. Pratt, “Fast pattern matching in strings,” SIAM Journal on Computing, vol. 6, pp. 323–350, 1977. [223] G. Kollios, D. Gunopulos, and V. J. Tsotras, “On indexing mobile objects,” in Proceedings of the ACM Symposium on Principles of Database Systems, pp. 261–272, ACM Press, 1999. [224] E. Koutsoupias and D. S. Taylor, “Tight bounds for 2-dimensional indexing schemes,” in Proceedings of the ACM Symposium on Principles of Database Systems, pp. 52–58, Seattle: ACM Press, June 1998. [225] M. Kowarschik and C. Weiß, “An overview of cache optimizaiton techniques and cache-aware numerical algorithms,” in Algorithms for Memory Hierar- chies, (U. Meyer, P. Sanders, and J. Sibeyn, eds.), ch. 10, pp. 213–232, Berlin: Springer-Verlag, 2003. [226] P. Krishnan and J. S. Vitter, “Optimal prediction for prefetching in the worst case,” SIAM Journal on Computing, vol. 27, pp. 1617–1636, December 1998. [227] V. Kumar and E. Schwabe, “Improved algorithms and data structures for solving graph problems in external memory,” in Proceedings of the IEEE Sym- posium on Parallel and Distributed Processing, pp. 169–176, October 1996. 166 References [228] K. K¨uspert, “Storage utilization in B*-trees with a generalized overﬂow tech- nique,” Acta Informatica, vol. 19, pp. 35–55, 1983. [229] P.-A. Larson, “Performance analysis of linear hashing with partial expan- sions,” ACM Transactions on Database Systems, vol. 7, pp. 566–587, Decem- ber 1982. [230] R. Laurini and D. Thompson, Fundamentals of Spatial Information Systems,. Academic Press, 1992. [231] P. L. Lehman and S. B. Yao, “Eﬃcient locking for concurrent operations on B-trees,” ACM Transactions on Database Systems, vol. 6, pp. 650–570, December 1981. [232] F. T. Leighton, “Tight bounds on the complexity of parallel sorting,” IEEE Transactions on Computers, vol. C-34, pp. 344–354, Special issue on sorting, E. E. Lindstrom, C. K. Wong, and J. S. Vitter, eds., April 1985. [233] C. E. Leiserson, S. Rao, and S. Toledo, “Eﬃcient out-of-core algorithms for linear relaxation using blocking covers,” Journal of Computer and System Sciences, vol. 54, no. 2, pp. 332–344, 1997. [234] Z. Li, P. H. Mills, and J. H. Reif, “Models and resource metrics for parallel and distributed computation,” Parallel Algorithms and Applications, vol. 8, pp. 35–59, 1996. [235] W. Litwin, “Linear hashing: A new tool for ﬁles and tables addressing,” in Proceedings of the International Conference on Very Large Databases, pp. 212– 223, October 1980. [236] W. Litwin and D. Lomet, “A new method for fast data searches with keys,” IEEE Software, vol. 4, pp. 16–24, March 1987. [237] D. Lomet, “A simple bounded disorder ﬁle organization with good perfor- mance,” ACM Transactions on Database Systems, vol. 13, no. 4, pp. 525–551, 1988. [238] D. B. Lomet and B. Salzberg, “The hB-tree: A multiattribute indexing method with good guaranteed performance,” ACM Transactions on Database Systems, vol. 15, no. 4, pp. 625–658, 1990. [239] D. B. Lomet and B. Salzberg, “Concurrency and recovery for index trees,” VLDB Journal, vol. 6, no. 3, pp. 224–240, 1997. [240] T. Lukovszki, A. Maheshwari, and N. Zeh, “I/O-eﬃcient batched range count- ing and its applications to proximity problems,” Foundations of Software Tech- nology and Theoretical Computer Science, pp. 244–255, 2001. [241] A. Maheshwari and N. Zeh, “I/O-eﬃcient algorithms for graphs of bounded treewidth,” in Proceedings of the ACM-SIAM Symposium on Discrete Algo- rithms, pp. 89–90, Washington, DC: ACM Press, January 2001. [242] A. Maheshwari and N. Zeh, “I/O-Optimal Algorithms for Planar Graphs Using Separators,” in Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, pp. 372–381, ACM Press, 2002. [243] A. Maheshwari and N. Zeh, “A survey of techniques for designing I/O-eﬃcient algorithms,” in Algorithms for Memory Hierarchies, (U. Meyer, P. Sanders, and J. Sibeyn, eds.), ch. 3, pp. 36–61, Berlin: Springer-Verlag, 2003. [244] A. Maheshwari and N. Zeh, “I/O-optimal algorithms for outerplanar graphs,” Journal of Graph Algorithms and Applications, vol. 8, pp. 47–87, 2004. References 167 [245] V. M¨akinen, G. Navarro, and K. Sadakane, “Advantages of backward search- ing — eﬃcient secondary memory and distributed implementation of com- pressed suﬃx arrays,” in Proceedings of the International Symposium on Algo- rithms and Computation, pp. 681–692, Springer-Verlag, 2004. [246] U. Manber and G. Myers, “Suﬃx arrays: A new method for on-line string searches,” SIAM Journal on Computing, vol. 22, pp. 935–948, October 1993. [247] U. Manber and S. Wu, “GLIMPSE: A tool to search through entire ﬁle sys- tems,” in Proceedings of the Winter USENIX Conference, (USENIX Associa- tion, ed.), pp. 23–32, San Francisco: USENIX, January 1994. [248] G. N. N. Martin, “Spiral storage: Incrementally augmentable hash addressed storage,” Technical Report CS-RR-027, University of Warwick, March 1979. [249] Y. Matias, E. Segal, and J. S. Vitter, “Eﬃcient bundle sorting,” in Proceed- ings of the ACM-SIAM Symposium on Discrete Algorithms, pp. 839–848, San Francisco: ACM Press, January 2000. [250] E. M. McCreight, “A space-economical suﬃx tree construction algorithm,” Journal of the ACM, vol. 23, no. 2, pp. 262–272, 1976. [251] E. M. McCreight, “Priority Search Trees,” SIAM Journal on Computing, vol. 14, pp. 257–276, May 1985. [252] K. Mehlhorn and U. Meyer, “External-memory breadth-ﬁrst search with sublinear I/O,” in Proceedings of the European Symposium on Algorithms, pp. 723–735, Springer-Verlag, 2002. [253] H. Mendelson, “Analysis of extendible hashing,” IEEE Transactions on Soft- ware Engineering, vol. SE-8, pp. 611–619, November 1982. [254] U. Meyer, “External memory BFS on undirected graphs with bounded degree,” in Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, pp. 87–88, Washington, DC: ACM Press, January 2001. [255] U. Meyer, “On dynamic breadth-ﬁrst search in external-memory,” in Pro- ceedings of the Symposium on Theoretical Aspects of Computer Science, (Schloss Dagstuhl, Germany), pp. 551–560, Internationales Begegnungs- und Forschungszentrum f¨ur Informatik, 2008. [256] U. Meyer, “On trade-oﬀs in external-memory diameter approximation,” in Proceedings of the Scandinavian Workshop on Algorithm Theory, (Gothen- burg, Sweden), Springer-Verlag, July 2008. [257] U. Meyer, P. Sanders, and J. Sibeyn, eds., Algorithms for Memory Hierarchies. Berlin: Springer-Verlag, 2003. [258] U. Meyer and N. Zeh, “I/O-eﬃcient undirected shortest paths,” in Proceed- ings of the European Symposium on Algorithms, pp. 435–445, Springer-Verlag, 2003. [259] U. Meyer and N. Zeh, “I/O-eﬃcient undirected shortest paths with unbounded weights,” in Proceedings of the European Symposium on Algorithms, Springer- Verlag, 2006. [260] C. Mohan, “ARIES/KVL: A key-value locking method for concurrency control of multiaction transactions on B-tree indices,” in Proceedings of the Interna- tional Conference on Very Large Databases, pp. 392–405, August 1990. 168 References [261] D. R. Morrison, “Patricia: Practical algorithm to retrieve information coded in alphanumeric,” Journal of the ACM, vol. 15, pp. 514–534, 1968. [262] S. A. Moyer and V. Sunderam, “Characterizing concurrency control perfor- mance for the PIOUS parallel ﬁle system,” Journal of Parallel and Distributed Computing, vol. 38, pp. 81–91, October 1996. [263] J. K. Mullin, “Spiral storage: Eﬃcient dynamic hashing with constant perfor- mance,” The Computer Journal, vol. 28, pp. 330–334, July 1985. [264] K. Munagala and A. Ranade, “I/O-complexity of graph algorithms,” in Pro- ceedings of the ACM-SIAM Symposium on Discrete Algorithms, pp. 687–694, Baltimore: ACM Press, January 1999. [265] S. Muthukrishnan, Data Streams: Algorithms and Applications. Vol. 1, issue 2 of Foundations and Trends in Theoretical Computer Science, Hanover, Mass.: now Publishers, 2005. [266] G. Navarro, “Indexing text using the Ziv–Lempel trie,” Journal of Discrete Algorithms, vol. 2, no. 1, pp. 87–114, 2004. [267] G. Navarro and V. M¨akinen, “Compressed full-text indexes,” ACM Computing Surveys, vol. 39, no. 1, p. 2, 2007. [268] J. Nievergelt, H. Hinterberger, and K. C. Sevcik, “The grid ﬁle: An adaptable, symmetric multi-key ﬁle structure,” ACM Transactions on Database Systems, vol. 9, pp. 38–71, 1984. [269] J. Nievergelt and E. M. Reingold, “Binary search tree of bounded balance,” SIAM Journal on Computing, vol. 2, pp. 33–43, March 1973. [270] J. Nievergelt and P. Widmayer, “Spatial data structures: Concepts and design choices,” in Algorithmic Foundations of GIS, (M. van Kreveld, J. Nievergelt, T. Roos, and P. Widmayer, eds.), pp. 153–197, Springer-Verlag, 1997. [271] M. H. Nodine, M. T. Goodrich, and J. S. Vitter, “Blocking for external graph searching,” Algorithmica, vol. 16, pp. 181–214, August 1996. [272] M. H. Nodine, D. P. Lopresti, and J. S. Vitter, “I/O overhead and parallel VLSI architectures for lattice computations,” IEEE Transactions on Commu- nications, vol. 40, pp. 843–852, July 1991. [273] M. H. Nodine and J. S. Vitter, “Deterministic distribution sort in shared and distributed memory multiprocessors,” in Proceedings of the ACM Symposium on Parallel Algorithms and Architectures, pp. 120–129, Velen, Germany: ACM Press, June–July 1993. [274] M. H. Nodine and J. S. Vitter, “Greed Sort: An optimal sorting algo- rithm for multiple disks,” Journal of the ACM, vol. 42, pp. 919–933, July 1995. [275] P. E. O’Neil, “The SB-tree. An index-sequential structure for high- performance sequential access,” Acta Informatica, vol. 29, pp. 241–265, June 1992. [276] J. A. Orenstein, “Redundancy in spatial databases,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 294– 305, Portland: ACM Press, June 1989. [277] J. A. Orenstein and T. H. Merrett, “A class of data structures for associative searching,” in Proceedings of the ACM Conference on Principles of Database Systems, pp. 181–190, ACM Press, 1984. References 169 [278] M. H. Overmars, The design of dynamic data structures. 1983. Springer- Verlag. [279] H. Pang, M. Carey, and M. Livny, “Memory-adaptive external sorts,” in Pro- ceedings of the International Conference on Very Large Databases, pp. 618– 629, 1993. [280] H. Pang, M. J. Carey, and M. Livny, “Partially preemptive hash joins,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, (P. Buneman and S. Jajodia, eds.), pp. 59–68, Washington, DC: ACM Press, May 1993. [281] I. Parsons, R. Unrau, J. Schaeﬀer, and D. Szafron, “PI/OT: Parallel I/O templates,” Parallel Computing, vol. 23, pp. 543–570, June 1997. [282] J. M. Patel and D. J. DeWitt, “Partition based spatial-merge join,” in Pro- ceedings of the ACM SIGMOD International Conference on Management of Data, pp. 259–270, ACM Press, June 1996. [283] D. Pfoser, C. S. Jensen, and Y. Theodoridis, “Novel approaches to the indexing of moving object trajectories,” in Proceedings of the International Conference on Very Large Databases, pp. 395–406, 2000. [284] F. P. Preparata and M. I. Shamos, Computational Geometry. Berlin: Springer- Verlag, 1985. [285] O. Procopiuc, P. K. Agarwal, L. Arge, and J. S. Vitter, “Bkd-tree: A dynamic scalable kd-tree,” in Proceedings of the International Symposium on Spatial and Temporal Databases, Santorini, Greece: Springer-Verlag, July 2003. [286] S. J. Puglisi, W. F. Smyth, and A. Turpin, “Inverted ﬁles versus suﬃx arrays for locating patterns in primary memory,” in Proceedings of the International Symposium on String Processing Information Retrieval, pp. 122–133, Springer- Verlag, 2006. [287] N. Rahman and R. Raman, “Adapting radix sort to the memory hierarchy,” in Workshop on Algorithm Engineering and Experimentation, Springer-Verlag, January 2000. [288] R. Raman, V. Raman, and S. S. Rao, “Succinct indexable dictionaries with applications to encoding k-ary trees and multisets,” in Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, pp. 233–242, ACM Press, 2002. [289] S. Ramaswamy and S. Subramanian, “Path caching: A technique for optimal external searching,” in Proceedings of the ACM Conference on Principles of Database Systems, pp. 25–35, Minneapolis: ACM Press, 1994. [290] J. Rao and K. Ross, “Cache conscious indexing for decision-support in main memory,” in Proceedings of the International Conference on Very Large Databases, (M. Atkinson et al., eds.), pp. 78–89, Los Altos, Cal.: Morgan Kaufmann, 1999. [291] J. Rao and K. A. Ross, “Making B+-trees cache conscious in main memory,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, (W. Chen, J. Naughton, and P. A. Bernstein, eds.), pp. 475–486, Dallas: ACM Press, 2000. [292] E. Riedel, G. A. Gibson, and C. Faloutsos, “Active storage for large-scale data mining and multimedia,” in Proceedings of the International Conference on Very Large Databases, pp. 62–73, August 1998. 170 References [293] J. T. Robinson, “The k-d-b-tree: A search structure for large multidimensional dynamic indexes,” in Proceedings of the ACM Conference on Principles of Database Systems, pp. 10–18, ACM Press, 1981. [294] M. Rosenblum, E. Bugnion, S. Devine, and S. A. Herrod, “Using the SimOS machine simulator to study complex computer systems,” ACM Transactions on Modeling and Computer Simulation, vol. 7, pp. 78–103, January 1997. [295] C. Ruemmler and J. Wilkes, “An introduction to disk drive modeling,” IEEE Computer, pp. 17–28, March 1994. [296] K. Salem and H. Garcia-Molina, “Disk striping,” in Proceedings of IEEE Inter- national Conference on Data Engineering, pp. 336–242, Los Angeles, 1986. [297] S. ˇSaltenis, C. S. Jensen, S. T. Leutenegger, and M. A. Lopez, “Index- ing the positions of continuously moving objects,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, (W. Chen, J. Naughton, and P. A. Bernstein, eds.), pp. 331–342, Dallas: ACM Press, 2000. [298] B. Salzberg and V. J. Tsotras, “Comparison of access methods for time- evolving data,” ACM Computing Surveys, vol. 31, pp. 158–221, June 1999. [299] H. Samet, Applications of Spatial Data Structures: Computer Graphics, Image Processing, and GIS. Addison-Wesley, 1989. [300] H. Samet, The Design and Analysis of Spatial Data Structures. Addison- Wesley, 1989. [301] V. Samoladas and D. Miranker, “A lower bound theorem for indexing schemes and its application to multidimensional range queries,” in Proceedings of the ACM Symposium on Principles of Database Systems, pp. 44–51, Seattle: ACM Press, June 1998. [302] P. Sanders, “Fast priority queues for cached memory,” ACM Journal of Exper- imental Algorithmics, vol. 5, no. 7, pp. 1–25, 2000. [303] P. Sanders, “Reconciling simplicity and realism in parallel disk models,” Par- allel Computing, vol. 28, no. 5, pp. 705–723, 2002. [304] P. Sanders, S. Egner, and J. Korst, “Fast concurrent access to parallel disks,” Algorithmica, vol. 35, no. 1, pp. 21–55, 2002. [305] J. E. Savage, “Extending the Hong-Kung model to memory hierarchies,” in Proceedings of the International Conference on Computing and Combinatorics, pp. 270–281, Springer-Verlag, August 1995. [306] J. E. Savage and J. S. Vitter, “Parallelism in space-time tradeoﬀs,” in Advances in Computing Research, (F. P. Preparata, ed.), pp. 117–146, JAI Press, 1987. [307] S. W. Schlosser, J. L. Griﬃn, D. F. Nagle, and G. R. Ganger, “Designing computer systems with MEMS-based storage,” in Proceedings of the Interna- tional Conference on Architectural Support for Programming Languages and Operating Systems, pp. 1–12, November 2000. [308] K. E. Seamons and M. Winslett, “Multidimensional array I/O in Panda 1.0,” Journal of Supercomputing, vol. 10, no. 2, pp. 191–211, 1996. [309] B. Seeger and H.-P. Kriegel, “The buddy-tree: An eﬃcient and robust access method for spatial data base systems,” in Proceedings of the International Conference on Very Large Databases, pp. 590–601, 1990. References 171 [310] M. Seltzer, K. A. Smith, H. Balakrishnan, J. Chang, S. McMains, and V. Padmanabhan, “File system logging versus clustering: A performance comparison,” in Proceedings of the Annual USENIX Technical Conference, pp. 249–264, New Orleans, 1995. [311] S. Sen, S. Chatterjee, and N. Dumir, “Towards a theory of cache-eﬃcient algorithms,” Journal of the ACM, vol. 49, no. 6, pp. 828–858, 2002. [312] R. Shah, P. J. Varman, and J. S. Vitter, “Online algorithms for prefetching and caching on parallel disks,” in Proceedings of the ACM Symposium on Parallel Algorithms and Architectures, pp. 255–264, ACM Press, 2004. [313] R. Shah, P. J. Varman, and J. S. Vitter, “On competitive online read-many parallel disks scheduling,” in Proceedings of the ACM Symposium on Parallel Algorithms and Architectures, p. 217, ACM Press, 2005. [314] E. A. M. Shriver, A. Merchant, and J. Wilkes, “An analytic behavior model for disk drives with readahead caches and request reordering,” in Procedings of ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, pp. 182–191, Madison, Wisc.: ACM Press, June 1998. [315] E. A. M. Shriver and M. H. Nodine, “An introduction to parallel I/O models and algorithms,” in Input/Output in Parallel and Distributed Computer Sys- tems, (R. Jain, J. Werth, and J. C. Browne, eds.), ch. 2, pp. 31–68, Kluwer Academic Publishers, 1996. [316] E. A. M. Shriver and L. F. Wisniewski, “An API for choreographing data accesses,” Tech. Rep. PCS-TR95-267, Dept. of Computer Science, Dartmouth College, November 1995. [317] J. F. Sibeyn, “From parallel to external list ranking,” Technical Report MPI– I–97–1–021, Max-Planck-Institut, September 1997. [318] J. F. Sibeyn, “External selection,” Journal of Algorithms, vol. 58, no. 2, pp. 104–117, 2006. [319] J. F. Sibeyn and M. Kaufmann, “BSP-like external-memory computation,” in Proceedings of the Italian Conference on Algorithms and Complexity, pp. 229– 240, 1997. [320] R. Sinha, S. Puglisi, A. Moﬀat, and A. Turpin, “Improving suﬃx array local- ity for fast pattern matching on disk,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, Vancouver: ACM Press, June 2008. [321] B. Srinivasan, “An adaptive overﬂow technique to defer splitting in B-trees,” The Computer Journal, vol. 34, no. 5, pp. 397–405, 1991. [322] A. Srivastava and A. Eustace, “ATOM: A system for building customized program analysis tools,” ACM SIGPLAN Notices, vol. 29, pp. 196–205, June 1994. [323] S. Subramanian and S. Ramaswamy, “The P-range tree: A new data structure for range searching in secondary memory,” in Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, pp. 378–387, ACM Press, 1995. [324] R. Tamassia and J. S. Vitter, “Optimal cooperative search in fractional cas- caded data structures,” Algorithmica, vol. 15, pp. 154–171, February 1996. 172 References [325] “TerraServer-USA: Microsoft’s online database of satellite images,” Available on the World-Wide Web at http://terraserver.microsoft.com/. [326] R. Thakur, A. Choudhary, R. Bordawekar, S. More, and S. Kuditipudi, “Passion: Optimized I/O for parallel applications,” IEEE Computer, vol. 29, pp. 70–78, June 1996. [327] “Topologically Integrated Geographic Encoding and Referencing system, TIGER/Line 1992 dataﬁles,” Available on the World-Wide Web at http://www.census.gov/geo/www/tiger/, 1992. [328] S. Toledo, “A survey of out-of-core algorithms in numerical linear algebra,” in External Memory Algorithms and Visualization, (J. Abello and J. S. Vitter, eds.), pp. 161–179, Providence, Rhode Island: American Mathematical Society Press, 1999. [329] L. Toma and N. Zeh, “I/O-eﬃcient algorithms for sparse graphs,” in Algo- rithms for Memory Hierarchies, (U. Meyer, P. Sanders, and J. Sibeyn, eds.), ch. 5, pp. 85–109, Berlin: Springer-Verlag, 2003. [330] TPIE User Manual and Reference, “The manual and software distribution,” available on the web at http://www.cs.duke.edu/TPIE/, 1999. [331] J. D. Ullman and M. Yannakakis, “The input/output complexity of transitive closure,” Annals of Mathematics and Artiﬁcial Intelligence, vol. 3, pp. 331– 360, 1991. [332] J. Vahrenhold and K. Hinrichs, “Planar point location for large data sets: To seek or not to seek,” ACM Journal of Experimental Algorithmics, vol. 7, p. 8, August 2002. [333] J. van den Bercken, B. Seeger, and P. Widmayer, “A generic approach to bulk loading multidimensional index structures,” in Proceedings of the International Conference on Very Large Databases, pp. 406–415, 1997. [334] M. van Kreveld, J. Nievergelt, T. Roos, and P. Widmayer, eds., Algorithmic foundations of GIS. Vol. 1340 of Lecture Notes in Computer Science, Springer- Verlag, 1997. [335] P. J. Varman and R. M. Verma, “An eﬃcient multiversion access structure,” IEEE Transactions on Knowledge and Data Engineering, vol. 9, pp. 391–409, May–June 1997. [336] D. E. Vengroﬀ and J. S. Vitter, “Eﬃcient 3-D range searching in external memory,” in Proceedings of the ACM Symposium on Theory of Computing, pp. 192–201, Philadelphia: ACM Press, May 1996. [337] D. E. Vengroﬀ and J. S. Vitter, “I/O-eﬃcient scientiﬁc computation using TPIE,” in Proceedings of NASA Goddard Conference on Mass Storage Sys- tems, pp. II, 553–570, September 1996. [338] P. Vettiger, M. Despont, U. Drechsler, U. D¨urig, W. H¨aberle, M. I. Lutwyche, E. Rothuizen, R. Stutz, R. Widmer, and G. K. Binnig, “The ‘Millipede’ — more than one thousand tips for future AFM data storage,” IBM Journal of Research and Development, vol. 44, no. 3, pp. 323–340, 2000. [339] J. S. Vitter, “Eﬃcient memory access in large-scale computation,” in Proceed- ings of the Symposium on Theoretical Aspects of Computer Science, pp. 26–41, Springer-Verlag, 1991. Invited paper. [340] J. S. Vitter, Notes. 1999. References 173 [341] J. S. Vitter and P. Flajolet, “Average-case analysis of algorithms and data structures,” in Handbook of Theoretical Computer Science, Volume A: Algo- rithms and Complexity, (J. van Leeuwen, ed.), ch. 9, pp. 431–524, Elsevier and MIT Press, 1990. [342] J. S. Vitter and D. A. Hutchinson, “Distribution sort with randomized cycling,” Journal of the ACM, vol. 53, pp. 656–680, July 2006. [343] J. S. Vitter and P. Krishnan, “Optimal prefetching via data compression,” Journal of the ACM, vol. 43, pp. 771–793, September 1996. [344] J. S. Vitter and M. H. Nodine, “Large-scale sorting in uniform memory hier- archies,” Journal of Parallel and Distributed Computing, vol. 17, pp. 107–114, 1993. [345] J. S. Vitter and E. A. M. Shriver, “Algorithms for parallel memory I: Two-level memories,” Algorithmica, vol. 12, no. 2–3, pp. 110–147, 1994. [346] J. S. Vitter and E. A. M. Shriver, “Algorithms for parallel memory II: Hier- archical multilevel memories,” Algorithmica, vol. 12, no. 2–3, pp. 148–169, 1994. [347] J. S. Vitter and M. Wang, “Approximate computation of multidimensional aggregates of sparse data using wavelets,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 193–204, Philadelphia: ACM Press, June 1999. [348] J. S. Vitter, M. Wang, and B. Iyer, “Data cube approximation and his- tograms via wavelets,” in Proceedings of the International ACM Conference on Information and Knowledge Management, pp. 96–104, Washington, DC: ACM Press, November 1998. [349] M. Wang, B. Iyer, and J. S. Vitter, “Scalable mining for classiﬁcation rules in relational databases,” in Herman Rubin Festschrift, Hayward, CA: Institute of Mathematical Statistics, Fall 2004. [350] M. Wang, J. S. Vitter, L. Lim, and S. Padmanabhan, “Wavelet-based cost estimation for spatial queries,” in Proceedings of the International Sympo- sium on Spatial and Temporal Databases, pp. 175–196, Redondo Beach, Cal.: Springer-Verlag, July 2001. [351] R. W. Watson and R. A. Coyne, “The parallel I/O architecture of the high- performance storage system (HPSS),” in Proceedings of the IEEE Symposium on Mass Storage Systems, pp. 27–44, September 1995. [352] P. Weiner, “Linear pattern matching algorithm,” in Proceedings of the IEEE Symposium on Switching and Automata Theory, pp. 1–11, 1973. [353] K.-Y. Whang and R. Krishnamurthy, “Multilevel grid ﬁles — a dynamic hier- archical multidimensional ﬁle structure,” in Proceedings of the International Symposium on Database Systems for Advanced Applications, pp. 449–459, World Scientiﬁc Press, 1992. [354] D. E. Willard and G. S. Lueker, “Adding range restriction capability to dynamic data structures,” Journal of the ACM, vol. 32, no. 3, pp. 597–617, 1985. [355] I. H. Witten, A. Moﬀat, and T. C. Bell, Managing Gigabytes: Compressing and Indexing Documents and Images. Los Altos, Cal.: Morgan Kaufmann, 2nd ed., 1999. 174 References [356] O. Wolfson, P. Sistla, B. Xu, J. Zhou, and S. Chamberlain, “DOMINO: Databases for moving objects tracking,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 547–549, Philadelphia: ACM Press, June 1999. [357] D. Womble, D. Greenberg, S. Wheat, and R. Riesen, “Beyond core: Making parallel computer I/O practical,” in Proceedings of the DAGS Symposium on Parallel Computation, pp. 56–63, June 1993. [358] C. Wu and T. Feng, “The universality of the shuﬄe-exchange network,” IEEE Transactions on Computers, vol. C-30, pp. 324–332, May 1981. [359] Y. Xia, S. Prabhakar, S. Lei, R. Cheng, and R. Shah, “Indexing continuously changing data with mean-variance tree,” in Proceedings of the ACM Sympo- sium on Applied Computing, pp. 52–57, ACM Press, March 2005. [360] A. C. Yao, “On random 2-3 trees,” Acta Informatica, vol. 9, pp. 159–170, 1978. [361] S. B. Zdonik and D. Maier, eds., Readings in object-oriented database systems. Morgan Kauﬀman, 1990. [362] N. Zeh, I/O-Eﬃcient Algorithms for Shortest Path Related Problems. PhD thesis, School of Computer Science, Carleton University, 2002. [363] W. Zhang and P.-A. Larson, “Dynamic memory adjustment for external mergesort,” in Proceedings of the International Conference on Very Large Databases, pp. 376–385, 1997. [364] B. Zhu, “Further computational geometry in secondary memory,” in Proceed- ings of the International Symposium on Algorithms and Computation, pp. 514– 522, Springer-Verlag, 1994. [365] J. Ziv and A. Lempel, “Compression of individual sequences via variable-rate coding,” IEEE Transactions on Information Theory, vol. 24, pp. 530–536, September 1978.