k clusters in each bucket. 254 CHAPTER 7. CLUSTERING Suppose, to be speciﬁc, that we want to put a limit on the sum of the distances between all the points of a cluster and its centroid. Then in addition to the count of points and the centroid of a cluster, we can include an estimate of this sum in the record for a cluster. When we initialize a bucket, we can compute the sum exactly. But as we merge clusters, this parameter becomes an estimate only. Suppose we merge two clusters, and want to compute the sum of distances for the merged cluster. Use the notation for centroids and counts in Example 7.14, and in addition, let s1 and s2 be the sums for the two clusters. Then we may estimate the radius of the merged cluster to be n1|c1 − c| + n2|c2 − c| + s1 + s2 That is, we estimate the distance between any point x and the new centroid c to be the distance of that point to its old centroid (these distances sum to s1 + s2, the last two terms in the above expression) plus the distance from the old centroid to the new (these distances sum to the ﬁrst two terms of the above expression). Note that this estimate is an upper bound, by the triangle inequality. An alternative is to replace the sum of distances by the sum of the squares of the distances from the points to the centroid. If these sums for the two clusters are t1 and t2, respectively, then we can produce an estimate for the same sum in the new cluster as n1|c1 − c| 2 + n2|c2 − c| 2 + t1 + t2 This estimate is close to correct if the space is high-dimensional, by the “curse of dimensionality.” 2 Example 7.16: Our third example will assume a non-Euclidean space and no constraint on the number of clusters. We shall borrow several of the techniques from the GRGPF Algorithm of Section 7.5. Speciﬁcally, we represent clusters by their clustroid and rowsum (sum of the squares of the distances from each node of the cluster to its clustroid). We include in the record for a cluster information about a set of points at maximum distance from the clustroid, including their distances from the clustroid and their rowsums. Recall that their purpose is to suggest a clustroid when this cluster is merged with another. When we merge buckets, we may choose one of many ways to decide which clusters to merge. For example, we may consider pairs of clusters in order of the distance between their centroids. We may also choose to merge clusters when we consider them, provided the sum of their rowsums is below a certain limit. Alternatively, we may perform the merge if the sum of rowsums divided by the number of points in the clusters is below a limit. Any of the other strategies discussed for deciding when to merge clusters may be used as well, provided we arrange to maintain the data (e.g., cluster diameter) necessary to make decisions. We then must pick a new clustroid, from among the points most distant from the clustroids of the two merged clusters. We can compute rowsums for 7.6. CLUSTERING FOR STREAMS AND PARALLELISM 255 each of these candidate clustroids using the formulas given in Section 7.5.4. We also follow the strategy given in that section to pick a subset of the distant points from each cluster to be the set of distant points for the merged cluster, and to compute the new rowsum and distance-to-clustroid for each. 2 7.6.5 Answering Queries Recall that we assume a query is a request for the clusters of the most recent m points in the stream, where m ≤ N. Because of the strategy we have adopted of combining buckets as we go back in time, we may not be able to ﬁnd a set of buckets that covers exactly the last m points. However, if we choose the smallest set of buckets that cover the last m points, we shall include in these buckets no more than the last 2m points. We shall produce, as answer to the query, the centroids or clustroids of all the points in the selected buckets. In order for the result to be a good approximation to the clusters for exactly the last m points, we must assume that the points between 2m and m + 1 will not have radically diﬀerent statistics from the most recent m points. However, if the statistics vary too rapidly, recall from Section 4.6.6 that a more complex bucketing scheme can guarantee that we can ﬁnd buckets to cover at most the last m(1 + ǫ) points, for any ǫ> 0. Having selected the desired buckets, we pool all their clusters. We then use some methodology for deciding which clusters to merge. Examples 7.14 and 7.16 are illustrations of two approaches to this merger. For instance, if we are required to produce exactly k clusters, then we can merge the clusters with the closest centroids until we are left with only k clusters, as in Example 7.14. Or we can make a decision whether or not to merge clusters in various ways, as we sampled in Example 7.16. 7.6.6 Clustering in a Parallel Environment Now, let us brieﬂy consider the use of parallelism available in a computing cluster.3 We assume we are given a very large collection of points, and we wish to exploit parallelism to compute the centroids of their clusters. The simplest approach is to use a map-reduce strategy, but in most cases we are constrained to use a single Reduce task. Begin by creating many Map tasks. Each task is assigned a subset of the points. The Map function’s job is to cluster the points it is given. Its output is a set of key-value pairs with a ﬁxed key 1, and a value that is the description of one cluster. This description can be any of the possibilities suggested in Section 7.6.2, such as the centroid, count, and diameter of the cluster. Since all key-value pairs have the same key, there can be only one Reduce task. This task gets descriptions of the clusters produced by each of the Map 3Do not forget that the term “cluster” has two completely diﬀerent meanings in this section. 256 CHAPTER 7. CLUSTERING tasks, and must merge them appropriately. We may use the discussion in Sec- tion 7.6.4 as representative of the various strategies we might use to produce the ﬁnal clustering, which is the output of the Reduce task. 7.6.7 Exercises for Section 7.6 Exercise 7.6.1: Execute the BDMO Algorithm with p = 3 on the following 1-dimensional, Euclidean data: 1, 45, 80, 24, 56, 71, 17, 40, 66, 32, 48, 96, 9, 41, 75, 11, 58, 93, 28, 39, 77 The clustering algorithms is k-means with k = 3. Only the centroid of a cluster, along with its count, is needed to represent a cluster. Exercise 7.6.2: Using your clusters from Exercise 7.6.1, produce the best centroids in response to a query asking for a clustering of the last 10 points. 7.7 Summary of Chapter 7 ✦ Clustering: Clusters are often a useful summary of data that is in the form of points in some space. To cluster points, we need a distance measure on that space. Ideally, points in the same cluster have small distances be- tween them, while points in diﬀerent clusters have large distances between them. ✦ Clustering Algorithms: Clustering algorithms generally have one of two forms. Hierarchical clustering algorithms begin with all points in a cluster of their own, and nearby clusters are merged iteratively. Point-assignment clustering algorithms consider points in turn and assign them to the clus- ter in which they best ﬁt. ✦ The Curse of Dimensionality: Points in high-dimensional Euclidean spa- ces, as well as points in non-Euclidean spaces often behave unintuitively. Two unexpected properties of these spaces are that random points are almost always at about the same distance, and random vectors are almost always orthogonal. ✦ Centroids and Clustroids: In a Euclidean space, the members of a cluster can be averaged, and this average is called the centroid. In non-Euclidean spaces, there is no guarantee that points have an “average,” so we are forced to use one of the members of the cluster as a representative or typical element of the cluster. That representative is called the clustroid. ✦ Choosing the Clustroid: There are many ways we can deﬁne a typical point of a cluster in a non-Euclidean space. For example, we could choose the point with the smallest sum of distances to the other points, the smallest sum of the squares of those distances, or the smallest maximum distance to any other point in the cluster. 7.7. SUMMARY OF CHAPTER 7 257 ✦ Radius and Diameter: Whether or not the space is Euclidean, we can de- ﬁne the radius of a cluster to be the maximum distance from the centroid or clustroid to any point in that cluster. We can deﬁne the diameter of the cluster to be the maximum distance between any two points in the cluster. Alternative deﬁnitions, especially of the radius, are also known, for example, average distance from the centroid to the other points. ✦ Hierarchical Clustering: This family of algorithms has many variations, which diﬀer primarily in two areas. First, we may chose in various ways which two clusters to merge next. Second, we may decide when to stop the merge process in various ways. ✦ Picking Clusters to Merge: One strategy for deciding on the best pair of clusters to merge in a hierarchical clustering is to pick the clusters with the closest centroids or clustroids. Another approach is to pick the pair of clusters with the closest points, one from each cluster. A third approach is to use the average distance between points from the two clusters. ✦ Stopping the Merger Process: A hierarchical clustering can proceed until there are a ﬁxed number of clusters left. Alternatively, we could merge until it is impossible to ﬁnd a pair of clusters whose merger is suﬃciently compact, e.g., the merged cluster has a radius or diameter below some threshold. Another approach involves merging as long as the resulting cluster has a suﬃciently high “density,” which can be deﬁned in various ways, but is the number of points divided by some measure of the size of the cluster, e.g., the radius. ✦ K-Means Algorithms: This family of algorithms is of the point-assignment type and assumes a Euclidean space. It is assumed that there are exactly k clusters for some known k. After picking k initial cluster centroids, the points are considered one at a time and assigned to the closest centroid. The centroid of a cluster can migrate during point assignment, and an optional last step is to reassign all the points, while holding the centroids ﬁxed at their ﬁnal values obtained during the ﬁrst pass. ✦ Initializing K-Means Algorithms: One way to ﬁnd k initial centroids is to pick a random point, and then choose k − 1 additional points, each as far away as possible from the previously chosen points. An alternative is to start with a small sample of points and use a hierarchical clustering to merge them into k clusters. ✦ Picking K in a K-Means Algorithm: If the number of clusters is unknown, we can use a binary-search technique, trying a k-means clustering with diﬀerent values of k. We search for the largest value of k for which a decrease below k clusters results in a radically higher average diameter of the clusters. This search can be carried out in a number of clustering operations that is logarithmic in the true value of k. 258 CHAPTER 7. CLUSTERING ✦ The BFR Algorithm: This algorithm is a version of k-means designed to handle data that is too large to ﬁt in main memory. It assumes clusters are normally distributed about the axes. ✦ Representing Clusters in BFR: Points are read from disk one chunk at a time. Clusters are represented in main memory by the count of the num- ber of points, the vector sum of all the points, and the vector formed by summing the squares of the components of the points in each dimension. Other collection of points, too far from a cluster centroid to be included in a cluster, are represented as “miniclusters” in the same way as the k clusters, while still other points, which are not near any other point will be represented as themselves and called “retained” points. ✦ Processing Points in BFR: Most of the points in a main-memory load will be assigned to a nearby cluster and the parameters for that cluster will be adjusted to account for the new points. Unassigned points can be formed into new miniclusters, and these miniclusters can be merged with previously discovered miniclusters or retained points. After the last memory load, the miniclusters and retained points can be merged to their nearest cluster or kept as outliers. ✦ The CURE Algorithm: This algorithm is of the point-assignment type. It is designed for a Euclidean space, but clusters can have any shape. It handles data that is too large to ﬁt in main memory. ✦ Representing Clusters in CURE: The algorithm begins by clustering a small sample of points. It then selects representative points for each cluster, by picking points in the cluster that are as far away from each other as possible. The goal is to ﬁnd representative points on the fringes of the cluster. However, the representative points are then moved a fraction of the way toward the centroid of the cluster, so they lie somewhat in the interior of the cluster. ✦ Processing Points in CURE: After creating representative points for each cluster, the entire set of points can be read from disk and assigned to a cluster. We assign a given point to the cluster of the representative point that is closest to the given point. ✦ The GRGPF Algorithm: This algorithm is of the point-assignment type. It handles data that is too big to ﬁt in main memory, and it does not assume a Euclidean space. ✦ Representing Clusters in GRGPF: A cluster is represented by the count of points in the cluster, the clustroid, a set of points nearest the clustroid and a set of points furthest from the clustroid. The nearby points allow us to change the clustroid if the cluster evolves, and the distant points allow for merging clusters eﬃciently in appropriate circumstances. For each of these points, we also record the rowsum, that is the square root 7.7. SUMMARY OF CHAPTER 7 259 of the sum of the squares of the distances from that point to all the other points of the cluster. ✦ Tree Organization of Clusters in GRGPF: Cluster representations are or- ganized into a tree structure like a B-tree, where nodes of the tree are typically disk blocks and contain information about many clusters. The leaves hold the representation of as many clusters as possible, while inte- rior nodes hold a sample of the clustroids of the clusters at their descen- dant leaves. We organize the tree so that the clusters whose representa- tives are in any subtree are as close as possible. ✦ Processing Points in GRGPF: After initializing clusters from a sample of points, we insert each point into the cluster with the nearest clustroid. Because of the tree structure, we can start at the root and choose to visit the child with the sample clustroid nearest to the given point. Following this rule down one path in the tree leads us to a leaf, where we insert the point into the cluster with the nearest clustroid on that leaf. ✦ Clustering Streams: A generalization of the DGIM Algorithm (for count- ing 1’s in the sliding window of a stream) can be used to cluster points that are part of a slowly evolving stream. The BDMO Algorithm uses buckets similar to those of DGIM, with allowable bucket sizes forming a sequence where each size is twice the previous size. ✦ Representation of Buckets in BDMO: The size of a bucket is the number of points it represents. The bucket itself holds only a representation of the clusters of these points, not the points themselves. A cluster representa- tion includes a count of the number of points, the centroid or clustroid, and other information that is needed for merging clusters according to some selected strategy. ✦ Merging Buckets in BDMO: When buckets must be merged, we ﬁnd the best matching of clusters, one from each of the buckets, and merge them in pairs. If the stream evolves slowly, then we expect consecutive buckets to have almost the same cluster centroids, so this matching makes sense. ✦ Answering Queries in DBMO: A query is a length of a suﬃx of the sliding window. We take all the clusters in all the buckets that are at least partially within that suﬃx and merge them using some strategy. The resulting clusters are the answer to the query. ✦ Clustering Using Map-Reduce: We can divide the data into chunks and cluster each chunk in parallel, using a Map task. The clusters from each Map task can be further clustered in a single Reduce task. 260 CHAPTER 7. CLUSTERING 7.8 References for Chapter 7 The ancestral study of clustering for large-scale data is the BIRCH Algorithm of [6]. The BFR Algorithm is from [2]. The CURE Algorithm is found in [5]. The paper on the GRGPF Algorithm is [3]. The necessary background regarding B-trees and R-trees can be found in [4]. The study of clustering on streams is taken from [1]. 1. B. Babcock, M. Datar, R. Motwani, and L. O’Callaghan, “Maintaining variance and k-medians over data stream windows,” Proc. ACM Symp. on Principles of Database Systems, pp. 234–243, 2003. 2. P.S. Bradley, U.M. Fayyad, and C. Reina, “Scaling clustering algorithms to large databases,” Proc. Knowledge Discovery and Data Mining, pp. 9– 15, 1998. 3. V. Ganti, R. Ramakrishnan, J. Gehrke, A.L. Powell, and J.C. French:, “Clustering large datasets in arbitrary metric spaces,” Proc. Intl. Conf. on Data Engineering, pp. 502–511, 1999. 4. H. Garcia-Molina, J.D. Ullman, and J. Widom, Database Systems: The Complete Book Second Edition, Prentice-Hall, Upper Saddle River, NJ, 2009. 5. S. Guha, R. Rastogi, and K. Shim, “CURE: An eﬃcient clustering algo- rithm for large databases,” Proc. ACM SIGMOD Intl. Conf. on Manage- ment of Data, pp. 73–84, 1998. 6. T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: an eﬃcient data clustering method for very large databases,” Proc. ACM SIGMOD Intl. Conf. on Management of Data, pp. 103–114, 1996. Chapter 8 Advertising on the Web One of the big surprises of the 21st century has been the ability of all sorts of interesting Web applications to support themselves through advertising, rather than subscription. While radio and television have managed to use advertising as their primary revenue source, most media – newspapers and magazines, for example – have had to use a hybrid approach, combining revenue from advertising and subscriptions. By far the most lucrative venue for on-line advertising has been search, and much of the eﬀectiveness of search advertising comes from the “adwords” model of matching search queries to advertisements. We shall therefore devote much of this chapter to algorithms for optimizing the way this assignment is done. The algorithms used are of an unusual type; they are greedy and they are “on- line” in a particular technical sense to be discussed. We shall therefore digress to discuss these two algorithmic issues – greediness and on-line algorithms – in general, before tackling the adwords problem. A second interesting on-line advertising problem involves selecting items to advertise at an on-line store. This problem involves “collaborative ﬁltering,” where we try to ﬁnd customers with similar behavior in order to suggest they buy things that similar customers have bought. This subject will be treated in Section 9.3. 8.1 Issues in On-Line Advertising In this section, we summarize the technical problems that are presented by the opportunities for on-line advertising. We begin by surveying the types of ads found on the Web. 8.1.1 Advertising Opportunities The Web oﬀers many ways for an advertiser to show their ads to potential customers. Here are the principal venues. 261 262 CHAPTER 8. ADVERTISING ON THE WEB 1. Some sites, such as eBay, Craig’s List or auto trading sites allow adver- tisers to post their ads directly, either for free, for a fee, or a commission. 2. Display ads are placed on many Web sites. Advertisers pay for the display at a ﬁxed rate per impression (one display of the ad with the download of the page by some user). Normally, a second download of the page, even by the same user, will result in the display of a diﬀerent ad and is a second impression. 3. On-line stores such as Amazon show ads in many contexts. The ads are not paid for by the manufacturers of the product advertised, but are selected by the store to maximize the probability that the customer will be interested in the product. We consider this kind of advertising in Chapter 9. 4. Search ads are placed among the results of a search query. Advertisers bid for the right to have their ad shown in response to certain queries, but they pay only if the ad is clicked on. The particular ads to be shown are selected by a complex process, to be discussed in this chapter, involving the search terms that the advertiser has bid for, the amount of their bid, the observed probability that the ad will be clicked on, and the total budget that the advertiser has oﬀered for the service. 8.1.2 Direct Placement of Ads When advertisers can place ads directly, such as a free ad on Craig’s List or the “buy it now” feature at eBay, there are several problems that the site must deal with. Ads are displayed in response to query terms, e.g., “apartment Palo Alto.” The Web site can use an inverted index of words, just as a search engine does (see Section 5.1.1) and return those ads that contain all the words in the query. Alternatively, one can ask the advertiser to specify parameters of the ad, which are stored in a database. For instance, an ad for a used car could specify the manufacturer, model, color, and year from pull-down menus, so only clearly understood terms can be used. Queryers can use the same menus of terms in their queries. Ranking ads is a bit more problematic, since there is nothing like the links on the Web to tell us which ads are more “important.” One strategy used is “most-recent ﬁrst.” That strategy, while equitable, is subject to abuse, where advertisers post small variations of their ads at frequent intervals. The tech- nology for discovering ads that are too similar has already been covered, in Section 3.4. An alternative approach is to try to measure the attractiveness of an ad. Each time it is displayed, record whether or not the queryer clicked on it. Presumably, attractive ads will be clicked on more frequently than those that are not. However, there are several factors that must be considered in evaluating ads: 8.1. ISSUES IN ON-LINE ADVERTISING 263 1. The position of the ad in a list has great inﬂuence on whether or not it is clicked. The ﬁrst on the list has by far the highest probability, and the probability drops oﬀ exponentially as the position increases. 2. The ad may have attractiveness that depends on the query terms. For example, an ad for a used convertible would be more attractive if the search query includes the term “convertible,” even though it might be a valid response to queries that look for that make of car, without specifying whether or not a convertible is wanted. 3. All ads deserve the opportunity to be shown until their click probability can be approximated closely. If we start all ads out with a click probability of 0, we shall never show them and thus never learn whether or not they are attractive ads. 8.1.3 Issues for Display Ads This form of advertising on the Web most resembles advertising in traditional media. An ad for a Chevrolet run in the pages of the New York Times is a display ad, and its eﬀectiveness is limited. It may be seen by many people, but most of them are not interested in buying a car, just bought a car, don’t drive, or have another good reason to ignore the ad. Yet the cost of printing the ad was still borne by the newspaper and hence by the advertiser. An impression of a similar ad on the Yahoo! home page is going to be relatively ineﬀective for essentially the same reason. The fee for placing such an ad is typically a fraction of a cent per impression. The response of traditional media to this lack of focus was to create newspa- pers or magazines for special interests. If you are a manufacturer of golf clubs, running your ad in Golf Digest would give you an order-of-magnitude increase in the probability that the person seeing your ad would be interested in it. This phenomenon explains the existence of many specialized, low-circulation maga- zines. They are able to charge much more per impression for an ad than is a general-purpose outlet such as a daily newspaper. The same phenomenon ap- pears on the Web. An ad for golf clubs on sports.yahoo.com/golf has much more value per impression than does the same ad on the Yahoo! home page or an ad for Chevrolets on the Yahoo! golf page. However, the Web oﬀers an opportunity to tailor display ads in a way that hardcopy media cannot: it is possible to use information about the user to determine which ad they should be shown, regardless of what page they are looking at. If it is known that Sally likes golf, then it makes sense to show her an ad for golf clubs, regardless of what page she is looking at. We could determine Sally’s love for golf in various ways: 1. She may belong to a golf-related group on Facebook. 2. She may mention “golf” frequently in emails posted on her gmail account. 264 CHAPTER 8. ADVERTISING ON THE WEB 3. She may spend a lot of time on the Yahoo! golf page. 4. She may issue search queries with golf-related terms frequently. 5. She may bookmark the Web sites of one or more golf courses. Each of these methods, and many others like these, raise enormous privacy issues. It is not the purpose of this book to try to resolve those issues, which in practice probably have no solution that will satisfy all concerns. On the one hand, people like the free services that have recently become advertising- supported, and these services depend on advertising being much more eﬀective than conventional ads. There is a general agreement that, if there must be ads, it is better to see things you might actually use than to have what pages you view cluttered with irrelevancies. On the other hand, there is great potential for misuse if the information leaves the realm of the machines that execute advertising algorithms and get into the hands of real people. 8.2 On-Line Algorithms Before addressing the question of matching advertisements to search queries, we shall digress slightly by examining the general class to which such algorithms belong. This class is referred to as “on-line,” and they generally involve an ap- proach called “greedy.” We also give, in the next section, a preliminary example of an on-line greedy algorithm for a simpler problem: maximal matching. 8.2.1 On-Line and Oﬀ-Line Algorithms Typical algorithms work as follows. All the data needed by the algorithm is presented initially. The algorithm can access the data in any order. At the end, the algorithm produces its answer. Such an algorithm is called oﬀ-line. However, there are times when we cannot see all the data before our al- gorithm must make some decisions. Chapter 4 covered stream mining, where we could store only a limited amount of the stream, and had to answer queries about the entire stream when called upon to do so. There is an extreme form of stream processing, where we must respond with an output after each stream ele- ment arrives. We thus must decide about each stream element knowing nothing at all of the future. Algorithms of this class are called on-line algorithms.1 As the case in point, selecting ads to show with search queries would be relatively simple if we could do it oﬀ-line. We would see a month’s worth of search queries, and look at the bids advertisers made on search terms, as well as their advertising budgets for the month, and we could then assign ads to 1Unfortunately, we are faced with another case of dual meanings, like the coincidence involving the term “cluster” that we noted in Section 7.6.6, where we needed to interpret properly phrases such as “algorithms for computing clusters on computer clusters.” Here, the term “on-line” refers to the nature of the algorithm, and should not be confused with “on-line” meaning “on the Internet” in phrases such as “on-line algorithms for on-line advertising.” 8.2. ON-LINE ALGORITHMS 265 the queries in a way that maximized both the revenue to the search engine and the number of impressions that each advertiser got. The problem with oﬀ-line algorithms is that most queryers don’t want to wait a month to get their search results. Thus, we must use an on-line algorithm to assign ads to search queries. That is, when a search query arrives, we must select the ads to show with that query immediately. We can use information about the past, e.g., we do not have to show an ad if the advertiser’s budget has already been spent, and we can examine the click-through rate (fraction of the time the ad is clicked on when it is displayed) that an ad has obtained so far. However, we cannot use anything about future search queries. For instance, we cannot know whether there will be lots of queries arriving later and using search terms on which this advertiser has made higher bids. Example 8.1: Let us take a very simple example of why knowing the future could help. A manufacturer A of replica antique furniture has bid 10 cents on the search term “chesterﬁeld”.2 A more conventional manufacturer B has bid 20 cents on both the terms “chesterﬁeld” and “sofa.” Both have monthly budgets of $100, and there are no other bidders on either of these terms. It is the beginning of the month, and a search query “chesterﬁeld” has just arrived. We are allowed to display only one ad with the query. The obvious thing to do is to display B’s ad, because they bid more. How- ever, suppose there will be lots of search queries this month for “sofa,” but very few for “chesterﬁeld.” Then A will never spend its $100 budget, while B will spend its full budget even if we give the query to A. Speciﬁcally, if there will be at least 500 more queries for either “sofa” or “chesterﬁeld,” then there is no harm, and potentially a beneﬁt, in giving the query to A. It will still be possible for B to spend its entire budget, while we are increasing the amount of A’s budget that will be spent. Note that this argument makes sense both from the point of view of the search engine, which wants to maximize total revenue, and from the point of view of both A and B, who presumably want to get all the impressions that their budgets allow. If we could know the future, then we would know how many more “sofa” queries and how many more “chesterﬁeld” queries were going to arrive this month. If that number is below 500, then we want to give the query to B to maximize revenue, but if it is 500 or more, then we want to give it to A. Since we don’t know the future, an on-line algorithm cannot always do as well as an oﬀ-line algorithm. 2 8.2.2 Greedy Algorithms Many on-line algorithms are of the greedy algorithm type. These algorithms make their decision in response to each input element by maximizing some function of the input element and the past. 2A chesterﬁeld is a type of sofa. See, for example, www.chesterfields.info. 266 CHAPTER 8. ADVERTISING ON THE WEB Example 8.2: The obvious greedy algorithm for the situation described in Example 8.1 is to assign a query to the highest bidder who still has budget left. For the data of that example, what will happen is that the ﬁrst 500 “sofa” or “chesterﬁeld” queries will be assigned to B. At that time, B runs out of budget and is assigned no more queries. After that, the next 1000 “chesterﬁeld” queries are assigned to A, and “sofa” queries get no ad and therefore earn the search engine no money. The worst thing that can happen is that 500 “chesterﬁeld” queries arrive, followed by 500 “sofa” queries. An oﬀ-line algorithm could optimally assign the ﬁrst 500 to A, earning $50, and the next 500 to B, earning $100, or a total of $150. However, the greedy algorithm will assign the ﬁrst 500 to B, earning $100, and then has no ad for the next 500, earning nothing. 2 8.2.3 The Competitive Ratio As we see from Example 8.2, an on-line algorithm need not give as good a result as the best oﬀ-line algorithm for the same problem. The most we can expect is that there will be some constant c less than 1, such that on any input, the result of a particular on-line algorithm is at least c times the result of the optimum oﬀ-line algorithm. The constant c, if it exists, is called the competitive ratio for the on-line algorithm. Example 8.3: The greedy algorithm, on the particular data of Example 8.2, gives a result that is 2/3 as good as that of the optimum algorithm: $100 versus $150. That proves that the competitive ratio is no greater than 2/3. But it could be less. The competitive ratio for an algorithm may depend on what kind of data is allowed to be input to the algorithm. Even if we restrict inputs to the situation described in Example 8.2, but with the bids allowed to vary, then we can show the greedy algorithm has a competitive ratio no greater than 1/2. Just raise the bid by A to ǫ less than 20 cents. As ǫ approaches 0, the greedy algorithm still produces only $100, but the return from the optimum algorithm approaches $200. We can show that it is impossible to do worse than half the optimum in this simple case, so the competitive ratio is indeed 1/2. However, we’ll leave this sort of proof for later sections. 2 8.2.4 Exercises for Section 8.2 ! Exercise 8.2.1: A popular example of the design of an on-line algorithm to minimize the competitive ratio is the ski-buying problem.3 Suppose you can buy skis for $100, or you can rent skis for $10 per day. You decide to take up skiing, but you don’t know if you will like it. You may try skiing for any number of days and then give it up. The merit of an algorithm is the cost per day of skis, and we must try to minimize this cost. 3Thanks to Anna Karlin for this example. 8.3. THE MATCHING PROBLEM 267 One on-line algorithm for making the rent/buy decision is “buy skis im- mediately.” If you try skiing once, fall down and give it up, then this on-line algorithm costs you $100 per day, while the optimum oﬀ-line algorithm would have you rent skis for $10 for the one day you used them. Thus, the competitive ratio of the algorithm “buy skis immediately” is at most 1/10th, and that is in fact the exact competitive ratio, since using the skis one day is the worst possible outcome for this algorithm. On the other hand, the on-line algorithm “always rent skis” has an arbitrarily small competitive ratio. If you turn out to really like skiing and go regularly, then after n days, you will have paid $10n or $10/day, while the optimum oﬀ-line algorithm would have bought skis at once, and paid only $100, or $100/n per day. Your question: design an on-line algorithm for the ski-buying problem that has the best possible competitive ratio. What is that competitive ratio? Hint: Since you could, at any time, have a fall and decide to give up skiing, the only thing the on-line algorithm can use in making its decision is how many times previously you have gone skiing. 8.3 The Matching Problem We shall now take up a problem that is a simpliﬁed version of the problem of matching ads to search queries. This problem, called “maximal matching,” is an abstract problem involving bipartite graphs (graphs with two sets of nodes – left and right – with all edges connecting a node in the left set to a node in the right set. Figure 8.1 is an example of a bipartite graph. Nodes 1, 2, 3, and 4 form the left set, while nodes a, b, c, and d form the right set. 8.3.1 Matches and Perfect Matches Suppose we are given a bipartite graph. A matching is a subset of the edges such that no node is an end of two or more edges. A matching is said to be perfect if every node appears in the matching. Note that a matching can only be perfect if the left and right sets are of the same size. A matching that is as large as any other matching for the graph in question is said to be maximal. Example 8.4: The set of edges {(1,a),(2,b),(3, d)} is a matching for the bipartite graph of Fig. 8.1. Each member of the set is an edge of the bipartite graph, and no node appears more than once. The set of edges {(1,c),(2,b),(3, d),(4,a)} is a perfect matching, represented by heavy lines in Fig. 8.2. Every node appears exactly once. It is, in fact, the sole perfect matching for this graph, although some bipartite graphs have more than one perfect matching. The matching of Fig. 8.2 is also maximal, since every perfect matching is maximal. 2 268 CHAPTER 8. ADVERTISING ON THE WEB 4 1 a b c d 2 3 Figure 8.1: A bipartite graph 8.3.2 The Greedy Algorithm for Maximal Matching Oﬀ-line algorithms for ﬁnding a maximal matching have been studied for dec- ades, and one can get very close to O(n2) for an n-node graph. On-line algo- rithms for the problem have also been studied, and it is this class of algorithms we shall consider here. In particular, the greedy algorithm for maximal match- ing works as follows. We consider the edges in whatever order they are given. When we consider (x, y), add this edge to the matching if neither x nor y are ends of any edge selected for the matching so far. Otherwise, skip (x, y). Example 8.5: Let us consider a greedy match for the graph of Fig. 8.1. Sup- pose we order the nodes lexicographically, that is, by order of their left node, breaking ties by the right node. Then we consider the edges in the order (1,a), (1,c), (2,b), (3,b), (3, d), (4,a). The ﬁrst edge, (1,a), surely becomes part of the matching. The second edge, (1,c), cannot be chosen, because node 1 already appears in the matching. The third edge, (2,b), is selected, because neither node 2 nor node b appears in the matching so far. Edge (3,b) is rejected for the match because b is already matched, but then (3, d) is added to the match because neither 3 nor d has been matched so far. Finally, (4,a) is rejected because a appears in the match. Thus, the matching produced by the greedy algorithm for this ordering of the edges is {(1,a),(2,b),(3, d)}. As we saw, this matching is not maximal. 2 Example 8.6: A greedy match can be even worse than that of Example 8.5. On the graph of Fig. 8.1, any ordering that begins with the two edges (1,a) and (3,b), in either order, will match those two pairs but then will be unable to match nodes 2 or 4. Thus, the size of the resulting match is only 2. 2 8.3. THE MATCHING PROBLEM 269 4 1 a b c d 2 3 Figure 8.2: The only perfect matching for the graph of Fig. 8.1 8.3.3 Competitive Ratio for Greedy Matching We can show a competitive ratio of 1/2 for the greedy matching algorithm of Section 8.3.2. First, the ratio cannot be more than 1/2. We already saw that for the graph of Fig. 8.1, there is a perfect matching of size 4. However, if the edges are presented in any of the orders discussed in Example 8.6, the size of the match is only 2, or half the optimum. Since the competitive ratio for an algorithm is the minimum over all possible inputs of the ratio of what that algorithm achieves to the optimum result, we see that 1/2 is an upper bound on the competitive ratio. Suppose Mo is a maximal matching, and Mg is the matching that the greedy algorithm produces. Let L be the set of left nodes that are matched in Mo but not in Mg. Let R be the set of right nodes that are connected by edges to any node in L. We claim that every node in R is matched in Mg. Suppose not; in particular, suppose node r in R is not matched in Mg. Then the greedy algorithm will eventually consider some edge (ℓ, r), where ℓ is in L. At that time, neither end of this edge is matched, because we have supposed that neither ℓ nor r is ever matched by the greedy algorithm. That observation contradicts the deﬁnition of how the greedy algorithm works; that is, the greedy algorithm would indeed match (ℓ, r). We conclude that every node in R is matched in Mg. Now, we know several things about the sizes of sets and matchings. 1. |Mo| ≤ |Mg|+|L|, since among the nodes on the left, only nodes in L can be matched in Mo but not Mg. 2. |L| ≤ |R|, because in Mo, all the nodes in L were matched. 270 CHAPTER 8. ADVERTISING ON THE WEB 3. |R| ≤ |Mg|, because every node in R is matched in Mg. Now, (2) and (3) give us |L| ≤ |Mg|. That, together with (1), gives us |Mo|≤ 2|Mg|, or |Mg|≥ 1 2 |Mo|. The latter inequality says that the competitive ratio is at least 1/2. Since we already observed that the competitive ratio is no more than 1/2, we now conclude the ratio is exactly 1/2. 8.3.4 Exercises for Section 8.3 Exercise 8.3.1: Deﬁne the graph Gn to have the 2n nodes a0,a1,...,an−1,b0,b1,...,bn−1 and the following edges. Each node ai, for i = 0, 1,...,n − 1, is connected to the nodes bj and bk, where j =2i mod n and k = (2i + 1) mod n For instance, the graph G4 has the following edges: (a0,b0), (a0,b1), (a1,b2), (a1,b3), (a2,b0), (a2,b1), (a3,b2), and (a3,b3). (a) Find a perfect matching for G4. (b) Find a perfect matching for G5. !!(c) Prove that for every n,Gn has a perfect matching. ! Exercise 8.3.2: How many perfect matchings do the graphs G4 and G5 of Exercise 8.3.1 have? ! Exercise 8.3.3: Whether or not the greedy algorithm gives us a perfect match- ing for the graph of Fig. 8.1 depends on the order in which we consider the edges. Of the 6! possible orders of the six edges, how many give us a perfect match- ing? Give a simple test for distinguishing those orders that do give the perfect matching from those that do not. 8.4 The Adwords Problem We now consider the fundamental problem of search advertising, which we term the “adwords problem,” because it was ﬁrst encountered in the Google Adwords system. We then discuss a greedy algorithm called “Balance” that oﬀers a good competitive ratio. We analyze this algorithm for a simpliﬁed case of the adwords problem. 8.4. THE ADWORDS PROBLEM 271 8.4.1 History of Search Advertising Around the year 2000, a company called Overture (later bought by Yahoo!) introduced a new kind of search. Advertisers bid on keywords (words in a search query), and when a user searched for that keyword, the links to all the advertisers who bid on that keyword are displayed in the order highest-bid-ﬁrst. If the advertiser’s link was clicked on, they paid what they had bid. That sort of search was very useful for the case where the search queryer really was looking for advertisements, but it was rather useless for the queryer who was just looking for information. Recall our discussion in Section 5.1.1 about the point that unless a search engine can provide reliable responses to queries that are for general information, no one will want to use the search engine when they are looking to buy something. Several years later, Google adapted the idea in a system called Adwords. By that time, the reliability of Google was well established, so people were willing to trust the ads they were shown. Google kept the list of responses based on PageRank and other objective criteria separate from the list of ads, so the same system was useful for the queryer who just wanted information as well as the queryer looking to buy something. The Adwords system went beyond the earlier system in several ways that made the selection of ads more complex. 1. Google would show only a limited number of ads with each query. Thus, while Overture simply ordered all ads for a given keyword, Google had to decide which ads to show, as well as the order in which to show them. 2. Users of the Adwords system speciﬁed a budget: the amount they were willing to pay for all clicks on their ads in a month. These constraints make the problem of assigning ads to search queries more complex, as we hinted at in Example 8.1. 3. Google did not simply order ads by the amount of the bid, but by the amount they expected to receive for display of each ad. That is, the click- through rate was observed for each ad, based on the history of displays of that ad. The value of an ad was taken to be the product of the bid and the click-through rate. 8.4.2 Deﬁnition of the Adwords Problem Of course, the decision regarding which ads to show must be made on-line. Thus, we are only going to consider on-line algorithms for solving the adwords problem, which is as follows. • Given: 1. A set of bids by advertisers for search queries. 2. A click-through rate for each advertiser-query pair. 272 CHAPTER 8. ADVERTISING ON THE WEB 3. A budget for each advertiser. We shall assume budgets are for a month, although any unit of time could be used. 4. A limit on the number of ads to be displayed with each search query. • Respond to each search query with a set of advertisers such that: 1. The size of the set is no larger than the limit on the number of ads per query. 2. Each advertiser has bid on the search query. 3. Each advertiser has enough budget left to pay for the ad if it is clicked upon. The revenue of a selection of ads is the total value of the ads selected, where the value of an ad is the product of the bid and the click-through rate for the ad and query. The merit of an on-line algorithm is the total revenue obtained over a month (the time unit over which budgets are assumed to apply). We shall try to measure the competitive ratio for algorithms, that is, the minimum total revenue for that algorithm, on any sequence of search queries, divided by the revenue of the optimum oﬀ-line algorithm for the same sequence of search queries. 8.4.3 The Greedy Approach to the Adwords Problem Since only an on-line algorithm is suitable for the adwords problem, we should ﬁrst examine the performance of the obvious greedy algorithm. We shall make a number of simpliﬁcations to the environment; our purpose is to show even- tually that there is a better algorithm than the obvious greedy algorithm. The simpliﬁcations: (a) There is one ad shown for each query. (b) All advertisers have the same budget. (c) All click-through rates are the same. (d) All bids are either 0 or 1. Alternatively, we may assume that the value of each ad (product of bid and click-through rate) is the same. The greedy algorithm picks, for each search query, any advertiser who has bid 1 for that query. The competitive ratio for this algorithm is 1/2, as the following example shows. Example 8.7: Suppose there are two advertisers A and B, and only two possible queries, x and y. Advertiser A bids only on x, while B bids on both x and y. The budget for each advertiser is 2. Notice the similarity to the situation in Example 8.1; the only diﬀerences are the fact that the bids by each advertiser are the same and the budgets are smaller here. 8.4. THE ADWORDS PROBLEM 273 Adwords Aspects not in Our Model There are several ways in which the real AdWords system diﬀers from the simpliﬁed model of this section. Matching Bids and Search Queries: In our simpliﬁed model, advertis- ers bid on sets of words, and an advertiser’s bid is eligible to be shown for search queries that have exactly the same set of words as the advertiser’s bid. In reality, Google, Yahoo!, and Microsoft all oﬀer advertisers a feature known as broad matching, where an ad is eligible to be shown for search queries that are inexact matches of the bid keywords. Examples include queries that include a subset or superset of keywords, and also queries that use words with very similar meanings to the words the advertiser bid on. For such broad matches, search engines charge the advertiser based on complicated formulas taking into account how closely related the search query is to the advertiser’s bid. These formulas vary across search engines and are not made public. Charging Advertisers for Clicks: In our simpliﬁed model, when a user clicks on an advertiser’s ad, the advertiser is charged the amount they bid. This policy is known as a ﬁrst-price auction. In reality, search engines use a more complicated system known as a second-price auction, where each advertiser pays approximately the bid of the advertiser who placed immediately behind them in the auction. For example, the ﬁrst- place advertiser for a search might pay the bid of the advertiser in second place, plus one cent. It has been shown that second-price auctions are less susceptible to being gamed by advertisers than ﬁrst-price auctions and lead to higher revenues for the search engine. Let the sequence of queries be xxyy. The greedy algorithm is able to allocate the ﬁrst two x’s to B, whereupon there is no one with an unexpended budget to pay for the two y’s. The revenue for the greedy algorithm in this case is thus 2. However, the optimum oﬀ-line algorithm will allocate the x’s to A and the y’s to B, achieving a revenue of 4. The competitive ratio for the greedy algorithm is thus no more than 1/2. We can argue that on any sequence of queries the ratio of the revenues for the greedy and optimal algorithms is at least 1/2, using essentially the same idea as in Section 8.3.3. 2 8.4.4 The Balance Algorithm There is a simple improvement to the greedy algorithm that gives a competitive ratio of 3/4 for the simple case of Section 8.4.3. This algorithm, called the Balance Algorithm, assigns a query to the advertiser who bids on the query and has the largest remaining budget. Ties may be broken arbitrarily. 274 CHAPTER 8. ADVERTISING ON THE WEB Example 8.8: Consider the same situation as in Example 8.7. The Balance Algorithm can assign the ﬁrst query x to either A or B, because they both bid on x and their remaining budgets are the same. However, the second x must be assigned to the other of A and B, because they then have the larger remaining budget. The ﬁrst y is assigned to B, since it has budget remaining and is the only bidder on y. The last y cannot be assigned, since B is out of budget, and A did not bid. Thus, the total revenue for the Balance Algorithm on this data is 3. In comparison, the total revenue for the optimum oﬀ-line algorithm is 4, since it can assign the x’s to A and the y’s to B. Our conclusion is that, for the simpliﬁed adwords problem of Section 8.4.3, the competitive ratio of the Balance Algorithm is no more than 3/4. We shall see next that with only two advertisers, 3/4 is exactly the competitive ratio, although as the number of advertisers grows, the competitive ratio lowers to 0.63 (actually 1 − 1/e) but no lower. 2 8.4.5 A Lower Bound on Competitive Ratio for Balance In this section we shall prove that in the simple case of the Balance Algorithm that we are considering, the competitive ratio is 3/4. Given Example 8.8, we have only to prove that the total revenue obtained by the Balance Algorithm is at least 3/4 of the revenue for the optimum oﬀ-line algorithm. Thus, consider a situation in which there are two advertisers, A1 and A2, each with a budget of B. We shall assume that each query is assigned to an advertiser by the optimum algorithm. If not, we can delete those queries without aﬀecting the revenue of the optimum algorithm and possibly reducing the revenue of Balance. Thus, the lowest possible competitive ratio is achieved when the query sequence consists only of ads assigned by the optimum algorithm. We shall also assume that both advertisers’ budgets are consumed by the optimum algorithm. If not, we can reduce the budgets, and again argue that the revenue of the optimum algorithm is not reduced while that of Balance can only shrink. That change may force us to use diﬀerent budgets for the two advertisers, but we shall continue to assume the budgets are both B. We leave as an exercise the extension of the proof to the case where the budgets of the two advertisers are diﬀerent. Figure 8.3 suggests how the 2B queries are assigned to advertisers by the two algorithms. In (a) we see that B queries are assigned to each of A1 and A2 by the optimum algorithm. Now, consider how these same queries are assigned by Balance. First, observe that Balance must exhaust the budget of at least one of the advertisers, say A2. If not, then there would be some query assigned to neither advertiser, even though both had budget. We know at least one of the advertisers bids on each query, because that query is assigned in the optimum algorithm. That situation contradicts how Balance is deﬁned to operate; it always assigns a query if it can. Thus, we see in Fig. 8.3(b) that A2 is assigned B queries. These queries could have been assigned to either A1 or A2 by the optimum algorithm. We 8.4. THE ADWORDS PROBLEM 275 A 1 A 2 A 1 A 2 B B x x y Not used (a) Optimum (b) Balance Figure 8.3: Illustration of the assignments of queries to advertisers in the opti- mum and Balance algorithms also see in Fig. 8.3(b) that we use y as the number of queries assigned to A1 and x as B − y. It is our goal to show y ≥ x. That inequality will show the revenue of Balance is at least 3B/2, or 3/4th the revenue of the optimum algorithm. We note that x is also the number of unassigned queries for the Balance Algorithm, and that all the unassigned queries must have been assigned to A2 by the optimum algorithm. The reason is that A1 never runs out of budget, so any query assigned by the optimum algorithm to A1 is surely bid on by A1. Since A1 always has budget during the running of the Balance Algorithm, that algorithm will surely assign this query, either to A1 or to A2. There are two cases, depending on whether more of the queries that are assigned to A1 by the optimum algorithm are assigned to A1 or A2 by Balance. 1. Suppose at least half of these queries are assigned by Balance to A1. Then y ≥ B/2, so surely y ≥ x. 2. Suppose more than half of these queries are assigned by Balance to A2. Consider the last of these queries q that is assigned to A2 by the Balance Algorithm. At that time, A2 must have had at least as great a budget 276 CHAPTER 8. ADVERTISING ON THE WEB available as A1, or else Balance would have assigned query q to A1, just as the optimum algorithm did. Since more than half of the B queries that the optimum algorithm assigns to A1 are assigned to A2 by Balance, we know that when q was assigned, the remaining budget of A2 was less than B/2. Therefore, at that time, the remaining budget of A1 was also less than B/2. Since budgets only decrease, we know that x** x, since x + y = B. We conclude that y ≥ x in either case, so the competitive ratio of the Balance Algorithm is 3/4. 8.4.6 The Balance Algorithm with Many Bidders When there are many advertisers, the competitive ratio for the Balance Algo- rithm can be under 3/4, but not too far below that fraction. The worst case for Balance is as follows. 1. There are N advertisers, A1, A2,...,AN. 2. Each advertiser has a budget B = N!. 3. There are N queries q1, q2,...,qN. 4. Advertiser Ai bids on queries q1, q2,...,qi and no other queries. 5. The query sequence consists of N rounds. The ith round consists of B occurrences of query qi and nothing else. The optimum oﬀ-line algorithm assigns the B queries qi in the ith round to Ai for all i. Thus, all queries are assigned to a bidder, and the total revenue of the optimum algorithm is NB. However, the Balance Algorithm assigns each of the queries in round 1 to the N advertisers equally, because all bid on q1, and the Balance Algorithm prefers the bidder with the greatest remaining budget. Thus, each advertiser gets B/N of the queries q1. Now consider the queries q2 in round 2. All but A1 bid on these queries, so they are divided equally among A2 through AN, with each of these N − 1 bidders getting B/(N − 1) queries. The pattern, suggested by Fig. 8.4, repeats for each round i = 3, 4,..., with Ai through AN getting B/(N − i) queries. However, eventually, the budgets of the higher-numbered advertisers will be exhausted. That will happen at the lowest round j such that B 1 N + 1 N − 1 + · · · + 1 N − j +1 ≥ B that is, 1 N + 1 N − 1 + · · · + 1 N − j +1 ≥ 1 8.4. THE ADWORDS PROBLEM 277 A 1 A 2 A 3 A n−1 A n B / n B /( n −1) B /( n −2) . . . Figure 8.4: Apportioning queries to N advertisers in the worst case Euler showed that as k gets large, P k i=1 1/i approaches loge k. Using this observation, we can approximate the above sum as loge N − loge(N − j). We are thus looking for the j such that loge N − loge(N − j) = 1, approxi- mately. If we replace loge N − loge(N − j) by the equivalent loge N/(N − j) and exponentiate both sides of the equation loge N/(N − j) = 1, we get N/(N − j)= e. Solving this equation for j, we get j = N 1 − 1 e as the approximate value of j for which all advertisers are either out of budget or do not bid on any of the remaining queries. Thus, the approximate revenue obtained by the Balance Algorithm is BN(1 − 1 e ), that is, the queries of the ﬁrst j rounds. Therefore, the competitive ratio is 1 − 1 e , or approximately 0.63. 8.4.7 The Generalized Balance Algorithm The Balance Algorithm works well when all bids are 0 or 1. However, in practice, bids can be arbitrary, and with arbitrary bids and budgets Balance fails to weight the sizes of the bids properly. The following example illustrates the point. Example 8.9: Suppose there are two advertisers A1 and A2, and one query q. The bids on q and budgets are: Bidder Bid Budget A1 1 110 A2 10 100 If there are 10 occurrences of q, the optimum oﬀ-line algorithm will assign them all to A2 and gain revenue 100. However, because A1’s budget is larger, Balance will assign all ten queries to A1 for a revenue of 10. In fact, one can extend this idea easily to show that for situations like this one, there is no competitive ratio higher than 0 that holds for the Balance Algorithm. 2 278 CHAPTER 8. ADVERTISING ON THE WEB In order to make Balance work in more general situations, we need to make two modiﬁcations. First, we need to bias the choice of ad in favor of higher bids. Second, we need to be less absolute about the remaining budget. Rather, we consider the fraction of the budgets remaining, so we are biased toward using some of each advertiser’s budget. The latter change will make the Balance Algorithm more “risk averse”; it will not leave too much of any advertiser’s budget unused. It can be shown (see the chapter references) that the following generalization of the Balance Algorithm has a competitive ratio of 1 − 1/e = 0.63. • Suppose that a query q arrives, advertiser Ai has bid xi for this query (note that xi could be 0). Also, suppose that fraction fi of the budget of Ai is currently unspent. Let Ψi = xi(1 − efi ). Then assign q to the advertiser Ai such that Ψi is a maximum. Break ties arbitrarily. Example 8.10: Consider how the generalized Balance Algorithm would work on the data of Example 8.9. For the ﬁrst occurrence of query q, Ψ1 =1 × (1 − e−1) since A1 has bid 1, and fraction 1 of A1’s budget remains. That is, Ψ1 =1 − 1/e =0.63 On the other hand, Ψ2 = 10 × (1 − e−1)=6.3. Thus, the ﬁrst q is awarded to A2. The same thing happens for each of the q’s. That is, Ψ1 stays at 0.63, while Ψ2 decreases. However, it never goes below 0.63. Even for the 10th q, when 90% of A2’s budget has already been used, Ψ2 = 10 × (1 − e−1/10). Recall (Section 1.3.5) the Taylor expansion for ex =1+x+x2/2!+x3/3!+· · · . Thus, e−1/10 =1 − 1 10 + 1 200 − 1 6000 + · · · or approximately, e−1/10 =0.905. Thus, Ψ2 = 10 × 0.095 = 0.95. 2 We leave unproved the assertion that the competitive ratio for this algorithm is 1 − 1/e. We also leave unproved an additional surprising fact: no on-line algorithm for the adwords problem as described in this section can have a competitive ratio above 1 − 1/e. 8.4.8 Final Observations About the Adwords Problem The Balance Algorithm, as described, does not take into account the possibility that the click-through rate diﬀers for diﬀerent ads. It is simple to multiply the bid by the click-through rate when computing the Ψi’s, and doing so will maximize the expected revenue. We can even incorporate information about 8.5. ADWORDS IMPLEMENTATION 279 the click-through rate for each ad on each query for which a nonzero amount has been bid. When faced with the problem of allocating a particular query q, we incorporate a factor that is the click-through rate for that ad on query q, when computing each of the Ψ’s. Another issue we must consider in practice is the historical frequency of queries. If, for example, we know that advertiser Ai has a budget suﬃciently small that there are sure to be enough queries coming later in the month to satisfy Ai’s demand, then there is no point in boosting Ψi if some of Ai’s budget has already been expended. That is, maintain Ψi = xi(1 − e−1) as long as we can expect that there will be enough queries remaining in the month to give Ai its full budget of ads. This change can cause Balance to perform worse if the sequence of queries is governed by an adversary who can control the sequence of queries. Such an adversary can cause the queries Ai bid on suddenly to disappear. However, search engines get so many queries, and their generation is so random, that it is not necessary in practice to imagine signiﬁcant deviation from the norm. 8.4.9 Exercises for Section 8.4 Exercise 8.4.1: Using the simplifying assumptions of Example 8.7, suppose that there are three advertisers, A,B, and C. There are three queries, x, y, and z. Each advertiser has a budget of 2. Advertiser A bids only on x;B bids on x and y, while C bids on x, y, and z. Note that on the query sequence xxyyzz, the optimum oﬀ-line algorithm would yield a revenue of 6, since all queries can be assigned. !(a) Show that the greedy algorithm will assign at least 4 of these 6 queries. !!(b) Find another sequence of queries such that the greedy algorithm can as- sign as few as half the queries that the optimum oﬀ-line algorithm assigns on that sequence. !! Exercise 8.4.2: Extend the proof of Section 8.4.5 to the case where the two advertisers have unequal budgets. ! Exercise 8.4.3: Show how to modify Example 8.9 by changing the bids and/or budgets to make the competitive ratio come out as close to 0 as you like. 8.5 Adwords Implementation While we should now have an idea of how ads are selected to go with the answer to a search query, we have not addressed the problem of ﬁnding the bids that have been made on a given query. As long as bids are for the exact set of words in a query, the solution is relatively easy. However, there are a number of extensions to the query/bid matching process that are not as simple. We shall explain the details in this section. 280 CHAPTER 8. ADVERTISING ON THE WEB 8.5.1 Matching Bids and Search Queries As we have described the adwords problem, and as it normally appears in practice, advertisers bid on sets of words. If a search query occurs having exactly that set of words in some order, then the bid is said to match the query, and it becomes a candidate for selection. We can avoid having to deal with word order by storing all sets of words representing a bid in lexicographic (alphabetic) order. The list of words in sorted order forms the hash-key for the bid, and these bids may be stored in a hash table used as an index, as discussed in Section 1.3.2. Search queries also have their words sorted prior to lookup. When we hash the sorted list, we ﬁnd in the hash table all the bids for exactly that set of words. They can be retrieved quickly, since we have only to look at the contents of that bucket. Moreover, there is a good chance that we can keep the entire hash table in main memory. If there are a million advertisers, each bidding on 100 queries, and the record of the bid requires 100 bytes, then we require ten gigabytes of main memory, which is well within the limits of what is feasible for a single machine. If more space is required, we can split the buckets of the hash table among as many machines as we need. Search queries can be hashed and sent to the appropriate machine. In practice, search queries may be arriving far too rapidly for a single ma- chine, or group of machines that collaborate on a single query at a time, to handle them all. In that case, the stream of queries is split into as many pieces as necessary, and each piece is handled by one group of machines. In fact, an- swering the search query, independent of ads, will require a group of machines working in parallel anyway, in order that the entire processing of a query can be done in main memory. 8.5.2 More Complex Matching Problems However, the potential for matching bids to objects is not limited to the case where the objects are search queries and the match criterion is same-set-of- words. For example, Google also matches adwords bids to emails. There, the match criterion is not based on the equality of sets. Rather, a bid on a set of words S matches an email if all the words in S appear anywhere in the email. This matching problem is much harder. We can still maintain a hash-table index for the bids, but the number of subsets of words in a hundred-word email is much too large to look up all the sets, or even all the small sets of (say) three or fewer words. There are a number of other potential applications of this sort of matching that, at the time of this writing, are not implemented but could be. They all involve standing queries – queries that users post to a site, expecting the site to notify them whenever something matching the query becomes available at the site. For example: 1. Twitter allows one to follow all the “tweets” of a given person. However, 8.5. ADWORDS IMPLEMENTATION 281 it is feasible to allow users to specify a set of words, such as ipod free music and see all the tweets where all these words appear, not necessarily in order, and not necessarily adjacent. 2. On-line news sites often allow users to select from among certain key- words or phrases, e.g., “healthcare” or “Barack Obama,” and receive alerts whenever a new news article contains that word or consecutive sequence of words. This problem is simpler than the email/adwords prob- lem for several reasons. Matching single words or consecutive sequences of words, even in a long article, is not as time-consuming as matching small sets of words. Further, the sets of terms that one can search for is limited, so there aren’t too many “bids.” Even if many people want alerts about the same term, only one index entry, with the list of all those people associated, is required. However, a more advanced system could allow users to specify alerts for sets of words in a news article, just as the Adwords system allows anyone to bid on a set of words in an email. 8.5.3 A Matching Algorithm for Documents and Bids We shall oﬀer an algorithm that will match many “bids” against many “docu- ments.” As before, a bid is a (typically small) set of words. A document is a larger set of words, such as an email, tweet, or news article. We assume there may be hundreds of documents per second arriving, although if there are that many, the document stream may be split among many machines or groups of machines. We assume there are many bids, perhaps on the order of a hundred million or a billion. As always, we want to do as much in main memory as we can. We shall, as before, represent a bid by its words listed in some order. There are two new elements in the representation. First, we shall include a status with each list of words. The status is an integer indicating how many of the ﬁrst words on the list have been matched by the current document. When a bid is stored in the index, its status is always 0. Second, while the order of words could be lexicographic, we can lower the amount of work by ordering words rarest-ﬁrst. However, since the number of diﬀerent words that can appear in emails is essentially unlimited, it is not feasible to order all words in this manner. As a compromise, we might identify the n most common words on the Web or in a sample of the stream of documents we are processing. Here, n might be a hundred thousand or a million. These n words are sorted by frequency, and they occupy the end of the list, with the most frequent words at the very end. All words not among the n most frequent can be assumed equally infrequent and ordered lexicographically. Then, the words of any document can be ordered. If a word does not appear on the list of n frequent words, place it at the front of the order, lexicographically. Those 282 CHAPTER 8. ADVERTISING ON THE WEB words in the document that do appear on the list of most frequent words appear after the infrequent words, in the reverse order of frequency (i.e., with the most frequent words of the documents ordered last). Example 8.11: Suppose our document is ’Twas brillig, and the slithy toves “The” is the most frequent word in English, and “and” is only slightly less frequent. Let us suppose that “twas” makes the list of frequent words, although its frequency is surely lower than that of “the” or “and.” The other words do not make the list of frequent words. Then the end of the list consists of “twas,” “and,” and “the,” in that order, since that is the inverse order of frequency. The other three words are placed at the front of the list in lexicographic order. Thus, brillig slithy toves twas and the is the sequence of words in the document, properly ordered. 2 The bids are stored in a hash-table, whose hash key is the ﬁrst word of the bid, in the order explained above. The record for the bid will also include information about what to do when the bid is matched. The status is 0 and need not be stored explicitly. There is another hash table, whose job is to contain copies of those bids that have been partially matched. These bids have a status that is at least 1, but less than the number of words in the set. If the status is i, then the hash-key for this hash table is the (i + 1)st word. The arrangement of hash tables is suggested by Fig. 8.5. To process a document, do the following. 1. Sort the words of the document in the order discussed above. Eliminate duplicate words. 2. For each word w, in the sorted order: (i) Using w as the hash-key for the table of partially matched bids, ﬁnd those bids having w as key. (i(i) For each such bid b, if w is the last word of b, move b to the table of matched bids. (ii(i) If w is not the last word of b, add 1 to b’s status, and rehash b using the word whose position is one more than the new status, as the hash-key. (i(v) Using w as the hash key for the table of all bids, ﬁnd those bids for which w is their ﬁrst word in the sorted order. (v) For each such bid b, if there is only one word on its list, copy it to the table of matched bids. 8.6. SUMMARY OF CHAPTER 8 283 Hash−table index of bids bids matched Partially bids matched List of Word w hash−key (a) Look up w (b) Move to output if is the last word w (c) Rehash with one higher w wstatus if is not last hash−key w bids with (d) Look up bids with is the only word output if w (e) Copy to Hash on second word than one word status 1 if more (f) Copy with Figure 8.5: Managing large numbers of bids and large numbers of documents (v(i) If b consists of more than one word, add it, with status 1, to the table of partially matched bids, using the second word of b as the hash-key. 3. Produce the list of matched bids as the output. The beneﬁt of the rarest-ﬁrst order should now be visible. A bid is only copied to the second hash table if its rarest word appears in the document. In comparison, if lexicographic order was used, more bids would be copied to the second hash table. By minimizing the size of that table, we not only reduce the amount of work in steps 2(i)–2(iii), but we make it more likely that this entire table can be kept in main memory. 8.6 Summary of Chapter 8 ✦ Targeted Advertising: The big advantage that Web-based advertising has over advertising in conventional media such as newspapers is that Web advertising can be selected according to the interests of each individual user. This advantage has enabled many Web services to be supported entirely by advertising revenue. ✦ On- and Oﬀ-Line Algorithms: Conventional algorithms that are allowed to see all their data before producing an answer are called oﬀ-line. An on- 284 CHAPTER 8. ADVERTISING ON THE WEB line algorithm is required to make a response to each element in a stream immediately, with knowledge of only the past, not the future elements in the stream. ✦ Greedy Algorithms: Many on-line algorithms are greedy, in the sense that they select their action at every step by minimizing some objective func- tion. ✦ Competitive Ratio: We can measure the quality of an on-line algorithm by minimizing, over all possible inputs, the value of the result of the on- line algorithm compared with the value of the result of the best possible oﬀ-line algorithm. ✦ Bipartite Matching: This problem involves two sets of nodes and a set of edges between members of the two sets. The goal is to ﬁnd a maximal matching – as large a set of edges as possible that includes no node more than once. ✦ On-Line Solution to the Matching Problem: One greedy algorithm for ﬁnding a match in a bipartite graph (or any graph, for that matter) is to order the edges in some way, and for each edge in turn, add it to the match if neither of its ends are yet part of an edge previously selected for the match. This algorithm can be proved to have a competitive ratio of 1/2; that is, it never fails to match at least half as many nodes as the best oﬀ-line algorithm matches. ✦ Search Ad Management: A search engine receives bids from advertisers on certain search queries. Some ads are displayed with each search query, and the search engine is paid the amount of the bid only if the queryer clicks on the ad. Each advertiser can give a budget, the total amount they are willing to pay for clicks in a month. ✦ The Adwords Problem: The data for the adwords problem is a set of bids by advertisers on certain search queries, together with a total budget for each advertiser and information about the historical click-through rate for each ad for each query. Another part of the data is the stream of search queries received by the search engine. The objective is to select on-line a ﬁxed-size set of ads in response to each query that will maximize the revenue to the search engine. ✦ Simpliﬁed Adwords Problem: To see some of the nuances of ad selection, we considered a simpliﬁed version in which all bids are either 0 or 1, only one ad is shown with each query, and all advertisers have the same budget. Under this model the obvious greedy algorithm of giving the ad placement to anyone who has bid on the query and has budget remaining can be shown to have a competitive ratio of 1/2. 8.7. REFERENCES FOR CHAPTER 8 285 ✦ The Balance Algorithm: This algorithm improves on the simple greedy algorithm. A query’s ad is given to the advertiser who has bid on the query and has the largest remaining budget. Ties can be broken arbitrarily. ✦ Competitive Ratio of the Balance Algorithm: For the simpliﬁed adwords model, the competitive ratio of the Balance Algorithm is 3/4 for the case of two advertisers and 1−1/e, or about 63% for any number of advertisers. ✦ The Balance Algorithm for the Generalized Adwords Problem: When bid- ders can make diﬀering bids, have diﬀerent budgets, and have diﬀerent click-through rates for diﬀerent queries, the Balance Algorithm awards an ad to the advertiser with the highest value of the function Ψ = x(1−e−f ). Here, x is the product of the bid and the click-through rate for that ad- vertiser and query, and f is the fraction of the advertiser’s budget that remains unspent. ✦ Implementing an Adwords Algorithm: The simplest version of the imple- mentation serves in situations where the bids are on exactly the set of words in the search query. We can represent a query by the list of its words, in sorted order. Bids are stored in a hash table or similar struc- ture, with a hash key equal to the sorted list of words. A search query can then be matched against bids by a straightforward lookup in the table. ✦ Matching Word Sets Against Documents: A harder version of the ad- words-implementation problem allows bids, which are still small sets of words as in a search query, to be matched against larger documents, such as emails or tweets. A bid set matches the document if all the words appear in the document, in any order and not necessarily adjacent. ✦ Hash Storage of Word Sets: A useful data structure stores the words of each bid set in the order rarest-ﬁrst. Documents have their words sorted in the same order. Word sets are stored in a hash table with the ﬁrst word, in the rarest-ﬁrst order, as the key. ✦ Processing Documents for Bid Matches: We process the words of the document rarest-ﬁrst. Word sets whose ﬁrst word is the current word are copied to a temporary hash table, with the second word as the key. Sets already in the temporary hash table are examined to see if the word that is their key matches the current word, and, if so, they are rehashed using their next word. Sets whose last word is matched are copied to the output. 8.7 References for Chapter 8 [1] is an investigation of the way ad position inﬂuences the click-through rate. The Balance Algorithm was developed in [2] and its application to the ad- words problem is from [3]. 286 CHAPTER 8. ADVERTISING ON THE WEB 1. N. Craswell, O. Zoeter, M. Taylor, and W. Ramsey, “An experimental comparison of click-position bias models,” Proc. Intl. Conf. on Web Search and Web Data Mining pp. 87–94, 2008. 2. B. Kalyanasundaram and K.R. Pruhs, “An optimal deterministic algo- rithm for b-matching,” Theoretical Computer Science 233:1–2, pp. 319– 325, 2000. 3. A Mehta, A. Saberi, U. Vazirani, and V. Vazirani, “Adwords and general- ized on-line matching,” IEEE Symp. on Foundations of Computer Science, pp. 264–273, 2005. Chapter 9 Recommendation Systems There is an extensive class of Web applications that involve predicting user responses to options. Such a facility is called a recommendation system. We shall begin this chapter with a survey of the most important examples of these systems. However, to bring the problem into focus, two good examples of recommendation systems are: 1. Oﬀering news articles to on-line newspaper readers, based on a prediction of reader interests. 2. Oﬀering customers of an on-line retailer suggestions about what they might like to buy, based on their past history of purchases and/or product searches. Recommendation systems use a number of diﬀerent technologies. We can classify these systems into two broad groups. • Content-based systems examine properties of the items recommended. For instance, if a Netﬂix user has watched many cowboy movies, then recom- mend a movie classiﬁed in the database as having the “cowboy” genre. • Collaborative ﬁltering systems recommend items based on similarity mea- sures between users and/or items. The items recommended to a user are those preferred by similar users. This sort of recommendation system can use the groundwork laid in Chapter 3 on similarity search and Chapter 7 on clustering. However, these technologies by themselves are not suﬃ- cient, and there are some new algorithms that have proven eﬀective for recommendation systems. 9.1 A Model for Recommendation Systems In this section we introduce a model for recommendation systems, based on a utility matrix of preferences. We introduce the concept of a “long-tail,” 287 288 CHAPTER 9. RECOMMENDATION SYSTEMS which explains the advantage of on-line vendors over conventional, brick-and- mortar vendors. We then brieﬂy survey the sorts of applications in which recommendation systems have proved useful. 9.1.1 The Utility Matrix In a recommendation-system application there are two classes of entities, which we shall refer to as users and items. Users have preferences for certain items, and these preferences must be teased out of the data. The data itself is repre- sented as a utility matrix, giving for each user-item pair, a value that represents what is known about the degree of preference of that user for that item. Values come from an ordered set, e.g., integers 1–5 representing the number of stars that the user gave as a rating for that item. We assume that the matrix is sparse, meaning that most entries are “unknown.” An unknown rating implies that we have no explicit information about the user’s preference for the item. Example 9.1: In Fig. 9.1 we see an example utility matrix, representing users’ ratings of movies on a 1–5 scale, with 5 the highest rating. Blanks represent the situation where the user has not rated the movie. The movie names are HP1, HP2, and HP3 for Harry Potter I, II, and III, TW for Twilight, and SW1, SW2, and SW3 for Star Wars episodes 1, 2, and 3. The users are represented by capital letters A through D. HP1 HP2 HP3 TW SW1 SW2 SW3 A 4 51 B 5 5 4 C 2 4 5 D 3 3 Figure 9.1: A utility matrix representing ratings of movies on a 1–5 scale Notice that most user-movie pairs have blanks, meaning the user has not rated the movie. In practice, the matrix would be even sparser, with the typical user rating only a tiny fraction of all available movies. 2 The goal of a recommendation system is to predict the blanks in the utility matrix. For example, would user A like SW2? There is little evidence from the tiny matrix in Fig. 9.1. We might design our recommendation system to take into account properties of movies, such as their producer, director, stars, or even the similarity of their names. If so, we might then note the similarity between SW1 and SW2, and then conclude that since A did not like SW1, they were unlikely to enjoy SW2 either. Alternatively, with much more data, we might observe that the people who rated both SW1 and SW2 tended to give them similar ratings. Thus, we could conclude that A would also give SW2 a low rating, similar to A’s rating of SW1. 9.1. A MODEL FOR RECOMMENDATION SYSTEMS 289 We should also be aware of a slightly diﬀerent goal that makes sense in many applications. It is not necessary to predict every blank entry in a utility matrix. Rather, it is only necessary to discover some entries in each row that are likely to be high. In most applications, the recommendation system does not oﬀer users a ranking of all items, but rather suggests a few that the user should value highly. It may not even be necessary to ﬁnd those items with the highest expected ratings, but only to ﬁnd a large subset of those with the highest ratings. 9.1.2 The Long Tail Before discussing the principal applications of recommendation systems, let us ponder the long tail phenomenon that makes recommendation systems neces- sary. Physical delivery systems are characterized by a scarcity of resources. Brick-and-mortar stores have limited shelf space, and can show the customer only a small fraction of all the choices that exist. On the other hand, on-line stores can make anything that exists available to the customer. Thus, a physical bookstore may have several thousand books on its shelves, but Amazon oﬀers millions of books. A physical newspaper can print several dozen articles per day, while on-line news services oﬀer thousands per day. Recommendation in the physical world is fairly simple. First, it is not possible to tailor the store to each individual customer. Thus, the choice of what is made available is governed only by the aggregate numbers. Typically, a bookstore will display only the books that are most popular, and a newspaper will print only the articles it believes the most people will be interested in. In the ﬁrst case, sales ﬁgures govern the choices, in the second case, editorial judgement serves. The distinction between the physical and on-line worlds has been called the long tail phenomenon, and it is suggested in Fig. 9.2. The vertical axis represents popularity (the number of times an item is chosen). The items are ordered on the horizontal axis according to their popularity. Physical institu- tions provide only the most popular items to the left of the vertical line, while the corresponding on-line institutions provide the entire range of items: the tail as well as the popular items. The long-tail phenomenon forces on-line institutions to recommend items to individual users. It is not possible to present all available items to the user, the way physical institutions can, and it is not reasonable to expect users to have heard of all the many items they might like. 9.1.3 Applications of Recommendation Systems We have mentioned several important applications of recommendation systems, but here we shall consolidate the list in a single place. 1. Product Recommendations: Perhaps the most important use of recom- mendation systems is at on-line retailers. We have noted how Amazon 290 CHAPTER 9. RECOMMENDATION SYSTEMS The Long Tail Figure 9.2: The long tail: physical institutions can only provide what is popular, while on-line institutions can make everything available or similar on-line vendors strive to present each returning user with some suggestions of products that they might like to buy. These suggestions are not random, but are based on the purchasing decisions made by similar customers or on other techniques we shall discuss in this chapter. 2. Movie Recommendations: Netﬂix oﬀers its customers recommendations of movies they might like. These recommendations are based on ratings provided by users, much like the ratings suggested in the example utility matrix of Fig. 9.1. The importance of predicting ratings accurately is so high, that Netﬂix oﬀered a prize of one million dollars for the ﬁrst algorithm that could beat its own recommendation system by 10%.1 The prize was ﬁnally won in 2009, by a team of researchers called “Bellkor’s Pragmatic Chaos,” after over three years of competition. 3. News Articles: News services have attempted to identify articles of in- terest to readers, based on the articles that they have read in the past. The similarity might be based on the similarity of important words in the documents, or on the articles that are read by people with similar reading tastes. The same principles apply to recommending blogs from among the millions of blogs available, videos on YouTube, or other sites where 1To be exact, the algorithm had to have a root-mean-square error (RMSE) that was 10% less than the RMSE of the Netﬂix algorithm on a test set taken from actual ratings of Netﬂix users. To develop an algorithm, contestants were given a training set of data, also taken from actual Netﬂix data. 9.1. A MODEL FOR RECOMMENDATION SYSTEMS 291 Into Thin Air and Touching the Void An extreme example of how the long tail, together with a well designed recommendation system can inﬂuence events is the story told by Chris An- derson about a book called Touching the Void. This mountain-climbing book was not a big seller in its day, but many years after it was pub- lished, another book on the same topic, called Into Thin Air was pub- lished. Amazon’s recommendation system noticed a few people who bought both books, and started recommending Touching the Void to peo- ple who bought, or were considering, Into Thin Air. Had there been no on-line bookseller, Touching the Void might never have been seen by poten- tial buyers, but in the on-line world, Touching the Void eventually became very popular in its own right, in fact, more so than Into Thin Air. content is provided regularly. 9.1.4 Populating the Utility Matrix Without a utility matrix, it is almost impossible to recommend items. However, acquiring data from which to build a utility matrix is often diﬃcult. There are two general approaches to discovering the value users place on items. 1. We can ask users to rate items. Movie ratings are generally obtained this way, and some on-line stores try to obtain ratings from their purchasers. Sites providing content, such as some news sites or YouTube also ask users to rate items. This approach is limited in its eﬀectiveness, since generally users are unwilling to provide responses, and the information from those who do may be biased by the very fact that it comes from people willing to provide ratings. 2. We can make inferences from users’ behavior. Most obviously, if a user buys a product at Amazon, watches a movie on YouTube, or reads a news article, then the user can be said to “like” this item. Note that this sort of rating system really has only one value: 1 means that the user likes the item. Often, we ﬁnd a utility matrix with this kind of data shown with 0’s rather than blanks where the user has not purchased or viewed the item. However, in this case 0 is not a lower rating than 1; it is no rating at all. More generally, one can infer interest from behavior other than purchasing. For example, if an Amazon customer views information about an item, we can infer that they are interested in the item, even if they don’t buy it. 292 CHAPTER 9. RECOMMENDATION SYSTEMS 9.2 Content-Based Recommendations As we mentioned at the beginning of the chapter, there are two basic architec- tures for a recommendation system: 1. Content-Based systems focus on properties of items. Similarity of items is determined by measuring the similarity in their properties. 2. Collaborative-Filtering systems focus on the relationship between users and items. Similarity of items is determined by the similarity of the ratings of those items by the users who have rated both items. In this section, we focus on content-based recommendation systems. The next section will cover collaborative ﬁltering. 9.2.1 Item Proﬁles In a content-based system, we must construct for each item a proﬁle, which is a record or collection of records representing important characteristics of that item. In simple cases, the proﬁle consists of some characteristics of the item that are easily discovered. For example, consider the features of a movie that might be relevant to a recommendation system. 1. The set of stars of the movie. Some viewers prefer movies with their favorite stars. 2. The director. Some viewers have a preference for the work of certain directors. 3. The year in which the movie was made. Some viewers prefer old movies; others watch only the latest releases. 4. The genre or general type of movie. Some viewers like only comedies, others dramas or romances. There are many other features of movies that could be used as well. Except for the last, genre, the information is readily available from descriptions of movies. Genre is a vaguer concept. However, movie reviews generally assign a genre from a set of commonly used terms. For example the Internet Movie Database (IMDB) assigns a genre or genres to every movie. We shall discuss mechanical construction of genres in Section 9.3.3. Many other classes of items also allow us to obtain features from available data, even if that data must at some point be entered by hand. For instance, products often have descriptions written by the manufacturer, giving features relevant to that class of product (e.g., the screen size and cabinet color for a TV). Books have descriptions similar to those for movies, so we can obtain features such as author, year of publication, and genre. Music products such as CD’s and MP3 downloads have available features such as artist, composer, and genre. 9.2. CONTENT-BASED RECOMMENDATIONS 293 9.2.2 Discovering Features of Documents There are other classes of item where it is not immediately apparent what the values of features should be. We shall consider two of them: document collections and images. Documents present special problems, but yields to a known technology. Images will be discussed in Section 9.2.3 as an important example where user-supplied features have some hope of success. There are many kinds of documents for which a recommendation system can be useful. For example, there are many news articles published each day, and we cannot read all of them. A recommendation system can suggest articles on topics a user is interested in, but how can we distinguish among topics? Web pages are also a collection of documents. Can we suggest pages a user might want to see? Likewise, blogs could be recommended to interested users, if we could classify blogs by topics. Unfortunately, these classes of documents do not tend to have readily avail- able information giving features. A substitute that has been useful in practice is the identiﬁcation of words that characterize the topic of a document. How we do the identiﬁcation was outlined in Section 1.3.1. First, eliminate stop words – the several hundred most common words, which tend to say little about the topic of a document. For the remaining words, compute the TF.IDF score for each word in the document. The ones with the highest scores are the words that characterize the document. We may then take as the features of a document the n words with the highest TF.IDF scores. It is possible to pick n to be the same for all documents, or to let n be a ﬁxed percentage of the words in the document. We could also choose to make all words whose TF.IDF scores are above a given threshold to be a part of the feature set. Now, documents are represented by sets of words. Intuitively, we expect these words to express the subjects or main ideas of the document. For example, in a news article, we would expect the words with the highest TF.IDF score to include the names of people discussed in the article, unusual properties of the event described, and the location of the event. To measure the similarity of two documents, there are several natural distance measures we can use: 1. We could use the Jaccard distance between the sets of words (recall Sec- tion 3.5.3). 2. We could use the cosine distance (recall Section 3.5.4) between the sets, treated as vectors. To compute the cosine distance in option (2), think of the sets of high- TF.IDF words as a vector, with one component for each possible word. The vector has 1 if the word is in the set and 0 if not. Since between two docu- ments there are only a ﬁnite number of words among their two sets, the inﬁnite dimensionality of the vectors is unimportant. Almost all components are 0 in both, and 0’s do not impact the value of the dot product. To be precise, the dot 294 CHAPTER 9. RECOMMENDATION SYSTEMS Two Kinds of Document Similarity Recall that in Section 3.4 we gave a method for ﬁnding documents that were “similar,” using shingling, minhashing, and LSH. There, the notion of similarity was lexical – documents are similar if they contain large, identical sequences of characters. For recommendation systems, the notion of similarity is diﬀerent. We are interested only in the occurrences of many important words in both documents, even if there is little lexical similarity between the documents. However, the methodology for ﬁnding similar documents remains almost the same. Once we have a distance measure, either Jaccard or cosine, we can use minhashing (for Jaccard) or random hyperplanes (for cosine distance; see Section 3.7.2) feeding data to an LSH algorithm to ﬁnd the pairs of documents that are similar in the sense of sharing many common keywords. product is the size of the intersection of the two sets of words, and the lengths of the vectors are the square roots of the numbers of words in each set. That calculation lets us compute the cosine of the angle between the vectors as the dot product divided by the product of the vector lengths. 9.2.3 Obtaining Item Features From Tags Let us consider a database of images as an example of a way that features have been obtained for items. The problem with images is that their data, typically an array of pixels, does not tell us anything useful about their features. We can calculate simple properties of pixels, such as the average amount of red in the picture, but few users are looking for red pictures or especially like red pictures. There have been a number of attempts to obtain information about features of items by inviting users to tag the items by entering words or phrases that describe the item. Thus, one picture with a lot of red might be tagged “Tianan- men Square,” while another is tagged “sunset at Malibu.” The distinction is not something that could be discovered by existing image-analysis programs. Almost any kind of data can have its features described by tags. One of the earliest attempts to tag massive amounts of data was the site del.icio.us, later bought by Yahoo!, which invited users to tag Web pages. The goal of this tagging was to make a new method of search available, where users entered a set of tags as their search query, and the system retrieved the Web pages that had been tagged that way. However, it is also possible to use the tags as a recommendation system. If it is observed that a user retrieves or bookmarks many pages with a certain set of tags, then we can recommend other pages with the same tags. The problem with tagging as an approach to feature discovery is that the process only works if users are willing to take the trouble to create the tags, and 9.2. CONTENT-BASED RECOMMENDATIONS 295 Tags from Computer Games An interesting direction for encouraging tagging is the “games” approach pioneered by Luis von Ahn. He enabled two players to collaborate on the tag for an image. In rounds, they would suggest a tag, and the tags would be exchanged. If they agreed, then they “won,” and if not, they would play another round with the same image, trying to agree simultaneously on a tag. While an innovative direction to try, it is questionable whether suﬃcient public interest can be generated to produce enough free work to satisfy the needs for tagged data. there are enough tags that occasional erroneous ones will not bias the system too much. 9.2.4 Representing Item Proﬁles Our ultimate goal for content-based recommendation is to create both an item proﬁle consisting of feature-value pairs and a user proﬁle summarizing the pref- erences of the user, based of their row of the utility matrix. In Section 9.2.2 we suggested how an item proﬁle could be constructed. We imagined a vector of 0’s and 1’s, where a 1 represented the occurrence of a high-TF.IDF word in the document. Since features for documents were all words, it was easy to represent proﬁles this way. We shall try to generalize this vector approach to all sorts of features. It is easy to do so for features that are sets of discrete values. For example, if one feature of movies is the set of stars, then imagine that there is a component for each star, with 1 if the star is in the movie, and 0 if not. Likewise, we can have a component for each possible director, and each possible genre. All these features can be represented using only 0’s and 1’s. There is another class of feature that is not readily represented by boolean vectors: those features that are numerical. For instance, we might take the average rating for movies to be a feature,2 and this average is a real number. It does not make sense to have one component for each of the possible average ratings, and doing so would cause us to lose the structure implicit in numbers. That is, two ratings that are close but not identical should be considered more similar than widely diﬀering ratings. Likewise, numerical features of products, such as screen size or disk capacity for PC’s, should be considered similar if their values do not diﬀer greatly. Numerical features should be represented by single components of vectors representing items. These components hold the exact value of that feature. There is no harm if some components of the vectors are boolean and others are 2The rating is not a very reliable feature, but it will serve as an example. 296 CHAPTER 9. RECOMMENDATION SYSTEMS real-valued or integer-valued. We can still compute the cosine distance between vectors, although if we do so, we should give some thought to the appropri- ate scaling of the nonboolean components, so that they neither dominate the calculation nor are they irrelevant. Example 9.2: Suppose the only features of movies are the set of stars and the average rating. Consider two movies with ﬁve stars each. Two of the stars are in both movies. Also, one movie has an average rating of 3 and the other an average of 4. The vectors look something like 011011013α 110101104α However, there are in principle an inﬁnite number of additional components, each with 0’s for both vectors, representing all the possible stars that neither movie has. Since cosine distance of vectors is not aﬀected by components in which both vectors have 0, we need not worry about the eﬀect of stars that are not in either movie. The last component shown represents the average rating. We have shown it as having an unknown scaling factor α. In terms of α, we can compute the cosine of the angle between the vectors. The dot product is 2 + 12α2, and the lengths of the vectors are √5+9α2 and √5+16α2. Thus, the cosine of the angle between the vectors is 2+12α2 √25 + 125α2 + 144α4 If we choose α = 1, that is, we take the average ratings as they are, then the value of the above expression is 0.816. If we use α = 2, that is, we double the ratings, then the cosine is 0.940. That is, the vectors appear much closer in direction than if we use α = 1. Likewise, if we use α =1/2, then the cosine is 0.619, making the vectors look quite diﬀerent. We cannot tell which value of α is “right,” but we see that the choice of scaling factor for numerical features aﬀects our decision about how similar items are. 2 9.2.5 User Proﬁles We not only need to create vectors describing items; we need to create vectors with the same components that describe the user’s preferences. We have the utility matrix representing the connection between users and items. Recall the nonblank matrix entries could be just 1’s representing user purchases or a similar connection, or they could be arbitrary numbers representing a rating or degree of aﬀection that the the user has for the item. With this information, the best estimate we can make regarding which items the user likes is some aggregation of the proﬁles of those items. If the utility matrix has only 1’s, then the natural aggregate is the average of the components of the vectors representing the item proﬁles for the items in which the utility matrix has 1 for that user. 9.2. CONTENT-BASED RECOMMENDATIONS 297 Example 9.3: Suppose items are movies, represented by boolean proﬁles with components corresponding to stars. Also, the utility matrix has a 1 if the user has seen the movie and is blank otherwise. If 20% of the movies that user U likes have Julia Roberts as one of the stars, then the user proﬁle for U will have 0.2 in the component for Julia Roberts. 2 If the utility matrix is not boolean, e.g., ratings 1–5, then we can weight the vectors representing the proﬁles of items by the utility value. It makes sense to normalize the utilities by subtracting the average value for a user. That way, we get negative weights for items with a below-average rating, and positive weights for items with above-average ratings. That eﬀect will prove useful when we discuss in Section 9.2.6 how to ﬁnd items that a user should like. Example 9.4: Consider the same movie information as in Example 9.3, but now suppose the utility matrix has nonblank entries that are ratings in the 1–5 range. Suppose user U gives an average rating of 3. There are three movies with Julia Roberts as a star, and those movies got ratings of 3, 4, and 5. Then in the user proﬁle of U, the component for Julia Roberts will have value that is the average of 3 − 3, 4 − 3, and 5 − 3, that is, a value of 1. On the other hand, user V gives an average rating of 4, and has also rated three movies starring Julia Roberts (it doesn’t matter whether or not they are the same three movies U rated). User V gives these three movies ratings of 2, 3, and 5. The user proﬁle for V has, in the component for Julia Roberts, the average of 2 − 4, 3 − 4, and 5 − 4, that is, the value −2/3. 2 9.2.6 Recommending Items to Users Based on Content With proﬁle vectors for both users and items, we can estimate the degree to which a user would prefer an item by computing the cosine distance between the user’s and item’s vectors. As in Example 9.2, we may wish to scale var- ious components whose values are not boolean. The random-hyperplane and locality-sensitive-hashing techniques can be used to place (just) item proﬁles in buckets. In that way, given a user to whom we want to recommend some items, we can apply the same two techniques – random hyperplanes and LSH – to determine in which buckets we must look for items that might have a small cosine distance from the user. Example 9.5: Consider ﬁrst the data of Example 9.3. The user’s proﬁle will have components for stars proportional to the likelihood that the star will appear in a movie the user likes. Thus, the highest recommendations (lowest cosine distance) belong to the movies with lots of stars that appear in many of the movies the user likes. As long as stars are the only information we have about features of movies, that is probably the best we can do.3 3Note that the fact all user-vector components will be small fractions does not aﬀect the 298 CHAPTER 9. RECOMMENDATION SYSTEMS Now, consider Example 9.4. There, we observed that the vector for a user will have positive numbers for stars that tend to appear in movies the user likes and negative numbers for stars that tend to appear in movies the user doesn’t like. Consider a movie with many stars the user likes, and only a few or none that the user doesn’t like. The cosine of the angle between the user’s and movie’s vectors will be a large positive fraction. That implies an angle close to 0, and therefore a small cosine distance between the vectors. Next, consider a movie with about as many stars the user likes as doesn’t like. In this situation, the cosine of the angle between the user and movie is around 0, and therefore the angle between the two vectors is around 90 degrees. Finally, consider a movie with mostly stars the user doesn’t like. In that case, the cosine will be a large negative fraction, and the angle between the two vectors will be close to 180 degrees – the maximum possible cosine distance. 2 9.2.7 Classiﬁcation Algorithms A completely diﬀerent approach to a recommendation system using item proﬁles and utility matrices is to treat the problem as one of machine learning. Regard the given data as a training set, and for each user, build a classiﬁer that predicts the rating of all items. There are a great number of diﬀerent classiﬁers, and it is not our purpose to teach this subject here. However, you should be aware of the option of developing a classiﬁer for recommendation, so we shall discuss one common classiﬁer – decision trees – brieﬂy. A decision tree is a collection of nodes, arranged as a binary tree. The leaves render decisions; in our case, the decision would be “likes” or “doesn’t like.” Each interior node is a condition on the objects being classiﬁed; in our case the condition would be a predicate involving one or more features of an item. To classify an item, we start at the root, and apply the predicate at the root to the item. If the predicate is true, go to the left child, and if it is false, go to the right child. Then repeat the same process at the node visited, until a leaf is reached. That leaf classiﬁes the item as liked or not. Construction of a decision tree requires selection of a predicate for each interior node. There are many ways of picking the best predicate, but they all try to arrange that one of the children gets all or most of the positive examples in the training set (i.e, the items that the given user likes, in our case) and the other child gets all or most of the negative examples (the items this user does not like). Once we have selected a predicate for a node N, we divide the items into the two groups: those that satisfy the predicate and those that do not. For each group, we again ﬁnd the predicate that best separates the positive and recommendation, since the cosine calculation involves dividing by the length of each vector. That is, user vectors will tend to be much shorter than movie vectors, but only the direction of vectors matters. 9.2. CONTENT-BASED RECOMMENDATIONS 299 negative examples in that group. These predicates are assigned to the children of N. This process of dividing the examples and building children can proceed to any number of levels. We can stop, and create a leaf, if the group of items for a node is homogeneous; i.e., they are all positive or all negative examples. However, we may wish to stop and create a leaf with the majority decision for a group, even if the group contains both positive and negative examples. The reason is that the statistical signiﬁcance of a small group may not be high enough to rely on. For that reason a variant strategy is to create an ensemble of decision trees, each using diﬀerent predicates, but allow the trees to be deeper than what the available data justiﬁes. Such trees are called overﬁtted. To classify an item, apply all the trees in the ensemble, and let them vote on the outcome. We shall not consider this option here, but give a simple hypothetical example of a decision tree. Example 9.6: Suppose our items are news articles, and features are the high- TF.IDF words (keywords) in those documents. Further suppose there is a user U who likes articles about baseball, except articles about the New York Yankees. The row of the utility matrix for U has 1 if U has read the article and is blank if not. We shall take the 1’s as “like” and the blanks as “doesn’t like.” Predicates will be boolean expressions of keywords. Since U generally likes baseball, we might ﬁnd that the best predicate for the root is “homerun” OR (“batter” AND “pitcher”). Items that satisfy the predicate will tend to be positive examples (articles with 1 in the row for U in the utility matrix), and items that fail to satisfy the predicate will tend to be negative examples (blanks in the utility-matrix row for U). Figure 9.3 shows the root as well as the rest of the decision tree. ("batter" AND "pitcher") "homerun" OR "Yankees" OR "Jeter" OR "Teixeira" Doesn’t Like Doesn’t Like Likes Figure 9.3: A decision tree 300 CHAPTER 9. RECOMMENDATION SYSTEMS Suppose that the group of articles that do not satisfy the predicate includes suﬃciently few positive examples that we can conclude all of these items are in the “don’t-like” class. We may then put a leaf with decision “don’t like” as the right child of the root. However, the articles that satisfy the predicate includes a number of articles that user U doesn’t like; these are the articles that mention the Yankees. Thus, at the left child of the root, we build another predicate. We might ﬁnd that the predicate “Yankees” OR “Jeter” OR “Teixeira” is the best possible indicator of an article about baseball and about the Yankees. Thus, we see in Fig. 9.3 the left child of the root, which applies this predicate. Both children of this node are leaves, since we may suppose that the items satisfying this predicate are predominantly negative and those not satisfying it are predominantly positive. 2 Unfortunately, classiﬁers of all types tend to take a long time to construct. For instance, if we wish to use decision trees, we need one tree per user. Con- structing a tree not only requires that we look at all the item proﬁles, but we have to consider many diﬀerent predicates, which could involve complex com- binations of features. Thus, this approach tends to be used only for relatively small problem sizes. 9.2.8 Exercises for Section 9.2 Exercise 9.2.1: Three computers, A,B, and C, have the numerical features listed below: Feature A B C Processor Speed 3.06 2.68 2.92 Disk Size 500 320 640 Main-Memory Size 6 4 6 We may imagine these values as deﬁning a vector for each computer; for in- stance, A’s vector is [3.06, 500, 6]. We can compute the cosine distance between any two of the vectors, but if we do not scale the components, then the disk size will dominate and make diﬀerences in the other components essentially in- visible. Let us use 1 as the scale factor for processor speed, α for the disk size, and β for the main memory size. (a) In terms of α and β, compute the cosines of the angles between the vectors for each pair of the three computers. (b) What are the angles between the vectors if α = β = 1? (c) What are the angles between the vectors if α =0.01 and β =0.5? !(d) One fair way of selecting scale factors is to make each inversely propor- tional to the average value in its component. What would be the values of α and β, and what would be the angles between the vectors? 9.3. COLLABORATIVE FILTERING 301 Exercise 9.2.2: An alternative way of scaling components of a vector is to begin by normalizing the vectors. That is, compute the average for each com- ponent and subtract it from that component’s value in each of the vectors. (a) Normalize the vectors for the three computers described in Exercise 9.2.1. !!(b) This question does not require diﬃcult calculation, but it requires some serious thought about what angles between vectors mean. When all com- ponents are nonnegative, as they are in the data of Exercise 9.2.1, no vectors can have an angle greater than 90 degrees. However, when we normalize vectors, we can (and must) get some negative components, so the angles can now be anything, that is, 0 to 180 degrees. Moreover, averages are now 0 in every component, so the suggestion in part (d) of Exercise 9.2.1 that we should scale in inverse proportion to the average makes no sense. Suggest a way of ﬁnding an appropriate scale for each component of normalized vectors. How would you interpret a large or small angle between normalized vectors? What would the angles be for the normalized vectors derived from the data in Exercise 9.2.1? Exercise 9.2.3: A certain user has rated the three computers of Exercise 9.2.1 as follows: A: 4 stars, B: 2 stars, C: 5 stars. (a) Normalize the ratings for this user. (b) Compute a user proﬁle for the user, with components for processor speed, disk size, and main memory size, based on the data of Exercise 9.2.1. 9.3 Collaborative Filtering We shall now take up a signiﬁcantly diﬀerent approach to recommendation. Instead of using features of items to determine their similarity, we focus on the similarity of the user ratings for two items. That is, in place of the item-proﬁle vector for an item, we use its column in the utility matrix. Further, instead of contriving a proﬁle vector for users, we represent them by their rows in the utility matrix. Users are similar if their vectors are close according to some distance measure such as Jaccard or cosine distance. Recommendation for a user U is then made by looking at the users that are most similar to U in this sense, and recommending items that these users like. The process of identifying similar users and recommending what similar users like is called collaborative ﬁltering. 9.3.1 Measuring Similarity The ﬁrst question we must deal with is how to measure similarity of users or items from their rows or columns in the utility matrix. We have reproduced Fig. 9.1 here as Fig. 9.4. This data is too small to draw any reliable conclusions, 302 CHAPTER 9. RECOMMENDATION SYSTEMS but its small size will make clear some of the pitfalls in picking a distance measure. Observe speciﬁcally the users A and C. They rated two movies in common, but they appear to have almost diametrically opposite opinions of these movies. We would expect that a good distance measure would make them rather far apart. Here are some alternative measures to consider. HP1 HP2 HP3 TW SW1 SW2 SW3 A 4 51 B 5 5 4 C 2 4 5 D 3 3 Figure 9.4: The utility matrix introduced in Fig. 9.1 Jaccard Distance We could ignore values in the matrix and focus only on the sets of items rated. If the utility matrix only reﬂected purchases, this measure would be a good one to choose. However, when utilities are more detailed ratings, the Jaccard distance loses important information. Example 9.7: A and B have an intersection of size 1 and a union of size 5. Thus, their Jaccard similarity is 1/5, and their Jaccard distance is 4/5; i.e., they are very far apart. In comparison, A and C have a Jaccard similarity of 2/4, so their Jaccard distance is the same, 1/2. Thus, A appears closer to C than to B. Yet that conclusion seems intuitively wrong. A and C disagree on the two movies they both watched, while A and B seem both to have liked the one movie they watched in common. 2 Cosine Distance We can treat blanks as a 0 value. This choice is questionable, since it has the eﬀect of treating the lack of a rating as more similar to disliking the movie than liking it. Example 9.8: The cosine of the angle between A and B is 4 × 5 √42 +52 +12√52 +52 +42 =0.386 The cosine of the angle between A and C is 5 × 2+1 × 4 √42 +52 +12√22 +42 +52 =0.322 This measure too tells us A is more similar to C than to B, a conclusion that deﬁes our intuition. 2 9.3. COLLABORATIVE FILTERING 303 Rounding the Data We could try to eliminate the apparent similarity between movies a user rates highly and those with low scores by rounding the ratings. For instance, we could consider ratings of 3, 4, and 5 as a “1” and consider ratings 1 and 2 as unrated. The utility matrix would then look as in Fig. 9.5. Now, the Jaccard distance between A and B is 3/4, while between A and C it is 1; i.e., C appears further from A than B does, which is intuitively correct. Applying cosine distance to Fig. 9.5 allows us to draw the same conclusion. HP1 HP2 HP3 TW SW1 SW2 SW3 A 1 1 B 1 1 1 C 1 1 D 1 1 Figure 9.5: Utilities of 3, 4, and 5 have been replaced by 1, while ratings of 1 and 2 are omitted Normalizing Ratings If we normalize ratings, by subtracting from each rating the average rating of that user, we turn low ratings into negative numbers and high ratings into positive numbers. If we then take the cosine distance, we ﬁnd that users with opposite views of the movies they viewed in common will have vectors in almost opposite directions, and can be considered as far apart as possible. However, users with similar opinions about the movies rated in common will have a relatively small angle between them. Example 9.9: Figure 9.6 shows the matrix of Fig. 9.4 with all ratings nor- malized. An interesting eﬀect is that D’s ratings have eﬀectively disappeared, because a 0 is the same as a blank when cosine distance is computed. Note that D gave only 3’s and did not diﬀerentiate among movies, so it is quite possible that D’s opinions are not worth taking seriously. HP1 HP2 HP3 TW SW1 SW2 SW3 A 2/3 5/3 −7/3 B 1/3 1/3 −2/3 C −5/3 1/3 4/3 D 0 0 Figure 9.6: The utility matrix introduced in Fig. 9.1 304 CHAPTER 9. RECOMMENDATION SYSTEMS Let us compute the cosine of the angle between A and B: (2/3) × (1/3) p (2/3)2 + (5/3)2 + (−7/3)2 p (1/3)2 + (1/3)2 + (−2/3)2 =0.092 The cosine of the angle between between A and C is (5/3) × (−5/3)+(−7/3) × (1/3) p (2/3)2 + (5/3)2 + (−7/3)2 p (−5/3)2 + (1/3)2 + (4/3)2 = −0.559 Notice that under this measure, A and C are much further apart than A and B, and neither pair is very close. Both these observations make intuitive sense, given that A and C disagree on the two movies they rated in common, while A and B give similar scores to the one movie they rated in common. 2 9.3.2 The Duality of Similarity The utility matrix can be viewed as telling us about users or about items, or both. It is important to realize that any of the techniques we suggested in Section 9.3.1 for ﬁnding similar users can be used on columns of the utility matrix to ﬁnd similar items. There are two ways in which the symmetry is broken in practice. 1. We can use information about users to recommend items. That is, given a user, we can ﬁnd some number of the most similar users, perhaps using the techniques of Chapter 3. We can base our recommendation on the decisions made by these similar users, e.g., recommend the items that the greatest number of them have purchased or rated highly. However, there is no symmetry. Even if we ﬁnd pairs of similar items, we need to take an additional step in order to recommend items to users. This point is explored further at the end of this subsection. 2. There is a diﬀerence in the typical behavior of users and items, as it pertains to similarity. Intuitively, items tend to be classiﬁable in simple terms. For example, music tends to belong to a single genre. It is impossi- ble, e.g., for a piece of music to be both 60’s rock and 1700’s baroque. On the other hand, there are individuals who like both 60’s rock and 1700’s baroque, and who buy examples of both types of music. The consequence is that it is easier to discover items that are similar because they belong to the same genre, than it is to detect that two users are similar because they prefer one genre in common, while each also likes some genres that the other doesn’t care for. As we suggested in (1) above, one way of predicting the value of the utility- matrix entry for user U and item I is to ﬁnd the n users (for some predetermined n) most similar to U and average their ratings for item I, counting only those among the n similar users who have rated I. It is generally better to normalize 9.3. COLLABORATIVE FILTERING 305 the matrix ﬁrst. That is, for each of the n users subtract their average rating for items from their rating for i. Average the diﬀerence for those users who have rated I, and then add this average to the average rating that U gives for all items. This normalization adjusts the estimate in the case that U tends to give very high or very low ratings, or a large fraction of the similar users who rated I(of which there may be only a few) are users who tend to rate very high or very low. Dually, we can use item similarity to estimate the entry for user U and item I. Find the m items most similar to I, for some m, and take the average rating, among the m items, of the ratings that U has given. As for user-user similarity, we consider only those items among the m that U has rated, and it is probably wise to normalize item ratings ﬁrst. Note that whichever approach to estimating entries in the utility matrix we use, it is not suﬃcient to ﬁnd only one entry. In order to recommend items to a user U, we need to estimate every entry in the row of the utility matrix for U, or at least ﬁnd all or most of the entries in that row that are blank but have a high estimated value. There is a tradeoﬀ regarding whether we should work from similar users or similar items. • If we ﬁnd similar users, then we only have to do the process once for user U. From the set of similar users we can estimate all the blanks in the utility matrix for U. If we work from similar items, we have to compute similar items for almost all items, before we can estimate the row for U. • On the other hand, item-item similarity often provides more reliable in- formation, because of the phenomenon observed above, namely that it is easier to ﬁnd items of the same genre than it is to ﬁnd users that like only items of a single genre. Whichever method we choose, we should precompute preferred items for each user, rather than waiting until we need to make a decision. Since the utility matrix evolves slowly, it is generally suﬃcient to compute it infrequently and assume that it remains ﬁxed between recomputations. 9.3.3 Clustering Users and Items It is hard to detect similarity among either items or users, because we have little information about user-item pairs in the sparse utility matrix. In the perspective of Section 9.3.2, even if two items belong to the same genre, there are likely to be very few users who bought or rated both. Likewise, even if two users both like a genre or genres, they may not have bought any items in common. One way of dealing with this pitfall is to cluster items and/or users. Select any of the distance measures suggested in Section 9.3.1, or any other distance measure, and use it to perform a clustering of, say, items. Any of the methods suggested in Chapter 7 can be used. However, we shall see that there may 306 CHAPTER 9. RECOMMENDATION SYSTEMS be little reason to try to cluster into a small number of clusters immediately. Rather, a hierarchical approach, where we leave many clusters unmerged may suﬃce as a ﬁrst step. For example, we might leave half as many clusters as there are items. HP TW SW A 4 5 1 B 4.67 C 2 4.5 D 3 3 Figure 9.7: Utility matrix for users and clusters of items Example 9.10: Figure 9.7 shows what happens to the utility matrix of Fig. 9.4 if we manage to cluster the three Harry-Potter movies into one cluster, denoted HP, and also cluster the three Star-Wars movies into one cluster SW. 2 Having clustered items to an extent, we can revise the utility matrix so the columns represent clusters of items, and the entry for user U and cluster C is the average rating that U gave to those members of cluster C that U did rate. Note that U may have rated none of the cluster members, in which case the entry for U and C is still blank. We can use this revised utility matrix to cluster users, again using the dis- tance measure we consider most appropriate. Use a clustering algorithm that again leaves many clusters, e.g., half as many clusters as there are users. Re- vise the utility matrix, so the rows correspond to clusters of users, just as the columns correspond to clusters of items. As for item-clusters, compute the entry for a user cluster by averaging the ratings of the users in the cluster. Now, this process can be repeated several times if we like. That is, we can cluster the item clusters and again merge the columns of the utility matrix that belong to one cluster. We can then turn to the users again, and cluster the user clusters. The process can repeat until we have an intuitively reasonable number of clusters of each kind. Once we have clustered the users and/or items to the desired extent and computed the cluster-cluster utility matrix, we can estimate entries in the orig- inal utility matrix as follows. Suppose we want to predict the entry for user U and item I: (a) Find the clusters to which U and I belong, say clusters C and D, respec- tively. (b) If the entry in the cluster-cluster utility matrix for C and D is something other than blank, use this value as the estimated value for the U–I entry in the original utility matrix. 9.3. COLLABORATIVE FILTERING 307 (c) If the entry for C–D is blank, then use the method outlined in Section 9.3.2 to estimate that entry by considering clusters similar to C or D. Use the resulting estimate as the estimate for the U-I entry. 9.3.4 Exercises for Section 9.3 abcdefgh A 45 51 32 B 343121 C 2 13 453 Figure 9.8: A utility matrix for exercises Exercise 9.3.1: Figure 9.8 is a utility matrix, representing the ratings, on a 1–5 star scale, of eight items, a through h, by three users A,B, and C. Compute the following from the data of this matrix. (a) Treating the utility matrix as boolean, compute the Jaccard distance be- tween each pair of users. (b) Repeat Part (a), but use the cosine distance. (c) Treat ratings of 3, 4, and 5 as 1 and 1, 2, and blank as 0. Compute the Jaccard distance between each pair of users. (d) Repeat Part (c), but use the cosine distance. (e) Normalize the matrix by subtracting from each nonblank entry the average value for its user. (f) Using the normalized matrix from Part (e), compute the cosine distance between each pair of users. Exercise 9.3.2: In this exercise, we cluster items in the matrix of Fig. 9.8. Do the following steps. (a) Cluster the eight items hierarchically into four clusters. The following method should be used to cluster. Replace all 3’s, 4’s, and 5’s by 1 and replace 1’s, 2’s, and blanks by 0. use the Jaccard distance to measure the distance between the resulting column vectors. For clusters of more than one element, take the distance between clusters to be the minimum distance between pairs of elements, one from each cluster. (b) Then, construct from the original matrix of Fig. 9.8 a new matrix whose rows correspond to users, as before, and whose columns correspond to clusters. Compute the entry for a user and cluster of items by averaging the nonblank entries for that user and all the items in the cluster. 308 CHAPTER 9. RECOMMENDATION SYSTEMS (c) Compute the cosine distance between each pair of users, according to your matrix from Part (b). 9.4 Dimensionality Reduction An entirely diﬀerent approach to estimating the blank entries in the utility matrix is to conjecture that the utility matrix is actually the product of two long, thin matrices. This view makes sense if there are a relatively small set of features of items and users that determine the reaction of most users to most items. In this section, we sketch one approach to discovering two such matrices; the approach is called “UV-decomposition,” and it is an instance of a more general theory called SVD (singular-value decomposition). 9.4.1 UV-Decomposition Consider movies as a case in point. Most users respond to a small number of features; they like certain genres, they may have certain famous actors or actresses that they like, and perhaps there are a few directors with a signiﬁcant following. If we start with the utility matrix M, with n rows and m columns (i.e., there are n users and m items), then we might be able to ﬁnd a matrix U with n rows and d columns and a matrix V with d rows and m columns, such that UV closely approximates M in those entries where M is nonblank. If so, then we have established that there are d dimensions that allow us to characterize both users and items closely. We can then use the entry in the product UV to estimate the corresponding blank entry in utility matrix M. This process is called UV-decomposition of M. 52443 31241 2 314 25435 4 45 4 = u11 u12 u21 u22 u31 u32 u41 u42 u51 u52 × v11 v12 v13 v14 v15 v21 v22 v23 v24 v25 Figure 9.9: UV-decomposition of matrix M Example 9.11: We shall use as a running example a 5-by-5 matrix M with all but two of its entries known. We wish to decompose M into a 5-by-2 and 2-by-5 matrix, U and V, respectively. The matrices M,U, and V are shown in Fig. 9.9 with the known entries of M indicated and the matrices U and V shown with their entries as variables to be determined. This example is essentially the smallest nontrivial case where there are more known entries than there are entries in U and V combined, and we therefore can expect that the best decomposition will not yield a product that agrees exactly in the nonblank entries of M. 2 9.4. DIMENSIONALITY REDUCTION 309 9.4.2 Root-Mean-Square Error While we can pick among several measures of how close the product UV is to M, the typical choice is the root-mean-square error (RMSE), where we 1. Sum, over all nonblank entries in M the square of the diﬀerence between that entry and the corresponding entry in the product UV. 2. Take the mean (average) of these squares by dividing by the number of terms in the sum (i.e., the number of nonblank entries in M). 3. Take the square root of the mean. Minimizing the sum of the squares is the same as minimizing the square root of the average square, so we generally omit the last two steps in our running example. 1 1 1 1 1 1 1 1 1 1 × 11111 11111 = 22222 22222 22222 22222 22222 Figure 9.10: Matrices U and V with all entries 1 Example 9.12: Suppose we guess that U and V should each have entries that are all 1’s, as shown in Fig. 9.10. This is a poor guess, since the product, consisting of all 2’s, has entries that are much below the average of the entries in M. Nonetheless, we can compute the RMSE for this U and V; in fact the regularity in the entries makes the calculation especially easy to follow. Consider the ﬁrst rows of M and UV. We subtract 2 (each entry in UV) from the entries in the ﬁrst row of M, to get 3, 0, 2, 2, 1. We square and sum these to get 18. In the second row, we do the same to get 1, −1, 0, 2, −1, square and sum to get 7. In the third row, the second column is blank, so that entry is ignored when computing the RMSE. The diﬀerences are 0, 1, −1, 2 and the sum of squares is 6. For the fourth row, the diﬀerences are 0, 3, 2, 1, 3 and the sum of squares is 23. The ﬁfth row has a blank entry in the last column, so the diﬀerences are 2, 2, 3, 2 and the sum of squares is 21. When we sum the sums from each of the ﬁve rows, we get 18+7+6+23+21= 75. Generally, we shall stop at this point, but if we want to compute the true RMSE, we divide by 23 (the number of nonblank entries in M) and take the square root. In this case p 75/23=1.806 is the RMSE. 2 9.4.3 Incremental Computation of a UV-Decomposition Finding the UV-decomposition with the least RMSE involves starting with some arbitrarily chosen U and V, and repeatedly adjusting U and V to make 310 CHAPTER 9. RECOMMENDATION SYSTEMS the RMSE smaller. We shall consider only adjustments to a single element of U or V, although in principle, one could make more complex adjustments. Whatever adjustments we allow, in a typical example there will be many lo- cal minima – matrices U and V such that no allowable adjustment reduces the RMSE. Unfortunately, only one of these local minima will be the global minimum – the matrices U and V that produce the least possible RMSE. To increase our chances of ﬁnding the global minimum, we need to pick many dif- ferent starting points, that is, diﬀerent choices of the initial matrices U and V. However, there is never a guarantee that our best local minimum will be the global minimum. We shall start with the U and V of Fig. 9.10, where all entries are 1, and do a few adjustments to some of the entries, ﬁnding the values of those entries that give the largest possible improvement to the RMSE. From these examples, the general calculation should become obvious, but we shall follow the examples by the formula for minimizing the RMSE by changing a single entry. In what follows, we shall refer to entries of U and V by their variable names u11, and so on, as given in Fig. 9.9. Example 9.13: Suppose we start with U and V as in Fig. 9.10, and we decide to alter u11 to reduce the RMSE as much as possible. Let the value of u11 be x. Then the new U and V can be expressed as in Fig. 9.11. x 1 1 1 1 1 1 1 1 1 × 11111 11111 = x +1 x +1 x +1 x +1 x +1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Figure 9.11: Making u11 a variable Notice that the only entries of the product that have changed are those in the ﬁrst row. Thus, when we compare UV with M, the only change to the RMSE comes from the ﬁrst row. The contribution to the sum of squares from the ﬁrst row is 5 − (x + 1) 2 + 2 − (x + 1) 2 + 4 − (x + 1) 2 + 4 − (x + 1) 2 + 3 − (x + 1) 2 This sum simpliﬁes to (4 − x)2 + (1 − x)2 + (3 − x)2 + (3 − x)2 + (2 − x)2 We want the value of x that minimizes the sum, so we take the derivative and set that equal to 0, as: −2 × (4 − x)+(1 − x)+(3 − x)+(3 − x)+(2 − x) =0 9.4. DIMENSIONALITY REDUCTION 311 2.6 1 1 1 1 1 1 1 1 1 × 11111 11111 = 3.6 3.6 3.6 3.6 3.6 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Figure 9.12: The best value for u11 is found to be 2.6 or −2 × (13 − 5x) = 0, from which it follows that x =2.6. Figure 9.12 shows U and V after u11 has been set to 2.6. Note that the sum of the squares of the errors in the ﬁrst row has been reduced from 18 to 5.2, so the total RMSE (ignoring average and square root) has been reduced from 75 to 62.2. 2.6 1 1 1 1 1 1 1 1 1 × y 11 1 1 11111 = 2.6y +1 3.6 3.6 3.6 3.6 y +1 2 2 2 2 y +1 2 2 2 2 y +1 2 2 2 2 y +1 2 2 2 2 Figure 9.13: v11 becomes a variable y Suppose our next entry to vary is v11. Let the value of this entry be y, as suggested in Fig. 9.13. Only the ﬁrst column of the product is aﬀected by y, so we need only to compute the sum of the squares of the diﬀerences between the entries in the ﬁrst columns of M and UV. This sum is 5−(2.6y + 1) 2 + 3−(y + 1) 2 + 2−(y + 1) 2 + 2−(y + 1) 2 + 4−(y + 1) 2 This expression simpliﬁes to (4 − 2.6y)2 + (2 − y)2 + (1 − y)2 + (1 − y)2 + (3 − y)2 As before, we ﬁnd the minimum value of this expression by diﬀerentiating and equating to 0, as: −2 × 2.6(4 − 2.6y)+(2 − y)+(1 − y)+(1 − y)+(3 − y) =0 The solution for y is y = 17.4/10.76= 1.617. The improved estimates of U and V are shown in Fig. 9.14. We shall do one more change, to illustrate what happens when entries of M are blank. We shall vary u31, calling it z temporarily. The new U and V are shown in Fig. 9.15. The value of z aﬀects only the entries in the third row. We can express the sum of the squares of the errors as 2 − (1.617z + 1) 2 + 3 − (z + 1) 2 + 1 − (z + 1) 2 + 4 − (z + 1) 2 312 CHAPTER 9. RECOMMENDATION SYSTEMS 2.6 1 1 1 1 1 1 1 1 1 × 1.617 1 1 1 1 1 1111 = 5.204 3.6 3.6 3.6 3.6 2.6172 2 2 2 2.6172 2 2 2 2.6172 2 2 2 2.6172 2 2 2 Figure 9.14: Replace y by 1.617 2 66664 2.6 1 1 1 z 1 1 1 1 1 3 77775 × » 1.617 1 1 1 1 1 1111 – = 2 66664 5.204 3.6 3.6 3.6 3.6 2.617 2 2 2 2 1.617z + 1 z + 1 z + 1 z + 1 z + 1 2.617 2 2 2 2 2.617 2 2 2 2 3 77775 Figure 9.15: u31 becomes a variable z Note that there is no contribution from the element in the second column of the third row, since this element is blank in M. The expression simpliﬁes to (1 − 1.617z)2 + (2 − z)2 + (−z)2 + (3 − z)2 The usual process of setting the derivative to 0 gives us −2 × 1.617(1 − 1.617z)+(2 − z) + (−z)+(3 − z) =0 whose solution is z =6.617/5.615= 1.178. The next estimate of the decompo- sition UV is shown in Fig. 9.16. 2 2 66664 2.6 1 1 1 1.178 1 1 1 1 1 3 77775 × » 1.617 1 1 1 1 1 1111 – = 2 66664 5.204 3.6 3.6 3.6 3.6 2.617 2 2 2 2 2.905 2.178 2.178 2.178 2.178 2.617 2 2 2 2 2.617 2 2 2 2 3 77775 Figure 9.16: Replace z by 1.178 9.4.4 Optimizing an Arbitrary Element Having seen some examples of picking the optimum value for a single element in the matrix U or V, let us now develop the general formula. As before, assume 9.4. DIMENSIONALITY REDUCTION 313 that M is an n-by-m utility matrix with some entries blank, while U and V are matrices of dimensions n-by-d and d-by-m, for some d. We shall use mij, uij, and vij for the entries in row i and column j of M,U, and V, respectively. Also, let P = UV, and use pij for the element in row i and column j of the product matrix P. Suppose we want to vary urs and ﬁnd the value of this element that mini- mizes the RMSE between M and UV. Note that urs aﬀects only the elements in row r of the product P = UV. Thus, we need only concern ourselves with the elements prj = d Xk=1 urkvkj = Xk6=s urkvkj + xvsj for all values of j such that mrj is nonblank. In the expression above, we have replaced urs, the element we wish to vary, by a variable x, and we use the convention •P k6=s is shorthand for the sum for k =1, 2,...,d, except for k = s. If mrj is a nonblank entry of the matrix M, then the contribution of this element to the sum of the squares of the errors is (mrj − prj)2 = mrj − Xk6=s urkvkj − xvsj 2 We shall use another convention: •P j is shorthand for the sum over all j such that mrj is nonblank. Then we can write the sum of the squares of the errors that are aﬀected by the value of x = urs as Xj mrj − Xk6=s urkvkj − xvsj 2 Take the derivative of the above with respect to x, and set it equal to 0, in order to ﬁnd the value of x that minimizes the RMSE. That is, Xj −2vsj mrj − Xk6=s urkvkj − xvsj =0 As in the previous examples, the common factor −2 can be dropped. We solve the above equation for x, and get x = P j vsj mrj − P k6=s urkvkj P j v2 sj There is an analogous formula for the optimum value of an element of V. If we want to vary vrs = y, then the value of y that minimizes the RMSE is y = P i uir mis − P k6=r uikvks P i u2 ir 314 CHAPTER 9. RECOMMENDATION SYSTEMS Here, P i is shorthand for the sum over all i such that mis is nonblank, and P k6=r is the sum over all values of k between 1 and d, except for k = r. 9.4.5 Building a Complete UV-Decomposition Algorithm Now, we have the tools to search for the global optimum decomposition of a utility matrix M. There are four areas where we shall discuss the options. 1. Preprocessing of the matrix M. 2. Initializing U and V. 3. Ordering the optimization of the elements of U and V. 4. Ending the attempt at optimization. Preprocessing Because the diﬀerences in the quality of items and the rating scales of users are such important factors in determining the missing elements of the matrix M, it is often useful to remove these inﬂuences before doing anything else. The idea was introduced in Section 9.3.1. We can subtract from each nonblank element mij the average rating of user i. Then, the resulting matrix can be modiﬁed by subtracting the average rating (in the modiﬁed matrix) of item j. It is also possible to ﬁrst subtract the average rating of item j and then subtract the average rating of user i in the modiﬁed matrix. The results one obtains from doing things in these two diﬀerent orders need not be the same, but will tend to be close. A third option is to normalize by subtracting from mij the average of the average rating of user i and item j, that is, subtracting one half the sum of the user average and the item average. If we choose to normalize M, then when we make predictions, we need to undo the normalization. That is, if whatever prediction method we use results in estimate e for an element mij of the normalized matrix, then the value we predict for mij in the true utility matrix is e plus whatever amount was subtracted from row i and from column j during the normalization process. Initialization As we mentioned, it is essential that there be some randomness in the way we seek an optimum solution, because the existence of many local minima justiﬁes our running many diﬀerent optimizations in the hope of reaching the global minimum on at least one run. We can vary the initial values of U and V, or we can vary the way we seek the optimum (to be discussed next), or both. A simple starting point for U and V is to give each element the same value, and a good choice for this value is that which gives the elements of the product UV the average of the nonblank elements of M. Note that if we have normalized M, then this value will necessarily be 0. If we have chosen d as the lengths of 9.4. DIMENSIONALITY REDUCTION 315 the short sides of U and V, and a is the average nonblank element of M, then the elements of U and V should be p a/d. If we want many starting points for U and V, then we can perturb the value p a/d randomly and independently for each of the elements. There are many options for how we do the perturbation. We have a choice regarding the distribution of the diﬀerence. For example we could add to each element a normally distributed value with mean 0 and some chosen standard deviation. Or we could add a value uniformly chosen from the range −c to +c for some c. Performing the Optimization In order to reach a local minimum from a given starting value of U and V, we need to pick an order in which we visit the elements of U and V. The simplest thing to do is pick an order, e.g., row-by-row, for the elements of U and V, and visit them in round-robin fashion. Note that just because we optimized an element once does not mean we cannot ﬁnd a better value for that element after other elements have been adjusted. Thus, we need to visit elements repeatedly, until we have reason to believe that no further improvements are possible. Alternatively, we can follow many diﬀerent optimization paths from a single starting value by randomly picking the element to optimize. To make sure that every element is considered in each round, we could instead choose a permuta- tion of the elements and follow that order for every round. Converging to a Minimum Ideally, at some point the RMSE becomes 0, and we know we cannot do better. In practice, since there are normally many more nonblank elements in M than there are elements in U and V together, we have no right to expect that we can reduce the RMSE to 0. Thus, we have to detect when there is little beneﬁt to be had in revisiting elements of U and/or V. We can track the amount of improvement in the RMSE obtained in one round of the optimization, and stop when that improvement falls below a threshold. A small variation is to observe the improvements resulting from the optimization of individual elements, and stop when the maximum improvement during a round is below a threshold. Avoiding Overﬁtting One problem that often arises when performing a UV-decomposition is that we arrive at one of the many local minima that conform well to the given data, but picks up values in the data that don’t reﬂect well the underlying process that gives rise to the data. That is, although the RMSE may be small on the given data, it doesn’t do well predicting future data. There are several things that can be done to cope with this problem, which is called overﬁtting by statisticians. 1. Avoid favoring the ﬁrst components to be optimized by only moving the value of a component a fraction of the way, say half way, from its current value toward its optimized value. 316 CHAPTER 9. RECOMMENDATION SYSTEMS 2. Stop revisiting elements of U and V well before the process has converged. 3. Take several diﬀerent UV decompositions, and when predicting a new entry in the matrix M, take the average of the results of using each decomposition. 9.4.6 Exercises for Section 9.4 Exercise 9.4.1: Starting with the decomposition of Fig. 9.10, we may choose any of the 20 entries in U or V to optimize ﬁrst. Perform this ﬁrst optimization step assuming we choose: (a) u32 (b) v41. Exercise 9.4.2: If we wish to start out, as in Fig. 9.10, with all U and V entries set to the same value, what value minimizes the RMSE for the matrix M of our running example? Exercise 9.4.3: Starting with the U and V matrices in Fig. 9.16, do the following in order: (a) Reconsider the value of u11. Find its new best value, given the changes that have been made so far. (b) Then choose the best value for u52. (c) Then choose the best value for v22. Exercise 9.4.4: Derive the formula for y (the optimum value of element vrs given at the end of Section 9.4.4. Exercise 9.4.5: Normalize the matrix M of our running example by: (a) First subtracting from each element the average of its row, and then subtracting from each element the average of its (modiﬁed) column. (b) First subtracting from each element the average of its column, and then subtracting from each element the average of its (modiﬁed) row. Are there any diﬀerences in the results of (a) and (b)? 9.5 The NetFlix Challenge A signiﬁcant boost to research into recommendation systems was given when NetFlix oﬀered a prize of $1,000,000 to the ﬁrst person or team to beat their own recommendation algorithm, called CineMatch, by 10%. After over three years of work, the prize was awarded in September, 2009. The NetFlix challenge consisted of a published dataset, giving the ratings by approximately half a million users on (typically small subsets of) approximately 9.5. THE NETFLIX CHALLENGE 317 17,000 movies. This data was selected from a larger dataset, and proposed al- gorithms were tested on their ability to predict the ratings in a secret remainder of the larger dataset. The information for each (user, movie) pair in the pub- lished dataset included a rating (1–5 stars) and the date on which the rating was made. The RMSE was used to measure the performance of algorithms. CineMatch has an RMSE of approximately 0.95; i.e., the typical rating would be oﬀ by almost one full star. To win the prize, it was necessary that your algorithm have an RMSE that was at most 90% of the RMSE of CineMatch. The bibliographic notes for this chapter include references to descriptions of the winning algorithms. Here, we mention some interesting and perhaps unintuitive facts about the challenge. • CineMatch was not a very good algorithm. In fact, it was discovered early that the obvious algorithm of predicting, for the rating by user u on movie m, the average of: 1. The average rating given by u on all rated movies and 2. The average of the ratings for movie m by all users who rated that movie. was only 3% worse than CineMatch. • The UV-decomposition algorithm described in Section 9.4 was found by three students (Michael Harris, Jeﬀrey Wang, and David Kamm) to give a 7% improvement over CineMatch, when coupled with normalization and a few other tricks. • The winning entry was actually a combination of several diﬀerent algo- rithms that had been developed independently. A second team, which submitted an entry that would have won, had it been submitted a few minutes earlier, also was a blend of independent algorithms. This strat- egy – combining diﬀerent algorithms – has been used before in a number of hard problems and is something worth remembering. • Several attempts have been made to use the data contained in IMDB, the Internet movie database, to match the names of movies from the NetFlix challenge with their names in IMDB, and thus extract useful information not contained in the NetFlix data itself. IMDB has information about actors and directors, and classiﬁes movies into one or more of 28 genres. It was found that genre and other information was not useful. One pos- sible reason is the machine-learning algorithms were able to discover the relevant information anyway, and a second is that the entity resolution problem of matching movie names as given in NetFlix and IMDB data is not that easy to solve exactly. 318 CHAPTER 9. RECOMMENDATION SYSTEMS • Time of rating turned out to be useful. It appears there are movies that are more likely to be appreciated by people who rate it immediately after viewing than by those who wait a while and then rate it. “Patch Adams” was given as an example of such a movie. Conversely, there are other movies that were not liked by those who rated it immediately, but were better appreciated after a while; “Memento” was cited as an example. While one cannot tease out of the data information about how long was the delay between viewing and rating, it is generally safe to assume that most people see a movie shortly after it comes out. Thus, one can examine the ratings of any movie to see if its ratings have an upward or downward slope with time. 9.6 Summary of Chapter 9 ✦ Utility Matrices: Recommendation systems deal with users and items. A utility matrix oﬀers known information about the degree to which a user likes an item. Normally, most entries are unknown, and the essential problem of recommending items to users is predicting the values of the unknown entries based on the values of the known entries. ✦ Two Classes of Recommendation Systems: These systems attempt to pre- dict a user’s response to an item by discovering similar items and the response of the user to those. One class of recommendation system is content-based; it measures similarity by looking for common features of the items. A second class of recommendation system uses collaborative ﬁl- tering; these measure similarity of users by their item preferences and/or measure similarity of items by the users who like them. ✦ Item Proﬁles: These consist of features of items. Diﬀerent kinds of items have diﬀerent features on which content-based similarity can be based. Features of documents are typically important or unusual words. Prod- ucts have attributes such as screen size for a television. Media such as movies have a genre and details such as actor or performer. Tags can also be used as features if they can be acquired from interested users. ✦ User Proﬁles: A content-based collaborative ﬁltering system can con- struct proﬁles for users by measuring the frequency with which features appear in the items the user likes. We can then estimate the degree to which a user will like an item by the closeness of the item’s proﬁle to the user’s proﬁle. ✦ Classiﬁcation of Items: An alternative to constructing a user proﬁle is to build a classiﬁer for each user, e.g., a decision tree. The row of the utility matrix for that user becomes the training data, and the classiﬁer must predict the response of the user to all items, whether or not the row had an entry for that item. 9.6. SUMMARY OF CHAPTER 9 319 ✦ Similarity of Rows and Columns of the Utility Matrix: Collaborative ﬁl- tering algorithms must measure the similarity of rows and/or columns of the utility matrix. Jaccard distance is appropriate when the matrix consists only of 1’s and blanks (for “not rated”). Cosine distance works for more general values in the utility matrix. It is often useful to normal- ize the utility matrix by subtracting the average value (either by row, by column, or both) before measuring the cosine distance. ✦ Clustering Users and Items: Since the utility matrix tends to be mostly blanks, distance measures such as Jaccard or cosine often have too little data with which to compare two rows or two columns. A preliminary step or steps, in which similarity is used to cluster users and/or items into small groups with strong similarity, can help provide more common components with which to compare rows or columns. ✦ UV-Decomposition: One way of predicting the blank values in a utility matrix is to ﬁnd two long, thin matrices U and V, whose product is an approximation to the given utility matrix. Since the matrix product UV gives values for all user-item pairs, that value can be used to predict the value of a blank in the utility matrix. The intuitive reason this method makes sense is that often there are a relatively small number of issues (that number is the “thin” dimension of U and V) that determine whether or not a user likes an item. ✦ Root-Mean-Square Error: A good measure of how close the product UV is to the given utility matrix is the RMSE (root-mean-square error). The RMSE is computed by averaging the square of the diﬀerences between UV and the utility matrix, in those elements where the utility matrix is nonblank. The square root of this average is the RMSE. ✦ Computing U and V: One way of ﬁnding a good choice for U and V in a UV-decomposition is to start with arbitrary matrices U and V. Repeat- edly adjust one of the elements of U or V to minimize the RMSE between the product UV and the given utility matrix. The process converges to a local optimum, although to have a good chance of obtaining a global optimum we must either repeat the process from many starting matrices, or search from the starting point in many diﬀerent ways. ✦ The NetFlix Challenge: An important driver of research into recommen- dation systems was the NetFlix challenge. A prize of $1,000,000 was oﬀered for a contestant who could produce an algorithm that was 10% better than NetFlix’s own algorithm at predicting movie ratings by users. The prize was awarded in Sept., 2009. 320 CHAPTER 9. RECOMMENDATION SYSTEMS 9.7 References for Chapter 9 [1] is a survey of recommendation systems as of 2005. The argument regard- ing the importance of the long tail in on-line systems is from [2], which was expanded into a book [3]. [8] discusses the use of computer games to extract tags for items. See [5] for a discussion of item-item similarity and how Amazon designed its collaborative-ﬁltering algorithm for product recommendations. There are three papers describing the three algorithms that, in combination, won the NetFlix challenge. They are [4], [6], and [7]. 1. G. Adomavicius and A. Tuzhilin, “Towards the next generation of rec- ommender systems: a survey of the state-of-the-art and possible exten- sions,” IEEE Trans. on Data and Knowledge Engineering 17:6, pp. 734– 749, 2005. 2. C. Anderson, http://www.wired.com/wired/archive/12.10/tail.html 2004. 3. C. Anderson, The Long Tail: Why the Future of Business is Selling Less of More, Hyperion Books, New York, 2006. 4. Y. Koren, “The BellKor solution to the Netﬂix grand prize,” www.netflixprize.com/assets/GrandPrize2009 BPC BellKor.pdf 2009. 5. G. Linden, B. Smith, and J. York, “Amazon.com recommendations: item- to-item collaborative ﬁltering,” Internet Computing 7:1, pp. 76–80, 2003. 6. M. Piotte and M. Chabbert, ”The Pragmatic Theory solution to the Net- ﬂix grand prize,” www.netflixprize.com/assets/ GrandPrize2009 BPC PragmaticTheory.pdf 2009. 7. A. Toscher, M. Jahrer, and R. Bell, “The BigChaos solution to the Netﬂix grand prize,” www.netflixprize.com/assets/GrandPrize2009 BPC BigChaos.pdf 2009. 8. L. von Ahn, “Games with a purpose,” IEEE Computer Magazine, pp. 96– 98, June 2006. Index A-Priori Algorithm, 194, 195, 201 Accessible page, 169 Ad-hoc query, 116 Adomavicius, G., 320 Advertising, 16, 97, 186, 261 Adwords, 270 Afrati, F.N., 53 Agglomerative clustering, see Hierar- chical clustering Aggregation, 24, 31, 35 Agrawal, R., 220 Alon, N., 144 Alon-Matias-Szegedy Algorithm, 128 Ampliﬁcation, 83 Analytic query, 50 AND-construction, 83 Anderson, C., 320 Andoni, A., 110 Apache, 22, 54 Archive, 114 Ask, 174 Association rule, 187, 189 Associativity, 25 Attribute, 29 Auction, 273 Austern, M.H., 54 Authority, 174 Average, 126 B-tree, 260 Babcock, B., 144, 260 Babu, S., 144 Bag, 37, 59 Balance Algorithm, 273 Balazinska, M., 53 Band, 70 Bandwidth, 20 Basket, see Market basket, 184, 186, 187, 216 Bayes net, 4 BDMO Algorithm, 251 Beer and diapers, 188 Bell, R., 320 Bellkor’s Pragmatic Chaos, 290 Berkhin, P., 182 BFR Algorithm, 234, 237 Bid, 271, 273, 280, 281 BigTable, 53 Bik, A.J.C., 54 Biomarker, 187 Bipartite graph, 267 BIRCH Algorithm, 260 Birrell, A., 54 Bitmap, 202 Block, 11, 19, 162 Blog, 170 Bloom ﬁlter, 122, 200 Bloom, B.H., 144 Bohannon, P., 53 Bonferroni correction, 5 Bonferroni’s principle, 4, 5 Bookmark, 168 Borkar, V., 53 Bradley, P.S., 260 Brick-and-mortar retailer, 186, 288, 289 Brin, S., 182 Broad matching, 273 Broder, A.Z., 18, 110, 182 Bu, Y., 53 Bucket, 9, 119, 134, 138, 200, 251 Budget, 272, 279 Budiu, M., 54 321 322 INDEX Burrows, M., 53 Candidate itemset, 197, 210 Candidate pair, 70, 201, 203 Carey, M., 53 Centroid, 223, 226, 232, 235, 239 Chabbert, M., 320 Chandra, T., 53 Chang, F., 53 Characteristic matrix, 62 Charikar, M.S., 110 Chaudhuri, S., 110 Checkpoint, 43 Chen, M.-S., 220 Cholera, 3 Chronicle data model, 143 Chunk, 21, 210, 238 CineMatch, 316 Classiﬁer, 298 Click stream, 115 Click-through rate, 265, 271 Cloud computing, 15 CloudStore, 22 Cluster computing, 20 Cluster tree, 246, 247 Clustera, 38, 52 Clustering, 3, 16, 221, 305 Clustroid, 226, 232 Collaborative ﬁltering, 4, 17, 57, 261, 287, 301 Combiner, 25, 159, 161 Communication cost, 44 Commutativity, 25 Competitive ratio, 16, 266, 269, 274 Compressed set, 238 Compute node, 20 Computer game, 295 Computing cloud, see Cloud comput- ing Conﬁdence, 187, 189 Content-based recommendation, 287, 292 Cooper, B.F., 53 Coordinates, 222 Cosine distance, 76, 86, 293, 298 Counting ones, 132, 251 Craig’s List, 262 Craswell, N., 285 CURE Algorithm, 242, 246 Currey, J., 54 Curse of dimensionality, 224, 248 Cyclic permutation, 68 Cylinder, 12 Czajkowski, G., 54 Darts, 122 Data mining, 1 Data stream, 16, 214, 250, 264 Data-stream-management system, 114 Database, 16 Datar, M., 111, 144, 260 Datar-Gionis-Indyk-Motwani Algorithm, 133 Dead end, 149, 152, 153, 175 Dean, J., 53 Decaying window, 139, 215 Decision tree, 298 Dehnert, J.C., 54 del.icio.us, 294 Deletion, 77 Dense matrix, 28 Density, 231, 233 DeWitt, D.J., 54 DFS, see Distributed ﬁle system Diameter, 231, 233 Diapers and beer, 186 Diﬀerence, 30, 33, 38 Dimension table, 50 Dimensionality reduction, 308 Discard set, 238 Disk, 11, 191, 223, 246 Disk block, see Block Display ad, 262, 263 Distance measure, 74, 221 Distinct elements, 124, 127 Distributed ﬁle system, 21, 184, 191 DMOZ, see Open directory Document, 56, 59, 187, 222, 281, 293, 294 INDEX 323 Document frequency, see Inverse doc- ument frequency Domain, 172 Dot product, 76 Dryad, 52 DryadLINQ, 53 Dup-elim task, 40 e, 12 Edit distance, 77, 80 Eigenvalue, 149 Eigenvector, 149 Elapsed communication cost, 46 Ensemble, 299 Entity resolution, 91, 92 Equijoin, 30 Erlingsson, I., 54 Ernst, M., 53 Ethernet, 20 Euclidean distance, 74, 89 Euclidean space, 74, 78, 222, 223, 226, 242 Exponentially decaying window, see De- caying window Facebook, 168 Fact table, 50 Failure, 20, 26, 39, 40, 42 False negative, 70, 80, 209 False positive, 70, 80, 122, 209 Family of functions, 81 Fang, M., 220 Fayyad, U.M., 260 Feature, 246, 292–294 Fetterly, D., 54 Fikes, A., 53 File, 20, 21, 191, 209 Filtering, 121 Fingerprint, 94 First-price auction, 273 Fixedpoint, 84, 174 Flajolet, P., 144 Flajolet-Martin Algorithm, 125 Flow graph, 38 French, J.C., 260 Frequent bucket, 200, 202 Frequent itemset, 4, 184, 194, 197 Frequent pairs, 194, 195 Frequent-items table, 196 Friends relation, 49 Frieze, A.M., 110 Gaber, M.M., 18 Ganti, V., 110, 260 Garcia-Molina, H., 18, 182, 220, 260 Garofalakis, M., 144 Gaussian elimination, 150 Gehrke, J., 144, 260 Genre, 292, 304, 317 GFS, see Google ﬁle system Ghemawat, S., 53, 54 Gibbons, P.B., 144 Gionis, A., 111, 144 Global minimum, 310 Gobioﬀ, H., 54 Google, 146, 157, 270 Google ﬁle system, 22 Graph, 42 Greedy algorithm, 264, 265, 268, 272 GRGPF Algorithm, 246 Grouping, 24, 31, 35 Grouping attribute, 31 Gruber, R.E., 53 Guha, S., 260 Gunda, P.K., 54 Gyongi, Z., 182 Hadoop, 25, 54 Hadoop distributed ﬁle system, 22 Hamming distance, 78, 86 Harris, M., 317 Hash function, 9, 60, 65, 70, 119, 122, 125 Hash key, 9, 280 Hash table, 9, 10, 12, 193, 200, 202, 204, 280, 282 Haveliwala, T.H., 182 HDFS, see Hadoop distributed ﬁle sys- tem Henzinger, M., 111 324 INDEX Hierarchical clustering, 223, 225, 243, 306 HITS, 174 Hive, 53, 54 Horn, H., 54 Howe, B., 53 Hsieh, W.C., 53 Hub, 174 Hyperlink-induced topic search, see HITS Hyracks, 38 Identical documents, 99 IDF, see Inverse document frequency Image, 115, 293, 294 IMDB, see Internet Movie Database Imielinski, T., 220 Immediate subset, 212 Immorlica, N., 111 Important page, 146 Impression, 262 In-component, 151 Inaccessible page, 169 Index, 10 Indyk, P., 110, 111, 144 Initialize clusters, 235 Insertion, 77 Interest, 188 Internet Movie Database, 292, 317 Intersection, 30, 33, 38 Into Thin Air, 291 Inverse document frequency, see TF.IDF, 8 Inverted index, 146, 262 IP packet, 115 Isard, M., 54 Isolated component, 152 Item, 184, 186, 187, 288, 304, 305 Item proﬁle, 292, 295 Itemset, 184, 192, 194 Jaccard distance, 74, 75, 82, 293 Jaccard similarity, 56, 64, 74, 169 Jacobsen, H.-A., 53 Jagadish, H.V., 144 Jahrer, M., 320 Join, see Natural join, see Multiway join, see Star join Join task, 40 K-means, 234 Kalyanasundaram, B., 286 Kamm, D., 317 Karlin, A., 266 Kaushik, R., 110 Kautz, W.H., 144 Key component, 119 Key-value pair, 22–24 Keyword, 271, 299 Kleinberg, J.M., 182 Knuth, D.E., 18 Koren, Y., 320 Kosmix, 22 Krioukov, A., 54 Kumar, R., 18, 54, 182 Kumar, V., 18 Lagrangean multipliers, 48 LCS, see Longest common subsequence Leiser, N, 54 Length, 128 Length indexing, 100 Leung, S.-T., 54 Lin, S., 111 Linden, G., 320 Linear equations, 150 Link, 29, 146, 160 Link matrix of the Web, 175 Link spam, 165, 169 Livny, M., 260 Local minimum, 310 Locality-sensitive family, 86 Locality-sensitive function, 81 Locality-sensitive hashing, 69, 80, 294 Logarithm, 12 Long tail, 186, 288, 289 Longest common subsequence, 77 LSH, see Locality-sensitive hashing Machine learning, 2, 298 Maghoul, F., 18, 182 Mahalanobis distance, 241 INDEX 325 Main memory, 191, 192, 200, 223 Malewicz, G, 54 Manber, U., 111 Manhattan distance, 75 Manning, C.P., 18 Many-many matching, 95 Many-many relationship, 184 Many-one matching, 95 Map task, 22, 23, 25 Map worker, 25, 27 Map-reduce, 15, 19, 22, 27, 159, 161, 210, 255 Market basket, 4, 16, 183, 184, 191 Markov process, 149, 152 Martin, G.N., 144 Master controller, 22, 24, 25 Master node, 22 Matching, 267 Matias, Y., 144 Matrix, 27, 35, 36, see Transition ma- trix of the Web, see Stochas- tic matrix, see Substochastic matrix, 159, 174, see Utility matrix, 308 Matthew eﬀect, 14 Maximal itemset, 194 Maximal matching, 267 Mean, see Average Median, 126 Mehta, A., 286 Merging clusters, 226, 229, 240, 244, 249, 253 Merton, P., 18 Minhashing, 63, 72, 76, 82, 294 Minicluster, 238 Minutiae, 94 Mirrokni, V.S., 111 Mirror page, 57 Mitzenmacher, M., 110 Moments, 127 Monotonicity, 194 Most-common elements, 139 Motwani, R., 111, 144, 220, 260 Mulihash Algorithm, 204 Multiplication, 27, 35, 36, 159, 174 Multiset, see Bag Multistage Algorithm, 202 Multiway join, 46 Mumick, I.S., 144 Mutation, 80 Name node, see Master node Natural join, 30, 34, 35, 45 Naughton, J.F., 54 Navathe, S.B., 220 Near-neighbor search, see locality-sensitive hashing Negative border, 212 Netﬂix challenge, 2, 290, 316 Newspaper articles, 97, 281, 290 Non-Euclidean distance, 232, see Co- sine distance, see Edit distance, see Hamming distance, see Jac- card distance Non-Euclidean space, 246, 248 Norm, 74, 75 Normal distribution, 237 Normalization, 301, 303, 314 O’Callaghan, L., 260 Oﬀ-line algorithm, 264 Olston, C., 54 Omiecinski, E., 220 On-line advertising, see Advertising On-line algorithm, 16, 264 On-line retailer, 186, 262, 288, 289 Open Directory, 166 OR-construction, 83 Orthogonal vectors, 224 Out-component, 151 Outlier, 223 Overﬁtting, 299 Overture, 271 Own pages, 170 Paepcke, A., 111 Page, L., 145, 182 PageRank, 3, 16, 27, 29, 40, 145, 147, 159 Pairs, see Frequent pairs Park, J.S., 220 326 INDEX Pass, 192, 195, 202, 208 Paulson, E., 54 PCY Algorithm, 200, 202, 203 Pedersen, J., 182 Perfect matching, 267 Permutation, 63, 68 PIG, 53 Piotte, M., 320 Plagiarism, 57, 187 Pnuts, 53 Point, 221, 251 Point assignment, 223, 234 Polyzotis, A., 53 Position indexing, 102, 104 Positive integer, 138 Powell, A.L., 260 Power law, 13 Predicate, 298 Preﬁx indexing, 101, 102, 104 Pregel, 42 Principal eigenvector, 149 Priority queue, 229 Privacy, 264 Probe string, 102 Proﬁle, see Item proﬁle, see User pro- ﬁle Projection, 30, 32 Pruhs, K.R., 286 Puz, N., 53 Query, 116, 135, 255 R-tree, 260 Rack, 20 Radius, 231, 233 Raghavan, P., 18, 182 Rajagopalan, S., 18, 182 Ramakrishnan, R., 53, 260 Ramsey, W., 285 Random hyperplanes, 86, 294 Random surfer, 146, 147, 152, 166 Randomization, 208 Rarest-ﬁrst order, 281 Rastogi, R., 144, 260 Rating, 288, 291 Recommendation system, 16, 287 Recursion, 40 Reduce task, 22, 24 Reduce worker, 25, 27 Reed, B., 54 Reina, C., 260 Relation, 29 Relational algebra, 29, 30 Replication, 21 Representation, 247 Representative point, 243 Representative sample, 119 Reservoir sampling, 144 Retained set, 238 Revenue, 272 Ripple-carry adder, 138 RMSE, see Root-mean-square error Robinson, E., 54 Root-mean-square error, 290, 309 Rounding data, 303 Row, see Tuple Rowsum, 246 Royalty, J., 54 S-curve, 71, 80 Saberi, A., 286 Sample, 208, 212, 214, 217, 235, 243, 247 Sampling, 118, 132 Savasere, A., 220 SCC, see Strongly connected compo- nent Schema, 29 Schutze, H., 18 Score, 93 Search ad, 262 Search engine, 157, 173 Search query, 115, 146, 168, 262, 280 Second-price auction, 273 Secondary storage, see Disk Selection, 30, 32 Sensor, 115 Set, 62, 100, see Itemset Set diﬀerence, see Diﬀerence Shankar, S., 54 INDEX 327 Shim, K., 260 Shingle, 59, 72, 97 Shivakumar, N., 220 Shopping cart, 185 Shortest paths, 42 Siddharth, J., 111 Signature, 62, 65, 72 Signature matrix, 65, 70 Silberschatz, A., 144 Silberstein, A., 53 Similarity, 4, 15, 56, 183, 294, 301 Singleton, R.C., 144 Singular-value decomposition, 308 Sketch, 88 Sliding window, 116, 132, 139, 251 Smith, B., 320 SON Algorithm, 210 Space, 74, 221 Spam, see Term spam, see Link spam Spam farm, 169, 172 Spam mass, 172, 173 Sparse matrix, 28, 63, 64, 159, 160, 288 Spider trap, 152, 155, 175 Splitting clusters, 249 SQL, 29 Srikant, R., 220 Srivastava, U., 53, 54 Standard deviation, 239, 241 Standing query, 116 Star join, 50 Stata, R., 18, 182 Statistical model, 1 Status, 281 Steinbach, M., 18 Stochastic matrix, 149 Stop clustering, 227, 231, 233 Stop words, 8, 61, 97, 187, 293 Stream, see Data stream String, 100 Striping, 28, 159, 161 Strongly connected component, 151 Strongly connected graph, 149 Substochastic matrix, 152 Suﬃx length, 104 Summarization, 3 Summation, 138 Superimposed code, see Bloom ﬁlter, 143 Supermarket, 185, 208 Superstep, 43 Support, 184, 209, 210, 212, 214 Supporting page, 170 Surprise number, 128 SVD, see Singular-value decomposition Swami, A., 220 Szegedy, M., 144 Tag, 294 Tail length, 125 Tan, P.-N., 18 Target page, 170 Task, 21 Taxation, 152, 155, 170, 175 Taylor expansion, 12 Taylor, M., 285 Teleport set, 166, 167, 172 Teleportation, 156 Tendril, 151 Term, 146 Term frequency, see TF.IDF, 8 Term spam, 146, 169 TF, see Term frequency TF.IDF, 7, 8, 293 Theobald, M., 111 Thrashing, 161, 200 Threshold, 71, 141, 184, 210, 214 TIA, see Total Information Awareness Timestamp, 133, 252 Toivonen’s Algorithm, 211 Toivonen, H., 220 Tomkins, A., 18, 54, 182 Topic-sensitive PageRank, 165, 172 Toscher, A., 320 Total Information Awareness, 5 Touching the Void, 291 Transaction, see Basket Transition matrix of the Web, 148, 159, 160, 162 Transitive closure, 40 Transpose, 175 328 INDEX Transposition, 80 Tree, 228, 246, 247, see Decision tree Triangle inequality, 74 Triangular matrix, 192, 202 Triples method, 193, 202 TrustRank, 172 Trustworthy page, 172 Tube, 152 Tuple, 29 Tuzhilin, A., 320 Twitter, 281 Ullman, J.D., 18, 53, 54, 220, 260 Union, 30, 33, 37 Universal set, 100 User, 288, 304, 305 User proﬁle, 296 Utility matrix, 288, 291, 308 UV-decomposition, 308, 317 Variable, 128 Vazirani, U., 286 Vazirani, V., 286 Vector, 27, 74, 78, 149, 159, 174, 175, 222 Vitter, J., 144 von Ahn, L., 295, 320 Wallach, D.A., 53 Wang, J., 317 Wang, W., 111 Weaver, D., 53 Web structure, 151 Weiner, J., 18, 182 Whizbang Labs, 2 Widom, J., 18, 54, 144, 260 Window, see Sliding window, see De- caying window Windows, 12 Word, 187, 222, 293 Word count, 23 Worker process, 25 Workﬂow, 38, 40, 44 Working store, 114 Xiao, C., 111 Yahoo, 271, 294 Yerneni, R., 53 York, J., 320 Yu, J.X., 111 Yu, P.S., 220 Yu, Y., 54 Zhang, T., 260 Zipf’s law, 15, see Power law Zoeter, O., 285 **