Privacy-Preserving Data Mining Models and Algorithms ADVANCES IN DATABASE SYSTEMS Volume 34 Series Editors Ahmed K. Elmagarmid Amit P. Sheth Purdue University Wright State University West Lafayette, IN 47907 Dayton, Ohio 45435 Other books in the Series: SEQUENCE DATA MINING, Guozhu Dong, Jian Pei; ISBN: 978-0-387-69936-3 DATA STREAMS: Models and Algorithms, edited by Charu C. Aggarwal; ISBN: 978-0-387-28759-1 SIMILARITY SEARCH: The Metric Space Approach, P. Zezula, G. Amato, V. Dohnal, M. Batko; ISBN: 0-387-29146-6 STREAM DATA MANAGEMENT, Nauman Chaudhry, Kevin Shaw, Mahdi Abdelguerﬁ; ISBN: 0-387-24393-3 FUZZY DATABASE MODELING WITH XML, Zongmin Ma; ISBN: 0-387- 24248-1 MINING SEQUENTIAL PATTERNS FROM LARGE DATA SETS, Wei Wang and Jiong Yang; ISBN: 0-387-24246-5 ADVANCED SIGNATURE INDEXING FOR MULTIMEDIA AND WEB APPLICATIONS, Yan- nis Manolopoulos, Alexandros Nanopoulos, Eleni Tousidou; ISBN: 1-4020-7425-5 ADVANCES IN DIGITAL GOVERNMENT: Technology, Human Factors, and Policy, edited by William J. McIver, Jr. and Ahmed K. Elmagarmid; ISBN: 1-4020-7067-5 INFORMATION AND DATABASE QUALITY, Mario Piattini, Coral Calero and Marcela Genero; ISBN: 0-7923-7599-8 DATA QUALITY, Richard Y. Wang, Mostapha Ziad, Yang W. Lee: ISBN: 0-7923-7215-8 THE FRACTAL STRUCTURE OF DATA REFERENCE: Applications to the Memory Hierarchy, Bruce McNutt; ISBN: 0-7923-7945-4 SEMANTIC MODELS FOR MULTIMEDIA DATABASE SEARCHING AND BROWSING, Shu- Ching Chen, R.L. Kashyap, and Arif Ghafoor; ISBN: 0-7923-7888-1 INFORMATION BROKERING ACROSS HETEROGENEOUS DIGITAL DATA: A Metadata- based Approach, Vipul Kashyap, Amit Sheth; ISBN: 0-7923-7883-0 DATA DISSEMINATION IN WIRELESS COMPUTING ENVIRONMENTS, Kian-Lee Tan and Beng Chin Ooi; ISBN: 0-7923-7866-0 MIDDLEWARE NETWORKS: Concept, Design and Deployment of Internet Infrastructure, Michah Lerner, George Vanecek, Nino Vidovic, Dad Vrsalovic; ISBN: 0-7923-7840-7 ADVANCED DATABASE INDEXING, Yannis Manolopoulos, Yannis Theodoridis, Vassilis J. Tsotras; ISBN: 0-7923-7716-8 MULTILEVEL SECURE TRANSACTION PROCESSING, Vijay Atluri, Sushil Jajodia, Binto George ISBN: 0-7923-7702-8 FUZZY LOGIC IN DATA MODELING, Guoqing Chen ISBN: 0-7923-8253-6 PRIVACY-PRESERVING DATA MINING: Models and Algorithms, edited by Charu C. Aggarwal and Philip S. Yu; ISBN: 0-387-70991-8 Privacy-Preserving Data Mining Models and Algorithms Charu C. Aggarwal IBM T.J. Watson Research Center, USA and Philip S. Yu University of Illinois at Chicago, USA ABC Edited by Editors: Charu C. Aggarwal IBM Thomas J. Watson Research Center 19 Skyline Drive Hawthorne NY 10532 charu@us.ibm.com Philip S. Yu Department of Computer Science University of Illinois at Chicago 854 South Morgan Street Chicago, IL 60607-7053 psyu@cs.uic.edu Series Editors Ahmed K. Elmagarmid Purdue University West Lafayette, IN 47907 Amit P. Sheth Wright State University Dayton, Ohio 45435 ISBN 978-0-387-70991-8 e-ISBN 978-0-387-70992-5 DOI 10.1007/978-0-387-70992-5 Library of Congress Control Number: 2007943463 c° 2008 Springer Science+Business Media, LLC. All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identiﬁed as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper 9 8 7 6 5 4 3 2 1 springer.com Preface In recent years, advances in hardware technology have lead to an increase in the capability to store and record personal data about consumers and indi- viduals. This has lead to concerns that the personal data may be misused for a variety of purposes. In order to alleviate these concerns, a number of tech- niques have recently been proposed in order to perform the data mining tasks in a privacy-preserving way. These techniques for performing privacy-preserving data mining are drawn from a wide array of related topics such as data mining, cryptography and information hiding. The material in this book is designed to be drawn from the different topics so as to provide a good overview of the important topics in the ﬁeld. While a large number of research papers are now available in this ﬁeld, many of the topics have been studied by different communities with different styles. At this stage, it becomes important to organize the topics in such a way that the relative importance of different research areas is recognized. Furthermore, the ﬁeld of privacy-preserving data mining has been explored independently by the cryptography, database and statistical disclosure control communities. In some cases, the parallel lines of work are quite similar, but the communities are not sufﬁciently integrated for the provision of a broader perspective. This book will contain chapters from researchers of all three communities and will therefore try to provide a balanced perspective of the work done in this ﬁeld. This book will be structured as an edited book from prominent researchers in the ﬁeld. Each chapter will contain a survey which contains the key research content on the topic, and the future directions of research in the ﬁeld. Emphasis will be placed on making each chapter self-sufﬁcient. While the chapters will be written by different researchers, the topics and content is organized in such a way so as to present the most important models, algorithms, and applications in the privacy ﬁeld in a structured and concise way. In addition, attention is paid in drawing chapters from researchers working in different areas in order to provide different points of view. Given the lack of structurally organized in- formation on the topic of privacy, the book will provide insights which are not easily accessible otherwise. A few chapters in the book are not surveys, since the corresponding topics fall in the emerging category, and enough material is vi Preface not available to create a survey. In such cases, the individual results have been included to give a ﬂavor of the emerging research in the ﬁeld. It is expected that the book will be a great help to researchers and graduate students inter- ested in the topic. While the privacy ﬁeld clearly falls in the emerging category because of its recency, it is now beginning to reach a maturation and popularity point, where the development of an overview book on the topic becomes both possible and necessary. It is hoped that this book will provide a reference to students, researchers and practitioners in both introducing the topic of privacy- preserving data mining and understanding the practical and algorithmic aspects of the area. Contents Preface v List of Figures xvii List of Tables xxi 1 An Introduction to Privacy-Preserving Data Mining 1 Charu C. Aggarwal, Philip S. Yu 1.1. Introduction 1 1.2. Privacy-Preserving Data Mining Algorithms 3 1.3. Conclusions and Summary 7 References 8 2 A General Survey of Privacy-Preserving Data Mining Models and Algorithms 11 Charu C. Aggarwal, Philip S. Yu 2.1. Introduction 11 2.2. The Randomization Method 13 2.2.1 Privacy Quantiﬁcation 15 2.2.2 Adversarial Attacks on Randomization 18 2.2.3 Randomization Methods for Data Streams 18 2.2.4 Multiplicative Perturbations 19 2.2.5 Data Swapping 19 2.3. Group Based Anonymization 20 2.3.1 The k-Anonymity Framework 20 2.3.2 Personalized Privacy-Preservation 24 2.3.3 Utility Based Privacy Preservation 24 2.3.4 Sequential Releases 25 2.3.5 The l-diversity Method 26 2.3.6 The t-closeness Model 27 2.3.7 Models for Text, Binary and String Data 27 2.4. Distributed Privacy-Preserving Data Mining 28 2.4.1 Distributed Algorithms over Horizontally Partitioned Data Sets 30 2.4.2 Distributed Algorithms over Vertically Partitioned Data 31 2.4.3 Distributed Algorithms for k-Anonymity 32 viii Contents 2.5. Privacy-Preservation of Application Results 32 2.5.1 Association Rule Hiding 33 2.5.2 Downgrading Classiﬁer Effectiveness 34 2.5.3 Query Auditing and Inference Control 34 2.6. Limitations of Privacy: The Curse of Dimensionality 37 2.7. Applications of Privacy-Preserving Data Mining 38 2.7.1 Medical Databases: The Scrub and Dataﬂy Systems 39 2.7.2 Bioterrorism Applications 40 2.7.3 Homeland Security Applications 40 2.7.4 Genomic Privacy 42 2.8. Summary 43 References 43 3 A Survey of Inference Control Methods for Privacy-Preserving Data Mining 53 Josep Domingo-Ferrer 3.1. Introduction 54 3.2. A classiﬁcation of Microdata Protection Methods 55 3.3. Perturbative Masking Methods 58 3.3.1 Additive Noise 58 3.3.2 Microaggregation 59 3.3.3 Data Wapping and Rank Swapping 61 3.3.4 Rounding 62 3.3.5 Resampling 62 3.3.6 PRAM 62 3.3.7 MASSC 63 3.4. Non-perturbative Masking Methods 63 3.4.1 Sampling 64 3.4.2 Global Recoding 64 3.4.3 Top and Bottom Coding 65 3.4.4 Local Suppression 65 3.5. Synthetic Microdata Generation 65 3.5.1 Synthetic Data by Multiple Imputation 65 3.5.2 Synthetic Data by Bootstrap 66 3.5.3 Synthetic Data by Latin Hypercube Sampling 66 3.5.4 Partially Synthetic Data by Cholesky Decomposition 67 3.5.5 Other Partially Synthetic and Hybrid Microdata Approaches 67 3.5.6 Pros and Cons of Synthetic Microdata 68 3.6. Trading off Information Loss and Disclosure Risk 69 3.6.1 Score Construction 69 3.6.2 R-U Maps 71 3.6.3 k-anonymity 71 3.7. Conclusions and Research Directions 72 References 73 Contents ix 4 Measures of Anonymity 81 Suresh Venkatasubramanian 4.1. Introduction 81 4.1.1 What is Privacy? 82 4.1.2 Data Anonymization Methods 83 4.1.3 A Classiﬁcation of Methods 84 4.2. Statistical Measures of Anonymity 85 4.2.1 Query Restriction 85 4.2.2 Anonymity via Variance 85 4.2.3 Anonymity via Multiplicity 86 4.3. Probabilistic Measures of Anonymity 87 4.3.1 Measures Based on Random Perturbation 87 4.3.2 Measures Based on Generalization 90 4.3.3 Utility vs Privacy 94 4.4. Computational Measures of Anonymity 94 4.4.1 Anonymity via Isolation 97 4.5. Conclusions and New Directions 97 4.5.1 New Directions 98 References 99 5 k-Anonymous Data Mining: A Survey 105 V. Ciriani, S. De Capitani di Vimercati, S. Foresti, and P. Samarati 5.1. Introduction 105 5.2. k-Anonymity 107 5.3. Algorithms for Enforcing k-Anonymity 110 5.4. k-Anonymity Threats from Data Mining 117 5.4.1 Association Rules 118 5.4.2 Classiﬁcation Mining 118 5.5. k-Anonymity in Data Mining 120 5.6. Anonymize-and-Mine 123 5.7. Mine-and-Anonymize 126 5.7.1 Enforcing k-Anonymity on Association Rules 126 5.7.2 Enforcing k-Anonymity on Decision Trees 130 5.8. Conclusions 133 Acknowledgments 133 References 134 6 A Survey of Randomization Methods for Privacy-Preserving Data Mining 137 Charu C. Aggarwal, Philip S. Yu 6.1. Introduction 137 6.2. Reconstruction Methods for Randomization 139 6.2.1 The Bayes Reconstruction Method 139 6.2.2 The EM Reconstruction Method 141 6.2.3 Utility and Optimality of Randomization Models 143 x Contents 6.3. Applications of Randomization 144 6.3.1 Privacy-Preserving Classiﬁcation with Randomization 144 6.3.2 Privacy-Preserving OLAP 145 6.3.3 Collaborative Filtering 145 6.4. The Privacy-Information Loss Tradeoff 146 6.5. Vulnerabilities of the Randomization Method 149 6.6. Randomization of Time Series Data Streams 151 6.7. Multiplicative Noise for Randomization 152 6.7.1 Vulnerabilities of Multiplicative Randomization 153 6.7.2 Sketch Based Randomization 153 6.8. Conclusions and Summary 154 References 154 7 A Survey of Multiplicative Perturbation for Privacy-Preserving Data Mining 157 Keke Chen and Ling Liu 7.1. Introduction 158 7.1.1 Data Privacy vs. Data Utility 159 7.1.2 Outline 160 7.2. Deﬁnition of Multiplicative Perturbation 161 7.2.1 Notations 161 7.2.2 Rotation Perturbation 161 7.2.3 Projection Perturbation 162 7.2.4 Sketch-based Approach 164 7.2.5 Geometric Perturbation 164 7.3. Transformation Invariant Data Mining Models 165 7.3.1 Deﬁnition of Transformation Invariant Models 166 7.3.2 Transformation-Invariant Classiﬁcation Models 166 7.3.3 Transformation-Invariant Clustering Models 167 7.4. Privacy Evaluation for Multiplicative Perturbation 168 7.4.1 A Conceptual Multidimensional Privacy Evaluation Model 168 7.4.2 Variance of Difference as Column Privacy Metric 169 7.4.3 Incorporating Attack Evaluation 170 7.4.4 Other Metrics 171 7.5. Attack Resilient Multiplicative Perturbations 171 7.5.1 Naive Estimation to Rotation Perturbation 171 7.5.2 ICA-Based Attacks 173 7.5.3 Distance-Inference Attacks 174 7.5.4 Attacks with More Prior Knowledge 176 7.5.5 Finding Attack-Resilient Perturbations 177 7.6. Conclusion 177 Acknowledgment 178 References 179 8 A Survey of Quantiﬁcation of Privacy Preserving Data Mining Algorithms 183 Elisa Bertino, Dan Lin and Wei Jiang 8.1. Introduction 184 8.2. Metrics for Quantifying Privacy Level 186 8.2.1 Data Privacy 186 Contents xi 8.2.2 Result Privacy 191 8.3. Metrics for Quantifying Hiding Failure 192 8.4. Metrics for Quantifying Data Quality 193 8.4.1 Quality of the Data Resulting from the PPDM Process 193 8.4.2 Quality of the Data Mining Results 198 8.5. Complexity Metrics 200 8.6. How to Select a Proper Metric 201 8.7. Conclusion and Research Directions 202 References 202 9 A Survey of Utility-based Privacy-Preserving Data Transformation Methods 207 Ming Hua and Jian Pei 9.1. Introduction 208 9.1.1 What is Utility-based Privacy Preservation? 209 9.2. Types of Utility-based Privacy Preservation Methods 210 9.2.1 Privacy Models 210 9.2.2 Utility Measures 212 9.2.3 Summary of the Utility-Based Privacy Preserving Methods 214 9.3. Utility-Based Anonymization Using Local Recoding 214 9.3.1 Global Recoding and Local Recoding 215 9.3.2 Utility Measure 216 9.3.3 Anonymization Methods 217 9.3.4 Summary and Discussion 219 9.4. The Utility-based Privacy Preserving Methods in Classiﬁcation Prob- lems 219 9.4.1 The Top-Down Specialization Method 220 9.4.2 The Progressive Disclosure Algorithm 224 9.4.3 Summary and Discussion 228 9.5. Anonymized Marginal: Injecting Utility into Anonymized Data Sets 228 9.5.1 Anonymized Marginal 229 9.5.2 Utility Measure 230 9.5.3 Injecting Utility Using Anonymized Marginals 231 9.5.4 Summary and Discussion 233 9.6. Summary 234 Acknowledgments 234 References 234 10 Mining Association Rules under Privacy Constraints 239 Jayant R. Haritsa 10.1. Introduction 239 10.2. Problem Framework 240 10.2.1 Database Model 240 10.2.2 Mining Objective 241 10.2.3 Privacy Mechanisms 241 10.2.4 Privacy Metric 243 10.2.5 Accuracy Metric 245 xii Contents 10.3. Evolution of the Literature 246 10.4. The FRAPP Framework 251 10.4.1 Reconstruction Model 252 10.4.2 Estimation Error 253 10.4.3 Randomizing the Perturbation Matrix 256 10.4.4 Efﬁcient Perturbation 256 10.4.5 Integration with Association Rule Mining 258 10.5. Sample Results 259 10.6. Closing Remarks 263 Acknowledgments 263 References 263 11 A Survey of Association Rule Hiding Methods for Privacy 267 Vassilios S. Verykios and Aris Gkoulalas-Divanis 11.1. Introduction 267 11.2. Terminology and Preliminaries 269 11.3. Taxonomy of Association Rule Hiding Algorithms 270 11.4. Classes of Association Rule Algorithms 271 11.4.1 Heuristic Approaches 272 11.4.2 Border-based Approaches 277 11.4.3 Exact Approaches 278 11.5. Other Hiding Approaches 279 11.6. Metrics and Performance Analysis 281 11.7. Discussion and Future Trends 284 11.8. Conclusions 285 References 286 12 A Survey of Statistical Approaches to Preserving Conﬁdentiality of Contingency Table Entries 291 Stephen E. Fienberg and Aleksandra B. Slavkovic 12.1. Introduction 291 12.2. The Statistical Approach Privacy Protection 292 12.3. Datamining Algorithms, Association Rules, and Disclosure Limitation 294 12.4. Estimation and Disclosure Limitation for Multi-way Contingency Tables 295 12.5. Two Illustrative Examples 301 12.5.1 Example 1: Data from a Randomized Clinical Trial 301 12.5.2 Example 2: Data from the 1993 U.S. Current Population Survey 305 12.6. Conclusions 308 Acknowledgments 309 References 309 13 A Survey of Privacy-Preserving Methods Across Horizontally Partitioned Data 313 Murat Kantarcioglu 13.1. Introduction 313 Contents xiii 13.2. Basic Cryptographic Techniques for Privacy-Preserving Distributed Data Mining 315 13.3. Common Secure Sub-protocols Used in Privacy-Preserving Distributed Data Mining 318 13.4. Privacy-preserving Distributed Data Mining on Horizontally Partitioned Data 323 13.5. Comparison to Vertically Partitioned Data Model 326 13.6. Extension to Malicious Parties 327 13.7. Limitations of the Cryptographic Techniques Used in Privacy- Preserving Distributed Data Mining 329 13.8. Privacy Issues Related to Data Mining Results 330 13.9. Conclusion 332 References 332 14 A Survey of Privacy-Preserving Methods Across Vertically Partitioned Data 337 Jaideep Vaidya 14.1. Introduction 337 14.2. Classiﬁcation 341 14.2.1 Na¨ıve Bayes Classiﬁcation 342 14.2.2 Bayesian Network Structure Learning 343 14.2.3 Decision Tree Classiﬁcation 344 14.3. Clustering 346 14.4. Association Rule Mining 347 14.5. Outlier detection 349 14.5.1 Algorithm 351 14.5.2 Security Analysis 352 14.5.3 Computation and Communication Analysis 354 14.6. Challenges and Research Directions 355 References 356 15 A Survey of Attack Techniques on Privacy-Preserving Data Perturbation Methods 359 Kun Liu, Chris Giannella, and Hillol Kargupta 15.1. Introduction 360 15.2. Deﬁnitions and Notation 360 15.3. Attacking Additive Data Perturbation 361 15.3.1 Eigen-Analysis and PCA Preliminaries 362 15.3.2 Spectral Filtering 363 15.3.3 SVD Filtering 364 15.3.4 PCA Filtering 365 15.3.5 MAP Estimation Attack 366 15.3.6 Distribution Analysis Attack 367 15.3.7 Summary 367 15.4. Attacking Matrix Multiplicative Data Perturbation 369 15.4.1 Known I/O Attacks 370 15.4.2 Known Sample Attack 373 15.4.3 Other Attacks Based on ICA 374 xiv Contents 15.4.4 Summary 375 15.5. Attacking k-Anonymization 376 15.6. Conclusion 376 Acknowledgments 377 References 377 16 Private Data Analysis via Output Perturbation 383 Kobbi Nissim 16.1. Introduction 383 16.2. The Abstract Model – Statistical Databases, Queries, and Sanitizers 385 16.3. Privacy 388 16.3.1 Interpreting the Privacy Deﬁnition 390 16.4. The Basic Technique: Calibrating Noise to Sensitivity 394 16.4.1 Applications: Functions with Low Global Sensitivity 396 16.5. Constructing Sanitizers for Complex Functionalities 400 16.5.1 k-Means Clustering 401 16.5.2 SVD and PCA 403 16.5.3 Learning in the Statistical Queries Model 404 16.6. Beyond the Basics 405 16.6.1 Instance Based Noise and Smooth Sensitivity 406 16.6.2 The Sample-Aggregate Framework 408 16.6.3 A General Sanitization Mechanism 409 16.7. Related Work and Bibliographic Notes 409 Acknowledgments 411 References 411 17 A Survey of Query Auditing Techniques for Data Privacy 415 Shubha U. Nabar, Krishnaram Kenthapadi, Nina Mishra and Rajeev Motwani 17.1. Introduction 415 17.2. Auditing Aggregate Queries 416 17.2.1 Ofﬂine Auditing 417 17.2.2 Online Auditing 418 17.3. Auditing Select-Project-Join Queries 426 17.4. Challenges in Auditing 427 17.5. Reading 429 References 430 18 Privacy and the Dimensionality Curse 433 Charu C. Aggarwal 18.1. Introduction 433 18.2. The Dimensionality Curse and the k-anonymity Method 435 18.3. The Dimensionality Curse and Condensation 441 18.4. The Dimensionality Curse and the Randomization Method 446 18.4.1 Effects of Public Information 446 18.4.2 Effects of High Dimensionality 450 18.4.3 Gaussian Perturbing Distribution 450 18.4.4 Uniform Perturbing Distribution 455 Contents xv 18.5. The Dimensionality Curse and l-diversity 458 18.6. Conclusions and Research Directions 459 References 460 19 Personalized Privacy Preservation 461 Yufei Tao and Xiaokui Xiao 19.1. Introduction 461 19.2. Formalization of Personalized Anonymity 463 19.2.1 Personal Privacy Requirements 464 19.2.2 Generalization 465 19.3. Combinatorial Process of Privacy Attack 467 19.3.1 Primary Case 468 19.3.2 Non-primary Case 469 19.4. Theoretical Foundation 470 19.4.1 Notations and Basic Properties 471 19.4.2 Derivation of the Breach Probability 472 19.5. Generalization Algorithm 473 19.5.1 The Greedy Framework 474 19.5.2 Optimal SA-generalization 476 19.6. Alternative Forms of Personalized Privacy Preservation 478 19.6.1 Extension of k-anonymity 479 19.6.2 Personalization in Location Privacy Protection 480 19.7. Summary and Future Work 482 References 485 20 Privacy-Preserving Data Stream Classiﬁcation 487 Yabo Xu, Ke Wang, Ada Wai-Chee Fu, Rong She, and Jian Pei 20.1. Introduction 487 20.1.1 Motivating Example 488 20.1.2 Contributions and Paper Outline 490 20.2. Related Works 491 20.3. Problem Statement 493 20.3.1 Secure Join Stream Classiﬁcation 493 20.3.2 Naive Bayesian Classiﬁers 494 20.4. Our Approach 495 20.4.1 Initialization 495 20.4.2 Bottom-Up Propagation 496 20.4.3 Top-Down Propagation 497 20.4.4 Using NBC 499 20.4.5 Algorithm Analysis 500 20.5. Empirical Studies 501 20.5.1 Real-life Datasets 502 20.5.2 Synthetic Datasets 504 20.5.3 Discussion 506 20.6. Conclusions 507 References 508 Index 511 List of Figures 5.1 Simpliﬁed representation of a private table 108 5.2 An example of domain and value generalization hierarchies 109 5.3 Classiﬁcation of k-anonymity techniques [11] 110 5.4 Generalization hierarchy for QI={Marital status, Sex} 111 5.5 Index assignment to attributes Marital status and Sex 112 5.6 An example of set enumeration tree over set I = {1, 2, 3} of indexes 113 5.7 Sub-hierarchies computed by Incognito for the table in Figure 5.1 114 5.8 Spatial representation (a) and possible partitioning (b)-(d) of the table in Figure 5.1 116 5.9 An example of decision tree 119 5.10 Different approaches for combining k-anonymity and data mining 120 5.11 An example of top-down anonymization for the private table in Figure 5.1 124 5.12 Frequent itemsets extracted from the table in Figure 5.1 127 5.13 An example of binary table 128 5.14 Itemsets extracted from the table in Figure 5.13(b) 128 5.15 Itemsets with support at least equal to 40 (a) and corresponding anonymized itemsets (b) 129 5.16 3-anonymous version of the tree of Figure 5.9 131 5.17 Suppression of occurrences in non-leaf nodes in the tree in Figure 5.9 132 5.18 Table inferred from the decision tree in Figure 5.17 132 5.19 11-anonymous version of the tree in Figure 5.17 132 5.20 Table inferred from the decision tree in Figure 5.19 133 6.1 Illustration of the Information Loss Metric 149 7.1 Using known points and distance relationship to infer the rotation matrix 175 xviii List of Figures 9.1 A taxonomy tree on categorical attribute Education 221 9.2 A taxonomy tree on continuous attribute Age 221 9.3 Interactive graph 232 9.4 A decomposition 232 10.1 CENSUS (γ =19) 261 10.2 Perturbation Matrix Condition Numbers (γ =19) 262 13.1 Relationship between Secure Sub-protocols and Privacy Preserving Distributed Data Mining on Horizontally Partitioned Data 323 14.1 Two dimensional problem that cannot be decomposed into two one-dimensional problems 340 15.1 Wigner’s semi-circle law: a histogram of the eigenvalues of A+A 2 √ 2p for a large, randomly generated A 363 17.1 Skeleton of a simulatable private randomized auditor 423 18.1 Some Examples of Generalization for 2-Anonymity 435 18.2 Upper Bound of 2-anonymity Probability in an Non-Empty Grid Cell 439 18.3 Fraction of Data Points Preserving 2-Anonymity with Data Dimensionality (Gaussian Clusters) 440 18.4 Minimum Information Loss for 2-Anonymity (Gaussian Clusters) 445 18.5 Randomization Level with Increasing Dimensionality, Perturbation level =8· σo (UniDis) 457 19.1 Microdata and generalization 462 19.2 The taxonomy of attribute Disease 463 19.3 A possible result of our generalization scheme 466 19.4 The voter registration list 468 19.5 Algorithm for computing personalized generalization 474 19.6 Algorithm for ﬁnding the optimal SA-generalization 478 19.7 Personalized k-anonymous generalization 480 20.1 Related streams / tables 489 20.2 The join stream 489 20.3 Example with 3 streams at initialization 496 20.4 After bottom-up propagations 498 20.5 After top-down propagations 499 20.6 UK road accident data (2001) 502 20.7 Classiﬁer accuracy 503 20.8 Time per input tuple 503 xix 20.9 Classiﬁer accuracy vs. window size 505 20.10 Classiﬁer accuracy vs. concept drifting interval 505 20.11 Time per input tuple vs. window size 506 20.12 Time per input tuple vs. blow-up ratio 506 20.13 Time per input tuple vs. number of streams 507 List of Figures List of Tables 3.1 Perturbative methods vs data types. “X” denotes applica- ble and “(X)” denotes applicable with some adaptation 58 3.2 Example of rank swapping. Left, original ﬁle; right, rankswapped ﬁle 62 3.3 Non-perturbative methods vs data types 64 9.1a The original table 209 9.2b A 2-anonymized table with better utility 209 9.3c A 2-anonymized table with poorer utility 209 9.4 Summary of utility-based privacy preserving methods 214 9.5a 3-anonymous table by global recoding 215 9.6b 3-anonymous table by local recoding 215 9.7a The original table 223 9.8b The anonymized table 223 9.9a The original table 225 9.10b The suppressed table 225 9.11a The original table 229 9.12b The anonymized table 229 9.13a Age Marginal 229 9.14b (Education, AnnualIncome) Marginal 229 10.1 CENSUS Dataset 260 10.2 Frequent Itemsets for supmin =0.02 260 12.1 Results of clinical trial for the effectiveness of an analgesic drug 302 12.2 Second panel has LP relaxation bounds, and third panel has sharp IP bounds for cell entries in Table 1.1 given [R|CST] conditional probability values 303 12.3 Sharp upper and lower bounds for cell entries in Ta- ble 12.1 given the [CSR] margin, and LP relaxation bounds given [R|CS] conditional probability values 304 12.4 Description of variables in CPS data extract 305 xxii 12.5 Marginal table [ACDGH] from 8-way CPS table 306 12.6 Summary of difference between upper and lower bounds for small cell counts in the full 8-way CPS table under Model 1 and under Model 2 307 14.1 The Weather Dataset 338 14.2 Arbitrary partitioning of data between 2 sites 339 14.3 Vertical partitioning of data between 2 sites 340 15.1 Summarization of Attacks on Additive Perturbation 368 15.2 Summarization of Attacks on Matrix Multiplicative Perturbation 375 18.1 Notations and Deﬁnitions 441 List of Tables Chapter 1 An Introduction to Privacy-Preserving Data Mining Charu C. Aggarwal IBM T. J. Watson Research Center Hawthorne, NY 10532 charu@us.ibm.com Philip S. Yu University of Illinois at Chicago Chicago, IL 60607 psyu@cs.uic.edu Abstract The ﬁeld of privacy has seen rapid advances in recent years because of the in- creases in the ability to store data. In particular, recent advances in the data mining ﬁeld have lead to increased concerns about privacy. While the topic of privacy has been traditionally studied in the context of cryptography and information-hiding, recent emphasis on data mining has lead to renewed interest in the ﬁeld. In this chapter, we will introduce the topic of privacy-preserving data mining and provide an overview of the different topics covered in this book. Keywords: Privacy-preserving data mining, privacy, randomization, k-anonymity. 1.1 Introduction The problem of privacy-preserving data mining has become more impor- tant in recent years because of the increasing ability to store personal data about users, and the increasing sophistication of data mining algorithms to leverage this information. A number of techniques such as randomization and k-anonymity [1, 4, 16] have been suggested in recent years in order to per- form privacy-preserving data mining. Furthermore, the problem has been dis- cussed in multiple communities such as the database community, the statistical disclosure control community and the cryptography community. In some cases, the different communities have explored parallel lines of work which are quite similar. This book will try to explore different topics from the perspective of 2 Privacy-Preserving Data Mining: Models and Algorithms different communities, and will try to give a fused idea of the work in different communities. The key directions in the ﬁeld of privacy-preserving data mining are as fol- lows: Privacy-Preserving Data Publishing: These techniques tend to study different transformation methods associated with privacy. These tech- niques include methods such as randomization [1], k-anonymity [16, 7], and l-diversity [11]. Another related issue is how the perturbed data can be used in conjunction with classical data mining methods such as as- sociation rule mining [15]. Other related problems include that of deter- mining privacy-preserving methods to keep the underlying data useful (utility-based methods), or the problem of studying the different deﬁ- nitions of privacy, and how they compare in terms of effectiveness in different scenarios. Changing the results of Data Mining Applications to preserve pri- vacy: In many cases, the results of data mining applications such as association rule or classiﬁcation rule mining can compromise the pri- vacy of the data. This has spawned a ﬁeld of privacy in which the results of data mining algorithms such as association rule mining are modiﬁed in order to preserve the privacy of the data. A classic example of such techniques are association rule hiding methods, in which some of the association rules are suppressed in order to preserve privacy. Query Auditing: Such methods are akin to the previous case of modify- ing the results of data mining algorithms. Here, we are either modifying or restricting the results of queries. Methods for perturbing the output of queries are discussed in [8], whereas techniques for restricting queries are discussed in [9, 13]. Cryptographic Methods for Distributed Privacy: In many cases, the data may be distributed across multiple sites, and the owners of the data across these different sites may wish to compute a common function. In such cases, a variety of cryptographic protocols may be used in order to communicate among the different sites, so that secure function com- putation is possible without revealing sensitive information. A survey of such methods may be found in [14]. Theoretical Challenges in High Dimensionality: Real data sets are usually extremely high dimensional, and this makes the process of privacy-preservation extremely difﬁcult both from a computational and effectiveness point of view. In [12], it has been shown that optimal k-anonymization is NP-hard. Furthermore, the technique is not even ef- fective with increasing dimensionality, since the data can typically be An Introduction to Privacy-Preserving Data Mining 3 combined with either public or background information to reveal the identity of the underlying record owners. A variety of methods for ad- versarial attacks in the high dimensional case are discussed in [5, 6]. This book will attempt to cover the different topics from the point of view of different communities in the ﬁeld. This chapter will provide an overview of the different privacy-preserving algorithms covered in this book. We will discuss the challenges associated with each kind of problem, and discuss an overview of the material in the corresponding chapter. 1.2 Privacy-Preserving Data Mining Algorithms In this section, we will discuss the key stream mining problems and will discuss the challenges associated with each problem. We will also discuss an overview of the material covered in each chapter of this book. The broad topics covered in this book are as follows: General Survey. In chapter 2, we provide a broad survey of privacy- preserving data-mining methods. We provide an overview of the different techniques and how they relate to one another. The individual topics will be covered in sufﬁcient detail to provide the reader with a good reference point. The idea is to provide an overview of the ﬁeld for a new reader from the per- spective of the data mining community. However, more detailed discussions are deferred to future chapters which contain descriptions of different data mining algorithms. Statistical Methods for Disclosure Control. The topic of privacy-preserv- ing data mining has often been studied extensively by the data mining com- munity without sufﬁcient attention to the work done by the conventional work done by the statistical disclosure control community. In chapter 3, detailed methods for statistical disclosure control have been presented along with some of the relationships to the parallel work done in the database and data mining community. This includes methods such as k-anonymity, swapping, random- ization, micro-aggregation and synthetic data generation. The idea is to give the readers an overview of the common themes in privacy-preserving data mining by different communities. Measures of Anonymity. There are a very large number of deﬁnitions of anonymity in the privacy-preserving data mining ﬁeld. This is partially because of the varying goals of different privacy-preserving data mining algorithms. For example, methods such as k-anonymity, l-diversity and t-closeness are all designed to prevent identiﬁcation, though the ﬁnal goal is to preserve the un- derlying sensitive information. Each of these methods is designed to prevent 4 Privacy-Preserving Data Mining: Models and Algorithms disclosure of sensitive information in a different way. Chapter 4 is a survey of different measures of anonymity. The chapter tries to deﬁne privacy from the perspective of anonymity measures and classiﬁes such measures. The chap- ter also compares and contrasts different measures, and discusses the relative advantages of different measures. This chapter thus provides an overview and perspective of the different ways in which privacy could be deﬁned, and what the relative advantages of each method might be. Thek-anonymity Method. An important method for privacy de-identiﬁcation is the method of k-anonymity [16]. The motivating factor behind the k- anonymity technique is that many attributes in the data can often be consid- ered pseudo-identiﬁers which can be used in conjunction with public records in order to uniquely identify the records. For example, if the identiﬁcations from the records are removed, attributes such as the birth date and zip-code an be used in order to uniquely identify the identities of the underlying records. Theideaink-anonymity is to reduce the granularity of representation of the data in such a way that a given record cannot be distinguished from at least (k − 1) other records. In chapter 5, the k-anonymity method is discussed in detail. A number of important algorithms for k-anonymity are discussed in the same chapter. The Randomization Method. The randomization technique uses data dis- tortion methods in order to create private representations of the records [1, 4]. In most cases, the individual records cannot be recovered, but only aggregate distributions can be recovered. These aggregate distributions can be used for data mining purposes. Two kinds of perturbation are possible with the random- ization method: Additive Perturbation: In this case, randomized noise is added to the data records. The overall data distributions can be recovered from the randomized records. Data mining and management algorithms re de- signed to work with these data distributions. A detailed discussion of these methods is provided in chapter 6. Multiplicative Perturbation: In this case, the random projection or ran- dom rotation techniques are used in order to perturb the records. A de- tailed discussion of these methods is provided in chapter 7. In addition, these chapters deal with the issue of adversarial attacks and vul- nerabilities of these methods. Quantiﬁcation of Privacy. A key issue in measuring the security of dif- ferent privacy-preservation methods is the way in which the underlying pri- vacy is quantiﬁed. The idea in privacy quantiﬁcation is to measure the risk of An Introduction to Privacy-Preserving Data Mining 5 disclosure for a given level of perturbation. In chapter 8, the issue of quantiﬁ- cation of privacy is closely examined. The chapter also examines the issue of utility, and its natural tradeoff with privacy quantiﬁcation. A discussion of the relative advantages of different kinds of methods is presented. Utility Based Privacy-Preserving Data Mining. Most privacy-preserving data mining methods apply a transformation which reduces the effectiveness of the underlying data when it is applied to data mining methods or algo- rithms. In fact, there is a natural tradeoff between privacy and accuracy, though this tradeoff is affected by the particular algorithm which is used for privacy- preservation. A key issue is to maintain maximum utility of the data with- out compromising the underlying privacy constraints. In chapter 9, a broad overview of the different utility based methods for privacy-preserving data mining is presented. The issue of designing utility based algorithms to work effectively with certain kinds of data mining problems is addressed. Mining Association Rules under Privacy Constraints. Since association rule mining is one of the important problems in data mining, we have devoted a number of chapters to this problem. There are two aspects to the privacy- preserving association rule mining problem: When the input to the data is perturbed, it is a challenging problem to accurately determine the association rules on the perturbed data. Chapter 10 discusses the problem of association rule mining on the perturbed data. A different issue is that of output association rule privacy. In this case, we try to ensure that none of the association rules in the output result in leakage of sensitive data. This problem is referred to as association rule hiding [17] by the database community, and that of contingency table privacy-preservation by the statistical community. The problem of output association rule privacy is brieﬂy discussed in chapter 10. A detailed survey of association rule hiding from the perspective of the database community is discussed in chapter 11, and a discussion from the perspective of the statistical community is discussed in chapter 12. Cryptographic Methods for Information Sharing and Privacy. In many cases, multiple parties may wish to share aggregate private data, without leak- ing any sensitive information at their end [14]. For example, different super- stores with sensitive sales data may wish to coordinate among themselves in knowing aggregate trends without leaking the trends of their individual stores. This requires secure and cryptographic protocols for sharing the information 6 Privacy-Preserving Data Mining: Models and Algorithms across the different parties. The data may be distributed in two ways across different sites: Horizontal Partitioning: In this case, the different sites may have dif- ferent sets of records containing the same attributes. Vertical Partitioning: In this case, the different sites may have different attributes of the same sets of records. Clearly, the challenges for the horizontal and vertical partitioning case are quite different. In chapters 13 and 14, a variety of cryptographic protocols for hor- izontally and vertically partitioned data are discussed. The different kinds of cryptographic methods are introduced in chapter 13. Methods for horizontally partitioned data are discussed in chapter 13, whereas methods for vertically partitioned data are discussed in chapter 14. Privacy Attacks. It is useful to examine the different ways in which one can make adversarial attacks on privacy-transformed data. This helps in designing more effective privacy-transformation methods. Some examples of methods which can be used in order to attack the privacy of the underlying data include SVD-based methods, spectral ﬁltering methods and background knowledge attacks. In chapter 15, a detailed description of different kinds of attacks on data perturbation methods is provided. Query Auditing and Inference Control. Many private databases are open to querying. This can compromise the security of the results, when the adver- sary can use different kinds of queries in order to undermine the security of the data. For example, a combination of range queries can be used in order to narrow down the possibilities for that record. Therefore, the results over mul- tiple queries can be combined in order to uniquely identify a record, or at least reduce the uncertainty in identifying it. There are two primary methods for preventing this kind of attack: Query Output Perturbation: In this case, we add noise to the output of the query result in order to preserve privacy [8]. A detailed description of such methods is provided in chapter 16. Query Auditing: In this case, we choose to deny a subset of the queries, so that the particular combination of queries cannot be used in order to violate the privacy [9, 13]. A detailed survey of query auditing methods have been provided in chapter 17. Privacy and the Dimensionality Curse. In recent years, it has been observed that many privacy-preservation methods such as k-anonymity and randomization are not very effective in the high dimensional case [5, 6]. In An Introduction to Privacy-Preserving Data Mining 7 chapter 18, we have provided a detailed description of the effects of the dimen- sionality curse on different kinds of privacy-preserving data mining algorithm. It is clear from the discussion in the chapter that most privacy methods are not very effective in the high dimensional case. Personalized Privacy Preservation. In many applications, different sub- jects have different requirements for privacy. For example, a brokerage cus- tomer with a very large account would likely have a much higher level of privacy-protection than a customer with a lower level of privacy protection. In such case, it is necessary to personalize the privacy-protection algorithm. In personalized privacy-preservation, we construct anonymizations of the data such that different records have a different level of privacy. Two examples of personalized privacy-preservation methods are discussed in [3, 18]. The method in [3] uses condensation approach for personalized anonymization, while the method in [18] uses a more conventional generalization approach for anonymization. In chapter 19, a number of algorithms for personalized anonymity are examined. Privacy-Preservation of Data Streams. A new topic in the area of privacy- preserving data mining is that of data streams, in which data grows rapidly at an unlimited rate. In such cases, the problem of privacy-preservation is quite challenging since the data is being released incrementally. In addition, the fast nature of data streams obviates the possibility of using the past history of the data. We note that both the topics of data streams and privacy-preserving data mining are relatively new, and there has not been much work on combining the two topics. Some work has been done on performing randomization of data streams [10], and other work deals with the issue of condensation based anonymization [2] of data streams. Both of these methods are discussed in Chapters 2 and 5, which are surveys on privacy and randomization respectively. Nevertheless, the literature on the stream topic remains sparse. Therefore, in chapter 20, we have added a chapter which speciﬁcally deals with the issue of privacy-preserving classiﬁcation of data streams. While this chapter is unlike other chapters in the sense that it is not a survey, we have included it in order to provide a ﬂavor of the emerging techniques in this important area of research. 1.3 Conclusions and Summary In this chapter, we introduced the problem of privacy-preserving data min- ing and discussed the broad areas of research in the ﬁeld. The broad areas of privacy are as follows: Privacy-preserving data publishing: This corresponds to sanitizing the data, so that its privacy remains preserved. 8 Privacy-Preserving Data Mining: Models and Algorithms Privacy-Preserving Applications: This corresponds to designing data management and mining algorithms in such a way that the privacy re- mains preserved. Some examples include association rule mining, clas- siﬁcation, and query processing. Utility Issues: Since the perturbed data may often be used for mining and management purposes, its utility needs to be preserved. Therefore, the data mining and privacy transformation techniques need to be de- signed effectively, so to to preserve the utility of the results. Distributed Privacy, cryptography and adversarial collaboration: This corresponds to secure communication protocols between trusted parties, so that information can be shared effectively without revealing sensitive information about particular parties. We also discussed a broad overview of the different topics discussed in this book. In the remaining chapters, the surveys will provide a comprehensive treatment of the topics in each category. References [1] Agrawal R., Srikant R. Privacy-Preserving Data Mining. ACM SIGMOD Conference, 2000. [2] Aggarwal C. C., Yu P. S.: A Condensation approach to privacy preserving data mining. EDBT Conference, 2004. [3] Aggarwal C. C., Yu P. S. On Variable Constraints in Privacy Preserving Data Mining. ACM SIAM Data Mining Conference, 2005. [4] Agrawal D. Aggarwal C. C. On the Design and Quantiﬁcation of Privacy Preserving Data Mining Algorithms. ACM PODS Conference, 2002. [5] Aggarwal C. C. On k-anonymity and the curse of dimensionality. VLDB Conference, 2004. [6] Aggarwal C. C. On Randomization, Public Information, and the Curse of Dimensionality. ICDE Conference, 2007. [7] Bayardo R. J., Agrawal R. Data Privacy through optimal k-anonymization. ICDE Conference, 2005. [8] Blum A., Dwork C., McSherry F., Nissim K.: Practical Privacy: The SuLQ Framework. ACM PODS Conference, 2005. [9] Kenthapadi K.,Mishra N., Nissim K.: Simulatable Auditing, ACM PODS Conference, 2005. [10] Li F., Sun J., Papadimitriou S. Mihaila G., Stanoi I.: Hiding in the Crowd: Privacy Preservation on Evolving Streams through Correlation Tracking. ICDE Conference, 2007. An Introduction to Privacy-Preserving Data Mining 9 [11] Machanavajjhala A., Gehrke J., Kifer D. -diversity: Privacy beyond k-anonymity. IEEE ICDE Conference, 2006. [12] Meyerson A., Williams R. On the complexity of optimal k-anonymity. ACM PODS Conference, 2004. [13] Nabar S., Marthi B., Kenthapadi K., Mishra N., Motwani R.: Towards Robustness in Query Auditing. VLDB Conference, 2006. [14] Pinkas B.: Cryptographic Techniques for Privacy-Preserving Data Min- ing. ACM SIGKDD Explorations, 4(2), 2002. [15] Rizvi S., Haritsa J. Maintaining Data Privacy in Association Rule Mining. VLDB Conference, 2002. [16] Samarati P., Sweeney L. Protecting Privacy when Disclosing Informa- tion: k-Anonymity and its Enforcement Through Generalization and Sup- pression. IEEE Symp. on Security and Privacy, 1998. [17] Verykios V. S., Elmagarmid A., Bertino E., Saygin Y.,, Dasseni E.: As- sociation Rule Hiding. IEEE Transactions on Knowledge and Data En- gineering, 16(4), 2004. [18] Xiao X., Tao Y.. Personalized Privacy Preservation. ACM SIGMOD Con- ference, 2006. Chapter 2 A General Survey of Privacy-Preserving Data Mining Models and Algorithms Charu C. Aggarwal IBM T. J. Watson Research Center Hawthorne, NY 10532 charu@us.ibm.com Philip S. Yu University of Illinois at Chicago Chicago, IL 60607 psyu@cs.uic.edu Abstract In recent years, privacy-preserving data mining has been studied extensively, be- cause of the wide proliferation of sensitive information on the internet. A num- ber of algorithmic techniques have been designed for privacy-preserving data mining. In this paper, we provide a review of the state-of-the-art methods for privacy. We discuss methods for randomization, k-anonymization, and distrib- uted privacy-preserving data mining. We also discuss cases in which the out- put of data mining applications needs to be sanitized for privacy-preservation purposes. We discuss the computational and theoretical limits associated with privacy-preservation over high dimensional data sets. Keywords: Privacy-preserving data mining, randomization, k-anonymity. 2.1 Introduction In recent years, data mining has been viewed as a threat to privacy because of the widespread proliferation of electronic data maintained by corporations. This has lead to increased concerns about the privacy of the underlying data. In recent years, a number of techniques have been proposed for modifying or transforming the data in such a way so as to preserve privacy. A survey on some of the techniques used for privacy-preserving data mining may be found 12 Privacy-Preserving Data Mining: Models and Algorithms in [123]. In this chapter, we will study an overview of the state-of-the-art in privacy-preserving data mining. Privacy-preserving data mining ﬁnds numerous applications in surveillance which are naturally supposed to be “privacy-violating” applications. The key is to design methods [113] which continue to be effective, without compro- mising security. In [113], a number of techniques have been discussed for bio- surveillance, facial de-dentiﬁcation, and identity theft. More detailed discus- sions on some of these sssues may be found in [96, 114–116]. Most methods for privacy computations use some form of transformation on the data in order to perform the privacy preservation. Typically, such meth- ods reduce the granularity of representation in order to reduce the privacy. This reduction in granularity results in some loss of effectiveness of data manage- ment or mining algorithms. This is the natural trade-off between information loss and privacy. Some examples of such techniques are as follows: The randomization method: The randomization method is a technique for privacy-preserving data mining in which noise is added to the data in order to mask the attribute values of records [2, 5]. The noise added is sufﬁciently large so that individual record values cannot be recov- ered. Therefore, techniques are designed to derive aggregate distribu- tions from the perturbed records. Subsequently, data mining techniques can be developed in order to work with these aggregate distributions. We will describe the randomization technique in greater detail in a later section. The k-anonymity model and l-diversity: The k-anonymity model was developed because of the possibility of indirect identiﬁcation of records from public databases. This is because combinations of record attributes can be used to exactly identify individual records. In the k-anonymity method, we reduce the granularity of data representation with the use of techniques such as generalization and suppression. This granularity is reduced sufﬁciently that any given record maps onto at least k other records in the data. The l-diversity model was designed to handle some weaknesses in the k-anonymity model since protecting identities to the level of k-individuals is not the same as protecting the corresponding sensitive values, especially when there is homogeneity of sensitive val- ues within a group. To do so, the concept of intra-group diversity of sensitive values is promoted within the anonymization scheme [83]. Distributed privacy preservation: In many cases, individual entities may wish to derive aggregate results from data sets which are partitioned across these entities. Such partitioning may be horizontal (when the records are distributed across multiple entities) or vertical (when the attributes are distributed across multiple entities). While the individual A General Survey of Privacy-Preserving Data Mining Models and Algorithms 13 entities may not desire to share their entire data sets, they may consent to limited information sharing with the use of a variety of protocols. The overall effect of such methods is to maintain privacy for each individual entity, while deriving aggregate results over the entire data. Downgrading Application Effectiveness: In many cases, even though the data may not be available, the output of applications such as association rule mining, classiﬁcation or query processing may result in violations of privacy. This has lead to research in downgrading the effectiveness of applications by either data or application modiﬁcations. Some exam- ples of such techniques include association rule hiding [124], classiﬁer downgrading [92], and query auditing [1]. In this paper, we will provide a broad overview of the different techniques for privacy-preserving data mining. We will provide a review of the major algo- rithms available for each method, and the variations on the different techniques. We will also discuss a number of combinations of different concepts such as k-anonymous mining over vertically- or horizontally-partitioned data. We will also discuss a number of unique challenges associated with privacy-preserving data mining in the high dimensional case. This paper is organized as follows. In section 2, we will introduce the ran- domization method for privacy preserving data mining. In section 3, we will discuss the k-anonymization method along with its different variations. In section 4, we will discuss issues in distributed privacy-preserving data mining. In section 5, we will discuss a number of techniques for privacy which arise in the context of sensitive output of a variety of data mining and data man- agement applications. In section 6, we will discuss some unique challenges associated with privacy in the high dimensional case. A number of applica- tions of privacy-preserving models and algorithms are discussed in Section 7. Section 8 contains the conclusions and discussions. 2.2 The Randomization Method In this section, we will discuss the randomization method for privacy- preserving data mining. The randomization method has been traditionally used in the context of distorting data by probability distribution for methods such as surveys which have an evasive answer bias because of privacy concerns [74, 129]. This technique has also been extended to the problem of privacy- preserving data mining [2]. The method of randomization can be described as follows. Consider a set of data records denoted by X = {x1 ...xN}. For record xi ∈ X,weadd a noise component which is drawn from the probability distribution fY(y). These noise components are drawn independently, and are denoted y1 ...yN. Thus, the new set of distorted records are denoted by x1 + y1 ...xN + yN.We 14 Privacy-Preserving Data Mining: Models and Algorithms denote this new set of records by z1 ...zN. In general, it is assumed that the variance of the added noise is large enough, so that the original record values cannot be easily guessed from the distorted data. Thus, the original records cannot be recovered, but the distribution of the original records can be recov- ered. Thus, if X be the random variable denoting the data distribution for the original record, Y be the random variable describing the noise distribution, and Z be the random variable denoting the ﬁnal record, we have: Z = X + Y X = Z − Y Now, we note that N instantiations of the probability distribution Z are known, whereas the distribution Y is known publicly. For a large enough number of values of N, the distribution Z can be approximated closely by using a vari- ety of methods such as kernel density estimation. By subtracting Y from the approximated distribution of Z, it is possible to approximate the original prob- ability distribution X. In practice, one can combine the process of approxima- tion of Z with subtraction of the distribution Y from Z by using a variety of iterative methods such as those discussed in [2, 5]. Such iterative methods typi- cally have a higher accuracy than the sequential solution of ﬁrst approximating Z and then subtracting Y from it. In particular, the EM method proposed in [5] shows a number of optimal properties in approximating the distribution of X. We note that at the end of the process, we only have a distribution contain- ing the behavior of X. Individual records are not available. Furthermore, the distributions are available only along individual dimensions. Therefore, new data mining algorithms need to be designed to work with the uni-variate dis- tributions rather than the individual records. This can sometimes be a chal- lenge, since many data mining algorithms are inherently dependent on sta- tistics which can only be extracted from either the individual records or the multi-variate probability distributions associated with the records. While the approach can certainly be extended to multi-variate distributions, density es- timation becomes inherently more challenging [112] with increasing dimen- sionalities. For even modest dimensionalities such as 7 to 10, the process of density estimation becomes increasingly inaccurate, and falls prey to the curse of dimensionality. One key advantage of the randomization method is that it is relatively sim- ple, and does not require knowledge of the distribution of other records in the data. This is not true of other methods such as k-anonymity which re- quire the knowledge of other records in the data. Therefore, the randomization method can be implemented at data collection time, and does not require the use of a trusted server containing all the original records in order to perform the anonymization process. While this is a strength of the randomization method, A General Survey of Privacy-Preserving Data Mining Models and Algorithms 15 it also leads to some weaknesses, since it treats all records equally irrespective of their local density. Therefore, outlier records are more susceptible to adver- sarial attacks as compared to records in more dense regions in the data [10]. In order to guard against this, one may need to be needlessly more aggressive in adding noise to all the records in the data. This reduces the utility of the data for mining purposes. The randomization method has been extended to a variety of data mining problems. In [2], it was discussed how to use the approach for classiﬁcation. A number of other techniques [143, 145] have also been proposed which seem to work well over a variety of different classiﬁers. Techniques have also been pro- posed for privacy-preserving methods of improving the effectiveness of classi- ﬁers. For example, the work in [51] proposes methods for privacy-preserving boosting of classiﬁers. Methods for privacy-preserving mining of association rules have been proposed in [47, 107]. The problem of association rules is especially challenging because of the discrete nature of the attributes corre- sponding to presence or absence of items. In order to deal with this issue, the randomization technique needs to be modiﬁed slightly. Instead of adding quan- titative noise, random items are dropped or included with a certain probability. The perturbed transactions are then used for aggregate association rule mining. This technique has shown to be extremely effective in [47]. The randomization approach has also been extended to other applications such as OLAP [3], and SVD based collaborative ﬁltering [103]. 2.2.1 Privacy Quantiﬁcation The quantity used to measure privacy should indicate how closely the orig- inal value of an attribute can be estimated. The work in [2] uses a measure that deﬁnes privacy as follows: If the original value can be estimated with c% conﬁdence to lie in the interval [α1,α2], then the interval width (α2 − α1) deﬁnes the amount of privacy at c% conﬁdence level. For example, if the per- turbing additive is uniformly distributed in an interval of width 2α,thenα is the amount of privacy at conﬁdence level 50% and 2α is the amount of privacy at conﬁdence level 100%. However, this simple method of determining privacy can be subtly incomplete in some situations. This can be best explained by the following example. Example 2.1 Consider an attribute X with the density function fX(x) given by: fX(x)=0.50≤ x ≤ 1 0.54≤ x ≤ 5 0 otherwise 16 Privacy-Preserving Data Mining: Models and Algorithms Assume that the perturbing additive Y is distributed uniformly between [−1, 1]. Then according to the measure proposed in [2], the amount of privacy is 2 at conﬁdence level 100%. However, after performing the perturbation and subsequent reconstruction, the density function fX(x) will be approximately revealed. Let us assume for a moment that a large amount of data is available, so that the distribution function is revealed to a high degree of accuracy. Since the (distribution of the) perturbing additive is publically known, the two pieces of information can be combined to determine that if Z ∈ [−1, 2],thenX ∈ [0, 1]; whereas if Z ∈ [3, 6] then X ∈ [4, 5]. Thus, in each case, the value of X can be localized to an interval of length 1. This means that the actual amount of privacy offered by the perturbing additive Y is at most 1 at conﬁdence level 100%. We use the qualiﬁer ‘at most’ since X can often be localized to an interval of length less than one. For example, if the value of Z happens to be −0.5, then the value of X can be localized to an even smaller interval of [0, 0.5]. This example illustrates that the method suggested in [2] does not take into account the distribution of original data. In other words, the (aggregate) re- construction of the attribute value also provides a certain level of knowledge which can be used to guess a data value to a higher level of accuracy. To accu- rately quantify privacy, we need a method which takes such side-information into account. A key privacy measure [5] is based on the differential entropy of a random variable. The differential entropy h(A) of a random variable A is deﬁned as follows: h(A)=− ΩA fA(a)log2 fA(a) da (2.1) where ΩA is the domain of A. It is well-known that h(A) is a measure of uncertainty inherent in the value of A[111]. It can be easily seen that for a random variable U distributed uniformly between 0 and a, h(U)=log2(a). For a =1, h(U)=0. In [5], it was proposed that 2h(A) is a measure of privacy inherent in the random variable A. This value is denoted by Π(A). Thus, a random variable U distributed uniformly between 0 and a has privacy Π(U)=2log2(a) = a.Fora general random variable A, Π(A) denote the length of the interval, over which a uniformly distributed random variable has the same uncertainty as A. Given a random variable B,theconditional differential entropy of A is de- ﬁned as follows: h(A|B)=− ΩA,B fA,B(a, b)log2 fA|B=b(a) da db (2.2) A General Survey of Privacy-Preserving Data Mining Models and Algorithms 17 Thus, the average conditional privacy of A given B is Π(A|B)=2h(A|B).This motivates the following metric P(A|B) for the conditional privacy loss of A, given B: P(A|B)=1− Π(A|B)/Π(A)=1− 2h(A|B)/2h(A) =1− 2−I(A;B). where I(A;B)=h(A) − h(A|B)=h(B) − h(B|A).I(A;B) is also known as the mutual information between the random variables A and B. Clearly, P(A|B) is the fraction of privacy of A which is lost by revealing B. As an illustration, let us reconsider Example 2.1 given above. In this case, the differential entropy of X is given by: h(X)=− ΩX fX(x)log2 fX(x) dx = − 1 0 0.5log2 0.5 dx − 5 4 0.5log2 0.5 dx =1 Thus the privacy of X, Π(X)=21 =2. In other words, X hasasmuchprivacy as a random variable distributed uniformly in an interval of length 2. The den- sity function of the perturbed value Z is given by fZ(z)= ∞ −∞ fX(ν)fY(z − ν) dν. Using fZ(z), we can compute the differential entropy h(Z) of Z. It turns out that h(Z)=9/4. Therefore, we have: I(X;Z)=h(Z) − h(Z|X)=9/4 − h(Y)=9/4 − 1=5/4 Here, the second equality h(Z|X)=h(Y) follows from the fact that X and Y are independent and Z = X + Y. Thus, the fraction of privacy loss in this case is P(X|Z)=1− 2−5/4 =0.5796. Therefore, after revealing Z,X has privacy Π(X|Z)=Π(X) × (1 −P(X|Z)) = 2 × (1.0 − 0.5796) = 0.8408. This value is less than 1, since X can be localized to an interval of length less than one for many values of Z. The problem of privacy quantiﬁcation has been studied quite extensively in the literature, and a variety of metrics have been proposed to quantify privacy. A number of quantiﬁcation issues in the measurement of privacy breaches has been discussed in [46, 48]. In [19], the problem of privacy-preservation has been studied from the broader context of the tradeoff between the privacy and the information loss. We note that the quantiﬁcation of privacy alone is not suf- ﬁcient without quantifying the utility of the data created by the randomization process. A framework has been proposed to explore this tradeoff for a variety of different privacy transformation algorithms. 18 Privacy-Preserving Data Mining: Models and Algorithms 2.2.2 Adversarial Attacks on Randomization In the earlier section on privacy quantiﬁcation, we illustrated an example in which the reconstructed distribution on the data can be used in order to reduce the privacy of the underlying data record. In general, a systematic approach can be used to do this in multi-dimensional data sets with the use of spectral ﬁltering or PCA based techniques [54, 66]. The broad idea in techniques such as PCA [54] is that the correlation structure in the original data can be esti- mated fairly accurately (in larger data sets) even after noise addition. Once the broad correlation structure in the data has been determined, one can then try to remove the noise in the data in such a way that it ﬁts the aggregate corre- lation structure of the data. It has been shown that such techniques can reduce the privacy of the perturbation process signiﬁcantly since the noise removal results in values which are fairly close to their original values [54, 66]. Some other discussions on limiting breaches of privacy in the randomization method may be found in [46]. A second kind of adversarial attack is with the use of public information. Consider a record X =(x1 ...xd), which is perturbed to Z =(z1 ...zd). Then, since the distribution of the perturbations is known, we can try to use a maximum likelihood ﬁt of the potential perturbation of Z to a public record. Consider the publicly public record W =(w1 ...wd). Then, the potential per- turbation of Z with respect to W is given by (Z −W)=(z1 −w1 ...zd −wd). Each of these values (zi − wi) should ﬁt the distribution fY(y). The corre- sponding log-likelihood ﬁt is given by − d i=1 log(fy(zi − wi)). The higher the log-likelihood ﬁt, the greater the probability that the record W corresponds to X. If it is known that the public data set always includes X, then the max- imum likelihood ﬁt can provide a high degree of certainty in identifying the correct record, especially in cases where d is large. We will discuss this issue in greater detail in a later section. 2.2.3 Randomization Methods for Data Streams The randomization approach is particularly well suited to privacy-preserving data mining of streams, since the noise added to a given record is independent of the rest of the data. However, streams provide a particularly vulnerable target for adversarial attacks with the use of PCA based techniques [54] because of the large volume of the data available for analysis. In [78], an interesting technique for randomization has been proposed which uses the auto-correlations in different time series while deciding the noise to be added to any particular value. It has been shown in [78] that such an approach is more robust since the noise correlates with the stream behavior, and it is more difﬁcult to create effective adversarial attacks with the use of correlation analysis techniques. A General Survey of Privacy-Preserving Data Mining Models and Algorithms 19 2.2.4 Multiplicative Perturbations The most common method of randomization is that of additive perturba- tions. However, multiplicative perturbations can also be used to good effect for privacy-preserving data mining. Many of these techniques derive their roots in the work of [61] which shows how to use multi-dimensional projections in or- der to reduce the dimensionality of the data. This technique preserves the inter- record distances approximately, and therefore the transformed records can be used in conjunction with a variety of data mining applications. In particular, the approach is discussed in detail in [97, 98], in which it is shown how to use the method for privacy-preserving clustering. The technique can also be applied to the problem of classiﬁcation as discussed in [28]. Multiplicative perturba- tions can also be used for distributed privacy-preserving data mining. Details can be found in [81]. A number of techniques for multiplicative perturbation in the context of masking census data may be found in [70]. A variation on this theme may be implemented with the use of distance preserving fourier transforms, which work effectively for a variety of cases [91]. As in the case of additive perturbations, multiplicative perturbations are not entirely safe from adversarial attacks. In general, if the attacker has no prior knowledge of the data, then it is relatively difﬁcult to attack the privacy of the transformation. However, with some prior knowledge, two kinds of attacks are possible [82]: Known Input-Output Attack: In this case, the attacker knows some linearly independent collection of records, and their corresponding per- turbed version. In such cases, linear algebra techniques can be used to reverse-engineer the nature of the privacy preserving transformation. Known Sample Attack: In this case, the attacker has a collection of independent data samples from the same distribution from which the original data was drawn. In such cases, principal component analysis techniques can be used in order to reconstruct the behavior of the original data. 2.2.5 Data Swapping We note that noise addition or multiplication is not the only technique which can be used to perturb the data. A related method is that of data swapping, in which the values across different records are swapped in order to perform the privacy-preservation [49]. One advantage of this technique is that the lower order marginal totals of the data are completely preserved and are not per- turbed at all. Therefore certain kinds of aggregate computations can be exactly performed without violating the privacy of the data. We note that this tech- nique does not follow the general principle in randomization which allows the 20 Privacy-Preserving Data Mining: Models and Algorithms value of a record to be perturbed independent;y of the other records. There- fore, this technique can be used in combination with other frameworks such as k-anonymity, as long as the swapping process is designed to preserve the deﬁnitions of privacy for that model. 2.3 Group Based Anonymization The randomization method is a simple technique which can be easily im- plemented at data collection time, because the noise added to a given record is independent of the behavior of other data records. This is also a weakness be- cause outlier records can often be difﬁcult to mask. Clearly, in cases in which the privacy-preservation does not need to be performed at data-collection time, it is desirable to have a technique in which the level of inaccuracy depends upon the behavior of the locality of that given record. Another key weakness of the randomization framework is that it does not consider the possibility that publicly available records can be used to identify the identity of the owners of that record. In [10], it has been shown that the use of publicly available records can lead to the privacy getting heavily compromised in high-dimensional cases. This is especially true of outlier records which can be easily distinguished from other records in their locality. Therefore, a broad approach to many privacy transformations is to construct groups of anonymous records which are trans- formed in a group-speciﬁc way. 2.3.1 The k-Anonymity Framework In many applications, the data records are made available by simply remov- ing key identiﬁers such as the name and social-security numbers from personal records. However, other kinds of attributes (known as pseudo-identiﬁers) can be used in order to accurately identify the records. Foe example, attributes such as age, zip-code and sex are available in public records such as census rolls. When these attributes are also available in a given data set, they can be used to infer the identity of the corresponding individual. A combination of these attributes can be very powerful, since they can be used to narrow down the possibilities to a small number of individuals. In k-anonymity techniques [110], we reduce the granularity of representa- tion of these pseudo-identiﬁers with the use of techniques such as general- ization and suppression. In the method of generalization, the attribute values are generalized to a range in order to reduce the granularity of representation. For example, the date of birth could be generalized to a range such as year of birth, so as to reduce the risk of identiﬁcation. In the method of suppression, the value of the attribute is removed completely. It is clear that such methods reduce the risk of identiﬁcation with the use of public records, while reducing the accuracy of applications on the transformed data. A General Survey of Privacy-Preserving Data Mining Models and Algorithms 21 In order to reduce the risk of identiﬁcation, the k-anonymity approach re- quires that every tuple in the table be indistinguishability related to no fewer than k respondents. This can be formalized as follows: Definition 2.2 Each release of the data must be such that every combina- tion of values of quasi-identiﬁers can be indistinguishably matched to at least k respondents. The ﬁrst algorithm for k-anonymity was proposed in [110]. The approach uses domain generalization hierarchies of the quasi-identiﬁers in order to build k-anonymous tables. The concept of k-minimal generalization has been pro- posed in [110] in order to limit the level of generalization for maintaining as much data precision as possible for a given level of anonymity. Subsequently, the topic of k-anonymity has been widely researched. A good overview and survey of the corresponding algorithms may be found in [31]. We note that the problem of optimal anonymization is inherently a difﬁcult one. In [89], it has been shown that the problem of optimal k-anonymization is NP-hard. Nevertheless, the problem can be solved quite effectively by the use of a number of heuristic methods. A method proposed by Bayardo and Agrawal [18] is the k-Optimize algorithm which can often obtain effective solutions. The approach assumes an ordering among the quasi-identiﬁer attributes. The values of the attributes are discretized into intervals (quantitative attributes) or grouped into different sets of values (categorical attributes). Each such group- ing is an item. For a given attribute, the corresponding items are also ordered. An index is created using these attribute-interval pairs (or items) and a set enumeration tree is constructed on these attribute-interval pairs. This set enu- meration tree is a systematic enumeration of all possible generalizations with the use of these groupings. The root of the node is the null node, and every successive level of the tree is constructed by appending one item which is lex- icographically larger than all the items at that node of the tree. We note that the number of possible nodes in the tree increases exponentially with the data dimensionality. Therefore, it is not possible to build the entire tree even for modest values of n.However,thek-Optimize algorithm can use a number of pruning strategies to good effect. In particular, a node of the tree can be pruned when it is determined that no descendent of it could be optimal. This can be done by computing a bound on the quality of all descendents of that node, and comparing it to the quality of the current best solution obtained during the traversal process. A branch and bound technique can be used to successively improve the quality of the solution during the traversal process. Eventually, it is possible to terminate the algorithm at a maximum computational time, and use the current solution at that point, which is often quite good, but may not be optimal. 22 Privacy-Preserving Data Mining: Models and Algorithms In [75], the Incognito method has been proposed for computing a k-minimal generalization with the use of bottom-up aggregation along domain generaliza- tion hierarchies. The Incognito method uses a bottom-up breadth-ﬁrst search of the domain generalization hierarchy, in which it generates all the possible mini- mal k-anonymous tables for a given private table. First, it checks k-anonymity for each single attribute, and removes all those generalizations which do not satisfy k-anonymity. Then, it computes generalizations in pairs, again pruning those pairs which do not satisfy the k-anonymity constraints. In general, the Incognito algorithm computes (i +1)-dimensional generalization candidates from the i-dimensional generalizations, and removes all those those generaliza- tions which do not satisfy the k-anonymity constraint. This approach is contin- ued until, no further candidates can be constructed, or all possible dimensions have been exhausted. We note that the methods in [76, 75] use a more gen- eral model for k-anonymity than that in [110]. This is because the method in [110] assumes that the value generalization hierarchy is a tree, whereas that in [76, 75] assumes that it is a graph. Two interesting methods for top-down specialization and bottom-up gener- alization for k-anonymity have been proposed in [50, 125]. In [50], a top-down heuristic is designed, which starts with a general solution, and then special- izes some attributes of the current solution so as to increase the information, but reduce the anonymity. The reduction in anonymity is always controlled, so that k-anonymity is never violated. At the same time each step of the spe- cialization is controlled by a goodness metric which takes into account both the gain in information and the loss in anonymity. A complementary method to top down specialization is that of bottom up generalization,forwhichan interesting method is proposed in [125]. We note that generalization and suppression are not the only transformation techniques for implementing k-anonymity. For example in [38] it is discussed how to use micro-aggregation in which clusters of records are constructed. For each cluster, its representative value is the average value along each dimen- sion in the cluster. A similar method for achieving anonymity via clustering is proposed in [15]. The work in [15] also provides constant factor approxi- mation algorithms to design the clustering. In [8], a related method has been independently proposed for condensation based privacy-preserving data min- ing. This technique generates pseudo-data from clustered groups of k-records. The process of pseudo-data generation uses principal component analysis of the behavior of the records within a group. It has been shown in [8], that the approach can be effectively used for the problem of classiﬁcation. We note that the use of pseudo-data provides an additional layer of protection, since it is difﬁcult to perform adversarial attacks on synthetic data. At the same time, the aggregate behavior of the data is preserved, and this can be useful for a variety of data mining problems. A General Survey of Privacy-Preserving Data Mining Models and Algorithms 23 Since the problem of k-anonymization is essentially a search over a space of possible multi-dimensional solutions, standard heuristic search techniques such as genetic algorithms or simulated annealing can be effectively used. Such a technique has been proposed in [130] in which a simulated annealing algo- rithm is used in order to generate k-anonymous representations of the data. An- other technique proposed in [59] uses genetic algorithms in order to construct k-anonymous representations of the data. Both of these techniques require high computational times, and provide no guarantees on the quality of the solutions found. The only known techniques which provide guarantees on the quality of the solution are approximation algorithms [13, 14, 89], in which the solu- tion found is guaranteed to be within a certain factor of the cost of the opti- mal solution. An approximation algorithm for k-anonymity was proposed in [89], and it provides an O(k · logk) optimal solution. A number of techniques have also been proposed in [13, 14], which provide O(k)-approximations to the optimal cost k-anonymous solutions. In [100], a large improvement was proposed over these different methods. The technique in [100] proposes an O(log(k))-approximation algorithm. This is signiﬁcantly better than compet- ing algorithms. Furthermore, the work in [100] also proposes a O(β · log(k)) approximation algorithm, where the parameter β can be gracefully adjusted based on running time constraints. Thus, this approach not only provides an approximation algorithm, but also gracefully explores the tradeoff between ac- curacy and running time. In many cases, associations between pseudo-identiﬁers and sensitive at- tributes can be protected by using multiple views, such that the pseudo- identiﬁers and sensitive attributes occur in different views of the table. Thus, only a small subset of the selected views may be made available. It may be possible to achieve k-anonymity because of the lossy nature of the join across the two views. In the event that the join is not lossy enough, it may result in a violation of k-anonymity. In [140], the problem of violation of k-anonymity using multiple views has been studied. It has been shown that the problem is NP-hard in general. It has been shown in [140] that a polynomial time algorithm is possible if functional dependencies exist between the different views. An interesting analysis of the safety of k-anonymization methods has been discussed in [73]. It tries to model the effectiveness of a k-anonymous rep- resentation, given that the attacker has some prior knowledge about the data such as a sample of the original data. Clearly, the more similar the sample data is to the true data, the greater the risk. The technique in [73] uses this fact to construct a model in which it calculates the expected number of items iden- tiﬁed. This kind of technique can be useful in situations where it is desirable 24 Privacy-Preserving Data Mining: Models and Algorithms to determine whether or not anonymization should be used as the technique of choice for a particular situation. 2.3.2 Personalized Privacy-Preservation Not all individuals or entities are equally concerned about their privacy. For example, a corporation may have very different constraints on the privacy of its records as compared to an individual. This leads to the natural problem that we may wish to treat the records in a given data set very differently for anonymiza- tion purposes. From a technical point of view, this means that the value of k for anonymization is not ﬁxed but may vary with the record. A condensation- based approach [9] has been proposed for privacy-preserving data mining in the presence of variable constraints on the privacy of the data records. This technique constructs groups of non-homogeneous size from the data, such that it is guaranteed that each record lies in a group whose size is at least equal to its anonymity level. Subsequently, pseudo-data is generated from each group so as to create a synthetic data set with the same aggregate distribution as the original data. Another interesting model of personalized anonymity is discussed in [132] in which a person can specify the level of privacy for his or her sensitive values. This technique assumes that an individual can specify a node of the domain generalization hierarchy in order to decide the level of anonymity that he can work with. This approach has the advantage that it allows for direct protection of the sensitive values of individuals than a vanilla k-anonymity method which is susceptible to different kinds of attacks. 2.3.3 Utility Based Privacy Preservation The process of privacy-preservation leads to loss of information for data mining purposes. This loss of information can also be considered a loss of utility for data mining purposes. Since some negative results [7] on the curse of dimensionality suggest that a lot of attributes may need to be suppressed in order to preserve anonymity, it is extremely important to do this carefully in order to preserve utility. We note that many anonymization methods [18, 50, 83, 126] use cost measures in order to measure the information loss from the anonymization process. examples of such utility measures include gener- alization height [18], size of anonymized group [83], discernability measures of attribute values [18], and privacy information loss ratio[126]. In addition, a number of metrics such as the classiﬁcation metric [59] explicitly try to per- form the privacy-preservation in such a way so as to tailor the results with use for speciﬁc applications such as classiﬁcation. The problem of utility-based privacy-preserving data mining was ﬁrst stud- ied formally in [69]. The broad idea in [69] is to ameliorate the curse of A General Survey of Privacy-Preserving Data Mining Models and Algorithms 25 dimensionality by separately publishing marginal tables containing attributes which have utility, but are also problematic for privacy-preservation purposes. The generalizations performed on the marginal tables and the original tables in fact do not need to be the same. It has been shown that this broad approach can preserve considerable utility of the data set without violating privacy. A method for utility-based data mining using local recoding was proposed in [135]. The approach is based on the fact that different attributes have different utility from an application point of view. Most anonymization methods are global, in which a particular tuple value is mapped to the same generalized value globally. In local recoding, the data space is partitioned into a number of regions, and the mapping of the tuple to the generalizes value is local to that region. Clearly, this kind of approach has greater ﬂexibility, since it can tailor the generalization process to a particular region of the data set. In [135], it has been shown that this method can perform quite effectively because of its local recoding strategy. Another indirect approach to utility based anonymization is to make the privacy-preservation algorithms more aware of the workload [77]. Typically, data recipients may request only a subset of the data in many cases, and the union of these different requested parts of the data set is referred to as the workload. Clearly, a workload in which some records are used more frequently than others tends to suggest a different anonymization than one which is based on the entire data set. In [77], an effective and efﬁcient algorithm has been proposed for workload aware anonymization. Another direction for utility based privacy-preserving data mining is to anonymize the data in such a way that it remains useful for particular kinds of data mining or database applications. In such cases, the utility measure is often affected by the underlying application at hand. For example, in [50], a method has been proposed for k-anonymization using an information-loss metric as the utility measure. Such an approach is useful for the problem of classiﬁcation. In [72], a method has been proposed for anonymization, so that the accuracy of the underlying queries is preserved. 2.3.4 Sequential Releases Privacy-preserving data mining poses unique problems for dynamic appli- cations such as data streams because in such cases, the data is released sequen- tially. In other cases, different views of the table may be released sequentially. Once a data block is released, it is no longer possible to go back and increase the level of generalization. On the other hand, new releases may sharpen an attacker’s view of the data and may make the overall data set more susceptible to attack. For example, when different views of the data are released sequen- tially, then one may use a join on the two releases [127] in order to sharpen the 26 Privacy-Preserving Data Mining: Models and Algorithms ability to distinguish particular records in the data. A technique discussed in [127] relies on lossy joins in order to cripple an attack based on global quasi- identiﬁers. The intuition behind this approach is that if the join is lossy enough, it will reduce the conﬁdence of the attacker in relating the release from previ- ous views to the current release. Thus, the inability to link successive releases is key in preventing further discovery of the identity of records. While the work in [127] explores the issue of sequential releases from the point of view of adding additional attributes, the work in [134] discusses the same issue when records are added to or deleted from the original data. A new generalization principle called m-invariance is proposed, which effec- tively limits the risk of privacy-disclosure in re-publication. Another method for handling sequential updates to the data set is discussed in [101]. The broad idea in this approach is to progressively and consistently increase the gen- eralization granularity, so that the released data satisﬁes the k-anonymity re- quirement both with respect to the current table, as well as with respect to the previous releases. 2.3.5 The l-diversity Method The k-anonymity is an attractive technique because of the simplicity of the deﬁnition and the numerous algorithms available to perform the anonymiza- tion. Nevertheless the technique is susceptible to many kinds of attacks espe- cially when background knowledge is available to the attacker. Some kinds of such attacks are as follows: Homogeneity Attack: In this attack, all the values for a sensitive at- tribute within a group of k records are the same. Therefore, even though the data is k-anonymized, the value of the sensitive attribute for that group of k records can be predicted exactly. Background Knowledge Attack: In this attack, the adversary can use an association between one or more quasi-identiﬁer attributes with the sensitive attribute in order to narrow down possible values of the sensi- tive ﬁeld further. An example given in [83] is one in which background knowledge of low incidence of heart attacks among Japanese could be used to narrow down information for the sensitive ﬁeld of what disease a patient might have. A detailed discussion of the effects of background knowledge on privacy may be found in [88]. Clearly, while k-anonymity is effective in preventing identiﬁcation of a record, it may not always be effective in preventing inference of the sensitive val- ues of the attributes of that record. Therefore, the technique of l-diversity was proposed which not only maintains the minimum group size of k, but also A General Survey of Privacy-Preserving Data Mining Models and Algorithms 27 focusses on maintaining the diversity of the sensitive attributes. Therefore, the l-diversity model [83] for privacy is deﬁned as follows: Definition 2.3 Let a q∗-block be a set of tuples such that its non-sensitive values generalize to q∗.Aq∗-block is l-diverse if it contains l “well repre- sented” values for the sensitive attribute S. A table is l-diverse, if every q∗- block in it is l-diverse. A number of different instantiations for the l-diversity deﬁnition are discussed in [83]. We note that when there are multiple sensitive attributes, then the l- diversity problem becomes especially challenging because of the curse of di- mensionality. Methods have been proposed in [83] for constructing l-diverse tables from the data set, though the technique remains susceptible to the curse of dimensionality [7]. Other methods for creating l-diverse tables are discussed in [133], in which a simple and efﬁcient method for constructing the l-diverse representation is proposed. 2.3.6 The t-closeness Model The t-closeness model is a further enhancement on the concept of l-diversity. One characteristic of the l-diversity model is that it treats all values of a given attribute in a similar way irrespective of its distribution in the data. This is rarely the case for real data sets, since the attribute values may be very skewed. This may make it more difﬁcult to create feasible l-diverse representations. Often, an adversary may use background knowledge of the global distribution in order to make inferences about sensitive values in the data. Furthermore, not all values of an attribute are equally sensitive. For example, an attribute corre- sponding to a disease may be more sensitive when the value is positive, rather than when it is negative. In [79], a t-closeness model was proposed which uses the property that the distance between the distribution of the sensitive attribute within an anonymized group should not be different from the global distribution by more than a threshold t. The Earth Mover distance metric is used in order to quantify the distance between the two distributions. Further- more, the t-closeness approach tends to be more effective than many other privacy-preserving data mining methods for the case of numeric attributes. 2.3.7 Models for Text, Binary and String Data Most of the work on privacy-preserving data mining is focussed on numer- ical or categorical data. However, speciﬁc data domains such as strings, text, or market basket data may share speciﬁc properties with some of these general data domains, but may be different enough to require their own set of tech- niques for privacy-preservation. Some examples are as follows: 28 Privacy-Preserving Data Mining: Models and Algorithms Text and Market Basket Data: While these can be considered a case of text and market basket data, they are typically too high dimensional to work effectively with standard k-anonymization techniques. However, these kinds of data sets have the special property that they are extremely sparse. The sparsity property implies that only a few of the attributes are non-zero, and most of the attributes take on zero values. In [11], tech- niques have been proposed to construct anonymization methods which take advantage of this sparsity. In particular sketch based methods have been used to construct anonymized representations of the data. Varia- tions are proposed to construct anonymizations which may be used at data collection time. String Data: String Data is considered challenging because of the vari- ations in the lengths of strings across different records. Typically meth- ods for k-anonymity are attribute speciﬁc, and therefore constructions of anonymizations for variable length records are quite difﬁcult. In [12], a condensation based method has been proposed for anonymization of string data. This technique creates clusters from the different strings, and then generates synthetic data which has the same aggregate properties as the individual clusters. Since each cluster contains at least k-records, the anonymized data is guaranteed to at least satisfy the deﬁnitions of k-anonymity. 2.4 Distributed Privacy-Preserving Data Mining The key goal in most distributed methods for privacy-preserving data min- ing is to allow computation of useful aggregate statistics over the entire data set without compromising the privacy of the individual data sets within the dif- ferent participants. Thus, the participants may wish to collaborate in obtaining aggregate results, but may not fully trust each other in terms of the distribution of their own data sets. For this purpose, the data sets may either be horizontally partitioned or be vertically partitioned. In horizontally partitioned data sets, the individual records are spread out across multiple entities, each of which have the same set of attributes. In vertical partitioning, the individual entities may have different attributes (or views) of the same set of records. Both kinds of partitioning pose different challenges to the problem of distributed privacy- preserving data mining. The problem of distributed privacy-preserving data mining overlaps closely with a ﬁeld in cryptography for determining secure multi-party computations. A broad overview of the intersection between the ﬁelds of cryptography and privacy-preserving data mining may be found in [102]. The broad approach to cryptographic methods tends to compute functions over inputs provided by multiple recipients without actually sharing the inputs with one another. For A General Survey of Privacy-Preserving Data Mining Models and Algorithms 29 example, in a 2-party setting, Alice and Bob may have two inputs x and y respectively, and may wish to both compute the function f(x, y) without re- vealing x or y to each other. This problem can also be generalized across k parties by designing the k argument function h(x1 ...xk). Many data mining algorithms may be viewed in the context of repetitive computations of many such primitive functions such as the scalar dot product, secure sum etc. In order to compute the function f(x, y) or h(x1 ...,xk),aprotocol will have to de- signed for exchanging information in such a way that the function is computed without compromising privacy. We note that the robustness of the protocol de- pends upon the level of trust one is willing to place on the two participants Alice and Bob. This is because the protocol may be subjected to various kinds of adversarial behavior: Semi-honest Adversaries: In this case, the participants Alice and Bob are curious and attempt to learn from the information received by them during the protocol, but do not deviate from the protocol themselves. In many situations, this may be considered a realistic model of adversarial behavior. Malicious Adversaries: In this case, Alice and Bob may vary from the protocol, and may send sophisticated inputs to one another to learn from the information received from each other. A key building-block for many kinds of secure function evaluations is the 1 out of 2 oblivious-transfer protocol. This protocol was proposed in [45, 105] and involves two parties: a sender,andareceiver. The sender’s input is a pair (x0,x1), and the receiver’s input is a bit value σ ∈{0, 1}. At the end of the process, the receiver learns xσ only, and the sender learns nothing. A number of simple solutions can be designed for this task. In one solution [45, 53], the receiver generates two random public keys, K0 and K1, but the receiver knows only the decryption key for Kσ. The receiver sends these keys to the sender, who encrypts x0 with K0, x1 with K1, and sends the encrypted data back to the receiver. At this point, the receiver can only decrypt xσ, since this is the only input for which they have the decryption key. We note that this is a semi- honest solution, since the intermediate steps require an assumption of trust. For example, it is assumed that when the receiver sends two keys to the sender, they indeed know the decryption key to only one of them. In order to deal with the case of malicious adversaries, one must ensure that the sender chooses the public keys according to the protocol. An efﬁcient method for doing so is described in [94]. In [94], generalizations of the 1 out of 2 oblivious transfer protocol to the 1 out N case and k out of N case are described. Since the oblivious transfer protocol is used as a building block for secure multi-party computation, it may be repeated many times over a given function 30 Privacy-Preserving Data Mining: Models and Algorithms evaluation. Therefore, the computational effectiveness of the approach is im- portant. Efﬁcient methods for both semi-honest and malicious adversaries are discussed in [94]. More complex problems in this domain include the com- putation of probabilistic functions over a number of multi-party inputs [137]. Such powerful techniques can be used in order to abstract out the primitives from a number of computationally intensive data mining problems. Many of the above techniques have been described for the 2-party case, though generic solutions also exist for the multiparty case. Some important solutions for the multiparty case may be found in [25]. The oblivious transfer protocol can be used in order to compute several data mining primitives related to vector distances in multi-dimensional space. A classic problem which is often used as a primitive for many other problems is that of computing the scalar dot-product in a distributed environment [58]. A fairly general set of methods in this direction are described in [39]. Many of these techniques work by sending changed or encrypted versions of the inputs to one another in order to compute the function with the different alternative versions followed by an oblivious transfer protocol to retrieve the correct value of the ﬁnal output. A systematic framework is described in [39] to transform normal data mining problems to secure multi-party computation problems. The problems discussed in [39] include those of clustering, classiﬁcation, associ- ation rule mining, data summarization, and generalization. A second set of methods for distributed privacy-preserving data mining is discussed in [32] in which the secure multi-party computation of a number of important data min- ing primitives is discussed. These methods include the secure sum, the secure set union, the secure size of set intersection and the scalar product. These tech- niques can be used as data mining primitives for secure multi-party computa- tion over a variety of horizontally and vertically partitioned data sets. Next, we will discuss algorithms for secure multi-party computation over horizontally partitioned data sets. 2.4.1 Distributed Algorithms over Horizontally Partitioned Data Sets In horizontally partitioned data sets, different sites contain different sets of records with the same (or highly overlapping) set of attributes which are used for mining purposes. Many of these techniques use specialized versions of the general methods discussed in [32, 39] for various problems. The work in [80] discusses the construction of a popular decision tree induction method called ID3 with the use of approximations of the best splitting attributes. Subsequently, a variety of classiﬁers have been generalized to the problem of horizontally-partitioned privacy preserving mining including the Naive Bayes Classiﬁer [65], and the SVM Classiﬁer with nonlinear kernels [141]. A General Survey of Privacy-Preserving Data Mining Models and Algorithms 31 An extreme solution for the horizontally partitioned case is discussed in [139], in which privacy-preserving classiﬁcation is performed in a fully distributed setting, where each customer has private access to only their own record. A host of other data mining applications have been generalized to the problem of horizontally partitioned data sets. These include the applications of asso- ciation rule mining [64], clustering [57, 62, 63] and collaborative ﬁltering [104]. Methods for cooperative statistical analysis using secure multi-party computation methods are discussed in [40, 41]. A related problem is that of information retrieval and document indexing in a network of content providers. This problem arises in the context of multi- ple providers which may need to cooperate with one another in sharing their content, but may essentially be business competitors. In [17], it has been dis- cussed how an adversary may use the output of search engines and content providers in order to reconstruct the documents. Therefore, the level of trust required grows with the number of content providers. A solution to this prob- lem [17] constructs a centralized privacy-preserving index in conjunction with a distributed access control mechanism. The privacy-preserving index main- tains strong privacy guarantees even in the face of colluding adversaries, and even if the entire index is made public. 2.4.2 Distributed Algorithms over Vertically Partitioned Data For the vertically partitioned case, many primitive operations such as com- puting the scalar product or the secure set size intersection can be useful in computing the results of data mining algorithms. For example, the methods in [58] discuss how to use to scalar dot product computation for frequent itemset counting. The process of counting can also be achieved by using the secure size of set intersection as described in [32]. Another method for association rule mining discussed in [119] uses the secure scalar product over the vertical bit representation of itemset inclusion in transactions, in order to compute the frequency of the corresponding itemsets. This key step is applied repeatedly within the framework of a roll up procedure of itemset counting. It has been shown in [119] that this approach is quite effective in practice. The approach of vertically partitioned mining has been extended to a variety of data mining applications such as decision trees [122], SVM Classiﬁcation [142], Naive Bayes Classiﬁer [121], and k-means clustering [120]. A num- ber of theoretical results on the ability to learn different kinds of functions in vertically partitioned databases with the use of cryptographic approaches are discussed in [42]. 32 Privacy-Preserving Data Mining: Models and Algorithms 2.4.3 Distributed Algorithms for k-Anonymity In many cases, it is important to maintain k-anonymity across different dis- tributed parties. In [60], a k-anonymous protocol for data which is vertically partitioned across two parties is described. The broad idea is for the two parties to agree on the quasi-identiﬁer to generalize to the same value before release. A similar approach is discussed in [128], in which the two parties agree on how the generalization is to be performed before release. In [144], an approach has been discussed for the case of horizontally par- titioned data. The work in [144] discusses an extreme case in which each site is a customer which owns exactly one tuple from the data. It is assumed that the data record has both sensitive attributes and quasi-identiﬁer attributes. The solution uses encryption on the sensitive attributes. The sensitive values can be decrypted only if therefore are at least k records with the same values on the quasi-identiﬁers. Thus, k-anonymity is maintained. The issue of k-anonymity is also important in the context of hiding iden- tiﬁcation in the context of distributed location based services [20, 52]. In this case, k-anonymity of the user-identity is maintained even when the location in- formation is released. Such location information is often released when a user may send a message at any point from a given location. A similar issue arises in the context of communication protocols in which the anonymity of senders (or receivers) may need to be protected. A message is said to be sender k-anonymous, if it is guaranteed that an attacker can at most narrow down the identity of the sender to k individuals. Similarly, a message is said to be receiver k-anonymous, if it is guaranteed that an attacker can at most narrow down the identity of the receiver to k individuals. A number of such techniques have been discussed in [56, 135, 138]. 2.5 Privacy-Preservation of Application Results In many cases, the output of applications can be used by an adversary in or- der to make signiﬁcant inferences about the behavior of the underlying data. In this section, we will discuss a number of miscellaneous methods for privacy- preserving data mining which tend to preserve the privacy of the end results of applications such as association rule mining and query processing. This prob- lem is related to that of disclosure control [1] in statistical databases, though advances in data mining methods provide increasingly sophisticated methods for adversaries to make inferences about the behavior of the underlying data. In cases, where the commercial data needs to be shared, the association rules may represent sensitive information for target-marketing purposes, which needs to be protected from inference. In this section, we will discuss the issue of disclosure control for a num- ber of applications such as association rule mining, classiﬁcation, and query A General Survey of Privacy-Preserving Data Mining Models and Algorithms 33 processing. The key goal here is to prevent adversaries from making infer- ences from the end results of data mining and management applications. A broad discussion of the security and privacy implications of data mining are presented in [33]. We will discuss each of the applications below: 2.5.1 Association Rule Hiding Recent years have seen tremendous advances in the ability to perform asso- ciation rule mining effectively. Such rules often encode important target mar- keting information about a business. Some of the earliest work on the chal- lenges of association rule mining for database security may be found in [16]. Two broad approaches are used for association rule hiding: Distortion: In distortion [99], the entry for a given transaction is mod- iﬁed to a different value. Since, we are typically dealing with binary transactional data sets, the entry value is ﬂipped. Blocking: In blocking [108], the entry is not modiﬁed, but is left in- complete. Thus, unknown entry values are used to prevent discovery of association rules. We note that both the distortion and blocking processes have a number of side effects on the non-sensitive rules in the data. Some of the non-sensitive rules may be lost along with sensitive rules, and new ghost rules may be created because of the distortion or blocking process. Such side effects are undesirable since they reduce the utility of the data for mining purposes. A formal proof of the NP-hardness of the distortion method for hiding as- sociation rule mining may be found in [16]. In [16], techniques are proposed for changing some of the 1-values to 0-values so that the support of the corre- sponding sensitive rules is appropriately lowered. The utility of the approach was deﬁned by the number of non-sensitive rules whose support was also low- ered by using such an approach. This approach was extended in [34] in which both support and conﬁdence of the appropriate rules could be lowered. In this case, 0-values in the transactional database could also change to 1-values. In many cases, this resulted in spurious association rules (or ghost rules) which was an undesirable side effect of the process. A complete description of the various methods for data distortion for association rule hiding may be found in [124]. Another interesting piece of work which balances privacy and disclosure concerns of sanitized rules may be found in [99]. The broad idea of blocking was proposed in [23]. The attractiveness of the blocking approach is that it maintains the truthfulness of the underlying data, since it replaces a value with an unknown (often represented by ‘?’) rather than a false value. Some interesting algorithms for using blocking for associa- tion rule hiding are presented in [109]. The work has been further extended in 34 Privacy-Preserving Data Mining: Models and Algorithms [108] with a discussion of the effectiveness of reconstructing the hidden rules. Another interesting set of techniques for association rule hiding with limited side effects is discussed in [131]. The objective of this method is to reduce the loss of non-sensitive rules, or the creation of ghost rules during the rule hiding process. In [6], it has been discussed how blocking techniques for hiding association rules can be used to prevent discovery of sensitive entries in the data set by an adversary. In this case, certain entries in the data are classiﬁed as sensitive, and only rules which disclose such entries are hidden. An efﬁcient depth-ﬁrst association mining algorithm is proposed for this task [6]. It has been shown that the methods can effectively reduce the disclosure of sensitive entries with the use of such a hiding process. 2.5.2 Downgrading Classiﬁer Effectiveness An important privacy-sensitive application is that of classiﬁcation, in which the results of a classiﬁcation application may be sensitive information for the owner of a data set. Therefore the issue is to modify the data in such a way that the accuracy of the classiﬁcation process is reduced, while retaining the utility of the data for other kinds of applications. A number of techniques have been discussed in [24, 92] in reducing the classiﬁer effectiveness in context of classiﬁcation rule and decision tree applications. The notion of parsimonious downgrading is proposed [24] in the context of blocking out inference chan- nels for classiﬁcation purposes while mining the effect to the overall utility. A system called Rational Downgrader [92] was designed with the use of these principles. The methods for association rule hiding can also be generalized to rule based classiﬁers. This is because rule based classiﬁers often use association rule min- ing methods as subroutines, so that the rules with the class labels in their con- sequent are used for classiﬁcation purposes. For a classiﬁer downgrading ap- proach, such rules are sensitive rules, whereas all other rules (with non-class attributes in the consequent) are non-sensitive rules. An example of a method for rule based classiﬁer downgradation is discussed in [95] in which it has been shown how to effectively hide classiﬁcation rules for a data set. 2.5.3 Query Auditing and Inference Control Many sensitive databases are not available for public access, but may have a public interface through which aggregate querying is allowed. This leads to the natural danger that a smart adversary may pose a sequence of queries through which he or she may infer sensitive facts about the data. The nature of this inference may correspond to full disclosure, in which an adversary may determine the exact values of the data attributes. A second notion is that of A General Survey of Privacy-Preserving Data Mining Models and Algorithms 35 partial disclosure in which the adversary may be able to narrow down the values to a range, but may not be able to guess the exact value. Most work on query auditing generally concentrates on the full disclosure setting. Two broad approaches are designed in order to reduce the likelihood of sen- sitive data discovery: Query Auditing: In query auditing, we deny one or more queries from a sequence of queries. The queries to be denied are chosen such that the sensitivity of the underlying data is preserved. Some examples of query auditing methods include [37, 68, 93, 106]. Query Inference Control: In this case, we perturb the underlying data or the query result itself. The perturbation is engineered in such a way, so as to preserve the privacy of the underlying data. Examples of meth- ods which use perturbation of the underlying data include [3, 26, 90]. Examples of methods which perturb the query result include [22, 36, 42–44]. An overview of classical methods for query auding may be found in [1]. The query auditing problem has an online version, in which we do not know the se- quence of queries in advance, and an ofﬂine version, in which we do know this sequence in advance. Clearly, the ofﬂine version is open to better optimization from an auditing point of view. The problem of query auditing was ﬁrst studied in [37, 106]. This approach works for the online version of the query auditing problem. In these works, the sum query is studied, and privacy is protected by using restrictions on sizes and pairwise overlaps of the allowable queries. Let us assume that the query size is restricted to be at most k, and the number of common elements in pairwise query sets is at most m. Then, if q be the number of elements that the attacker already knows from background knowledge, it was shown that [37, 106] that the maximum number of queries allowed is (2 · k − (q +1))/m. We note that if N be the total number of data elements, the above expression is always bounded above by 2·N. If for some constant c, we choose k = N/c and m =1, the approach can only support a constant number of queries, after which all queries would have to be denied by the auditor. Clearly, this is undesirable from an application point of view. Therefore, a considerable amount of research has been devoted to increasing the number of queries which can be answered by the auditor without compromising privacy. In [67], the problem of sum auditing on sub-cubes of the data cube are stud- ied, where a query expression is constructed using a string of 0, 1, and *. The elements to be summed up are determined by using matches to the query string pattern. In [71], the problem of auditing a database of boolean values is studied for the case of sum and max queries. In [21], and approach for query auditing 36 Privacy-Preserving Data Mining: Models and Algorithms is discussed which is actually a combination of the approach of denying some queries and modifying queries in order to achieve privacy. In [68], the authors show that denials to queries depending upon the answer to the current query can leak information. The authors introduce the notion of simulatable auditing for auditing sum and max queries. In [93], the authors devise methods for auditing max queries and bags of max and min queries under the partial and full disclosure settings. The authors also examine the notion of utility in the context of auditing, and obtain results for sum queries in the full disclosure setting. A number of techniques have also been proposed for the ofﬂine version of the auditing problem. In [29], a number of variations of the ofﬂine audit- ing problem have been studied. In the ofﬂine auditing problem, we are given a sequence of queries which have been truthfully answered, and we need to determine if privacy has been breached. In [29], effective algorithms were pro- posed for the sum, max, and max and min versions of the problems. On the other hand, the sum and max version of the problem was shown to be NP-hard. In [4], an ofﬂine auditing framework was proposed for determining whether a database adheres to its disclosure properties. The key idea is to create an audit expression which speciﬁes sensitive table entries. A number of techniques have also been proposed for sanitizing or random- izing the data for query auditing purposes. These are fairly general models of privacy, since they preserve the privacy of the data even when the entire data- base is available. The standard methods for perturbation [2, 5] or k-anonymity [110] can always be used, and it is always guaranteed that an adversary may not derive anything more from the queries than they can from the base data. Thus, since a k-anonymity model guarantees a certain level of privacy even when the entire database is made available, it will continue to do so under any sequence of queries. In [26], a number of interesting methods are discussed for measuring the effectiveness of sanitization schemes in terms of balancing privacy and utility. Instead of sanitizing the base data, it is possible to use summary constructs on the data, and respond to queries using only the information encoded in the summary constructs. Such an approach preserves privacy, as long as the summary constructs do not reveal sensitive information about the underly- ing records. A histogram based approach to data sanitization has been dis- cussed in [26, 27]. In this technique the data is recursively partitioned into multi-dimensional cells. The ﬁnal output is the exact description of the cuts along with the population of each cell. Clearly, this kind of description can be used for approximate query answering with the use of standard histogram query processing methods. In [55], a method has been proposed for privacy- preserving indexing of multi-dimensional data by using bucketizing of the un- derlying attribute values in conjunction with encryption of identiﬁcation keys. A General Survey of Privacy-Preserving Data Mining Models and Algorithms 37 We note that a choice of larger bucket sizes provides greater privacy but less accuracy. Similarly, optimizing the bucket sizes for accuracy can lead to reduc- tions in privacy. This tradeoff has been studied in [55], and it has been shown that reasonable query precision can be maintained at the expense of partial disclosure. In the class of methods which use summarization structures for inference control, an interesting method was proposed by Mishra and Sandler in [90], which uses pseudo-random sketches for privacy-preservation. In this technique sketches are constructed from the data, and the sketch representations are used to respond to user queries. In [90], it has been shown that the scheme preserves privacy effectively, while continuing to be useful from a utility point of view. Finally, an important class of query inference control methods changes the results of queries in order to preserve privacy. A classical method for aggre- gate queries such as the sum or relative frequency is that of random sampling [35]. In this technique, a random sample of the data is used to compute such aggregate functions. The random sampling approach makes it impossible for the questioner to precisely control the formation of query sets. The advantage of using a random sample is that the results of large queries are quite robust (in terms of relative error), but the privacy of individual records are preserved because of high absolute error. Another method for query inference control is by adding noise to the results of queries. Clearly, the noise should be sufﬁcient that an adversary cannot use small changes in the query arguments in order to infer facts about the base data. In [44], an interesting technique has been presented in which the result of a query is perturbed by an amount which depends upon the underlying sen- sitivity of the query function. This sensitivity of the query function is deﬁned approximately by the change in the response to the query by changing one ar- gument to the function. An important theoretical result [22, 36, 42, 43] shows that a surprisingly small amount of noise needs to be added to the result of a query, provided that the number of queries is sublinear in the number of data- base rows. With increasing sizes of databases today, this result provides fairly strong guarantees on privacy. Such queries together with their slightly noisy responses are referred to as the SuLQ primitive. 2.6 Limitations of Privacy: The Curse of Dimensionality Many privacy-preserving data-mining methods are inherently limited by the curse of dimensionality in the presence of public information. For exam- ple, the technique in [7] analyzes the k-anonymity method in the presence of increasing dimensionality. The curse of dimensionality becomes especially important when adversaries may have considerable background information, as a result of which the boundary between pseudo-identiﬁers and sensitive 38 Privacy-Preserving Data Mining: Models and Algorithms attributes may become blurred. This is generally true, since adversaries may be familiar with the subject of interest and may have greater information about them than what is publicly available. This is also the motivation for techniques such as l-diversity [83] in which background knowledge can be used to make further privacy attacks. The work in [7] concludes that in order to maintain privacy, a large number of the attributes may need to be suppressed. Thus, the data loses its utility for the purpose of data mining algorithms. The broad intuition behind the result in [7] is that when attributes are generalized into wide ranges, the combination of a large number of generalized attributes is so sparsely populated, that even two anonymity becomes increasingly unlikely. While the method of l-diversity has not been formally analyzed, some obser- vations made in [83] seem to suggest that the method becomes increasingly infeasible to implement effectively with increasing dimensionality. The method of randomization has also been analyzed in [10]. This pa- per makes a ﬁrst analysis of the ability to re-identify data records with the use of maximum likelihood estimates. Consider a d-dimensional record X =(x1 ...xd), which is perturbed to Z =(z1 ...zd). For a given pub- lic record W =(w1 ...wd), we would like to ﬁnd the probability that it could have been perturbed to Z using the perturbing distribution fY(y).Ifthiswere true, then the set of values given by (Z − W)=(z1 − w1 ...zd − wd) should be all drawn from the distribution fY(y). The corresponding log-likelihood ﬁt is given by − d i=1 log(fy(zi − wi)). The higher the log-likelihood ﬁt, the greater the probability that the record W corresponds to X. In order to achieve greater anonymity, we would like the perturbations to be large enough, so that some of the spurious records in the data have greater log-likelihood ﬁt to Z than the true record X. It has been shown in [10], that this probability reduces rapidly with increasing dimensionality for different kinds of perturbing distri- butions. Thus, the randomization technique also seems to be susceptible to the curse of high dimensionality. We note that the problem of high dimensionality seems to be a fundamental one for privacy preservation, and it is unlikely that more effective methods can be found in order to preserve privacy when background information about a large number of features is available to even a subset of selected individuals. Indirect examples of such violations occur with the use of trail identiﬁcations [84, 85], where information from multiple sources can be compiled to create a high dimensional feature representation which violates privacy. 2.7 Applications of Privacy-Preserving Data Mining The problem of privacy-preserving data mining has numerous applications in homeland security, medical database mining, and customer transaction analysis. Some of these applications such as those involving bio-terrorism A General Survey of Privacy-Preserving Data Mining Models and Algorithms 39 and medical database mining may intersect in scope. In this section, we will discuss a number of different applications of privacy-preserving data mining methods. 2.7.1 Medical Databases: The Scrub and Dataﬂy Systems The scrub system [118] was designed for de-identiﬁcation of clinical notes and letters which typically occurs in the form of textual data. Clinical notes and letters are typically in the form of text which contain references to pa- tients, family members, addresses, phone numbers or providers. Traditional techniques simply use a global search and replace procedure in order to pro- vide privacy. However clinical notes often contain cryptic references in the form of abbreviations which may only be understood either by other providers or members of the same institution. Therefore traditional methods can iden- tify no more than 30-60% of the identifying information in the data [118]. The Scrub system uses numerous detection algorithms which compete in parallel to determine when a block of text corresponds to a name, address or a phone number. The Scrub System uses local knowledge sources which compete with one another based on the certainty of their ﬁndings. It has been shown in [118] that such a system is able to remove more than 99% of the identifying infor- mation from the data. The Dataﬂy System [117] was one of the earliest practical applications of privacy-preserving transformations. This system was designed to prevent iden- tiﬁcation of the subjects of medical records which may be stored in multi- dimensional format. The multi-dimensional information may include directly identifying information such as the social security number, or indirectly iden- tifying information such as age, sex or zip-code. The system was designed in response to the concern that the process of removing only directly identify- ing attributes such as social security numbers was not sufﬁcient to guarantee privacy. While the work has a similar motive as the k-anonymity approach of preventing record identiﬁcation, it does not formally use a k-anonymity model in order to prevent identiﬁcation through linkage attacks. The approach works by setting a minimum bin size for each ﬁeld. The anonymity level is deﬁned in Dataﬂy with respect to this bin size. The values in the records are thus gener- alized to the ambiguity level of a bin size as opposed to exact values. Directly, identifying attributes such as the social-security-number, name, or zip-code are removed from the data. Furthermore, outlier values are suppressed from the data in order to prevent identiﬁcation. Typically, the user of Dataﬂy will set the anonymity level depending upon the proﬁle of the data recipient in ques- tion. The overall anonymity level is deﬁned between 0 and 1, which deﬁnes the minimum bin size for each ﬁeld. An anonymity level of 0 results in Dataﬂy providing the original data, whereas an anonymity level of 1 results in the 40 Privacy-Preserving Data Mining: Models and Algorithms maximum level of generalization of the underlying data. Thus, these two val- ues provide two extreme values of trust and distrust. We note that these values are set depending upon the recipient of the data. When the records are released to the public, it is desirable to set of higher level of anonymity in order to ensure the maximum amount of protection. The generalizations in the dataﬂy system are typically done independently at the individual attribute level, since the bins are deﬁned independently for different attributes. The Dataﬂy system is one of the earliest systems for anonymization, and is quite simple in its ap- proach to anonymization. A lot of work in the anonymity ﬁeld has been done since the creation of the Dataﬂy system, and there is considerable scope for enhancement of the Dataﬂy system with the use of these models. 2.7.2 Bioterrorism Applications In typical bioterrorism applications, we would like to analyze medical data for privacy-preserving data mining purposes. Often a biological agent such as anthrax produces symptoms which are similar to other common respiratory diseases such as the cough, cold and the ﬂu. In the absence of prior knowl- edge of such an attack, health care providers may diagnose a patient affected by an anthrax attack of have symptoms from one of the more common res- piratory diseases. The key is to quickly identify a true anthrax attack from a normal outbreak of a common respiratory disease, In many cases, an unusual number of such cases in a given locality may indicate a bio-terrorism attack. Therefore, in order to identify such attacks it is necessary to track incidences of these common diseases as well. Therefore, the corresponding data would need to be reported to public health agencies. However, the common respira- tory diseases are not reportable diseases by law. The solution proposed in [114] is that of “selective revelation” which initially allows only limited access to the data. However, in the event of suspicious activity, it allows a “drill-down” into the underlying data. This provides more identiﬁable information in accordance with public health law. 2.7.3 Homeland Security Applications A number of applications for homeland security are inherently intrusive be- cause of the very nature of surveillance. In [113], a broad overview is provided on how privacy-preserving techniques may be used in order to deploy these applications effectively without violating user privacy. Some examples of such applications are as follows: Credential Validation Problem: In this problem, we are trying to match the subject of the credential to the person presenting the credential. For example, the theft of social security numbers presents a serious threat to homeland security. In the credential validation approach [113], an A General Survey of Privacy-Preserving Data Mining Models and Algorithms 41 attempt is made to exploit the semantics associated with the social se- curity number to determine whether the person presenting the SSN cre- dential truly owns it. Identity Theft: A related technology [115] is to use a more active ap- proach to avoid identity theft. The identity angel system [115], crawls through cyberspace, and determines people who are at risk from iden- tity theft. This information can be used to notify appropriate parties. We note that both the above approaches to prevention of identity theft are relatively non-invasive and therefore do not violate privacy. Web Camera Surveillance: One possible method for surveillance is with the use of publicly available webcams [113, 116], which can be used to detect unusual activity. We note that this is a much more invasive approach than the previously discussed techniques because of person- speciﬁc information being captured in the webcams. The approach can be made more privacy-sensitive by extracting only facial count informa- tion from the images and using these in order to detect unusual activity. It has been hypothesized in [116] that unusual activity can be detected only in terms of facial count rather than using more speciﬁc informa- tion about particular individuals. In effect, this kind of approach uses a domain-speciﬁc downgrading of the information available in the web- cams in order to make the approach privacy-sensitive. Video-Surveillance: In the context of sharing video-surveillance data, a major threat is the use of facial recognition software, which can match the facial images in videos to the facial images in a driver license data- base. While a straightforward solution is to completely black out each face, the result is of limited new, since all facial information has been wiped out. A more balanced approach [96] is to use selective downgrad- ing of the facial information, so that it scientiﬁcally limits the ability of facial recognition software to reliably identify faces, while maintaining facial details in images. The algorithm is referred to as k-Same, and the key is to identify faces which are somewhat similar, and then construct new faces which construct combinations of features from these similar faces. Thus, the identity of the underlying individual is anonymized to a certain extent, but the video continues to remain useful. Thus, this ap- proach has the ﬂavor of a k-anonymity approach, except that it creates new synthesized data for the application at hand. The Watch List Problem: The motivation behind this problem [113] is that the government typically has a list of known terrorists or suspected entities which it wishes to track from the population. The aim is to view transactional data such as store purchases, hospital admissions, airplane 42 Privacy-Preserving Data Mining: Models and Algorithms manifests, hotel registrations or school attendance records in order to identify or track these entities. This is a difﬁcult problem because the transactional data is private, and the privacy of subjects who do not ap- pear in the watch list need to be protected. Therefore, the transactional behavior of non-suspicious subjects may not be identiﬁed or revealed. Furthermore, the problem is even more difﬁcult if we assume that the watch list cannot be revealed to the data holders. The second assumption is a result of the fact that members on the watch list may only be sus- pected entities and should have some level of protection from identiﬁca- tion as suspected terrorists to the general public. The watch list problem is currently an open problem [113]. 2.7.4 Genomic Privacy Recent years have seen tremendous advances in the science of DNA se- quencing and forensic analysis with the use of DNA. As result, the databases of collected DNA are growing very fast in the both the medical and law en- forcement communities. DNA data is considered extremely sensitive, since it contains almost uniquely identifying information about an individual. As in the case of multi-dimensional data, simple removal of directly iden- tifying data such as social security number is not sufﬁcient to prevent re- identiﬁcation. In [86], it has been shown that a software called CleanGene can determine the identiﬁability of DNA entries independent of any other de- mographic or other identiﬁable information. The software relies on publicly available medical data and knowledge of particular diseases in order to as- sign identiﬁcations to DNA entries. It was shown in [86] that 98-100% of the individuals are identiﬁable using this approach. The identiﬁcation is done by taking the DNA sequence of an individual and then constructing a genetic pro- ﬁle corresponding to the sex, genetic diseases, the location where the DNA was collected etc. This genetic proﬁle has been shown in [86] to be quite effec- tive in identifying the individual to a much smaller group. One way to protect the anonymity of such sequences is with the use of generalization lattices [87] which are constructed in such a way that an entry in the modiﬁed database cannot be distinguished from at least (k − 1) other entities. Another approach discussed in [11] constructs synthetic data which preserves the aggregate char- acteristics of the original data, but preserves the privacy of the original records. Another method for compromising the privacy of genomic data is that of trail re-identiﬁcation, in which the uniqueness of patient visit patterns [84, 85] is exploited in order to make identiﬁcations. The premise of this work is that pa- tients often visit and leave behind genomic data at various distributed locations and hospitals. The hospitals usually separate out the clinical data from the ge- nomic data and make the genomic data available for research purposes. While the data is seemingly anonymous, the visit location pattern of the patients is A General Survey of Privacy-Preserving Data Mining Models and Algorithms 43 encoded in the site from which the data is released. It has been shown in [84, 85] that this information may be combined with publicly available data in order to perform unique re-identiﬁcations. Some broad ideas for protecting the privacy in such scenarios are discussed in [85]. 2.8 Summary In this paper, we presented a survey of the broad areas of privacy-preserving data mining and the underlying algorithms. We discussed a variety of data modiﬁcation techniques such as randomization and k-anonymity based tech- niques. We discussed methods for distributed privacy-preserving mining, and the methods for handling horizontally and vertically partitioned data. We dis- cussed the issue of downgrading the effectiveness of data mining and data management applications such as association rule mining, classiﬁcation, and query processing. We discussed some fundamental limitations of the problem of privacy-preservation in the presence of increased amounts of public infor- mation and background knowledge. Finally, we discussed a number of diverse application domains for which privacy-preserving data mining methods are useful. References [1] Adam N., Wortmann J. C.: Security-Control Methods for Statistical Databases: A Comparison Study. ACM Computing Surveys, 21(4), 1989. [2] Agrawal R., Srikant R. Privacy-Preserving Data Mining. Proceedings of the ACM SIGMOD Conference, 2000. [3] Agrawal R., Srikant R., Thomas D. Privacy-Preserving OLAP. Proceed- ings of the ACM SIGMOD Conference, 2005. [4] Agrawal R., Bayardo R., Faloutsos C., Kiernan J., Rantzau R., Srikant R.: Auditing Compliance via a hippocratic database. VLDB Conference, 2004. [5] Agrawal D. Aggarwal C. C. On the Design and Quantiﬁcation of Privacy-Preserving Data Mining Algorithms. ACM PODS Conference, 2002. [6] Aggarwal C., Pei J., Zhang B. A Framework for Privacy Preservation against Adversarial Data Mining. ACM KDD Conference, 2006. [7] Aggarwal C. C. On k-anonymity and the curse of dimensionality. VLDB Conference, 2005. [8] Aggarwal C. C., Yu P. S.: A Condensation approach to privacy preserv- ing data mining. EDBT Conference, 2004. [9] Aggarwal C. C., Yu P. S.: On Variable Constraints in Privacy-Preserving Data Mining. SIAM Conference, 2005. 44 Privacy-Preserving Data Mining: Models and Algorithms [10] Aggarwal C. C.: On Randomization, Public Information and the Curse of Dimensionality. ICDE Conference, 2007. [11] Aggarwal C. C., Yu P. S.: On Privacy-Preservation of Text and Sparse Binary Data with Sketches. SIAM Conference on Data Mining, 2007. [12] Aggarwal C. C., Yu P. S. On Anonymization of String Data. SIAM Con- ference on Data Mining, 2007. [13] Aggarwal G., Feder T., Kenthapadi K., Motwani R., Panigrahy R., Thomas D., Zhu A.: Anonymizing Tables. ICDT Conference, 2005. [14] Aggarwal G., Feder T., Kenthapadi K., Motwani R., Panigrahy R., Thomas D., Zhu A.: Approximation Algorithms for k-anonymity. Jour- nal of Privacy Technology, paper 20051120001, 2005. [15] Aggarwal G., Feder T., Kenthapadi K., Khuller S., Motwani R., Pan- igrahy R., Thomas D., Zhu A.: Achieving Anonymity via Clustering. ACM PODS Conference, 2006. [16] Atallah, M., Elmagarmid, A., Ibrahim, M., Bertino, E., Verykios, V.: Disclosure limitation of sensitive rules, Workshop on Knowledge and Data Engineering Exchange, 1999. [17] Bawa M., Bayardo R. J., Agrawal R.: Privacy-Preserving Indexing of Documents on the Network. VLDB Conference, 2003. [18] Bayardo R. J., Agrawal R.: Data Privacy through Optimal k- Anonymization. Proceedings of the ICDE Conference, pp. 217–228, 2005. [19] Bertino E., Fovino I., Provenza L.: A Framework for Evaluating Privacy-Preserving Data Mining Algorithms. Data Mining and Knowl- edge Discovery Journal, 11(2), 2005. [20] Bettini C., Wang X. S., Jajodia S.: Protecting Privacy against Location Based Personal Identiﬁcation. Proc. of Secure Data Management Work- shop, Trondheim, Norway, 2005. [21] Biskup J., Bonatti P.: Controlled Query Evaluation for Known Policies by Combining Lying and Refusal. Annals of Mathematics and Artiﬁcial Intelligence, 40(1-2), 2004. [22] Blum A., Dwork C., McSherry F., Nissim K.: Practical Privacy: The SuLQ Framework. ACM PODS Conference, 2005. [23] Chang L., Moskowitz I.: An integrated framwork for database inference and privacy protection. Data and Applications Security. Kluwer, 2000. [24] Chang L., Moskowitz I.: Parsimonious downgrading and decision trees applied to the inference problem. New Security Paradigms Workshop, 1998. A General Survey of Privacy-Preserving Data Mining Models and Algorithms 45 [25] Chaum D., Crepeau C., Damgard I.: Multiparty unconditionally secure protocols. ACM STOC Conference, 1988. [26] Chawla S., Dwork C., McSherry F., Smith A., Wee H.: Towards Privacy in Public Databases, TCC, 2005. [27] Chawla S., Dwork C., McSherry F., Talwar K.: On the Utility of Privacy- Preserving Histograms, UAI, 2005. [28] Chen K., Liu L.: Privacy-preserving data classiﬁcation with rotation per- turbation. ICDM Conference, 2005. [29] Chin F.: Security Problems on Inference Control for SUM, MAX, and MIN Queries. J. of the ACM, 33(3), 1986. [30] Chin F., Ozsoyoglu G.: Auditing for Secure Statistical Databases. Pro- ceedings of the ACM’81 Conference, 1981. [31] Ciriani V., De Capitiani di Vimercati S., Foresti S., Samarati P.: k-Anonymity. Security in Decentralized Data Management, ed. Jajodia S., Yu T., Springer, 2006. [32] Clifton C., Kantarcioglou M., Lin X., Zhu M.: Tools for privacy- preserving distributed data mining. ACM SIGKDD Explorations, 4(2), 2002. [33] Clifton C., Marks D.: Security and Privacy Implications of Data Min- ing., Workshop on Data Mining and Knowledge Discovery, 1996. [34] Dasseni E., Verykios V., Elmagarmid A., Bertino E.: Hiding Association Rules using Conﬁdence and Support, 4th Information Hiding Workshop, 2001. [35] Denning D.: Secure Statistical Databases with Random Sample Queries. ACM TODS Journal, 5(3), 1980. [36] Dinur I., Nissim K.: Revealing Information while preserving privacy. ACM PODS Conference, 2003. [37] Dobkin D., Jones A., Lipton R.: Secure Databases: Protection against User Inﬂuence. ACM Transactions on Databases Systems, 4(1), 1979. [38] Domingo-Ferrer J,, Mateo-Sanz J.: Practical data-oriented micro- aggregation for statistical disclosure control. IEEE TKDE, 14(1), 2002. [39] Du W., Atallah M.: Secure Multi-party Computation: A Review and Open Problems.CERIAS Tech. Report 2001-51, Purdue University, 2001. [40] Du W., Han Y. S., Chen S.: Privacy-Preserving Multivariate Statistical Analysis: Linear Regression and Classiﬁcation, Proc. SIAM Conf. Data Mining, 2004. [41] Du W., Atallah M.: Privacy-Preserving Cooperative Statistical Analysis, 17th Annual Computer Security Applications Conference, 2001. 46 Privacy-Preserving Data Mining: Models and Algorithms [42] Dwork C., Nissim K.: Privacy-Preserving Data Mining on Vertically Partitioned Databases, CRYPTO, 2004. [43] Dwork C., Kenthapadi K., McSherry F., Mironov I., Naor M.: Our Data, Ourselves: Privacy via Distributed Noise Generation. EUROCRYPT, 2006. [44] Dwork C., McSherry F., Nissim K., Smith A.: Calibrating Noise to Sen- sitivity in Private Data Analysis, TCC, 2006. [45] Even S., Goldreich O., Lempel A.: A Randomized Protocol for Signing Contracts. Communications of the ACM, vol 28, 1985. [46] Evﬁmievski A., Gehrke J., Srikant R. Limiting Privacy Breaches in Pri- vacy Preserving Data Mining. ACM PODS Conference, 2003. [47] Evﬁmievski A., Srikant R., Agrawal R., Gehrke J.: Privacy-Preserving Mining of Association Rules. ACM KDD Conference, 2002. [48] Evﬁmievski A.: Randomization in Privacy-Preserving Data Mining. ACM SIGKDD Explorations, 4, 2003. [49] Fienberg S., McIntyre J.: Data Swapping: Variations on a Theme by Dalenius and Reiss. Technical Report, National Institute of Statistical Sciences, 2003. [50] Fung B., Wang K., Yu P.: Top-Down Specialization for Information and Privacy Preservation. ICDE Conference, 2005. [51] Gambs S., Kegl B., Aimeur E.: Privacy-Preserving Boosting. Knowl- edge Discovery and Data Mining Journal, to appear. [52] Gedik B., Liu L.: A customizable k-anonymity model for protecting location privacy, ICDCS Conference, 2005. [53] Goldreich O.: Secure Multi-Party Computation, Unpublished Manu- script, 2002. [54] Huang Z., Du W., Chen B.: Deriving Private Information from Random- ized Data. pp. 37–48, ACM SIGMOD Conference, 2005. [55] Hore B., Mehrotra S., Tsudik B.: A Privacy-Preserving Index for Range Queries. VLDB Conference, 2004. [56] Hughes D, Shmatikov V.: Information Hiding, Anonymity, and Privacy: A modular Approach. Journal of Computer Security, 12(1), 3–36, 2004. [57] Inan A., Saygin Y., Savas E., Hintoglu A., Levi A.: Privacy-Preserving Clustering on Horizontally Partitioned Data. Data Engineering Work- shops, 2006. [58] Ioannidis I., Grama A., Atallah M.: A secure protocol for computing dot products in clustered and distributed environments, International Con- ference on Parallel Processing, 2002. A General Survey of Privacy-Preserving Data Mining Models and Algorithms 47 [59] Iyengar V. S.: Transforming Data to Satisfy Privacy Constraints. KDD Conference, 2002. [60] Jiang W., Clifton C.: Privacy-preserving distributed k-Anonymity. Pro- ceedings of the IFIP 11.3 Working Conference on Data and Applications Security, 2005. [61] Johnson W., Lindenstrauss J.: Extensions of Lipshitz Mapping into Hilbert Space, Contemporary Math. vol. 26, pp. 189-206, 1984. [62] Jagannathan G., Wright R.: Privacy-Preserving Distributed k-means clustering over arbitrarily partitioned data. ACM KDD Conference, 2005. [63] Jagannathan G., Pillaipakkamnatt K., Wright R.: A New Privacy- Preserving Distributed k-Clustering Algorithm. SIAM Conference on Data Mining, 2006. [64] Kantarcioglu M., Clifton C.: Privacy-Preserving Distributed Mining of Association Rules on Horizontally Partitioned Data. IEEE TKDE Jour- nal, 16(9), 2004. [65] Kantarcioglu M., Vaidya J.: Privacy-Preserving Naive Bayes Classi- ﬁer for Horizontally Partitioned Data. IEEE Workshop on Privacy- Preserving Data Mining, 2003. [66] Kargupta H., Datta S., Wang Q., Sivakumar K.: On the Privacy Preserv- ing Properties of Random Data Perturbation Techniques. ICDM Confer- ence, pp. 99-106, 2003. [67] Karn J., Ullman J.: A model of statistical databases and their security. ACM Transactions on Database Systems, 2(1):1–10, 1977. [68] Kenthapadi K.,Mishra N., Nissim K.: Simulatable Auditing, ACM PODS Conference, 2005. [69] Kifer D., Gehrke J.: Injecting utility into anonymized datasets. SIGMOD Conference, pp. 217-228, 2006. [70] Kim J., Winkler W.: Multiplicative Noise for Masking Continuous Data, Technical Report Statistics 2003-01, Statistical Research Division, US Bureau of the Census, Washington D.C., Apr. 2003. [71] Kleinberg J., Papadimitriou C., Raghavan P.: Auditing Boolean At- tributes. Journal of Computer and System Sciences, 6, 2003. [72] Koudas N., Srivastava D., Yu T., Zhang Q.: Aggregate Query Answering on Anonymized Tables. ICDE Conference, 2007. [73] Lakshmanan L., Ng R., Ramesh G. To Do or Not To Do: The Dilemma of Disclosing Anonymized Data. ACM SIGMOD Conference, 2005. [74] Liew C. K., Choi U. J., Liew C. J. A data distortion by probability dis- tribution. ACM TODS, 10(3):395-411, 1985. 48 Privacy-Preserving Data Mining: Models and Algorithms [75] LeFevre K., DeWitt D., Ramakrishnan R.: Incognito: Full Domain K-Anonymity. ACM SIGMOD Conference, 2005. [76] LeFevre K., DeWitt D., Ramakrishnan R.: Mondrian Multidimensional K-Anonymity. ICDE Conference, 25, 2006. [77] LeFevre K., DeWitt D., Ramakrishnan R.: Workload Aware Anonymization. KDD Conference, 2006. [78] Li F., Sun J., Papadimitriou S. Mihaila G., Stanoi I.: Hiding in the Crowd: Privacy Preservation on Evolving Streams through Correlation Tracking. ICDE Conference, 2007. [79] Li N., Li T., Venkatasubramanian S: t-Closeness: Orivacy beyond k-anonymity and l-diversity. ICDE Conference, 2007. [80] Lindell Y., Pinkas B.: Privacy-Preserving Data Mining. CRYPTO, 2000. [81] Liu K., Kargupta H., Ryan J.: Random Projection Based Multiplicative Data Perturbation for Privacy Preserving Distributed Data Mining. IEEE Transactions on Knowledge and Data Engineering, 18(1), 2006. [82] Liu K., Giannella C. Kargupta H.: An Attacker’s View of Distance Pre- serving Maps for Privacy-Preserving Data Mining. PKDD Conference, 2006. [83] Machanavajjhala A., Gehrke J., Kifer D., and Venkitasubramaniam M.: l-Diversity: Privacy Beyond k-Anonymity. ICDE, 2006. [84] Malin B, Sweeney L. Re-identiﬁcation of DNA through an automated linkage process. Journal of the American Medical Informatics Associa- tion, pp. 423–427, 2001. [85] Malin B. Why methods for genomic data privacy fail and what we can do to ﬁx it, AAAS Annual Meeting, Seattle, WA, 2004. [86] Malin B., Sweeney L.: Determining the identiﬁability of DNA database entries. Journal of the American Medical Informatics Association, pp. 537–541, November 2000. [87] Malin, B. Protecting DNA Sequence Anonymity with Generalization Lattices. Methods of Information in Medicine, 44(5): 687-692, 2005. [88] Martin D., Kifer D., Machanavajjhala A., Gehrke J., Halpern J.: Worst- Case Background Knowledge. ICDE Conference, 2007. [89] Meyerson A., Williams R. On the complexity of optimal k-anonymity. ACM PODS Conference, 2004. [90] Mishra N., Sandler M.: Privacy vis Pseudorandom Sketches. ACM PODS Conference, 2006. [91] Mukherjee S., Chen Z., Gangopadhyay S.: A privacy-preserving tech- nique for Euclidean distance-based mining algorithms using Fourier based transforms, VLDB Journal, 2006. A General Survey of Privacy-Preserving Data Mining Models and Algorithms 49 [92] Moskowitz I., Chang L.: A decision theoretic system for information downgrading. Joint Conference on Information Sciences, 2000. [93] Nabar S., Marthi B., Kenthapadi K., Mishra N., Motwani R.: Towards Robustness in Query Auditing. VLDB Conference, 2006. [94] Naor M., Pinkas B.: Efﬁcient Oblivious Transfer Protocols, SODA Con- ference, 2001. [95] Natwichai J., Li X., Orlowska M.: A Reconstruction-based Algorithm for Classiﬁcation Rules Hiding. Australasian Database Conference, 2006. [96] Newton E., Sweeney L., Malin B.: Preserving Privacy by De-identifying Facial Images. IEEE Transactions on Knowledge and Data Engineer- ing, IEEE TKDE, February 2005. [97] Oliveira S. R. M., Zaane O.: Privacy Preserving Clustering by Data Transformation, Proc. 18th Brazilian Symp. Databases, pp. 304-318, Oct. 2003. [98] Oliveira S. R. M., Zaiane O.: Data Perturbation by Rotation for Privacy- Preserving Clustering, Technical Report TR04-17, Department of Com- puting Science, University of Alberta, Edmonton, AB, Canada, August 2004. [99] Oliveira S. R. M., Zaiane O., Saygin Y.: Secure Association-Rule Shar- ing. PAKDD Conference, 2004. [100] Park H., Shim K. Approximate Algorithms for K-anonymity. ACM SIG- MOD Conference, 2007. [101] Pei J., Xu J., Wang Z., Wang W., Wang K.: Maintaining k-Anonymity against Incremental Updates. Symposium on Scientiﬁc and Statistical Database Management, 2007. [102] Pinkas B.: Cryptographic Techniques for Privacy-Preserving Data Min- ing. ACM SIGKDD Explorations, 4(2), 2002. [103] Polat H., Du W.: SVD-based collaborative ﬁltering with privacy. ACM SAC Symposium, 2005. [104] Polat H., Du W.: Privacy-Preserving Top-N Recommendations on Hor- izontally Partitioned Data. Web Intelligence, 2005. [105] Rabin M. O.: How to exchange secrets by oblivious transfer, Technical Report TR-81, Aiken Corporation Laboratory, 1981. [106] Reiss S.: Security in Databases: A combinatorial Study, Journal of ACM, 26(1), 1979. [107] Rizvi S., Haritsa J.: Maintaining Data Privacy in Association Rule Min- ing. VLDB Conference, 2002. 50 Privacy-Preserving Data Mining: Models and Algorithms [108] Saygin Y., Verykios V., Clifton C.: Using Unknowns to prevent discov- ery of Association Rules, ACM SIGMOD Record, 30(4), 2001. [109] Saygin Y., Verykios V., Elmagarmid A.: Privacy-Preserving Association Rule Mining, 12th International Workshop on Research Issues in Data Engineering, 2002. [110] Samarati P.: Protecting Respondents’ Identities in Microdata Release. IEEE Trans. Knowl. Data Eng. 13(6): 1010-1027 (2001). [111] Shannon C. E.: The Mathematical Theory of Communication, Univer- sity of Illinois Press, 1949. [112] Silverman B. W.: Density Estimation for Statistics and Data Analysis. Chapman and Hall, 1986. [113] Sweeney L.: Privacy Technologies for Homeland Security. Testimony before the Privacy and Integrity Advisory Committee of the Deprtment of Homeland Scurity, Boston, MA, June 15, 2005. [114] Sweeney L.: Privacy-Preserving Bio-terrorism Surveillance. AAAI Spring Symposium, AI Technologies for Homeland Security, 2005. [115] Sweeney L.: AI Technologies to Defeat Identity Theft Vulnerabilities. AAAI Spring Symposium, AI Technologies for Homeland Security, 2005. [116] Sweeney L., Gross R.: Mining Images in Publicly-Available Cameras for Homeland Security. AAAI Spring Symposium, AI Technologies for Homeland Security, 2005. [117] Sweeney L.: Guaranteeing Anonymity while Sharing Data, the Dataﬂy System. Journal of the American Medical Informatics Association, 1997. [118] Sweeney L.: Replacing Personally Identiﬁable Information in Medical Records, the Scrub System. Journal of the American Medical Informat- ics Association, 1996. [119] Vaidya J., Clifton C.: Privacy-Preserving Association Rule Mining in Vertically Partitioned Databases. ACM KDD Conference, 2002. [120] Vaidya J., Clifton C.: Privacy-Preserving k-means clustering over verti- cally partitioned Data. ACM KDD Conference, 2003. [121] Vaidya J., Clifton C.: Privacy-Preserving Naive Bayes Classiﬁer over vertically partitioned data. SIAM Conference, 2004. [122] Vaidya J., Clifton C.: Privacy-Preserving Decision Trees over vertically partitioned data. Lecture Notes in Computer Science, Vol 3654, 2005. [123] Verykios V. S., Bertino E., Fovino I. N., Provenza L. P., Saygin Y., Theodoridis Y.: State-of-the-art in privacy preserving data mining. ACM SIGMOD Record, v.33 n.1, 2004. A General Survey of Privacy-Preserving Data Mining Models and Algorithms 51 [124] Verykios V. S., Elmagarmid A., Bertino E., Saygin Y.,, Dasseni E.: As- sociation Rule Hiding. IEEE Transactions on Knowledge and Data En- gineering, 16(4), 2004. [125] Wang K., Yu P., Chakraborty S.: Bottom-Up Generalization: A Data Mining Solution to Privacy Protection. ICDM Conference, 2004. [126] Wang K., Fung B. C. M., Yu P. Template based Privacy -Preservation in classiﬁcation problems. ICDM Conference, 2005. [127] Wang K., Fung B. C. M.: Anonymization for Sequential Releases. ACM KDD Conference, 2006. [128] Wang K., Fung B. C. M., Dong G.: Integarting Private Databases for Data Analysis. Lecture Notes in Computer Science, 3495, 2005. [129] Warner S. L. Randomized Response: A survey technique for eliminat- ing evasive answer bias. Journal of American Statistical Association, 60(309):63–69, March 1965. [130] Winkler W.: Using simulated annealing for k-anonymity. Technical Report 7, US Census Bureau. [131] Wu Y.-H., Chiang C.-M., Chen A. L. P.: Hiding Sensitive Association Rules with Limited Side Effects. IEEE Transactions on Knowledge and Data Engineering, 19(1), 2007. [132] Xiao X., Tao Y.. Personalized Privacy Preservation. ACM SIGMOD Conference, 2006. [133] Xiao X., Tao Y. Anatomy: Simple and Effective Privacy Preservation. VLDB Conference, pp. 139-150, 2006. [134] Xiao X., Tao Y.: m-Invariance: Towards Privacy-preserving Re- publication of Dynamic Data Sets. SIGMOD Conference, 2007. [135] Xu J., Wang W., Pei J., Wang X., Shi B., Fu A. W. C.: Utility Based Anonymization using Local Recoding. ACM KDD Conference, 2006. [136] Xu S., Yung M.: k-anonymous secret handshakes with reusable cre- dentials. ACM Conference on Computer and Communications Security, 2004. [137] Yao A. C.: How to Generate and Exchange Secrets. FOCS Conferemce, 1986. [138] Yao G., Feng D.: A new k-anonymous message transmission protocol. International Workshop on Information Security Applications, 2004. [139] Yang Z., Zhong S., Wright R.: Privacy-Preserving Classiﬁcation of Cus- tomer Data without Loss of Accuracy. SDM Conference, 2006. [140] Yao C., Wang S., Jajodia S.: Checking for k-Anonymity Violation by views. ACM Conference on Computer and Communication Security, 2004. 52 Privacy-Preserving Data Mining: Models and Algorithms [141] Yu H., Jiang X., Vaidya J.: Privacy-Preserving SVM using nonlinear Kernels on Horizontally Partitioned Data. SAC Conference, 2006. [142] Yu H., Vaidya J., Jiang X.: Privacy-Preserving SVM Classiﬁcation on Vertically Partitioned Data. PAKDD Conference, 2006. [143] Zhang P., Tong Y., Tang S., Yang D.: Privacy-Preserving Naive Bayes Classiﬁer. Lecture Notes in Computer Science, Vol 3584, 2005. [144] Zhong S., Yang Z., Wright R.: Privacy-enhancing k-anonymization of customer data, In Proceedings of the ACM SIGMOD-SIGACT-SIGART Principles of Database Systems, Baltimore, MD. 2005. [145] Zhu Y., Liu L. Optimal Randomization for Privacy- Preserving Data Mining. ACM KDD Conference, 2004. Chapter 3 A Survey of Inference Control Methods for Privacy-Preserving Data Mining Josep Domingo-Ferrer∗ Rovira i Virgili University of Tarragona † UNESCO Chair in Data Privacy Dept. of Computer Engineering and Mathematics Av. Pa¬õsos Catalans 26, E-43007 Tarragona, Catalonia josep.domingo@urv.cat Abstract Inference control in databases, also known as Statistical Disclosure Control (SDC), is about protecting data so they can be published without revealing conﬁdential information that can be linked to speciﬁc individuals among those to which the data correspond. This is an important application in several areas, such as ofﬁcial statistics, health statistics, e-commerce (sharing of consumer data), etc. Since data protection ultimately means data modiﬁcation, the challenge for SDC is to achieve protection with minimum loss of the accuracy sought by database users. In this chapter, we survey the current state of the art in SDC methods for protecting individual data (microdata). We discuss several information loss and disclosure risk measures and analyze several ways of combining them to assess the performance of the various methods. Last but not least, topics which need more research in the area are identiﬁed and possible directions hinted. Keywords: Privacy, inference control, statistical disclosure control, statistical disclosure limitation, statistical databases, microdata. ∗This work received partial support from the Spanish Ministry of Science and Education through project SEG2004-04352-C04-01 “PROPRIETAS”, the Government of Catalonia under grant 2005 SGR 00446 and Eurostat through the CENEX SDC project. The author is solely responsible for the views ex- pressed in this chapter, which do not necessarily reﬂect the position of UNESCO nor commit that organiza- tion. †Part of this chapter was written while the author was a Visiting Fellow at Princeton University. 54 Privacy-Preserving Data Mining: Models and Algorithms 3.1 Introduction Inference control in statistical databases, also known as Statistical Disclo- sure Control (SDC) or Statistical Disclosure Limitation (SDL), seeks to protect statistical data in such a way that they can be publicly released and mined with- out giving away private information that can be linked to speciﬁc individuals or entities. There are several areas of application of SDC techniques, which include but are not limited to the following: Ofﬁcial statistics. Most countries have legislation which compels na- tional statistical agencies to guarantee statistical conﬁdentiality when they release data collected from citizens or companies. This justiﬁes the research on SDC undertaken by several countries, among them the Eu- ropean Union (e.g. the CASC project[8]) and the United States. Health information. This is one of the most sensitive areas regarding pri- vacy. For example, in the U. S., the Privacy Rule of the Health Insurance Portability and Accountability Act (HIPAA,[43]) requires the strict reg- ulation of protected health information for use in medical research. In most western countries, the situation is similar. E-commerce. Electronic commerce results in the automated collection of large amounts of consumer data. This wealth of information is very useful to companies, which are often interested in sharing it with their subsidiaries or partners. Such consumer information transfer should not result in public proﬁling of individuals and is subject to strict regulation; see [28] for regulations in the European Union and [77] for regulations in the U.S. The protection provided by SDC techniques normally entails some degree of data modiﬁcation, which is an intermediate option between no modiﬁcation (maximum utility, but no disclosure protection) and data encryption (maximum protection but no utility for the user without clearance). The challenge for SDC is to modify data in such a way that sufﬁcient pro- tection is provided while keeping at a minimum the information loss, i.e. the loss of the accuracy sought by database users. In the years that have elapsed since the excellent survey by [3], the state of the art in SDC has evolved so that now at least three subdisciplines are clearly differentiated: Tabular data protection This is the oldest and best established part of SDC, because tabular data have been the traditional output of na- tional statistical ofﬁces. The goal here is to publish static aggregate information, i.e. tables, in such a way that no conﬁdential information on speciﬁc individuals among those to which the table refers can be inferred. See [79] for a conceptual survey and [36] for a software survey. A Survey of Inference Control Methods for Privacy-Preserving Data Mining 55 Dynamic databases The scenario here is a database to which the user can sub- mit statistical queries (sums, averages, etc.). The aggregate information obtained by a user as a result of successive queries should not allow him to infer information on speciﬁc individuals. Since the 80s, this has been known to be a difﬁcult problem, subject to the tracker attack [69]. One possible strategy is to perturb the answers to queries; solutions based on perturbation can be found in [26], [54] and [76]. If perturbation is not acceptable and exact answers are needed, it may become necessary to refuse answers to certain queries; solutions based on query restriction can be found in [9] and [38]. Finally, a third strategy is to provide correct (unperturbed) interval answers, as done in [37] and [35]. Microdata protection This subdiscipline is about protecting static individual data, also called microdata. It is only recently that data collectors (sta- tistical agencies and the like) have been persuaded to publish microdata. Therefore, microdata protection is the youngest subdiscipline and is ex- periencing continuous evolution in the last years. Good general works on SDC are [79, 45]. This survey will cover the current state of the art in SDC methods for microdata, the most common data used for data mining. First, the main existing methods will be described. Then, we will discuss several information loss and disclosure risk measures and will analyze several approaches to combining them when assessing the performance of the various methods. The comparison metrics being presented should be used as a benchmark for future developments in this area. Open research issues and directions will be suggested at the end of this chapter. Plan of This Chapter Section 3.2 introduces a classiﬁcation of microdata protection methods. Section 3.3 reviews perturbative masking methods. Section 3.4 reviews non- perturbative masking methods. Section 3.5 reviews methods for synthetic mi- crodata generation. Section 3.6 discusses approaches to trade off information loss for disclosure risk and analyzes their strengths and limitations. Conclusions and directions for future research are summarized in Section 3.7. 3.2 A classiﬁcation of Microdata Protection Methods A microdata set V can be viewed as a ﬁle with n records, where each record contains m attributes on an individual respondent. The attributes can be classi- ﬁed in four categories which are not necessarily disjoint: Identiﬁers. These are attributes that unambiguously identify the respon- dent. Examples are the passport number, social security number, name- surname, etc. 56 Privacy-Preserving Data Mining: Models and Algorithms Quasi-identiﬁers or key attributes. These are attributes which identify the respondent with some degree of ambiguity. (Nonetheless, a com- bination of quasi-identiﬁers may provide unambiguous identiﬁcation.) Examples are address, gender, age, telephone number, etc. Conﬁdential outcome attributes. These are attributes which contain sen- sitive information on the respondent. Examples are salary, religion, po- litical afﬁliation, health condition, etc. Non-conﬁdential outcome attributes. Those attributes which do not fall in any of the categories above. Since the purpose of SDC is to prevent conﬁdential information from being linked to speciﬁc respondents, we will assume in what follows that original microdata sets to be protected have been pre-processed to remove from them all identiﬁers. The purpose of microdata SDC mentioned in the previous section can be stated more formally by saying that, given an original microdata set V,the goal is to release a protected microdata set V in such a way that: 1 Disclosure risk (i.e. the risk that a user or an intruder can use V to determine conﬁdential attributes on a speciﬁc individual among those in V)islow. 2 User analyses (regressions, means, etc.) on V and on V yield the same or at least similar results. Microdata protection methods can generate the protected microdata set V either by masking original data, i.e. generating V a modiﬁed version of the original microdata set V; or by generating synthetic data V that preserve some statistical proper- ties of the original data V. Masking methods can in turn be divided in two categories depending on their effect on the original data [79]: Perturbative. The microdata set is distorted before publication. In this way, unique combinations of scores in the original dataset may disap- pear and new unique combinations may appear in the perturbed dataset; such confusion is beneﬁcial for preserving statistical conﬁdentiality. The perturbation method used should be such that statistics computed on the perturbed dataset do not differ signiﬁcantly from the statistics that would be obtained on the original dataset. Non-perturbative. Non-perturbative methods do not alter data; rather, they produce partial suppressions or reductions of detail in the original A Survey of Inference Control Methods for Privacy-Preserving Data Mining 57 dataset. Global recoding, local suppression and sampling are examples of non-perturbative masking. At a ﬁrst glance, synthetic data seem to have the philosophical advantage of circumventing the re-identiﬁcation problem: since published records are in- vented and do not derive from any original record, some authors claim that no individual having supplied original data can complain from having been re-identiﬁed. At a closer look, some authors (e.g., [80] and [63]) claim that even synthetic data might contain some records that allow for re-identiﬁcation of conﬁdential information. In short, synthetic data overﬁtted to original data might lead to disclosure just as original data would. On the other hand, a clear problem of synthetic data is data utility: only the statistical properties explic- itly selected by the data protector are preserved, which leads to the question whether the data protector should not directly publish the statistics he wants preserved rather than a synthetic microdata set. We will return to these issues in Section 3.5. So far in this section, we have classiﬁed microdata protection methods by their operating principle. If we consider the type of data on which they can be used, a different dichotomic classiﬁcation applies: Continuous. An attribute is considered continuous if it is numerical and arithmetic operations can be performed with it. Examples are income and age. Note that a numerical attribute does not necessarily have an inﬁnite range, as is the case for age. When designing methods to protect continuous data, one has the advantage that arithmetic operations are possible, and the drawback that every combination of numerical values in the original dataset is likely to be unique, which leads to disclosure if no action is taken. Categorical. An attribute is considered categorical when it takes values over a ﬁnite set and standard arithmetic operations do not make sense. Ordinal and nominal scales can be distinguished among categorical at- tributes. In ordinal scales the order between values is relevant, whereas in nominal scales it is not. In the former case, max and min operations are meaningful while in the latter case only pairwise comparison is pos- sible. The instruction level is an example of ordinal attribute, whereas eye color is an example of nominal attribute. In fact, all quasi-identiﬁers in a microdata set are normally categorical nominal. When designing methods to protect categorical data, the inability to perform arithmetic operations is certainly inconvenient, but the ﬁniteness of the value range is one property that can be successfully exploited. 58 Privacy-Preserving Data Mining: Models and Algorithms 3.3 Perturbative Masking Methods Perturbative methods allow for the release of the entire microdata set, al- though perturbed values rather than exact values are released. Not all pertur- bative methods are designed for continuous data; this distinction is addressed further below for each method. Most perturbative methods reviewed below (including additive noise, rank swapping, microaggregation and post-randomization) are special cases of ma- trix masking. If the original microdata set is X, then the masked microdata set Z is computed as Z = AXB + C where A is a record-transforming mask, B is an attribute-transforming mask and C is a displacing mask (noise)[27]. Table 3.1 lists the perturbative methods described below. For each method, the table indicates whether it is suitable for continuous and/or categorical data. 3.3.1 Additive Noise The noise additions algorithms in the literature are: Masking by uncorrelated noise addition. The vector of observations xj for the j-th attribute of the original dataset Xj is replaced by a vector zj = xj + j where j is a vector of normally distributed errors drawn from a random variable εj ∼ N(0,σ2 εj ), such that Cov(εt,εl)=0for all t = l.This does not preserve variances nor correlations. Masking by correlated noise addition. Correlated noise addition also preserves means and additionally allows preservation of correlation co- efﬁcients. The difference with the previous method is that the covariance Table 3.1. Perturbative methods vs data types. “X” denotes applicable and “(X)” denotes ap- plicable with some adaptation Method Continuous data Categorical data Additive noise X Microaggregation X(X) Rank swapping XX Rounding X Resampling X PRAM X MASSC X A Survey of Inference Control Methods for Privacy-Preserving Data Mining 59 matrix of the errors is now proportional to the covariance matrix of the original data, i.e. ε ∼ N(0,Σε),whereΣε = αΣ. Masking by noise addition and linear transformation. In [49], a method is proposed that ensures by additional transformations that the sample covariance matrix of the masked attributes is an unbiased estimator for the covariance matrix of the original attributes. Masking by noise addition and nonlinear transformation. An algorithm combining simple additive noise and nonlinear transformation is pro- posed in [72]. The advantages of this proposal are that it can be ap- plied to discrete attributes and that univariate distributions are preserved. Unfortunately, as justiﬁed in [6], the application of this method is very time-consuming and requires expert knowledge on the data set and the algorithm. For more details on speciﬁc algorithms, the reader can check [5]. In practice, only simple noise addition (two ﬁrst variants) or noise addition with linear transformation are used. When using linear transformations, a decision has to be made whether to reveal them to the data user to allow for bias adjustment in the case of subpopulations. With the exception of the not very practical method of [72], additive noise is not suitable to protect categorical data. On the other hand, it is well suited for continuous data for the following reasons: It makes no assumptions on the range of possible values for Vi (which may be inﬁnite). The noise being added is typically continuous and with mean zero, which suits well continuous original data. No exact matching is possible with external ﬁles. Depending on the amount of noise added, approximate (interval) matching might be possible. 3.3.2 Microaggregation Microaggregation is a family of SDC techniques for continous microdata. The rationale behind microaggregation is that conﬁdentiality rules in use al- low publication of microdata sets if records correspond to groups of k or more individuals, where no individual dominates (i.e. contributes too much to) the group and k is a threshold value. Strict application of such conﬁdentiality rules leads to replacing individual values with values computed on small aggregates (microaggregates) prior to publication. This is the basic principle of microag- gregation. 60 Privacy-Preserving Data Mining: Models and Algorithms To obtain microaggregates in a microdata set with n records, these are com- bined to form g groups of size at least k. For each attribute, the average value over each group is computed and is used to replace each of the original aver- aged values. Groups are formed using a criterion of maximal similarity. Once the procedure has been completed, the resulting (modiﬁed) records can be pub- lished. The optimal k-partition (from the information loss point of view) is deﬁned to be the one that maximizes within-group homogeneity; the higher the within- group homogeneity, the lower the information loss, since microaggregation replaces values in a group by the group centroid. The sum of squares criterion is common to measure homogeneity in clustering. The within-groups sum of squares SSE is deﬁned as SSE = g i=1 ni j=1 (xij − ¯xi)(xij − ¯xi) The lower SSE, the higher the within group homogeneity. Thus, in terms of sums of squares, the optimal k-partition is the one that minimizes SSE. For a microdata set consisting of p attributes, these can be microaggregated together or partitioned into several groups of attributes. Also the way to form groups may vary. Several taxonomies are possible to classify the microaggre- gation algorithms in the literature: i) ﬁxed group size [15, 44, 23] vs variable group size [15, 51, 18, 68, 50, 20]; ii) exact optimal (only for the univariate case, [41, 55]) vs heuristic microaggregation; iii) continuous vs categorical microaggregation [75]. To illustrate, we next give a heuristic algorithm called MDAV (Maximum Distance to Average Vector,[23]) for multivariate ﬁxed group size microaggre- gation on unprojected continuous data. We designed and implemented MDAV for the µ-Argus package [44]. Algorithm 3.1 (MDAV) 1 Compute the average record ¯x of all records in the dataset. Consider the most distant record xr to the average record ¯x (using the squared Euclidean distance). 2 Find the most distant record xs from the record xr considered in the previous step. 3 Form two groups around xr and xs, respectively. One group contains xr and the k − 1 records closest to xr. The other group contains xs and the k − 1 records closest to xs. A Survey of Inference Control Methods for Privacy-Preserving Data Mining 61 4 If there are at least 3k records which do not belong to any of the two groups formed in Step 3, go to Step 1 taking as new dataset the previous dataset minus the groups formed in the last instance of Step 3. 5 If there are between 3k−1 and 2k records which do not belong to any of the two groups formed in Step 3: a) compute the average record ¯x of the remaining records; b) ﬁnd the most distant record xr from ¯x;c)forma group containing xr and the k−1 records closest to xr; d) form another group containing the rest of records. Exit the Algorithm. 6 If there are less than 2k records which do not belong to the groups formed in Step 3, form a new group with those records and exit the Al- gorithm. The above algorithm can be applied independently to each group of at- tributes resulting from partitioning the set of attributes in the dataset. 3.3.3 Data Wapping and Rank Swapping Data swapping was originally presented as an SDC method for databases containing only categorical attributes [11]. The basic idea behind the method is to transform a database by exchanging values of conﬁdential attributes among individual records. Records are exchanged in such a way that low-order fre- quency counts or marginals are maintained. Even though the original procedure was not very used in practice (see [32]), its basic idea had a clear inﬂuence in subsequent methods. In [59] and [58] data swapping was introduced to protect continuous and categorical microdata, respectively. Another variant of data swapping for microdata is rank swapping, which will be described next in some detail. Although originally described only for ordinal attributes [40], rank swap- ping can also be used for any numerical attribute [53]. First, values of an attribute Xi are ranked in ascending order, then each ranked value of Xi is swapped with another ranked value randomly chosen within a restricted range (e.g. the rank of two swapped values cannot differ by more than p% of the total number of records, where p is an input parameter). This algorithm is indepen- dently used on each original attribute in the original data set. It is reasonable to expect that multivariate statistics computed from data swapped with this algorithm will be less distorted than those computed after an unconstrained swap. In earlier empirical work by these authors on continu- ous microdata protection [21], rank swapping has been identiﬁed as a particu- larly well-performing method in terms of the tradeoff between disclosure risk and information loss (see Example 3.4 below). Consequently, it is one of the techniques that have been implemented in the µ − Argus package [44]. 62 Privacy-Preserving Data Mining: Models and Algorithms Table 3.2. Example of rank swapping. Left, original ﬁle; right, rankswapped ﬁle 1K 3.74.4 1H 3.04.8 2L 3.83.4 2L 4.53.2 3N 3.04.8 3M3.74.4 4M4.55.0 4N 5.06.0 5L 5.06.0 5L 4.55.0 6H 6.07.5 6F 6.79.5 7H 4.510.0 7K 3.811.0 8F 6.711.0 8H 6.010.0 9D 8.09.5 9C10.07.5 10 C 10.03.2 10 D 8.03.4 Example 3.2 In Table 3.2, we can see an original microdata set on the left and its rankswapped version on the right. There are four attributes and ten records in the original dataset; the second attribute is alphanumeric, and the standard alphabetic order has been used to rank it. A value of p = 10% has been used for all attributes. 3.3.4 Rounding Rounding methods replace original values of attributes with rounded val- ues. For a given attribute Xi, rounded values are chosen among a set of round- ing points deﬁning a rounding set (often the multiples of a given base value). In a multivariate original dataset, rounding is usually performed one attribute at a time (univariate rounding); however, multivariate rounding is also possi- ble [79, 10]. The operating principle of rounding makes it suitable for contin- uous data. 3.3.5 Resampling Originally proposed for protecting tabular data [42, 17], resampling can also be used for microdata. Take t independent samples S1,··· ,St of the values of an original attribute Xi. Sort all samples using the same ranking criterion. Build the masked attribute Zi as ¯x1,··· , ¯xn,wheren is the number of records and ¯xj is the average of the j-th ranked values in S1,··· ,St. 3.3.6 PRAM The Post-RAndomization Method (PRAM, [39]) is a probabilistic, pertur- bative method for disclosure protection of categorical attributes in microdata ﬁles. In the masked ﬁle, the scores on some categorical attributes for cer- tain records in the original ﬁle are changed to a different score according to a prescribed probability mechanism, namely a Markov matrix. The Markov A Survey of Inference Control Methods for Privacy-Preserving Data Mining 63 approach makes PRAM very general, because it encompasses noise addition, data suppression and data recoding. PRAM information loss and disclosure risk largely depend on the choice of the Markov matrix and are still (open) research topics [14]. The PRAM matrix contains a row for each possible value of each attribute to be protected. This rules out using the method for continuous data. 3.3.7 MASSC MASSC [71] is a masking method whose acronym summarizes its four steps: Micro Agglomeration, Substitution, Subsampling and Calibration. We brieﬂy recall the purpose of those four steps: 1 Micro agglomeration is applied to partition the original dataset into risk strata (groups of records which are at a similar risk of disclosure). These strata are formed using the key attributes, i.e. the quasi-identiﬁers in the records. The idea is that those records with rarer combinations of key attributes are at a higher risk. 2 Optimal probabilistic substitution is then used to perturb the original data. 3 Optimal probabilistic subsampling is used to suppress some attributes or even entire records. 4 Optimal sampling weight calibration is used to preserve estimates for outcome attributes in the treated database whose accuracy is critical for the intended data use. MASSC in interesting in that, to the best of our knowledge, it is the ﬁrst at- tempt at designing a perturbative masking method in such a way that disclosure risk can be analytically quantiﬁed. Its main shortcoming is that its disclosure model simpliﬁes reality by considering only disclosure resulting from linkage of key attributes with external sources. Since key attributes are typically cate- gorical, the risk of disclosure can be analyzed by looking at the probability that a sample unique is a population unique; however, doing so ignores the fact that continuous outcome attributes can also be used for respondent re-identiﬁcation via record linkage. As an example, if respondents are companies and turnover is one outcome attribute, everyone in a certain industrial sector knows which is the company with largest turnover. Thus, in practice, MASSC is a method only suited when continuous attributes are not present. 3.4 Non-perturbative Masking Methods Non-perturbative methods do not rely on distortion of the original data but on partial suppressions or reductions of detail. Some of the methods are usable 64 Privacy-Preserving Data Mining: Models and Algorithms Table 3.3. Non-perturbative methods vs data types Method Continuous data Categorical data Sampling X Global recoding XX Top and bottom coding XX Local suppression X on both categorical and continuous data, but others are not suitable for contin- uous data. Table 3.3 lists the non-perturbative methods described below. For each method, the table indicates whether it is suitable for continuous and/or categorical data. 3.4.1 Sampling Instead of publishing the original microdata ﬁle, what is published is a sam- ple S of the original set of records [79]. Sampling methods are suitable for categorical microdata, but for continuous microdata they should probably be combined with other masking methods. The reason is that sampling alone leaves a continuous attribute Vi unperturbed for all records in S. Thus, if attribute Vi is present in an external administrative public ﬁle, unique matches with the published sample are very likely: indeed, given a continuous attribute Vi and two respondents o1 and o2, it is highly unlikely that Vi will take the same value for both o1 and o2 unless o1 = o2 (this is true even if Vi has been truncated to represent it digitally). If, for a continuous identifying attribute, the score of a respondent is only approximately known by an attacker (as assumed in [78]), it might still make sense to use sampling methods to protect that attribute. However, assumptions on restricted attacker resources are perilous and may prove deﬁnitely too opti- mistic if good quality external administrative ﬁles are at hand. 3.4.2 Global Recoding This method is also sometimes known as generalization [67, 66]. For a cate- gorical attribute Vi, several categories are combined to form new (less speciﬁc) categories, thus resulting in a new V i with |D(V i )| < |D(Vi)| where |·|is the cardinality operator. For a continuous attribute, global recoding means re- placing Vi by another attribute V i which is a discretized version of Vi.Inother words, a potentially inﬁnite range D(Vi) is mapped onto a ﬁnite range D(V i ). This is the technique used in the µ-Argus SDC package [44]. This technique is more appropriate for categorical microdata, where it helps disguise records with strange combinations of categorical attributes. Global recoding is used heavily by statistical ofﬁces. A Survey of Inference Control Methods for Privacy-Preserving Data Mining 65 Example 3.3 If there is a record with “Marital status = Widow/er” and “Age = 17”, global recoding could be applied to “Marital status” to create a broader category “Widow/er or divorced”, so that the probability of the above record being unique would diminish. Global recoding can also be used on a continuous attribute, but the inherent discretization leads very often to an unaf- fordable loss of information. Also, arithmetical operations that were straight- forward on the original Vi are no longer easy or intuitive on the discretized V i . 3.4.3 Top and Bottom Coding Top and bottom coding is a special case of global recoding which can be used on attributes that can be ranked, that is, continuous or categorical ordinal. The idea is that top values (those above a certain threshold) are lumped together to form a new category. The same is done for bottom values (those below a certain threshold). See [44]. 3.4.4 Local Suppression Certain values of individual attributes are suppressed with the aim of in- creasing the set of records agreeing on a combination of key values. Ways to combine local suppression and global recoding are discussed in [16] and im- plemented in the µ-Argus SDC package [44]. If a continuous attribute Vi is part of a set of key attributes, then each com- bination of key values is probably unique. Since it does not make sense to systematically suppress the values of Vi, we conclude that local suppression is rather oriented to categorical attributes. 3.5 Synthetic Microdata Generation Publication of synthetic — i.e. simulated — data was proposed long ago as a way to guard against statistical disclosure. The idea is to randomly generate data with the constraint that certain statistics or internal relationships of the original dataset should be preserved. We next review some approaches in the literature to synthetic data gener- ation and then proceed to discuss the global pros and cons of using synthetic data. 3.5.1 Synthetic Data by Multiple Imputation More than twenty years ago, it was suggested in [65] to create an entirely synthetic dataset based on the original survey data and multiple imputation. Rubin’s proposal was more completely developed in [57]. A simulation study of it was given in [60]. In [64] inference on synthetic data is discussed and in [63] an application is given. 66 Privacy-Preserving Data Mining: Models and Algorithms We next sketch the operation of the original proposal by Rubin. Consider an original microdata set X of size n records drawn from a much larger population of N individuals, where there are background attributes A, non-conﬁdential attributes B and conﬁdential attributes C. Background attributes are observed and available for all N individuals in the population, whereas B and C are only available for the n records in the sample X. The ﬁrst step is to construct from X a multiply-imputed population of N individuals. This population consists of the n records in X and M(the number of multiple imputations, typically between 3 and 10) matrices of (B,C) data for the N − n non-sampled indi- viduals. The variability in the imputed values ensures, theoretically, that valid inferences can be obtained on the multiply-imputed population. A model for predicting (B,C) from A is used to multiply-impute (B,C) in the popula- tion. The choice of the model is a nontrivial matter. Once the multiply-imputed population is available, a sample Z of n records can be drawn from it whose structure looks like the one a sample of n records drawn from the original population. This can be done M times to create M replicates of (B,C) values. The result are M multiply-imputed synthetic datasets. To make sure no orig- inal data are in the synthetic datasets, it is wise to draw the samples from the multiply-imputed population excluding the n original records from it. 3.5.2 Synthetic Data by Bootstrap Long ago, [30] proposed generating synthetic microdata by using bootstrap methods. Later, in [31] this approach was used for categorical data. The bootstrap approach bears some similarity to the data distortion by probability distribution and the multiple-imputation methods described above. Given an original microdata set X with p attributes, the data protector com- putes its empirical p-variate cumulative distribution function (c.d.f.) F.Now, rather than distorting the original data to obtain masked data (as done by the masking methods in Sections 3.3 and 3.4), the data protector alters (or “smoothes”) the c.d.f. F to derive a similar c.d.f. F . Finally, F is sampled to obtain a synthetic microdata set Z. 3.5.3 Synthetic Data by Latin Hypercube Sampling Latin Hypercube Sampling (LHS) appears in the literature as another method for generating multivariate synthetic datasets. In [46], the LHS up- dated technique of [33] was improved, but the proposed scheme is still time- intensive even for a moderate number of records. In [12], LHS is used along with a rank correlation reﬁnement to reproduce both the univariate (i.e. mean and covariance) and multivariate structure (in the sense of rank correlation) of the original dataset. In a nutshell, LHS-based methods rely on iterative reﬁnement, are time-intensive and their running time does not only depend on the number of values to be reproduced, but on the starting values as well. A Survey of Inference Control Methods for Privacy-Preserving Data Mining 67 3.5.4 Partially Synthetic Data by Cholesky Decomposition Generating plausible synthetic values for all attributes in a database may be difﬁcult in practice. Thus, several authors have considered mixing actual and synthetic data. In [7], a non-iterative method for generating continuous synthetic microdata is proposed. It consists of three methods sketched next. Informally, suppose two sets of attributes X and Y, where the former are the conﬁdential out- come attributes and the latter are quasi-identiﬁer attributes. Then X are taken as independent and Y as dependent attributes. Conditional on the speciﬁc con- ﬁdential attributes xi, the quasi-identiﬁer attributes Yi are assumed to follow a multivariate normal distribution with covariance matrix Σ={σjk} and a mean vector xiB,whereB is a matrix of regression coefﬁcients. Method A computes a multiple regression of Y on X and ﬁtted Y A at- tributes. Finally, attributes X and Y A are released in place of X and Y. If a user ﬁts a multiple regression model to (y A,x), she will get estimates ˆBA and ˆΣA which, in general, are different from the estimates ˆB and ˆΣ ob- tained when ﬁtting the model to the original data (y,x). IPSO Method B mod- iﬁes y A into y B in such a way that the estimate ˆBB obtained by multiple linear regression from (y B,x) satisﬁes ˆBB = ˆB. A more ambitious goal is to come up with a data matrix y C such that, when a multivariate multiple regression model is ﬁtted to (y C,x), both sufﬁcient statistics ˆB and ˆΣ obtained on the original data (y,x) are preserved. This is achieved by IPSO Method C. 3.5.5 Other Partially Synthetic and Hybrid Microdata Approaches The multiple imputation approach described in [65] for creating entirely synthetic microdata can be extended for partially synthetic microdata. As a result multiply-imputed, partially synthetic datasets are obtained that contain a mix of actual and imputed (synthetic) values. The idea is to multiply- impute conﬁdential values and release non-conﬁdential values without per- turbation. This approach was ﬁrst applied to protect the Survey of Consumer Finances [47, 48]. In Abowd and Woodcock [1, 2], this technique was adopted to protect longitudinal linked data, that is, microdata that contain observations from two or more related time periods (successive years, etc.). Methods for valid inference on this kind of partial synthetic data were developed in [61] and a non-parametric method was presented in [62] to generate multiply-imputed, partially synthetic data. Closely related to multiply imputed, partially synthetic microdata is model- based disclosure protection [34, 56]. In this approach, a set of conﬁdential continuous outcome attributes is regressed on a disjoint set of non-conﬁdential 68 Privacy-Preserving Data Mining: Models and Algorithms attributes; then the ﬁtted values are released for the conﬁdential attributes in- stead of the original values. A different approach called hybrid masking was proposed in [13]. The idea is to compute masked data as a combination of original and synthetic data. Such a combination allows better control than purely synthetic data over the individual characteristics of masked records. For hybrid masking to be feasible, a rule must be used to pair one original data record with one synthetic data record. An option suggested in [13] is to go through all original data records and pair each original record with the nearest synthetic record according to some distance. Once records have been paired, [13] suggest two possible ways for combining one original record X with one synthetic record Xs: additive combination and multiplicative combination. Additive combination yields Z = αX +(1− α)Xs and multiplicative combination yields Z = Xα ·X(1−α) s where α is an input parameter in [0, 1] and Z is the hybrid record. [13] present empirical results comparing the hybrid approach with rank swapping and mi- croaggregation masking (the synthetic component of hybrid data is generated using Latin Hypercube Sampling [12]). Another approach to combining original and synthetic microdata is pro- posed in [70]. The idea here is to ﬁrst mask an original dataset using a masking method (see Sections 3.3 and 3.4 above). Then a hill-climbing optimization heuristic is run which seeks to modify the masked data to preserve the ﬁrst and second-order moments of the original dataset as much as possible without increasing the disclosure risk with respect to the initial masked data. The opti- mization heuristic can be modiﬁed to preserve higher-order moments, but this signiﬁcantly increases computation. Also, the optimization heuristic can take as initial dataset a random dataset instead of a masked dataset; in this case, the output dataset is purely synthetic. 3.5.6 Pros and Cons of Synthetic Microdata As pointed out in Section 3.2, synthetic data are appealing in that, at a ﬁrst glance, they seem to circumvent the re-identiﬁcation problem: since published records are invented and do not derive from any original record, it might be concluded that no individual can complain from having been re-identiﬁed. At a closer look this advantage is less clear. If, by chance, a published synthetic record matches a particular citizen’s non-conﬁdential attributes (age, marital status, place of residence, etc.) and conﬁdential attributes (salary, mortgage, etc.), re-identiﬁcation using the non-conﬁdential attributes is easy and that cit- izen may feel that his conﬁdential attributes have been unduly revealed. In that A Survey of Inference Control Methods for Privacy-Preserving Data Mining 69 case, the citizen is unlikely to be happy with or even understand the explanation that the record was synthetically generated. On the other hand, limited data utility is another problem of synthetic data. Only the statistical properties explicitly captured by the model used by the data protector are preserved. A logical question at this point is why not directly publish the statistics one wants to preserve rather than release a synthetic mi- crodata set. One possible justiﬁcation for synthetic microdata would be if valid analy- ses could be obtained on a number of subdomains, i.e. similar results were ob- tained in a number of subsets of the original dataset and the corresponding sub- sets of the synthetic dataset. Partially synthetic or hybrid microdata are more likely to succeed in staying useful for subdomain analysis. However, when us- ing partially synthetic or hybrid microdata, we lose the attractive feature of purely synthetic data that the number of records in the protected (synthetic) dataset is independent from the number of records in the original dataset. 3.6 Trading off Information Loss and Disclosure Risk Sections 3.2 through 3.5 have presented a plethora of methods to protect microdata. To complicate things further, most of such methods are parametric (e.g., in microaggregation, one parameter is the minimum number of records in a cluster), so the user must go through two choices rather than one: a primary choice to select a method and a secondary choice to select parameters for the method to be used. To help reducing the embarras du choix, some guidelines are needed. 3.6.1 Score Construction The mission of SDC to modify data in such a way that sufﬁcient protection is provided at minimum information loss suggests that a good SDC method is one achieving a good tradeoff between disclosure risk and information loss. Following this idea, [21] proposed a score for method performance rating based on the average of information loss and disclosure risk measures. For each method M and parameterization P, the following score is computed: Score(V,V)=IL(V,V)+DR(V,V) 2 where IL is an information loss measure, DR is a disclosure risk measure and V is the protected dataset obtained after applying method M with para- meterization P to an original dataset V. In [21] and [19] IL and DR were computed using a weighted combination of several information loss and disclosure risk measures. With the resulting score, a ranking of masking methods (and their parameterizations) was ob- 70 Privacy-Preserving Data Mining: Models and Algorithms tained. In [81] the line of the above two papers was followed to rank a different set of methods using a slightly different score. To illustrate how a score can be constructed, we next describe the particular score used in [21]. Example 3.4 Let X and X be matrices representing original and protected datasets, respectively, where all attributes are numerical. Let V and R be the covariance matrix and the correlation matrix of X, respectively; let ¯X be the vector of attribute averages for X and let S be the diagonal of V. Deﬁne V , R, ¯X, and S analogously from X. The Information Loss (IL) is computed by averaging the mean variations of X −X, ¯X − ¯X,V −V ,S −S, and the mean absolute error of R − R and multiplying the resulting average by 100. Thus, we obtain the following expression for information loss: IL = 100 5 p j=1 n i=1 |xij−x ij | |xij| np + p j=1 |¯xj− ¯x j| |¯xj| p + p j=1 1≤i≤j |vij −v ij | |vij | p(p+1) 2 + p j=1 |vjj−v jj| |vjj| p + p j=1 1≤i≤j |rij−r ij | p(p−1) 2 The expression of the overall score is obtained by combining information loss and information risk as follows: Score = IL + (0.5DLD+0.5PLD)+ID 2 2 Here, DLD (Distance Linkage Disclosure risk) is the percentage of correctly linked records using distance-based record linkage [19], PLD (Probabilistic Linkage Record Disclosure risk) is the percentage of correctly linked records using probabilistic linkage [29], ID (Interval Disclosure) is the percentage of original records falling in the intervals around their corresponding masked values and IL is the information loss measure deﬁned above. Based on the above score, [21] found that, for the benchmark datasets and the intruder’s external information they used, two good performers among the set of methods and parameterizations they tried were: i) rankswapping with pa- rameter p around 15 (see description above); ii) multivariate microaggregation on unprojected data taking groups of three attributes at a time (Algorithm 3.1 with partitioning of the set of attributes). Using a score permits to regard the selection of a masking method and its parameters as an optimization problem. This idea was ﬁrst used in the above- mentioned contribution [70]. In that paper, a masking method was applied to the original data ﬁle and then a post-masking optimization procedure was ap- plied to decrease the score obtained. A Survey of Inference Control Methods for Privacy-Preserving Data Mining 71 On the negative side, no speciﬁc score weighting can do justice to all meth- ods. Thus, when ranking methods, the values of all measures of information loss and disclosure risk should be supplied along with the overall score. 3.6.2 R-U Maps A tool which may be enlightening when trying to construct a score or, more generally, optimize the tradeoff between information loss and disclosure risk is a graphical representation of pairs of measures (disclosure risk, information loss) or their equivalents (disclosure risk, data utility). Such maps are called R-U conﬁdentiality maps [24, 25]. Here, R stands for disclosure risk and U for data utility. According to [25], “in its most basic form, an R-U conﬁdentiality map is the set of paired values (R, U), of disclosure risk and data utility that correspond to various strategies for data release” (e.g., variations on a parame- ter). Such (R, U) pairs are typically plotted in a two-dimensional graph, so that the user can easily grasp the inﬂuence of a particular method and/or parameter choice. 3.6.3 k-anonymity A different approach to facing the conﬂict between information loss and disclosure risk is suggested by Samarati and Sweeney [67, 66, 73, 74]. A pro- tected dataset is said to satisfy k-anonymity for k>1 if, for each combination of quasi-identiﬁer values (e.g. address, age, gender, etc.), at least k records ex- ist in the dataset sharing that combination. Now if, for a given k, k-anonymity is assumed to be enough protection, one can concentrate on minimizing in- formation loss with the only constraint that k-anonymity should be satisﬁed. This is a clean way of solving the tension between data protection and data utility. Since k-anonymity is usually achieved via generalization (equivalent to global recoding, as said above) and local suppression, minimizing informa- tion loss usually translates to reducing the number and/or the magnitude of suppressions. k-anonymity bears some resemblance to the underlying principle of mi- croaggregation and is a useful concept because quasi-identiﬁers are usually categorical or can be categorized, i.e. they take values in a ﬁnite (and ideally re- duced) range. However, re-identiﬁcation is not necessarily based on categorical quasi-identiﬁers: sometimes, numerical outcome attributes —which are contin- uous and often cannot be categorized— give enough clues for re-identiﬁcation (see discussion on the MASSC method above). Microaggregation was sug- gested in [23] as a possible way to achieve k-anonymity for numerical, ordinal and nominal attributes. A similar idea called data condensation had also been independently proposed by [4] to achieve k-anonymity for the speciﬁc case of numerical attributes. 72 Privacy-Preserving Data Mining: Models and Algorithms Another connection between k-anonymity and microaggregation is the NP- hardness of solving them optimally. Satisfying k-anonymity with minimal data modiﬁcation has been shown to be NP-hard in [52], which is parallel to the NP- hardness of optimal multivariate microaggregation proven in [55]. 3.7 Conclusions and Research Directions Inference control methods for privacy-preserving data mining are a hot re- search topic progressing very fast. There are still many open issues, some of which can be hopefully solved with further research and some which are likely to stay open due to the inherent nature of SDC. We ﬁrst list some of the issues that we feel can be and should be settled in the near future: Identifying a comprehensive listing of data uses (e.g. regression models, association rules, etc.) that would allow the deﬁnition of data use- speciﬁc information loss measures broadly accepted by the commu- nity; those new measures could complement and/or replace the generic measures currently used. Work in this line has been started in Europe in 2006 under the CENEX SDC project sponsored by Eurostat. Devising disclosure risk assessment procedures which are as universally applicable as record linkage while being less greedy in computational terms. Identifying the external data sources that intruders can typically access in order to attempt re-identiﬁcation for each domain of application. This would help data protectors ﬁguring out in more realistic terms which are the disclosure scenarios they should protect data against. Creating one or several benchmarks to assess the performance of SDC methods. Benchmark creation is currently hampered by the conﬁdential- ity of the original datasets to be protected. Data protectors should agree on a collection of non-conﬁdential original-looking data sets (ﬁnancial datasets, population datasets, etc.) which can be used by anybody to compare the performance of SDC methods. The benchmark should also incorporate state-of-the-art disclosure risk assessment methods, which requires continuous update and maintenance. There are other issues which, in our view, are less likely to be resolved in the near future, due to the very nature of SDC methods. As pointed out in [22], if an intruder knows the SDC algorithm used to create a protected data set, he can mount algorithm-speciﬁc re-identiﬁcation attacks which can disclose more conﬁdential information than conventional data mining attacks. Keeping secret A Survey of Inference Control Methods for Privacy-Preserving Data Mining 73 the SDC algorithm used would seem a solution, but in many cases the protected dataset itself gives some clues on the SDC algorithm used to produce it. Such is the case for a rounded, microaggregated or partially suppressed microdata set. Thus, it is unclear to what extent the SDC algorithm used can be kept secret. Other data security areas where slightly distorted data are sent to a recipient who is legitimate but untrusted also share the same concerns about the secrecy of protection algorithms in use. This is the case of watermarking. Teaming up with those areas sharing similar problems is probably one clever line of action for SDC. References [1] J. M. Abowd and S. D. Woodcock. Disclosure limitation in longitudinal linked tables. In P. Doyle, J. I. Lane, J. J. Theeuwes, and L. V. Zayatz, ed- itors, Conﬁdentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies, pages 215–278, Amsterdam, 2001. North-Holland. [2] J. M. Abowd and S. D. Woodcock. Multiply-imputing conﬁdential char- acteristics and ﬁle links in longitudinal linked data. In J. Domingo-Ferrer and V. Torra, editors, Privacy in Statistical Databases, volume 3050 of Lecture Notes in Computer Science, pages 290–297, Berlin Heidelberg, 2004. Springer. [3] N. R. Adam and J. C. Wortmann. Security-control for statistical data- bases: a comparative study. ACM Computing Surveys, 21(4):515–556, 1989. [4] C. C. Aggarwal and P. S. Yu. A condensation approach to privacy pre- serving data mining. In E. Bertino, S. Christodoulakis, D. Plexousakis, V. Christophides, M. Koubarakis, K. B¨ohm, E. Ferrari, editors, Advances in Database Technology - EDBT 2004, vol. 2992 of Lecture Notes in Com- puter Science, pages 183-199, Berlin Heidelberg, 2004. Springer. [5] R. Brand. Microdata protection through noise addition. In J. Domingo- Ferrer, editor, Inference Control in Statistical Databases, volume 2316 of Lecture Notes in Computer Science, pages 97–116, Berlin Heidelberg, 2002. Springer. [6] R. Brand. Tests of the applicability of sullivan’s algorithm to synthetic data and real business data in ofﬁcial statistics, 2002. European Project IST-2000-25069 CASC, Deliverable 1.1-D1, http://neon.vb.cbs.nl/casc. [7] J. Burridge. Information preserving statistical obfuscation. Statistics and Computing, 13:321–327, 2003. 74 Privacy-Preserving Data Mining: Models and Algorithms [8] CASC. Computational aspects of statistical conﬁdentiality, 2004. European project IST-2000-25069 CASC, 5th FP, 2001-2004, http://neon.vb.cbs.nl/casc. [9] F. Y. Chin and G. Ozsoyoglu. Auditing and inference control in statistical databases. IEEE Transactions on Software Engineering, SE-8:574–582, 1982. [10] L. H. Cox and J. J. Kim. Effects of rounding on the quality and conﬁ- dentiality of statistical data. In J. Domingo-Ferrer and L. Franconi, edi- tors, Privacy in Statistical Databases-PSD 2006, volume 4302 of Lecture Notes in Computer Science, pages 48–56, Berlin Heidelberg, 2006. [11] T. Dalenius and S. P. Reiss. Data-swapping: a technique for disclosure control (extended abstract). In Proc. of the ASA Section on Survey Re- search Methods, pages 191–194, Washington DC, 1978. American Sta- tistical Association. [12] R. Dandekar, M. Cohen, and N. Kirkendall. Sensitive micro data protec- tion using latin hypercube sampling technique. In J. Domingo-Ferrer, ed- itor, Inference Control in Statistical Databases, volume 2316 of Lecture Notes in Computer Science, pages 245–253, Berlin Heidelberg, 2002. Springer. [13] R. Dandekar, J. Domingo-Ferrer, and F. Seb´e. Lhs-based hybrid mi- crodata vs rank swapping and microaggregation for numeric microdata protection. In J. Domingo-Ferrer, editor, Inference Control in Statistical Databases, volume 2316 of Lecture Notes in Computer Science, pages 153–162, Berlin Heidelberg, 2002. Springer. [14] P.-P. de Wolf. Risk, utility and pram. In J. Domingo-Ferrer and L. Fran- coni, editors, Privacy in Statistical Databases-PSD 2006, volume 4302 of Lecture Notes in Computer Science, pages 189–204, Berlin Heidelberg, 2006. [15] D. Defays and P. Nanopoulos. Panels of enterprises and conﬁdentiality: the small aggregates method. In Proc. of 92 Symposium on Design and Analysis of Longitudinal Surveys, pages 195–204, Ottawa, 1993. Statis- tics Canada. [16] A. G. DeWaal and L. C. R. J. Willenborg. Global recodings and local suppressions in microdata sets. In Proceedings of Statistics Canada Sym- posium’95, pages 121–132, Ottawa, 1995. Statistics Canada. [17] J. Domingo-Ferrer and J. M. Mateo-Sanz. On resampling for statistical conﬁdentiality in contingency tables. Computers & Mathematics with Applications, 38:13–32, 1999. A Survey of Inference Control Methods for Privacy-Preserving Data Mining 75 [18] J. Domingo-Ferrer and J. M. Mateo-Sanz. Practical data-oriented mi- croaggregation for statistical disclosure control. IEEE Transactions on Knowledge and Data Engineering, 14(1):189–201, 2002. [19] J. Domingo-Ferrer, J. M. Mateo-Sanz, and V. Torra. Comparing sdc meth- ods for microdata on the basis of information loss and disclosure risk. In Pre-proceedings of ETK-NTTS’2001 (vol. 2), pages 807–826, Luxem- burg, 2001. Eurostat. [20] J. Domingo-Ferrer, F. Seb´e,and A. Solanas. A polynomial-time approx- imation to optimal multivariate microaggregation. Computers & Mathe- matics with Applications, 2007. (To appear). [21] J. Domingo-Ferrer and V. Torra. A quantitative comparison of dis- closure control methods for microdata. In P. Doyle, J. I. Lane, J. J. M. Theeuwes, and L. Zayatz, editors, Conﬁdentiality, Disclo- sure and Data Access: Theory and Practical Applications for Sta- tistical Agencies, pages 111–134, Amsterdam, 2001. North-Holland. http://vneumann.etse.urv.es/publications/bcpi. [22] J. Domingo-Ferrer and V. Torra. Algorithmic data mining against privacy protection methods for statistical databases. manuscript, 2004. [23] J. Domingo-Ferrer and V. Torra. Ordinal, continuous and heterogenerous k-anonymity through microaggregation. Data Mining and Knowledge Discovery, 11(2):195–212, 2005. [24] G. T. Duncan, S. E. Fienberg, R. Krishnan, R. Padman, and S. F. Roehrig. Disclosure limitation methods and information loss for tabular data. In P. Doyle, J. I. Lane, J. J. Theeuwes, and L. V. Zayatz, editors, Conﬁ- dentiality, Disclosure and Data Access: Theory and Practical Applica- tions for Statistical Agencies, pages 135–166, Amsterdam, 2001. North- Holland. [25] G. T. Duncan, S. A. Keller-McNulty, and S. L Stokes. Disclosure risk vs. data utility: The r-u conﬁdentiality map, 2001. [26] G. T. Duncan and S. Mukherjee. Optimal disclosure limitation strategy in statistical databases: deterring tracker attacks through additive noise. Journal of the American Statistical Association, 95:720–729, 2000. [27] G. T. Duncan and R. W. Pearson. Enhancing access to microdata while protecting conﬁdentiality: prospects for the future. Statistical Science, 6:219–239, 1991. [28] E.U.Privacy. European privacy regulations, 2004. http://europa.eu.int/ comm/internal market/privacy/law en.htm. [29] I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64(328):1183–1210, 1969. 76 Privacy-Preserving Data Mining: Models and Algorithms [30] S. E. Fienberg. A radical proposal for the provision of micro-data samples and the preservation of conﬁdentiality. Technical Report 611, Carnegie Mellon University Department of Statistics, 1994. [31] S. E. Fienberg, U. E. Makov, and R. J. Steele. Disclosure limitation using perturbation and related methods for categorical data. Journal of Ofﬁcial Statistics, 14(4):485–502, 1998. [32] S. E. Fienberg and J. McIntyre. Data swapping: variations on a theme by dalenius and reiss. In J. Domingo-Ferrer and V. Torra, editors, Privacy in Statistical Databases, volume 3050 of Lecture Notes in Computer Sci- ence, pages 14–29, Berlin Heidelberg, 2004. Springer. [33] A. Florian. An efﬁcient sampling scheme: updated latin hypercube sam- pling. Probabilistic Engineering Mechanics, 7(2):123–130, 1992. [34] L. Franconi and J. Stander. A model based method for disclosure limi- tation of business microdata. Journal of the Royal Statistical Society D - Statistician, 51:1–11, 2002. [35] R. Garﬁnkel, R. Gopal, and D. Rice. New approaches to dis- closure limitation while answering queries to a database: protect- ing numerical conﬁdential data against insider threat based on data and algorithms, 2004. Manuscript. Available at http://www- eio.upc.es/seminar/04/garﬁnkel.pdf. [36] S. Giessing. Survey on methods for tabular data protection in argus. In J. Domingo-Ferrer and V. Torra, editors, Privacy in Statistical Databases, volume 3050 of Lecture Notes in Computer Science, pages 1–13, Berlin Heidelberg, 2004. Springer. [37] R. Gopal, R. Garﬁnkel, and P. Goes. Conﬁdentiality via camouﬂage: the cvc approach to disclosure limitation when answering queries to data- bases. Operations Research, 50:501–516, 2002. [38] R. Gopal, P. Goes, and R. Garﬁnkel. Interval protection of conﬁdential information in a database. INFORMS Journal on Computing, 10:309– 322, 1998. [39] J. M. Gouweleeuw, P. Kooiman, L. C. R. J. Willenborg, and P.-P. De- Wolf. Post randomisation for statistical disclosure control: Theory and implementation, 1997. Research paper no. 9731 (Voorburg: Statistics Netherlands). [40] B. Greenberg. Rank swapping for ordinal data, 1987. Washington, DC: U. S. Bureau of the Census (unpublished manuscript). [41] S. L. Hansen and S. Mukherjee. A polynomial algorithm for optimal univariate microaggregation. IEEE Transactions on Knowledge and Data Engineering, 15(4):1043–1044, 2003. A Survey of Inference Control Methods for Privacy-Preserving Data Mining 77 [42] G. R. Heer. A bootstrap procedure to preserve statistical conﬁdentiality in contingency tables. In D. Lievesley, editor, Proc. of the International Seminar on Statistical Conﬁdentiality, pages 261–271, Luxemburg, 1993. Ofﬁce for Ofﬁcial Publications of the European Communities. [43] HIPAA. Health insurance portability and accountability act, 2004. http://www.hhs.gov/ocr/hipaa/. [44] A. Hundepool, A. Van de Wetering, R. Ramaswamy, L. Franconi, A. Capobianchi, P.-P. DeWolf, J. Domingo-Ferrer, V. Torra, R. Brand, and S. Giessing. µ-ARGUS version 4.0 Software and User’s Manual. Statis- tics Netherlands, Voorburg NL, may 2005. http://neon.vb.cbs.nl/casc. [45] A. Hundepool, J. Domingo-Ferrer, L. Franconi, S. Giessing, R. Lenz, J. Longhurst, E. Schulte-Nordholt, G. Seri, and P.-P. DeWolf. Handbook on Statistical Disclosure Control (version 1.0). Eurostat (CENEX SDC Project Deliverable), 2006. [46] D. E. Huntington and C. S. Lyrintzis. Improvements to and limita- tions of latin hypercube sampling. Probabilistic Engineering Mechanics, 13(4):245–253, 1998. [47] A. B. Kennickell. Multiple imputation and disclosure control: the case of the 1995 survey of consumer ﬁnances. In Record Linkage Techniques, pages 248–267, Washington DC, 1999. National Academy Press. [48] A. B. Kennickell. Multiple imputation and disclosure protection: the case of the 1995 survey of consumer ﬁnances. In J. Domingo-Ferrer, editor, Statistical Data Protection, pages 248–267, Luxemburg, 1999. Ofﬁce for Ofﬁcial Publications of the European Communities. [49] J. J. Kim. A method for limiting disclosure in microdata based on ran- dom noise and transformation. In Proceedings of the Section on Survey Research Methods, pages 303–308, Alexandria VA, 1986. American Sta- tistical Association. [50] M. Laszlo and S. Mukherjee. Minimum spanning tree partitioning algo- rithm for microaggregation. IEEE Transactions on Knowledge and Data Engineering, 17(7):902–911, 2005. [51] J. M. Mateo-Sanz and J. Domingo-Ferrer. A method for data-oriented multivariate microaggregation. In J. Domingo-Ferrer, editor, Statistical Data Protection, pages 89–99, Luxemburg, 1999. Ofﬁce for Ofﬁcial Pub- lications of the European Communities. [52] A. Meyerson and R. Williams. General k-anonymization is hard. Techni- cal Report 03-113, Carnegie Mellon School of Computer Science (USA), 2003. 78 Privacy-Preserving Data Mining: Models and Algorithms [53] R. Moore. Controlled data swapping techniques for masking public use microdata sets, 1996. U. S. Bureau of the Census, Washington, DC, (un- published manuscript). [54] K. Muralidhar, D. Batra, and P. J. Kirs. Accessibility, security and ac- curacy in statistical databases: the case for the multiplicative ﬁxed data perturbation approach. Management Science, 41:1549–1564, 1995. [55] A. Oganian and J. Domingo-Ferrer. On the complexity of optimal mi- croaggregation for statistical disclosure control. Statistical Journal of the United Nations Economic Comission for Europe, 18(4):345–354, 2001. [56] S. Polettini, L. Franconi, and J. Stander. Model based disclosure protec- tion. In J. Domingo-Ferrer, editor, Inference Control in Statistical Data- bases, volume 2316 of Lecture Notes in Computer Science, pages 83–96, Berlin Heidelberg, 2002. Springer. [57] T. J. Raghunathan, J. P. Reiter, and D. Rubin. Multiple imputation for statistical disclosure limitation. Journal of Ofﬁcial Statistics, 19(1):1–16, 2003. [58] S. P. Reiss. Practical data-swapping: the ﬁrst steps. ACM Transactions on Database Systems, 9:20–37, 1984. [59] S. P. Reiss, M. J. Post, and T. Dalenius. Non-reversible privacy transfor- mations. In Proceedings of the ACM Symposium on Principles of Data- base Systems, pages 139–146, Los Angeles, CA, 1982. ACM. [60] J. P. Reiter. Satisfying disclosure restrictions with synthetic data sets. Journal of Ofﬁcial Statistics, 18(4):531–544, 2002. [61] J. P. Reiter. Inference for partially synthetic, public use microdata sets. Survey Methodology, 29:181–188, 2003. [62] J. P. Reiter. Using cart to generate partially synthetic public use micro- data, 2003. Duke University working paper. [63] J. P. Reiter. Releasing multiply-imputed, synthetic public use microdata: An illustration and empirical study. Journal of the Royal Statistical Soci- ety, Series A, 168:185–205, 2005. [64] J. P. Reiter. Signiﬁcance tests for multi-component estimands from multiply-imputed, synthetic microdata. Journal of Statistical Planning and Inference, 131(2):365–377, 2005. [65] D. B. Rubin. Discussion of statistical disclosure limitation. Journal of Ofﬁcial Statistics, 9(2):461–468, 1993. [66] P. Samarati. Protecting respondents’ identities in microdata release. IEEE Transactions on Knowledge and Data Engineering, 13(6):1010–1027, 2001. A Survey of Inference Control Methods for Privacy-Preserving Data Mining 79 [67] P. Samarati and L. Sweeney. Protecting privacy when disclosing informa- tion: k-anonymity and its enforcement through generalization and sup- pression. Technical report, SRI International, 1998. [68] G. Sande. Exact and approximate methods for data directed microaggre- gation in one or more dimensions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5):459–476, 2002. [69] J. Schl¨orer. Disclosure from statistical databases: quantitative aspects of trackers. ACM Transactions on Database Systems, 5:467–492, 1980. [70] F. Seb´e, J. Domingo-Ferrer, J. M. Mateo-Sanz, and V. Torra. Post- masking optimization of the tradeoff between information loss and disclosure risk in masked microdata sets. In J. Domingo-Ferrer, editor, Inference Control in Statistical Databases, volume 2316 of Lecture Notes in Computer Science, pages 163–171, Berlin Heidelberg, 2002. Springer. [71] A. C. Singh, F. Yu, and G. H. Dunteman. Massc: A new data mask for lim- iting statistical information loss and disclosure. In H. Linden, J. Riecan, and L. Belsby, editors, Work Session on Statistical Data Conﬁdential- ity 2003, Monographs in Ofﬁcial Statistics, pages 373–394, Luxemburg, 2004. Eurostat. [72] G. R. Sullivan. The Use of Added Error to Avoid Disclosure in Microdata Releases. PhD thesis, Iowa State University, 1989. [73] L. Sweeney. Achieving k-anonymity privacy protection using general- ization and suppression. International Journal of Uncertainty, Fuzziness and Knowledge Based Systems, 10(5):571–588, 2002. [74] L. Sweeney. k-anonimity: A model for protecting privacy. Interna- tional Journal of Uncertainty, Fuzziness and Knowledge Based Systems, 10(5):557–570, 2002. [75] V. Torra. Microaggregation for categorical variables: a median based ap- proach. In J. Domingo-Ferrer and V. Torra, editors, Privacy in Statistical Databases, volume 3050 of Lecture Notes in Computer Science, pages 162–174, Berlin Heidelberg, 2004. Springer. [76] J. F. Traub, Y. Yemini, and H. Wozniakowski. The statistical security of a statistical database. ACM Transactions on Database Systems, 9:672–679, 1984. [77] U.S.Privacy. U. s. privacy regulations, 2004. http://www.media-aware ness.ca/english/issues/privacy/us legislation privacy.cfm. [78] L. Willenborg and T. DeWaal. Statistical Disclosure Control in Practice. Springer-Verlag, New York, 1996. [79] L. Willenborg and T. DeWaal. Elements of Statistical Disclosure Control. Springer-Verlag, New York, 2001. 80 Privacy-Preserving Data Mining: Models and Algorithms [80] W. E. Winkler. Re-identiﬁcation methods for masked microdata. In J. Domingo-Ferrer and V. Torra, editors, Privacy in Statistical Databases, volume 3050 of Lecture Notes in Computer Science, pages 216–230, Berlin Heidelberg, 2004. Springer. [81] W. E. Yancey, W. E. Winkler, and R. H. Creecy. Disclosure risk assess- ment in perturbative microdata protection. In J. Domingo-Ferrer, editor, Inference Control in Statistical Databases, volume 2316 of Lecture Notes in Computer Science, pages 135–152, Berlin Heidelberg, 2002. Springer. Chapter 4 Measures of Anonymity Suresh Venkatasubramanian School of Computing, University of Utah suresh@cs.utah.edu Abstract To design a privacy-preserving data publishing system, we must ﬁrst quantify the very notion of privacy, or information loss. In the past few years, there has been a proliferation of measures of privacy, some based on statistical considera- tions, others based on Bayesian or information-theoretic notions of information, and even others designed around the limitations of bounded adversaries. In this chapter, we review the various approaches to capturing privacy. We will ﬁnd that although one can deﬁne privacy from different standpoints, there are many structural similarities in the way different approaches have evolved. It will also become clear that the notions of privacy and utility (the useful information one can extract from published data) are intertwined in ways that are yet to be fully resolved. Keywords: Measures of privacy, statistics, Bayes inference, information theory, cryptography. 4.1 Introduction In this chapter, we survey the various approaches that have been proposed to measure privacy (and the loss of privacy). Since most privacy concerns (espe- cially those related to health-care information [44]) are raised in the context of legal concerns, it is instructive to view privacy from a legal perspective, rather than from purely technical considerations. It is beyond the scope of this survey1 to review the legal interpretations of privacy [11]. However, one essay on privacy that appears directly relevant (and has inspired at least one paper surveyed here) is the view of privacy in terms of access that others have to us and our information, presented by Ruth Gavison [23]. In her view, a general deﬁnition of privacy must be one that is measurable, of value, and actionable. The ﬁrst property needs no explanation; the second means that the entity being considered private must be valuable, and the third 82 Privacy-Preserving Data Mining: Models and Algorithms property argues that from a legal perspective, only those losses of privacy are interesting that can be prosecuted. This survey, and much of the research on privacy, concerns itself with the measuring of privacy. The second property is implicit in most discussion of measures of privacy: authors propose basic data items that are valuable and must be protected (ﬁelds in a record, background knowledge about a distrib- ution, and so on). The third aspect of privacy is of a legal nature and is not directly relevant to our discussion here. 4.1.1 What is Privacy? To measure privacy, we must deﬁne it. This, in essence, is the hardest part of the problem of measuring privacy, and is the reason for the plethora of proposed measures. Once again, we turn to Gavison for some insight. In her paper, she argues that there are three inter-related kinds of privacy: secrecy, anonymity, and solitude. Secrecy concerns information that others may gather about us. Anonymity addresses how much “in the public gaze” we are, and solitude measures the degree to which others have physical access to us. From the perspective of protecting information, solitude relates to the physical pro- tection of data, and is again beyond the purview of this article. Secrecy and anonymity are useful ways of thinking about privacy, and we will see that mea- sures of privacy preservation can be viewed as falling mostly into one of these two categories. If we think of privacy as secrecy (of our information), then a loss of privacy is leakage of that information. This can measured through various means: the probability of a data item being accessed, the change in knowledge of an ad- versary upon seeing the data, and so on. If we think in terms of anonymity, then privacy leakage is measured in terms of the size of the blurring accompanying the release of data: the more the blurring, the more anonymous the data. Privacy versus Utility. It would seem that the most effective way to pre- serve privacy of information would be to encrypt it. Users wishing to access the data could be given keys, and this would summarily solve all privacy is- sues. Unfortunately, this approach does not work in a data publishing scenario, which is the primary setting for much work on privacy preservation. The key notion here is one of utility: the goal of privacy preservation mea- sures is to secure access to conﬁdential information while at the same time releasing aggregate information to the public. One common example used is that of the U.S. Census. The U.S Census wishes to publish survey data from the census so that demographers and other public policy experts can analyze trends in the general population. On the other hand, they wish to avoid releas- ing information that could be used to infer facts about speciﬁc individuals; the Measures of Anonymity 83 case of the AOL search query release [34] indicates the dangers of releasing data without adequately anonymizing it. It is this idea of utility that makes cryptographic approaches to privacy preservation problematic. As Dwork points out in her overview of differen- tial privacy [16], a typical cryptographic scenario involves two communicating parties and an adversary attempting to eavesdrop. In the scenarios we consider, the adversary is the same as the recipient of the message, making security guar- antees much harder to prove. Privacy and utility are fundamentally in tension with each other. We can achieve perfect privacy by not releasing any data, but this solution has no util- ity. Thus, any discussion of privacy measures is incomplete without a corre- sponding discussion of utility measures. Traditionally, the two concepts have been measured using different yardsticks, and we are now beginning to see attempts to unify the two notions along a common axis of measurement. A Note on Terminology. Various terms have been used in the literate to describe privacy and privacy loss. Anonymization is a popular term, often used to describe methods like k-anonymity and its successors. Information loss is used by some of the information-theoretic methods, and privacy leakage is another common expression describing the loss of privacy. We will use these terms interchangeably. 4.1.2 Data Anonymization Methods The measures of anonymity we discuss here are usually deﬁned with respect to a particular data anonymization method. There are three primary methods in use today, random perturbation, generalization and suppression.Inwhat follows, we discuss these methods. Perhaps the most natural way of anonymizing numerical data is to perturb it. Rather than reporting a value x for an attribute, we report the value ˜x = x + r, where r is a random value drawn from an appropriate (usually bias-free) dis- tribution. One must be careful with this approach however; if the value r is chosen independently each time x is queried, then simple averaging will elim- inate its effect. Since introducing bias would affect any statistical analysis one might wish to perform on the data, a preferred method is to ﬁx the perturbations in advance. If the attribute x has a domain other than R, then perturbation is more com- plex. As long as the data lies in a continuous metric space (like Rd for in- stance), then a perturbation is well deﬁned. If the data is categorical however, other methods, such as deleting items and inserting other, randomly chosen items, must be employed. We will see more of such methods below. It is often useful to distinguish between two kinds of perturbation. Input perturbation is the process of perturbing the source data itself, and returning 84 Privacy-Preserving Data Mining: Models and Algorithms correct answers to queries on this perturbed data. Output perturbation on the other hand perturbs the answers sent to a query, rather than modifying the input itself. The other method for anonymizing data is generalization, which is often used in conjunction with suppression. Suppose the data domain possesses a natural hierarchical structure. For example, ZIP codes can be thought of as the leaves of a hierarchy, where 8411∗ is the parent of 84117,and84∗ is an ances- tor of 8411∗, and so on. In the presence of such a hierarchy, attributes can be generalized by replacing their values with that of their (common) parent. Again returning to the ZIP code example, ZIP codes of the form 84117, 84118, 84120 might all be replaced by the generic ZIP 841∗. The degree of perturbation can then be measured in terms of the height of the resulting generalization above the leaf values. Data suppression, very simply, is the omission of data. For example, a set of database tuples might all have ZIP code ﬁelds of the form 84117 or 84118, with the exception of a few tuples that have a ZIP code ﬁeld value of 90210. In this case, the outlier tuples can be suppressed in order to construct valid and compact generalization. Another way of performing data suppression is to replace a ﬁeld with a generic identiﬁer for that ﬁeld. In the above example, the ZIP code ﬁeld value of 90210 might be replaced by a null value ⊥ZIP. Another method of data anonymization that was proposed by Zhang et al. [50] is to permute the data. Given a table consisting of sensitive and identifying attributes, their approach is to permute the projection of the table consisting of the sensitive attributes; the purpose of doing this is to retain the aggregate prop- erties of the table, while destroying the link between identifying and sensitive attributes that could lead to a privacy leakage. 4.1.3 A Classiﬁcation of Methods Broadly speaking, methods for measuring privacy can be divided into three distinct categories. Early work on statistical databases measured privacy in terms of the variance of key perturbed variables: the larger the variance, the better the privacy of the perturbed data. We refer to these approaches as statis- tical methods. Much of the more recent work on privacy measures starts with the obser- vation that statistical methods are unable to quantify the idea of background information that an adversary may possess. As a consequence, researchers have employed tools from information theory and Bayesian analysis to quan- tify more precisely notions of information transfer and loss. We will describe these methods under the general heading of probabilistic methods. Almost in parallel with the development of probabilistic methods, some re- searchers have attacked the problem of privacy from a computational angle. Measures of Anonymity 85 In short, rather than relying on statistical or probabilistic estimates for the amount of information leaked, these measures start from the idea of a resource- bounded adversary, and measure privacy in terms of the amount of information accessible by such an adversary. This approach is reminiscent of cryptographic approaches, but for the reasons outlined above is substantially more difﬁcult. An Important Omission: Secure Multiparty Computation. One impor- tant technique for preserving data privacy is the approach from cryptography called secure multi-party computation (SMC). The simplest version of this framework is the so-called ‘Millionaires Problem’ [49]: Two millionaires wish to know who is richer; however, they do not want to ﬁnd out inadvertently any additional information about each others wealth. How can they carry out such a conversation? In general, an SMC scenario is described by N clients, each of whom owns some private data, and a public function f(x1,...xN) that needs to be com- puted from the shared data without any of the clients revealing their private information. Notice that in an SMC setting, the clients are trusted, and do not trust the central server to preserve their information (otherwise they could merely trans- mit the required data to the server). In all the privacy-preservation settings we will consider in this article, it is the server that is trusted, and queries to the server emanate from untrusted clients. We will not address SMC-based pri- vacy methods further. 4.2 Statistical Measures of Anonymity 4.2.1 Query Restriction Query restriction was one of the ﬁrst methods for preserving anonymity in data [22, 25, 21, 40]. For a database of size N, and a ﬁxed parameter k,all queries that returned either fewer than k or more than N − k records were rejected. Query restriction anticipates k-anonymity, in that the method for pre- serving anonymity is by returning a large set of records for any query. Contrast this with data suppression; rather than deleting records, the procedure deletes queries. It was pointed out later [13, 12, 41, 10, 41] that query restriction could be subverted by requesting a speciﬁc sequence of queries, and then combining them using simple Boolean operators, in a construction referred to as a tracker. Thus, this mechanism is not very effective. 4.2.2 Anonymity via Variance Here, we start with randomly perturbed data ˜x = x + r, as described in Section 4.1.2. Intuitively, the larger the perturbation, the more blurred, and thus 86 Privacy-Preserving Data Mining: Models and Algorithms more protected the value is. Thus, we can measure anonymity by measuring the variance of the perturbed data. The larger the variance, the better the guarantee of anonymity, and thus one proposal by Duncan et al. [15] is to lower bound the variance for estimators of sensitive attributes. An alternative approach, used by Agrawal and Srikant [3], is to ﬁx a conﬁdence level and measure the length of the interval of values of the estimator that yields this conﬁdence bound; the longer the interval, the more successful the anonymization. Under this model, utility can be measured in a variety of ways. The Dun- can et al. paper measures utility by combining the perturbation scheme with a query restriction method, and measuring the fraction of queries that are permitted after perturbation. Obviously, the larger the perturbation (measured by the variance σ2), the larger the fraction of queries that return sets of high cardinality. This presents a natural tradeoff between privacy (increased by increasing σ2) and utility (increased by increasing the fraction of permitted queries). The paper by Agrawal and Srikant implicitly measures utility in terms of how hard it is to reconstruct the original data distribution. They use many iter- ations of a Bayesian update procedure to perform this reconstruction; however the reconstruction itself provides no guarantees (in terms of distance to the true data distribution). 4.2.3 Anonymity via Multiplicity Perturbation-based privacy works by changing the values of data items. In generalization-based privacy, the idea is to “blur” the data via generalization. The hope here is that the blurred data set will continue to provide the statistical utility that the original data provided, while preventing access to individual tuples. The measure of privacy here is a combinatorial variant of the length-of- interval measure used in [3]. A database is said to be k-anonymous [42] if there is no query that can extract fewer than k records from it. This is achieved by aggregating tuples along a generalization hierarchy: for example, by aggre- gating zip codes upto to the ﬁrst three digits, and so on. k-anonymity was ﬁrst deﬁned in the context of record linkage: can tuples from multiple databases be joined together to infer private information inaccessible from the individual sources? The k-anonymity requirement means such access cannot happen, since no query returns fewer than k records, and so cannot be used to isolate a single tuple containing the private information. As a method for blocking record link- age, k-anonymity is effective, and much research has gone into optimizing the computations, investigating the intrinsic hardness of computing it, and gener- alizing it to multiple dimensions. Measures of Anonymity 87 4.3 Probabilistic Measures of Anonymity Upto this point, an information leak has been deﬁned as the revealing of speciﬁc data in a tuple. Often though, information can be leaked even if the adversary does not gain access to a speciﬁc data item. Such attacks usually rely on knowing aggregate information about the (perturbed) source database, as well as the method of perturbation used when modifying the data. Suppose we attempt to anonymize an attribute X by perturbing it with a random value chosen uniformly from the interval [−1, 1]2. Fixing a conﬁdence level of 100%, and using the measure of privacy from [3], we infer that the privacy achieved by this perturbation is 2 (the length of the interval [−1, 1]). Suppose however that a distribution on the values of X is revealed: namely, X takes a value in the range [0, 1] with probability 1/2, and a value in the range [4, 5] with probability 1/2. In this case, no matter what the actual value of X is, an adversary can infer from the perturbed value ˜X which of the two intervals of length 1 the true value of X really lies in, reducing the effective privacy to at most 1. Incorporating background information changes the focus of anonymity mea- surements. Rather than measuring the likelihood of some data being released, we now have to measure a far more nebulous quantity: the “amount of new information learned by an adversary” relative to the background. In order to do this, we need more precise notions of information leakage than the variance of a perturbed value. This analysis applies irrespective of whether we do anonymization based on random perturbation or generalization. We ﬁrst consider measures of anonymi- zation that are based on perturbation schemes, following this with an exami- nation of measures based on generalization. In both settings, the measures are probabilistic: they compute functions of distributions deﬁned on the data. 4.3.1 Measures Based on Random Perturbation Using Mutual Information The paper by Agrawal and Aggarwal [2] pro- poses the use of mutual information to measure leaked information. We can use the entropy H(A) to encode the amount of uncertainty (and therefore the degree of privacy) in a random variable A.H(A|B),theconditional entropy of A given B, can be interpreted as the amount of privacy “left” in A after B is revealed. Since entropy is usually expressed in terms of bits of information, we will use the expression 2H(A) to represent the measure of privacy in A.Using this measure, the fraction of privacy leaked by an adversary who knows B can be written as P(A|B)=1− 2H(A|B)/2H(A) =1− 2−I(A;B) 88 Privacy-Preserving Data Mining: Models and Algorithms where I(A;B)=H(A) − H(A|B) is the mutual information between the random variables A and B. They also develop a notion of utility measured by the statistical distance be- tween the source distribution of data and the perturbed distribution. They also demonstrate an EM-based method for reconstructing the maximum likelihood estimate of the source distribution, and show that it converges to the correct answer (they do not address the issue of rate of convergence). Handling Categorical Values The above schemes rely on the source data be- ing numerical. For data mining applications, the relevant source data is usually categorical, consisting of collections of transactions, each transaction deﬁned as a set of items. For example, in the typical market-basket setting, a transac- tion consists of a set of items purchased by a customer. Such sets are typically represented by binary characteristic vectors. The el- ementary datum that requires anonymity is membership: does item i belong to transaction t? The questions requiring utility, on the other hand, are of the form, “which patterns have reasonable support and conﬁdence”? In such a set- ting, the only possible perturbation is to ﬂip an item’s membership in a trans- action, but not so often as to change the answers to questions about patterns in any signiﬁcant way. There are two ways of measuring privacy in this setting. The approach taken by Evﬁmievski et al. [20] is to evaluate whether an anonymization scheme leaves clues for an adversary with high probability. Speciﬁcally, the deﬁne a privacy breach one in which the probability of some property of the input data is high, conditioned on the output perturbed data having certain properties. Definition 4.3.1 An itemset A causes a privacy breach of level ρ if for some item a ∈ A and some i ∈ 1 ...N we have P[a ∈ ti|A ⊆ t i] ≥ ρ. Here, the event “A ⊆ ti” is leaking information about the event “a ∈ ti”. Note that this measure is absolute, regardless of what the prior probability of a ∈ ti might have been. The perturbation method is based on randomly sampling some items of the transaction ti to keep, and buffering with elements a ∈ ti at random. The second approach, taken by Rizvi and Haritsa [38], is to measure privacy in terms of the probability of correctly reconstructing the original bit, given a perturbed bit. This can be calculated using Bayes’ Theorem, and is parame- trized by the probability of ﬂipping a bit (which they set to a constant p). Pri- vacy is then achieved by setting p to a value that minimizes the reconstruction probability; the authors show that a wide range of values for p yields acceptable privacy thresholds. Measures of Anonymity 89 Both papers then frame utility as the problem of reconstructing itemset fre- quencies accurately. [20] establishes a tradeoff between utility more precisely, in terms of the probabilities p[l → l]=P[#(t ∩ A)=l|#(t ∩ A)=l]. For privacy, we have to ensure that (for example) if we ﬁx an element a ∈ t, then the set of tuples t that do contain a are not overly represented in the modiﬁed itemset. Speciﬁcally, in terms of an average over the size of tuple sets returned, we obtain a condition on the p[l → l]. In essence, the probabilities p[l → l] encode the tradeoff between utility (or ease of reconstruction) and privacy. Measuring Transfer of Information Both the above papers have the same weakness that plagued the original statistics-based anonymization works: they ignore the problem of the background knowledge attack. A related, and yet subtlely different problem is that ignoring the source data distribution may yield meaningless results. For example, suppose the probability of an item oc- curring any particular transaction is very high. Then the probability of recon- structing its value correctly is also high, but this would not ordinarily be viewed as a leak of information. A more informative approach would be to measure the level of “surprise”: namely whether the probability P[a ∈ ti] increases (or decreases) dramatically, conditioned on seeing the event A ⊆ t i. Notice that this idea is the motivation for [2]; in their paper, the mutual information I(A;B) measures the transfer of information between the source and anonymized data. Evﬁmievski et al. [19], in a followup to [20], develop a slightly different notion of information transfer, motivated by the idea that mutual information is an “averaged” measure and that for privacy preservation, worst-case bounds are more relevant. Formally, information leakage is measured by estimating the change in probability of a property from source to distorted data. For example, given a property Q(X) of the data, they say that there is a privacy breach after perturbing the data by function R(X) if for some y, P[Q(X)] ≤ ρ1,P[Q(X)|R(X)=y] ≥ ρ2 where ρ1 ρ2. However, ensuring that this property holds is computationally intensive. The authors show that a sufﬁcient condition for guaranteeing no (ρ1,ρ2) privacy breach is to bound the difference in probability between two different xi be- ing mapped to a particular y. Formally, they propose perturbation schemes such that p[x1 → y] p[x2 → y] ≤ γ 90 Privacy-Preserving Data Mining: Models and Algorithms Intuitively, this means that if we look back from y, there is no easy way of telling whether the source was x1 or x2. The formal relation to (ρ1,ρ2)-privacy is established via this intuition. Formally, we can rewrite I(X;Y)= y p(y)KL(p(X|Y = y)|p(X)) The function KL(p(X|Y = y)|p(X)) measures the transfer distance; it asks how different the induced distribution p(X|Y = y) is from the source distri- bution p(X). The more the difference is, the less the privacy breach is. The authors propose replacing the averaging in the above expression by a max, yielding a modiﬁed notion Iw(X;Y)=maxy p(y)KL(p(X|Y = y)|p(X)) They then show that a (ρ1,ρ2)-privacy breach yields a lower bound on the worst-case mutual information Iw(X;Y), which is what we would expect. More general perturbation schemes All of the above described perturba- tion schemes are local: perturbations are applied independently to data items. Kargupta et al. [27] showed that the lack of correlation between perturbations can be used to attack such a privacy-preserving mechanism. Their key idea is a spectral ﬁltering method based on computing principal components of the data transformation matrix. Their results suggest that for more effective privacy preservation, one should consider more general perturbation schemes. It is not hard to see that a nat- ural generalization of these perturbation schemes is a Markov-chain based ap- proach, where an item x is perturbed to item y based on a transition probability p(y|x). FRAPP [4] is one such scheme based on this idea. The authors show that they can express the notion of a (ρ1,ρ2)-privacy breach in terms of prop- erties of the Markov transition matrix. Moreover, they can express the utility of this scheme in terms of the condition number of the transition matrix. 4.3.2 Measures Based on Generalization It is possible to mount a ‘background knowledge’ attack on k-anonymity. For example, it is possible that all the k records returned from a particular query share the same value of some attribute. Knowing that the desired tuple is one of the k tuples, we have thus extracted a value from this tuple without needing to isolate it. The ﬁrst approach to address this problem was the work on -diversity [32]. Here, the authors start with the now-familiar idea that the privacy measure should capture the change in the adversary’s world-view upon seeing the data. Measures of Anonymity 91 However, they execute this idea with an approach that is absolute. They require that the distribution of sensitive values in an aggregate have high entropy (at least log ). This subsumes k-anonymity, since we can think of the probability of leakage of a single tuple in k-anonymity as 1/k, and so the “entropy” of the aggregate is log k. Starting with this idea, they introduce variants of -diversity that are more relaxed about disclosure, or allow one to distinguish between positive and negative disclosure, or even allow for multi-attribute disclosure measurement. Concurrently published, the work on p-sensitive-k-anonymity [43] attempts to do the same thing, but in a more limited way, by requiring at least p dis- tinct sensitive values in each generalization block, instead of using entropy. A variant of this idea was proposed by Wong et al. [47]; in their scheme, termed (α, k)-anonymity, the additional constraint imposed on the generaliza- tion is that the fractional frequency of each value in a generalization is no more than α. Note that this approach automatically lower bounds the entropy of the generalization by log(1/α). Machanavajjhala et al. [32] make the point that it is difﬁcult to model the adversary’s background knowledge; they use this argument to justify the -diversity measure. One way to address this problem is to assume that the adversary has access to global statistics of the sensitive attribute in question. In this case, the goal is to make the sensitive attribute “blend in”; its distribution in the generalization should mimic its distribution in the source data. This is the approach taken by Li, Li and the author [31]. They deﬁne a mea- sure called t-closeness that requires that the “distance” between the distribution of a sensitive attribute in the generalized and original tables is at most t. A natural distance measure to use would be the KL-distance from the gener- alized to the source distribution. However, for numerical attributes, the notion of closeness must incorporate the notion of a metric on the attribute. For ex- ample, suppose that a salary ﬁeld in a table is generalized to have three distinct values (20000, 21000, 22000). One might reasonably argue that this general- ization leaks more information than a generalization that has the three distinct values (20000, 50000, 80000). Computing the distance between two distributions where the underlying do- mains inhabit a metric space can be performed using the metric known as the earth-mover distance [39], or the Monge-Kantorovich transportation distance [24]. Formally, suppose we have two distributions p, q deﬁned over the ele- ments X of a metric space (X, d). Then the earth-mover distance between p and q is dE(p, q)= inf P[x|x] x,x d(x, x)P[x|x]p(x) subject to the constraint x P[x|x]p(x)=q(x). 92 Privacy-Preserving Data Mining: Models and Algorithms Intuitively, this distance is deﬁned as the value that minimizes the trans- portation cost of transforming one distribution to the other, where transporta- tion cost is measured in terms of the distance in the underlying metric space. Note that since any underlying metric can be used, this approach can be used to integrate numerical and categorical attributes, by imposing any suitable metric (based on domain generalization or other methods) on the categorical attributes. The idea of extending the notion of diversity to numerical attributes was also considered by Zhang et al. [50]. In this paper, the notion of distance for nu- merical attributes is extended in a different way: the goal for the k-anonymous blocks is that the “diameter” of the range of sensitive attributes is larger than a parameter e. Such a generalization is said to be (k, e)-anonymous. Note that this condition makes utility difﬁcult. If we relate this to the -diversity condi- tion of having at least distinct values, this represents a natural generalization of the approach. As stated however, the approach appears to require deﬁning a total order on the domain of the attribute; this would prevent it from being used for higher dimensional attributes sets. Another interesting feature of the Zhang et al. method is that it considers the down-stream problem of answering aggregate queries on an anonymized database, and argues that rather than performing generalization, it might be better to perform a permutation of the data. They show that this permutation- based anonymization can answer aggregate queries more accurately than generalization-based anonymization. Anonymizing Inferences. In all of the above measures, the data being protected is an attribute of a record, or some distributional characteristic of the data. Another approach to anonymization is to protect the possible infer- ences that can be made from the data; this is akin to the approach taken by Evﬁmievski et al. [19, 20] for perturbation-based privacy. Wang et al. [45] investigate this idea in the context of generalization and suppression. A pri- vacy template is an inference on the data, coupled with a conﬁdence bound, and the requirement is that in the anonymized data, this inference not be valid with a conﬁdence larger than the provided bound. In their paper, they present a scheme based on data suppression (equivalent to using a unit height gen- eralization hierarchy) to ensure that a given set of privacy templates can be preserved. Clustering as k-anonymity. Viewing attributes as elements of metric space and deﬁning privacy accordingly has not been studied extensively. However, from the perspective of generalization, many papers ( [7, 30, 35]) have pointed out that generalization along a domain generalization hierarchy is only one way of aggregating data. In fact, if we endow the attribute space with a metric, then Measures of Anonymity 93 the process of generalization can be viewed in general as a clustering problem on this metric space, where the appropriate measure of anonymity is applied to each cluster, rather than to each generalized group. Such an approach has the advantage of placing different kinds of attributes on an equal footing. When anonymizing categorical attributes, generaliza- tion proceeds along a generalization hierarchy, which can be interpreted as deﬁning a tree metric. Numerical attributes are generalized along ranges, and t-closeness works with attributes in a general metric space. By lifting all such attributes to a general metric space, generalization can happen in a uniform manner, measured in terms of the diameters of the clusters. Strictly speaking, these methods do not introduce a new notion of privacy; however, they do extend the applicability of generalization-based privacy mea- sures like k-anonymity and its successors. Measuring utility in generalization-based anonymity The original k- anonymity work deﬁnes the utility of a generalized table as follows. Each cell is the result of generalizing an attribute up a certain number of levels in a generalization hierarchy. In normalized form, the “height” of a generalization ranges from 0 if the original value is used, to 1 if a completely generalized value is used (in the scheme proposed, a value of 1 corresponds to value suppression, since that is the top level of all hierarchies). The precision of a generalization scheme is then 1 - the average height of a generalization (mea- sured over all cells). The precision is 1 if there is no generalization and is 0 if all values are generalized. Bayardo and Agrawal ( [5]) deﬁne a different utility measure for k- anonymity. In their view, a tuple that inhabits a generalized equivalence class E of size |E| = j, j > k incurs a “cost” of j. A tuple that is suppressed entirely incurs a cost of D,whereD is the size of the entire database. Thus, the cost incurred by an anonymization is given by C = |E|≥k |E|2 + |E|0, there is some adversary running in time t(n) that can succeed with high probability. In this model, adversaries are surprisingly strong. The authors show that even with almost-linear perturbation, an adversary permitted to run in expo- nential time can break privacy. Restricting the adversary to run in polynomial time helps, but only slightly; any perturbation E = o √ n is not enough to pre- serve privacy, and this is tight. Feasibility results are hard to prove in this model: as the authors point out, an adversary, with one query, can distinguish between the databases 1n and 0n if it has background knowledge that these are the only two choices. A per- turbation of n/2 would be needed to hide the database contents. One way of circumventing this is to assume that the database itself is generated from some distribution, and that the adversary is required to reveal the value of a speciﬁc bit (say, the ith bit) after making an arbitrary number of queries, and after being given all bits of the database except the ith bit. In this setting, privacy is deﬁned as the condition that the adversary’s re- construction probability is at most 1/2+δ. In this setting, they show that a T(n)-perturbed database is private against all adversaries that run in time T(n). Measuring Anonymity Via Information Transfer As before, in the case of probabilistic methods, we can reformulate the anonymity question in terms of information transfer; how much does the probability of a bit being 1 (or 0) change upon anonymization ? 96 Privacy-Preserving Data Mining: Models and Algorithms Dwork and Nissim [18] explore this idea in the context of computation- ally bounded adversaries. Starting with a database d represented as a Boolean matrix and drawn from a distribution D, we can deﬁne the prior probability pij 0 = P[dij =1]. Once an adversary asks T queries to the anonymized database as above, and all other values of the database are provided, we can now deﬁne the posterior probability pij T of dij taking the value 1. The change in belief can be quantiﬁed by the expression ∆=|c(pij T) − c(pij 0 )|,where c(x)=log(x/(1 − x)) is a monotonically increasing function of x. This is the simpliﬁed version of their formulation. In general, we can replace the event dij =1by the more general f(di1,di2,...dik)=1,wheref is some k-ary Boolean function. All the above deﬁnitions translate to this more general setting. We can now deﬁne (δ, T (n))-privacy as the condition that for all distributions over databases, all functions f, and all adversaries making T queries, the probability that the maximum change of belief is more than δ is negligibly small. As with [14], the authors show a natural tradeoff between the degree of perturbation needed, and the level of privacy achieved. Speciﬁcally, the au- thors show that a previously proposed algorithm SuLQ [6] achieves (δ, T (n)) privacy with a perturbation E = O( T(n)/δ). They then go on to show that under such conditions, it is possible to perform efﬁcient and accurate data min- ing on the anonymized database to estimate probabilities of the form P[β|α], where α, β are two attributes. Indistinguishability Although the above measures of privacy develop pre- cise notions of information transfer with respect to a bounded adversary, they still require some notion of a distribution on the input databases, as well as a speciﬁc protocol followed by an adversary. To abstract the ideas underlying privacy further, Dwork et al. [17] formulate a deﬁnition of privacy inspired by Dalenius [16]: A database is private if anything learnable from it can be learned in the absence of the database. In order to do this, they distinguish between non-interactive privacy mech- anisms, where the data publisher anonymizes the data and publishes it (input perturbation), and interactive mechanisms, in which the output to queries are perturbed (output perturbation). Dwork [16] shows that in a non-interactive setting, it is impossible to achieve privacy under this deﬁnition; in other words, it is always possible to design an adversary and an auxiliary information gener- ator such that the adversary, combining the anonymized data and the auxiliary information, can effect a privacy breach far more often than an adversary lack- ing access to the database can. In the interactive setting, we can think of the interaction between the data- base and the adversary as a transcript. The idea of indistinguishability is that if two databases are very similar, then their transcripts with respect to an ad- Measures of Anonymity 97 versary should also be similar. Intuitively, this means that if an individual adds their data to a database (causing a small change), the nominal loss in privacy is very small. The main consequence of this formulation is that it is possible to design per- turbation schemes that depend only on the query functions and the error terms, and are independent of the database. Informally, the amount of perturbation required depends on the sensitivity of the query functions: the more the func- tion can change when one input is perturbed slightly, the more perturbation the database must incur. The details of these procedures are quite technical: the reader is referred to [16, 17] for more details. 4.4.1 Anonymity via Isolation Another approach to anonymization is taken by [8, 9]. The underlying prin- ciple here is isolation: a record is private if it cannot be singled out from its neighbors. Formally, they deﬁne an adversary as an algorithm that takes an anonymized database and some auxiliary information, and outputs a single point q. The adversary succeeds if a small ball around q does not contain too many points of the database. In this sense, the adversary has isolated some points of the database3. Under this deﬁnition of a privacy breach, they then develop methods for anonymizing a database. Like the papers above, they use a differential model of privacy: an anonymization is successful if the adversary, combining the anonymization with auxiliary information, can do no better at isolation than a weaker adversary with no access to the anonymized data. One technical problem with the idea of isolation, which the authors acknowledge, is that it can be attacked in the same way that methods like k-anonymity are attacked. If the anonymization causes many points with sim- ilar characteristics to cluster together, then even though the adversary cannot isolate a single point, it can determine some special characteristics of the data from the clustering that might not have otherwise been inferred. 4.5 Conclusions and New Directions The evolution of measures of privacy, irrespective of the speciﬁc method of perturbation or class of measure, has proceeded along a standard path. The earliest measures are absolute in nature, deﬁning an intuitive notion of privacy in terms of a measure of obfuscation. Further development occurs when the notion of background information is brought in, and this culminates in the idea of a change in adversarial information before and after the anonymized data is presented. From the perspective of theoretical rigor, computational approaches to pri- vacy are the most attractive. They rely on few to no modelling assumptions 98 Privacy-Preserving Data Mining: Models and Algorithms about adversaries, and their cryptographic ﬂavor reinforces our belief in their overall reliability as measures of privacy. Although the actual privacy preserva- tion methods proposed in this space are fairly simple, they do work from very simple models of the underlying database, and one question that so far remains unanswered is the degree to which these methods can be made practically ef- fective when dealing with the intricacies of actual databases. The most extensive attention has been paid to the probabilistic approaches to privacy measurements. k-anonymity and its successors have inspired nu- merous works that study not only variants of the basic measures, but systems for managing privacy, extensions to higher dimensional spaces, as well as better methods for publishing data tables. The challenge in dealing with meth- ods deriving from k-anonymity is the veritable alphabet soup of approaches that have been proposed, all varying subtlety in the nature of the assumptions used. The work by Wong et al. [46] illustrates the subtleties of modelling background information; their m-conﬁdentiality measure attempts to model adversaries who exploit the desire of k-anonymizing schemes to generate a minimal anonymization. This kind of background information is very hard to formalize and argue rigorously about, even when we consider the general framework for analyzing background information proposed by Martin et al. [33]. 4.5.1 New Directions There are two recent directions in the area of privacy preservation measures that are quite interesting and merit further study. The ﬁrst addresses the prob- lem noted earlier: the imbalance in the study of utility versus privacy. The computational approaches to privacy preservation, starting with the work of Dinur and Nissim [14], provide formal tradeoffs between utility and privacy, for bounded adversaries. The work of Kifer et al. [28] on injecting utility into privacy-preservation allows for a more general measure of utility as a distance between distributions, and Rastogi et al. [37] examine the tradeoff between privacy and utility rigorously in the perturbation framework. With a few exceptions, all of the above measures of privacy are global:they assume a worst-case (or average-case) measure of privacy over the entire input, or prove privacy guarantees that are independent of the speciﬁc instance of a database being anonymized. It is therefore natural to consider personalized privacy, where the privacy guarantee need only be accurate with respect to the speciﬁc instance being considered, or can be tuned depending on auxiliary inputs. The technique for anonymizing inferences developed in [45] can be viewed as such a scheme: the set of inferences needing protection are supplied as part of the input, and other inferences need not be protected. In the context Measures of Anonymity 99 of k-anonymity, Xiao and Tao [48] propose a technique that takes as input user preferences about the level of generalization they desire for their sensitive at- tributes, and adapts the k-anonymity method to satisfy these preferences. The work on worst-case background information modelling by Martin et al. [33] assumes that the speciﬁc background knowledge possessed by an adversary is an input to the privacy-preservation algorithm. Recent work by Nissim et al. [36] revisits the indistinguishability measure [17] (which is oblivious of the speciﬁc database instance) by designing an instance-based property of the query function that they use to anonymize a given database. Notes 1. ...and the expertise of the author! 2. This example is taken from [2]. 3. This bears a strong resemblance to k-anonymity, but is more general. References [1] Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, April 15-20, 2007, The Marmara Hotel, Istanbul, Turkey (2007), IEEE. [2] AGRAWAL,D.,AND AGGARWAL, C. C. On the design and quantiﬁca- tion of privacy preserving data mining algorithms. In Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of Database Systems (Santa Barbara, CA, 2001), pp. 247–255. [3] AGRAWAL,R.,AND SRIKANT, R. Privacy preserving data mining. In Proceedings of the ACM SIGMOD Conference on Management of Data (Dallas, TX, May 2000), pp. 439–450. [4] AGRAWAL,S.,AND HARITSA, J. R. FRAPP: A framework for high- accuracy privacy-preserving mining. In ICDE ’05: Proceedings of the 21st International Conference on Data Engineering (ICDE’05) (Wash- ington, DC, USA, 2005), IEEE Computer Society, pp. 193–204. [5] BAYARDO,JR., R. J., AND AGRAWAL, R. Data privacy through optimal k-anonymization. In ICDE (2005), IEEE Computer Society, pp. 217–228. [6] BLUM,A.,DWORK,C.,MCSHERRY,F.,AND NISSIM, K. Practical pri- vacy: the sulq framework. In PODS ’05: Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems (New York, NY, USA, 2005), ACM Press, pp. 128–138. [7] BYUN,J.-W.,KAMRA,A.,BERTINO,E.,AND LI, N. Efﬁcient - anonymization using clustering techniques. In DASFAA (2007), K. Ramamohanarao, P. R. Krishna, M. K. Mohania, and E. Nantajee- warawat, Eds., vol. 4443 of Lecture Notes in Computer Science, Springer, pp. 188–200. 100 Privacy-Preserving Data Mining: Models and Algorithms [8] CHAWLA,S.,DWORK,C.,MCSHERRY,F.,SMITH,A.,AND WEE, H. Toward privacy in public databases. In TCC (2005), J. Kilian, Ed., vol. 3378 of Lecture Notes in Computer Science, Springer, pp. 363–385. [9] CHAWLA,S.,DWORK,C.,MCSHERRY,F.,AND TALWAR,K.On privacy-preserving histograms. In UAI (2005), AUAI Press. [10] DE JONGE, W. Compromising statistical databases responding to queries about means. ACM Trans. Database Syst. 8, 1 (1983), 60–80. [11] DECEW,J.Privacy.InThe Stanford Encyclopedia of Philosophy, E. N. Zalta, Ed. Fall 2006. [12] DENNING,D.E.,DENNING,P.J.,AND SCHWARTZ, M. D. The tracker: A threat to statistical database security. ACM Trans. Database Syst. 4,1 (1979), 76–96. [13] DENNING,D.E.,AND SCHL¨ORER, J. A fast procedure for ﬁnding a tracker in a statistical database. ACM Trans. Database Syst. 5, 1 (1980), 88–102. [14] DINUR,I.,AND NISSIM, K. Revealing information while preserving privacy. In PODS ’03: Proceedings of the twenty-second ACM SIGMOD- SIGACT-SIGART symposium on Principles of database systems (New York, NY, USA, 2003), ACM Press, pp. 202–210. [15] DUNCAN,G.T.,AND MUKHERJEE, S. Optimal disclosure limitation strategy in statistical databases: Deterring tracker attacks through additive noise. Journal of the American Statistical Association 95, 451 (2000), 720. [16] DWORK, C. Differential privacy. In Proc. 33rd Intnl. Conf. Automata, Languages and Programming (ICALP) (2006), pp. 1–12. Invited paper. [17] DWORK,C.,MCSHERRY,F.,NISSIM,K.,AND SMITH, A. Calibrating noise to sensitivity in private data analysis. In TCC (2006), S. Halevi and T. Rabin, Eds., vol. 3876 of Lecture Notes in Computer Science, Springer, pp. 265–284. [18] DWORK,C.,AND NISSIM, K. Privacy-preserving datamining on ver- tically partitioned databases. In CRYPTO (2004), M. K. Franklin, Ed., vol. 3152 of Lecture Notes in Computer Science, Springer, pp. 528–544. [19] EVFIMEVSKI,A.,GEHRKE,J.,AND SRIKANT, R. Limiting privacy breaches in privacy preserving data mining. In Proceedings of the ACM SIGMOD/PODS Conference (San Diego, CA, June 2003), pp. 211–222. [20] EVFIMIEVSKI,A.,SRIKANT,R.,AGRAWAL,R.,AND GEHRKE,J. Privacy preserving mining of association rules. In KDD ’02: Proceed- ings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (New York, NY, USA, 2002), ACM Press, pp. 217–228. Measures of Anonymity 101 [21] FELLEGI, I. P. On the question of statistical conﬁdentiality. J. Am. Stat. Assoc 67, 337 (1972), 7–18. [22] FRIEDMAN,A.D.,AND HOFFMAN, L. J. Towards a fail-safe approach to secure databases. In Proc. IEEE Symp. Security and Privacy (1980). [23] GAV ISO N, R. Privacy and the limits of the law. The Yale Law Journal 89, 3 (January 1980), 421–471. [24] GIVENS,C.R.,AND SHORTT, R. M. A class of Wasserstein metrics for probability distributions. Michigan Math J. 31 (1984), 231–240. [25] HOFFMAN,L.J.,AND MILLER, W. F. Getting a personal dossier from a statistical data bank. Datamation 16, 5 (1970), 74–75. [26] IYENGAR, V. S. Transforming data to satisfy privacy constraints. In KDD ’02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (New York, NY, USA, 2002), ACM Press, pp. 279–288. [27] KARGUPTA,H.,DATTA,S.,WANG,Q.,AND SIVAKUMAR,K.On the privacy preserving properties of random data perturbation techniques. In Proceedings of the IEEE International Conference on Data Mining (Melbourne, FL, November 2003), p. 99. [28] KIFER,D.,AND GEHRKE, J. Injecting utility into anonymized datasets. In SIGMOD ’06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data (New York, NY, USA, 2006), ACM Press, pp. 217–228. [29] KOCH,C.,GEHRKE,J.,GAROFALAKIS,M.N.,SR IVA STAVA,D., ABERER,K.,DESHPANDE,A.,FLORESCU,D.,CHAN,C.Y.,GANTI, V. , K ANNE,C.-C.,KLAS,W.,AND NEUHOLD,E.J.,Eds.Proceedings of the 33rd International Conference on Very Large Data Bases, Univer- sity of Vienna, Austria, September 23-27, 2007 (2007), ACM. [30] LEFEVRE,K.,DEWITT,D.J.,AND RAMAKRISHNAN, R. Mondrian multidimensional k-anonymity. In ICDE ’06: Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) (Washington, DC, USA, 2006), IEEE Computer Society, p. 25. [31] LI,N.,LI,T.,AND VENKATASUBRAMANIAN,S.t-closeness: Privacy beyond k-anonymity and -diversity. In IEEE International Conference on Data Engineering (this proceedings) (2007). [32] MACHANAVAJJHALA,A.,GEHRKE,J.,KIFER,D.,AND VENKITA- SUBRAMANIAM, M. l-diversity: Privacy beyond k-anonymity. In Pro- ceedings of the 22nd International Conference on Data Engineering (ICDE’06) (2006), p. 24. 102 Privacy-Preserving Data Mining: Models and Algorithms [33] MARTIN,D.J.,KIFER,D.,MACHANAVAJJHALA,A.,GEHRKE,J., AND HALPERN, J. Y. Worst-case background knowledge for privacy- preserving data publishing. In ICDE [1], pp. 126–135. [34] NAKASHIMA, E. AOL Search Queries Open Window Onto Users’ Worlds. The Washington Post (August 17 2006). [35] NERGIZ,M.E.,AND CLIFTON, C. Thoughts on k-anonymization. In ICDE Workshops (2006), R. S. Barga and X. Zhou, Eds., IEEE Computer Society, p. 96. [36] NISSIM,K.,RASKHODNIKOVA,S.,AND SMITH, A. Smooth sensitivity and sampling in private data analysis. In STOC ’07: Proceedings of the thirty-ninth annual ACM symposium on Theory of computing (New York, NY, USA, 2007), ACM Press, pp. 75–84. [37] RASTOGI,V.,HONG,S.,AND SUCIU, D. The boundary between pri- vacy and utility in data publishing. In Koch et al. [29], pp. 531–542. [38] RIZVI,S.J.,AND HARITSA, J. R. Maintaining data privacy in asso- ciation rule mining. In VLDB ’2002: Proceedings of the 28th interna- tional conference on Very Large Data Bases (2002), VLDB Endowment, pp. 682–693. [39] RUBNER,Y.,TOMASI,C.,AND GUIBAS, L. J. The earth mover’s dis- tance as a metric for image retrieval. Int. J. Comput. Vision 40, 2 (2000), 99–121. [40] SCHL¨ORER, J. Identiﬁcation and retrieval of personal records from a statistical data bank. Methods Info. Med. 14, 1 (1975), 7–13. [41] SCHWARTZ,M.D.,DENNING,D.E.,AND DENNING, P. J. Linear queries in statistical databases. ACM Trans. Database Syst. 4, 2 (1979), 156–167. [42] SWEENEY, L. Achieving k-anonymity privacy protection using general- ization and suppression. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10, 5 (2002), 571–588. [43] TRUTA,T.M.,AND VINAY, B. Privacy protection: p-sensitive k- anonymity property. In ICDEW ’06: Proceedings of the 22nd Interna- tional Conference on Data Engineering Workshops (ICDEW’06) (Wash- ington, DC, USA, 2006), IEEE Computer Society, p. 94. [44] U. S. DEPARTMENT OF HEALTH AND HUMAN SERVICES.Ofﬁcefor Civil Rights - HIPAA. http://www.hhs.gov/ocr/hipaa/. [45] WANG,K.,FUNG,B.C.M.,AND YU, P. S. Handicapping attacker’s conﬁdence: an alternative to k-anonymization. Knowl. Inf. Syst. 11,3 (2007), 345–368. Measures of Anonymity 103 [46] WONG,R.C.-W.,FU,A.W.-C.,WANG,K.,AND PEI, J. Minimal- ity attack in privacy preserving data publishing. In Koch et al. [29], pp. 543–554. [47] WONG,R.C.-W.,LI,J.,FU,A.W.-C.,AND WANG,K. (α,k)- anonymity: an enhanced k-anonymity model for privacy preserving data publishing. In KDD ’06: Proceedings of the 12th ACM SIGKDD interna- tional conference on Knowledge discovery and data mining (New York, NY, USA, 2006), ACM Press, pp. 754–759. [48] XIAO,X.,AND TAO, Y. Personalized privacy preservation. In SIG- MOD ’06: Proceedings of the 2006 ACM SIGMOD international confer- ence on Management of data (New York, NY, USA, 2006), ACM Press, pp. 229–240. [49] YAO, A. C. Protocols for secure computations. In Proc. IEEE Founda- tions of Computer Science (1982), pp. 160–164. [50] ZHANG,Q.,KOUDAS,N.,SR IVA STAVA,D.,AND YU, T. Aggregate query answering on anonymized tables. In ICDE [1], pp. 116–125. Chapter 5 k-Anonymous Data Mining: A Survey V. Ciriani, S. De Capitani di Vimercati, S. Foresti, and P. Samarati DTI - Universit`a degli Studi di Milano 26013 Crema - Italy {ciriani, decapita, foresti, samarati}@dti.unimi.it Abstract Data mining technology has attracted signiﬁcant interest as a means of identify- ing patterns and trends from large collections of data. It is however evident that the collection and analysis of data that include personal information may violate the privacy of the individuals to whom information refers. Privacy protection in data mining is then becoming a crucial issue that has captured the attention of many researchers. In this chapter, we ﬁrst describe the concept of k-anonymity and illustrate different approaches for its enforcement. We then discuss how the privacy re- quirements characterized by k-anonymity can be violated in data mining and introduce possible approaches to ensure the satisfaction of k-anonymity in data mining. Keywords: k-anonymity, data mining, privacy. 5.1 Introduction The amount of data being collected every day by private and public organi- zations is quickly increasing. In such a scenario, data mining techniques are be- coming more and more important for assisting decision making processes and, more generally, to extract hidden knowledge from massive data collections in the form of patterns, models, and trends that hold in the data collections. While not explicitly containing the original actual data, data mining results could po- tentially be exploited to infer information - contained in the original data - and not intended for release, then potentially breaching the privacy of the parties to whom the data refer. Effective application of data mining can take place only if proper guarantees are given that the privacy of the underlying data is not com- promised. The concept of privacy preserving data mining has been proposed in response to these privacy concerns [6]. Privacy preserving data mining aims 106 Privacy-Preserving Data Mining: Models and Algorithms at providing a trade-off between sharing information for data mining analy- sis, on the one side, and protecting information to preserve the privacy of the involved parties on the other side. Several privacy preserving data mining ap- proaches have been proposed, which usually protect data by modifying them to mask or erase the original sensitive data that should not be revealed [4, 6, 13]. These approaches typically are based on the concepts of: loss of privacy, measuring the capacity of estimating the original data from the modiﬁed data, and loss of information, measuring the loss of accuracy in the data. In gen- eral, the more the privacy of the respondents to which the data refer, the less accurate the result obtained by the miner and vice versa. The main goal of these approaches is therefore to provide a trade-off between privacy and accu- racy. Other approaches to privacy preserving data mining exploit cryptographic techniques for preventing information leakage [20, 30]. The main problem of cryptography-based techniques is, however, that they are usually computation- ally expensive. Privacy preserving data mining techniques clearly depend on the deﬁni- tion of privacy, which captures what information is sensitive in the original data and should therefore be protected from either direct or indirect (via in- ference) disclosure. In this chapter, we consider a speciﬁc aspect of privacy that has been receiving considerable attention recently, and that is captured by the notion of k-anonymity [11, 26, 27]. k-anonymity is a property that models the protection of released data against possible re-identiﬁcation of the respon- dents to which the data refer. Intuitively, k-anonymity states that each release of data must be such that every combination of values of released attributes that are also externally available and therefore exploitable for linking can be indistinctly matched to at least k respondents. k-anonymous data mining has been recently introduced as an approach to ensuring privacy-preservation when releasing data mining results. Very few, preliminary, attempts have been pre- sented looking at different aspects in guaranteeing k-anonymity in data mining. We discuss possible threats to k-anonymity posed by data mining and sketch possible approaches to their counteracting, also brieﬂy illustrating some pre- liminary results existing in the current literature. After recalling the concept of k-anonymity (Section 5.2) and some proposals for its enforcement (Section 5.3), we discuss possible threats to k-anonymity to which data min- ing results are exposed (Section 5.4). We then illustrate (Section 5.5) possi- ble approaches combining k-anonymity and data mining, distinguishing them depending on whether k-anonymity is enforced directly on the private data (before mining) or on the mined data themselves (either as a post-mining sanitization process or by the mining process itself). For each of the two ap- proaches (Section 5.6 and 5.7, respectively) we discuss possible ways to cap- ture k-anonymity violations to the aim, on the one side, of deﬁning when mined k-Anonymous Data Mining: A Survey 107 results respect k-anonymity of the original data and, on the other side, of identi- fying possible protection techniques for enforcing such a deﬁnition of privacy. 5.2 k-Anonymity k-anonymity [11, 26, 27] is a property that captures the protection of re- leased data against possible re-identiﬁcation of the respondents to whom the released data refer. Consider a private table PT, where data have been de- identiﬁed by removing explicit identiﬁers (e.g., SSN and Name). However, values of other released attributes, such as ZIP, Date of birth, Mari- tal status,andSex can also appear in some external tables jointly with the individual respondents’ identities. If some combinations of values for these attributes are such that their occurrence is unique or rare, then parties observ- ing the data can determine the identity of the respondent to which the data refer or reduce the uncertainty over a limited set of respondents. k-anonymity de- mands that every tuple in the private table being released be indistinguishably related to no fewer than k respondents. Since it seems impossible, or highly impractical and limiting, to make assumptions on which data are known to a potential attacker and can be used to (re-)identify respondents, k-anonymity takes a safe approach requiring that, in the released table itself, the respon- dents be indistinguishable (within a given set of individuals) with respect to the set of attributes, called quasi-identiﬁer, that can be exploited for linking. In other words, k-anonymity requires that if a combination of values of quasi- identifying attributes appears in the table, then it appears with at least k occur- rences. To illustrate, consider a private table reporting, among other attributes, the marital status, the sex, the working hours of individuals, and whether they suffer from hypertension. Assume attributes Marital status, Sex,and Hours are the attributes jointly constituting the quasi-identiﬁer. Figure 5.1 is a simpliﬁed representation of the projection of the private table over the quasi- identiﬁer. The representation has been simpliﬁed by collapsing tuples with the same quasi-identifying values into a single tuple. The numbers at the right hand side of the table report, for each tuple, the number of actual occurrences, also specifying how many of these occurrences have values Y and N, respectively, for attribute Hypertension. For simplicity, in the following we use such a simpliﬁed table as our table PT. The private table PT in Figure 5.1 guarantees k-anonymity only for k ≤ 2. In fact, the table has only two occurrences of divorced (fe)males working 35 hours. If such a situation is satisﬁed in a particular correlated external table as well, the uncertainty of the identity of such respondents can be reduced to two speciﬁc individuals. In other words, a data recipient can infer that any 108 Privacy-Preserving Data Mining: Models and Algorithms Marital status Sex Hours #tuples (Hyp. values) divorced M 35 2(0Y,2N) divorced M 40 17 (16Y, 1N) divorced F 35 2(0Y,2N) married M 35 10 (8Y, 2N) married F 50 9(2Y,7N) single M 40 26 (6Y, 20N) Figure 5.1. Simpliﬁed representation of a private table information appearing in the table for such divorced (fe)males working 35 hours, actually pertains to one of two speciﬁc individuals. It is worth pointing out a simple but important observation (to which we will come back later in the chapter): if a tuple has k occurrences, then any of its sub-tuples must have at least k-occurrences. In other words, the exis- tence of k occurrences of any sub-tuple is a necessary (not sufﬁcient) condi- tion for having k occurrences of a super-tuple. For instance, with reference to our example, k-anonymity over quasi-identiﬁer {Marital status, Sex, Hours} requires that each value of the individual attributes, as well as of any sub-tuple corresponding to a combination of them, appears with at least k oc- currences. This observation will be exploited later in the chapter to assess the non satisfaction of a k-anonymity constraint for a table based on the fact that a sub-tuple of the quasi-identiﬁer appears with less than k occurrences. Again with reference to our example, the observation that there are only two tuples referring to divorced females allows us to assert that the table will certainly not satisfy k-anonymity for k>2 (since the two occurrences will remain at most two when adding attribute Hours). Two main techniques have been proposed for enforcing k-anonymity on a private table: generalization and suppression, both enjoying the property of preserving the truthfulness of the data. Generalization consists in replacing attribute values with a generalized ver- sion of them. Generalization is based on a domain generalization hierarchy and a corresponding value generalization hierarchy on the values in the domains. Typically, the domain generalization hierarchy is a total order and the corre- sponding value generalization hierarchy a tree, where the parent/child relation- ship represents the direct generalization/specialization relationship. Figure 5.2 illustrates an example of possible domain and value generalization hierarchies for the quasi-identifying attributes of our example. Generalization can be applied at the level of single cell (substituting the cell value with a generalized version of it) or at the level of attribute (generalizing all the cells in the corresponding column). It is easy to see how generaliza- tion can enforce k-anonymity: values that were different in the private table can be generalized to a same value, whose number of occurrences would be k-Anonymous Data Mining: A Survey 109 M2 = {any marital status} M1 = {been married, never married} M0 = {married, divorced, single} any marital status been married rrrrr never married LLLLL married |||| divorced single (a) Marital status S1 = {any sex} S0 = {F,M} any sex F M 3333 (b) Sex H2 = {[1, 100)} H1 = {[1, 40),[40, 100)} H0 = {35, 40, 50} [1, 100) [1, 40) ÖÖÖÖ [40, 100) >>>> 35 40 50 444 (c) Hours Figure 5.2. An example of domain and value generalization hierarchies the sum of the number of occurrences of the values that have been general- ized to it. The same reasoning extends to tuples. Figure 5.11(d) reports the result of a generalization over attribute Sex on the table in Figure 5.1, which resulted, in particular, in divorced people working 35 hours to be collapsed to the same tuple {divorced, any sex, 35}, with 4 occurrences. The ta- ble in Figure 5.11(d) satisﬁes k-anonymity for any k ≤ 4 (since there are no less than 4 respondents for each combination of values of quasi-identifying at- tributes). Note that 4-anonymity could be guaranteed also by only generalizing (to any sex) the sex value of divorced people (males and females) working 35 hours while leaving the other tuples unaltered, since for all the other tuples not satisfying this condition there are already at least 4 occurrences in the private table. This cell generalization approach has the advantage of avoiding general- izing all values in a column when generalizing only a subset of them sufﬁces to guarantee k-anonymity. It has, however, the disadvantage of not preserving the homogeneity of the values appearing in the same column. Suppression consists in protecting sensitive information by removing it. Suppression, which can be applied at the level of single cell, entire tuple, or entire column, allows reducing the amount of generalization to be enforced to achieve k-anonymity. Intuitively, if a limited number of outliers would force a large amount of generalization to satisfy a k-anonymity constraint, then 110 Privacy-Preserving Data Mining: Models and Algorithms Suppression Generalization Tuple Attribute Cell None Attribute AG TS AG AS ≡ AG AG CS AG ≡ AG AS Cell CG TS CG AS CG CS ≡ CG CG ≡ CG CS not applicable not applicable None TS AS CS not interesting Figure 5.3. Classiﬁcation of k-anonymity techniques [11] such outliers can be removed from the table thus allowing satisfaction of k- anonymity with less generalization (and therefore reducing the loss of infor- mation). Figure 5.3 summarizes the different combinations of generalization and sup- pression at different granularity levels (including combinations where one of the two techniques is not adopted), which correspond to different approaches and solutions to the k-anonymity problem [11]. It is interesting to note that the application of generalization and suppression at the same granularity level is equivalent to the application of generalization only (AG ≡AG AS and CG ≡CG CS), since suppression can be modeled as a generalization to the top element in the value generalization hierarchy. Combinations CG TS (cell generalization, tuple suppression) and CG AS (cell generalization, attribute suppression) are not applicable since the application of generalization at the cell level implies the application of suppression at that level too. 5.3 Algorithms for Enforcing k-Anonymity The application of generalization and suppression to a private table PT produces less precise (more general) and less complete (some values are sup- pressed) tables that provide protection of the respondents’ identities. It is im- portant to maintain under control, and minimize, the information loss (in terms of loss of precision and completeness) caused by generalization and suppres- sion. Different deﬁnitions of minimality have been proposed in the literature and the problem of ﬁnding minimal k-anonymous tables, with attribute gener- alization and tuple suppression, has been proved to be computationally hard [2, 3, 22]. Within a given deﬁnition of minimality, more generalized tables, all ensur- ing minimal information loss, may exist. While existing approaches typically aim at returning any of such solutions, different criteria could be devised ac- cording to which a solution should be preferred over the others. This aspect is particularly important in data mining, where there is the need to maximize the usefulness of the data with respect to the goal of the data mining process k-Anonymous Data Mining: A Survey 111 (see Section 5.6). We now describe some algorithms proposed in literature for producing k-anonymous tables. Samarati’s Algorithms. The ﬁrst algorithm for AG TS (i.e., generalization over quasi-identiﬁer attributes and tuple suppression) was proposed in con- junction with the deﬁnition of k-anonymity [26]. Since the algorithm operates on a set of attributes, the deﬁnition of domain generalization hierarchy is ex- tended to refer to tuples of domains. The domain generalization hierarchy of a domain tuple is a lattice, where each vertex represents a generalized table that is obtained by generalizing the involved attributes according to the corre- sponding domain tuple and by suppressing a certain number of tuples to fulﬁll the k-anonymity constraint. Figure 5.4 illustrates an example of domain gen- eralization hierarchy obtained by considering Marital status and Sex as quasi-identifying attributes, that is, by considering the domain tuple M0,S0. Each path in the hierarchy corresponds to a generalization strategy according to which the original private table PT can be generalized. The main goal of the algorithm is to ﬁnd a k-minimal generalization that suppresses less tuples. Therefore, given a threshold MaxSup specifying the maximum number of tu- ples that can be suppressed, the algorithm has to compute a generalization that satisﬁes k-anonymity within the MaxSup constraint. Since going up in the hi- erarchy the number of tuples that must be removed to guarantee k-anonymity decreases, the algorithm performs a binary search on the hierarchy. Let h be the height of the hierarchy. The algorithm ﬁrst evaluates all the solutions at height h/2. If there is at least a k-anonymous table that satisﬁes the MaxSup threshold, the algorithm checks solutions at height h/4; otherwise it evalu- ates solutions at height 3h/4, and so on, until it ﬁnds the lowest height where there is a solution that satisﬁes the k-anonymity constraint. As an example, consider the private table in Figure 5.1 with QI={Marital status, Sex}, the domain and value generalization hierarchies in Figure 5.2, and the gener- alization hierarchy in Figure 5.4. Suppose also that k =4and MaxSup=1. The algorithm ﬁrst evaluates solutions at height 3/2,thatis,M0,S1 and M2,S1 M1,S1 <xwill belong to one of the resulting regions, while all points with d ≤ x will belong to the other region. Note that this splitting operation is allowed only if there are more than k points within any region. The algorithm terminates when there are no more splitting operations allowed. The tuples within a given region are then generalized to a unique tuple of summary statistics for the considered region. For each quasi- identifying attribute, a summary statistic may simply be a static value (e.g., the average value) or the pair of maximum and minimum values for the attribute in the region. As an example, consider the private table PT in Figure 5.1 and suppose that QI = {Marital status, Sex} and k =10. Figure 5.8(a) illustrates the two dimensional representation of the table for the Mari- tal status and Sex quasi-identifying attributes, where the number asso- ciated with each point corresponds to the occurrences of the quasi-identiﬁer value in PT. Suppose to perform a split operation on the Marital status dimension. The resulting two regions illustrated in Figure 5.8(b) are 10- anonymous. The bottom region can be further partitioned along the Sex dimension, as represented in Figure 5.8(c). Another splitting operation along the Marital status dimension can be performed on the region containing the points that correspond to the quasi-identifying values married,M and divorced,M. Figure 5.8(d) illustrates the ﬁnal solution. The experimental results [19] show that the Mondrian multidimensional method obtains good solutions for the k-anonymity problem, also compared with k-Optimize and Incognito. Approximation Algorithms. Since the majority of the exact algorithms proposed in literature have computational time exponential in the number of the attributes composing the quasi-identiﬁer, approximation algorithms have been also proposed. Approximation algorithms for CS and CG have been 116 Privacy-Preserving Data Mining: Models and Algorithms MF divorced married single 26 10 19 9 2 (a) MF divorced married single 26 10 19 9 2 (b) MF divorced married single 26 10 19 9 2 (c) MF divorced married single 26 10 19 9 2 (d) Figure 5.8. Spatial representation (a) and possible partitioning (b)-(d) of the table in Figure 5.1 presented, both for general and speciﬁc values of k (e.g., 1.5-approximation1 for 2-anonymity, and 2-approximation for 3-anonymity [3]). The ﬁrst approximation algorithm for CS was proposed by Meyerson and Williams [22] and guarantees a O(k log(k))-approximation. The best-known approximation algorithm for CS is described in [2] and guarantees a O(k)- approximate solution. The algorithm constructs a complete weighted graph from the original private table PT. Each vertex in the graph corresponds to a tuple in PT, and the edges are weighted with the number of different attribute values between the two tuples represented by extreme vertices. The algorithm then constructs, starting from the graph, a forest composed of trees containing at least k vertices, which represents the clustering for k-anonymization. Some cells in the vertices are suppressed to obtain that all the tuples in the same tree have the same quasi-identiﬁer value. The cost of a vertex is evaluated as the number of cells suppressed, and the cost of a tree is the sum of the weights of 1In a minimization framework, a p-approximation algorithm guarantees that the cost C of its solution is such that C/C∗ ≤ p,whereC∗ is the cost of an optimal solution [17]. -Anonymous Data Mining: A Survey 117 its vertices. The cost of the ﬁnal solution is equal to the sum of the costs of its trees. In constructing the forest, the algorithm limits the maximum number of vertices in a tree to be 3k−3. Partitions with more than 3k−3 elements are de- composed, without increasing the total solution cost. The construction of trees with no more than 3k − 3 vertices guarantees a O(k)-approximate solution. An approximation algorithm for CG is described in [3] as a direct exten- sion of the approximation algorithm for CS presented in [2]. For taking into account the generalization hierarchies, each edge has a weight that is computed as follows. Given two tuples i and j and an attribute a, the generalization cost hi,j(a) associated with a is the lowest level of the value generalization hierar- chy of a such that tuples i and j have the same generalized value for a.The weight w(e) of the edge e =(i, j) is therefore w(e)=Σahi,j(a)/la,wherela is the number of levels in the value generalization hierarchy of a. The solution of this algorithm is guaranteed to be a O(k)-approximation. Besides algorithms that compute k-anonymized tables for any value of k, ad-hoc algorithms for speciﬁc values of k have also been proposed. For in- stance, to ﬁnd better results for Boolean attributes, in the case where k =2or k =3, an ad-hoc approach has been provided in [3]. The algorithm for k =2 exploits the minimum-weight [1, 2]-factor built on the graph constructed for the 2-anonymity. The [1, 2]-factor for graph G is a spanning subgraph of G built using only vertices with no more than 2 outgoing edges. Such a subgraph is a vertex-disjoint collection of edges and pairs of adjacent vertices and can be computed in polynomial time. Each component in the subgraph is treated as a cluster, and a 2-anonymized table is obtained by suppressing each cell, for which the vectors in the cluster differ in value. This procedure is a 1.5- approximation algorithm. The approximation algorithm for k =3is similar and guarantees a 2-approximation solution. 5.4 k-Anonymity Threats from Data Mining Data mining techniques allow the extraction of information from large col- lections of data. Data mined information, even if not explicitly including the original data, is built on them and can therefore allow inferences on origi- nal data to be withdrawn, possibly putting privacy constraints imposed on the original data at risk. This observation holds also for k-anonymity. The desire to ensure k-anonymity of the data in the collection may therefore require to impose restrictions on the possible output of the data mining process. In this section, we discuss possible threats to k-anonymity that can arise from per- forming mining on a collection of data maintained in a private table PT subject to k-anonymity constraints. We discuss the problems for the two main classes of data mining techniques, namely association rule mining and classiﬁcation mining. k 118 Privacy-Preserving Data Mining: Models and Algorithms 5.4.1 Association Rules The classical association rule mining operates on a set of transactions, each composed of a set of items, and produce association rules of the form X → Y,whereX and Y are sets of items. Intuitively, rule X → Y expresses the fact that transactions that contain items X tend to also contain items Y. Each rule has a support and a conﬁdence, in the form of percentage. The support expresses the percentage of transactions that contain both X and Y, while the conﬁdence expresses the percentage of transactions, among those containing X, that also contain Y. Since the goal is to ﬁnd common patterns, typically only those rules that have support and conﬁdence greater than some predeﬁned thresholds are considered of interest [5, 28, 31]. Translating association rule mining over a private table PT on which k- anonymity should be enforced, we consider the values appearing in the table as items, and the tuples reporting respondents’ information as transactions. For simplicity, we assume here that the domains of the attributes are disjoint. Also, we assume support and conﬁdence to be expressed in absolute values (in contrast to percentage). The reason for this assumption, which is consistent with the approaches in the literature, is that k-anonymity itself is expressed in terms of absolute numbers. Note, however, that this does not imply that the release itself will be made in terms of absolute values. Association rule mining over a private table PT allows then the extrac- tion of rules expressing combination of values common to different respon- dents. For instance, with reference to the private table in Figure 5.1, rule {divorced}→{M} with support 19, and conﬁdence 19 21 states that 19 tuples in the table refer to divorced males, and among the 21 tuples referring to divorced people 19 of them are male. If the quasi-identiﬁer of table PT contains both at- tributes Marital status and Sex, it is easy to see that such a rule violates any k-anonymity for k>19, since it reﬂects the existence of 19 respondents who are divorced male (being Marital status and Sex included in the quasi-identiﬁer, this implies that no more than 19 indistinguishable tuples can exist for divorced male respondents). Less trivially, the rule above violates also k-anonymity for any k>2, since it reﬂects the existence of 2 respondents who are divorced and not male; again, being Marital status and Sex included in the quasi-identiﬁer, this implies that no more than 2 indistinguishable tuples can exist for non male divorced respondents. 5.4.2 Classiﬁcation Mining In classiﬁcation mining, a set of database tuples, acting as a training sam- ple, are analyzed to produce a model of the data that can be used as a predictive classiﬁcation method for classifying new data into classes. Goal of the classi- ﬁcation process is to build a model that can be used to further classify tuples k-Anonymous Data Mining: A Survey 119 being inserted and that represents a descriptive understanding of the table con- tent [25]. One of the most popular classiﬁcation mining techniques is represented by decision trees, deﬁned as follows. Each internal node of a decision tree is as- sociated with an attribute on which the classiﬁcation is deﬁned (excluding the classifying attributes, which in our example is Hypertension). Each out- going edge is associated with a split condition representing how the data in the training sample are partitioned at that tree node. The form of a split condition depends on the type of the attribute. For instance, for a numerical attribute A, the split condition may be of the form A ≤ v,wherev is a possible value for A. Each node contains information about the number of samples at that node and how they are distributed among the different class values. As an example, the private table PT in Figure 5.1 can be used as a learning set to build a decision tree for predicting if people are likely to suffer from hypertension problems, based on their marital status, if they are male, and on their working hours, if they are female. A possible decision tree for such a case performing the classiﬁcation based on some values appearing in quasi- identiﬁer attributes is illustrates in Figure 5.9. The quasi-identiﬁer attributes correspond to internal (splitting) nodes in the tree, edges are labeled with (a subset of) attribute values instead of reporting the complete split condition, and nodes simply contain the number of respondents classiﬁed by the node values, distinguishing between people suffering (Y) and not suffering (N)of hypertension. While the decision tree does not directly release the data of the private ta- ble, it indeed allows inferences on them. For instance, Figure 5.9 reports the existence of 2 females working 35 hours (node reachable from path F,35). Again, since Sex and Hours belong to the quasi-identiﬁer, this information Sex 32 Y 34 N M yyssssssssssssss F ##HHHHHHHHHHHHH Marital status 30 Y 25 N married ÑÑÓÓÓÓÓÓÓ divorced single <<<<<<< Hours 2 Y 9 N 35 ÓÓØØØØØØ 50 666666 8 Y 2 N 16 Y 3 N 6 Y 20 N 0 Y 2 N 2 Y 7 N Figure 5.9. An example of decision tree 120 Privacy-Preserving Data Mining: Models and Algorithms reﬂects the existence of no more than two respondents for such occurrences of values, thus violating k-anonymity for any k>2. Like for association rules, threats can also be possible by combining classiﬁcations given by different nodes along the same path. For instance, considering the decision tree in Fig- ure 5.9, the combined release of the nodes reachable from paths F (with 11 occurrences) and F, 50 (with 9 occurrences) allows to infer that there are 2 female respondents in PT who do not work 50 hours per week. 5.5 k-Anonymity in Data Mining Section 5.4 has illustrated how data mining results can compromise the k- anonymity of a private table, even if the table itself is not released. Since proper privacy guarantees are a must for enabling information sharing, it is then im- portant to devise solutions ensuring that data mining does not open the door to possible privacy violations. With particular reference to k-anonymity, we must ensure that k-anonymity for the original table PT be not violated. There are two possible approaches to guarantee k-anonymity in data mining. Anonymize-and-Mine: anonymize the private table PT and perform min- ing on its k-anonymous version. Mine-and-Anonymize: perform mining on the private table PT and anonymize the result. This approach can be performed by executing the two steps independently or in combination. Figure 5.10 provides a graphical illustration of these approaches, reporting, for the Mine-and-Anonymize approach, the two different cases: one step or two Anonymize-and-Mine ___ ___ PT anonymize //______ PTk mine // MDk Mine-and-Anonymize ___ ___ PT mine //______ ___ ___ MD anonymize //______ MDk ___ ___ PT anonymized mining //_______________ MDk Figure 5.10. Different approaches for combining k-anonymity and data mining k-Anonymous Data Mining: A Survey 121 steps. In the ﬁgure, boxes represent data, while arcs represent processes pro- ducing data from data. The different data boxes are: PT, the private table; PTk, an anonymized version of PT; MD, a result of a data mining process (without any consideration of k-anonymity constraints); and MDk, a result of a data mining process that respects the k-anonymity constraint for the private table PT. Dashed lines for boxes and arcs denote data and processes, respectively, reserved to the data holder, while continuous lines denote data and processes that can be viewed and executed by other parties (as their visibility and execu- tion does not violate the k-anonymity for PT). Let us then discuss the two approaches more in details and their trade-offs between applicability and efﬁciency of the process on the one side, and utility of data on the other side. Anonymize-and-Mine (AM) This approach consists in applying a k- anonymity algorithm on the original private table PT and releasing then a table PTk that is a k-anonymized version of PT. Data mining is performed, by the data holder or even external parties, on PTk.The advantage of such an approach is that it allows the decoupling of data protection from mining, giving a double beneﬁt. First, it guarantees that data mining is safe: since data mining is executed on PTk (and not on PT), by deﬁnition the data mining results cannot violate k-anonymity for PT. Second, it allows data mining to be executed by others than the data holder, enabling different data mining processes and different uses of the data. This is convenient, for example, when the data holder may not know a priori how the recipient may analyze and classify the data. Moreover, the recipient may have application-speciﬁc data min- ing algorithms and she may want to directly deﬁne parameters (e.g., accuracy and interpretability) and decide the mining method only af- ter examining the data. On the other hand, the possible disadvantages of performing mining on anonymized data is that mining operates on less specialized and complete data, therefore usefulness and signiﬁcance of the mining results can be compromised. Since classical k-anonymity approaches aim at satisfying k-anonymity minimizing information loss (i.e., minimizing the amount of generalization and suppression adopted), a k-anonymity algorithm may produce a result that is not suited for min- ing purposes. As a result, classical k-anonymity algorithms may hide information that is highly useful for data mining purposes. Particular care must then be taken in the k-anonymization process to ensure maxi- mal utility of the k-anonymous table PTk with respect to the goals of the data mining process that has to be executed. In particular, the aim of k- anonymity algorithms operating on data intended for data mining should not be the mere minimization of information loss, but the optimization of a measure suitable for data mining purposes. A further limitation of 122 Privacy-Preserving Data Mining: Models and Algorithms the Anonymize-and-Mine approach is that it is not applicable when the input data can be accessed only once (e.g., when the data source is a stream). Also, it may be overall less efﬁcient, since the anonymization process may be quite expensive with respect to the mining one, espe- cially in case of sparse and large databases [1]. Therefore, performing k-anonymity before data mining is likely to be more expensive than do- ing the contrary. Mine-and-Anonymize (MA) This approach consists in mining original non- k-anonymous data, performing data mining on the original table PT, and then applying an anonymization process on the data mining result. Data mining can then be performed by the data holder only, and only the sanitized data mining results (MDk) are released to other parties. The deﬁnition of k-anonymity must then be adapted to the output of the data mining phase. Intuitively, no inference should be possible on the mined data allowing violating k-anonymity for the original table PT. This does not mean that the table PT must be k-anonymous, but that if it was not, it should not be known and the effect of its non being k-anonymous be not visible in the mined results. In the Mine-and-Anonymize approach, k-anonymity constraints can be taken into consideration after data min- ing is complete (two-step Mine-and-Anonymize) or within the mining process itself (one-step Mine-and-Anonymize). In two-step Mine-and- Anonymize the result needs to be sanitized removing from MD all data that would compromise k-anonymity for PT.Inone-step Mine-and- Anonymize the data mining algorithm needs to be modiﬁed so to en- sure that only results that would not compromise k-anonymity for PT are computed (MDk). The two possible implementations (one step vs two steps) provide different trade-offs between applicability and efﬁ- ciency: two-step Mine-and-Anonymize does not require any modiﬁca- tion to the mining process and therefore can use any data mining tool available (provided that results are then anonymized); one-step Mine- and-Anonymize requires instead to redesign data mining algorithms and tools to directly enforce k-anonymity, combining the two steps can how- ever result in a more efﬁcient process giving then performance advan- tages. Summarizing, the main drawback of Mine-and-Anonymize is that it requires mining to be executed only by the data holder (or parties au- thorized to access the private table PT). This may therefore impact ap- plicability. The main advantages are efﬁciency of the mining process and quality of the results: performing mining before, or together with, anonymization can in fact result more efﬁcient and allow to keep data distortion under control to the goal of maximizing the usefulness of the data. k-Anonymous Data Mining: A Survey 123 5.6 Anonymize-and-Mine The main objective of classical k-anonymity techniques is the minimiza- tion of information loss. Since a private table may have more than one mini- mal k-anonymous generalization, different preference criteria can be applied in choosing a minimal generalization, such as minimum absolute distance, min- imum relative distance, maximum distribution, or minimum suppression [26]. In fact, the strategies behind heuristics for k-anonymization can be typically based on preference criteria or even user policies (e.g., the discourage of the generalization of some given attributes). In the context of data mining, the main goal is retaining useful information for data mining, while determining a k-anonymization that protects the respon- dents against linking attacks. However, it is necessary to deﬁne k-anonymity algorithms that guarantee data usefulness for subsequent mining operations. A possible solution to this problem is the use of existing k-anonymizing algo- rithms, choosing the maximization of the usefulness of the data for classiﬁca- tion as a preference criteria. Recently, two approaches that anonymize data before mining have been pre- sented for classiﬁcation (e.g., decision trees): a top-down [16] and a bottom- up [29] technique. These two techniques aim at releasing a k-anonymous table T(A1,...,Am, class) for modeling classiﬁcation of attribute class consider- ing the quasi-identiﬁer QI = {A1,...,Am}. k-anonymity is achieved with cell generalization and cell suppression (CG ), that is, different cells of the same attribute may have values belonging to different generalized domains. The aim of preserving anonymity for classiﬁcation is then to satisfy the k-anonymity constraint while preserving the classiﬁcation structure in the data. The top-down approach starts from a table containing the most general val- ues for all attributes and tries to reﬁne (i.e., specialize) some values. For in- stance, the table in Figure 5.11(a) represents a completely generalized table for the table in Figure 5.1. The bottom-up approach starts from a private table and tries to generalize the attributes until the k-anonymity constraint is satisﬁed. In the top-down technique a reﬁnement is performed only if it has some suitable properties for guaranteeing both anonymity and good classiﬁcation. For this purpose, a selection criterion is described for guiding the top-down reﬁnement process to heuristically maximize the classiﬁcation goal. The re- ﬁnement has two opposite effects: it increases the information of the table for classiﬁcation and it decreases its anonymity. The algorithm is guided by the functions InfoGain(v) and AnonyLoss(v) measuring the information gain and the anonymity loss, respectively, where v is the attribute value (cell) candidate for reﬁnement. A good candidate v is such that InfoGain(v) is large, and Anony- Loss(v) is small. Thus, the selection criterion for choosing the candidate v to be reﬁned maximizes function Score(v) = InfoGain(v) AnonyLoss(v)+1. Function Score(v) 124 Privacy-Preserving Data Mining: Models and Algorithms is computed for each value v of the attributes in the table. The value with the highest score is then specialized to its children in the value generalization hi- erarchy. An attribute value v, candidate for specialization, is considered useful to obtain a good classiﬁcation if the frequencies of the class values are not uni- formly distributed for the specialized values of v. The entropy of a value in a table measures the dominance of the majority: the more dominating the major- ity value in the class is, the smaller the entropy is. InfoGain(v) then measures the reduction of entropy after reﬁning v (for a formal deﬁnition of InfoGain(v) see [16]). A good candidate is a value v that reduces the entropy of the table. For instance, with reference to the private table in Figure 5.1 and its gener- alized version in Figure 5.11(a), InfoGain(any marital status)ishigh since for been married we have 14 N and 26 Y, with a difference of 12, and for never married we have 20 N and 6 Y, with a difference of 14 (see Marital status Sex Hours #tuples (Hyp. values) any marital status any sex [1,100) 66 (32Y, 34N) (a) Step 1: the most general table Marital status Sex Hours #tuples (Hyp. values) been married any sex [1,100) 40 (26Y, 14N) never married any sex [1,100) 26 (6Y, 20N) (b) Step 2 Marital status Sex Hours #tuples (Hyp. values) divorced any sex [1,100) 21 (16Y, 5N) married any sex [1,100) 19 (10Y, 9N) never married any sex [1,100) 26 (6Y, 20N) (c) Step 3 Marital status Sex Hours #tuples (Hyp. values) divorced any sex 35 4(0Y,4N) divorced any sex 40 17 (16Y, 1N) married any sex 35 10 (8Y, 2N) married any sex 50 9(2Y,7N) single any sex 40 26 (6Y, 20N) (d) Final table (after 7 steps) Figure 5.11. An example of top-down anonymization for the private table in Figure 5.1 k-Anonymous Data Mining: A Survey 125 Figure 5.11(b)). On the contrary, InfoGain([1, 100)) is low since for [0, 40) we have 8 Yand6 N, with a difference of 2, and for [40, 100) we have 24 Y and 28 N, with a difference of 2. Thus Marital status is more useful for classiﬁcation than Hours. Let us deﬁne the anonymity degree of a table as the maximum k for which the table is k-anonymous. The loss of anonymity, deﬁned as AnonyLoss(v),is the difference between the degrees of anonymity of the table before and after reﬁning v. For instance, the degrees of the tables in Figures 5.11(b) and 5.11(c) are 26 (tuples containing: never married, any sex,[1,100)) and 19 (tuples containing: married, any sex,[1,100)), respectively. Since the table in Figure 5.11(c) is obtained by reﬁning the value been married of the table in Figure 5.11(b), AnonyLoss(been married)is7. The algorithm terminates when any further reﬁnement would violate the k- anonymity constraint. Example 5.1 Consider the private table in Figure 5.1, and the value gener- alization hierarchies in Figure 5.2. Let us suppose QI = {Marital status, Sex, Hours} and k =4. The algorithm starts from the most generalized table in Figure 5.11(a), and computes the scores: Score(any marital status), Score(any sex), and Score([1, 100)). Since the maximum score corresponds to value any marital status,this value is reﬁned, producing the table in Figure 5.11(b). The remaining ta- bles computed by the algorithm are shown in Figures 5.11(c), and 5.11(d). Figure 5.11(d) illustrates the ﬁnal table since the only possible reﬁnement (any sex to M and F) violates 4-anonymity. Note that the ﬁnal table is 4- anonymous with respect to QI = {Marital status, Sex, Hours}. The bottom-up approach is the dual of the top-down approach. Starting from the private table, the objective of the bottom-up approach is to generalize the values in the table to determine a k-anonymous table preserving good qualities for classiﬁcation and minimizing information loss. The effect of generalization is thus measured by a function involving anonymity gain (instead of anonymity loss) and information loss. Note that, since these methods compute a minimal k-anonymous table suit- able for classiﬁcation with respect to class and QI, the computed table PTk is optimized only if classiﬁcation is performed using the entire set QI.Other- wise, the obtained table PTk could be too general. For instance, consider the table in Figure 5.1, the table in Figure 5.11(d) is a 4-anonymization for it con- sidering QI = {Marital status, Sex, Hours}. If classiﬁcation is to be done with respect to a subset QI = {Marital status, Sex} of QI, such a table would be too general. As a matter of fact, a 4-anonymization for PT with respect to QI can be obtained from PT by simply generalizing di- vorced and married to been married. This latter generalization would 126 Privacy-Preserving Data Mining: Models and Algorithms generalize only 40 cells, instead of the 66 cells (M and F to any sex) gener- alized in the table in Figure 5.11(d). 5.7 Mine-and-Anonymize The Mine-and-Anonymize approach performs mining on the original ta- ble PT. Anonymity constraints must therefore be enforced with respect to the mined results to be returned. Regardless of whether the approach is executed in one or two steps (see Section 5.5), the problem to be solved is to translate k- anonymity constraints for PT over the mined results. Intuitively, the mined re- sults should not allow anybody to infer the existence of sets of quasi-identiﬁer values that have less than k occurrences in the private table PT. Let us then discuss what this implies for association rules and for decision trees. 5.7.1 Enforcing k-Anonymity on Association Rules To discuss k-anonymity for association rules it is useful to distinguish the two different phases of association rule mining: 1 ﬁnd all combinations of items whose support (i.e., the number of joint occurrences in the records) is greater than a minimum threshold σ (fre- quent itemsets mining); 2 use the frequent itemsets to generate the desired rules. The consideration of these two phases conveniently allows expressing k- anonymity constraints with respect to observable itemsets instead of associa- tion rules. Intuitively, k-anonymity for PT is satisﬁed if the observable itemsets do not allow inferring (the existence of) sets of quasi-identiﬁer values that have less than k occurrences in the private table. It is trivial to see that any itemset X that includes only values on quasi-identiﬁer attributes and with a support lower than k is clearly unsafe. In fact, the information given by the itemset corresponds to stating that there are less than k respondents with occurrences of values as in X, thus violating k-anonymity. Besides trivial itemsets such as this, also the combination of itemsets with support greater than or equal to k can breach k-anonymity. As an example, consider the private table in Figure 5.1, where the quasi- identiﬁer is {Marital status, Sex, Hours} and suppose 3-anonymity must be guaranteed. All itemsets with support lower than 3 clearly violate the constraint. For instance, itemset {divorced,F} with support 2, which holds in the table, cannot be released. Figure 5.12 illustrates some examples of item- sets with support greater than or equal to 19 (assuming lower supports are not of interest). While one may think that releasing these itemsets guarantees any k-anonymity for k ≤ 19, it is not so. Indeed, the combination of the two item- sets {divorced,M}, with support 19, and {divorced}, with support 21, k-Anonymous Data Mining: A Survey 127 Itemset Support {∅} 66 {M} 55 {M, 40} 43 {single, M, 40} 26 {divorced} 21 {divorced, M} 19 {married} 19 Figure 5.12. Frequent itemsets extracted from the table in Figure 5.1 clearly violates it. In fact, from their combination we can infer the existence of two tuples in the private table for which the condition ‘Marital status = divorced ∧¬(Sex = M)’ is satisﬁed. Being Marital status and Sex included in the quasi-identiﬁer, this implies that no more than 2 indistinguish- able tuples can exist for divorced non male respondents, thus violating k- anonymity for k>2. In particular, since Sex can assume only two values, the two itemsets above imply the existence of (not released) itemset {divorced, F} with support 2. Note that, although both itemsets ({divorced}, 21) and ({divorced,M}, 19) cannot be released, there is no reason to suppress both, since each of them individually taken is safe. The consideration of inferences such as those, and of possible solutions for suppressing itemsets to block the inferences while maximizing the utility of the released information, bring some resembling with the primary and secondary suppression operations in statistical data release [12]. It is also important to note that suppression is not the only option that can be applied to sanitize a set of itemsets so that no unsafe inferences violating k-anonymity are possible. Alternative approaches can be investigated, including adapting classical sta- tistical protection strategies [12, 14]. For instance, itemsets can be combined, essentially providing a result that is equivalent to operating on generalized (in contrast to speciﬁc) data. Another possible approach consists in introducing noise in the result, for example, modifying the support of itemsets in such a way that their combination never allows inferring itemsets (or patterns of them) with support lower than the speciﬁed k. A ﬁrst investigation of translating the k-anonymity property of a private table on itemsets has been carried out in [7–9] with reference to private ta- bles where all attributes are deﬁned on binary domains. The identiﬁcation of unsafe itemsets bases on the concept of pattern, which is a boolean for- mula of items, and on the following observation. Let X and X ∪{Ai} be two itemsets. The support of pattern X ∧¬Ai can be obtained by subtract- ing the support of itemset X ∪{Ai} from the support of X. By generalizing this observation, we can conclude that given two itemsets X = {Ax1 ...Axn } 128 Privacy-Preserving Data Mining: Models and Algorithms and Y = {Ax1 ...Axn ,Ay1 ...Aym }, with X ⊂ Y, the support of pattern Ax1 ∧ ...∧ Axn ∧¬Ay1 ∧ ...∧¬Aym (i.e., the number of tuples in the table containing X but not Y − X) can be inferred from the support of X,Y,and all itemsets Z such that X ⊂ Z ⊂ Y. This observation allows stating that a set of itemsets satisﬁes k-anonymity only if all itemsets, as well as the patterns derivable from them, have support greater than or equal to k. As an example, consider the private table PT in Figure 5.13(a), where all attributes can assume two distinct values. This table can be trans- formed into the binary table T in Figure 5.13(b), where A corresponds to ‘Marital status = been married’, B corresponds to ‘Sex = M’, and C corresponds to ‘Hours = [40,100)’. Figure 5.14 reports the lattice of all itemsets derivable from T together with their support. Assume that all item- sets with support greater than or equal to the threshold σ =40, represented in Figure 5.15(a), are of interest, and that k =10. The itemsets in Figure 5.15(a) present two inference channels. The ﬁrst inference is obtained through itemsets X1 = {C} with support 52, and Y1 = {BC} with support 43. According to Marital status Sex Hours #tuples been married M [1-40) 12 been married M [40-100) 17 been married F [1-40) 2 been married F [40-100) 9 never married M [40-100) 26 (a) PT ABC#tuples 11012 11117 1002 1019 01126 (b) T Figure 5.13. An example of binary table ABC yyyyyyy EEEEEEE 17 AB EEEEEEE29 AC yyyyyyy EEEEEEE26 BC yyyyyyy 43 A DDDDDDDD40 B 55 C yyyyyyyy 52 ∅ 66 Figure 5.14. Itemsets extracted from the table in Figure 5.13(b) k-Anonymous Data Mining: A Survey 129 BC ~~~~~~~ 43 A ;;;;;;40 B 55 C 52 ∅ 66 (a) BC ~~~~~~~ 43 A ;;;;;;40 B 55 C 62 ∅ 86 (b) Figure 5.15. Itemsets with support at least equal to 40 (a) and corresponding anonymized itemsets (b) the observation previously mentioned, since X1 ⊂ Y1, we can infer that pattern C ∧¬B has support 52 − 43 = 9. The second inference channel is obtained through itemsets X2 ={∅} with support 66, Y2 = {BC} with support 43, and all itemsets Z such that X2 ⊂ Z ⊂ Y2, that is, itemsets {B} with support 55, and {C} with support 52. The support of pattern ¬B ∧¬C can then be ob- tained by applying again the observation previously mentioned. Indeed, from {BC} and {B} we infer pattern B ∧¬C with support 55−43 = 12, and from {BC} and {C} we infer pattern ¬B ∧ C with support 52 − 43 = 9.Since the support of itemset {∅} corresponds to the total number of tuples in the bi- nary table, the support of ¬B ∧¬C is computed by subtracting the support of B ∧¬C(12), ¬B ∧ C(9), and B ∧ C(43) from the support of {∅},thatis, 66−12−9−43 = 2. The result is that release of the itemsets in Figure 5.15(a) would not satisfy k-anonymity for any k>2. In [9] the authors present an algorithm for detecting inference channels that is based on a classical data mining solution for concisely representing all frequent itemsets (closed itemsets [24]) and on the deﬁnition of maximal inference channels. In the same work, the authors propose to block possi- ble inference channels violating k-anonymity by modifying the support of in- volved itemsets. In particular, an inference channel due to a pair of itemsets X = {Ax1 ...Axn } and Y = {Ax1 ...Axn ,Ay1 ...Aym } is blocked by in- creasing the support of X by k. In addition, to avoid contradictions among the released itemsets, also the support of all subsets of X is increased by k.Forin- stance, with respect to the previous two inference channels, since k is equal to 10, the support of itemset {C} is increased by 10 and the support of {∅} is in- creased by 20, because {∅} is involved in the two channels. Figure 5.15(b) illustrates the resulting anonymized itemsets. Another possible strategy for blocking channels consists in decreasing the support of the involved itemsets to zero. Note that this corresponds basically to removing some tuples in the original table. 130 Privacy-Preserving Data Mining: Models and Algorithms 5.7.2 Enforcing k-Anonymity on Decision Trees Like for association rules, a decision tree satisﬁes k-anonymity for the pri- vate table PT from which the tree has been built if no information in the tree allows inferring quasi-identiﬁer values that have less than k occurrences in the private table PT. Again, like for association rules, k-anonymity breaches can be caused by individual pieces of information or by combination of appar- ently anonymous information. In the following, we brieﬂy discuss the problem distinguishing two cases depending on whether the decision tree reports fre- quencies information for the internal nodes also or for the leaves only. Let us ﬁrst consider the case where the tree reports frequencies informa- tion for all the nodes in the tree. An example of such a tree is reported in Figure 5.9. With a reasoning similar to that followed for itemsets, given a k, all nodes with a number of occurrences lower than k are unsafe as they breach k-anonymity. For instance, the fourth leaf (reachable through path F,35) is unsafe for any k-anonymity higher than 2. Again, with a reasoning simi- lar to that followed for itemsets, also combinations of nodes that allow infer- ring patterns of tuples containing quasi-identifying attributes with a number of occurrences lower than k breach k-anonymity for the given k. For instance, nodes corresponding to paths F and to F,50, which taken individually would appear to satisfy any k-anonymity constraint for k ≤ 9, considered in combination would violate any k-anonymity for k>2 since their com- bination allows inferring that there are no more than two tuples in the table referring to females working a number of hours different from 50. It is inter- esting to draw a relationship between decision trees and itemsets. In particular, any node in the tree corresponds to an itemset dictated by the path to reach the node. For instance, with reference to the tree in Figure 5.9, the nodes corre- spond to itemsets: {},{M},{M,married},{M,divorced},{M,single}, {F},{F,35},{F,40},{F,50}, where the support of each itemset is the sum of the YsandNs in the corresponding node. This observation can be exploited for translating approaches for sanitizing itemsets for the sanitization of deci- sion trees (or viceversa). With respect to blocking inference channels, different approaches can be used to anonymize decision trees, including suppression of unsafe nodes as well as other nodes as needed to block combinations breaching anonymity (secondary suppression). To illustrate, suppose that 3-anonymity is to be guaranteed. Figure 5.16 reports a 3-anonymized version of the tree in Figure 5.9. Here, besides suppressing node F,35, its sibling F,50 has been suppressed to block the inference channel described above. Let us now consider the case where the tree reports frequencies information only for the leaf nodes. Again, there is an analogy with the itemset problem with the additional consideration that, in this case, itemsets are such that none k-Anonymous Data Mining: A Survey 131 Sex 32 Y 34 N M ÑÑÔÔÔÔÔÔÔ F 66666666 Marital status 30 Y 25 N married ÑÑÓÓÓÓÓÓÓ divorced single ======= 2 Y 9 N 8 Y 2 N 16 Y 3 N 6 Y 20 N Figure 5.16. 3-anonymous version of the tree of Figure 5.9 of them is a subset of another one. It is therefore quite interesting to note that the set of patterns of tuples identiﬁed by the tree nodes directly corresponds to a generalized version of the private table PT, where some values are sup- pressed (CG ). This property derives from the fact that, in this case, every tuple in PT satisﬁes exactly one pattern (path to a leaf). To illustrate, consider the de- cision tree in Figure 5.17, obtained from the tree in Figure 5.9 by suppressing occurrences in non-leaf nodes. Each leaf in the tree corresponds to a general- ized tuple reporting the value given by the path (for attributes appearing in the path). The number of occurrences of such a generalized tuple is reported in the leaf. If a quasi-identiﬁer attribute does not appear along the path, then its value is set to ∗. As a particular case, if every path in the tree contains all the quasi- identiﬁer attributes and puts conditions on speciﬁc values, the generalization coincides with the private table PT. For instance, Figure 5.18 reports the table containing tuple patterns that can be derived from the tree in Figure 5.17, and which corresponds to a generalization of the original private table PT in Fig- ure 5.1. The relationship between trees and generalized tables is very important as it allows us to express the protection enjoyed of a decision tree in terms of the generalized table corresponding to it, with the advantage of possibly ex- ploiting classical k-anonymization approaches referred to the private table. In particular, this observation allows us to identify as unsafe all and only those nodes corresponding to tuples whose number of occurrences is lower than k. In other words, in this case (unlike for the case where frequencies of internal nodes values are reported) there is no risk that combination of nodes, each with occurrences higher than or equal to k, can breach k-anonymity. Again, different strategies can be applied to protect decision trees in this case, including exploiting the correspondence just withdrawn, translating on 132 Privacy-Preserving Data Mining: Models and Algorithms Sex M uulllllllllllllll F ''OOOOOOOOOOOO Marital status married }}{{{{{{{{ divorced single ""DDDDDDDD Hours 35 ÑÑÓÓÓÓÓÓÓ 50 ;;;;;;; 8 Y 2 N 16 Y 3 N 6 Y 20 N 0 Y 2 N 2 Y 7 N Figure 5.17. Suppression of occurrences in non-leaf nodes in the tree in Figure 5.9 Marital status Sex Hours #tuples (Hyp. values) divorced M ∗ 19 (16Y, 3N) ∗ F35 2(0Y,2N) married M ∗ 10 (8Y, 2N) ∗ F50 9(2Y,7N) single M ∗ 26 (6Y, 20N) Figure 5.18. Table inferred from the decision tree in Figure 5.17 Sex M ||xxxxxxxxx F 7777777 Marital status been married single !!DDDDDDDDD 2 Y 9 N 24 Y 5 N 6 Y 20 N Figure 5.19. 11-anonymous version of the tree in Figure 5.17 the tree the generalization and suppression operations that could be executed on the private table. To illustrate, consider the tree in Figure 5.17, the cor- responding generalized table is in Figure 5.18, which clearly violates any k- anonymity for k>2. Figure 5.19 illustrates a sanitized version of the tree for guaranteeing 11-anonymity obtained by suppressing the splitting node Hours and combining nodes M,married and M,divorced into a single node. Note how the two operations have a correspondence with reference to the start- ing table in Figure 5.18 with an attribute generalization over Hours and a cell generalization over Marital status, respectively. Figure 5.20 illustrates the table corresponding to the tree in Figure 5.19. The problem of sanitizing decision trees has been studied in the literature by Friedman et al. [15, 16], who proposed a method for directly building a k-Anonymous Data Mining: A Survey 133 Marital status Sex Hours #tuples (Hyp. values) been married M ∗ 29 (24Y, 5N) ∗ F ∗ 11 (2Y, 9N) single M ∗ 26 (6Y, 20N) Figure 5.20. Table inferred from the decision tree in Figure 5.19 k-anonymous decision tree from a private table PT. The proposed algorithm is basically an improvement of the classical decision tree building algorithm, combining mining and anonymization in a single process. At initialization time, the decision tree is composed of a unique root node, representing all the tuples in PT. At each step, the algorithm inserts a new splitting node in the tree, by choosing the attribute in the quasi-identiﬁer that is more useful for classiﬁcation purposes, and updates the tree accordingly. If the tree obtained is non-k-anonymous, then the node insertion is rolled back. The algorithm stops when no node can be inserted without violating k-anonymity, or when the clas- siﬁcation obtained is considered satisfactory. 5.8 Conclusions A main challenge in data mining is to enable the legitimate usage and shar- ing of mined information while at the same time guaranteeing proper pro- tection of the original sensitive data. In this chapter, we have discussed how k-anonymity can be combined with data mining for protecting the identity of the respondents to whom the data being mined refer. We have described the possible threats to k-anonymity that can arise from performing mining on a collection of data and characterized two main approaches to combine k- anonymity in data mining. We have also discussed different methods that can be used for detecting k-anonymity violations and consequently eliminate them in association rule mining and classiﬁcation mining. k-anonymous data mining is however a recent research area and many is- sues are still to be investigated such as: the combination of k-anonymity with other possible data mining techniques; the investigation of new approaches for detecting and blocking k-anonymity violations; and the extension of current approaches to protect the released data mining results against attribute, in con- trast to identity, disclosure [21]. Acknowledgements This work was supported in part by the European Union under contract IST- 2002-507591, by the Italian Ministry of Research Fund for Basic Research (FIRB) under project “RBNE05FKZ2”, and by the Italian MIUR under project 2006099978. 134 Privacy-Preserving Data Mining: Models and Algorithms References [1] Charu C. Aggarwal. On k-anonymity and the curse of dimensionality. In Proc. of the 31th VLDB Conference, Trondheim, Norway, September 2005. [2] Gagan Aggarwal, Tomas Feder, Krishnaram Kenthapadi, Rajeev Mot- wani, Rina Panigrahy, Dilys Thomas, and An Zhu. Anonymizing ta- bles. In Proc. of the 10th International Conference on Database Theory (ICDT’05), Edinburgh, Scotland, January 2005. [3] Gagan Aggarwal, Tomas Feder, Krishnaram Kenthapadi, Rajeev Mot- wani, Rina Panigrahy, Dilys Thomas, and An Zhu. Approximation al- gorithms for k-anonymity. Journal of Privacy Technology, November 2005. [4] Dakshi Agrawal and Charu C. Aggarwal. On the design and quantiﬁca- tion of privacy preserving data mining algorithms. In Proc. of the 20th ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, Santa Barbara, California, June 2001. [5] Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules. In Proc. of the 20th VLDB Conference, Santiago, Chile, September 1994. [6] Rakesh Agrawal and Ramakrishnan Srikant. Privacy-preserving data mining. In Proc. of the ACM SIGMOD Conference on Management of Data, Dallas, Texas, May 2000. [7] Maurizio Atzori, Francesco Bonchi, Fosca Giannotti, and Dino Pe- dreschi. Blocking anonymity threats raised by frequent itemset mining. In Proc. of the 5th IEEE International Conference on Data Mining (ICDM 2005), Houston, Texas, November 2005. [8] Maurizio Atzori, Francesco Bonchi, Fosca Giannotti, and Dino Pe- dreschi. k-anonymous patterns. In Proc. of the 9th European Confer- ence on Principles and Practice of Knowledge Discovery in Databases (PKDD), Porto, Portugal, October 2005. [9] Maurizio Atzori, Francesco Bonchi, Fosca Giannotti, and Dino Pe- dreschi. Anonymity preserving pattern discovery. VLDB Journal,No- vember 2006. [10] Roberto J. Bayardo and Rakesh Agrawal. Data privacy through optimal k-anonymization. In Proc. of the International Conference on Data En- gineering (ICDE’05), Tokyo, Japan, April 2005. [11] Valentina Ciriani, Sabrina De Capitani di Vimercati, Sara Foresti, and Pierangela Samarati. k-anonymity. In T. Yu and S. Jajodia, editors, Se- curity in Decentralized Data Management. Springer, Berlin Heidelberg, 2007. k-Anonymous Data Mining: A Survey 135 [12] Valentina Ciriani, Sabrina De Capitani di Vimercati, Sara Foresti, and Pierangela Samarati. Microdata protection. In T. Yu and S. Jajodia, edi- tors, Security in Decentralized Data Management. Springer, Berlin Hei- delberg, 2007. [13] Alexandre Evﬁmievski, Ramakrishnan Srikant, Rakesh Agrawal, and Jo- hannes Gehrke. Privacy preserving mining of association rules. In Proc. of the 8th ACM SIGKDD International Conference on Knowledge Dis- covery and Data Mining, Edmonton, Alberta, Canada, July 2002. [14] Federal Committee on Statistical Methodology. Statistical policy work- ing paper 22, May 1994. Report on Statistical Disclosure Limitation Methodology. [15] Arik Friedman, Assaf Schuster, and Ran Wolff. Providing k-anonymity in data mining. VLDB Journal. Forthcoming. [16] Benjamin C.M. Fung, Ke Wang, and Philip S. Yu. Anonymizing classi- ﬁcation data for privacy preservation. IEEE Transactions on Knowledge and Data Engineering, 19(5):711–725, May 2007. [17] Michael R. Garey and David S. Johnson Computers and Intractability. W. H. Freeman & Co., New York, NY, USA, 1979. [18] Kristen LeFevre, David J. DeWitt, and Raghu Ramakrishnan. Incognito: efﬁcient full-domain k-anonymity. In Proc. of the ACM SIGMOD Con- ference on Management of Data, Baltimore, Maryland, June 2005. [19] Kristen LeFevre, David J. DeWitt, and Raghu Ramakrishnan. Mondrian multidimensional k-anonymity. In Proc. of the International Conference on Data Engineering (ICDE’06), Atlanta, Georgia, April 2006. [20] Yehuda Lindell and Benny Pinkas. Privacy preserving data mining. Jour- nal of Cryptology, 15(3):177–206, June 2002. [21] Ashwin Machanavajjhala, Johannes Gehrke, and Daniel Kifer. -density: Privacy beyond k-anonymity. In Proc. of the International Conference on Data Engineering (ICDE’06), Atlanta, Georgia, April 2006. [22] Adam Meyerson and Ryan Williams On the complexity of optimal k- anonymity. In Proc. of the 23rd ACM SIGMOD-SIGACT-SIGART Sym- posium on Principles of Database Systems, Paris, France, June 2004. [23] Hyoungmin Park and Kyuseok Shim. Approximate algorithms for k- anonymity. In Proc. of the ACM SIGMOD Conference on Management of Data, Beijing, China, June 2007. [24] Nicolas Pasquier, Yves Bastide, Raﬁk Taouil, and Lotﬁ Lakhal. Discov- ering frequent closed itemsets for association rules. In Proc. of the 7th International Conference on Database Theory (ICDT ’99), Jerusalem, Is- rael, January 1999. 136 Privacy-Preserving Data Mining: Models and Algorithms [25] Rajeev Rastogi and Kyuseok Shim. PUBLIC: A decision tree classiﬁer that integrates building and pruning. In Proc. of the 24th VLDB Confer- ence, New York, September 1998. [26] Pierangela Samarati. Protecting respondents’ identities in microdata release. IEEE Transactions on Knowledge and Data Engineering, 13(6):1010–1027, November 2001. [27] Pierangela Samarati and Latanya Sweeney. Generalizing data to provide anonymity when disclosing information (abstract). In Proc. of the 17th ACM-SIGMOD-SIGACT-SIGART Symposium on the Principles of Data- base Systems, page 188, Seattle, WA, 1998. [28] Ramakrishnan Srikant and Rakesh Agrawal. Mining generalized associ- ation rules. In Proc. of the 21th VLDB Conference, Zurich, Switzerland, September 1995. [29] Ke Wang, Philip S. Yu, and Sourav Chakraborty. Bottom-up generaliza- tion: A data mining solution to privacy protection. In Proc. of the 4th IEEE International Conference on Data Mining (ICDM 2004), Brighton, UK, November 2004. [30] Zhiqiang Yang, Sheng Zhong, and Rebecca N. Wright. Privacy- preserving classiﬁcation of customer data without loss of accuracy. In Proc. of the 5th SIAM International Conference on Data Mining,New- port Beach, California, April 2005. [31] Mohammed J. Zaki and Ching-Jui Hsiao. Charm: An efﬁcient algorithm for closed itemset mining. In Proc. of the 2nd SIAM International Con- ference on Data Mining, Arlington, Virginia, April 2002. Chapter 6 A Survey of Randomization Methods for Privacy-Preserving Data Mining Charu C. Aggarwal IBM T. J. Watson Research Center Hawthorne, NY 10532 charu@us.ibm.com Philip S. Yu University of Illinois at Chicago Chicago, IL 60607 psyu@us.ibm.com Abstract A well known method for privacy-preserving data mining is that of random- ization. In randomization, we add noise to the data so that the behavior of the individual records is masked. However, the aggregate behavior of the data dis- tribution can be reconstructed by subtracting out the noise from the data. The reconstructed distribution is often sufﬁcient for a variety of data mining tasks such as classiﬁcation. In this chapter, we will provide a survey of the random- ization method for privacy-preserving data mining. Keywords: Randomization, privacy quantiﬁcation, perturbation. 6.1 Introduction In the randomization method, we add noise to the data in order to mask the values of the records. The noise added is sufﬁciently large so that the individ- ual values of the records can no longer be recovered. However, the probabil- ity distribution of the aggregate data can be recovered and subsequently used for privacy-preservation purposes. The earliest work on randomization may be found in [16, 12], in which it has been used in order to eliminate evasive an- swer bias. In [3] it has been shown how the reconstructed distributions may be 138 Privacy-Preserving Data Mining: Models and Algorithms used for data mining. The speciﬁc problem which has been discussed in [3] is that of classiﬁcation, though the approach can be easily extended to a variety of other problems such as association rule mining [8, 24]. The method of randomization can be described as follows. Consider a set of data records denoted by X = {x1 ...xN}. For record xi ∈ X,weadd a noise component which is drawn from the probability distribution fY(y). These noise components are drawn independently, and are denoted y1 ...yN. Thus, the new set of distorted records are denoted by x1 + y1 ...xN + yN. We denote this new set of records by z1 ...zN. In general, it is assumed that the variance of the added noise is large enough, so that the original record values cannot be easily guessed from the distorted data. Thus, the original records cannot be recovered, but the distribution of the original records can be recovered. We note that the addition of X and Y creates a new distribu- tion Z. We know N instantiations of this new distribution, and can therefore estimate it approximately. Furthermore, since the distribution of Y is publicly known, we can estimate the distribution obtained by subtracting Y from Z. In a later section, we will discuss more accurate strategies for distribution es- timation. Furthermore, the above-mentioned technique is an additive strategy for randomization. In the multiplicative strategy, it is possible to multiply the records with random vectors in order yo provide the ﬁnal representation of the data. Thus, this approach uses a random projection kind of approach in or- der to perform the privacy-preserving transformation. The resulting data can be be re-constructed within a certain variance depending upon the number of components of the multiplicative perturbation. We note that methods such as randomization add or multiply the noise to the records in a data-independent way. In other methods such as k-anonymity [25], the overall behavior of the records is leveraged in the anonymization process. This is very useful from a practical point of view, since it means that the ran- domization can be performed at data-collection time. Thus, a trusted server is not required (as in k-anonymization) in order to perform the transformations on the records. This is a key advantage of randomization methods, though it comes at the expense that there are no guarantees against re-identiﬁcation of the data in the presence of public information. Another key property of the randomization method is that the original records are not used after the trans- formation. Rather, the data mining algorithms use aggregate distributions of the data in order to perform the mining process. This paper is organized as follows. In the next section, we will discuss a number of reconstruction methods for randomization. We will also discuss the issue of optimaility and utility of randomization methods. In section 3, we will discuss a number of applications of randomization. We will show how the approach can be used for a number of applications such as classiﬁcation and association rule mining. In section 4, we will discuss issues surrounding the A Survey of Randomization Methods for Privacy-Preserving Data Mining 139 quantiﬁcation of privacy-preserving data mining algorithms. In section 5, we will discuss a number of adversarial attacks on the randomization method. In section 6, we discuss applications of the randomization method to the case of time series data. In section 7, we discuss the method of multiplicative pertur- bations and its applications to a variety of data mining algorithms. The conclu- sions and summary are presented in section 8. 6.2 Reconstruction Methods for Randomization In this section, we will discuss reconstruction algorithms for the randomiza- tion method. We note that the perturbed data distribution Z can be obtained by adding the distributions of the original data X and that of the perturbation Y. Therefore, we have: Z = X + Y X = Z − Y We note that only the distribution of Y is known explicit. The distribution of X is unknown, and N instantiations of the probability distribution Z are known. These N instantiations can be used to construct an estimate of the probability distribution Z. When the value of N is large, this estimate can be quite accurate. Once Z is known, we can subtract Y from it in order to obtain the probability distribution of X. For modest values of N, the errors in the estimation of Z can be quite large, and these errors may get magniﬁed on subtraction of Y. Therefore, a more indirect method is desirable in order to estimate the probability distribution of X. A pair of closely related iterative methods have been discussed in [3, 5] for approximation of the corresponding probability distributions. The method in [3] uses the Bayes rule for distribution approximation, whereas that in [5] uses the EM method for distribution approximation. In this section, we will describe both methods. First, we will discuss the method in [3] for distribution reconstruction. 6.2.1 The Bayes Reconstruction Method Let f and F be the estimated density functions and cumulative density functions with the use of the reconstructed distributions. The, we can use the bayes formula in order to derive an estimate for f , using the ﬁrst observed value z1: F (a)= a −∞ fX1 (w|X1 + Y1 = z1)dw (6.1) We can expand the above expression using the Bayes rule (in conjunction with the independence of the random variables Y and X) in order to construct the 140 Privacy-Preserving Data Mining: Models and Algorithms following expression for F (a). F (a)= a −∞ fX(z1 − w)· fX(w)dw ∞ −∞ fX(z1 − w)· fX(w)dw (6.2) We note that the above expression for F (a) was derived using a single ob- servation z1. In practice, the average distribution of multiple observations z1 ...zN can be used in order to construct the estimated cumulative distrib- ution F (a). Thus, we can construct the estimated distribution as follows: F (a)=(1/N )· N i=1 a −∞ fX(zi − w)· fX(w)dw ∞ −∞ fX(zi − w)· fX(w)dw (6.3) The corresponding density distribution can be obtained by differentiating F (a). This differentiation results in the removal of the integral sign from the numerator, and the corresponding instantiation of w to a. Therefore, we have: f (a)=(1/N )· N i=1 fX(zi − a)· fX(a) ∞ −∞ fX(zi − w)· fX(w)dw (6.4) We note that it is tricky to compute f(·) using the above equation, since we do not know the distribution for f on the right hand side. This suggests an iterative method for computing the distribution f. We start of by setting f as the uniform distribution, and iteratively update it using the equation above. The algorithm for computing f(a) for a particular value of a is described as follows: Set f to be the uniform distribution; repeat Update f(a)=(1/N )· N i=1 fX(zi−a)·fX(a) ∞ −∞ fX(zi−w)·fX(w)dw until convergence We note that we cannot compute the value of f(a) over all possible (inﬁnite number of) values of a in a continuous domain. Therefore, we partition the domain of X into a number of intervals [l1,u1]...[ln,un], and assume that the function is uniform over each interval. For each interval [li,ui],thevalueofa in the above equation is picked to be (li + ui)/2. Thus, in each iteration, we use n different values of a corresponding to each of the intervals. We note that the density functions on the right hand sides can be computed using the mean values over the corresponding intervals. We note that the algorithm is terminated when the distribution does not change signiﬁcantly over successive steps of the algorithm. A χ2 test was used to compare the two distributions. The implementation in [3] terminated the al- gorithm when the difference between successive estimates was given by 1% of A Survey of Randomization Methods for Privacy-Preserving Data Mining 141 the threshold of the χ2 test. While this algorithm is known to perform effec- tively in practice, the work in [3] does not prove this algorithm to be a provably convergent solution. In [5], an Expectation Maximization (EM) algorithm has been proposed which converges to a provably optimal solution. It is also shown in [5] that the Bayes algorithm of [3] is actually an approximation of the Ex- pectation Maximization algorithm proposed in [5]. This is one of the reasons why the Bayes method proposed in [3] is so robust in practice. 6.2.2 The EM Reconstruction Method In this subsection, we will discuss the EM algorithm for distribution recon- struction. Since the function fX(x) is deﬁned over a continuous domain, we need to parameterize and discretize it for the purpose of any numerical estima- tion method. We assume that the data domain ΩX can be discretized into K intervals Ω1 ...ΩK,where∪k i=1Ωi =ΩX.Letmi = m(Ωi) be the length of the interval Ωi. We assume that fX(x) is constant over Ωi and the correspond- ing density function value is equal to θi. Thus, such a form will restrict fX(x) to a class parameterized by the ﬁnite set of parameters Θ={θ1,θ2,...,θK}. In order to explicitly denote the parametric dependence of the density function on Θ we will use the notation fX;Θ(x) for the density function of X.There- fore, we have fX;Θ(x)= K i=1 θiIΩi (x).HereIΩi (x)=1if x ∈ Ωi and 0 otherwise. Since fX;Θ(x) is a density, it follows that K i=1 θim(Ωi)=1.By choosing K large enough, density functions of the form discussed above can approximate any density function with arbitrary precision. After this parameterization, the algorithm will proceed to estimate Θ,and thereby determine ˆfX;Θ(x).LetΘ={ˆθ1, ˆθ2,...,ˆθK} be the estimate of these parameters produced by the reconstruction algorithm. Given a set of observations Z = z, we would ideally like to ﬁnd the maximum-likelihood (ML) estimate ΘML =argmaxΘ ln fZ;Θ(z).TheML estimate has many attractive properties such as consistency, asymptotic unbi- asedness, and asymptotic minimum variance among unbiased estimates. How- ever, it is not always be possible to ﬁnd ΘML directly, and this turns out to be the case with the fZ;Θ(z) given above. In order to achieve this goal, we will derive a reconstruction algorithm which ﬁts into the broad framework of Expectation Maximization (EM) algorithms. The algorithm proceeds as if a more comprehensive set of data, say D = d is observable and maximizes ln fD;Θ(d) over all values of Θ(M-step). Since d is in fact unavailable, it replaces ln fD;Θ(d) by its conditional expected value given Z = z and the current estimate of Θ(E-Step). The D is chosen to make E-step and M-step easy to compute. In this paper, we propose the use of X = x as the more comprehensive set of data. As shown in the next section, this choice results in a computationally 142 Privacy-Preserving Data Mining: Models and Algorithms efﬁcient algorithm. More formally, we deﬁne a Q function as follows: Q(Θ, Θ)=E ln fX;Θ(X) Z = z; Θ (6.5) Thus, Q(Θ, Θ) is the expected value of ln fX;Θ(X) computed with respect fX|Z=z; Θ, the density of X given Z = z and parameter vector Θ. After the initialization of Θ to a nominal value Θ0, the EM algorithm will iterate over the following two steps: 1 E-step: Compute Q(Θ,Θk). 2 M-step: Update Θk+1 =argmaxΘQ(Θ,Θk). The above discussion provides the general framework of EM algorithms; the actual details of the E-step and M-steps require a derivation which is prob- lem speciﬁc. Similarly, the precise convergence properties of an EM algorithm are rather sensitive to the problem and its corresponding derivation. In the next subsection, we will derive the EM algorithm for the reconstruction problem and show that the resulting EM-algorithm has desirable convergence proper- ties. The values of Q(Θ, Θ) during the E-step and the M-step of the recon- struction algorithm are discussed in [5]. Theorem 6.1 The value of Q(Θ, Θ) during the E-step of the reconstruction algorithm is given by: Q(Θ, Θ)= K i=1 ψi(z; Θ)lnθi,whereψi(z; Θ)= ˆθi N j=1 Pr(Y ∈zj−Ωi) fZ; Θ(zj ). In the next proposition, we calculate the value of Θ that maximizes Q(Θ, Θ). Theorem 6.2 The value of Θ which maximizes Q(Θ, Θ) during the M-step of the reconstruction algorithm is given by: θi = ψi(z; Θ) miN,whereψi(z; Θ)= ˆθi N j=1 Pr(Y ∈zj−Ωi) fZ; Θ(zj ). Now, we are in a position to describe the EM algorithm for the reconstruc- tion problem. 1. Initialize θ0 i = 1 K, i =1, 2,...,K; k =0; 2. Update Θ as follows: θ(k+1) i = ψi(z;Θk) miN; 3. k = k +1; 4. If not termination-criterion then return to Step 2. One key observation is that the EM algorithm is actually a reﬁned version of the Bayes method discussed in [3]. The key difference between the two meth- ods is in how the approximation of the values within an interval is treated. A Survey of Randomization Methods for Privacy-Preserving Data Mining 143 While the Bayes method uses the crude estimate of the midpoint of the in- terval, the EM algorithm is more reﬁned about it. While the Bayes method has not been shown to provably converge, it has been known to always em- pirically converge. On the other hand, our argument below shows that the EM algorithm does converge to a provably optimal solution. The close relationship between the two methods is the reason that the Bayes method is always known to empirically converge to an approximately optimal solution. The termination criterion for this method is based on how much Θk has changed since the last iteration. It has been shown in [5] that the EM algorithm converges to the true distribution of the random variable X. We summarize the result as follows: Theorem 6.3 The EM sequence {Θ(k)} for the reconstruction algorithm converges to the unique Maximum Likelihood Estimate ΘML. The above results lead to the following desirable property of the EM Algo- rithm. Observation 6.2.1 When there is a very large number of data observa- tions, then the EM algorithm provides zero information loss. This is because as the number of observations increases, ΘML ⇒ Θ.There- fore, the original and estimated distribution become the same (subject to the discretization needed for any numerical estimation algorithm), resulting in zero information loss. 6.2.3 Utility and Optimality of Randomization Models We note that the use of different perturbing distributions results in a differ- ent level of effectiveness of the randomization scheme. A key issue is how the randomization may be performed in order to optimize the tradeoff between pri- vacy and accuracy. Clearly, the provision of a higher level of accuracy for the same privacy level is desirable from the point of view of maintaining greater utility of the randomized data. In order to achieve this goal, the work in [30] de- ﬁnes a randomization scheme in which the noise added to a given observation depends upon the value of the underlying data record as well as a user-deﬁned parameter. Thus, in this case, the noise is conditional on the value of the record itself. This is a more general and ﬂexible model for the randomization process. We note that this approach still does not depend upon the behavior of the other records, and can therefore be performed at data collection time. Methods are deﬁned in [30] in order to perform reconstruction of the data with the use of this kind of randomization. The reconstruction methods proposed in [30] are designed with the use of kernel estimators or iterative EM methods. In [30] a number of information loss and interval metrics are used to quantify the tradeoff between privacy and optimality. The approach explores the issue of optimizing the information loss within a privacy constraint, or optimizing the 144 Privacy-Preserving Data Mining: Models and Algorithms privacy within an information loss constraint. A number of simulations have been presented in [30] to illustrate the effectiveness of the approach. 6.3 Applications of Randomization The randomization method has been extended to a variety of data mining problems. In [3], it was discussed how to use the approach for classiﬁcation. A number of other techniques [29, 30] have also been proposed which seem to work well over a variety of different classiﬁers. Techniques have also been pro- posed for privacy-preserving methods of improving the effectiveness of classi- ﬁers. For example, the work in [10] proposes methods for privacy-preserving boosting of classiﬁers. Methods for privacy-preserving mining of association rules have been proposed in [8, 24]. The problem of association rules is espe- cially challenging because of the discrete nature of the attributes corresponding to presence or absence of items. In order to deal with this issue, the random- ization technique needs to be modiﬁed slightly. Instead of adding quantitative noise, random items are dropped or included with a certain probability. The perturbed transactions are then used for aggregate association rule mining. This technique has shown to be extremely effective in [8]. The randomiza- tion approach has also been extended to other applications such as OLAP [4], and SVD based collaborative ﬁltering [22]. We will discuss details of many of these techniques below. We note that a variety of other randomization schemes exist for privacy- preserving data mining. The above-mentioned scheme uses a single perturbing distribution in order to perform the randomization over the entire data. The randomization scheme can be tailored much more effectively by using mixture models [30] in order to perform the privacy-preservation. The work in [30] shows that this approach has a number of optimality properties in terms of the quality of the perturbation. 6.3.1 Privacy-Preserving Classiﬁcation with Randomization A number of methods have been proposed for privacy-preserving classiﬁca- tion with randomization. In [3], a method has been discussed for decision tree classiﬁcation with the use of the aggregate distributions reconstructed from the randomized distribution. The key idea is to construct the distributions sepa- rately for the different classes. Then, the splitting condition for the decision tree uses the relative presence of the different classes which is derived from the aggregate distributions. It has been shown in [3] that such an approach can be used in order to design very effective classiﬁers. Since the probabilistic behavior is encoded in aggregate data distributions, it can be used to construct a naive Bayes classiﬁer. In such a classiﬁer [29], the A Survey of Randomization Methods for Privacy-Preserving Data Mining 145 approach of randomized response with partial hiding is used in order to per- form the classiﬁcation. It has been shown in [29] that this approach is effective both empirically and analytically. 6.3.2 Privacy-Preserving OLAP In [4], a randomization algorithm for distributed privacy-preserving OLAP is discussed. In this approach, each client independently perturbs their data before sending it to a centralized server. The technique uses local perturbation techniques in which the perturbation added to an element depends upon its initial value. A variety of reconstruction techniques are discussed in order to respond to different kinds of queries. The key in such queries is to develop effective algorithms for estimating counts of different subcubes in the data. Such queries are typical in most OLAP applications. The approach has been shown in [4] to satisfy a number of privacy-breach guarantees. The method in [4] uses an interesting technique called retention replace- ment perturbation. In retention replacement perturbation, each element from column j is retained with probability pj, or replaced with an element from the selected pdf. It has been shown in [4] that approximate probabilistic recon- structability is possible when a least a certain number of rows are present in the data. Methods have also been devised in [4] to express the estimated query results on the perturbed table as a function of the query results on the perturbed table. Methods are devised in [4] to reconstruct the original distributed, single column aggregates, and multiple column aggregates. Techniques have also been devised on [4] for perturbation of categorical data sets. In this case, the retention-replacement approach needs to be modi- ﬁed appropriately. In this case, the replacement approach is to use a random element to replace an element which is not retained. 6.3.3 Collaborative Filtering A variety of collaborative ﬁltering techniques have been discussed in [22, 23]. The collaborative ﬁltering problem is used in the context of electronic commerce when users choose to leave quantitative feedback (or ratings) about the products which they may like. In the collaborative ﬁltering problem, we wish to make predictions of ratings of products for a particular user with the use of ratings of users with similar proﬁles. Such ratings are useful for making recommendations that the user may like. In [23], a correlation based collabo- rative ﬁltering technique with randomization was proposed. In [22], an SVD based collaborative ﬁltering method was proposed using randomized pertur- bation techniques. Since the collaborative ﬁltering technique is inherently one in which ratings from multiple users are incorporated, we use a client-server 146 Privacy-Preserving Data Mining: Models and Algorithms mechanism in order to perform the perturbation. The broad approach of SVD- based collaborative ﬁltering technique is as follows: The server decides on the nature (eg. uniform or Gaussian) of the per- turbing distribution along with the corresponding parameters. These pa- rameters are transmitted to each user. Each user computes the mean and z-number for their ratings. The entries which are not rated are substituted with the mean for the corresponding ratings and a z-number of 0. Each user then adds random number to all the ratings, and sends the disguised ratings to the server. The server receives the ratings from the different users and uses SVD on the disguised matrix in order to make predictions. 6.4 The Privacy-Information Loss Tradeoff The quantity used to measure privacy should indicate how closely the orig- inal value of an attribute can be estimated. The work in [3] uses a measure that deﬁnes privacy as follows: If the original value can be estimated with c% conﬁdence to lie in the interval [α1,α2], then the interval width (α2 − α1) deﬁnes the amount of privacy at c% conﬁdence level. For example, if the per- turbing additive is uniformly distributed in an interval of width 2α,thenα is the amount of privacy at conﬁdence level 50% and 2α is the amount of privacy at conﬁdence level 100%. However, this simple method of determining privacy can be subtly incomplete in some situations. This can be best explained by the following example. Example 6.4 Consider an attribute X with the density function fX(x) given by: fX(x)=0.5 0 ≤ x ≤ 1 0.5 4 ≤ x ≤ 5 0 otherwise Assume that the perturbing additive Y is distributed uniformly between [−1, 1]. Then according to the measure proposed in [3], the amount of privacy is 2 at conﬁdence level 100%. However, after performing the perturbation and subsequent reconstruction, the density function fX(x) will be approximately revealed. Let us assume for a moment that a large amount of data is available, so that the distribution function is revealed to a high degree of accuracy. Since the (distribution of the) perturbing additive is publicly known, the two pieces of information can A Survey of Randomization Methods for Privacy-Preserving Data Mining 147 be combined to determine that if Z ∈ [−1, 2],thenX ∈ [0, 1]; whereas if Z ∈ [3, 6] then X ∈ [4, 5]. Thus, in each case, the value of X can be localized to an interval of length 1. This means that the actual amount of privacy offered by the perturbing additive Y is at most 1 at conﬁdence level 100%. We use the qualiﬁer ‘at most’ since X can often be localized to an interval of length less than one. For example, if the value of Z happens to be −0.5, then the value of X can be localized to an even smaller interval of [0, 0.5]. This example illustrates that the method suggested in [3] does not take into account the distribution of original data. In other words, the (aggregate) re- construction of the attribute value also provides a certain level of knowledge which can be used to guess a data value to a higher level of accuracy. To accu- rately quantify privacy, we need a method which takes such side-information into account. A key privacy measure [5] is based on the differential entropy of a random variable. The differential entropy h(A) of a random variable A is deﬁned as follows: h(A)=− ΩA fA(a)log2 fA(a) da (6.6) where ΩA is the domain of A. It is well-known that h(A) is a measure of uncertainty inherent in the value of A[111]. It can be easily seen that for a random variable U distributed uniformly between 0 and a, h(U)=log2(a). For a =1, h(U)=0. In [5], it was proposed that 2h(A) is a measure of privacy inherent in the random variable A. This value is denoted by Π(A). Thus, a random variable U distributed uniformly between 0 and a has privacy Π(U)=2log2(a) = a.Fora general random variable A, Π(A) denote the length of the interval, over which a uniformly distributed random variable has the same uncertainty as A. Given a random variable B,theconditional differential entropy of A is deﬁned as follows: h(A|B)=− ΩA,B fA,B(a, b)log2 fA|B=b(a) da db (6.7) Thus, the average conditional privacy of A given B is Π(A|B)=2h(A|B).This motivates the following metric P(A|B) for the conditional privacy loss of A, given B: P(A|B)=1− Π(A|B)/Π(A)=1− 2h(A|B)/2h(A) =1− 2−I(A;B). where I(A;B)=h(A) − h(A|B)=h(B) − h(B|A).I(A;B) is also known as the mutual information between the random variables A and B. Clearly, P(A|B) is the fraction of privacy of A which is lost by revealing B. 148 Privacy-Preserving Data Mining: Models and Algorithms As an illustration, let us reconsider Example 6.4 given above. In this case, the differential entropy of X is given by: h(X)=− ΩX fX(x)log2 fX(x) dx = = − 1 0 0.5log2 0.5 dx − 5 4 0.5log2 0.5 dx =1 Thus the privacy of X, Π(X)=21 =2. In other words, X hasasmuchprivacy as a random variable distributed uniformly in an interval of length 2. The den- sity function of the perturbed value Z is given by fZ(z)= ∞ −∞ fX(ν)fY(z − ν) dν. Using fZ(z), we can compute the differential entropy h(Z) of Z. It turns out that h(Z)=9/4. Therefore, we have: I(X;Z)=h(Z) − h(Z|X)=9/4 − h(Y)=9/4 − 1=5/4 Here, the second equality h(Z|X)=h(Y) follows from the fact that X and Y are independent and Z = X + Y. Thus, the fraction of privacy loss in this case is P(X|Z)=1− 2−5/4 =0.5796. Therefore, after revealing Z,X has privacy Π(X|Z)=Π(X) × (1 −P(X|Z)) = 2 × (1.0 − 0.5796) = 0.8408. This value is less than 1, since X can be localized to an interval of length less than one for many values of Z. Given the perturbed values z1,z2,...,zN,it is (in general) not possible to reconstruct the original density function fX(x) with an arbitrary precision. The greater the variance of the perturbation, the lower the precision in estimating fX(x). This constitutes the classic tradeoff between privacy and information loss. We refer the lack of precision in esti- mating fX(x) as information loss. Clearly, the lack of precision is estimating the true distribution will degrade the accuracy of the application that such a dis- tribution is used for. The work in [3] uses an application dependent approach to measure the information loss. For example, for a classiﬁcation problem, the inaccuracy in distribution reconstruction is measured by examining the effects on the mis-classiﬁcation rate. The work in [5] uses a more direct approach to measure the information loss. Let ˆfX(x) denote the density function of X as estimated by a reconstruction algorithm. We propose the metric I(fX, ˆfX) to measure the information loss incurred by a reconstruction algorithm in estimating fX(x): I(fX, ˆfX)=1 2E ΩX fX(x) − ˆfX(x) dx (6.8) Thus the proposed metric equals half the expected value of L1-norm between the original distribution fX(x) and its estimate ˆfX(x). Note that information A Survey of Randomization Methods for Privacy-Preserving Data Mining 149 A B CD Original Distribution Estimated Distribution In this case, the estimated distribution is somewhat shifted from the original distribution. Information Loss is the amount of mismatch between the two curves in terms of area. This is equal to half the sum of the areas of A, B, C and D. and is also equal to 1 - Area shared by both curves. Figure 6.1. Illustration of the Information Loss Metric loss I(fX, ˆfX) lies between 0 and 1; I(fX, ˆfX)=1implies perfect recon- struction of fX(x) and I(fX, ˆfX)=0implies that there is no overlap between fX(x) and its estimate ˆfX(x)(see Figure 6.1). The proposed metric is univer- sal in the sense that it can be applied to any reconstruction algorithm since it depends only on the original density fX(x), and its estimate ˆfX(x). We advo- cate the use of a universal metric since it is independent of the particular data mining task at hand, and therefore facilitates absolute comparisons between disparate reconstruction algorithms. 6.5 Vulnerabilities of the Randomization Method In the earlier section on privacy quantiﬁcation, we illustrated an example in which the reconstructed distribution on the data can be used in order to reduce the privacy of the underlying data record. In general, a systematic approach can be used to do this in multi-dimensional data sets with the use of spec- tral ﬁltering or PCA based techniques [11, 14]. The broad idea in techniques such as PCA [11] is that the correlation structure in the original data can be estimated fairly accurately (in larger data sets) even after noise addition. This is because the noise is added to each dimension independently, and it does not affect the expected covariance between different pairs of attributes. Only the variance of the attributes is affected, and the change in variance can be esti- mated accurately from the public information about the perturbing distribution. To understand this point, consider the case when the noise variable Y1 is added to the ﬁrst column X1, and the noise variable Y2 is added to the second column X2. Then, we have: covariance((X1 + Y1)·(X2 + Y2)) = covariance(X1 ·X2) variance((X1 + Y1)) = variance(X1)+variance(Y1) Both results can be derived by expanding the expressions and using the fact that the covariance between either of {X1,X2} with either of {Y1,Y2} is zero and that covariance(Y1,Y2)=0. This is because it is assumed that the noise is added independently to each dimension. Therefore, the covariance of Y1 and Y2 with each other or the original data columns is zero. Furthermore, the variances of Y1 and Y2 are known, since the corresponding distributions are 150 Privacy-Preserving Data Mining: Models and Algorithms publicly known. This means that the covariance matrix of the perturbed data can be used to derive the covariance matrix of the original data by simply mod- ifying the diagonal entries. Once the covariance matrix of the original data has been estimated, one can then try to remove the noise in the data in such a way that it ﬁts the aggregate correlation structure of the data. For example, the data is expected to be distributed along the eigenvectors of this covariance matrix, so that the variance along these eigenvectors are given by the corresponding eigenvalues. Since real data usually shows considerable skew in the eigenvalue structure, it is often the case that the entire data set of a few hundred dimen- sions can be captured on a plane containing less than 20 to 30 eigenvectors. In such cases, it is apparent that the data points which deviate signiﬁcantly from this much lower dimensional plane need to be projected back onto it in order to derive the original data. It has been shown in [11] that such an approach can reconstruct the data quite accurately. Furthermore, we note that the accuracy of this kind of approach increases with the size of the data set, and the relation- ship of the intrinsic dimensionality to the full dimensionality of the data set. A related method in [14] uses spectral ﬁltering in order to reconstruct the data accurately. It has been shown that such techniques can reduce the privacy of the perturbation process signiﬁcantly since the noise removal results in values which are fairly close to their original values [11, 14]. The approach is particu- larly effective in cases where the data is embedded in a much lower intrinsic di- mensionality as compared to its true dimensionality. It has been shown in [11] that the addition of noise along the eigenvectors of the data is safer from the point of view of privacy-preservation. This is because the discrepancy between the behavior of individual randomized points with the correlation structure of the data may no longer be used for reconstruction. Some other discussions on limiting breaches of privacy in the randomization method may be found in [7]. A second kind of adversarial attack is with the use of public informa- tion [1]. While the PCA-approach is good for value-reconstruction, it does not say much about identiﬁcation of the subject of a record. Both value- reconstruction and subject-identiﬁcation are required in adversarial attacks. For this purpose, it is possible to use public data in order to try to deter- mine the identity of the subject. Consider a record X =(x1 ...xd),which is perturbed to Z =(z1 ...zd). Then, since the distribution of the pertur- bations is known, we can try to use a maximum likelihood ﬁt of the poten- tial perturbation of Z to a public record. Consider the publicly public record W =(w1 ...wd). Then, the potential perturbation of Z with respect to W is given by (Z − W)=(z1 − w1 ...zd − wd). Each of these values (zi − wi) should ﬁt the distribution fY(y). The corresponding log-likelihood ﬁt is given by − d i=1 log(fy(zi − wi)). The higher the log-likelihood ﬁt, the greater the probability that the record W corresponds to X. If it is known that the public data set always includes X, then the maximum likelihood ﬁt can provide a high A Survey of Randomization Methods for Privacy-Preserving Data Mining 151 degree of certainty in identifying the correct record, especially in cases where d is large. Another result in [10] suggests that the use of different perturbing distributions can have signiﬁcant effects on the privacy of the underlying data. For example, the use of uniform perturbations is experimentally shown to be more effective in the low dimensional case. However, for the high dimensional case, gaussian perturbations are more effective. The work in [10] characterizes the amount of perturbation required for a particular dimensionality with each kind of perturbing distribution. For the case of gaussian distributions, the stan- dard deviation of the perturbation needs to increases with the square-root of the implicit dimensionality, and for the case of uniform distributions, the standard deviation of the perturbation increases at least linearly with the implicit dimen- sionality. In either case, both kinds of perturbations tend to become ineffective with increasing dimensionality. 6.6 Randomization of Time Series Data Streams The randomization approach is particularly well suited to privacy- preserving data mining of streams, since the noise added to a given record is independent of the rest of the data. However, streams provide a particularly vulnerable target for adversarial attacks with the use of PCA based techniques [11] because of the large volume of the data available for analysis. In addi- tion, there are typically auto-correlations among the different components of a series. Such auto-correlations can also be used for reconstruction purposes. In [28], an interesting technique for randomization has been proposed which uses the correlations and auto-correlations in different time series while decid- ing the noise to be added to any particular value. The key idea for the case of correlated noise is to use a similar idea as in [11] in order to use princi- pal component analysis to determine the directions in which the second order correlations are zero. These principal components are the eigenvectors of the covariance matrix for the data. Then, the noise is added along these princi- pal components (or eigenvectors) rather than the original space. This ensures that it is extremely difﬁcult to reconstruct the data using correlation analysis. This approach is effective for the case of correlations across multiple streams, but not auto-correlations within a single stream. In the case of dynamic auto- correlations, we are dealing with the case when there are correlations within a single stream at different local time instants. Such correlations can also be removed by treating a window of the stream at one time, and performing the principal components analysis on all the components of the window. Thus, we are using essentially the same idea, except that we are using multiple time in- stants of the sane stream to construct the co-variance matrix. The ideas can in fact be combined when there are both correlations and auto-correlations by using multiple time-instants from all streams, in order to create one covariance 152 Privacy-Preserving Data Mining: Models and Algorithms matrix. This will also capture correlations between different streams at slightly displaced time instants. Such situations are are referred to as lag correlations, and are quite common in data streams when slight changes in one stream pre- cede changes in another because of the same cause. In many cases, the directions of correlations may change over time. If a static approach is used for randomization, then the changes in the correlation structure will result in a risk of the data becoming exposed over time, when the principal components have changed sufﬁciently. Therefore, the technique in [28] is designed to dynamically adjust the directions of correlation as more and more points from the data stream are received. It has been shown in [28] that such an approach is more robust since the noise correlates with the stream behavior, and it is more difﬁcult to create effective adversarial attacks with the use of correlation analysis techniques. 6.7 Multiplicative Noise for Randomization The most common method of randomization is that of additive perturba- tions. However, multiplicative perturbations can also be used to good effect for privacy-preserving data mining. Many of these techniques derive their roots in the work of [13] which shows how to use multi-dimensional projections in or- der to reduce the dimensionality of the data. This technique preserves the inter- record distances approximately, and therefore the transformed records can be used in conjunction with a variety of distance-intensive data mining applica- tions. In particular, the approach is discussed in detail in [20, 21], in which it is shown how to use the method for privacy-preserving clustering. The technique can also be applied to the problem of classiﬁcation as discussed in [28]. We note that both clustering and classiﬁcation are locality speciﬁc problems, and are therefore particularly well suited to the multiplicative perturbation tech- nique. One key difference between the use of additive and multiplicative per- turbations is that in the former case, we can reconstruct only aggregate distri- butions, whereas in the latter case more record-speciﬁc information (eg. dis- tances) are preserved. Therefore, the latter technique is often more friendly to different kinds of data mining techniques. Multiplicative perturbations can also be used for distributed privacy- preserving data mining. Details can be found in [17]. In [17], a number of key assumptions have also been discussed, which ensure that privacy is preserved. These assumptions discuss the level of privacy when the attacker knows par- tial characateristics about the algorithm used to perform the transformation, or other statistics associated with the transformation. The effects of using special kinds of data (eg. boolean data) are also discussed. A number of techniques for multiplicative perturbation in the context of masking census data may be found in [15]. A variation on this theme may A Survey of Randomization Methods for Privacy-Preserving Data Mining 153 be implemented with the use of distance preserving fourier transforms, which work effectively for a variety of cases [19]. 6.7.1 Vulnerabilities of Multiplicative Randomization As in the case of additive perturbations, multiplicative perturbations are not entirely safe from adversarial attacks. In general, if the attacker has no prior knowledge of the data, then it is relatively difﬁcult to attack the privacy of the transformation. However, with some prior knowledge, two kinds of attacks are possible [18]: Known Input-Output Attack: In this case, the attacker knows some linearly independent collection of records, and their corresponding per- turbed version. In such cases, linear algebra techniques can be used to reverse-engineer the nature of the privacy preserving transformation. The number of records required depends upon the dimensionality of the data and the available records. The probability of a privacy breach with a given sample size is characterized in [18]. Known Sample Attack: In this case, the attacker has a collection of independent data samples from the same distribution from which the original data was drawn. In such cases, principal component analysis techniques can be used in order to reconstruct the behavior of the original data. Then, one can try to determine how the current random projection of the data relates to this principal component analysis. This can provide an approximate idea of the corresponding geometric transformation. One observation is that both the above mentioned techniques require much more samples (or background knowledge) to work effectively in the high di- mensional case. Thus, random projection techniques should generally be used for the case of high dimensional data, and only a smaller number of projections should be retained in order to preserve privacy. Thus, as with the additive per- turbation technique, the multiplicative technique is not completely secure from attacks. A key research direction is to use a combination of additive and mul- tiplicative perturbation techniques in order to construct more robust privacy- preservation techniques. 6.7.2 Sketch Based Randomization A closely related case to the use of multiplicative perturbations is the use of sketch-based randomization. In sketch based randomization [2], we use sketches in order to construct the randomization from the data set. We note that sketches are a special case of multiplicative perturbation techniques in the sense that the individual components of the multiplicative vector are drawn from {−1, +1}. Sketches are particularly useful for the case of sparse data 154 Privacy-Preserving Data Mining: Models and Algorithms such as text or binary data in which most components are zero and only a few components are non-zero. Furthermore, sketches are designed in such a way that many aggregate properties such as the dot product can be estimate very accurately from a small number of constant components. Since text and mar- ket basket data are both high-dimensional, the use of random projections is particularly effective from the point of view of adversarial attacks. In [11], it as been shown how the method of sketches can be used in order to perform effective privacy-preserving data mining of text and market basket data. It is possible to use sketches to create a scheme which is similar to random- ization in the sense that the transformation of a given record can be performed at data collection time. It is possible to control the anonymization in such a way so that the absolute variance of the randomization scheme is preserved. If desired, it is also possible to use sketches to add noise so that records cannot be distinguished easily from their k-nearest neighbors. This is a similar model to the k-anonymity model, but comes at the expense of using a trusted server for anonymization. 6.8 Conclusions and Summary In this chapter, we discussed the randomization method for privacy- preserving data mining. We discussed a number of different algorithms for randomization, such as the Bayes method and the EM reconstruction tech- nique. The EM-reconstruction algorithm also exhibits a number of optimality properties with respect to its convergence to the maximum likelihood estimate of the data distribution. We also discussed a number of variants of the pertur- bation technique such as the method of multiplicative perturbations. A number of applications of the randomization method were discussed over a variety of data mining problems. References [1] Aggarwal C. C.: On Randomization, Public Information and the Curse of Dimensionality. ICDE Conference, 2007. [2] Aggarwal C. C., Yu P. S.: On Privacy-Preservation of Text and Sparse Binary Data with Sketches. SIAM Conference on Data Mining, 2007. [3] Agrawal R., Srikant R. Privacy-Preserving Data Mining. Proceedings of the ACM SIGMOD Conference, 2000. [4] Agrawal R., Srikant R., Thomas D. Privacy-Preserving OLAP. Proceed- ings of the ACM SIGMOD Conference, 2005. [5] Agrawal D. Aggarwal C. C. On the Design and Quantiﬁcation of Privacy- Preserving Data Mining Algorithms. ACM PODS Conference, 2002. A Survey of Randomization Methods for Privacy-Preserving Data Mining 155 [6] Chen K., Liu L.: Privacy-preserving data classiﬁcation with rotation per- turbation. ICDM Conference, 2005. [7] Evﬁmievski A., Gehrke J., Srikant R. Limiting Privacy Breaches in Pri- vacy Preserving Data Mining. ACM PODS Conference, 2003. [8] Evﬁmievski A., Srikant R., Agrawal R., Gehrke J.: Privacy-Preserving Mining of Association Rules. ACM KDD Conference, 2002. [9] Fienberg S., McIntyre J.: Data Swapping: Variations on a Theme by Dale- nius and Reiss. Technical Report, National Institute of Statistical Sci- ences, 2003. [10] Gambs S., Kegl B., Aimeur E.: Privacy-Preserving Boosting. Knowledge Discovery and Data Mining Journal, to appear. [11] Huang Z., Du W., Chen B.: Deriving Private Information from Random- ized Data. pp. 37–48, ACM SIGMOD Conference, 2005. [12] Warner S. L. Randomized Response: A survey technique for eliminat- ing evasive answer bias. Journal of American Statistical Association, 60(309):63–69, March 1965. [13] Johnson W., Lindenstrauss J.: Extensions of Lipshitz Mapping into Hilbert Space, Contemporary Math. vol. 26, pp. 189–206, 1984. [14] Kargupta H., Datta S., Wang Q., Sivakumar K.: On the Privacy Preserving Properties of Random Data Perturbation Techniques. ICDM Conference, pp. 99–106, 2003. [15] Kim J., Winkler W.: Multiplicative Noise for Masking Continuous Data, Technical Report Statistics 2003-01, Statistical Research Division, US Bureau of the Census, Washington D.C., Apr. 2003. [16] Liew C. K., Choi U. J., Liew C. J. A data distortion by probability distri- bution. ACM TODS, 10(3):395–411, 1985. [17] Liu K., Kargupta H., Ryan J.: Random Projection Based Multiplicative Data Perturbation for Privacy Preserving Distributed Data Mining. IEEE Transactions on Knowledge and Data Engineering, 18(1), 2006. [18] Liu K., Giannella C., Kargupta H.: An Attacker’s View of Distance Pre- serving Maps for Privacy-Preserving Data Mining. PKDD Conference, 2006. [19] Mukherjee S., Chen Z., Gangopadhyay S.: A privacy-preserving tech- nique for Euclidean distance-based mining algorithms using Fourier based transforms, VLDB Journal, 2006. [20] Oliveira S. R. M., Zaane O.: Privacy Preserving Clustering by Data Trans- formation, Proc. 18th Brazilian Symp. Databases, pp. 304–318, Oct. 2003. 156 Privacy-Preserving Data Mining: Models and Algorithms [21] Oliveira S. R. M., Zaiane O.: Data Perturbation by Rotation for Privacy- Preserving Clustering, Technical Report TR04-17, Department of Com- puting Science, University of Alberta, Edmonton, AB, Canada, August 2004. [22] Polat H., Du W.: SVD-based collaborative ﬁltering with privacy. ACM SAC Symposium, 2005. [23] Polat H., Du W.: Privacy-preserving collaborative ﬁltering with random- ized perturbation techniques. ICDM Conference, 2003. [24] Rizvi S., Haritsa J.: Maintaining Data Privacy in Association Rule Min- ing. VLDB Conference, 2002. [25] Samarati P.: Protecting Respondents’ Identities in Microdata Release. IEEE Trans. Knowl. Data Eng. 13(6): 1010–1027 (2001). [26] Shannon C. E.: The Mathematical Theory of Communication, University of Illinois Press, 1949. [27] Silverman B. W.: Density Estimation for Statistics and Data Analysis. Chapman and Hall, 1986. [28] Li F., Sun J., Papadimitriou S., Mihaila G., Stanoi I.: Hiding in the Crowd: Privacy Preservation on Evolving Streams through Correlation Tracking. ICDE Conference, 2007. [29] Zhang P., Tong Y., Tang S., Yang D.: Privacy-Preserving Naive Bayes Classiﬁer. Lecture Notes in Computer Science, Vol 3584, 2005. [30] Zhu Y., Liu L. Optimal Randomization for Privacy- Preserving Data Min- ing. ACM KDD Conference, 2004. Chapter 7 A Survey of Multiplicative Perturbation for Privacy-Preserving Data Mining Keke Chen College of Computing Georgia Institute of Technology kekechen@cc.gatech.edu Ling Liu College of Computing Georgia Institute of Technology lingliu@cc.gatech.edu Abstract The major challenge of data perturbation is to achieve the desired balance be- tween the level of privacy guarantee and the level of data utility. Data privacy and data utility are commonly considered as a pair of conﬂicting requirements in privacy-preserving data mining systems and applications. Multiplicative per- turbation algorithms aim at improving data privacy while maintaining the de- sired level of data utility by selectively preserving the mining task and model speciﬁc information during the data perturbation process. By preserving the task and model speciﬁc information, a set of “transformation-invariant data mining models” can be applied to the perturbed data directly, achieving the required model accuracy. Often a multiplicative perturbation algorithm may ﬁnd multiple data transformations that preserve the required data utility. Thus the next major challenge is to ﬁnd a good transformation that provides a satisfactory level of privacy guarantee. In this chapter, we review three representative multiplicative perturbation methods: rotation perturbation, projection perturbation, and geo- metric perturbation, and discuss the technical issues and research challenges. We ﬁrst describe the mining task and model speciﬁc information for a class of data mining models, and the transformations that can (approximately) preserve the information. Then we discuss the design of appropriate privacy evaluation models for multiplicative perturbations, and give an overview of how we use the privacy evaluation model to measure the level of privacy guarantee in the context of different types of attacks. Keywords: Multiplicative perturbation, random projection, sketches. 158 Privacy-Preserving Data Mining: Models and Algorithms 7.1 Introduction Data perturbation refers to a data transformation process typically per- formed by the data owners before publishing their data. The goal of perform- ing such data transformation is two-fold. On one hand, the data owners want to change the data in a certain way in order to disguise the sensitive information contained in the published datasets, and on the other hand, the data owners want the transformation to best preserve those domain-speciﬁc data properties that are critical for building meaningful data mining models, thus maintaining mining task speciﬁc data utility of the published datasets. Data perturbation techniques are one of the most popular models for pri- vacy preserving data mining. It is especially useful for applications where data owners want to participate in cooperative mining but at the same time want to prevent the leakage of privacy-sensitive information in their published datasets. Typical examples include publishing micro data for research purpose or out- sourcing the data to the third party data mining service providers. Several per- turbation techniques have been proposed to date [4–1, 8, 3, 13, 14, 26, 35], among which the most popular one is the randomization approach that focuses on single-dimensional perturbation and assumes independency between data columns [4, 13]. Only recently, the data management community has shown some development on multi-dimensional data perturbation techniques, such as the condensation approach using k-nearest neighbor (kNN) method [1], the multi-dimensional K-anonymization using kd-tree [24], and the multiplica- tive data perturbation techniques [31, 8, 28, 9]. Compared to single-column- based data perturbation techniques that assume data columns to be independent and focus on developing single-dimensional perturbation techniques, multi- dimensional data perturbation aims at perturbing the data while preserving the multi-dimensional information with respect to inter-column dependency and distribution. In this chapter, we will discuss multiplicative data perturbations. This cate- gory includes three types of particular perturbation techniques: Rotation Per- turbation, Projection Perturbation, and Geometric Perturbation. Comparing to other multi-dimensional data perturbation methods, these perturbations exhibit unique properties for privacy preserving data classiﬁcation and data cluster- ing. They all preserve (or approximately preserve) distance or inner product, which are important to many classiﬁcation and clustering models. As a result, the classiﬁcation and clustering mining models based on the perturbed data through multiplicative data perturbation show similar accuracy to those based on the original data. The main challenge for multiplicative data perturbations thus is how to maximize the desired data privacy. In contrast, many other data perturbation techniques focus on seeking for the better trade-off between the Multiplicative Perturbations for Privacy 159 level of data utility and accuracy preserved and the level of data privacy guar- anteed. 7.1.1 Data Privacy vs. Data Utility Perturbation techniques are often evaluated with two basic metrics: level of privacy guarantee and level of model-speciﬁc data utility preserved, which is often measured by the loss of accuracy for data classiﬁcation and data clus- tering. An ultimate goal for all data perturbation algorithms is to optimize the data transformation process by maximizing both data privacy and data utility achieved. However, the two metrics are typically representing two conﬂicting goals in many existing perturbation techniques [4, 3, 12–1]. Data privacy is commonly measured by the difﬁculty level in estimating the original data from the perturbed data. Given a data perturbation technique, the higher level of difﬁculty in which the original values can be estimated from the perturbed data, the higher level of data privacy this technique supports. In [4], the variance of the added random noise is used as the level of difﬁculty for estimating the original values as traditionally used in statistical data distortion [23]. However, recent research [12, 3] reveals that variance of the noise is not an effective indicator for random noise addition. In addition, [22] shows that the level of data privacy guaranteed is also bounded to the types of special attacks that can reconstruct the original data from the perturbed data and noise distribution. k-Anonymization is another popular way of measuring the level of privacy, originally proposed for relational databases [34], by enabling the effective estimation of the original data record to a k-record group, assuming that each record in the k-record group is equally protected. However, recent study [29] shows that the privacy evaluation of k-Anonymized records is far more complicated than this simple k-anonymization assumption. Data utility typically refers to the amount of mining-task/model speciﬁc crit- ical information preserved about the dataset after perturbation. Different data mining tasks, such as classiﬁcation mining task vs. association rule mining, or different models for the same task, such as decision tree model vs. k-Nearest- Neighbor (kNN) classiﬁer for classiﬁcation, typically utilize different sets of data properties about the dataset. For example, the task of building decision trees primarily concerns the column distribution. Hence, the quality of pre- serving column distribution should be the key data utility to be maintained in perturbation techniques for decision tree model, as shown in the random- ization approach [4]. In comparison, the kNN model relies heavily on the distance relationship, which is quite different from the column distribution. Furthermore, such task/model-speciﬁc information is often multidimensional. Many classiﬁcation models typically concern the multidimensional informa- tion rather than single column distribution. Multi-dimensional perturbation 160 Privacy-Preserving Data Mining: Models and Algorithms techniques with the focus on preserving the model-speciﬁc multidimensional information will be more effective for these models. It is also interesting to note that the data privacy metric and the data utility metric are often contradictory rather than complimentary in many existing data perturbation techniques [4, 3, 12–1]. Typically data perturbation algorithms that aim at maximizing the level of data privacy often have to bear with higher information loss. The intrinsic correlation between the data privacy and the data utility raises a number of important issues regarding how to ﬁnd a right balance between the two measures. In summary, we identify three important design principles for multiplicative data perturbations. First, preserving the mining task and model-speciﬁc data properties is critical for providing better quality guarantee on both privacy and model accuracy. Second, it is beneﬁcial if data perturbation can effectively pre- serve the task/model-speciﬁc data utility information, and avoid the need for developing special mining algorithms that can use the perturbed data as ran- dom noise addition requires. Third and most importantly, if one can develop a data perturbation technique that does not induce any lost of mining-task/model speciﬁc data utility, this will enable us to focus on optimizing perturbation algorithms by maximizing the level of data privacy against attacks, which ulti- mately leads to better overall quality of both data privacy and data utility. 7.1.2 Outline In the remaining of the chapter we will ﬁrst give the deﬁnition of multi- plicative perturbation in Section 7.2. Speciﬁcally, we categorize multiplicative perturbations into three categories: rotation perturbation, projection perturba- tion, and geometric perturbation. Rotation perturbation is often criticized not resilient to attacks, while geometric perturbation is a direct enhancement to rotation perturbation by adding more components, such as translation pertur- bation and noise addition, to the original rotation perturbation. Both rotation perturbation and geometric Perturbation keep the dimensionality of dataset un- changed, while projection perturbation reduces the dimensionality, and thus incurs more errors in distance or inner product calculation. One of the unique features that distinguish multiplicative perturbations from other perturbations is that it provides high guarantee on data utility in terms of data classiﬁcation and clustering. Since many data mining models utilize dis- tance or inner product, as long as such information is preserved, models trained on perturbed data will have similar accuracy to those trained on the original data. In Section 7.3, we deﬁne transformation-invariant classiﬁers and cluster- ing models, the representative models to which multiplicative perturbations are applied. Multiplicative Perturbations for Privacy 161 Evaluation of privacy guarantee for perturbations is an important component in the analysis of multiplicative perturbation. In Section 7.4, we review a set of privacy metrics speciﬁcally designed for multiplicative perturbations. We argue that in multidimensional perturbation, the values of multiple columns should be perturbed together and the evaluation metrics should be uniﬁed for all columns. We also describe a general framework for privacy evaluation of multiplicative data perturbation by incorporating attack analysis. We argue that attack analysis is a necessary step in order to accurately eval- uate the privacy guarantee of any particular perturbation. In Section 7.5, we review a selection of known attacks to multiplicative perturbations based on different levels of attack’s knowledge about the original dataset. By incorpo- rating attack analysis under the general framework of privacy evaluation, a ran- domized perturbation optimization is developed and described in Section 7.5.5. 7.2 Deﬁnition of Multiplicative Perturbation We will ﬁrst describe the notations used in this chapter, and then describe three categories of multiplicative perturbations and their basic characteristics. 7.2.1 Notations In privacy-preserving data mining, either a portion of or the entire data set will be perturbed and then exported. For example, in classiﬁcation, the training data is exported and the testing data might be exported, too, while in clustering, the entire data for clustering is exported. Suppose that X is the exported dataset consisting of N data rows (records) and d columns (attributes, or dimensions). For presentation convenience, we use Xd×N,X =[x1 ...xN], to denote the dataset, where a column xi (1 ≤ i ≤ N) is a data tuple, representing a vector in the real space Rd. In classiﬁcation, each of such data tuples xi also belongs to a predeﬁned class, which is indicated by the class label attribute yi.The class label can be nominal (or continuous for regression), and is public, i.e., privacy-insensitive. For clear presentation, we can also consider X is a sample dataset from the d-dimension random vector X =[X1,X2,...,Xd]T. As a convention, we use bold lower case to represent vectors, bold upper case to represent random variables, and upper case to represent matrices or datasets. 7.2.2 Rotation Perturbation This category does not cover traditional “rotations” only, but literally, it in- cludes all orthonormal perturbations. A rotation perturbation is deﬁned as following G(X): G(X)=RX 162 Privacy-Preserving Data Mining: Models and Algorithms The matrix Rd×d is an orthonormal matrix [32], which has following proper- ties. Let RT represent the transpose of R, rij represent the (i, j) element of R,andI be the identity matrix. The rows and columns of R are orthonormal, i.e., for any column j, d i=1 r2 ij =1, and for any two columns j and k, j = k,d i=1 rijrik =0. A similar property is held for rows. This deﬁnition infers that RTR = RRT = I It also implies that by changing the order of the rows or columns of an orthog- onal matrix, the resulting matrix is still orthogonal. A random orthonormal matrix can be efﬁciently generated following the Haar distribution [33]. A key feature of rotation transformation is that it preserve the Euclidean dis- tance of multi-dimensional points during the transformation. Let xT represent the transpose of vector x,andx = xT x represent the length of a vector x. By the deﬁnition of rotation matrix, we have Rx = x Similarly, inner product is also invariant to rotation. Let x, y = xT y repre- sent the inner product of x and y.Wehave Rx,Ry = xTRTRy = x, y In general, rotation also preserves the geometric shapes such as hyperplane and hyper curved surface in the multidimensional space [7]. We observed that since many classiﬁers look for geometric decision boundary, such as hyper- plane and hyper surface, rotation transformation will preserve the most critical information for many classiﬁcation models. There are two ways to apply rotation perturbation. We can either apply it to the whole dataset X[8], or group columns to pairs and apply different rotation perturbations to different pairs of columns [31]. 7.2.3 Projection Perturbation Projection perturbation refers to the technique of projecting a set of data points from a high-dimensional space to a randomly chosen lower-dimensional sub- space. Let Pk×d be a projection matrix. G(X)=PX Why can it also be used for perturbation? The rationale is based on the Johnson-Lindenstrauss Lemma [21]. Theorem 1 For any 0 <<1 and any integer n,letk be a positive integer such that k ≥ 4lnn 2/2−3/3 . Then, for any set S of n data points in d dimensional Multiplicative Perturbations for Privacy 163 space Rd, there is a map f:Rd → Rk such that, for all x ∈ S, (1 − )x − x2 ≤f(x) − f(x)2 ≤ (1 + )x − x2 where ·denotes the vector 2-norm. This lemma shows that any set of n points in d-dimensional Euclidean space could be embedded into a O(log n 2 )-dimensional space, such that the pair-wise distance of any two points are maintained with small error. With large n (large dataset) and small (high accuracy in distance preservation), the ideal dimen- sionality might be large and may not be practical for the perturbation purpose. Furthermore, although this lemma implies that we can always ﬁnd one good projection that approximately preserves distances for a particular dataset, the geometric decision boundary might still be distorted and thus the model ac- curacy is reduced. Due to the different distributions of dataset and particular properties of data mining models, it is challenging to develop an algorithm that can ﬁnd random projections that preserves model accuracy well for any given dataset. In paper [28] a method is used to generate random projection matrix. The process can be brieﬂy described as follows. Let P be the projection matrix. Each entry ri,j of P is independent and identically chosen from some distrib- ution with mean zero and variance σ2. A row-wise projection is deﬁned as G(X)= 1√ kσ PX Let x and y be two points in the original space, and u and v be their projec- tions. The statistical properties of inner product under projection perturbation can be shown as follows. E[utv − xty]=0 and Var[utv − xty]=1 k( i x2 i i y2 i +( i xiyi)2) Since x and y are not normalized by rows, but by columns in practice, with large dimensionality d and relatively small k, the variance is substantial. Simi- larly, the conclusion can be extended to the distance relationship. Therefore, projection perturbation does not strictly guarantee the preservation of dis- tance/inner product as rotation or geometric perturbation does, which may sig- niﬁcantly downgrade the model accuracy. 164 Privacy-Preserving Data Mining: Models and Algorithms 7.2.4 Sketch-based Approach Sketch-based approach is primarily proposed to perturb high-dimensional sparse data [2], such as the datasets in text mining and market basket mining. A sketch of the original record x =(x1,...,xd) is deﬁned by a r dimensional vector s =(s1,...,sr), r d,where sj = d i=1 xirij The random variable rij is drawn from {-1,+1} with a mean of 0, and is gen- erated from a pseudo-random number generator [5], which produces 4-wise independent values for the variable rij. Note that the sketch based approach defers from projection perturbation with the following two features. First, the number of components for each sketch, i.e., r, can vary across different records, and is carefully controlled so as to provide a uniform measure of privacy guarantee across different records. Second, for each record, rij is different − there is no ﬁxed projection matrix across records. The sketch based approach has a few statistical properties that enable ap- proximate calculation of dot product of the original data records with their sketches. Let s and t with the same number of components r, be the sketches of the original records x and y, respectively. The expected dot product x and y is given by the following. E[x, y]=s, t/r and the variance of the above estimation is determined by the few non-zeros entries in the sparse original vectors Var(s, t/r)=( d i=1 d l=1 x2 i y2 l − ( d i=1 xiyi)2)/r (7.1) On the other side, the original value xk in the vector x can also be esti- mated by privacy attackers, the precision of which is determined by its variance ( d i=1 x2 i −x2 k)/r, k =1...d. The larger the variance is, the better the original value is protected. Therefore, by decreasing r the level of privacy guarantee is possibly increased. However, the precision of dot-product estimation (Eq. 7.1) is decreased. This typical tradeoff has to be carefully controlled in practice [2]. 7.2.5 Geometric Perturbation Geometric perturbation is an enhancement to rotation perturbation by incor- porating additional components such as random translation perturbation and Multiplicative Perturbations for Privacy 165 noise addition to the basic form of multiplicative perturbation Y = R×X.We show that by adding random translation perturbation and noise addition, Geo- metric perturbation exhibits more robustness in countering attacks than simple rotation based perturbation [9]. Let td×1 represent a random vector. We deﬁne a translation matrix as follows. Definition 1 Ψ is a translation matrix if Ψ=[t, t,...,t]d×n, i.e., Ψd×n = td×11T N×1. where 1N×1 is the vector of N’1’s. Let ∆d×N be a random noise matrix, where each element is Independently and Identically Distributed (iid) variable εij, e.g., a Gaussian noise N(0,σ2). The deﬁnition of geometric perturbation is given by a function G(X), G(X)=RX +Ψ+∆ Clearly, translation perturbation does not change distance, as for any pair of points x and y, (x+t)−(y+t) = x−y. Comparing with rotation pertur- bation, it protects the rotation center from attacks and adds additional difﬁculty to ICA-based attacks. However, translation perturbation does not preserve in- ner product. In [9], it shows that by adding an appropriate level of noise ∆, one can effec- tively prevent knowledgeable attackers from distance-based data reconstruc- tion, since noise addition perturbs distances, which protects perturbation from distance-inference attacks. For example, the experiments in [9] shows that a Gaussian noise N(0,σ2) is effective to counter the distance-inference attacks. Although noise addition prevents from fully preserving distance information, a low intensity noise will not change class boundary or cluster membership much. In addition, the noise component is optional − ifthedataownermakessure that the original data records are secure and no people except the data owner knows any record in the original dataset, the noise component can be removed from geometric perturbation. 7.3 Transformation Invariant Data Mining Models By using multiplicative perturbation algorithms, we can mine the the perturbed data directly with a set of existing “transformation-invariant data mining mod- els”, instead of developing new data mining algorithms to mine the perturbed data [4]. In this section, we will deﬁne the concept of transformation-invariant mining models with the example of “transformation-invariant classiﬁers”, and then we extend our discussion to the transformation-invariant models in data classiﬁcation and data clustering. 166 Privacy-Preserving Data Mining: Models and Algorithms 7.3.1 Deﬁnition of Transformation Invariant Models Generally speaking, a transformation invariant model, if trained or mined on the transformed data, performs as good as the model based on the original data. We take the classiﬁcation problem as an example. A classiﬁcation problem is also a function approximation problem − classiﬁers are the functions learned from the training data [16]. In the following discussion, we use functions to represent classiﬁers. Let ˆfX represent a classiﬁer ˆf trained with dataset X and ˆfX(Y) be the classiﬁcation result on the dataset Y.LetT(X) be any transformation function, which transforms the dataset X to another dataset XT.WeuseErr( ˆfX(Y)) to denote the error rate of classiﬁer ˆfX on testing data Y and let ε be some small real number, |ε| < 1. Definition 2 A classiﬁer ˆf is invariant to a transformation T if and only if Err( ˆfX(Y)) = Err( ˆfT(X)(T(Y)))+ε for any training dataset X and testing dataset Y. With the strict condition ˆfX(Y) ≡ ˆfT(X)(T(Y)), we get the Proposition 2. Proposition 2 In particular, if ˆfX(Y) ≡ ˆfT(X)(T(Y)) is satisﬁed for any training dataset X and testing dataset Y, the classiﬁer is invariant to the trans- formation T(X). For instance, if a classiﬁer ˆf is invariant to rotation transformation, we call it rotation-invariant classiﬁer. Similar deﬁnition applies to translation-invariant classiﬁer. In subsequent sections, we will list some examples of transformation invari- ant models for classiﬁcation and clustering. Some detailed proofs can be found in [7]. 7.3.2 Transformation-Invariant Classiﬁcation Models kNN Classiﬁers and Kernel Methods A k-Nearest-Neighbor (kNN) classiﬁer determines the class label of a point by looking at the labels of its k nearest neighbors in the training dataset and classiﬁes the point to the class that most of its neighbors belong to. Since the distances between any points are not changed with rotation and translation transformation, the k nearest neighbors are not changed and thus the classiﬁ- cation result is not changed either. Since kNN classiﬁer is a special case of kernel methods, we can also ex- tend our conclusion to kernel methods. Here, we refer kernel methods to the traditional local methods [16]. In general, since the kernels are dependent on the local points, the locality of which is evaluated by distance, transformations that preserve distance will make kernel methods invariant. Multiplicative Perturbations for Privacy 167 Support Vector Machines Support Vector Machine (SVM) classiﬁer also utilizes kernel functions in train- ing and classiﬁcation. However, it has an explicit training procedure, which differentiates itself from the traditional kernel methods we just discussed. We can use a two-step procedure to prove that a SVM classiﬁer is invariant to a transformation. 1) Training with the transformed dataset generates the same set of model parameters; 2) the classiﬁcation function with the model parame- ters is also invariant to the transformation. The detailed proof will involve the quadratic optimization procedure for SVM. We have demonstrated that SVM classiﬁers with typical kernels are invariant to rotation transformation [7]. It turns out that if a transformation makes the kernel invariant, then the SVM classiﬁer is also invariant to the transformation. There are the three popular choices for the kernels discussed in the SVM literature [10, 16]. d-th degree polynomial: K(x, x)=(1+x, x)d, radial basis: K(x, x)=exp(−x − x/c), neural network: K(x, x) = tanh(κ1x, x + κ2) Apparently, all of the three are invariant to rotation transformation. Since trans- lation does not preserve inner product, it is not straightforward to prove that SVMs with polynomial and neural network kernels are invariant to translation perturbation. However, experiments [9] showed that these classiﬁers are also invariant to translation perturbation. Linear Classiﬁers Linear classiﬁcation models are popular methods due to their simplicity. In linear classiﬁcation models, the classiﬁcation boundary is modeled as a hy- perplane, which is clearly a geometric concept. It is easy to understand that distance-preserving transformations, such as rotation and translation, will still make the classes separated if they are originally separated. There is also a de- tailed proof showing that a typical linear classiﬁer, perceptron, is invariant to rotation transformation [7]. 7.3.3 Transformation-Invariant Clustering Models Most clustering models are based on Euclidean distance such as the popular k-means algorithm [16]. Many are focused on the density property, which is derived from Euclidean distance, such as DBSCAN [11], DENCLUE [17] and OPTICS [6]. All of these clustering models are invariant to Euclidean-distance- preserving transformations, such as rotation and translation. There are other clustering models, which employ different distance metrics [19], such as linkage based clustering and cosine-distance based clustering. As 168 Privacy-Preserving Data Mining: Models and Algorithms long as we can ﬁnd a transformation preserving the particular distance metric, the corresponding clustering model will be invariant to this transformation. 7.4 Privacy Evaluation for Multiplicative Perturbation The goal of data perturbation is twofold: preserving the accuracy of spe- ciﬁc data mining models (data utility), and preserving the privacy of original data (data privacy). The discussion about transformation-invariant data mining models has shown that multiplicative perturbations can theoretically guarantee zero-loss of accuracy for a number of data mining models. The challenge is to ﬁnd one that maximizes the privacy guarantee in terms of potential attacks. We dedicate this section to discuss how good a multiplicative perturbation is in terms of preserving privacy under a set of privacy attacks. We ﬁrst de- ﬁne a multi-column (or multidimensional) privacy measure for evaluating the privacy quality of a multiplicative perturbation over a given dataset. Then, we introduce a framework of privacy evaluation, which can incorporate different attack analysis into the evaluation of privacy guarantee. We show that using this framework, we can employ certain optimization methods (Section 7.5.5) to ﬁnd a good perturbation among a bunch of randomly generated perturba- tions, which is locally optimal for the given dataset. 7.4.1 A Conceptual Multidimensional Privacy Evaluation Model In practice, different columns (or dimensions, or attributes) may have differ- ent privacy concern. Therefore, we advocate that the general-purpose privacy metric Φ deﬁned for an entire dataset should be based on column privacy metric, rather than point-based privacy metrics, such distance-based metrics. A conceptual privacy model is deﬁned as Φ=Φ(p, w),wherep denotes the column privacy metric vector p =[p1,p2,...,pd] of a given dataset X,and w =(w1,w2,...,wd) denote privacy weights associated to the d columns respectively. The column privacy pi itself is deﬁned by a function, which we will discuss later. In summary, the model suggests that the column-wise pri- vacy metric should be calculated ﬁrst and then use Φ to generate a composite metric. We will ﬁrst describe some basic designs to the components in function Φ. Then, we dedicate another subsection to the concrete design of the function for generating p. The ﬁrst design idea is to take the column importance into uniﬁcation of dif- ferent column privacy. Intuitively, the more important the column is, the higher level of privacy guarantee will be required for the perturbed data column. Since w is used to denote the importance of columns in terms of preserving privacy, we use pi/wi to represent the weighted column privacy of column i. Multiplicative Perturbations for Privacy 169 The second concept is the minimum privacy guarantee and the average pri- vacy guarantee among all columns. Normally, when we measure the privacy guarantee of a multidimensional perturbation, we need to pay more attention to the column that has the lowest weighted column privacy, because such a column could become the weakest link of privacy protection. Hence, the ﬁrst composition function is the minimum privacy guarantee. Φ1 = d min i=1 {pi/wi} Similarly, the average privacy guarantee of the multi-column perturbation is deﬁned by Φ2 = 1 d d i=1 pi/wi, which could be another interesting measure. Note that these two functions assume that pi should be comparable crossing columns, which is one of the important requirement in the following discus- sion. 7.4.2 Variance of Difference as Column Privacy Metric After deﬁning the conceptual privacy model, we move to the design of column-wise privacy metric. Intuitively, for a data perturbation approach, the quality of preserved privacy can be understood as the difﬁculty level of esti- mating the original data from the perturbed data. Therefore, how statistically different the estimated data is from the original data could be an intuitive mea- sure. We use a variance-of-difference (VoD) based approach, which has a sim- ilar form to the naive variance-based evaluation [4], but with very different semantics. Let the difference between the original column data and the estimated data be a random variable Di. Without any knowledge about the original data, the mean and variance of the difference present the quality of the estimation. The perfect estimation will have zero mean and variance. Since the mean of differ- ence, i.e., the bias of estimation, can be easily removed if the attacker knows the original distribution of column, we use only the variance of the difference (VoD) as the primary metric to determine the level of difﬁculty in estimating the original data. VoDis formally deﬁned as follows. Let Xi be a random variable represent- ing the column i,X i be the estimated result1 of Xi,andDi be Di = X i − Xi. Let E[Di] and Var(Di) denote the mean and the variance of D respectively. Then VoDfor column i is Var(Di). Let an estimate of certain value, say xi,be x i, σ = Var(Di),andc denote conﬁdence parameter depending on both the distribution of Di and the corresponding conﬁdence level. The corresponding 1It would not be appropriate to use only the perturbed data for privacy estimation, if we consider the potential attacks. 170 Privacy-Preserving Data Mining: Models and Algorithms original value xi in Xi is located in the range deﬁned below: [x i − E[Di] − cσ, x i − E[Di]+cσ] By removing the effect of E[Di], the width of the estimation range, 2cσ, presents the quality of estimating the original value, which proportionally re- ﬂects the level of privacy guarantee. The smaller range means better estima- tion, i.e., a lower level of privacy guarantee. For simplicity, we often use σ to represent the privacy level. VoDonly deﬁnes the privacy guarantee for a single column. However, we usually need to evaluate the privacy level of all perturbed columns together if a multiplicative perturbation is applied. The single-column VoDdoes not work across different columns since different column value ranges may result in very different VoDs. For example, the VoDof age may be much smaller than VoDof salary. Therefore, a same amount of VoDis not equally effective for columns with different value ranges. One straightforward method to unify the different value ranges is via normalization over the original dataset and the perturbed dataset. Normalization can be done with various ways, such as max/min normalization or standardized normalization [30]. After normaliza- tion, the level of privacy guarantee for each column should be approximately comparable. Note that normalization after VoD calculation, such as relative variance VoDi/V ar(Xi) is not appropriate, since small Var(Xi) will inap- propriately increase the value. 7.4.3 Incorporating Attack Evaluation Privacy evaluation has to consider the resilience to attacks as well. The VoD evaluation has a unique advantage in incorporating attack analysis in privacy evaluation. In general, let X be the normalized original dataset, P be the per- turbed dataset, and O be the estimated/observed dataset through “attack simu- lation”. We can calculate VoD(Xi,Oi) for the column i in terms of different attacks. For example, the attacks to rotation perturbation can be evaluated by following steps. Details will be discussed shortly. 1 Naive Estimation: O ≡ P; 2 ICA-based Reconstruction: Independent Component Analysis (ICA) is used to estimate R.Let ˆR be the estimate of R, and the estimated data ˆR−1P aligned with the known column statistics to get the dataset O; 3 Distance-based Inference: knowing a set of special points in X that can be mapped to certain set of points in P, so that the mapping helps to get the estimated rotation ˆR,andthenO = ˆR−1P. Multiplicative Perturbations for Privacy 171 7.4.4 Other Metrics Other metrics include distance-based risk of privacy breach, which was used to evaluate the level of privacy breach when a few pairs of original data points and their maps in perturbed data are known [27]. Assume ˆx is the estimate of an original point x.An-privacy breach occurs if ˆx − x≤x This roughly represents that, if the estimate is within an arbitrarily small local area around the original point, then the risk of privacy breach is high. How- ever, even though the estimated point is distant from the original point, the estimation can still be effective − large distance may only be determined by the difference between a few columns, while other columns may be very simi- lar. That is the reason why we should consider column-wise privacy metrics. 7.5 Attack Resilient Multiplicative Perturbations Attack analysis is the essential component in privacy evaluation of multi- plicative perturbation. The previous section has set up an evaluation model that can conveniently incorporate attack analysis through “attack simulation”. Namely, privacy attacks to multiplicative perturbations are the methods for es- timating original points (or values of particular columns) from the perturbed data, with certain level of additional knowledge about the original data. As the perturbed data goes public, the level of effectiveness is solely determined by the additional knowledge the attacker may have. In the following sections, we describe some potential inference attacks to multiplicative perturbations, primarily focused on rotation perturbation. These attacks are organized according to the different levels of knowledge that an attacker may have. We hope that, from this section the interested read- ers will have more ideas about the attacks to general multiplicative perturba- tions and are able to apply appropriate tools to counter attacks. Most content of this section can be found in the paper [9], and we will just present the basic ideas here. 7.5.1 Naive Estimation to Rotation Perturbation When the attacker knows no additional information, we call attacks under such circumstance as naive estimation, which simply estimates the original data from perturbed data. In this case, an appropriate rotation perturbation is enough to achieve high level of privacy guarantee. With the VoDmetric over the normalized data, we can formally analyze the privacy guarantee provided by the rotation perturbed data. Let X be the normalized dataset, X be the rotation of X,andId be the d-dimensional identity matrix. VoD of column i 172 Privacy-Preserving Data Mining: Models and Algorithms can be evaluated by Cov(X − X)(i,i) = Cov(RX − X)(i,i)(7.2) =((R − Id)Cov(X)(R − Id)T)(i,i) Let rij represent the element (i, j) in the matrix R,andcij be the element (i, j) in the covariance matrix of X. The VoD for ith column is computed as follows. Cov(X − X)(i,i) = d j=1 d k=1 rijrikckj − 2 d j=1 rijcij + cii (7.3) When the random rotation matrix is generated following the Haar distribu- tion, a considerable number of matrix entries are approximately independent normal distribution N(0, 1/d)[20]. For simplicity and easy understanding, we assume that all entries in random rotation matrix approximately follow inde- pendent normal distribution N(0, 1/d). Therefore, random rotations will make VoDi changing around the mean value cii as shown in the following equation. E[VoDi] ∼ d j=1 d k=1 E[rij]E[rik]ckj − 2 d j=1 E[rij]cij + cii = cii It means that the original column variance could substantially inﬂuence the result of random rotation. However, the expectation of VoDs is not the only factor determining the ﬁnal privacy guarantee. We should also look at the vari- ance of VoDs. If the variance of VoDs is considerably large, we still get great chance to ﬁnd a rotation with high VoDs in a set of sample random rotations, and the larger the Var(VoDi) is, the more likely the randomly generated ro- tation matrices can provide a high privacy level. With the approximately inde- pendency assumption, we have Var(VoDi) ∼ d i=1 d j=1 Var(rij)Var(rik)c2 ij +4 d j=1 Var(rij)c2 ij ∼ O(1/d2 d i=1 d j=1 c2 ij +4/d d j=1 c2 ij). The above result shows that Var(VoDi) seems approximately related to the average of the squared covariance entries, with more inﬂuence from the row Multiplicative Perturbations for Privacy 173 i of covariance matrix. Therefore, by looking at the covariance matrix of the original dataset and estimate the Var(VoDi), we can estimate the chance of ﬁnding a random rotation that can give high privacy guarantee. Rotation Center. The basic rotation perturbation uses the origin as the ro- tation center. Therefore, the points around the origin will be still close to the origin after the perturbation, which leads to weaker privacy protection over these points. The attack to rotation center can be regarded as another kind of naive estimation. This problem is addressed by random translation perturba- tion, which hides the rotation center. More sophisticated attacks to the combi- nation of rotation and translation would have to utilize the ICA technique with sufﬁcient additional knowledge, which will be described shortly. 7.5.2 ICA-Based Attacks In this section, we introduce a high-level attack based on data reconstruc- tion. The basic method for reconstructing X from the perturbed data RX would be Independent Component Analysis (ICA) technique, derived from the research of signal processing [18]. The ICA technique can be applied to estimate the independent components (the row vectors in our deﬁnition) of the original dataset X from the perturbed data, if the following conditions are satisﬁed: 1 The source row vectors are independent; 2 All source row vectors should be non-Gaussian with possible exception of one row; 3 The number of observed row vectors must be at least as large as the independent source row vectors. 4 The transformation matrix R must be of full column rank. For rotation matrices, the 3rd and 4th conditions are always satisﬁed. How- ever, the ﬁrst two conditions although practical for signal processing, are of- ten not satisﬁed in data classiﬁcation or clustering. Furthermore, there are a few more difﬁculties in applying direct ICA-based attack. First of all, even ICA can be done successfully, the order of the original independent compo- nents cannot be preserved or determined through only ICA [18]. Formally, any permutation matrix P and its inverse P −1 can be substituted in the model to give X = RP −1PX. ICA could possibly give the estimate for some permu- tated source PX. Thus, we cannot identify the particular column without more knowledge about the original data. Second, even if the ordering of columns can be identiﬁed, ICA reconstruction does not guarantee to preserve the variance of the original signal − the estimated signal is often scaled up but we do not know how much the scaling is unless we know the original value range of the 174 Privacy-Preserving Data Mining: Models and Algorithms column. Therefore, without knowing the basic statistics of original columns, ICA-attack is not effective. However, such basic column statistics are not impossible to get in some cases. Now, we assume that attackers know the basic statistics, including the column max/min values and the probability density function (PDF), or empir- ical PDF of each column. An enhanced ICA-based attack can be described as follows. 1 Run ICA algorithm to get a reconstructed dataset; 2 For each pair of (Oi,Xj), where Oi is a reconstructed column and Xi is an original column, scale Oi with the max/min values of Xj; 3 Compare the PDFs of the scaled Oi and Xj to ﬁnd the closest match among all possible combinations. Note the the PDFs should be aligned before comparison. [9] gives one method to align it. The above procedure describes how to use ICA and additional knowledge about the original dataset to precisely reconstruct the original dataset. Note if the four conditions for effective ICA are exactly satisﬁed and the basic statis- tics and PDFs are all known distinct from each other, the basic rotation per- turbation will be totally broken by the enhanced ICA-based attack. In practice, we can test if the ﬁrst two conditions for effective ICA are satisﬁed to decide whether we can safely use rotation perturbation, when the column distribu- tional information is released. If ICA-based attacks can be effectively done, it is also trivial to reveal an additional translation perturbation, which is used to protect the rotation center. If the ﬁrst and second conditions are not satisﬁed, as for most datasets in data classiﬁcation and clustering, precise ICA reconstruction cannot be achieved. Under this circumstance, different rotation perturbations may result in differ- ent levels of privacy guarantee and the goal is to ﬁnd one perturbation that is resilient to the enhanced ICA-based attacks. For projection perturbation [28], the third condition of effective ICA is not satisﬁed either. Although overcomplete ICA is available for this particular case [25], it is generally ineffective to break projection perturbation with ICA-based attacks. The major concern of projection perturbation is to ﬁnd one that pre- serves the utility of perturbed data. 7.5.3 Distance-Inference Attacks In the previous sections, we have discussed naive estimation and ICA- based attacks. In the following discussion, we assume that, besides the in- formation necessary to perform the discussed attacks, the attacker manages to get more knowledge about the original dataset. We assume two scenarios: Multiplicative Perturbations for Privacy 175 1) s/he also knows at least d +1linearly independent original data records, X = {x1, x2,...,xd+1}; or 2) s/he can only get less then d linearly indepen- dent points. S/he then tries to ﬁnd the mapping between these points and their images in the perturbed dataset, denoted by O = {o1, o2,...,od+1}, to break rotation perturbation and possible also translation perturbation. For both scenarios, it is possible to ﬁnd the images of the known points in the perturbed data. Particularly, if a few original points are highly distinguish- able, such as “outliers”, their images in the perturbed data can be correctly identiﬁed with high probability for low-dimensional small datasets (< 4 di- mensions). With considerable cost, it is not impossible for higher dimensional and larger datasets by simple exhaustive search, although the probability to get the exact images is relatively low. For scenario 1), with the known mapping, the rotation R and translation t can be precisely calculated if the incomplete geometric perturbation G(X)=RX +Ψis applied. Therefore, the threat will be substantial to any other data point in the original dataset. rotation * * **** * * * * * * * * * * mapping Figure 7.1. Using known points and distance relationship to infer the rotation matrix For scenario 2), if we assume the exact images of the known original points are identiﬁed, there is a comprehensive discussion about the potential privacy breach to rotation perturbation [27]. For rotation perturbation, i.e., O = RX between the known points X and their images O,ifX consists of less than d points, there are numerous estimates of R, denoted by ˆR, satisfying the rela- tionship between X and O. The weakest points, except the known points X, are those around X. Paper [27] gives some estimation to the risk of privacy breach for certain point x if a set of points X and their image O are known. The deﬁnition is based on -privacy breach (Section 7.4.1). The probability of -privacy breach, ρ(x,),foranyx in the original dataset can be estimated as follows. Let d(x,X) be the distance between x and X. ρ(x,)=2 π arcsin( x 2d(x,X)), if x < 2d(x,X); 1 otherwise. 176 Privacy-Preserving Data Mining: Models and Algorithms Note that -privacy breach is not sufﬁcient to column-wise privacy evaluation. Thus, the above deﬁnition may not be sufﬁcient as well. In order to protect from distance-inference attack for both scenarios, an ad- ditional noise component ∆ is introduced to form the complete version of geo- metric perturbation G(X)=RX +Ψ+∆,where∆=[δ1,δ2,...,δN],and δi is a d-dimensional Gaussian random vector. The ∆ component reduces the probability of getting exact images and the precision of estimation to R and Ψ, which signiﬁcantly increases the resilience to distance-inference attacks. Assume the attacker still knows enough pairs of independent (point, image). Now, with the additional noise component, the most effective way to estimate the rotation/translation component is linear regression. The steps include 1) ﬁl- tering out the translation component ﬁrst; 2) applying linear regression to esti- mate R; 3) plugging the estimate ˆR back to estimate the translation component; 4) estimating the original data with ˆR and ˆΨ. There is a detailed procedure in [9]. We can simulate the procedure to estimate the resilience of a perturbation. Note that the additional noise component also implies that we have to sac- riﬁce some model accuracy for gaining the stronger privacy protection. An empirical study has been performed on a bunch of datasets to evaluate the rela- tionship between noise intensity, resilience to attacks and model accuracy [9]. In general, a low-intense noise component will be enough to reduce the risk of being attacked, while still preserving model accuracy. However, the noise component is required only when the data owner is sure that a small part of the original data is released. 7.5.4 Attacks with More Prior Knowledge There are also extreme cases that may not happen in practice, which assume the attacker knows a considerable amount of original data points and these points form a sample set that the higher-order statistical properties of the orig- inal dataset, like the covariance matrix, are approximately estimated from the sample set. By using the sample statistics and the sample points, the attacker can have more effective attacks. Note that, in general, if the attacker has known so much information about the original data, its privacy may already be breached. It should not be advised to publish more original data. Further discussion about perturbations will make less sense. However, the techniques developed in these attacks, such as PCA- based attack [27] and AK-ICA attack [15] might be eventually utilized in other aspects to enhance multiplicative perturbations in the future. We will not give detailed description about these attacks due to the space limitation. Instead, they will be covered by another dedicated chapter. Multiplicative Perturbations for Privacy 177 7.5.5 Finding Attack-Resilient Perturbations We have discussed the uniﬁed privacy metric for evaluating the quality of a random geometric perturbation. Some known inference attacks have been an- alyzed under the framework of multi-column privacy evaluation, which allows us to design an algorithm to choose a good geometric perturbation in terms of these attacks − if the attacker knows considerable amount of original data, it is advised not to release the perturbed dataset, however. A deterministic al- gorithm in optimizing the perturbation may also provide extra clue to privacy attackers. Therefore, it is also expected to have certain level of randomization in the perturbation optimization. A randomized perturbation-optimization algorithm for geometric perturba- tion was proposed in [9]. We brieﬂy describe it as follows. Algorithm 1 is a hill-climbing method, which runs in a given number of iterations to ﬁnd a geometric perturbation that maximizes the minimum privacy guarantee as pos- sible. Initially, a random translation is selected, which needs not optimization at all. In each iteration, the algorithm randomly generates a rotation matrix. Local maximization of VoD [9] is applied to ﬁnd a better rotation matrix in terms of naive estimation, which is then tested by the ICA reconstruction with the algorithm described in section 7.5.2. The rotation matrix is accepted as the currently best perturbation if it provides higher minimum privacy guarantee than the previous perturbations. After the iterations, if necessary, a noise com- ponent is appended to the perturbation, so that the distance-inference attack cannot reduce the privacy guarantee to a safety level φ, e.g., φ =0.2. Algo- rithm 1 outputs the rotation matrix Rt, the random translation matrix Ψ,the noise level σ2, and the corresponding privacy guarantee (we use minimum pri- vacy guarantee in the following algorithm) in terms of the known attacks. If the ﬁnal privacy guarantee is lower than the expected threshold, the data owner can select not to release the data. This algorithm provides a framework, in which any discovered attacks can be simulated and evaluated. 7.6 Conclusion We have reviewed the multiplicative perturbation method as an alterna- tive method to privacy preserving data mining. The design of this category of perturbation algorithms is based on an important principle: by developing perturbation algorithms that can always preserve the mining task and model speciﬁc data utility, one can focus on ﬁnding a perturbation that can provide higher level of privacy guarantee. We described three representative multiplica- tive perturbation methods − rotation perturbation, projection perturbation, and geometric perturbation. All aim at preserving the distance relationship in the original data, thus achieving good data utility for a set of classiﬁcation and clustering models. Another important advantage of using these multiplicative 178 Privacy-Preserving Data Mining: Models and Algorithms Algorithm 1 Finding a resilient perturbation (Xd×N, w, m) Input:Xd×N:the original dataset, w: weights for attributes in privacy evaluation, m: the number of iterations. Output:Rt: the selected rotation matrix, Ψ: the random translation, σ2: the noise level, p: privacy quality calculate the covariance matrix C of X; p =0, and randomly generate the translation Ψ; for Each iteration do randomly generate a rotation matrix R; swapping the rows of R to get R, which maximizes min1≤i≤d{ 1 wi (Cov(RX − X)(i,i)}; p0 = the privacy guarantee of R, p1 =0; if p0 >pthen generate ˆX with ICA; {(1),(2),...,(d)} = argmin{(1),(2),...,(d)} d i=1 ∆PDF(Xi,O(i)) p1 = min1≤k≤d 1 wk VoD(Xk,O(k)) end if if pφif p<φ. perturbation methods is the fact that we are not required to re-design the exist- ing data mining algorithms in order to perform data mining over the perturbed data. Privacy evaluation and attack analysis are the major challenging issues for multiplicative perturbations. We reviewed the multi-column variance of dif- ference (VoD) based evaluation method and the distance-based method. Since column distribution information has high probability to be released publicly, in principle it is necessary to evaluate privacy guarantee based on columns. Although this chapter does not intend to enumerate all possible attacks, as we know, attack analysis to multiplicative perturbation is still a very active area, we describe several types of attacks and organize the discussion according to the level of knowledge that the attacker may have about the original data. We also outlined some techniques developed to date for addressing these attacks. Based on attack analysis and the VoD-based evaluation method, we show how to ﬁnd the perturbations that locally optimize the level of privacy guarantee in terms of various attacks. Acknowledgment This work is partially supported by grants from NSF CISE CyberTrust pro- gram, IBM faculty award 2006, and an AFOSR grant. Multiplicative Perturbations for Privacy 179 References [1] AGGARWAL,C.C.,AND YU, P. S. A condensation approach to pri- vacy preserving data mining. Proc. of Intl. Conf. on Extending Database Technology (EDBT) 2992 (2004), 183–199. [2] AGGARWAL,C.C.,AND YU, P. S. On privacy-preservation of text and sparse binary data with sketches. SIAM Data Mining Conference (2007). [3] AGRAWAL,D.,AND AGGARWAL, C. C. On the design and quantiﬁca- tion of privacy preserving data mining algorithms. Proc. of ACM PODS Conference (2002). [4] AGRAWAL,R.,AND SRIKANT, R. Privacy-preserving data mining. Proc. of ACM SIGMOD Conference (2000). [5] ALON,N.,MATIAS,Y.,AND SZEGEDY, M. The space complexity of approximating the frequency moments. Proc. of ACM PODS Conference (1996). [6] ANKERST,M.,BREUNIG,M.M.,KRIEGEL,H.-P.,AND SANDER,J. OPTICS: Ordering points to identify the clustering structure. Proc. of ACM SIGMOD Conference (1999), 49–60. [7] CHEN,K.,AND LIU, L. A random geometric perturbation approach to privacy-preserving data classiﬁcation. Technical Report, College of Computing, Georgia Tech (2005). [8] CHEN,K.,AND LIU, L. A random rotation perturbation approach to privacy preserving data classiﬁcation. Proc. of Intl. Conf. on Data Mining (ICDM) (2005). [9] CHEN,K.,AND LIU, L. Towards attack-resilient geometric data pertur- bation. SIAM Data Mining Conference (2007). [10] CRISTIANINI,N.,AND SHAWE-TAYLOR,J.An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, 2000. [11] ESTER,M.,KRIEGEL,H.-P.,SANDER,J.,AND XU, X. A density- based algorithm for discovering clusters in large spatial databases with noise. Second International Conference on Knowledge Discovery and Data Mining (1996), 226–231. [12] EVFIMIEVSKI,A.,GEHRKE,J.,AND SRIKANT, R. Limiting privacy breaches in privacy preserving data mining. Proc. of ACM PODS Con- ference (2003). [13] EVFIMIEVSKI,A.,SRIKANT,R.,AGRAWAL,R.,AND GEHRKE, J. Pri- vacy preserving mining of association rules. Proc. of ACM SIGKDD Conference (2002). 180 Privacy-Preserving Data Mining: Models and Algorithms [14] FEIGENBAUM,J.,ISHAI,Y.,MALKIN,T.,NISSIM,K.,STRAUSS,M., AND WRIGHT, R. N. Secure multiparty computation of approxima- tions. In ICALP ’01: Proceedings of the 28th International Colloquium on Automata, Languages and Programming, (2001), Springer-Verlag, pp. 927–938. [15] GUO,S.,AND WU, X. Deriving private information from arbitrarily projected data. In Proceedings of the 11th European Conference on Prin- ciples and Practice of Knowledge Discovery in Databases (PKDD07) (Warsaw, Poland, Sept 2007). [16] HASTIE,T.,TIBSHIRANI,R.,AND FRIEDMANN,J. The Elements of Statistical Learning. Springer-Verlag, 2001. [17] HINNEBURG,A.,AND KEIM, D. A. An efﬁcient approach to cluster- ing in large multimedia databases with noise. Proc. of ACM SIGKDD Conference (1998), 58–65. [18] HYVARINEN,A.,KARHUNEN,J.,AND OJA,E. Independent Compo- nent Analysis. Wiley-Interscience, 2001. [19] JAIN,A.K.,AND DUBES, R. C. Data clustering: A review. ACM Com- puting Surveys 31 (1999), 264–323. [20] JIANG, T. How many entries in a typical orthogonal matrix can be ap- proximated by independent normals. To appear in The Annals of Proba- bility (2005). [21] JOHNSON,W.B.,AND LINDENSTRAUSS, J. Extensions of lipshitz map- ping into hilbert space. Contemporary Mathematics 26 (1984). [22] KARGUPTA,H.,DATTA,S.,WANG,Q.,AND SIVAKUMAR,K.On the privacy preserving properties of random data perturbation techniques. Proc. of Intl. Conf. on Data Mining (ICDM) (2003). [23] KIM,J.J.,AND WINKLER, W. E. Multiplicative noise for masking continuous data. Tech. Rep. Statistics #2003-01, Statistical Research Di- vision, U.S. Bureau of the Census, Washington D.C., April 2003. [24] LEFEVRE,K.,DEWITT,D.J.,AND RAMAKRISHNAN, R. Mondrain multidimensional k-anonymity. Proc. of IEEE Intl. Conf. on Data Eng. (ICDE) (2006). [25] LEWICKI,M.S.,AND SEJNOWSKI, T. J. Learning overcomplet repre- sentations. Neural Computation 12, 2 (2000). [26] LINDELL,Y.,AND PINKAS, B. Privacy preserving data mining. Journal of Cryptology 15, 3 (2000), 177–206. [27] LIU,K.,GIANNELLA,C.,AND KARGUPTA, H. An attacker’s view of distance preserving maps for privacy preserving data mining. In Pro- ceedings of the 10th European Conference on Principles and Practice of Multiplicative Perturbations for Privacy 181 Knowledge Discovery in Databases (PKDD’06) (Berlin, Germany, Sep- tember 2006). [28] LIU,K.,KARGUPTA,H.,AND RYAN, J. Random projection-based mul- tiplicative data perturbation for privacy preserving distributed data min- ing. IEEE Transactions on Knowledge and Data Engineering (TKDE) 18, 1 (January 2006), 92–106. [29] MACHANAVAJJHALA,A.,GEHRKE,J.,KIFER,D.,AND VENKITA- SUBRAMANIAM, M. l-diversity: Privacy beyond k-anonymity. Proc. of IEEE Intl. Conf. on Data Eng. (ICDE) (2006). [30] NETER,J.,KUTNER,M.H.,NACHTSHEIM,C.J.,AND WASSERMAN, W. Applied Linear Statistical Methods. WCB/McGraw-Hill, 1996. [31] OLIVEIRA,S.R.M.,AND ZA¨IANE, O. R. Privacy preservation when sharing data for clustering. In Proceedings of the International Workshop on Secure Data Management in a Connected World (Toronto, Canada, August 2004), pp. 67–82. [32] SADUN,L.Applied Linear Algebra: the Decoupling Principle. Prentice Hall, 2001. [33] STEWART, G. The efﬁcient generation of random orthogonal matrices with an application to condition estimation. SIAM Journal on Numerical Analysis 17 (1980). [34] SWEENEY, L. k-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems 10,5 (2002). [35] VAIDYA,J.,AND CLIFTON, C. Privacy preserving k-means cluster- ing over vertically partitioned data. Proc. of ACM SIGKDD Conference (2003). Chapter 8 A Survey of Quantiﬁcation of Privacy Preserving Data Mining Algorithms Elisa Bertino Department of Computer Science Purdue University bertino@cs.purdue.edu Dan Lin Department of Computer Science Purdue University lindan@cs.purdue.edu Wei Jiang Department of Computer Science Purdue University wjiang@cs.purdue.edu Abstract The aim of privacy preserving data mining (PPDM) algorithms is to extract rel- evant knowledge from large amounts of data while protecting at the same time sensitive information. An important aspect in the design of such algorithms is the identiﬁcation of suitable evaluation criteria and the development of related benchmarks. Recent research in the area has devoted much effort to determine a trade-off between the right to privacy and the need of knowledge discovery. It is often the case that no privacy preserving algorithm exists that outperforms all the others on all possible criteria. Therefore, it is crucial to provide a compre- hensive view on a set of metrics related to existing privacy preserving algorithms so that we can gain insights on how to design more effective measurement and PPDM algorithms. In this chapter, we review and summarize existing criteria and metrics in evaluating privacy preserving techniques. Keywords: Privacy metric. 184 Privacy-Preserving Data Mining: Models and Algorithms 8.1 Introduction Privacy is one of the most important properties that an information system must satisfy. For this reason, several efforts have been devoted to incorporat- ing privacy preserving techniques with data mining algorithms in order to pre- vent the disclosure of sensitive information during the knowledge discovery. The existing privacy preserving data mining techniques can be classiﬁed ac- cording to the following ﬁve different dimensions [32]: (i) data distribution (centralized or distributed); (ii) the modiﬁcation applied to the data (encryp- tion, perturbation, generalization, and so on) in order to sanitize them; (iii) the data mining algorithm which the privacy preservation technique is designed for; (iv) the data type (single data items or complex data correlations) that needs to be protected from disclosure; (v) the approach adopted for preserving privacy (heuristic or cryptography-based approaches). While heuristic-based techniques are mainly conceived for centralized datasets, cryptography-based algorithms are designed for protecting privacy in a distributed scenario by us- ing encryption techniques. Heuristic-based algorithms recently proposed aim at hiding sensitive raw data by applying perturbation techniques based on prob- ability distributions. Moreover, several heuristic-based approaches for hiding both raw and aggregated data through a hiding technique (k-anonymization, adding noises, data swapping, generalization and sampling) have been devel- oped, ﬁrst, in the context of association rule mining and classiﬁcation and, more recently, for clustering techniques. Given the number of different privacy preserving data mining (PPDM) tech- niques that have been developed in these years, there is an emerging need of moving toward standardization in this new research area, as discussed by Oliveira and Zaiane [23]. One step toward this essential process is to provide a quantiﬁcation approach for PPDM algorithms to make it possible to evaluate and compare such algorithms. However, due to the variety of characteristics of PPDM algorithms, it is often the case that no privacy preserving algorithm ex- ists that outperforms all the others on all possible criteria. Rather, an algorithm may perform better than another one on speciﬁc criteria like privacy level, data quality. Therefore, it is important to provide users with a comprehensive set of privacy preserving related metrics which will enable them to select the most appropriate privacy preserving technique for the data at hand, with respect to some speciﬁc parameters they are interested in optimizing [6]. For a better understanding of PPDM related metrics, we next identify a proper set of criteria and the related benchmarks for evaluating PPDM algo- rithms. We then adopt these criteria to categorize the metrics. First, we need to be clear with respect to the concept of “privacy” and the general goals of a PPDM algorithm. In our society the privacy term is overloaded, and can, in general, assume a wide range of different meanings. For example, in the A Survey of Quantiﬁcation of Privacy Preserving Data Mining Algorithms 185 context of the HIPAA1 Privacy Rule, privacy means the individual’s ability to control who has the access to personal health care information. From the orga- nizations point of view, privacy involves the deﬁnition of policies stating which information is collected, how it is used, and how customers are informed and involved in this process. Moreover, there are many other deﬁnitions of privacy that are generally related with the particular environment in which the privacy has to be guaranteed. What we need is a more generic deﬁnition, that can be in- stantiated to different environments and situations. From a philosophical point of view, Schoeman [26] and Walters [33] identify three possible deﬁnitions of privacy: Privacy as the right of a person to determine which personal information about himself/herself may be communicated to others. Privacy as the control over access to information about oneself. Privacy as limited access to a person and to all the features related to the person. In three deﬁnitions, what is interesting from our point of view is the concept of “Controlled Information Release”. From this idea, we argue that a deﬁnition of privacy that is more related with our target could be the following: “The right of an individual to be secure from unauthorized disclosure of information about oneself that is contained in an electronic repository”. Performing a ﬁnal tuning of the deﬁnition, we consider privacy as “The right of an entity to be se- cure from unauthorized disclosure of sensible information that are contained in an electronic repository or that can be derived as aggregate and complex information from data stored in an electronic repository”. The last generaliza- tion is due to the fact that the concept of individual privacy does not even exist. As in [23] we consider two main scenarios. The ﬁrst is the case of a Medical Database where there is the need to pro- vide information about diseases while preserving the patient identity. Another scenario is the classical “Market Basket” database, where the transactions re- lated to different client purchases are stored and from which it is possible to extract some information in form of association rules like “If a client buys a product X, he/she will purchase also Z with y% probability”. The ﬁrst is an example where individual privacy has to be ensured by protecting from unau- thorized disclosure sensitive information in form of speciﬁc data items related to speciﬁc individuals. The second one, instead, emphasizes how not only the raw data contained into a database must be protected, but also, in some cases, the high level information that can be derived from non sensible raw data need 1Health Insurance Portability and Accountability Act 186 Privacy-Preserving Data Mining: Models and Algorithms to protected. Such a scenario justiﬁes the ﬁnal generalization of our privacy deﬁnition. In the light of these considerations, it is, now, easy to deﬁne which are the main goals a PPDM algorithm should enforce: 1 A PPDM algorithm should have to prevent the discovery of sensible in- formation. 2 It should be resistant to the various data mining techniques. 3 It should not compromise the access and the use of non sensitive data. 4 It should not have an exponential computational complexity. Correspondingly, we identify the following set of criteria based on which a PPDM algorithm can be evaluated. - Privacy level offered by a privacy preserving technique, which indicates how closely the sensitive information, that has been hidden, can still be estimated. - Hiding failure, that is, the portion of sensitive information that is not hidden by the application of a privacy preservation technique; - Data quality after the application of a privacy preserving technique, con- sidered both as the quality of data themselves and the quality of the data mining results after the hiding strategy is applied; - Complexity, that is, the ability of a privacy preserving algorithm to exe- cute with good performance in terms of all the resources implied by the algorithm. For the rest of the chapter, we ﬁrst present details of each criteria through analyzing existing PPDM techniques. Then we discuss how to select proper metric under a speciﬁed condition. Finally, we summarize this chapter and outline future research directions. 8.2 Metrics for Quantifying Privacy Level Before presenting different metrics related to privacy level, we need to take into account two aspects: (i) sensitive or private information can be contained in the original dataset; and (ii) private information that can be discovered from data mining results. We refer to the ﬁrst one as data privacy and the latter as result privacy. 8.2.1 Data Privacy In general, the quantiﬁcation used to measure data privacy is the degree of uncertainty, according to which original private data can be inferred. The A Survey of Quantiﬁcation of Privacy Preserving Data Mining Algorithms 187 higher the degree of uncertainty achieved by a PPDM algorithm, the better the data privacy is protected by this PPDM algorithm. For various types of PPDM algorithms, the degree of uncertainty is estimated in different ways. According to the adopted techniques, PPDM algorithms can be classiﬁed into two main categories: heuristic-based approaches and cryptography-based ap- proaches. Heuristic-based approaches mainly include four sub-categories: ad- ditive noise, multiplicative noise, k-anonymization, and statistical disclosure control based approaches. In what follows, we survey representative works of each category of PPDM algorithms and review the metrics used by them. Additive-Noise-based Perturbation Techniques. Thebasicideaofthe additive-noise-based perturbation technique is to add random noise to the ac- tual data. In [2], Agrawal and Srikant uses an additive-noise-based technique to perturb data. They then estimate the probability distribution of original numeric data values in order to build a decision tree classiﬁer from perturbed training data. They introduce a quantitative measure to evaluate the amount of privacy offered by a method and evaluate the proposed method against this measure. The privacy is measured by evaluating how closely the original values of a modiﬁed attribute can be determined. In particular, if the perturbed value of an attribute can be estimated, with a conﬁdence c, to belong to an interval [a, b], then the privacy is estimated by (b−a) with conﬁdence c. However, this metric does not work well because it does not take into account the distribution of the original data along with the perturbed data. Therefore, a metric that considers all the informative content of data available to the user is needed. Agrawal and Aggarwal [1] address this problem by introducing a new privacy metric based on the concept of information entropy. More speciﬁcally, they propose an Ex- pectation Maximization (EM) based algorithm for distribution reconstruction, which converges to the maximum likelihood estimate of the original distrib- ution on the perturbed data. The measurement of privacy given by them con- siders the fact that both the perturbed individual record and the reconstructed distribution are available to the user as well as the perturbing distribution, as it is speciﬁed in [10]. This metric deﬁnes the average conditional privacy of an attribute A given other information, modeled with a random variable B, as 2h(A|B),whereh(A|B) is the conditional differential entropy of A given B representing a measure of uncertainty inherent in the value of A,giventhe value of B. Another additive-noise-based perturbation technique is by Rivzi and Haritsa [24]. They propose a distortion method to pre-process the data before execut- ing the mining process. Their privacy measure deals with the probability with which the user’s distorted entries can be reconstructed. Their goal is to en- sure privacy at the level of individual entries in each customer tuple. In other words, the authors estimate the probability that a given 1 or 0 in the true matrix 188 Privacy-Preserving Data Mining: Models and Algorithms representing the transactional database can be reconstructed, even if for many applications the 1’s and 0’s values do not need the same level of privacy. Evﬁmievski et al. [11] propose a framework for mining association rules from transactions consisting of categorical items, where the data has been ran- domized to preserve privacy of individual transactions, while ensuring at the same time that only true associations are mined. They also provide a formal deﬁnition of privacy breaches and a class of randomization operators that are much more effective in limiting breaches than uniform randomization. Accord- ing to Deﬁnition 4 from [11], an itemset A results in a privacy breach of level ρ if the probability that an item in A belongs to a non randomized transaction, given that A is included in a randomized transaction, is greater than or equal to ρ. In some scenarios, being conﬁdent that an item not present in the original transaction may also be considered a privacy breach. In order to evaluate the privacy breaches, the approach taken by Evﬁmievski et al. is to count the oc- currences of an itemset in a randomized transaction and in its sub-items in the corresponding non randomized transaction. Out of all sub-items of an itemset, the item causing the worst privacy breach is chosen. Then, for each combina- tion of transaction size and itemset size, the worst and the average value of this breach level are computed over all frequent itemsets. The itemset size giving the worst value for each of these two values is selected. Finally, we introduce a universal measure of data privacy level, proposed by Bertino et al. in [6]. The measure is developed based on [1]. The basic concept used by this measure is information entropy, which is deﬁned by Shannon [27]: let X be a random variable which takes on a ﬁnite set of values according to a probability distribution p(x). Then, the entropy of this probability distribution is deﬁned as follows: h(X)=− p(x)log2(p(x)) (8.1) or, in the continuous case: h(X)=− f(x)log2(f(x))dx (8.2) where f(x) denotes the density function of the continuous random variable x. Information entropy is a measure of how much “choice” is involved in the selection of an event or how uncertain we are of its outcome. It can be used for quantifying the amount of information associated with a set of data. The concept of “information associated with data” can be useful in the evaluation of the privacy achieved by a PPDM algorithm. Because the entropy represents the information content of a datum, the entropy after data sanitization should be higher than the entropy before the sanitization. Moreover the entropy can be assumed as the evaluation of the uncertain forecast level of an event which in our context is evaluation of the right value of a datum. Consequently, the level A Survey of Quantiﬁcation of Privacy Preserving Data Mining Algorithms 189 of privacy inherent in an attribute X, given some information modeled by Y, is deﬁned as follows: Π(X|Y)=2− fX,Y (x,y)log2 fX|Y =y(x))dxdy (8.3) The privacy level deﬁned in equation 8.3 is very general. In order to use it in the different PPDM contexts, it needs to be reﬁned in relation with some characteristics like the type of transactions, the type of aggregation and PPDM methods. In [6], an example of instantiating the entropy concept to evaluate the privacy level in the context of “association rules” is presented. However, it is worth noting that the value of the privacy level depends not only on the PPDM algorithm used, but also on the knowledge that an attacker has about the data before the use of data mining techniques and the relevance of this knowledge in the data reconstruction operation. This problem is under- lined, for example, in [29, 30]. In [6], this aspect is not considered, but it is possible to introduce assumptions on attacker knowledge by properly model- ing Y. Multiplicative-Noise-based Perturbation Techniques. According to [16], additive random noise can be ﬁltered out using certain signal processing tech- niques with very high accuracy. To avoid this problem, random projection- based multiplicative perturbation techniques has been proposed in [19]. Instead of adding some random values to the actual data, random matrices are used to project the set of original data points to a randomly chosen lower-dimensional space. However, the transformed data still preserves much statistical aggregates regarding the original dataset so that certain data mining tasks (e.g., computing inner product matrix, linear classiﬁcation, K-means clustering and computing Euclidean distance) can be performed on the transformed data in a distributed environment (data are either vertically partitioned or horizontally partitioned) with small errors. In addition, this approach provides a high degree of privacy regarding the original data. As analyzed in the paper, even if the random matrix (i.e., the multiplicative noise) is disclosed, it is impossible to ﬁnd the exact values of the original dataset, but ﬁnding approximation of the original data is possible. The variance of the approximated data is used as privacy measure. Oliveira and Zaiane [22] also adopt a multiplicative-noise-based perturba- tion technique to perform a clustering analysis while ensuring at the same time privacy preservation. They have introduced a family of geometric data trans- formation methods where they apply a noise vector to distort conﬁdential nu- merical attributes. The privacy ensured by such techniques is measured as the variance difference between the actual and the perturbed values. This measure is given by Var(X − Y),whereX represents a single original attribute and Y 190 Privacy-Preserving Data Mining: Models and Algorithms the distorted attribute. This measure can be made scale invariant with respect to the variance of X by expressing security as Sec = Var(X − Y)/V ar(X). k-Anonymization Techniques. The concept of k-anonymization is intro- duced by Samarati and Sweeney in [25, 28]. A database is k-anonymous with respect to quasi-identiﬁer attributes (a set of attributes that can be used with certain external information to identify a speciﬁc individual) if there exist at least k transactions in the database having the same values according to the quasi-identiﬁer attributes. In practice, in order to protect sensitive dataset T, before releasing T to the public, T is converted into a new dataset T ∗ that guarantees the k-anonymity property for a sensible attribute by performing some value generalizations on quasi-identiﬁer attributes. Therefore, the degree of uncertainty of the sensitive attribute is at least 1/k. Statistical-Disclosure-Control-based Techniques. In the context of sta- tistical disclosure control, a large number of methods have been developed to preserve individual privacy when releasing aggregated statistics on data. To anonymize the released statistics from those data items such as person, house- hold and business, which can be used to identify an individual, not only fea- tures described by the statistics but also related information publicly available need to be considered [35]. In [7] a description of the most relevant perturba- tion methods proposed so far is presented. Among these methods speciﬁcally designed for continuous data, the following masking techniques are described: additive noise, data distortion by probability distribution, resampling, microag- gregation, rank swapping, etc. For categorical data both perturbative and non- perturbative methods are presented. The top-coding and bottom-coding tech- niques are both applied to ordinal categorical variables; they recode, respec- tively, the ﬁrst/last p values of a variable into a new category. The global- recoding technique, instead, recodes the p lowest frequency categories into a single one. The privacy level of such method is assessed by using the disclosure risk, that is, the risk that a piece of information be linked to a speciﬁc individual. There are several approaches to measure the disclosure risk. One approach is based on the computation of the distance-based record linkage. An intruder is assumed to try to link the masked dataset with the external dataset using the key variables. The distance between records in the original and the masked datasets is computed. A record in the masked dataset is labelled as “linked” or “linked to 2nd nearest” if the nearest or 2nd nearest record in the original dataset turns out to be the corresponding original record. Then the disclosure risk is computed as the percentage of “linked” and “linked to 2nd nearest”. The second approach is based on the computation of the probabilistic record link- age. The linear sum assignment model is used to ‘pair’ records in the original A Survey of Quantiﬁcation of Privacy Preserving Data Mining Algorithms 191 ﬁle and the masked ﬁle. The percentage of correctly paired records is a measure of disclosure risk. Another approach computes rank intervals for the records in the masked dataset. The proportion of original values that fall into the interval centered around their corresponding masked value is a measure of disclosure risk. Cryptography-based Techniques. The cryptography-based technique usu- ally guarantees very high level of data privacy. In [14], Kantarcioglu and Clifton address the problem of secure mining of association rules over hori- zontally partitioned data, using cryptographic techniques to minimize the in- formation shared. Their solution is based on the assumption that each party ﬁrst encrypts its own itemsets using commutative encryption, then the already en- crypted itemsets of every other party. Later on, an initiating party transmits its frequency count, plus a random value, to its neighbor, which adds its frequency count and passes it on to other parties. Finally, a secure comparison takes place between the ﬁnal and initiating parties to determine if the ﬁnal result is greater than the threshold plus the random value. Another cryptography-based approach is described in [31]. Such approach addresses the problem of association rule mining in vertically partitioned data. In other words, its aim is to determine the item frequency when transactions are split across different sites, without revealing the contents of individual transac- tions. The security of the protocol for computing the scalar product is analyzed. Though cryptography-based techniques can well protect data privacy, they may not be considered good with respect to other metrics like efﬁciency that will be discussed in later sections. 8.2.2 Result Privacy So far, we have seen privacy metrics related to the data mining process. Many data mining tasks produce aggregate results, such as Bayesian classiﬁers. Although it is possible to protect sensitive data when a classiﬁer is constructed, can this classiﬁer be used to infer sensitive data values? In other words, do data mining results violate privacy? This issue has been analyzed and a framework is proposed in [15] to test if a classiﬁer C creates an inference channel that could be adopted to infer sensitive data values. The framework considers three types of data: public data (P), accessible to every one including the adversary; private/sensitive data (S), must be protected and unknown to the adversary; unknown data (U), not known to the adversary, but the release of this data might cause privacy violation. The framework as- sumes that S depends only on P and U, and the adversary has at most t data samples of the form (pi,si). The approach to determine whether an inference channel exists is comprised of two steps. First, a classiﬁer C1 is built on the t data samples. To evaluate the impact of C, another classiﬁer C2 is built based 192 Privacy-Preserving Data Mining: Models and Algorithms on the same t data samples plus the classiﬁer C. If the accuracy of C2 is sig- niﬁcantly better than C1, we can say that C provides an inference channel for S. Classiﬁer accuracy is measured based on Bayesian classiﬁcation error. Sup- pose we have a dataset {x1,...,xn}, and we want to classify xi into m classes labelled as {1,...,m}. Given a classiﬁer C: C: xi → C(xi) ∈{1,...,m},i=1,...,n The classiﬁer accuracy for C is deﬁned as: m j=1 Pr(C(xi) = j|z = j)Pr(z = j) where z is the actual class label of xi. Since cryptography-based PPDM tech- niques usually produce the same results as those mined from the original dataset, analyzing privacy implications from the mining results is particular important to this class of techniques. 8.3 Metrics for Quantifying Hiding Failure The percentage of sensitive information that is still discovered, after the data has been sanitized, gives an estimate of the hiding failure parameter. Most of the developed privacy preserving algorithms are designed with the goal of obtaining zero hiding failure. Thus, they hide all the patterns considered sen- sitive. However, it is well known that the more sensitive information we hide, the more non-sensitive information we miss. Thus, some PPDM algorithms have been recently developed which allow one to choose the amount of sen- sitive data that should be hidden in order to ﬁnd a balance between privacy and knowledge discovery. For example, in [21], Oliveira and Zaiane deﬁne the hiding failure (HF) as the percentage of restrictive patterns that are discovered from the sanitized database. It is measured as follows: HF = #RP(D) #RP(D)(8.4) where #RP(D) and #RP(D) denote the number of restrictive patterns dis- covered from the original data base D and the sanitized database D respec- tively. Ideally, HF should be 0. In their framework, they give a speciﬁcation of a disclosure threshold φ, representing the percentage of sensitive transactions that are not sanitized, which allows one to ﬁnd a balance between the hiding failure and the number of misses. Note that φ does not control the hiding failure directly, but indirectly by controlling the proportion of sensitive transactions to be sanitized for each restrictive pattern. A Survey of Quantiﬁcation of Privacy Preserving Data Mining Algorithms 193 Moreover, as pointed out in [32], it is important not to forget that intruders and data terrorists will try to compromise information by using various data mining algorithms. Therefore, a PPDM algorithm developed against a particu- lar data mining techniques that assures privacy of information, may not attain similar protection against all possible data mining algorithms. In order to pro- vide for a complete evaluation of a PPDM algorithm, we need to measure its hiding failure against data mining techniques which are different from the tech- nique that the PPDM algorithm has been designed for. The evaluation needs the consideration of a class of data mining algorithms which are signiﬁcant for our test. Alternatively, a formal framework can be developed that upon testing of a PPDM algorithm against pre-selected data sets, we can transitively prove privacy assurance for the whole class of PPDM algorithms. 8.4 Metrics for Quantifying Data Quality The main feature of the most PPDM algorithms is that they usually modify the database through insertion of false information or through the blocking of data values in order to hide sensitive information. Such perturbation techniques cause the decrease of the data quality. It is obvious that the more the changes are made to the database, the less the database reﬂects the domain of interest. Therefore, data quality metrics are very important in the evaluation of PPDM techniques. Since the data is often sold for making proﬁt, or shared with others in the hope of leading to innovation, data quality should have an acceptable level according also to the intended data usage. If data quality is too degraded, the released database is useless for the purpose of knowledge extraction. In existing works, several data quality metrics have been proposed that are either generic or data-use-speciﬁc. However, currently, there is no metric that is widely accepted by the research community. Here we try to identify a set of possible measures that can be used to evaluate different aspects of data quality. In evaluating the data quality after the privacy preserving process, it can be useful to assess both the quality of the data resulting from the PPDM process and the quality of the data mining results. The quality of the data themselves can be considered as a general measure evaluating the state of the individual items contained in the database after the enforcement of a privacy preserving technique. The quality of the data mining results evaluates the alteration in the information that is extracted from the database after the privacy preservation process, on the basis of the intended data use. 8.4.1 Quality of the Data Resulting from the PPDM Process The main problem with data quality is that its evaluation is relative [18], in that it usually depends on the context in which data are used. In particular, there 194 Privacy-Preserving Data Mining: Models and Algorithms are some aspects related to data quality evaluation that are heavily related not only with the PPDM algorithm, but also with the structure of the database, and with the meaning and relevance of the information stored in the database with respect to a well deﬁned context. In the scientiﬁc literature data quality is gen- erally considered a multi-dimensional concept that in certain contexts involves both objective and subjective parameters [3, 34]. Among the various possible parameters, the following ones are usually considered the most relevant: - Accuracy: it measures the proximity of a sanitized value to the original value. - Completeness: it evaluates the degree of missed data in the sanitized database. - Consistency: it is related to the internal constraints, that is, the relation- ships that must hold among different ﬁelds of a data item or among data items in a database. Accuracy. The accuracy is closely related to the information loss result- ing from the hiding strategy: the less is the information loss, the better is the data quality. This measure largely depends on the speciﬁc class of PPDM al- gorithms. In what follows, we discuss how different approaches measure the accuracy. As for heuristic-based techniques, we distinguish the following cases based on the modiﬁcation technique that is performed for the hiding process. If the algorithm adopts a perturbation or a blocking technique to hide both raw and aggregated data, the information loss can be measured in terms of the dissimi- larity between the original dataset D and the sanitized one D. In [21], Oliveira and Zaiane propose three different methods to measure the dissimilarity be- tween the original and sanitized databases. The ﬁrst method is based on the difference between the frequency histograms of the original and the sanitized databases. The second method is based on computing the difference between the sizes of the sanitized database and the original one. The third method is based on a comparison between the contents of two databases. A more de- tailed analysis on the deﬁnition of dissimilarity is presented by Bertino et al. in [6]. They suggest to use the following formula in the case of transactional dataset perturbation: Diss(D, D)= n i=1 |fD(i) − fD (i)|n i=1 fD(i)(8.5) where i is a data item in the original database D and fD(i) is its frequency within the database, whereas i’ is the given data item after the application of A Survey of Quantiﬁcation of Privacy Preserving Data Mining Algorithms 195 a privacy preservation technique and fD (i) is its new frequency within the transformed database D. As we can see, the information loss is deﬁned as the ratio between the sum of the absolute errors made in computing the frequen- cies of the items from a sanitized database and the sum of all the frequencies of items in the original database. The formula 8.5 can also be used for the PPDM algorithms which adopt a blocking technique for inserting into the dataset un- certainty about some sensitive data items or their correlations. The frequency of the item i belonging to the sanitized dataset D is then given by the mean value between the minimum frequency of the data item i, computed by consid- ering all the blocking values associated with it equal to zero, and the maximum frequency, obtained by considering all the question marks equal to one. In case of data swapping, the information loss caused by an heuristic-based algorithm can be evaluated by a parameter measuring the data confusion in- troduced by the value swappings. If there is no correlation among the different database records, the data confusion can be estimated by the percentage of value replacements executed in order to hide speciﬁc information. For the multiplicative-noise-based approaches [19], the quality of the per- turbed data depends on the size of the random projection matrix. In general, the error bound of the inner product matrix produce by this perturbation technique is 0 on average and the variance is bounded by the inverse of the dimensionality of the reduced space. In other words, when the dimensionality of the random projection matrix is close to that of the original data, the result of computing the inner product matrix based on the transformed or projected data is also close to the actual value. Since inner product is closely related to many distance-based metrics (e.g., Euclidean distance, cosine angle of two vectors, correlation coef- ﬁcient of two vectors, etc), the analysis on error bound has direct impact on the mining results if these data mining tasks adopt certain distance-based metrics. If the data modiﬁcation consists of aggregating some data values, the infor- mation loss is given by the loss of detail in the data. Intuitively, in this case, in order to perform the hiding operation, the PPDM algorithms use some type of “Generalization or Aggregation Scheme” that can be ideally modeled as a tree scheme. Each cell modiﬁcation applied during the sanitization phase using the Generalization tree introduces a data perturbation that reduces the general ac- curacy of the database. As in the case of the k-anonymity algorithm presented in [28], we can use the following formula. Given a database T with NA ﬁelds and N transactions, if we identify as generalization scheme a domain general- ization hierarchy GT with a depth h, it is possible to measure the information loss (IL) of a sanitized database T ∗ as: IL(T ∗)= i=NA i=1 i=N j=1 h |GTAi| |T|∗|NA| (8.6) 196 Privacy-Preserving Data Mining: Models and Algorithms where h |GTAi| represent the detail loss for each cell sanitized. For hiding tech- niques based on sampling approach, the quality is obviously related to the size of the considered sample and, more generally, on its features. There are some other precision metrics speciﬁcally designed for k- anonymization approaches. One of the earliest data quality metrics is based on the height of generalization hierarchies [25]. The height is the number of times the original data value has been generalized. This metric assumes that a generalization on the data represents an information loss on the original data value. Therefore, data should be generalized as fewer steps as possible to pre- serve maximum utility. However, this metric does not take into account that not every generalization steps are equal in the sense of information loss. Later, Iyengar [13] proposes a general loss metric (LM). Suppose T isadata table with n attributes. The LM metric is thought as the average information loss of all data cells of a given dataset, deﬁned as follows: LM(T ∗)= n i=1 |T| j=1 f(T ∗[i][j])−1 g(Ai)−1 |T|·n (8.7) In equation 8.7, T ∗ is the anonymized table of T, f is a function that given a data cell value T ∗[i][j], returns the number of distinct values that can be generalized to T ∗[i][j],andg is a function that given an attribute Ai, returns the number of distinct values of Ai. The next metric, classiﬁcation metric (CM), is introduced by Iyengar [13] to optimize a k-anonymous dataset for training a classiﬁer. It is deﬁned as the sum of the individual penalties for each row in the table normalized by the total number of rows N. CM(T ∗)= all rows penalty(row r) N(8.8) The penalty value of row r is 1, i.e., row r is penalized, if it is suppressed or if its class label is not the majority class label of its group. Otherwise, the penalty value of row r is 0. This metric is particularly useful when we want to build a classiﬁer over anonymous data. Another interesting metric is the discernibility metric(DM) proposed by Bayado and Agrawal [4]. This discernibility metric assigns a penalty to each tuple based on how many tuples in the transformed dataset are indistinguish- able from it. Let t be a tuple from the original table T,andletGT ∗ (t) be the set of tuples in an anonymized table T ∗ indistinguishable from t or the set of tuples in T∗ equivalent to the anonymized value of t.ThenDM is deﬁned as follows: DM(T ∗)= t∈T |GT ∗ (t)| (8.9) A Survey of Quantiﬁcation of Privacy Preserving Data Mining Algorithms 197 Note that if a tuple t has been suppressed, the size of GT ∗ (t) is the same as the size of T ∗. In many situation, suppressions are considered to be most ex- pensive in the sense of information loss. Thus, to maximize data utility, tuple suppression should be avoided whenever possible. For any given metric M,ifM(T) >M(T ), we say T has a higher infor- mation loss, or is less precise, than b. In other words, data quality of T is worse than that of T . Is this true for all metrics? What is a good metric? It is not easy to answer these kinds of questions. As shown in [20], CM works better than LM in classiﬁcation application. In addition, LM is better for association rule mining. It is apparent that to judge how good a particular metric is, we need to associate our judgement with speciﬁc applications (e.g., classiﬁcation, mining association rules). The CM metric and the information gain privacy loss ratio [5, 28] are more interesting measure of utility because it considers the possible application for the data. Nevertheless, it is unclear what to do if we want to build classiﬁers on various attributes. In addition, these two metrics only work well if the data are intended to be used for building classiﬁers. Is there a utility metric that works well for various applications? Having this in mind, Kifer [17] proposes a utility measure related to Kullback-Leibler divergence. In theory, using this measure, better anonymous datasets (for different applications) can be produced. Re- searchers have measured the utility of the resulting anonymous datasets. Pre- liminary results show that this metric works well in practical applications. For the statistical-based perturbation techniques which aim to hide the val- ues of a conﬁdential attribute, the information loss is basically the lack of pre- cision in estimating the original distribution function of the given attribute. As deﬁned in [1], the information loss incurred during the reconstruction of estimating the density function fX(x) of the attribute X, is measured by com- puting the following value: I(fX, fX)=1 2E ΩX fX(x) − fX(x) dx (8.10) that is, half of the expected value of L1 norm between fX(x) and fX(x),which are the density distributions respectively before and after the application of the privacy preserving technique. When considering the cryptography-based techniques which are typically employed in distributed environments, we can observe that they do not use any kind of perturbation techniques for the purpose of privacy preserving. Instead, they use the cryptographic techniques to assure data privacy at each site by limiting the information shared by all the sites. Therefore, the quality of data stored at each site is not compromised at all. 198 Privacy-Preserving Data Mining: Models and Algorithms Completeness and Consistency. While the accuracy is a relatively general parameter in that it can be measured without strong assumptions on the dataset analyzed, the completeness is not so general. For example, in some PPDM strategies, e.g. blocking, the completeness evaluation is not signiﬁcant. On the other hand, the consistency requires to determine all the relationships that are relevant for a given dataset. In [5], Bertino et al. propose a set of evaluation parameters including the completeness and consistency evaluation. Unlike other techniques, their ap- proach takes into account two more important aspects: relevance of data and structure of database. They provide a formal description that can be used to magnify the aggregate information of interest for a target database and the rel- evance of data quality properties of each aggregate information and for each attribute involved in the aggregate information. Speciﬁcally, the completeness lack (denoted as CML) is measured as follows: CML = n i=0 (DMG.Ni.CV × DMG.Ni.CW)(8.11) In equation 8.11, DMG is an oriented graph where each node Ni is an at- tribute class. CV is the completeness value and CW is the consistency value. The consistency lack (denoted as CSL) is given by the number of constraint violations occurred in all the sanitized transaction multiplied by the weight associated with every constraints. CSL = n i=0 (DMG.SCi.csv × DMG.SCi.cw) + m j=0 (DMG.CCj.csv × DMG.CCj.cw)(8.12) In equation 8.11, csv indicates the number of violations, cw is the weight of the constraint, SCi describes a simple constraint class, and CCj describes a complex constraint class. 8.4.2 Quality of the Data Mining Results In some situations, it can be useful and also more relevant to evaluate the quality of the data mining results after the sanitization process. This kind of metric is strictly related to the use the data are intended for. Data can be ana- lyzed in order to mine information in terms of associations among single data items or to classify existing data with the goal of ﬁnding an accurate clas- siﬁcation of new data items, and so on. Based on the intended data use, the information loss is measured with a speciﬁc metric, depending each time on the particular type of knowledge model one aims to extract. A Survey of Quantiﬁcation of Privacy Preserving Data Mining Algorithms 199 If the intended data usage is data clustering, the information loss can be mea- sured by the percentage of legitimate data points that are not well-classiﬁed af- ter the sanitization process. As in [22], a misclassiﬁcation error ME is deﬁned to measure the information loss. ME = 1 N k i=1 (|Clusteri(D)|−|Clusteri(D)|)(8.13) where N represents the number of points in the original dataset, k is the num- ber of clusters under analysis, and |Clusteri(D)| and |Clusteri(D)| repre- sent the number of legitimate data points of the ith cluster in the original dataset D and the sanitized dataset D respectively. Since a privacy preserving tech- nique usually modify data for the sanitization purpose, the parameters involved in the clustering analysis is almost inevitably affected. In order to achieve high clustering quality, it is very important to keep the clustering results as consis- tent as possible before and after the application of a data hiding technique. When quantifying information loss in the context of the other data usages, it is useful to distinguish between: lost information representing the percent- age of non-sensitive patterns (i.e., association, classiﬁcation rules) which are hidden as side-effect of the hiding process; and the artifactual information representing the percentage of artifactual patterns created by the adopted pri- vacy preserving technique. For example, in [21], Oliveira and Zaiane deﬁne two metrics misses cost and artifactual pattern which are corresponding to lost information and artifactual information respectively. In particular, misses cost measures the percentage of non-restrictive patterns that are hidden after the sanitization process. This happens when some non-restrictive patterns lose support in the database due to the sanitization process. The misses cost (MC) is computed as follows: MC = # ∼ RP(D) − # ∼ RP(D) # ∼ RP(D)(8.14) where # ∼ RP(D) and # ∼ RP(D) denote the number of non-restrictive patterns discovered from the original database D and the sanitized database D respectively. In the best case, MC should be 0%. Notice that there is a com- promise between the misses cost and the hiding failure in their approach. The more restrictive patterns they hide, the more legitimate patterns they miss. The other metric, artifactual pattern (AP), is measured in terms of the percentage of the discovered patterns that are artifacts. The formula is: AP = |P |−|P P | P (8.15) where |X| denotes the cardinality of X. According to their experiments, their approach does not have any artifactual patterns, i.e., AP is always 0. 200 Privacy-Preserving Data Mining: Models and Algorithms In case of association rules, the lost information can be modeled as the set of non-sensitive rules that are accidentally hidden, referred to as lost rules, by the privacy preservation technique, the artifactual information, instead, rep- resents the set of new rules, also known as ghost rules, that can be ex- tracted from the database after the application of a sanitization technique. Similarly, if the aim of the mining task is data classiﬁcation, e.g. by means of decision trees inductions, both the lost and artifactual information can be quantiﬁed by means of the corresponding lost and ghost association rules de- rived by the classiﬁcation tree. These measures allow one to evaluate the high level information that are extracted from a database in form of the widely-used inference rules before and after the application of a PPDM algorithm. It is worth noting that for most cryptography-based PPDM algorithms, the data mining results are the same as that produced from unsanitized data. 8.5 Complexity Metrics The complexity metric measures the efﬁciency and scalability of a PPDM algorithm. Efﬁciency indicates whether the algorithm can be executed with good performance, which is generally assessed in terms of space and time. Space requirements are assessed according to the amount of memory that must be allocated in order to implement the given algorithm. For the evaluation of time requirements, there are several approaches. The ﬁrst approach is to evaluate the CPU time. For example, in [21], they ﬁrst keep constant both the size of the database and the set of restrictive patterns, and then increase the size of the input data to measure the CPU time taken by their algorithm. An alternative approach would be to evaluate the time requirements in terms of the computational cost. In this case, it is obvious that an algorithm having a polynomial complexity is more efﬁcient than another one with expo- nential complexity. Sometimes, the time requirements can even be evaluated by counting the average number of operations executed by a PPDM algorithm. As in [14], the performance is measured in terms of the number of encryption and decryption operations required by the speciﬁc algorithm. The last two mea- sures, i.e. the computational cost and the average number of operations, do not provide an absolute measure, but they can be considered in order to perform a fast comparison among different algorithms. In case of distributed algorithms, especially the cryptography-based algo- rithms (e.g. [14, 31]), the time requirements can be evaluated in terms of com- munication cost during the exchange of information among secure processing. Speciﬁcally, in [14], the communication cost is expressed as the number of messages exchanged among the sites, that are required by the protocol for se- curely counting the frequency of each rule. A Survey of Quantiﬁcation of Privacy Preserving Data Mining Algorithms 201 Scalability is another important aspect to assess the performance of a PPDM algorithm. In particular, scalability describes the efﬁciency trends when data sizes increase. Such parameter concerns the increase of both performance and storage requirements as well as the costs of the communications required by a distributed technique with the increase of data sizes. Due to the continuous advances in hardware technology, large amounts of data can now be easily stored. Databases along with data warehouses today store and manage amounts of data which are increasingly large. For this rea- son, a PPDM algorithm has to be designed and implemented with the capability of handling huge datasets that may still keep growing. The less fast is the de- crease in the efﬁciency of a PPDM algorithm for increasing data dimensions, the better is its scalability. Therefore, the scalability measure is very important in determining practical PPDM techniques. 8.6 How to Select a Proper Metric In previous section, we have discussed various types of metrics. An im- portant question here is “which one among the presented metrics is the most relevant for a given privacy preserving technique?”. Dwork and Nissim [9] make some interesting observations about this ques- tion. In particular, according to them in the case of statistical databases privacy is paramount, whereas in the case of distributed databases for which the privacy is ensured by using a secure multiparty computation technique functionality is of primary importance. Since a real database usually contains a large number of records, the performance guaranteed by a PPDM algorithm, in terms of time and communication requirements, is a not negligible factor, as well as its trend when increasing database size. The data quality guaranteed by a PPDM algo- rithm is, on the other hand, very important when ensuring privacy protection without damaging the data usability from the authorized users. From the above observations, we can see that a trade-off metric may help us to state a unique value measuring the effectiveness of a PPDM algorithm. In [7], the score of a masking method provides a measure of the trade-off be- tween disclosure risk and information loss. It is deﬁned as an average between the ranks of disclosure risk and information loss measures, giving the same importance to both metrics. In [8], a R-U conﬁdentiality map is described that traces the impact on disclosure risk R and data utility U of changes in the parameters of a disclosure limitation method which adopts an additive noise technique. We believe that an index assigning the same importance to both the data quality and the degree of privacy ensured by a PPDM algorithm is quite restrictive, because in some contexts one of these parameters can be more rel- evant than the other. Moreover, in our opinion the other parameters, even less relevant ones, should be also taken into account. The efﬁciency and scalability 202 Privacy-Preserving Data Mining: Models and Algorithms measures, for instance, could be discriminating factors in choosing among a set of PPDM algorithms that ensure similar degrees of privacy and data utility. A weighted mean could be, thus, a good measure for evaluating by means of a unique value the quality of a PPDM algorithm. 8.7 Conclusion and Research Directions In this chapter, we have surveyed different approaches used in evaluating the effectiveness of privacy preserving data mining algorithms. A set of criteria is identiﬁed, which are privacy level, hiding failure, data quality and complexity. As none of the existing PPDM algorithms can outperform all the others with respect to all the criteria, we discussed the importance of certain metrics for each speciﬁc type of PPDM algorithms, and also pointed out the goal of a good metric. There are several future research directions along the way of quantifying a PPDM algorithm and its underneath application or data mining task. One is to develop a comprehensive framework according to which various PPDM al- gorithms can be evaluated and compared. It is also important to design good metrics that can better reﬂect the properties of a PPDM algorithm, and to de- velop benchmark databases for testing all types of PPDM algorithms. References [1] Agrawal, D., Aggarwal, C.C.: On the design and quantiﬁcation of pri- vacy preserving data mining algorithms. In: Proceedings of the 20th ACM SIGACT-SIGMOD-SIGART Symposium on Principle of Data- base System, pp. 247–255. ACM (2001) [2] Agrawal, R., Srikant, R.: Privacy preserving data mining. In: Proceeed- ings of the ACM SIGMOD Conference of Management of Data, pp. 439–450. ACM (2000) [3] Ballou, D., Pazer, H.: Modelling data and process quality in multi input, multi output information systems. Management science 31(2), 150–162 (1985) [4] Bayardo, R., Agrawal, R.: Data privacy through optimal k- anonymization. In: Proc. of the 21st Int’l Conf. on Data Engineering (2005) [5] Bertino, E., Fovino, I.N.: Information driven evaluation of data hiding algorithms. In: 7th Internationa Conference on Data Warehousing and Knowledge Discovery, pp. 418–427 (2005) [6] Bertino, E., Fovino, I.N., Provenza, L.P.: A framework for evaluating pri- vacy preserving data mining algorithms. Data Mining and Knowledge Discovery 11(2), 121–154 (2005) A Survey of Quantiﬁcation of Privacy Preserving Data Mining Algorithms 203 [7] Domingo-Ferrer, J., Torra, V.: A quantitative comparison of disclosure control methods for microdata. In: L. Zayatz, P. Doyle, J. Theeuwes, J. Lane (eds.) Conﬁdentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies, pp. 113–134. North- Holland (2002) [8] Duncan, G.T., Keller-McNulty, S.A., Stokes, S.L.: Disclosure risks vs. data utility: The R-U conﬁdentiality map. Tech. Rep. 121, National Insti- tute of Statistical Sciences (2001) [9] Dwork, C., Nissim, K.: Privacy preserving data mining in vertically par- titioned database. In: CRYPTO 2004, vol. 3152, pp. 528–544 (2004) [10] Evﬁmievski, A.: Randomization in privacy preserving data mining. SIGKDD Explor. Newsl. 4(2), 43–48 (2002) [11] Evﬁmievski, A., Srikant, R., Agrawal, R., Gehrke, J.: Privacy preserving mining of association rules. In: 8th ACM SIGKDD International Con- ference on Knowledge Discovery and Data Mining, pp. 217–228. ACM- Press (2002) [12] Fung, B.C.M., Wang, K., Yu, P.S.: Top-down specialization for informa- tion and privacy preservation. In: Proceedings of the 21st IEEE Inter- national Conference on Data Engineering (ICDE 2005). Tokyo, Japan (2005) [13] Iyengar, V.: Transforming data to satisfy privacy constraints. In: Proc., the Eigth ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining, pp. 279–288 (2002) [14] Kantarcioglu, M., Clifton, C.: Privacy preserving distributed mining of association rules on horizontally partitioned data. In: ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pp. 24–31 (2002) [15] Kantarcıo˘glu, M., Jin, J., Clifton, C.: When do data mining results violate privacy? In: Proceedings of the 2004 ACM SIGKDD International Con- ference on Knowledge Discovery and Data Mining, pp. 599–604. Seattle, WA (2004). [16] Kargupta, H., Datta, S., Wang, Q., Sivakumar, K.: On the privacy preserv- ing properties of random data perturbation techniques. In: Proceedings of the Third IEEE International Conference on Data Mining (ICDM’03). Melbourne, Florida (2003) [17] Kifer, D., Gehrke, J.: Injecting utility into anonymized datasets. In: Pro- ceedings of the 2006 ACM SIGMOD International Conference on Man- agement of Data, pp. 217–228. ACM Press, Chicago, IL, USA (2006) [18] Kumar Tayi, G., Ballou, D.P.: Examining data quality. Communications of the ACM 41(2), 54–57 (1998) 204 Privacy-Preserving Data Mining: Models and Algorithms [19] Liu, K., Kargupta, H., Ryan, J.: Random projection-based multiplicative data perturbation for privacy preserving distributed data mining 18(1), 92–106 (2006) [20] Nergiz, M.E., Clifton, C.: Thoughts on k-anonymization. In: The Second International Workshop on Privacy Data Management held in conjunction with The 22nd International Conference on Data Engineering. Atlanta, Georgia (2006) [21] Oliveira, S.R.M., Zaiane, O.R.: Privacy preserving frequent itemset min- ing. In: IEEE icdm Workshop on Privacy, Security and Data Mining, vol. 14, pp. 43–54 (2002) [22] Oliveira, S.R.M., Zaiane, O.R.: Privacy preserving clustering by data transformation. In: 18th Brazilian Symposium on Databases (SBBD 2003), pp. 304–318 (2003) [23] Oliveira, S.R.M., Zaiane, O.R.: Toward standardization in privacy pre- serving data mining. In: ACM SIGKDD 3rd Workshop on Data Mining Standards, pp. 7–17 (2004) [24] Rizvi, S., Haritsa, R.: Maintaining data privacy in association rule mining. In: 28th International Conference on Very Large Databases, pp. 682–693 (2002) [25] Samarati, P.: Protecting respondents’ identities in microdata release. IEEE Transactions on Knowledge and Data Engineering (TKDE) 13(6), 1010–1027 (2001). [26] Schoeman, F.D.: Philosophical Dimensions of Privacy: An Anthology. Cambridge University Press. (1984) [27] Shannon, C.E.: A mathematical theory of communication. Bell System Technical Journal 27, 379–423, 623–656 (1948) [28] Sweeney, L.: Achieving k-anonymity privacy protection using general- ization and suppression. International Journal of Uncertainty, Fuzziness and Knowledge Based Systems 10(5), 571–588 (2002) [29] Trottini, M.: A decision-theoretic approach to data disclosure problems. Research in Ofﬁcial Statistics 4, 7–22 (2001) [30] Trottini, M.: Decision models for data disclosure limitation. Ph.D. thesis, Carnegie Mellon University (2003). [31] Vaidya, J., Clifton, C.: Privacy preserving association rule mining in ver- tically partitioned data. In: 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 639–644. ACM Press (2002) [32] Verykios, V.S., Bertino, E., Nai Fovino, I., Parasiliti, L., Saygin, Y., Theodoridis, Y.: State-of-the-art in privacy preserving data mining. SIG- MOD Record 33(1), 50–57 (2004) A Survey of Quantiﬁcation of Privacy Preserving Data Mining Algorithms 205 [33] Walters, G.J.: Human Rights in an Information Age: A Philosophical Analysis, chap. 5. University of Toronto Press. (2001) [34] Wang, R.Y., Strong, D.M.: Beyond accuracy: what data quality means to data consumers. Journal of Management Information Systems 12(4), 5–34 (1996) [35] Willenborg, L., De Waal, T.: Elements of statistical disclosure control, Lecture Notes in Statistics, vol. 155. Springer (2001) Chapter 9 A Survey of Utility-based Privacy-Preserving Data Transformation Methods Ming Hua Simon Fraser University School of Computing Science 8888 University Drive, Burnaby, BC, Canada V5A 1S6 mhua@cs.sfu.ca Jian Pei Simon Fraser University School of Computing Science 8888 University Drive, Burnaby, BC, Canada V5A 1S6 jpei@cs.sfu.ca Abstract As a serious concern in data publishing and analysis, privacy preserving data processing has received a lot of attention. Privacy preservation often leads to information loss. Consequently, we want to minimize utility loss as long as the privacy is preserved. In this chapter, we survey the utility-based privacy preser- vation methods systematically. We ﬁrst brieﬂy discuss the privacy models and utility measures, and then review four recently proposed methods for utility- based privacy preservation. We ﬁrst introduce the utility-based anonymization method for maximiz- ing the quality of the anonymized data in query answering and discernability. Then we introduce the top-down specialization (TDS) method and the progres- sive disclosure algorithm (PDA) for privacy preservation in classiﬁcation prob- lems. Last, we introduce the anonymized marginal method, which publishes the anonymized projection of a table to increase the utility and satisfy the privacy requirement. Keywords: Privacy preservation, data utility, utility-based privacy preservation, k-anonymity, sensitive inference, l-diversity. 208 Privacy-Preserving Data Mining: Models and Algorithms 9.1 Introduction Advanced analysis on data sets containing information about individuals poses a serious threat to individual privacy. Various methods have been pro- posed to tackle the privacy preservation problem in data analysis, such as anonymization and perturbation. The major goal is to protect some sensitive individual information (privacy) from being identiﬁed by the published data. For example, in k-anonymization, certain individual information is generalized or suppressed so that any individual in a released data set is indistinguishable from other k − 1 individuals. A natural consequence of privacy preservation is the information loss. For example, after the k-anonymization, the information describing an individual should be the same as at least other k − 1 individuals. The loss of the spe- ciﬁc information about certain individuals may affect the data quality. In the extreme case, the data may become totally useless. Example 9.1 (Utility loss in privacy preservation) Table 9.1a is a data set used for customer analysis. Among the listed attributes, {Age, Ed- ucation, Zip Code} can be used to uniquely identify an individual. Such a set of attributes is called a quasi-identiﬁer. Annual Income is a sensitive attribute. Target Customer is the class label of customers. In order to protect the annual income information for individuals, sup- pose 2-anonymity is required so that any individual is indistinguishable from another one on the quasi-identiﬁer. Table 9.2b and 9.3c are both valid 2- anonymizations of 9.1a. The tuples sharing the same quasi-identiﬁer have the same gId. However, Table 9.2b provides more accurate results than Table 9.3c in answering the following two queries. Q1: “How many customers under age 29 are there in the data set?” Q2:“Is an individual with age =25, Education = Bachelor, Zip Code = 53712 a target customer?” According to Table 9.2b, the answers of Q1 and Q2 are “2” and “Y”, re- spectively. But according to Table 9.3c, the answer to Q1 is an interval [0, 4], because 29 falls in the age range of tuple t1,t2,t4, and t6. The answer to Q2 is Y and N with 50% probability each. From this example, we make two observations. First, different anonymiza- tion may lead to different information loss. Table 9.2b and 9.3c are in the same anonymization level, but Table 9.2b provides more accurate answers to the queries. Therefore, it is crucial to minimize the information loss in privacy preservation. Second, the data utility depends on the applications using the data. In the above example, Q1 is an aggregate query, thus the data is more useful if the attribute values are more accurate. Q2 is a classiﬁcation query, so the utility Utility-based Privacy-Preserving Data Transformation Methods 209 Table 9.1a. The original table tId Age Education Zip Code Annual Income Target Customer t1 24 Bachelor 53711 40k Y t2 25 Bachelor 53712 50k Y t3 30 Master 53713 50k N t4 30 Master 53714 80k N t5 32 Master 53715 50k N t6 32 Doctorate 53716 100k N Table 9.2b. A 2-anonymized table with better utility gId tId Age Education Zip Code Annual Income Target Customer g1 t1 [24-25] Bachelor [53711-53712] 40k Y g1 t2 [24-25] Bachelor [53711-53712] 50k Y g2 t3 30 Master [53713-53714] 50k N g2 t4 30 Master [53713-53714] 80k N g3 t5 32 GradSchool [53715-53716] 50k N g3 t6 32 GradSchool [53715-53716] 100k N Table 9.3c. A 2-anonymized table with poorer utility gId tId Age Education Zip Code Annual Income Target Customer g1 t1 [24-30] ANY [53711-53714] 40k Y g2 t2 [25-32] ANY [53712-53716] 50k Y g3 t3 [30-32] Master [53713-53715] 50k N g1 t4 [24-30] ANY [53711-53714] 80k N g3 t5 [30-32] Master [53713-53715] 50k N g2 t6 [25-32] ANY [53712-53716] 100k N of data depends on how much the classiﬁcation model is preserved in the anonymized data. In a word, utility is the quality of data for the intended use. 9.1.1 What is Utility-based Privacy Preservation? The utility-based privacy preservation has two goals: protecting the private information and preserving the data utility as much as possible. Privacy preser- vation is a hard requirement, that is, it must be satisﬁed, and utility is the mea- sure to be optimized. While privacy preservation has been extensively studied, the research of utility-based privacy preservation has just started. The chal- lenges include: 210 Privacy-Preserving Data Mining: Models and Algorithms Utility measure. One key issue in the utility-based privacy preservation is how to model the data utility in different applications. A good utility measure should capture the intrinsic factors that affect the quality of data for the speciﬁc application. Balance between utility and privacy. In some situation, preserving utility and privacy are not conﬂicting. But more often than not, hiding the privacy information may have to sacriﬁce some utility. How do we trade off between the two goals? Efﬁciency and scalability. The traditional privacy preservation is already computational challenging. For example, even simple restriction of optimized k-anonymity is NP-hard [3]. How do we develop efﬁcient algorithms if utility is involved? Moreover, real data sets often contains millions of high dimen- sional tuples, highly scalable algorithms are needed. Ability to deal with different types of attributes. Real life data often in- volve different types of attributes, such as numerical, categorical, binary or mixtures of these data types. The utility-based privacy preserving methods should be able to deal with attributes of different types. 9.2 Types of Utility-based Privacy Preservation Methods In this section, we introduce some common privacy models and recently proposed data utility measures. 9.2.1 Privacy Models Various privacy models have been proposed in literature. This section intro- duces some of the privacy models that are often used as well as the correspond- ing privacy preserving methods. K-Anonymity. K-anonymity is a privacy model developed for the linking attack [18]. Given a table T with attributes (A1,...,An),aquasi-identiﬁer is a minimal set of attributes (Ai1 ,...,Ail )(1≤ i1 < ... < il ≤ n) in T that can be joined with external information to re-identify individual records. Note that there may be more than one quasi-identifer in a table. AtableT is said k-anonymous given a parameter k and the quasi-identifer QI =(Ai1 ,...,Ail ) if for each tuple t ∈ T, there exist at least another (k−1) tuples t1,...,tk−1 such that those k tuples have the same projection on the quasi-identiﬁer. Tuple t and all other tuples indistinguishable from t on the quasi-identiﬁer form an equivalence class. Utility-based Privacy-Preserving Data Transformation Methods 211 Given a table T with the quasi-identiﬁer and a parameter k, the problem of k-anonymization is to compute a view T that has the same attributes as T such that T is k-anonymous and as close to T as possible according to some quality metric. Data suppression and value generalization are often used for anonymization. Suppression is masking the attribute value with a special value in the domain. Generalization is replacing a speciﬁc value with a more generalized one. For example, the actual age of an individual can be replaced by an interval, or the city of an individual can be replaced by the corresponding province. Cer- tain quality measures are often used in the anonymization, such as the average equivalence class size. Theoretical analysis shows that the problem of opti- mal anonymization under many quality models is NP-hard [1, 14, 3]. Various k-anonymization methods are proposed [19, 20, 29, 12, 11]. One of the most important advantages of k-anonymity is that no additional noise or artiﬁcial perturbation is added into the original data. All the tuples in an anonymized data remains trustful. l-Diversity. l-diversity [13] is based on the observation that if the sensi- tive values in one equivalence class lacks diversity, then no matter how large the equivalence class is, attacker may still guess the sensitive value of an indi- vidual with high probability. For example, Table 9.3c is a 2-anonymous table. Particularly, t3 and t5 are generalized into the same equivalence class. How- ever, since their annual income is the same, an attacker can easily conclude that the annual income of t3 is 50k although the 2-anonymity is preserved. Ta- ble 9.2b has better diversity in the sensitive attribute. t3 and t4 are in the same equivalence class and their annual income is different. Therefore, the attacker only have a 50% opportunity to know the real annual income of t3. l-diversity model addresses the above problem. By intuition, a table is l-diverse if each equivalence class contains at least l “well represented” sensitive values, that is, at least l most frequent values have very similar frequencies. Consider a table T =(A1,...,An,S) and constant c and l, where (A1,...,An) is a quasi-identiﬁer and S is a sensitive attribute. Sup- pose an equivalence class EC contains value s1,...,sm with frequency f(s1),...,f(sm)(appearing in the frequency non-ascending order) on sen- sitive attribute S, EC satisﬁes (c, l)-diversity with respect to S if f(s1) 100k” should not be dis- closed. There are two inference rules {[20 − 30],Bachelor}→“ ≤ 50k”and {Doctorate, Lawyer}→“ > 100k” with high conﬁdence. Table 9.10b is a suppressed table where the conﬁdence of each inference rule is reduced to 50% or below. But the table remains useful for classiﬁcation. That is, given a tuple t with the same values on attribute Age, Education,andJob as any tuple t in the original table, t receives the same class label as t with a high proba- bility according to Table 9.10b. This is because the class label in Table 9.9a is highly related to attribute Job. As long as the values on Job are disclosed, the classiﬁcation accuracy is guaranteed. To give more details about the method, we ﬁrst introduce how to deﬁne the privacy requirement using privacy templates, and then discuss the utility measure. Last, we use an example to illustrate the algorithm. Privacy Template. To make a table free from sensitive infer- ences, it is required that the conﬁdence of each inference rule is low. Utility-based Privacy-Preserving Data Transformation Methods 225 Table 9.9a. The original table tId Age Education Job AnnualIncome TargetCustomer 1 [20-30] Bachelor Engineer ≤ 50k Y 2 [20-30] Bachelor Artist ≤ 50k N 3 [20-30] Bachelor Lawyer ≤ 50k Y 4 [20-30] Bachelor Artist [50k − 100k]N 5 [20-30] Master Artist [50k − 100k]N 6 [31-40] Master Engineer [50k − 100k]Y 7 [20-30] Doctorate Lawyer > 100k N 8 [31-40] Doctorate Lawyer > 100k Y 9 [31-40] Doctorate Lawyer [50k − 100k]Y 10 [20-30] Doctorate Engineer [50k − 100k]N Table 9.10b. The suppressed table tId Age Education Job AnnualIncome TargetCustomer 1 [20-30] ⊥Edu Engineer ≤ 50k Y 2 [20-30] ⊥Edu Artist ≤ 50k N 3 [20-30] ⊥Edu Lawyer ≤ 50k Y 4 [20-30] ⊥Edu Artist [50k − 100k]N 5 [20-30] Master Artist [50k − 100k]N 6 ⊥Age Master Engineer [50k − 100k]Y 7 [20-30] ⊥Edu Lawyer > 100k N 8 ⊥Age ⊥Edu Lawyer > 100k Y 9 ⊥Age ⊥Edu Lawyer [50k − 100k]Y 10 [20-30] ⊥Edu Engineer [50k − 100k]N Templates can be used to specify such a requirement. Consider table T = (M1,...,Mm,Π1,...,Πn, Θ),whereMj (1 ≤ j ≤ m)isanon-sensitive attribute,Πi (1 ≤ i ≤ n)isasensitive attribute,andΘ is a class label at- tribute. A template is deﬁned as IC → πi,h,whereπi is a sensitive attribute value from sensitive attribute Πi, IC is a set of attributes not containing Πi and called inference channel,andh is a conﬁdence threshold. An inference is an instance of IC → πi,h, which has the form ic → πi,whereic con- tains values from attributes in IC. The conﬁdence of inference ic → πi, denoted by conf(ic → πi), is the percentage of tuples containing both ic and πi among the tuples containing ic.Thatis,conf(ic → πi)= |Ric,πi | |Ric| , where Rv denotes the tuples containing value v. The conﬁdence of a template is deﬁned as the maximum conﬁdence of all the inferences of the template. That is, Conf(IC → πi)=maxconf(ic → πi).TableT satisﬁes template IC → πi,h if Conf(IC → πi) ≤ h.T satisﬁes a set of templates if T satisﬁes each template in the set. 226 Privacy-Preserving Data Mining: Models and Algorithms Progressive Disclosure and Utility Measure. As discussed in Section 9.2, suppression is an efﬁcient method for eliminating sensitive inferences. Con- sider table T =(M1,...,Mm,Π1,...,ΠN, Θ).Mj (1 ≤ j ≤ m)isanon- sensitive attribute and Πi (1 ≤ i ≤ n)isasensitive attribute.Θ is a class label. The suppression of a value on attribute Mj is to replace all occurrences of this value by a special value ⊥j. For each template IC → πi,h not sat- isﬁed in T, some values in the inference channel IC should be suppressed so that Conf(IC → πi) is reduced to not greater than h. Disclosure is the opposite operation of suppression. Given a suppressed ta- ble T, Supj denotes all values suppressed on attribute Mj.Adisclosure of value v ∈ Supj replaces the special value ⊥j with v in all the tuples that cur- rently contain ⊥j but originally contain v. A disclosure is valid if it does not lead to a template violation. Moreover, a disclosure on attribute Mj is beneﬁ- cial, that is, it increases the information utility for classiﬁcation, if more than one class is involved in the tuples containing ⊥j. The following utility score measures the beneﬁt of a disclosure quantitatively. For each suppressed attribute value v , Score(v) is deﬁned as Score(v)= InfoGain(v) PrivLoss(v)+1 where InfoGain(v) is the information gain in disclosing value v and PrivLoss(v) is the privacy loss in disclosing value v, which are deﬁned as follows. InfoGain(v). Given a set of tuples S and the class labels cls involved in S, the entropy is deﬁned as H(S)=− c∈cls freq(S, c) |S| × log2 freq(S, c) |S| where freq(S, c) is the number of tuples containing class c in S. Given value v on attribute Mj, the tuples containing v is denoted by Rv. Suppose R⊥j is the set of tuples having suppressed value on Mj before dis- closing v, the information gain of disclosing v is InfoGain(v)=H(R⊥j ) − ( |Rv| |R⊥j |H(Rv)+ |R⊥j − Rv| |R⊥j | H(R⊥j − Rv)) PrivLoss(v). Given value v on attribute Mj, the privacy loss PrivLoss(v) is deﬁned as the average conﬁdence increase of inferences. PrivLoss(v)=AV GMj ∈IC{Conf(IC → πi) − Conf(IC → πi)} Utility-based Privacy-Preserving Data Transformation Methods 227 where Conf(IC → πi) and Conf(IC → πi) are the conﬁdence before and after disclosing v. Example 9.8 (Utility score) Consider Table 9.9a. Suppose the pri- vacy templates are {Age, Education}→“ ≤ 50k”, 50% {Education,Job}→“ > 100k”, 50% Suppose at ﬁrst all the values on attribute Job is suppressed to ⊥Job. The score of disclosing value Engineer on Job is calculated as follows. H(R⊥Job)=− 5 10 × log2 5 10 − 5 10 × log2 5 10 =1 H(REngineer)=− 2 3 × log2 2 3 − 1 3 × log2 1 3 =0.9149 H(R⊥Job − REngineer)=− 3 7 × log2 3 7 − 4 7 × log2 4 7 =0.9852 InfoGain(Engineer)=H(R⊥Job) − ( 3 10 × H(REngineer)+ 7 10 × H(R⊥Job − REngineer)) = 0.03589 Before disclosing Engineer: Conf({Education,Job}→> 100k) = conf({⊥Education, ⊥Job}→> 100k)=0.2 After disclosing Engineer: conf({⊥Education, Engineer}→> 100k)=0 conf({⊥Education, ⊥Job}→> 100k)=0.286 Conf({Education,Job}→> 100k)=max{0, 0.286} =0.286 PrivLoss(Engineer)=0.286 − 0.2=0.086 Score(Engineer)= 0.03589 0.086+1 =0.033 The Algorithm. The Progressive Disclosure Algorithm ﬁrst suppresses all non-sensitive attribute values in a table and then iteratively discloses the at- tribute values that are helpful for classiﬁcation without violating privacy tem- plates. In each iteration, the score of each suppressed value is calculated and the one with the maximum score is disclosed. The iteration terminates when there is no valid and beneﬁcial disclosure. The algorithm is illustrated using the following example. Example 9.9 (The Progressive disclosure algorithm) Consider the following templates on Table 9.9a. (1) {Age, Education}→“ ≤ 50k”, 50% (2) {Education,Job}→“ > 100k”, 50% At ﬁrst, the values on attribute Age, Education, and Job are suppressed to ⊥Age, ⊥Education, and ⊥Job, respectively. The candidate disclosing values include [20 − 30],[31 − 40], Bachelor, Master, Doctorate, Engineer, 228 Privacy-Preserving Data Mining: Models and Algorithms Artist, and Lawyer. In order to ﬁnd the most beneﬁcial disclosure, the score of each value is calculated. Since Artist has the maximum score, it is disclosed in this iteration. At the next iteration, the scores of the rest candidates are updated, and the one with the maximum score, [20 − 30], is disclosed. All the valid and beneﬁcial disclosures are executed similarly in the rest iterations. The ﬁnally published table is shown in Table 9.10b. Note that in the ﬁnally published table, Bachelor and Doctorate are suppressed because disclosing them violates the privacy templates; [31−40] is suppressed because disclosing it is not beneﬁcial. 9.4.3 Summary and Discussion The top-down specialization (TDS) method and the progressive disclosure algorithm (PDA) are based on the observation that the goals of privacy preser- vation and classiﬁcation modeling may not be always conﬂicting. Privacy preservation is to hide the sensitive individual (speciﬁc) information, while classiﬁcation modeling draws the general structure of the data. TDS and PDA try to achieve the “win-win” goal that the speciﬁc information hidden for pri- vacy preservation is the information misleading or not useful for classiﬁcation. Therefore, the quality of the classiﬁcation model built on the table after using TDS or PDA may be even better than that built on the original table. 9.5 Anonymized Marginal: Injecting Utility into Anonymized Data Sets One drawback of the anonymization method is that after the generalization on quasi-identiﬁers, the distribution of the more speciﬁc data is lost. For ex- ample, consider Table 9.11a and the corresponding 2-anonymous Table 9.12b. After the anonymization, all the values on attribute Age are generalized to the full range in the domain without any speciﬁc distribution information. How- ever, if we publish Table 9.13a in addition to Table 9.12b, more information about Age is published and the 2-anonymity is still guaranteed. Table 9.13a is called a marginal on Age. On the other hand, not all marginals preserve privacy. For example, Ta- ble 9.14b satisﬁes 2-anonymity itself, but if an attacker knows an individual living in 53715 with Doctorate degree is in the original table, he/she may link the information from Table 9.12b and 9.14b together and conclude that the annual income of the individual is 80k. Based on the above observation, [8] models the utility of anonymized tables as how much they preserve the distribution of the original table. It then pro- poses to publish more than one anonymized tables to better approximate the original distribution. Utility-based Privacy-Preserving Data Transformation Methods 229 Table 9.11a. The original table tId Age Education Zip Code AnnualIncome 1 27 Bachelor 53711 40k 2 28 Bachelor 53713 50k 3 27 Master 53715 50k 4 28 Doctorate 53716 80k 5 30 Master 53714 50k 6 30 Doctorate 53712 100k Table 9.12b. The anonymized table gId tId Age Education Zip Code AnnualIncome 1 1 [27-30] Bachelor [53711-53713] 40k 1 2 [27-30] Bachelor [53711-53713] 50k 2 3 [27-30] GradSchool [53715-53716] 40k 2 4 [27-30] GradSchool [53715-53716] 80k 3 5 [27-30] GradSchool [53712-53714] 50k 3 6 [27-30] GradSchool [53712-53714] 100k Table 9.13a. Age Marginal Age Count 27 2 28 2 30 2 Table 9.14b. (Education, AnnualIncome) Marginal Education AnnualIncome Count Bachelor 40k 1 Bachelor 50k 1 Master 50k 2 Doctorate 80k 1 Doctorate 100k 1 Now the problem becomes, which additional anonymized tables should be published and how to check the privacy if more than one anonymized table are released. First of all, we introduce the concept of anonymized marginal and the utility measure to evaluate the quality of a set of anonymized marginals. 9.5.1 Anonymized Marginal Consider a table T =(A1,...,An).{Ai1 ,...,Aim }(1 ≤ i1 < ... < im ≤ n) is a set of attributes in T. A marginal table TAi1 ,...,Aim can be created by the following SQL statement. (Attribute Count is the number of tuples in TAi1 ,...,Aim sharing the same values on Ai1 ,...,Aim ). 230 Privacy-Preserving Data Mining: Models and Algorithms CREATE TABLE TAi1 ,...,Aim AS (SELECT Ai1 ,...,Aim , COUNT(∗)AS Count FROM T GROUP BY Ai1 ,...,Aim ) The marginal table indicates the distribution of the tuples from T in domain D(Ai1 )×...×D(Aim ),whereD(Ai) is the domain of attribute Ai.Amarginal is anonymized if some of its attribute values are generalized. 9.5.2 Utility Measure Distribution is an intrinsic characteristic of a data set. Many data analysis discover the patterns from data distribution, such as classiﬁcation which dis- covers the class distribution in a data set. Therefore, whether the distribution of a data set is preserved after anonymization is crucial for the utility of data. In this spirit, a utility measure is deﬁned as the difference between the distribution of the original data and that of the anonymized data. Empirical distribution of the original table. Consider a table T = (A1,...,Am). In the probabilistic view, the tuples in T can be considered as an i.i.d. (identically and independently distributed) sample generated from an underlying distribution F. Reversely, F can be estimated using the empirical distribution FT of T. Given any instance x =(x1,...,xm) in the domain of T, the empirical probability pT(x) is the posteriori probability of x in table T. In other words, pT(x) is the proportion of tuples in T having the same attribute values as x,thatis,pT(x)=|{t|t∈T,t.Ai=xi, 1≤i≤m}| |T| . Maximum entropy probability distribution of anonymized marginals. Similarly, the anonymized marginals of T can be viewed as a set of constraints on the underlying distribution. For example, Age Marginal in Table 9.13a in- dicates that 33.3% of the tuples in Table 9.11a have age 27, 28 and 30, respec- tively. Given a set of constraints, the maximum entropy probability distribution is the distribution that maximizes the entropy among all the probability distri- butions satisfying the constraints. It is often used to estimate the underlying distribution given some constraints. The intuition is that, by maximizing the entropy, the prior knowledge about the distribution is minimized. Consider a table T =(A1,...,Am) and a set of marginals M = {M1,...,Mn}, each marginal Mi =(Ai1 ,...,Aik ,Count)(1≤ i1 < ... < ik ≤ m) contains the projection of T on attribute {Ai1 ,...,Aik } and the count of tuples. A distribution F satisﬁes Mi if for any instance t in Mi,the Utility-based Privacy-Preserving Data Transformation Methods 231 probability in F satisﬁes ΠAi1 ,...,Aik x=t p(x)=t.Count |T| where x is an instance from the domain of T and ΠAi1 ,...,Aik x = t means that the projection of x on Ai1 ,...,Aik is the same as t. The above equation means that the projection of distribution F on Ai1 ,...,Aik isthesameasthe empirical distribution of Mi.F satisﬁes a set of marginals M if F satisﬁes each marginal Mi in M. The maximum entropy probability distribution FM is the distribution with the maximum entropy in all the distributions satisfying M. Kullback-Leibler divergence (KL-divergence). Suppose the empirical distribution of a table T is F1 and the maximum entropy probability distrib- ution of the anonymized marginals M is F2,theKullback-Leibler divergence (KL-divergence) [9] is used to measure the difference between the two distrib- utions. (Note that KL-divergence is not a metric.) DKL( F1, F2)= i p(1) i log p(1) i p(2) i = H( F1, F2) − H( F1) where p(1) i and p(2) i are the probabilities of an instance from distribution F1 and F2, respectively. H( F1) is the entropy of F1, which measures how much effort it needs to identify an instance from distribution F1.H( F1, F2) is the cross-entropy of F1 and F2, which measures the effort needed to identify an instance from distribution F1 and F2. A smaller KL-divergence indicates that the two distributions are more similar. KL-divergence is non-negative and it is minimized when F1 = F2. Given a table T, the entropy H( F1) is constant. Therefore, minimizing DKL( F1, F2) is mathematically equivalent to minimiz- ing H( F1, F2). Therefore, the utility of a set of anonymized marginals M = {M1,...,Mn} can be measured by the KL-divergence between FM and FT. A smaller KL- divergence value indicates better utility of M. 9.5.3 Injecting Utility Using Anonymized Marginals Based on the above utility measure, ideally, we want to search all the pos- sible sets of anonymized marginals and ﬁnd the one with the minimum KL- divergence. There are two challenges. Calculating the KL-divergence is computational challenging. First, gen- erating all the possible sets of marginals needs exhaustive search. Second, 232 Privacy-Preserving Data Mining: Models and Algorithms ﬁnding the optimal k-anonymization for a single marginal is already NP- hard [3]. Third, given a set of constraints, calculating the maximum entropy probability distribution requires iterative algorithms [4, 16], which may be time-consuming. Since there is a close-form algorithm [10] to compute the maximum en- tropy probability distribution on decomposable tables, the anonymized mar- ginal method restricts the search to only including decomposable marginals. The concept of decomposable marginals is derived from the decomposable graphical model [10]. If a set of marginals are decomposable, then they are conditionally independent. Instead of giving the formal deﬁnition, we use the following example to illustrate the decomposable marginals and how to calcu- late the maximum entropy probability on decomposable marginals. A B C D Figure 9.3. Interactive graph B C DA B C Figure 9.4. A decomposition Example 9.10 (Decomposable Marginal) Consider a set of mar- ginals M1 =(A, B, C, Count) and M2 =(B,C,D,Count).Wecreatean interactive graph (Figure 9.3) for them by generating a vertex for each at- tribute. An edge between two vertices is created if the corresponding attributes are in the same marginal. M1 and M2 are decomposable because they satisﬁes the following two conditions: (1) in the corresponding interactive graph, clique BC separates A and D(the two components after the decomposition are shown in Figure 9.4); (2) each maximal clique in the interactive graph is covered by a marginal. An example of non-decomposable marginals is M1 =(A, B, C, Count), M2 =(B,D,Count) and M3 =(C, D, Count). They have the same interac- tive graph as shown in Figure 9.3, but the maximal clique BCD is not covered by any marginal. Therefore, they are not decomposable marginals. A set of decomposable marginals can be viewed as a set of condition- ally independent relations. For example, attributes A and D in marginals M1 =(A, B, C, Count) and M2 =(B,C,D,Count) are independent given attributes BC. The calculation of the maximum entropy probability distribu- tion for decomposable marginals is illustrated in the following example. Example 9.11 (Maximum entropy probability) Consider mar- ginals M1 =(A, B, Count), and M2 =(B,C,Count) of table Utility-based Privacy-Preserving Data Transformation Methods 233 T =(A, B, C).M1 and M2 are decomposable and B separates A and C. Therefore, attribute A and C are independent given B. If M1 and M2 are ordinary marginals: The attribute values in ordinary marginals are not generalized. For any instance x =(a, b, c) in the domain of T, the maximum entropy probability of x is p(x)=p(a, b, c) = p(a, c|b)· p(b) = p(a|b)· p(c|b)· p(b) = p(a,b)·p(b,c) p(b) where p(a, b) is the proportion of tuples in M1 having value a and b on at- tribute A and B, respectively. If M1 and M2 are anonymized marginals: Some attribute values in anonymized marginals are generalized. For any instance x =(a, b, c) in the domain of T, suppose a,b,c are the corresponding generalized values in M1 and M2. The maximum entropy probability of x is: p(x)=p(a, b, c)=p(a,b)·p(b,c) p(b)· 1 |Ra |·|Rb |·|Rc| where p(a,b) is the fraction of tuples having value a and b on attribute A and B in M1, respectively. Ra is the set of tuples having value a on A in M1. Since ﬁnding all the possible decomposable marginals requires exhaustive search, a search algorithm like genetic algorithm or random walk is needed. Guarantee the privacy. Another challenge is that given a set of marginals {M1,...,Mn}, how to check whether the information obtained from combin- ing {M1,...,Mn} satisﬁes k-anonymity and l-diversity? The theoretical results in [8] show that in order to check k-anonymity of a set of decomposable marginals {M1,...,Mn}, we only need to check whether each marginal Mi satisﬁes k-anonymity. But checking whether {M1,...,Mn} satisﬁes l-diversity is more difﬁcult. We have to join all the marginals together and test whether the joined table satisﬁes l-diversity. Several propositions help reduce the computation. First, if there is one marginal that violates l-diversity, then the whole set of marginals violate l- diversity. Second, only the marginals containing sensitive attributes need to be joined together to check for l-diversity. Third, if a subset of marginals do not satisfy l-diversity, then the whole set of marginals do not satisfy l-diversity. 9.5.4 Summary and Discussion Anonymized marginal is very effective in improving the utility of the anonymized data. However, searching all the possible decomposable marginals for the optimal solution requires a lot of computation. A simpler yet effective 234 Privacy-Preserving Data Mining: Models and Algorithms method is, given table T, ﬁrst compute an traditional k-anonymous table T , and then create a set of anonymous marginals M containing single attribute from T. Experimental results [8] show that publishing T together with M still dramatically decreases the KL-divergence. 9.6 Summary Utility-based privacy preserving methods are attracting more and more at- tention. However, the concept of utility is not new in privacy preservation prob- lems. Utility is often used as one of the criteria for the privacy preserving meth- ods [21] and measures the information loss after using the privacy preservation technique on data sets. Then, what makes the utility-based privacy preservation methods special? Traditional privacy preserving methods often do not make explicit assumptions about the applications where the data are used. Therefore, the utility measure is often very general and thus not so effective. For example, traditionally, in the sensitive inference privacy model, the utility is often considered maximal if the number of suppressed entries is minimized. It is true only for certain applica- tions. As a comparison, the utility-based privacy preservation methods target at a class of applications based on the same data utility. Therefore, the devel- oped methods are effective in reducing the information loss for the intended applications while preserving privacy as well. In addition to the four methods discussed in this chapter, there are many applications which utilize some special functions of data. How to extend the utility-based privacy preserving methods to various applications is highly inter- esting. For example, in the data set where ranking queries are usually issued, the utility of data should be measured as how much the dominance relation- ship among tuples is preserved. None of the existing models can handle this problem. Moreover, the utility-based privacy preserving methods can also be extended to other types of data, such as stream data where the temporal char- acteristics are considered more important in analysis. Acknowledgements This work is supported in part by the NSERC Grants 312194-05 and 614067, and an IBM Faculty Award. All opinions, ﬁndings, conclusions and recommendations in this paper are those of the authors and do not necessarily reﬂect the views of the funding agencies. References [1] Charu C. Aggarwal. On k-anonymity and the curse of dimensionality. In Proceedings of the 31st International Conference on Very Large Data Bases, pages 901–909, August 2005. Utility-based Privacy-Preserving Data Transformation Methods 235 [2] Charu C. Aggarwal, Jian Pei, and Bo Zhang. On privacy preserva- tion against adversarial data mining. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 510 – 516. ACM Press, 2006. [3] Roberto J. Bayardo and Rakesh Agrawal. Data privacy through optimal k-anonymization. In Proceedings of the 21st International Conference on Data Engineering (ICDE’05), pages 217 – 228. IEEE Computer Society, 2005. [4] A.L. Berger, S.A. Della-Pietra, and V.J. Della-Pietra. A maximum en- tropy approach to natural language processing. Computational Linguis- tics, 22(1):39–71, 1996. [5] Benjamin C. M. Fung, Ke Wang, and Philip S. Yu. Top-down specializa- tion for information and privacy preservation. In Proceedings of the 21st International Conference on Data Engineering (ICDE’05), volume 00, pages 205 – 216. IEEE Computer Society, 2005. [6] Benjamin C. M. Fung, Ke Wang, and Philip S. Yu. Anonymizing classi- ﬁcation data for privacy preservation. IEEE Transactions on Knowledge and Data Engineering, 19(5):711–725, May 2007. [7] Vijay S. Iyengar. Transforming data to satisfy privacy constraints. In Pro- ceedings of the eighth ACM SIGKDD international conference on Knowl- edge discovery and data mining, pages 279 – 288. ACM Press, 2002. [8] Daniel Kifer and Johannes Gehrke. Injecting utility into anonymized datasets. In Proceedings of the 2006 ACM SIGMOD international con- ference on Management of data, pages 217 – 228. ACM Press, 2006. [9] S. Kullback and R. Leibler. On information and sufﬁciency. Annals of Mathematical Statistics, 22:79–87, 1951. [10] Steffen L. Lauritzen. Graphical Models. Oxford Science Publicatins, 1996. [11] F. Giannotti M. Atzori, F. Bonchi and D. Pedreschi. Blocking anonymity threats raised by frequent itemset mining. In Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05), November 2005. [12] F. Giannotti M. Atzori, F. Bonchi and D. Pedreschi. k-anonymous pat- terns. In Proceedings of the Ninth European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD’05), volume 3721 of Lecture Notes in Computer Science, Springer, Porto, Portugal, October 2005. [13] Ashwin Machanavajjhala, Johannes Gehrke, Daniel Kifer, and Muthu- ramakrishnan Venkitasubramaniam. l-diversity: Privacy beyond k-anonymity. In Proceedings of the 22nd International Conference on Data Engineering (ICDE’06), page 24, 2006. 236 Privacy-Preserving Data Mining: Models and Algorithms [14] Adam Meyerson and Ryan Williams. On the complexity of optimal k- anonymity. In Proceedings of the Twenty-third ACM SIGACT-SIGMOD- SIGART Symposium on Principles of Database Systems, pages 223–228, June 2004. [15] Stanley R. M. Oliveira and Osmar R. Za¨ıane. Privacy preserving frequent itemset mining. In CRPITS’14: Proceedings of the IEEE international conference on Privacy, security and data mining, pages 43–54, Dar- linghurst, Australia, Australia, 2002. Australian Computer Society, Inc. [16] Adwait Ratnaparkhi. A maximum entropy part-of-speech tagger. In Pro- ceedings of the Conference on Empirical Methods in Natural Language Processing, pages 133–142, University of Pennsylvania, May 1996. ACL. [17] P. Samarati. Protecting respondents’ identities in microdata re- lease. IEEE Transactions on Knowledge and Data Engineering, 13(6): 1010 – 1027, November 2001. [18] Pierangela Samarati and Latanya Sweeney. Generalizing data to provide anonymity when disclosing information. Technical report, March 1998. [19] Latanya Sweeney. Achieving k-Anonymity Privacy Protection Using Generalization and Suppression. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10(5):571–588, 2002. [20] Latanya Sweeney. k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst., 10(5):557–570, 2002. [21] Vassilios S. Verykios, Elisa Bertino, Igor Nai Fovino, Loredana Parasil- iti Provenza, Yucel Saygin, and Yannis Theodoridis. State-of-the-art in privacy preserving data mining. ACM SIGMOD Record, 33(1):50 – 57, 2004. [22] Vassilios S. Verykios, Ahmed K. Elmagarmid, Elisa Bertino, Yucel Say- gin, and Elena Dasseni. Association rule hiding. IEEE Transactions on Knowledge and Data Engineering, 16(4):434–447, 2004. [23] Ke Wang, Benjamin C. M. Fung, and Philip S. Yu. Template-based pri- vacy preservation in classiﬁcation problems. In Proceedings of the Fifth IEEE International Conference on Data Mining, pages 466 – 473. IEEE Computer Society, 2005. [24] Ke Wang, Philip S. Yu, and Sourav Chakraborty. Bottom-up general- ization: A data mining solution to privacy protection. In Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM’04), volume 00, pages 249 – 256. IEEE Computer Society, 2004. [25] Xiaokui Xiao and Yufei Tao. m-invariance: Towards privacy preserving re-publication of dynamic datasets. In To appear in ACM Conference on Management of Data (SIGMOD), 2007. Utility-based Privacy-Preserving Data Transformation Methods 237 [26] Xiaokui Xiao and Yufei Tao. Anatomy: simple and effective privacy preservation. In Proceedings of the 32nd international conference on Very large data bases, volume 32, pages 139 – 150. VLDB Endowment, 2006. [27] Jian Xu, Wei Wang, Jian Pei, Xiaoyuan Wang, Baile Shi, and Ada Wai-Chee Fu. Utility-based anonymization for privacy preservation with less information loss. ACM SIGKDD Explorations Newsletter, 8(2):21– 30, December 2006. [28] Jian Xu, Wei Wang, Jian Pei, Xiaoyuan Wang, Baile Shi, and Ada Wai- Chee Fu. Utility-based anonymization using local recoding. In Proceed- ings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 785 – 790. ACM Press, 2006. [29] Sheng Zhong, Zhiqiang Yang, and Rebecca N. Wright. Privacy- enhancing k-anonymization of customer data. In Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Princi- ples of database systems(PODS ’05), pages 139–147, New York, NY, USA, 2005. ACM Press. Chapter 10 Mining Association Rules under Privacy Constraints Jayant R. Haritsa Database Systems Lab Indian Institute of Science, Bangalore 560012, INDIA haritsa@dsl.serc.iisc.ernet.in Abstract Data mining services require accurate input data for their results to be mean- ingful, but privacy concerns may impel users to provide spurious information. In this chapter, we study whether users can be encouraged to provide correct information by ensuring that the mining process cannot, with any reasonable de- gree of certainty, violate their privacy. Our analysis is in the context of extracting association rules from large historical databases, a popular mining process that identiﬁes interesting correlations between database attributes. We analyze the various schemes that have been proposed for this purpose with regard to a vari- ety of parameters including the degree of trust, privacy metric, model accuracy and mining efﬁciency. Keywords: Privacy, data Mining, association rules. 10.1 Introduction The knowledge models produced through data mining techniques are only as good as the accuracy of their input data. One source of data inaccuracy is when users deliberately provide wrong information. This is especially com- mon with regard to customers who are asked to provide personal information on Web forms to e-commerce service providers. The compulsion for doing so may be the (perhaps well-founded) worry that the requested information may be misused by the service provider to harass the customer. As a case in point, consider a pharmaceutical company that asks clients to disclose the diseases they have suffered from in order to investigate the correlations in their occur- rences – for example, “Adult females with malarial infections are also prone to contract tuberculosis”. While the company may be acquiring the data solely for genuine data mining purposes that would eventually reﬂect itself in better 240 Privacy-Preserving Data Mining: Models and Algorithms service to the client, at the same time the client might worry that if her med- ical records are either inadvertently or deliberately disclosed, it may adversely affect her future employment opportunities. In this chapter, we study whether customers can be encouraged to provide correct information by ensuring that the mining process cannot, with any rea- sonable degree of certainty, violate their privacy, but at the same time produce sufﬁciently accurate mining results. The difﬁculty in achieving these goals is that privacy and accuracy are typically contradictory in nature, with the con- sequence that improving one usually incurs a cost in the other [3]. A related issue is the degree of trust that needs to be placed by the users in third-party intermediaries. And ﬁnally, from a practical viability perspective, the time and resource overheads imposed on the data mining process due to supporting the privacy requirements. Our study is carried out in the context of extracting association rules from large historical databases [7], an extremely popular mining process that identi- ﬁes interesting correlations between database attributes, such as the one de- scribed in the pharmaceutical example. By the end of the chapter, we will attempt to show that the state-of-the-art is such that it is indeed possible to simultaneously achieve all the desirable objectives (i.e. privacy, accuracy, and efﬁciency) for association rule mining. In the above discussion, and for the most part in this chapter, the focus is on maintaining the conﬁdentiality of the input user data. However, it is also conceivable to think of the complementary aspect of maintaining output se- crecy, that is, the privacy of sensitive association rules that are an outcome of the mining process – a summary discussion on these techniques is included in our coverage of the literature. 10.2 Problem Framework In this section, we describe the framework of the privacy mining problem in the context of association rules. 10.2.1 Database Model We assume that the original (true) database U consists of N records, with each record having M categorical attributes. Note that boolean data is a spe- cial case of this class, and further, that continuous-valued attributes can be converted into categorical attributes by partitioning the domain of the attribute into ﬁxed length intervals. The domain of attribute j is denoted by Sj U, resulting in the domain SU of a record in U being given by SU = M j=1 Sj U. We map the domain SU to the index set IU = {1,...,|SU |}, thereby modeling the database as a set of N Mining Association Rules under Privacy Constraints 241 values from IU. If we denote the ith record of U as Ui,thenU = {Ui}N i=1,Ui ∈ IU. To make this concrete, consider a database U with 3 categorical attributes Age, Sex and Education having the following category values: Age Child, Adult, Senior Sex Male, Female Education Elementary, Graduate For this schema, M =3,S1 U ={Child, Adult, Senior},S2 U ={Male, Female}, S3 U ={Elementary, Graduate},SU = S1 U × S2 U × S3 U, |SU | =12. The domain SU is indexed by the index set IU = {1, ..., 12}, and hence the set of records UU Child Male Elementary Child Male Graduate Child Female Graduate Senior Male Elementary maps to 1 2 4 9 10.2.2 Mining Objective The goal of the data-miner is to compute association rules on the above database. Denoting the set of attributes in the U database by C, an association rule is a (statistical) implication of the form Cx =⇒ Cy,whereCx,Cy ⊂ C and Cx ∩ Cy = φ. A rule Cx =⇒ Cy issaidtohaveasupport (or fre- quency) factor s iff at least s% of the transactions in U satisfy Cx ∪ Cy.A rule Cx =⇒ Cy is satisﬁed in U with a conﬁdence factor c iff at least c% of the transactions in U that satisfy Cx also satisfy Cy. Both support and conﬁ- dence are fractions in the interval [0,1]. The support is a measure of statistical signiﬁcance, whereas conﬁdence is a measure of the strength of the rule. A rule is said to be “interesting” if its support and conﬁdence are greater than user-deﬁned thresholds supmin and conmin, respectively, and the objective of the mining process is to ﬁnd all such interesting rules. It has been shown in [7] that achieving this goal is effectively equivalent to generating all subsets of C that have support greater than supmin – these subsets are called frequent itemsets. Therefore, the mining objective is, in essence, to efﬁciently discover all frequent itemsets that are present in the database. 10.2.3 Privacy Mechanisms We now move on to considering the various mechanisms through which privacy of the user data could be provided. One approach to address this prob- lem is for the service providers to assure the users that the databases obtained from their information would be anonymized through the variety of techniques 242 Privacy-Preserving Data Mining: Models and Algorithms proposed in the statistical database literature [1, 38], before being supplied to the data miners. For example, the swapping of values between different customer records, as proposed in [17]. Depending on the service provider to guarantee privacy can be referred to as a “B2B (business-to-business)” privacy environment. However, in today’s world, most users are (perhaps justiﬁably) cynical about such assurances, and it is therefore imperative to demonstrably provide privacy at the point of data collection itself, that is, at the user site. This is referred to as the “B2C (business-to-customer)” privacy environment [47]. Note that in this environment, any technique that requires knowledge of other user records becomes infeasible, and therefore the B2B approaches cannot be applied here. The bulk of the work in privacy-preserving data mining of association rules has addressed the B2C environment (e.g. [2, 9, 19, 34]), where the user’s true data has to be anonymized at the source itself. Note that the anonymization process has to be implemented by a program which could be supplied either by the service provider or, more likely, by an independent trusted third-party vendor. Further, this program has to be veriﬁably secure – therefore, it must be simple in construction, eliminating the possibility of the true data being surreptitiously supplied to the service provider. In a nutshell, the goal of these techniques is to ensure the privacy of the raw local data at the source, but, at the same time, to support accurate reconstruction of the global data mining models at the destination. Within the above framework, the general approach has been to adopt a data perturbation strategy, wherein each individual user’s true data is altered in some manner before forwarding to the service provider. Here, there are two possibilities: statistical distortion, which has been the predominant tech- nique, and algebraic distortion, proposed in [47]. In the statistical approach, a common randomizing algorithm is employed at all user sites, and this al- gorithm is disclosed to the eventual data miner. For example, in the MASK technique [34], targeted towards “market-basket” type of sparse boolean data- bases, each bit in the true user transaction vector is independently ﬂipped with a parametrized probability. While there is only one-way communication from users to the service provider in the statistical approach, the algebraic scheme, in marked contrast, requires two-way communication between the data miner and the user. Here, the data miner supplies a user-speciﬁc perturbation vector, and the user then returns the perturbed data after applying this vector on the true data, discretiz- ing the output and adding some noise. The vector is dependent on the current contents of the perturbed database available with the miner and, for large en- terprises, the data collection process itself could become a bottleneck in the efﬁcient running of the system. Mining Association Rules under Privacy Constraints 243 Within the statistical approach, there are two further possibilities: (a) A sim- ple independent attribute perturbation, wherein the value of each attribute in the user record is perturbed independently of the rest; or (b) A more gener- alized dependent attribute perturbation, where the perturbation of each at- tribute may be affected by the perturbations of the other attributes in the record. Most of the statistical perturbation techniques in the literature, in- cluding [18, 19, 34], fall into the independent attribute perturbation category. Notice, however, that this is in a sense antithetical to the original goal of as- sociation rule mining, which is to identify correlations across attributes.This limitation is addressed in [10], which employs a dependent attribute perturba- tion model, with each attribute in the user’s data vector being perturbed based on its own value as well as the perturbed values of the earlier attributes. Another model of privacy-preserving data mining is the k-anonymity model [35, 2], where each record value is replaced with a corresponding generalized value. Speciﬁcally, each perturbed record cannot be distinguished from at least k other records in the data. However, this falls into the B2C model since the intermediate database-forming-server can learn or recover precise records. 10.2.4 Privacy Metric Independent of the speciﬁc scheme used to achieve privacy, the end result is that the miner receives as input the perturbed database V and the perturbation technique T used to produce this database. From these inputs, the miner at- tempts to reconstruct the original distribution of the true database U, and mine this reconstructed database to obtain the association rules. Given this frame- work, the general notion of privacy in the association rule mining literature is the level of certainity with which the data miner can reconstruct the true data values of users. The certainity can be evaluated at various levels: Average Privacy. This metric measures the reconstruction probability of a random value in the database. Worst-case Privacy. This metric measures the maximum reconstruction probability across all the values in the database. Re-interrogated Privacy. A common system environment is where the miner does not have access to the perturbed database after the completion of the mining process. But it is also possible to have situations wherein the miner can use the mining output (i.e. the association rules) to subsequently re-interrogate the perturbed database, possibly resulting in reduced privacy. Ampliﬁcation Privacy. A particularly strong notion of privacy, called “am- pliﬁcation”, was presented in [18], which guarantees strict limits on privacy 244 Privacy-Preserving Data Mining: Models and Algorithms breaches of individual user information, independent of the distribution of the true data. Here, the property of a data record Ui is denoted by Q(Ui).Forex- ample, consider the following record from the example dataset U discussed earlier: Age Sex Education Child Male Elementary Sample properties of the record include Q1(Ui) ≡ “Age = Child and Sex = Male”,and Q2(Ui) ≡ “Age = Child or Adult”. In this context, the prior probability of a property of a customer’s private in- formation is the likelihood of the property in the absence of any knowledge about the customer’s private information. On the other hand, the posterior probability is the likelihood of the property given the perturbed information from the customer and the knowledge of the prior probabilities through recon- struction from the perturbed database. In order to preserve the privacy of some property of a customer’s private information, the posterior probability of that property should not be unduly different to that of the prior probability of the property for the customer. This notion of privacy is quantiﬁed in [18] through the following results, where ρ1 and ρ2 denote the prior and posterior probabil- ities, respectively: Privacy Breach: An upward ρ1-to-ρ2 privacy breach exists with respect to property Q if ∃v ∈ SV such that P[Q(Ui)] ≤ ρ1 and P[Q(Ui)|R(Ui)=v] ≥ ρ2. Conversely, a downward ρ2-to-ρ1 privacy breach exists with respect to property Q if ∃v ∈ SV such that P[Q(Ui)] ≥ ρ2 and P[Q(Ui)|R(Ui)=v] ≤ ρ1. Ampliﬁcation: Let the perturbed database be V = {V1,...,VN}, with do- main SV, and corresponding index set IV. For example, given the sample database U discussed above, and assuming that each attribute is distorted to produce a value within its original domain, the distortion may result in VV 5 7 2 12 which maps to Adult Male Elementary Adult Female Elementary Child Male Graduate Senior Female Graduate Let the probability of an original customer record Ui = u, u ∈ IU being perturbed to a record Vi = v, v ∈ IV be p(u → v),andletA denote the matrix of these transition probabilities, with Avu = p(u → v). Mining Association Rules under Privacy Constraints 245 With the above notation,a randomization operator R(u) ∀u1,u2 ∈ SU: p[u1 → v] p[u2 → v] ≤ γ where γ ≥ 1 and ∃u : p[u → v] > 0. Operator R(u) is at most γ-amplifying if it is at most γ-amplifying for all qualifying v ∈ SV. Breach Prevention: Let R be a randomization operator, v ∈ SV be a random- ized value such that ∃u : p[u → v] > 0,andρ1,ρ2 (0 <ρ1 <ρ2 < 1) be two probabilities as per the above privacy breach deﬁnition. Then, if R is at most γ-amplifying for v, revealing “R(u)=v” will cause neither upward (ρ1-to-ρ2) nor downward (ρ2-to-ρ1) privacy breaches with respect to any property if the following condition is satisﬁed: ρ2(1 − ρ1) ρ1(1 − ρ2) >γ If this situation holds, R is said to support (ρ1,ρ2) privacy guarantees. 10.2.5 Accuracy Metric For association rule mining on a perturbed database, two kinds of errors can occur: Firstly, there may be support errors, where a correctly-identiﬁed frequent itemset may be associated with an incorrect support value. Secondly, there may be identity errors, wherein either a genuine frequent itemset is mis- takenly classiﬁed as rare, or the converse, where a rare itemset is claimed to be frequent. The Support Error (µ) metric reﬂects the average relative error (in per- cent) of the reconstructed support values for those itemsets that are correctly identiﬁed to be frequent. Denoting the number of frequent itemsets by |F|,the reconstructed support by sup and the actual support by sup, the support error is computed over all frequent itemsets as µ = 1 | F |Σf∈F | supf − supf | supf ∗ 100 The Identity Error (σ) metric, on the other hand, reﬂects the percentage er- ror in identifying frequent itemsets and has two components: σ+, indicating the percentage of false positives, and σ− indicating the percentage of false negatives. Denoting the reconstructed set of frequent itemsets with R and the correct set of frequent itemsets with F, these metrics are computed as σ+ = |R−F | |F | ∗ 100 σ− = |F −R| |F | * 100 Note that in some papers (e.g. [47]), the accuracy metrics are taken to be the worst-case, rather than average-case, versions of the above errors. 246 Privacy-Preserving Data Mining: Models and Algorithms 10.3 Evolution of the Literature From the database perspective, the ﬁeld of privacy-preserving data mining was catalyzed by the pioneering work of [9]. In that work, developing privacy- preserving data classiﬁers by adding noise to the record values was proposed and analyzed. This approach was extended in [3] and [26] to address a variety of subtle privacy loopholes. Concurrently, the research community also began to look into extending privacy-preserving techniques to alternative mining patterns such as associ- ation rules, clustering, etc. For association rules, two streams of literature emerged, as mentioned earlier, one looking at providing input data privacy, and the other considering the protection of sensitive output rules. An important point to note here is that unlike the privacy-preserving classiﬁer approaches that were based on adding a noise component to continuous-valued data, the privacy-preserving techniques in association-rule mining are based on proba- bilistic mapping from the domain space to the range space, over categorical atttributes. With regard to input data privacy, the early papers include [34, 19], which proposed the MASK algorithm and the Cut-and-Paste operators, respectively. MASK. In MASK [34], a simple probabilistic distortion of user data, em- ploying random numbers generated from a pre-deﬁned distribution function, was proposed and evaluated in the context of sparse boolean databases, such as those found in “market-baskets”. The distortion technique was simply to ﬂip each 0 or 1 bit with a parametrized probability p, or to retain as is with the com- plementary probability 1−p, and the privacy metric used was average privacy. Through a theoretical and empirical analysis, it was shown that the p parameter could be carefully tuned to simultaneously achieve acceptable average privacy and good accuracy. However, it was also found that mining the distorted database could be or- ders of magnitude more time-consuming as compared to mining the original database. This issue was addressed in a followup work [12] which showed that by generalizing the distortion process to perform symbol-speciﬁc dis- tortion (i.e. different ﬂipping probabilities for different values), appropriately chooosing these distortion parameters, and applying a variety of set-theoretic optimizations in the reconstruction process, runtime efﬁciencies that are well within an order of magnitude of undistorted mining can be achieved. Cut-and-Paste Operator. The notion of a privacy breach was introduced in [19] as the following: The presence of an itemset I in the randomized trans- action causes a privacy breach of level ρ if it is possible to infer, for some Mining Association Rules under Privacy Constraints 247 transaction in the true database, that the probability of some item i occuring in it exceeds rho. With regard to this worst-case privacy metric, a set of randomizing privacy operators were presented and analyzed in [19]. The starting point was Uniform Randomization, where each existing item in the true transaction is, with proba- bility p, replaced with a new item not present in the original transaction. (Note that this means that the number of items in the randomized transaction is al- ways equal to the number in the original transaction, and is therefore different from MASK where the number of items in the randomized transaction is usu- ally signiﬁcantly more than its source since the ﬂipping is done on both the 1’s and the 0’s in the transaction bit vector.) It was then pointed out that a basic deﬁciency of the uniform randomization approach is that while it might, with a suitable choice of p, be capable of providing acceptable average privacy, its worst case privacy could be signiﬁcantly weaker. To address this issue, an alternative select-a-size (SaS) randomization oper- ator was proposed, which is composed of the following steps, employed on a per-transaction basis: Step 1: For customer transaction ti of length m, a random integer j from [1,m] is ﬁrst chosen with probability pm[j]. Step 2: Then, j items are uniformly and randomly selected from the true trans- action and inserted into the randomized transaction. Step 3: Finally, a uniformly and randomly chosen fraction ρm of the remain- ing items in the database that are not present in the true transaction (i.e. C− items in ti), are inserted into the randomized transaction. In short, the ﬁnal randomized transaction is composed of a subset of true items from the original transaction and additional false items from the complemen- tary set of items in the database. A variant of the SaS operator studied in detail in [19] is the cut-and-paste (C&P) operator. Here, an additional parameter is a cutoff integer, Km, with the integer j being chosen from [1,Km], rather than from [1,m]. If it turns out that j>m,thenj is set to m (which means that the entire original transaction is copied to the randomized transaction). Apart from the cutoff threshold, an- other difference between C&P and SaS is that the subsequent ρm randomized insertion (Step 3 above) is carried out on (a) the items that are not present in the true transaction (as in SaS), and (b) additionally, on the remaining items in the true transaction that were not selected for inclusion in Step 2. An issue in the C&P operator is the optimal selection of the ρm and Km pa- rameters, and combinatorial formulae for determining their values are given in [19]. Through a detailed set of experiments on real-life datasets, it was shown that even with a challenging privacy requirement of not permitting any 248 Privacy-Preserving Data Mining: Models and Algorithms breaches with ρ>50%, mining a C&P-randomized database was able to cor- rectly identify around 80 to 90% of the “short” frequent itemsets, that is fre- quent itemsets of lengths upto 3. The issue of how to safely randomize and mine long transactions was left as an open problem, since directly using C&P in such environments could result in unacceptably poor accuracy. The above work was signiﬁcantly extended in [18] through, as discussed in Section 10.2.4, the formulation of strict ampliﬁcation-based privacy metrics and delineating a methodology for limiting the associated privacy breaches. Distributed Databases. Maintaining input data privacy was also consid- ered in [41, 25] in the context of databases that are distributed across a number of sites with each site only willing to share data mining results, but not the source data. While [41] considered data that is vertically partitioned (i.e., each site hosts a disjoint subset of the matrix columns), the complementary situa- tion where the data is horizontally partitioned (i.e., each site hosts a disjoint subset of the matrix rows) is addressed in [25]. The solution technique in [41] requires generating and computing a large set of independent linear equations – in fact, the number of equations and the number of terms in each equation is proportional to the cardinality of the database. It may therefore prove to be expensive for market-basket databases which typically contain millions of customer transactions. In [25], on the other hand, the problem is modeled as a secure multi-party computation [23] and an algorithm that minimizes the in- formation shared without incurring much overhead on the mining process is presented. Note that in these formulations, a pre-existing true database at each site is assumed, i.e. a B2B model. Algebraic Distortion. Then, in [47], an algebraic-distortion mechanism was presented that unlike the statistical approach of the prior literature, requires two-way communication between the miner and the users. If Vc is the current perturbed database, then Ek is computed by the miner, which corresponds to the eigenvectors corresponding to the largest k eigenvalues of VcTVc,where VcT is the transpose of Vc. The choice of k makes a tradeoff between privacy and accuracy – large values of k give more accuracy and less privacy, while small values provide higher privacy and less accuracy. Ek is supplied to the user, who then uses it on her true transaction vector, discretizes the output, and then adds a noise component. Their privacy metric is rather different, in that they evaluate the level of privacy by measuring the probability of an “unwanted” item to be included in the perturbed transaction. The deﬁnition of unwanted here is that it is an item that does not contribute to association rule mining in the sense that it does not appear in any frequent itemset. An implication is that privacy esti- mates can be conditional on the choices of association rule mining parameters Mining Association Rules under Privacy Constraints 249 (supmin,conmin). This may encourage the miner to experiment with a variety of values in order to maximize the breach of privacy. Output Rule Privacy. We now turn our attention to the issue of maintaining the privacy of output rules. That is, we would like to alter the original database in a manner such that only the association rules deemed to be sensitive by the owner of the data source cannot be identiﬁed through the mining process. The proposed solutions involve either falsifying some of the entries in the true database or replacing them with null values. Note that, by deﬁnition, these techniques require a completely materialized true database as the starting point, in contrast to the B2C techniques for input data privacy. In [13], the process of transforming the database to hide sensitive rules is termed as “sanitization”, and in practical terms, this requires reducing either the support or the conﬁdence of the sensitive rules to below the supmin or conmin thresholds. Speciﬁcally, using R to refer to the set of all rules, and S to refer to the set of sensitive rules, the goal is to hide all the S rules by reducing the supports or conﬁdences, and simultaneously minimize the number of rules in R − S that may also become hidden as a side-effect of the sanitization process. (Note that the objective is only to maintain the visibility of rules in R − S, allowing the speciﬁc supports or conﬁdences obtained by the miner for the R − S rules to be altered if required. That is, it would be perfectly acceptable for the database to be sanitized such that a rule with high support or conﬁdence in R − S became a rule that was just above the threshold in the sanitized database.) The sanitization can be achieved in different ways: 1) By changing the val- ues of individual entries in the database; or, 2) By removing entire transactions from the database. It was shown in the initial work of [13], which only con- sidere the lowering of support values, that, irrespective of the sanitization ap- proach, ﬁnding the optimal (w.r.t. minimizing the impact on R−S) sanitization is an NP-Hard problem (through reduction from the Hitting Set problem [21]). A greedy heuristic technique was suggested, where the S set is ordered in de- creasing order of support, and then each element is hidden in the ordered set is hidden in an iterative fashion. The hiding is done by performing a greedy search through the ancestors of the itemset, selecting at each level the parent with the maximum support and setting the selected parent as the new item- set that needs to be hidden. At the end of the process, a frequent item has been selected. The algorithm searches through the common list of transactions that support both the selected item and the initial frequent itemset to be hid- den in order to identify the transaction that affects the minimum number of 2- itemsets. After this transaction is identiﬁed, then the selected frequent item is removed from the identiﬁed transaction. The effects of this database alteration 250 Privacy-Preserving Data Mining: Models and Algorithms are propagated to the other itemset elements, and the process repeats until the itemset is hidden. The above work was extended in [15] to achieve hiding by also using the the conﬁdence criterion. Unlike the purely support-based hiding approach where only 1’s are converted to 0’s, hiding through the conﬁdence criterion can be achieved by converting 0’s into 1’s. However, an associated danger is that there can now be false positives, that is, infrequent rules may be incorrectly promoted into the frequent category. A detailed treatment of this issue is pre- sented in [44]. An alternative approach for output rule privacy proposed in [37, 36] is to use the concept of “data blocking”, wherein some values in the database are replaced with NULLs signifying unknowns. In this framework, the notions of itemset support and conﬁdence are converted into intervals, with the actual support and conﬁdence lying within these intervals. For example, the mini- mum support of itemset Cx is the percentage of transactions that have 1’s for this itemset, while the maximum possible support is the percentage of trans- actions that contain either 1 or NULL for this itemset. Greedy algorithms for implementing the hiding are presented, and a discussion of their effectiveness is provided in [36]. More recently, decision-theoretic approaches based on data blocking are presented in [30, 22], which also utilize the “border theory” of frequent itemsets [40] – however, these approaches can be computationally demanding. The rule-hiding techniques have limitations in that (a) they crucially depend on the data miner processing the database only with the speciﬁed supports and conﬁdence levels – this may be hard to ensure in practice; (b) they may in- troduce signiﬁcant false positives and false negatives in the non-sensitive set of rules; (c) they may introduce signiﬁcant changes in the supports and conﬁ- dences of the non-sensitive set of rules; and (c) in the case of data blocking, it may be sometimes possible to infer the hidden rules by assigning values to the null attributes. Frameworks. A common trend in the input data privacy literature was to propose speciﬁc perturbation techniques, which are then analyzed for their pri- vacy and accuracy properties. Recently, in [10], the problem was approached from a different perspective, wherein a generalized matrix-theoretic framework that facilitates a systematic approach to the design of random perturbation schemes for privacy-preserving mining was proposed. This framework sup- ports ampliﬁcation-based privacy, and its execution and memory overheads are comparable to that of classical mining on the true database. The distinguishing feature of FRAPP is its quantitative characterization of the sources of error in the random data perturbation and model reconstruction processes. Mining Association Rules under Privacy Constraints 251 In fact, although it uses dependent attribute perturbation, it is fully de- composable into the perturbation of individual attributes, and hence has the same run-time complexity as any independent perturbation method. Through the framework, many of the earlier techniques are cast as special instances of the FRAPP perturbation matrix. More importantly, it was shown that through appropriate choices of matrix elements, new perturbation techniques can be constructed that provide highly accurate mining results even under strict ampliﬁcation-based [18] privacy guarantees. In fact, a perturbation matrix with provably minimal condition number1, was identiﬁed, substantially improving the accuracy under the given constraints. Finally, an efﬁcient integration of this optimal matrix with the association mining process was outlined. 10.4 The FRAPP Framework In the remainder of this chapter, we present, as a representative example, the salient details of FRAPP and discuss how it simultaneously provides strong privacy, high accuracy and good efﬁciency, in a B2C privacy-preserving envi- ronment of mining association rules. As mentioned earlier, let the probability of an original customer record Ui = u, u ∈ IU being perturbed to a record Vi = v, v ∈ IV be p(u → v),andlet A denote the matrix of these transition probabilities, with Avu = p(u → v). This random process maps to a Markov process, and the perturbation matrix A should therefore satisfy the following properties [39]: Avu ≥ 0 and v∈IV Avu =1 ∀u ∈ IU,v ∈ IV(10.1) Due to the constraints imposed by Equation 10.1, the domain of A is a subset of R|SV |×|SU |. This domain is further restricted by the choice of perturbation method. For example, for the MASK technique [34], all the entries of matrix A are decided by the choice of a single parameter, namely, the ﬂipping proba- bility. We now explore the preferred choices of A to simultaneously achieve pri- vacy guarantees and high accuracy, without restricting ab initio to a particular perturbation method. From the previously-mentioned results of [18], the following condition on the perturbation matrix A in order to support (ρ1,ρ2) privacy can be derived: Avu1 Avu2 ≤ γ<ρ2(1 − ρ1) ρ1(1 − ρ2) ∀u1,u2 ∈ IU, ∀v ∈ IV(10.2) 1In the class of symmetric positive-deﬁnite matrices (refer Section 10.4.2.1). 252 Privacy-Preserving Data Mining: Models and Algorithms That is, the choice of perturbation matrix A should follow the restriction that the ratio of any two matrix entries (in a row) should not be more than γ. 10.4.1 Reconstruction Model We now analyze how the distribution of the original database is recon- structed from the perturbed database. As per the perturbation model, a client Ci with data record Ui = u, u ∈ IU generates record Vi = v, v ∈ IV with prob- ability p[u → v]. This event of generation of v can be viewed as a Bernoulli trial with success probability p[u → v]. If the outcome of the ith Bernoulli trial is denoted by the random variable Y i v , the total number of successes Yv in N trials is given by the sum of the N Bernoulli random variables: Yv = N i=1 Y i v (10.3) That is, the total number of records with value v in the perturbed database is given by Yv. Note that Yv is the sum of N independent but non-identical Bernoulli trials. The trials are non-identical because the probability of success varies from trial i to trial j, depending on the values of Ui and Uj, respectively. The distribu- tion of such a random variable Yv is known as the Poisson-Binomial distribu- tion [45]. From Equation 10.3, the expectation of Yv is given by E(Yv)= N i=1 E(Y i v )= N i=1 P(Y i v =1) (10.4) Using Xu to denote the number of records with value u in the original database, and noting that P(Y i v =1)=p[u → v]=Avu for Ui = u, results in E(Yv)= u∈IU AvuXu (10.5) Let X =[X1X2 ···X|SU |]T,Y =[Y1Y2 ···Y|SV |]T. Then, the following ex- pression is obtained from Equation 10.5: E(Y)=AX (10.6) At ﬁrst glance, it may appear that X, the distribution of records in the orig- inal database (and the objective of the reconstruction exercise), can be directly obtained from the above equation. However, an immediate difﬁculty is that that the data miner does not possess E(Y), but only a speciﬁc instance of Y, with Mining Association Rules under Privacy Constraints 253 which she has to approximate E(Y).2 Therefore, the following approximation to Equation 10.6 is resorted to: Y = A X(10.7) where X is estimated as X. This is a system of |SV | equations in |SU | un- knowns, and for the system to be uniquely solvable, a necessary condition is that the space of the perturbed database is a superset of the original database (i.e. |SV |≥|SU |). Further, if the inverse of matrix A exists, the solution of this system of equations is given by X = A−1Y(10.8) providing the desired estimate of the distribution of records in the original database. Note that this estimation is unbiased because E( X)=A−1E(Y)= X. 10.4.2 Estimation Error To analyze the error in the above estimation process, the following well- known theorem from linear algebra applies [39]: Theorem 10.1 Given an equation of the form Ax = b and that the mea- surement of b is in-exact, the relative error in the solution x = A−1b satisﬁes δx x ≤ c δb b where c is the condition number of matrix A. For a positive-deﬁnite matrix, c = λmax/λmin,whereλmax and λmin are the maximum and minimum eigen-values of matrix A, respectively. In- formally, the condition number is a measure of the sensitivity of a matrix to numerical operations. Matrices with condition numbers near one are said to be well-conditioned, i.e. stable, whereas those with condition numbers much greater than one (e.g. 105 for a 5 ∗ 5 Hilbert matrix [39]) are said to be ill- conditioned, i.e. highly sensitive. Equations 10.6 and 10.8, coupled with Theorem 10.1, result in X − X X ≤ c Y − E(Y) E(Y) (10.9) 2If multiple distorted versions are provided, then E(Y) is approximated by the observed average of these versions. 254 Privacy-Preserving Data Mining: Models and Algorithms which means that the error in estimation arises from two sources: First, the sensitivity of the problem, indicated by the condition number of matrix A;and second, the deviation of Y from its mean, i.e. the deviation of perturbed data- base counts from their expected values, indicated by the variance of Y.Inthe remainder of this sub-section, we determine how to reduce this error by (a) ap- propriate choice of perturbation matrix to minimize the condition number, and (b) identifying the minimum size of the database required to (probabilistically) bound the deviation within a desired threshold. 10.4.2.1 Minimizing the Condition Number. The perturbation tech- niques proposed in the literature primarily differ in their choices for perturba- tion matrix A. For example: MASK [34] uses a matrix A with Avu = pk(1 − p)Mb−k (10.10) where Mb is the number of boolean attributes when each categorical attribute j is converted into | Sj U | boolean attributes, (1 − p) is the bit ﬂipping probability for each boolean attribute, and k is the number of attributes with matching bits between the perturbed value v and the original value u. The cut-and-paste (C&P) randomization operator [19] employs a matrix A with Avu = M z=0 pM[z] · min{z,lu,lv} q=max{0,z+lu−M,lu+lv−Mb} lu Cq M−lu Cz−q MCz ·Mb−lu Clv−qρ(lv−q)(1 − ρ)(Mb−lu−lv+q) (10.11) where pM[z]= min{K,z} w=0 M−wCz−wρ(z−w)(1 − ρ)(M−z) · 1 − M/(K +1) if w = M& w55 Employer Type (Empolyment)B Gov, Pvt, SE, Other Education C40 Annual Salary (Salary) H<$50K, $50K+ From disclosure risk perspective we are interested in protecting cells with small counts such as “1” and “2”. There are 361 cells with count of 1 and 186 with count of 2. Our task is to reduce a potential disclosure risk for at least 19% of our sample, while still providing sufﬁcient information for a “valid” statistical analysis. To alleviate estimation problems, we recoded variables B and G from 5 and 2 categories respectively to 2 categories each yielding a reduced 8-way table with 768 cells. This table is still sparse. There are 193 zero count cells, or about 25% of the cells. About 16% of cells have high potential disclosure risk; there are 73 cells with counts of 1 and 53 with counts of 2. For this table we ﬁnd two reasonable log-liner models 306 Privacy-Preserving Data Mining: Models and Algorithms Model 1: [ABCF G][ACDF G][ACDGH][ADEF G], Model 2: [ACDGH][ABF G][ABCG][ADF G][BEFG][DEFG], with goodness-of-ﬁt statistics G2 = 1870.64 with 600 degrees of freedom and G2 = 2058.91 with 634 degrees of freedom, respectively. Model 1 is a decomposable graphical log-linear model whose minimal suf- ﬁcient statistics are the released margins. We ﬁrst evaluate if these ﬁve-way marginal tables are safe to release by analyzing number of cells with small counts. Most of the cell counts are large and do not seem to present an imme- diate disclosure risk. Two of the margins are potentially problematic. Marginal table [ABCF G] has 1 cell with count of “5” in (1,4,2,1,2) cell, while the mar- gin [ACDGH] has a low count of “4” and two cells with count of “8”; e.g., see Table 12.5. Even without out any further analysis, most agencies would not release such margins. Because we are ﬁtting a decomposable models this initial exploratory analysis reveals that there will be at least one cell with a tight sharp upper bound of size “4”. Bellow we investigate if these margins are indeed safe to release accounting for the log-linear model we can ﬁt and the estimates they provide for the reduced and full eight-way tables. Table 12.5. Marginal table [ACDGH] from 8-way CPS table A 123 C 121212 DGH 111 198 139 943 567 2357 2225 2 11 19 240 715 1009 3781 21 246 144 765 294 3092 2018 2 8 14 274 480 1040 2465 2112327 2558 835 524 2794 3735 2 8 14 51 105 114 770 211411 1316 617 359 3738 3953 2 4 15326878372 Model 1 is easy to ﬁt and evaluate: it is decomposable and there are closed- form solutions for bounds given the margins. Almost all lower bounds are 0. As expected from the analysis above, the smallest upper bound is 4 counts. There are 16 such cells, of which 4 contain counts of “1” and rest contain “0”. The next smallest upper bound is 5, for 7 “0” cell counts and for 1 cell with a count of “5”. The 5 cells with counts of “1” have the highest risk of disclosure. The next set of cells with a considerably high disclosure risk are cells with an upper bound of size 8. There are 32 such cells (23 contain counts of “0”, 4 contain counts of “1”, 3 contain counts of “2”, and 2 contain counts of “3”). If we focus on count cells of “1” and “2”, with the release of this model we directly identiﬁed 12 out of 126 sensitive cells. Preserving Privacy for Contingency Tables 307 Table 12.6. Summary of difference between upper and lower bounds for small cell counts in the full 8-way CPS table under Model 1 and under Model 2 Model 1 Model 2 Bound diff. 0 1 2345 0 12345 Cell count 0 226 112 66 52 69 62 192 94 58 40 36 26 1 - 1215141320 -1086210 2 - - 1384 - -2244 3 ---142---000 If we ﬁt the same model to the full 8-way table with 2,880 cells, there are 660 cells with difference in bounds less than equal to 5, with all lower bounds being 0. Most of these are “0” cell counts; however, a high disclosure risk exists for 74 cells with count of “1”, 16 cells with cell count equal “2”, and 7 cells with counts of “3”; see the summary in Table 12.6. Thus releasing the margins corresponding to Model 1 poses a substantial risk of disclosure. Model 2 is non-decomposable log-linear model and it requires an iterative algorithm for parameter estimation and extensive calculation for bounds. This model has 5 marginals as sufﬁcient statistics. The 5-way margin [ACDGH]is still problematic; however, the 4 4-way margins all appear to be safe to release with the smallest count of size “46” appearing in cell (1,4,1,1) of the margin [ABF G]. We focus our discussion only on cells with small counts, as we did for the Model 1. Since Model 2 is non-decomposable, no closed-form solutions exist for cell bounds, and we must rely on LP and IP which sometimes may not produce sharp bounds. In this case this was not an issue. For the reduced 8-way table, all lower bounds are 0 and the minimum upper bound again is 4. There are 16 cells with upper bound of 4, of which four cells have count “1”, and the rest are “0”. The next smallest upper bound is 8, and there are 5 such cells with counts of “1”, 4 cells with counts of “2”, and 3 cells with counts of “3”. With these margins, in comparison to the released margins under Model 1, we have eliminated the effect of the margin [ABCF G], and reduced a disclosure risk for a subset of small cell counts; however, we did not reduced the disclosure risk for the small cell counts with the highest disclosure risk. For the full 8- way table, we compare the distribution of small cell bounds for the small cell counts under the two models; see Table 12.6. There are no cells with counts of “3” that have very tight bounds. For the cells with counts of “2”, the number of tight bounds have not substantially decreased (e.g., 16 under Model 1 vs. 12 under Model 2), but there has been a signiﬁcant decrease in the number of tight bounds for the cells with count of “1” (e.g., from 74 under Model 1 to 36 under Model 2). 308 Privacy-Preserving Data Mining: Models and Algorithms In theory we could enumerate the number of possible tables utilizing alge- braic techniques and software such as LattE [5], MCMC, or SIS. Due to large dimension of the solution polytope for this example, however, LattE is cur- rently unable the execute the computation because the space of possible tables is extremely large. We have also been unable to ﬁne-tune the SIS procedure to obtain a reasonable estimate except “inﬁnity”. While it is possible to ﬁnd a Markov basis corresponding to the second log-linear model, utilizing those for calculating bounds and or sampling from the space of tables is also currently computationally infeasible. But the practicality of such calculations is likely to change with increased computer power and memory. Based on Model 1, variables B and H are conditionally independent given the remaining 6 variables. Thus we can collapse the 8-way table to a 6-way table and carry out a disclosure risk analysis on it. The collapsed table has only 96 cells, and there is only one small cell count of size “2” that would raise an immediate privacy concern. Furthermore, we have collapsed over the two “most” sensitive and most interesting variables for statistical analysis: Type of Employer and Income. We do not pursue this analysis here but, if other vari- ables are of interest, we could again focus on search for the best decompos- able model. With various search algorithms and criteria, out of 32,768 possible decomposable models all searches converge to [ACF G][ADEF G], a model with a likelihood ratio chi-square of G2 = 144.036 and 36 degrees of freedom. In this case, we could simply provide the margins of the above model to the user to construct association rules provided that they do not provide pre- cise information on three sensitive cells. Numerous association rules can be derived from the given margins. Some interesting rules, for example could be AF G ⇒ C,andAF G ⇒ DE. As we did in in the clinical trial example, we can evaluate how safe the release of these rules are by determining the bounds on the cells given the marginal and conditional constraints, that is the rules’ support and conﬁdence. 12.6 Conclusions The literature on datamining for association rules has focused on extracting rules with high predictive utility, measured by criteria such as support and con- ﬁdence. For categorical data bases, coming in the form of multi-way contin- gency tables, these rules and criteria essentially are extracting marginal tables and linked conditionals. Some authors have recognized the relevance of log- linear and related models for this type of datamining activity, e.g., see [11], and [37], but few have addressed the issue of preserving the privacy of in- dividuals represented in the data base being mined, with no links to date to ideas from log-linear and related models. In this chapter we have provided an overview of the totally separate statistical literature focused on protecting against disclosure limitation in contingency tables, while providing marginal and conditional tables for analysis and reporting. Preserving Privacy for Contingency Tables 309 From the perspective of privacy preservation the methods described in this chapter for bounds on cell counts provide an alternative approach to that found in most of the machine learning literature. These methods stress the link be- tween the ensemble of data to be released, i.e., margins and conditionals, and their ability to characterize the data base through the use of log-linear and related statistical models and assessments of goodness-of-ﬁt. Measures of pri- vacy preservation based on bounds and other statistically related quantities may suggest that “the best association rules” may not be releasable without possibly compromising conﬁdentiality. New to this enterprise, and especially new to datamining are the tools from computational algebraic geometry. We have attempted to illustrate their ap- plicability here largely through the examples. For more details we refer the interested reader to [6], [17], [31], and papers in a special 2006 issue of the Journal of Symbolic Computation devoted to problems at the interface of sta- tistics and algebraic geometry. Machine learning has made major progress in the efﬁcient extraction of association rules from large data bases. The statistical literature has focused more heavily on understanding the utility of the the extracted information and on related methodologies for assessing disclosure limitation or privacy preservation. Our goal in reviewing the points of convergence in these two literatures has been to stimulate a fusion of the different methodologies and computational tools. Barak et al. [3] adds the element of perturbation to our toolkit and we hope to compare their methods with those described in this paper in the near future. Acknowledgements We owe special thanks to Alan Karr for pointing out the close correspon- dence between the contingency tables and association rule mining and to Cyn- thia Dwork for getting us to explain the sense in which our approach addresses conﬁdentiality protection. This research was supported in part by NSF Grants EIA-98-76619 and IIS-01-31884 to the National Institute of Statistical Sci- ences, by Army Contract DAAD19-02-1-3-0389 to CyLab and by NSF Grant DMS-0631589 to the Department of Statistics, both at Carnegie Mellon Uni- versity, by NSF Grant SES-0532407 to the Department of Statistics, Pennsyl- vania State University, and by NSF Grant DMS-0439734 to the Institute for Mathematics and Its Application, University of Minnesota. References [1] Agresti, A. (2002). Categorical Data Analysis. 2nd Edition. New York: Wiley. [2] Anderson, B. and Moore, A. (1998). AD-trees for Fast Counting and for Fast Learning of Association Rules, Knowledge Discovery from Data- bases Conference. 310 Privacy-Preserving Data Mining: Models and Algorithms [3] Barak, B., Chaudhuri, K., Dwork, C., Kale, S., McSherry, M., and Tal- war, K. (2007). Privacy, accuracy, and consistency too: a holistic solu- tion to contingency table release, PODS ’07: Proceedings of 26th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Sys- tems, New York: ACM Press, 273–282. [4] Bishop, Y. M. M., Fienberg, S. E., and Holland, P. W. (1975). Dis- crete Multivariate Analysis: Theory and Practice. Cambridge, MA: MIT Press. [5] De Loera, J., Haws, D., Hemmecke, R., Huggins, P., Tauzer, J., and Yoshida, R. (2003). A User’s Guide for LattE v1.1. University of Cali- fornia, Davis. [6] Diaconis, P. and Sturmfels, B. (1998). Algebraic Algorithms for Sam- pling From Conditional Distributions, Annals of Statistics, 26, 363–397. [7] Dobra, A. and Fienberg, S. E. (2000). Bounds for Cell Entries in Con- tingency Tables Given Marginal Totals and Decomposable Graphs, Pro- ceedings of the National Academy of Sciences, 97, 11885–11892. [8] Dobra, A. and Fienberg, S. E. (2001). Bounds for Cell Entries in Con- tingency Tables Induced by Fixed Marginal Totals, Statistical Journal of the United Nations ECE, 18, 363–371. [9] Dobra, A., Fienberg, S. E., and Trottini, M. (2003). Assessing the Risk of Disclosure of Conﬁdential Categorical Data (with discussion), In J. Bernardo et al. eds., Bayesian Statistics 7, Clarendon: Oxford University Press, 125–144. [10] Domingo-Ferrer, J. and Torra, V. (eds.) (2004). Privacy in Statistical Databases, Lecture Notes in Computer Science No. 3050,NewYork: Springer-Verlag. [11] DuMouchel, W. and Pregibon, D. (2001). Empirical Bayes Screening for Multi-Item Associations, Proceedings of the ACM SIGKDD Inten- tional Conference on Knowledge Discovery in Databases & Data Mining (KDD01), ACM Press, 67–76. [12] Duncan, G. T., Fienberg, S. E., Krishnan, R., Padman, R., and Roehrig, S. F. (2001). Disclosure Limitation Methods and Information Loss for Tabular Data, In P. Doyle, J. Lane, J. Theeuwes, and L. Zayatz (eds.) Conﬁdentiality, Disclosure and Data Access: Theory and Practical Ap- plications for Statistical Agencies, Amsterdam: Elsevier, 135–166. [13] Dwork, C., McSherry, F., Nissim, K. and Smith, A. (2006). Calibrating Noise to Sensitivity of Functions in Private Data Analysis, 3rd Theory of Cryptography Conference (TCC) 2006, 265–284. [14] Evﬁmievski, A., Srikant, R., Agrawal, R., and Gehrke, J. (2002). Privacy Preserving Mining of Association Rules, Proceedings of the 8th ACM Preserving Privacy for Contingency Tables 311 SIGKDD International Conference on Knowledge Discovery in Data- bases and Data Mining, Edmonton, Canada, July 2002. [15] Fienberg, S. E. (1980). The Analysis of Cross-Classiﬁed Categorical Data. 2nd edition. Cambridge, MA: MIT Press. [16] Fienberg, S. E. (2004). Datamining and Disclosure Limitation for Cate- gorical Statistical Databases, Proceedings of Workshop on Privacy and Security Aspects of Data Mining, Fourth IEEE International Conference on Data Mining (ICDM 2004), Brighton, UK, November 2004. [17] Fienberg, S. E., Makov, U. E., Meyer, M. M., and Steele, R. J. (2001). Computing the Exact Distribution for a Multi-way Contingency Table Conditional on its Marginals Totals, In A. K. M. E. Saleh, ed. Data Analysis from Statistical Foundations: Papers in Honor of D. A. S. Fraser, Huntington, NY: Nova Science Publishing, 145–165. [18] Fienberg, S. E. and Makov, U. E. (1998). Conﬁdentiality, Uniqueness, and Disclosure Limitation for Categorical Data, Journal of Ofﬁcial Sta- tistics, 14, 385–397. [19] Fienberg, S. E. and Slavkovic, A. B. (2004). Making the Release of Con- ﬁdential Data from Multi-Way Tables Count, Chance, 17(3), 5–10. [20] Fienberg, S. E. and Slavkovic, A. B. (2005). Preserving the Conﬁdential- ity of Categorical Statistical Data Bases When Releasing Information for Association Rules, Data Mining and Knowledge Discovery. 11, 155–180. [21] Hemmecke, R. and Hemmecke, R. (2003). 4ti2 Version 1.1— Computation of Hilbert bases, Graver bases, toric Gr¨obner bases, and more. http://www.4ti2.de. [22] Jordan, M. I. (ed.) (1998). Learning in Graphical Models. Cambridge MA: MIT Press. [23] Kargupta, H., Datta, S., Wang, Q., and Sivakumar, K. (2003). Ran- dom Data Perturbation Techniques and Privacy Preserving Data Mining, Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM 2003), Melbourn, Florida, USA, December 2003. [24] Koch, G., Amara, J., Atkinson, S., and Stanish, W. (1983). Overview of categorical analysis methods, SAS-SUGI, 8, 785–795. [25] Lauritzen, S. L. (1996). Graphical Models. Oxford: Oxford University Press. [26] Madigan, D. and Raftery, A. E. (1994). Model Selection and Account- ing for Model Uncertainty in Graphical Models Using Occams Window, Journal of the American Statistical Association, 89: 1535–1546. 312 Privacy-Preserving Data Mining: Models and Algorithms [27] Moore, A. and Schneider, J. (2002). Real-valued All-Dimensions Search: Low-overhead Rapid Searching Over Subsets of Attributes, Proceed- ings of the 18th Conference on Uncertainty in Artiﬁcial Intelligence, July, 2002, San Francisco: Morgan Kaufmann Publishers, 360–369. [28] Rizvi, S. and Haritsa, J. (2002). Maintaining Data Privacy in Association Rule Mining, Proceedings of the 28th Conference on Very Large Data Base (VLDB’02). [29] Silverstein, C., Brin, S., and Motwani, R. (1998). Beyond Market Bas- kets: Generalizing Association Rules to Dependence Rules, Data Mining and Knowledge Discovery, 2,39–68. [30] Silverstein, C., Brin, S., Motwani, R. and Ullman, J. (2000). Scalable Techniques for Mining Causal Structures, Data Mining and Knowledge Discovery, 4, 163–192. [31] Slavkovic, A. B. (2004). Statistical Disclosure Limitation Beyond the Margins. Ph.D. Thesis, Department of Statistics, Carnegie Mellon Uni- versity. [32] Slavkovic, A. B. and Smucker, B. (2007). Calculating Cell Bounds in Contingency Tables Based on Conditional Frequencies. Technical Re- port, Department of Statistics, Penn State University. [33] Slavkovic, A. B. and Fienberg, S. E. (2004). Bounds for Cell Entries in Two-way Tables Given Conditional Relative Frequencies, In Domingo- Ferrer, J. and Torra, V. (eds.), Privacy in Statistical Databases, Lecture Notes in Computer Science No. 3050, 30–43. New York: Springer-Verlag. [34] Sturmfels, B. (2003). Algebra and Geometery of Statistical Models. John von Neumann Lectures at Munich University. [35] Trottini, M. and Fienberg, S. E. (2002). Modelling User Uncertainty for Disclosure Risk and Data Utility, International Journal of Uncertainty, Fuzziness and Knowledge Based Systems, 10, 511–528. [36] Willenborg, L. C. R. J. and de Waal, T. (2000). Elements of Statistical Disclosure Control. Lecture Notes in Statistics, Volume 155, New York: Springer-Verlag. [37] Wu, X., Barbar´a, D. and Ye, Y. (2003). Screening and Interpreting Multi- item Associations Based on Log-linear modeling, Proceedings of the ACM SIGKDD Intentional Conference on Knowledge Discovery in Data- bases & Data Mining (KDD03), ACM Press, 276–285. [38] Zaki M. J. (2004). Mining Non-Redundant Association Rules, Data Min- ing and Knowledge Discovery, 9, 223–248. [39] Verykios S. Vassilios and Gkoulalas-Divani A.(2007) A Survey of Asso- ciation Rule Hiding Methods for Privacy, in this volume . Chapter 13 A Survey of Privacy-Preserving Methods Across Horizontally Partitioned Data Murat Kantarcioglu Computer Science Department University of Texas at Dallas muratk@utdallas.edu Abstract Data mining can extract important knowledge from large data collections, but sometimes these collections are split among various parties. Data warehousing, bringing data from multiple sources under a single authority, increases risk of privacy violations. Furthermore, privacy concerns may prevent the parties from directly sharing even some meta-data. Distributed data mining and processing provide a means to address this is- sue, particularly if queries are processed in a way that avoids the disclosure of any information beyond the ﬁnal result. This chapter describes methods to mine horizontally partitioned data without violating privacy and discusses how to use the data mining results in a privacy-preserving way. The methods described here incorporate cryptographic techniques to minimize the information shared, while adding as little as possible overhead to the mining and processing task. Keywords: Privacy, distributed data mining, horizontally partitioned data and homomorphic encryption. 13.1 Introduction Data mining technology has emerged as a means of identifying patterns and trends from large quantities of data. Recently, there has been growing concern over the privacy implications of data mining. Some of this is public perception: The “Data Mining Moratorium Act of 2003” introduced in the U.S. Senate [8] was based on a fear of government searches of private data for individual in- formation, rather than what the technical community views as Data Mining. However, concerns remain. While data mining is generally aimed at producing general models rather than learning about speciﬁc individuals, the process of 314 Privacy-Preserving Data Mining: Models and Algorithms data mining creates integrated data warehouses that pose real privacy issues. Data that is of limited sensitivity by itself becomes highly sensitive when inte- grated, and gathering the data under a single roof greatly increases the opportu- nity for misuse. Even though some of the distributed data mining tasks protect individual data privacy, they still require that each site reveals some partial information about the local data. What if even this information is sensitive? For example, suppose the Centers for Disease Control (CDC), a public agency, would like to mine health records to try to ﬁnd ways to reduce the proliferation of antibiotic resistant bacteria. Insurance companies have data on patient diseases and prescriptions. CDC may try to mine association rules of the form X ⇒ Y such that the Pr(X&Y) and Pr(Y |X) are above some cer- tain thresholds. Mining this data for association rules would allow the discov- ery of rules such as Augmentin&Summer ⇒ Infection&Fall, i.e., people taking Augmentin in the summer seem to have recurring infections. The problem is that insurance companies will be concerned about sharing this data. Not only must the privacy of patient records be maintained, but in- surers will be unwilling to release rules pertaining only to them. Imagine a rule indicating a high rate of complications with a particular medical procedure. If this rule doesn’t hold globally, the insurer would like to know this; they can then try to pinpoint the problem with their policies and improve patient care. If the fact that the insurer’s data supports this rule is revealed (say, under a Freedom of Information Act request to the CDC), the insurer could be exposed to signiﬁcant public relations or liability problems. This potential risk could exceed their own perception of the beneﬁt of participating in the CDC study. One solution to this problem is to avoid disclosing data beyond its source, while still constructing data mining models equivalent to those that would have been learned on an integrated data set. Since we prove that data is not disclosed beyond its original source, the opportunity for misuse is not increased by the process of data mining. The deﬁnition of privacy followed in this line of research is conceptually simple: no site should learn anything new from the process of data mining. Speciﬁcally, anything learned during the data mining process must be derivable given one’s own data and the ﬁnal result. In other words, nothing is learned about any other site’s data that isn’t inherently obvious from the data mining result. The approach followed in this research has been to select a type of data mining model to be learned and develop a protocol to learn the model while meeting this deﬁnition of privacy. In addition to the type of data mining model to be learned, the different types of data distribution result in a need for different protocols. For example, the ﬁrst paper in this area proposed a solution for learning decision trees on horizontally partitioned data: each site has complete information on a distinct set of entities, and an integrated dataset consists of the union of these datasets. Privacy-Preserving Methods Across Horizontally Partitioned Data 315 In contrast, vertically partitioned data has different types of information at each site; each has partial information on the same set of entities. In this case an in- tegrated dataset would be produced by joining the data from the sites. While [25] showed how to generate ID3 decision trees on horizontally partitioned data, a completely new method was needed for vertically partitioned data [6]. (We will not further discuss the vertically partitioned data case in this chapter. Please see Vaidya’s chapter in this book for the discussion of vertically par- titioned data case) This chapter presents solutions such that the parties learn (almost) nothing beyond the global results. We assume homogeneous data- bases and horizontally partitioned data: All sites have the same schema, but each site has information on different entities. Given solutions are relatively efﬁcient and proved to preserve privacy under some reasonable assumptions. Speciﬁcally, in section 13.2, we brieﬂy discuss the necessary cryptographic de- ﬁnitions and tools. In section 13.3, we summarize how the basic cryptographic tools could be used to create privacy-preserving sub-protocols. Later on, in sec- tion 13.4, we outline how the privacy-preserving distributed data mining pro- tocols are created using these few sub-protocols. In section 13.6, we discuss how to extend current algorithms to withstand different adversarial models. In section 13.8, we give an overview of other privacy issues related to data min- ing results. Finally, in section 13.9, we conclude with possible future research directions. 13.2 Basic Cryptographic Techniques for Privacy-Preserving Distributed Data Mining Privacy-preserving distributed data mining algorithms require collaboration between parties to compute the results, while provably preventing the disclo- sure of any information except the data mining results. To achieve this goal, we will use tools from secure multiparty computation (SMC) domain. The concept of privacy in this approach is based on a solid body of theoretical work. First, we brieﬂy discuss the basic ideas from SMC domain. Then, we describe a use- ful variant of public-key cryptography system called homomorphic encryption. Privacy Deﬁnitions and Proof Techniques Secure Multiparty Computation (SMC) originated with Yao’s Millionaires’ problem [33]. The basic problem is that two millionaires would like to know who is richer, with neither revealing their net worth. Abstractly, the problem is to simply compare two numbers, each held by one party, without either party revealing its number to the other. Yao[33] presented a generic circuit evaluation based solution for this problem as well as generalizing it to any efﬁciently computable function restricted to two parties. 316 Privacy-Preserving Data Mining: Models and Algorithms The SMC literature deﬁnes two basic adversarial models: Semi-Honest: Semi-honest (or Honest but Curious) adversaries follow the protocol faithfully, but can try to infer the secret information of the other parties from the data they see during the execution of the protocol. Malicious: Malicious adversaries may do anything to infer secret information. They can abort the protocol at any time, send spurious messages, spoof messages, collude with other (malicious) parties, etc. While the semi-honest model may seem questionable for privacy (if a party can be trusted to follow the protocol, why don’t we trust them with the data?), we believe that it meets several practical needs for early adoption of the tech- nology. Consider the case where credit card companies jointly build data min- ing models for credit card fraud detection. In many cases the parties involved already have authorization to see the data (e.g., the theft of credit card infor- mation from CardSystems [30] involved data that CardSystems was expected to see during processing). The problem is that storing the data brings with it a responsibility (and cost) of protecting that data; CardSystems was supposed to delete the information once the processing was complete. If parties could develop the desired models without seeing the data, then they are saved the re- sponsibility (and cost) of protecting it. Also the simplicity and efﬁciency possi- ble with semi-honest protocols will help speed adoption so that trusted parties are saved the expense of protecting data other than their own. As the technol- ogy gains acceptance, malicious protocols will become viable for uses where the parties are not mutually trusted. (Please see section 13.6 for the discussion of malicious parties) In either adversarial model, there exist formal deﬁnitions of privacy [13]. In- formally, the deﬁnition of privacy is based on equivalence to having a trusted third party perform the computation. This is the gold standard of secure mul- tiparty computation. Imagine that each of the data sources gives their input to a (hypothetical) trusted third party. This party, acting in complete isolation, computes the results and reveals them. After revealing the results, the trusted party forgets everything it has seen. A secure multiparty computation approx- imates this standard: no party learns more than it would in the trusted third party approach. One fact is immediately obvious: no matter how secure the computation, some information about the inputs may be revealed. This is a result of the com- puted function itself. For example, if one party’s net worth is $100,000, and the other party is richer, one has a lower bound on their net worth. This is captured in the formal SMC deﬁnitions: any information that can be inferred from one’s own data and the result can be revealed by the protocol. Thus, there are two kinds of information leaks; the information leak from the function computed ir- respective of the process used to compute the function and the information leak Privacy-Preserving Methods Across Horizontally Partitioned Data 317 from the speciﬁc process of computing the function. Whatever is leaked from the function itself is unavoidable as long as the function has to be computed (We discuss the privacy issues related to data mining results in section 13.8). In secure computation the second kind of leak is provably prevented. There is no information leak whatsoever due to the process. Some algorithms improve efﬁciency by trading off some security (leak a small amount of information). Even if this is allowed, the SMC style of proof provides a tight bound on the in- formation leaked; allowing one to determine if the algorithm satisﬁes a privacy policy. This leads to the primary proof technique used to demonstrate the security of privacy-preserving distributed data mining: a simulation argument. Given only its own input and the result, a party must be able to simulate what it sees during execution of the protocol. One key point is the restriction of the simulator to polynomial time algo- rithms, and that the views only need to be computationally indistinguishable. Algorithms meeting this deﬁnition need not be proven against an adversary capable of trying an exponential number of possibilities in a reasonable time frame. While some protocols do not require this restriction, most make use of cryptographic techniques that are only secure against polynomial time adver- saries. This is adequate in practice (as with cryptography); security parameters can be set to ensure that the computing resources to break the protocol in any reasonable time do not exist. While the Yao’s generic circuit evaluation method has been proven secure by the above deﬁnition, it poses signiﬁcant computational problems. Given the size and computational cost of data mining problems, representing algorithms as a boolean circuit results in unrealistically large circuits. The challenge of privacy-preserving distributed data mining is to develop algorithms that have reasonable computation and communication costs on real-world problems, and prove their security with respect to the above deﬁnition. The composition theorem [13] is another very useful theorem from the SMC literature. Theorem 13.2.1 Composition Theorem for the semi-honest model. Suppose that g is privately reducible to f and that there exists a protocol for privately computing f. Then there exists a protocol for privately computing g. Informally, the theorem states that if a protocol is shown to be secure except for several invocations of sub-protocols, and if the sub-protocols themselves are proven to be secure, then the entire protocol is secure. The immediate con- sequence is that, with care, we can combine secure sub-protocols to produce new secure protocols. Also, if many algorithms depend on a few common sub-protocols, efﬁcient implementation of these sub-protocols signiﬁcantly 318 Privacy-Preserving Data Mining: Models and Algorithms improves the overall efﬁciency. The following section shows that many privacy preserving data mining algorithms can be developed using few sub-protocols. Homomorphic Encryption As mentioned above, most of the protocols devised for privacy-preserving distributed data mining could be implemented using few sub-protocols. For the ease of exposition, we describe those sub-protocols using homomorphic encryption techniques. In a nutshell, we can describe the homomorphic encryption as follows: Let Epk(.) denote the encryption function with public key pk and Dpr(.) denote the decryption function with private key pr. A secure public key cryptosystem is called homomorphic if it satisﬁes the following requirements: (1) Given the encryption of m1 and m2,Epk(m1) and Epk(m2), there exists an efﬁcient algorithm to compute the public key encryption of m1+m2, denoted Epk(m1+ m2):=Epk(m1)+h Epk(m2). (2) Given a constant k and the encryption of m1,Epk(m1), there exists an efﬁcient algorithm to compute the public key encryption of k · m1, denoted Epk(k · m1):=k ×h Epk(m1). Please refer to [29] for more details. 13.3 Common Secure Sub-protocols Used in Privacy-Preserving Distributed Data Mining We will brieﬂy describe the common secure sub-protocols used in Privacy- preserving Distributed Data Mining. For each sub-protocol, if possible, we describe a version that only uses homomorphic encryption. Unless otherwise stated, all the sub-protocols are secure in the semi-honest model with no collu- sion, and all the arithmetic operations are deﬁned in some large enough ﬁnite ﬁeld. In later sections, we will show how different algorithms could be imple- mented using these secure sub-protocols. Since these common building blocks are quite general, using theorem 13.2.1, they can be combined to create new privacy preserving algorithms in the future. Secure Sum Secure Sum securely calculates the sum of values from individual sites. As- sume that each site i has some value vi and all sites want to securely compute v = m l=1 vl where v is known to be in the range [0..n]. Homomorphic en- cryption could be used to calculate secure sum as follows: 1: Site 1 creates a homomorphic encryption public and private key pair, and sends the public key to all sites 2: Site 1 sets s1 = Epk(v1) Privacy-Preserving Methods Across Horizontally Partitioned Data 319 3: Each site i where m ≥ i>1,getssi−1 from site i − 1 and computes si = si−1 +h Epk(vi) using additive property of the homomorphic encryption 4: Site m sends sm to site 1 5: Site 1 sends Dpr(sm) to all parties The above protocol is secure because any party other than site 1 cannot decrypt the si values. It also correctly calculates the summation because sm = sm−1+h Epk(vm)=Epk( m l=1 vl) and Dpr(sm)=v. Assuming three or more parties and no collusion, a more efﬁcient method can be found in [19]. Secure Comparison / Yao’s Millionaire Problem Assume that two sites, each having one value, want to compare the two values without revealing anything else other than the comparison result. Secure Comparison methods can be used to solve the above problem. To the best of our knowledge, secure circuit evaluation based approaches still provide the best performance [33]. DotProductProtocol Securely computing the dot product of two vectors is another important sub- protocol required in many privacy-preserving data mining tasks. Many secure dot product protocols have been proposed in the past [5, 31, 14, 12]. Among those proposed techniques, the method of Goethals et al. [12] is quite simple and provably secure. We now brieﬂy describe it here. The problem is deﬁned as follows: Alice has a n-dimensional vector X = (x1,...,xn) while Bob has a n-dimensional vector Y =(y1,...,yn).Atthe end of the protocol, Alice should get ra = X· Y + rb where rb is a random number chosen from uniform distribution that is known only to Bob, and X· Y = n i=1 xi · yi. The key idea behind the protocol is to use a homomorphic encryption system described in section 13.2. Using such a system, it is quite simple to build a dot product protocol. If Alice encrypts her vector and sends in encrypted form to Bob, using the additive homomorphic property, Bob can compute the dot product. The speciﬁc details are given below: Require: Alice has input vector X = {x1,...,xn} Require: Bob has input vector Y = {y1,...,yn} Require: Alice and Bob get outputs rA,rB respectively such that rA + rB = X· Y 1: Alice generates a homomorphic private and public key pair. 2: Alice sends public key to Bob. 3: for i =1...ndo 4: Alice sends to Bob ci = Epk(xi). 5: end for 320 Privacy-Preserving Data Mining: Models and Algorithms 6: Bob computes wi =(ci ×h yi) 7: Bob computes w = w1 +h w2 +h ...+h wn 8: Bob generates a random plaintext rB. 9: Bob sends to Alice w = w +h Epk(−rB). 10: Alice computes rA = Dpr(w)= X· Y − rB. Oblivious Evaluation of Polynomials Another important sub-protocol required in privacy-preserving data mining is the secure polynomial evaluation protocol. Consider the case where Alice has a polynomial P of degree k over some ﬁnite ﬁeld F. Bob has an element x ∈Fandalsoknowsk. Alice would like to let Bob compute the value P(x) in such a way that Alice does not learn x and Bob does not gain any addi- tional information about P(except P(x)). This problem was ﬁrst investigated by [28]. Subsequently, there have been more protocols improving the commu- nication and computation efﬁciency [2] as well as extending the problem to ﬂoating point numbers [1]. We now brieﬂy describe the protocol used for oblivious polynomial eval- uation that uses the secure dot product above. Given a dot product proto- col, we can easily create a protocol for polynomial evaluation as follows: Let P(y)= k i=0 aiyi be Alice’s input and x be Bob’s input, using secure dot product, Bob can evaluate the P(x) as follows Alice forms Bob forms U = ⎡ ⎢⎢⎢⎣ a0 a1 ... ak ⎤ ⎥⎥⎥⎦ V = ⎡ ⎢⎢⎢⎣ 1 x ... xk ⎤ ⎥⎥⎥⎦ Alice and Bob engage in secure dot product so that (only) Bob gets r = U.V Clearly r = k i=0 aixi = P(x). Using theorem 13.2.1, it can be shown that if the dot product protocol is secure, then the above protocol is also secure. Privately computing ln x For entropy measures used in data mining, we need to be able to privately compute ln x,wherex = x1+x2 with x1 known to Alice and x2 known to Bob. Thus, Alice should get y1 and Bob should get y2 such that y1 + y2 =lnx = ln(x1 + x2). One of the key results presented in [26] was a cryptographic protocol for this computation. We now describe the protocol in brief: Note that ln x is Real while general cryptographic tools work over ﬁnite ﬁelds. We multiply the ln x with a known constant to make it integral. Privacy-Preserving Methods Across Horizontally Partitioned Data 321 The basic idea behind computing random shares of ln(x1 + x2) is to use the Taylor approximation for ln x. Remember that the Taylor approximation gives us: ln(1 + )= ∞ i=1 (−1)i−1i i = − 2 2 + 3 3 − 4 4 + ...for − 1 <<1 For an input x,letn = log2 x.Then2n represents the closest power of 2 less than x. Therefore, x = x1 + x2 =2n(1 + ) where −1/2 ≤ ≤ 1/2. Consequently, ln(x)=ln(2n(1 + )) =ln2n +ln(1+) ≈ ln 2n + i=1...k (−1)i−1i/i =ln2n + T() where T() is a polynomial of degree k. This error is exponentially small in k. There are two phases to the protocol. Phase 1 ﬁnds an appropriate n and .LetN be a predetermined (public) upper-bound on the value of n. First, Yao’s circuit evaluation is applied to the following small circuit which takes x1 and x2 as input and outputs random shares of 2N and 2N n ln 2. Note that 2n = x − 2n,wheren can be determined by simply looking at the two most signiﬁcant bits of x,and2N is obtained simply by shifting the result by N −n bits to the left. Thus, the circuit outputs random α1 and α2 such that α1 +α2 = 2N, and also outputs random β1 and β2 such that β1 + β2 =2N n ln 2.This circuit can be easily constructed. Random shares are obtained by having one of the parties input random values α1,β1 ∈Finto the circuit and having the circuit output α2 = 2N − α1 and β2 =2N n ln 2 − β1 to the other party. Phase 2 of the protocol involves computing shares of the Taylor series ap- proximation, T(). This is done as follows: Alice chooses a random w1 ∈F and deﬁnes a polynomial Q(x) such that w1 + Q(α2)=T(). Thus Q(·) is deﬁned as Q(x)=lcm(2,...,k) k i=1 (−1)i−1 2N(i−1) (α1 + x)i i − w1 Alice and Bob then execute a secure polynomial evaluation deﬁned above with Alice inputting Q(·) and Bob inputting α2, in which Bob obtains w2 = Q(α2). Alice and Bob deﬁne u1 = lcm(2,...,k)β1+w1 and u2 = lcm(2,...,k)β2+ w2.Wehavethatu1 + u2 ≈ 2N lcm(2,...,k)lnx. Further details on the protocol, as well as the proof of security, can be found in [26]. 322 Privacy-Preserving Data Mining: Models and Algorithms Secure Intersection Secure Intersection methods are useful in data mining to ﬁnd common rules, frequent itemsets etc., without revealing the owner of the item. Many algo- rithms have been developed for calculating Secure Set Intersection. For exam- ple, [32] provides an efﬁcient solution. Here we describe a secure set intersection protocol that uses secure polyno- mial evaluation. [23, 9] Let us assume that Alice has set X = {x1,...,xn} and Bob has set Y = {y1,...,yn}. Our goal is to securely calculate X ∩ Y. By representing set X as a polynomial and using polynomial evaluation, Alice and Bob can calculate X ∩ Y securely. The speciﬁc details are given below: Require: Alice has input set X = {x1,...,xn} Require: Bob has input set Y = {y1,...,yn} Require: Alice and Bob learn X ∩ Y 1: Alice generates a homomorphic private and public key pair 2: Alice sends public key to Bob 3: Alice creates a polynomial P(z)= n i=0 aizi such that P(xi)=0for all xi (This is possible using interpolation) 4: for i =1...ndo 5: Alice sends to Bob ci = Epk(ai) 6: end for 7: for i =1...ndo 8: Using ci values, and random non-zero ri, Bob computes wi = Epk(ri · P(yi)+yi)(This is possible due to homomorphic encryption) 9: end for 10: Bob permutes wi values and send it to Alice 11: Alice decrypts all wi values and outputs Dpr(wi) as an element of X ∩ Y if Dpr(wi) ∈ X Note that above protocol works, because if yi ∈ X ∩ Y then P(yi)=0, and if P(yi)=0,thenDpr(wi)=yi. On the other hand, if yi ∈ X ∩ Y then P(yi) =0,andthenDpr(wi) will be some random number based on ri.See [9] for further details. Secure Set Union Secure union methods are useful in data mining to allow each party to give its rules,decision trees etc. without revealing the owner of the item. Union of items can be easily evaluated using SMC methods if the domain of the items is small. Each party creates a binary vector (where the ith entry is 1 if the ith item is present locally). At this point, a simple circuit that or’s the corresponding vectors can be built and securely evaluated using general secure multi-party circuit evaluation protocols. However, in data mining, the domain of the items Privacy-Preserving Methods Across Horizontally Partitioned Data 323 are usually very large, potentially inﬁnite. This problem can be overcome using approaches based on commutative encryption [20]. 13.4 Privacy-preserving Distributed Data Mining on Horizontally Partitioned Data In this section, we will give an overview of how different sub-protocols described in section 13.3 could be used to create various privacy-preserving distributed data mining algorithms on horizontally partitioned data (PPDDM). In each of the discussed PPDDM algorithms general data mining functionality is reduced to a computation of secure sub-protocols. Figure 13.1 shows the correspondence between algorithms and constituent secure sub-protocols. In all the algorithms described below, we assume that the data is horizon- tally partitioned. This assumption implies that different sites collect the same set of information about different entities. For example, different credit card companies may collect credit card transactions of different individuals. In rela- tional terms, with horizontal partitioning, the relation to be mined is the union of the relations at the sites. Also, at the end of this section, we brieﬂy discuss the relationship between the privacy-preserving algorithms developed for hor- izontally and vertically partitioned data. •Secure Sum •Secure Comparison •Secure Dot Product •Secure Union •Secure Logarithm •Secure Poly. Evaluation •Association Rule Mining •Decision Trees •EM Clustering •K-NN Classification •Naïve Bayes Classifier •Support Vector Machine Data Mining on Horizontally Partitioned DataSpecific Secure Tools Figure 13.1. Relationship between Secure Sub-protocols and Privacy Preserving Distributed Data Mining on Horizontally Partitioned Data 324 Privacy-Preserving Data Mining: Models and Algorithms ID3 Decision Tree Mining In the ﬁrst work on privacy-preserving distributed data mining on horizon- tally partitioned data [25], the goal is to securely build an ID3 decision tree where the training set is horizontally distributed between two parties. The basic idea is that ﬁnding the attribute that maximizes information gain is equivalent to ﬁnding the attribute that minimizes the conditional entropy. The conditional entropy for an attribute for two parties can be written as a sum of the expression of the form (v1 +v2)×log(v1 +v2). The authors use the secure log algorithm, secure polynomial evaluation, and secure comparison sub-protocols to securely calculate the expression (v1 + v2) × log(v1 + v2) and show how to use this function for building the ID3 securely. Association Rule Mining The goal of privacy-preserving association rule mining is to compute rules of the form X ⇒ Y(e.g Diaper implies Beer) that has a global support and conﬁdence over some certain threshold. It is proven in [20] that this could be achieved using secure set union, secure summation and secure comparison sub-protocols. The algorithm described in [20] has two phases. The ﬁrst phase uses secure set union to get the union of candidate association rules. In the second phase, secure summation and secure comparison are used to ﬁlter the candidate items that are not supported globally. Naive Bayes Classiﬁcation The Naive Bayes classiﬁer is a highly practical Bayesian learning method that applies to learning tasks where each instance x is described by a conjunc- tion of attribute values and the target function f(x) can take on any value from some ﬁnite set C[27]. In Naive Bayes classiﬁcation, in order to classify an instance represented as a tuple of attribute values , we need to estimate the con- ditional probabilities P(ai|cj) for all cj ∈ C using the training set. The prior probabilities P(cj) for all cj ∈ C also need to be ﬁxed in some fashion (typi- cally by simply counting the frequencies from the training set). The probabili- ties for differing hypotheses (classes) can also be computed by normalizing the values received for each hypothesis (class). It is shown in [18] that computing P(ai|cj) could be securely reduced to computing a function of the form n i=1 xin i=1 yi where xi,yi values are known by site i. At the same time, n i=1 xin i=1 yi could be securely calculated using secure summation and secure ln(x) protocol. Privacy-Preserving Methods Across Horizontally Partitioned Data 325 k-NN Classiﬁcation k-NN classiﬁcation predicts the class value of an instance using the k near- est examples based in the training data. Various distance metrics are used to determine the k nearest examples. [11] In [21], a privacy preserving k-nn algorithm is suggested under the assump- tion that the instance that needs to be classiﬁed is public. The approach given in [21] makes use of an untrusted, non-colluding party that is not allowed to learn anything about any of the data, but is trusted not to collude with other parties to reveal private information. The basic idea is that each site ﬁnds its own k-nearest neighbors, (this is possible since the instance that needs to be classiﬁed is public and the data is horizontally partitioned) and encrypts the class with the public key of the site that sent the instance for classiﬁcation (querying site). The parties securely compare their k-nearest neighbors with those of all other sites – except that the comparison gives each site a random share of the result, so no party learns the result of the comparison. The results from all sites are combined, scrambled, and given to the untrusted, non-colluding site. This site combines the random shares to get the comparison result for each pair, enabling it to sort and select the global k-nearest neighbors (but without learning the source or values of the items). The querying site and the untrusted, non-colluding site then engage in a protocol to ﬁnd the class value. Support Vector Machine Classiﬁcation Support Vector Machine (SVM) classiﬁcation is an another important classi- ﬁcation technique. In [34], a privacy-preserving solution for horizontally parti- tioned case is given using secure dot product sub-protocol. The solution given in [34] uses the observation that to build the SVM, only the kernel matrix K is needed. To calculate the Kernel matrix K, the gram matrix G where Gij = xi · xj is needed to be computed securely for all training instance pairs xi,xj. Clearly Gij could be calculated using the secure dot-product protocol. k-means and EM Clustering Clustering is a well studied data mining technique that tries to group similar instances in a given data set into clusters to minimize some objective function. In k-means clustering, the goal is to partition data into k clusters. Usually, k initial cluster centers are chosen, and then the cluster centroids are updated using an iterative method. In [15], it is shown that k-means clustering could be achieved on arbitrarily partitioned data using secure dot product, secure summation and secure comparison. Similarly, in [24], secure clustering using 326 Privacy-Preserving Data Mining: Models and Algorithms the expectation maximization method is given for horizontally partitioned data using secure summation protocol. 13.5 Comparison to Vertically Partitioned Data Model The privacy preserving algorithms developed for vertically partitioned data also uses the common sub-protocols discussed above. To illustrate the differ- ence between the vertically partitioned and horizontally partitioned data model, let us revisit the association rule mining on both data models. In both data mod- els, to mine association rules, we need to check whether the global support of an itemset X(e.g. the global support of an itemset that contains beer and dia- per) is bigger than some certain threshold. In the horizontally partitioned data model, a transaction database DB is assumed to be partitioned among n sites (namely S1,S2,...,Sn)whereDB = DB1 ∪ DB2 ∪···∪DBn and DBi re- sides at site Si (1 ≤ i ≤ n). The itemset X has local support count of X.supi at site Si if X.supi of the transactions contains X.Theglobal support count of X is given as X.sup = n i=1 X.supi. An itemset X is globally supported if X.sup ≥ s × ( n i=1 |DBi|). To check whether an itemset X is globally supported or not, we can check the following equivalent condition: X.sup ≥ s ∗ ( n i=1 |DBi|) n i=1 X.supi ≥ s ∗ ( n i=1 |DBi|) n i=1 (X.supi − s ∗|DBi|) ≥ 0 Clearly, in the horizontally partitioned data case, we can check whether an itemset X is globally supported or not by using a secure sum protocol that involves at most n values (i.e. the sum of the values (X.supi − s ∗|DBi|) for 1 ≤ i ≤ n where n is the number of sites) and a secure comparison protocol. In the case of vertically partitioned data [31], a transaction database DB is assumed to be partitioned among n sites where DB = DB1 DB2 ··· DBn. In other words, information about each transaction is distributed among multiple sites. In [31], it is shown that to compute whether an itemset X is glob- ally supported or not, we need to compute a dot product that involves all the transactions. This means that if the original DB has m transactions, we need to run secure dot product algorithm with vector sizes m to compute a single global support. In practice, the total number of transactions (i.e. m) is much larger than the total number of possible sites (i.e. n). Due to these reasons, privately mining association rules over vertically partitioned data is much more expen- sive then privately mining association rules over horizontally partitioned data. Privacy-Preserving Methods Across Horizontally Partitioned Data 327 Similar phenomenon emerges in other types of privacy preserving distributed data mining algorithms. Usually, privacy-preserving algorithms running on the vertically partitioned data require secure dot product protocol executions over large vectors. On the other hand, for the horizontally partitioned data, it may be possible to aggregate local information (e.g. local support count of an itemset in the association rule mining) for efﬁcient distributed processing. The above observation implies that for horizontally partitioned data, we may need to re- strict the number of sites participating in the protocol execution for efﬁciency purposes. However, for the vertically partitioned data, we may need to control both the number of sites participating in the protocol and the total size of the data. 13.6 Extension to Malicious Parties Most of the work described in the previous sections deals only with semi- honest adversaries, which are assumed to follow the prescribed protocol but try to infer private information using the messages they receive during the protocol. Although the semi-honest model is reasonable in some cases, it is unrealistic to assume that adversaries will always follow the protocols exactly. In particular, malicious adversaries could deviate arbitrarily from their pre- scribed protocols. Secure protocols that are developed against malicious ad- versaries require utilization of expensive techniques. Clearly, protocols that can withstand malicious adversaries provide more security. However, there is an obvious trade-off: protocols that are secure against malicious adversaries are generally more expensive than those secure only against semi-honest ad- versaries. In this section, we give a brief overview of how to make commonly used sub-protocols secure against malicious adversaries. Again, our exposi- tion is based on the homomorphic encryption. First, we discuss few additional cryptographic tools needed to devise protocols secure against malicious par- ties. Later on, we discuss how these tools could be used to improve secure dot product protocol. Threshold Homomorphic Encryption From SMC literature, we know that any semi-honest protocol could be trans- formed into a protocol that is secure against malicious adversaries [13]. Zero Knowledge proofs are the key ingredients in such transformations. Using zero knowledge proofs, each party could prove that it follows the prescribed proto- col without revealing any information. For the sake of completeness, here we describe the zero knowledge proofs needed to extend homomorphic encryption based semi-honest sub-protocols. The implementation details of those proto- cols for Paillier encryption can be found in [4, 3]. 328 Privacy-Preserving Data Mining: Models and Algorithms Threshold Decryption (two-party case): Given the common public key pk, the private key pr corresponding to pk is divided into two pieces pr0 and pr1. There exists an efﬁcient, secure protocol Dpri (Epk(a)) that outputs the random share of the decryption result si along with the non-interactive zero knowledge proof POD(pri,Epk(a),si) show- ing that pri is used correctly. Those shares can be combined to calculate the decryption result. Also any single share of the private key pri cannot be used to decrypt the ciphertext alone. In other words si does not reveal anything about the ﬁnal decryption result. We also need a special version of a threshold decryption such that only one party learns the decryption result. Such a protocol could be easily implemented exploiting the fact that for any given Epk(a), the party that needs to learn the decryption re- sult could generate Epk(r1) and then both parties could jointly decrypt Epk(a)+h Epk(r1). Since only one party knows r1, only that party can learn the correct decryption result. Proving that you know a plaintext: A party Pi can compute the zero knowledge proof POK(ea) if he knows an element a in the domain of valid plaintexts such that Dpr(ea)=a. Proving that multiplication is correct: Assume that party Pi is given an encryption Epk(a), chooses a constant c, and calculates Epk(a.c). Later on, Pi can give zero knowledge proof POMC(ea,ec,ea.c) such that Dpr(ea)=Dpr(a) and Dpr(ea.c)=Dpr(ec).Dpr(ea). Converting Secure Dot protocol in the Semi-Honest Model to Malicious Model. If we look at the dot product protocol in the semi-honest model carefully, we need to make sure that the Bob does the multiplications correctly and all the encryptions sent to Alice are valid. These could be easily achieved using the zero knowledge protocols described above. Alice sends the encrypted values along with the associated proofs of correct encryption to Bob. For each multiplication, Bob generates the zero knowledge proof of correct multiplica- tion and sends those to Alice. Later on, Alice can check those proofs to make sure that the dot product was calculated correctly. Such a generic transforma- tion (i.e. using zero knowledge proofs) could be applied for other sub-protocols as well. In some cases, generic transformation can further be improved in terms of efﬁciency by specializing them in the malicious model. As an example,in [17], the authors provide a more efﬁcient algorithm for secure dot product in the malicious model. Privacy-Preserving Methods Across Horizontally Partitioned Data 329 13.7 Limitations of the Cryptographic Techniques Used in Privacy-Preserving Distributed Data Mining Privacy is not free. Especially, in the case of privacy preserving distributed data mining, we need to use expensive cryptographic operations. Further more, protocols that are secure against malicious parties are even more expensive. These results indicate that we need to carefully set the parameters used in pri- vacy preserving distributed data mining protocols. 1 Forexample,ifwesetthe support threshold for association rules too low, this may cause an explosion in the number of locally supported itemsets, which in return, require many ex- pensive cryptographic operations during secure set union phase. Similarly, for building Naive Bayes models, we need calculate the occurrence probability of each attribute value given the class attribute. Therefore, using attributes with large number of discrete values may require much higher computation times. Although privacy preserving distributed data mining algorithms are devel- oped to reveal nothing other than the ﬁnal result, not revealing anything could be an overkill in some situations. For example, in the privacy-preserving as- sociation rule mining protocol, we need to run one secure summation and one secure comparison to securely check whether an itemset is globally supported or not. If revealing the total support count of an itemset is not a privacy threat, then we may not need to execute the secure comparison protocol. Therefore, the privacy requirements should be considered carefully before executing the privacy preserving distributed data mining protocols. Compared to noise addition methods used in privacy-preserving data min- ing, cryptographic techniques for privacy-preserving distributed data mining do not allow easy trade-off between privacy and accuracy. For instance, in the noise addition techniques, variance of the noise could be adjusted to increase privacy while potentially lowering the result accuracy. In contrast, by adjust- ing the key sizes used in the cryptographic protocols, we can trade off between privacy and efﬁciency. As a result, new approaches are needed for privacy- preserving distributed data mining to trade off between privacy and accuracy systematically. One way to satisfy this goal is to introduce new “approximate privacy-preserving distributed data mining” protocols that can cheaply approx- imate the required data mining result, and allow trades-off between accuracy of the approximation versus efﬁciency. We believe that the work of Feigenbau et al. [7] can provide a good starting point in that direction. Another limitation with current privacy-preserving data mining protocols is that each party is only assumed to be either honest, semi-honest or malicious. We believe that there are many real-world scenarios where parties participating 1As discussed in Section 13.5, for the vertically partitioned data, we need to also carefully choose the total data used for privacy preserving data mining 330 Privacy-Preserving Data Mining: Models and Algorithms in the protocols are “rational”. In other words, the parties are willing to share their data to achieve some certain gain and they will cheat only if cheating in- creases their gain. Such rational adversary assumption could potentially affect the resulting privacy-preserving distributed data mining protocols. For exam- ple, in [16], it is shown that if the participating parties are rational, we can achieve signiﬁcant cost reductions in the malicious model. Clearly, further research is needed to explore the effect of rational behavior in privacy pre- serving distributed data mining. Finally, all the tools and techniques discussed until this point do not consider the privacy effect of the data mining results. In the next section, we will explore this issue in more details. 13.8 Privacy Issues Related to Data Mining Results In the previous sections, we discussed provably secure distributed data min- ing protocols that reveal nothing but the resulting data mining model. This work still leaves a privacy question open: Do the resulting data mining models inherently violate privacy? This question is important because the full impact of privacy-preserving data mining will only be realized when we can guarantee that the resulting models do not violate privacy as well. Here, in this section, we give an overview of the model developed in [22] that presents a start on methods and metrics for evaluating the privacy impact of data mining models. Although the methods discussed in [22] provide results only for classiﬁcation, these results give a good cross-section of what needs to be done, and a demonstration of techniques to analyze the privacy impact. To make the privacy implications of data mining results clear, consider the following “medical diagnosis” scenario. Suppose we want to create a “medical diagnosis” model for public use: a classiﬁer that predicts the likelihood of an individual getting a terminal illness. Most individuals would consider the clas- siﬁer output to be sensitive – for example, when applying for life insurance. The classiﬁer takes some public information (age, address, cause of death of ancestors), together with some private information (eating habits, lifestyle), and gives a probability that the individual will contract the disease at a young age. Since the classiﬁer requires some information that the insurer is presumed not to know, can we state that the classiﬁer does not violate privacy? The answer is not as simple as it seems. Since the classiﬁer uses some public information as input, it would appear that the insurer could improve an estimate of the disease probability by repeatedly probing the classiﬁer with the known public information and “guesses” the unknown information. At ﬁrst glance, this appears to be a privacy violation. Surprisingly, given reasonable assump- tions on the external knowledge available to an adversary, it can be proven that Privacy-Preserving Methods Across Horizontally Partitioned Data 331 the adversary learns nothing new [22]. To analyze similar cases, in [22], the authors categorize the data by into three classes: Public Data:(P) This data is accessible to everyone, including the ad- versary. Private/Sensitive Data:(S) It is assumed that this kind of data must be protected: The values should remain unknown to the adversary. Unknown Data:(U) This is the data that is not known to the adversary, and is not inherently sensitive. However, before disclosing this data to an adversary (or enabling an adversary to estimate it, such as by publishing a data mining model) we must show that it does not help the adversary to discover sensitive data. Later on, the authors analyze the cases where giving a classiﬁer to an adver- sary could violate privacy. The most obvious way a classiﬁer can compromise privacy is by taking Public data and predicting Sensitive values. However, it turns out that there are many other ways a classiﬁer can be misused to violate privacy. In [22], the authors have analyzed the following cases: 1 P → S: Classiﬁer that produces sensitive data given public data. 2 PU → S: Classiﬁer taking public and unknown data into sensitive data. 3 PS → P: Classiﬁer taking public and sensitive data into public data. 4 Assuming that the adversary has access to Sensitive data for some indi- viduals, what is the effect on privacy of giving the following classiﬁers to an adversary? (a) P → S: Can the adversary do better with such a classiﬁer because of his/her background knowledge? (b) P → U: Can giving the adversary a predictor for Unknown data improve its ability to build a classiﬁer for Sensitive data? The long list of possible privacy violations due to data mining results given above indicates that we need to be really careful in revealing data mining re- sults. Recently, in [10], the authors gave a new decision tree learning algorithm which guarantees that the data mining result does not violate the k-anonymity of the individuals represented in the training data.Although, current work in this area resulted in some interesting results, we believe that more research is needed to understand the privacy implications of data mining results. 332 Privacy-Preserving Data Mining: Models and Algorithms 13.9 Conclusion This chapter presents a survey of efﬁcient solutions for many privacy pre- serving data mining tasks on horizontally partitioned data. We show that many privacy preserving distributed data mining protocols on horizontally parti- tioned data can be efﬁciently implemented by securely reducing them to few basic secure building blocks. Also we give an overview of some of the initial solutions on how to use the data mining results without violating privacy. We believe that the need for mining of data where access is restricted due to privacy concerns will increase. Examples include knowledge discovery among intelligence services of different countries and collaboration among corpora- tions without revealing trade secrets. Even within a single multi-national com- pany, privacy laws in different jurisdictions may prevent sharing individual data. This increasing need for privacy preserving data mining techniques will require ﬂexible and efﬁcient solutions that could be tailored for individual pri- vacy needs for different distributed data mining tasks. Current solutions do not allow users to trade off between efﬁciency, accuracy, and privacy easily. We believe that more ﬂexible and more efﬁcient solutions are needed for future wide-scale adoption of the privacy preserving data mining techniques. References [1] Chang, Yan-Cheng and Lu, Chi-Jen (2001). Oblivious polynomial evalu- ation and oblivious neural learning. Lecture Notes in Computer Science, 2248:369+. [2] Cramer, R., Gilboa, Niv, Naor, Moni, Pinkas, Benny, and Poupard, G. (2000). Oblivious Polynomial Evaluation. Can be found in the Privacy Preserving Data Mining paper by Naor and Pinkas. [3] Cramer, Ronald, Damg˚ard,Ivan, and Nielsen, Jesper B. (2001). Mul- tiparty computation from threshold homomorphic encryption. Lecture Notes in Computer Science, 2045:280+. [4] Damgard, I., Jurik, M., and Nielsen, J. (2003). A generalization of pail- lier’s public-key system with applications to electronic voting. [5] Du, Wenliang and Atallah, Mikhail J. (2001). Privacy-preserving sta- tistical analysis. In Proceeding of the 17th Annual Computer Security Applications Conference, New Orleans, Louisiana, USA. [6] Du, Wenliang and Zhan, Zhijun (2002). Building decision tree classiﬁer on private data. In Clifton, Chris and Estivill-Castro, Vladimir, editors, IEEE International Conference on Data Mining Workshop on Privacy, Security, and Data Mining, volume 14, pages 1–8, Maebashi City, Japan. Australian Computer Society. Privacy-Preserving Methods Across Horizontally Partitioned Data 333 [7] Feigenbaum, Joan, Ishai, Yuval, Malkin, Tal, Nissim, Kobbi, Strauss, Martin J., and Wright, Rebecca N. (2006). Secure multiparty compu- tation of approximations. ACM Trans. Algorithms, 2(3):435–472. [8] Feingold, Mr., Corzine, Mr., Wyden, Mr., and Nelson, Mr. (2003). Data Mining Moratorium Act of 2003. U.S. Senate Bill (proposed). [9] Freedman, Michael J., Nissim, Kobbi, and Pinkas, Benny (2004). Efﬁ- cient private matching and set intersection. In Eurocrypt 2004, Interlaken, Switzerland. International Association for Cryptologic Research (IACR). [10] Friedman, Arik, Wolff, Ran, and Schuster, Assaf (to appear). Providing k-anonymity in data mining. VLDB Journal. [11] Fukunaga, Keinosuke (1990). Introduction to Statistical Pattern Recog- nition. Academic Press, San Diego, CA. [12] Goethals, Bart, Laur, Sven, Lipmaa, Helger, and Mielik¨ainen, Taneli (2004). On Secure Scalar Product Computation for Privacy-Preserving Data Mining. In Park, Choonsik and Chee, Seongtaek, editors, The 7th Annual International Conference in Information Security and Cryptology (ICISC 2004), volume 3506, pages 104–120. [13] Goldreich, Oded (2004). The Foundations of Cryptography, volume 2, chapter General Cryptographic Protocols. Cambridge University Press. [14] Ioannidis, Ioannis, Grama, Ananth, and Atallah, Mikhail (2002). A se- cure protocol for computing dot-products in clustered and distributed en- vironments. In The 2002 International Conference on Parallel Process- ing, Vancouver, British Columbia. [15] Jagannathan, Geetha and Wright, Rebecca N. (2005). Privacy-preserving distributed k-means clustering over arbitrarily partitioned data. In Pro- ceedings of the 2005 ACM SIGKDD International Conference on Knowl- edge Discovery and Data Mining, pages 593–599, Chicago, IL. [16] Jiang, Wei, Clifton, Chris, and Kantarcioglu, Murat (To appear.). Trans- forming semi-honest protocols to ensure accountability. Data and Knowl- edge Engineering. [17] Kantarcioglu, Murat and Kardes, Onur (2006). Privacy-preserving data mining in malicious model. Technical Report CS-2006-06, Stevens Insti- tute of Technology. [18] Kantarcioglu, Murat and Vaidya, Jaideep (2003). Privacy preserving naive bayes classiﬁer for horizontally partitioned data. In the Workshop on Privacy Preserving Data Mining held in association with The Third IEEE International Conference on Data Mining, Melbourne, FL. [19] Kantarcıo˘glu,Murat and Clifton, Chris (2002). Privacy-preserving dis- tributed mining of association rules on horizontally partitioned data. In 334 Privacy-Preserving Data Mining: Models and Algorithms The ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD’02), pages 24–31, Madison, Wisconsin. [20] Kantarcıo˘glu,Murat and Clifton, Chris (2004a). Privacy-preserving dis- tributed mining of association rules on horizontally partitioned data. IEEE TKDE, 16(9):1026–1037. [21] Kantarcıo˘glu,Murat and Clifton, Chris (2004b). Privately computing a distributed k-nn classiﬁer. In Boulicaut, Jean-Franois, Esposito, Flori- ana, Giannotti, Fosca, and Pedreschi, Dino, editors, PKDD2004: 8th Eu- ropean Conference on Principles and Practice of Knowledge Discovery in Databases, pages 279–290, Pisa, Italy. [22] Kantarcıo˘glu,Murat, Jin, Jiashun, and Clifton, Chris (2004). When do data mining results violate privacy? In Proceedings of the 2004 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 599–604, Seattle, WA. [23] Kissner, L. and Song, D. (2005). Privacy-preserving set operations. In Advances in Cryptology — CRYPTO 2005. [24] Lin, Xiaodong, Clifton, Chris, and Zhu, Michael (2005). Privacy pre- serving clustering with distributed EM mixture modeling. Knowledge and Information Systems, 8(1):68–81. [25] Lindell, Yehuda and Pinkas, Benny (2000). Privacy preserving data min- ing. In Advances in Cryptology Ð CRYPTO 2000, pages 36–54. Springer- Verlag. [26] Lindell, Yehuda and Pinkas, Benny (2002). Privacy preserving data min- ing. Journal of Cryptology, 15(3):177–206. [27] Mitchell, Tom (1997). Machine Learning. McGraw-Hill Sci- ence/Engineering/Math, 1st edition. [28] Naor, Moni and Pinkas, Benny (1999). Oblivious transfer and polynomial evaluation. In Proceedings of the Thirty-ﬁrst Annual ACM Symposium on Theory of Computing, pages 245–254, Atlanta, Georgia, United States. ACM Press. [29] Paillier, P. (1999). Public key cryptosystems based on composite degree residuosity classes. In Advances in Cryptology - Eurocrypt ’99 Proceed- ings, LNCS 1592, pages 223–238. Springer-Verlag. [30] Perry, John M. (2005). Statement of john m. perry, president and ceo, cardsystems solutions, inc. before the united states house of representa- tives subcommittee on oversight and investigations of the committee on ﬁnancial services. http://financialservices.house.gov/ hearings.asp?formmode=detail&hearing=407&comm=4. [31] Vaidya, Jaideep and Clifton, Chris (2002). Privacy preserving associ- ation rule mining in vertically partitioned data. In The Eighth ACM Privacy-Preserving Methods Across Horizontally Partitioned Data 335 SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 639–644, Edmonton, Alberta, Canada. [32] Vaidya, Jaideep and Clifton, Chris (2005). Secure set intersection cardi- nality with application to association rule mining. Journal of Computer Security, 13(4). [33] Yao, Andrew C. (1986). How to generate and exchange secrets. In Pro- ceedings of the 27th IEEE Symposium on Foundations of Computer Sci- ence, pages 162–167. IEEE. [34] Yu, Hwanjo, Jiang, Xiaoqian, and Vaidya, Jaideep (2006). Privacy- preserving svm using nonlinear kernels on horizontally partitioned data. In SAC ’06: Proceedings of the 2006 ACM symposium on Applied com- puting, pages 603–610, New York, NY, USA. ACM Press. Chapter 14 A Survey of Privacy-Preserving Methods Across Vertically Partitioned Data Jaideep Vaidya MSIS Department and CIMIC Rutgers University jsvaidya@rbs.rutgers.edu Abstract The goal of data mining is to extract or “mine” knowledge from large amounts of data. However, data is often collected by several different sites. Privacy, legal and commercial concerns restrict centralized access to this data, thus derailing data mining projects. Recently, there has been growing focus on ﬁnding solu- tions to this problem. Several algorithms have been proposed that do distributed knowledge discovery, while providing guarantees on the non-disclosure of data. Vertical partitioning of data is an important data distribution model often found in real life. Vertical partitioning or heterogeneous distribution implies that different features of the same set of data are collected by different sites. In this chapter we survey some of the methods developed in the literature to mine verti- cally partitioned data without violating privacy and discuss challenges and com- plexities speciﬁc to vertical partitioning. Keywords: Vertically partitioned data, privacy-preserving data mining. 14.1 Introduction Today, the collection of data is ubiquitous. With the rapid increase in com- puting, storage and networking resources, data is not only collected and stored but also analyzed. Indeed, data is often anonymized and released for public use. However, this brings the problem of privacy into sharp focus. Our personal data is supposed to be private. However, as several high proﬁle infractions have shown, this is not really the case. This creates a serious problem since it means that data really cannot be shared without appropriate security. One possibility is to only use local data and not worry about integrating or using global data. While this would be 338 Privacy-Preserving Data Mining: Models and Algorithms Table 14.1. The Weather Dataset outlook temperature humidity windy play sunny hot high false no sunny hot high true no overcast hot high false yes rainy mild high false yes rainy cool normal false yes rainy cool normal true no overcast cool normal true yes sunny mild high false no sunny cool normal false yes rainy mild normal false yes sunny mild normal true yes overcast mild high true yes overcast hot normal false yes rainy mild high true no perfect from the security standpoint, it would not be very useful. Therefore, the key challenge is how to use data without really having complete access to it? While this may sound counter-intuitive, advances in cryptography show that it is possible. The challenge is to do this in an efﬁcient manner. In general data can be distributed in an arbitrary fashion. This means that different parties may own partial information about different sets of entities. Table 14.1 shows the famous weather dataset consisting of 14 items and 5 features. Tables 14.2(a)-14.2(b) show an arbitrary partitioning of the dataset between 2 parties. While this is possible in general, in practice, such arbitrary partitioning rarely happens. Two special cases of arbitrary partitioning – hor- izontal partitioning of data and vertical partitioning of data are a lot likelier. Horizontal partitioning of data means that different sites collect the same fea- tures of information for different entities. We have already seen in [1] how privacy-preserving data mining is done over horizontally partitioned data. Vertically partitioned data means that different sites collect different features of data for the same set of entities. Integrating the local datasets gives the global dataset. Tables 14.3(a) and 14.3(b) show a vertical partitioning of the dataset between 2 parties. This happens in many real life situations. For example, con- sider a medical research study which wants to compare medical outcomes of different treatment methods of a particular disease. (E.g., to answer the ques- tion “will this treatment for this patient be successful or not?”) The insurance companies must not disclose individual patient data without permission [13], and details of patient treatment plans are similarly protected data held by hos- pitals. Similar constraints arise in many applications; European Community legal restrictions apply to disclosure of any individual data[9]. Privacy-Preserving Methods across Vertically Partitioned Data 339 Table 14.2. Arbitrary partitioning of data between 2 sites (a) Site 1 outlook temperature humidity windy play sunny −−false no - hot - true no overcast hot - - - - mild high false - rainy - normal - yes rainy - - true - -- normal - yes - mild - - no sunny cool - - - rainy - - false - -- normal true - overcast - - true yes - hot normal - yes rainy - high true no (b) Site 2 outlook temperature humidity windy play - hot high - - sunny - high - - - hot - false yes rainy - - - yes - cool - false - - cool normal - no overcast cool - true - sunny - high false - -- normal false yes - mild normal - yes sunny mild - - yes - mild high - - overcast - - false - - mild - - - In general, with vertically partitioned data, more data signiﬁcantly improves the quality of the models built from the dataset. Overall, the data analysis re- sults are signiﬁcantly more real and useful. While this is also the case with horizontally partitioned data (more data is always good), but it has a more crit- ical impact with vertically partitioned data. This is because data from different parties give signiﬁcantly different additional information about the entities. For example, consider Figure 14.1 that shows points plotted in a two dimensional space along with their projections on the X and Y axis. Assume that the data is vertically partitioned between two parties (one having the X-coordinate for each point, while the other has the Y-coordinate for each point). Suppose we 340 Privacy-Preserving Data Mining: Models and Algorithms Table 14.3. Vertical partitioning of data between 2 sites (a) Site 1 outlook temperature sunny hot sunny hot overcast hot rainy mild rainy cool rainy cool overcast cool sunny mild sunny cool rainy mild sunny mild overcast mild overcast hot rainy mild (b) Site 2 humidity windy play high false no high true no high false yes high false yes normal false yes normal true no normal true yes high false no normal false yes normal false yes normal true yes high true yes normal false yes high true no 0 1 2 3 4 5 6 7 8 0 2 4 6 8 10 12 Figure 14.1. Two dimensional problem that cannot be decomposed into two one-dimensional problems wanted to cluster the points. From the two dimensional plot it is obvious that there are at least three distinct clusters approximately centered around (2,4), (7,2.5), and (5.5,10.2). However neither site can ﬁgure this out on their own. From the Y-axis, it looks like two clusters centered at approximately 3.8, and 10.5. From the X-axis, the two clusters would be centered around 2, and 6. In Privacy-Preserving Methods across Vertically Partitioned Data 341 fact, it is unclear if there should be only two clusters or several. The situation is equally bad if we want to identify outliers or anomalies. Again, looking at the two dimensional plot, it is obvious that the points at (1.5,10) and (4.5,4) are outliers. However, on the basis of the one-dimensional projections, neither point is identiﬁed as an outlier on either the X-axis or the Y-axis. Thus, we clearly get incorrect results with partial data. The situation only worsens with higher dimensional data. The complexity of privacy-preserving data mining is signiﬁcantly increased due to the vertical partitioning of data. In contrast to horizontal partitioning of data, vertical partitioning of data raises several unique questions with re- spect to the way data is processed, results are obtained and shared. We now survey different types of privacy-preserving data mining algorithms following the main data mining tasks of association rule mining, classiﬁcation, clustering and outlier detection. We brieﬂy survey the ﬁrst three while going into more detail on the fourth. Thus, outlier detection serves as the expository example of privacy-preserving data mining over vertically partitioned data. In each case, we also examine some of the complications speciﬁc to vertical partitioning of data and some of the inherent challenges. 14.2 Classiﬁcation Classiﬁcation refers to the problem of categorizing observations into classes. Predictive modeling uses samples of data for which the class is known to generate a model for classifying new observations. One issue with classi- ﬁcation for vertically partitioned data is whether the class attribute is shared by all of the parties or is local to only one of them. Having the class attribute known to all of the parties simpliﬁes the problem. However, that may not al- ways be the case. If the class attribute is known to only one party, any process that needs to count the number of entities having a particular value for an at- tribute and a particular class will have to be secure. This means that computing the information gain, etc. needs to be completely secure. Another issue with classiﬁcation is how is the classiﬁcation model shared between parties? One possibility is to let all of the parties know the devel- oped model – but often this may reveal too much information. The completely secure alternative is to keep the created model completely split between the parties. However, this may have a signiﬁcant impact on the classiﬁcation time. Other alternatives are also possible with differing tradeoffs between security and cost. We will now see how these have affected the proposed solutions for classiﬁcation. 342 Privacy-Preserving Data Mining: Models and Algorithms 14.2.1 Na¬õve Bayes Classiﬁcation Na¨ıve Bayes is a simple but highly effective classiﬁer. This combination of simplicity and effectiveness has lead to its use as a baseline standard by which other classiﬁers are measured. Vaidya and Clifton [27] present a privacy- preserving solution for vertically partitioned data. The Na¨ıve Bayes classiﬁer applies to learning tasks where each instance x is described by a conjunction of attribute values and the target function f(x) can take on any value from some ﬁnite set C. The Bayesian approach to classifying an new instance is to assign the most probable target value, cMAP, given the attribute values that describe the instance. cMAP = argmax cj∈C (P(cj|a1,a2,...,an)) The Na¨ıve Bayes classiﬁer makes the simplifying assumption that all at- tributes are independent. Therefore, cNB = argmax cj ∈C ! P(cj) i P(ai|cj) " (14.1) where cNB denotes the target value output by the Na¨ıve Bayes classiﬁer. Therefore, the key problem is to compute these conditional probabilities. When considering a secure solution, an important question is the location of the class attribute. There are two possibilities: the class may be known to all parties or it may be private to some party. This impacts the way the model is built and the way evaluation of a new instance is done. Both cases are realistic and model different situations. In the ﬁrst case, each party can easily estimate all the required counts for nominal attributes and means and variances for nu- meric attributes locally, causing no privacy breaches. Prediction is also simple – each party can independently estimate the probabilities. All parties then se- curely multiply the probabilities and compare to obtain the predicted class. As such, we do not further discuss this. The other case is more challenging and is discussed below. The method in [27] is fully secure in the sense that even the model built is split between the participants. Thus, none of the participants knows the ac- tual model parameters. The only information revealed is when a new instance is classiﬁed – the class of the instance. The downside to this, of course, the performance drop. A secure protocol has to be run for every classiﬁcation. If this performance penalty has to be avoided, the global model must be made available to all of the parties. The way to compute the model parameters is somewhat different for nomi- nal and numeric attributes. For a nominal attribute, the conditional probability Privacy-Preserving Methods across Vertically Partitioned Data 343 is given simply by the ratio of number of instances having that attribute value and that class to the total number of instances having that class. If we encode presence of the attribute value (viz. class value) in a instance as 1, and absence as 0, we create boolean vectors such that the scalar product of the vectors gives the correct result. The scalar product of [11] also randomly splits the results between the parties. Thus, the numerator and denominator of the ratio are split between the parties. Now a secure division protocol must be run to compute splits of the conditional probability. More details can be found in [27]. For a numeric attribute, the process is more complicated. The necessary parameters are the mean µ and variance σ2 for each class. Again, the necessary information is split between all of the parties. To compute the mean, each party needs to sum the attribute values for all appropriate instances having the same class value. These local sums are added together and the global sum is divided by the total number of instances having that same class to get the mean for that class value. This can be done, once again, by carefully constructing the vectors for each class and using the secure scalar product protocol. The party owning the class attribute builds a vector of 1/ni and 0 depending on whether the training entity is in the class or not. The mean for the class is the scalar product of this vector with the projection of the data onto the attribute. The scalar product will give shares of the mean. Computing the variance is more complicated as it requires summing the square of the distances between values and the mean, without revealing values to the party owning the class attribute or the classes to the party owning the data attribute or the means to either. Thus, to compute the variance σ2 y, it is necessary to subtract the appropriate mean from each value, square the difference and sum all such values together. Finally the global sum needs to be divided by the global number of instances having the same class y to give the required variance σ2 y. Homomorphic encryption is used to get the differences, and secure square computation protocol is used to get shares of the square. Finally the scalar product is used as earlier to get the variance. To evaluate a new instance, the secure ln protocol of [20] is used to get shares of the conditional probability for each attribute. Finally a secure addition and comparison circuit is used to determine the class label of the maximal class. More details can be found in [27]. 14.2.2 Bayesian Network Structure Learning Bayesian Networks relax the attribute independence assumption of the Na¨ıve Bayes classiﬁer, capturing situations where dependencies between at- tributes affect the class. A Bayesian Network is a graphical model; the vertices correspond to attributes, and the edges to probabilistic relationships between the attributes (Na¨ıve Bayes is thus a Bayesian Network with no edges.) The 344 Privacy-Preserving Data Mining: Models and Algorithms probability of a given class is similar to Equation 14.1, except that the probabil- ities associated with an attribute are conditional on the parents of that attribute in the network. Wright and Yang [31] propose a privacy-preserving protocol for learning the Bayesian network structure for vertically partitioned data. This protocol is limited two parties. The basic approach is to emulate the K2 algorithm [6], which starts with a graph with no edges, then chooses a node and greedily adds the “parent” edge to that node that most improves a score for the network, stopping when a threshold for number of parents is reached. Since the structure of the ﬁnal network is presumed to be part of the outcome (and thus not a privacy concern), the only issue is to determine which attribute most improves the score. This is similar to the decision tree induction protocol presented below ; the difference is in the score function. Instead of information gain K2 algorithm uses: f(i, πi)= qi j=1 di − 1)! αij + di − 1)! di k=1 αijk!(14.2) (For full details including the notation, please see [31]; our purpose here is not to give the full algorithm but to show the novel ideas with respect to privacy- preserving data mining.) The privacy-preserving solution works by ﬁrst modifying the scoring func- tion (taking the natural log of f(i, πi)). While this changes the output, it doesn’t affect the order; since all that matters is determining which attribute gives the highest score, the actual value is unimportant and the resulting network is un- changed. This same technique – transforming scoring functions in ways that do not alter the ﬁnal result – has proven beneﬁcial in designing other privacy- preserving data mining algorithms. Note that by pushing the logarithm into Equation 14.2, the products turn into summations. Moreover, taking a page from [21] they approximate a difﬁcult to compute value (in this case, Stirling’s approximation for factorial.) Ignoring small factors in the approximation, the formula reduces to a sum of factors, where each factor is of the form ln x or x ln x (except for a ﬁnal factor based on the number of possible values for each attribute, which they consider public knowledge.) This now reduces to secure summation and the ln x and x ln x protocols of [21]. 14.2.3 Decision Tree Classiﬁcation A solution for constructing ID3 on vertically partitioned data was proposed by Du and Zhan[8]. Their work assumes that the data is vertically partitioned between two parties. The class of the training data is assumed to be shared, but some the attributes are private. Thus most of the steps of the ID3 algorithm Privacy-Preserving Methods across Vertically Partitioned Data 345 can be evaluated locally. The main problem is computing which site has the best attribute to split on – each can compute the gain of their own attributes without reference to the other site. [29] propose a solution that solves a more general problem – constructing an ID3 decision tree when the training data is vertically partitioned between many parties (≥ 2) and the class attribute is known to only a single party. Since each party has knowledge of only some of the attributes, knowing the structure of the tree (especially, knowledge of an unknown attribute and its breakpoints for testing) constitutes a violation of the privacy of the individual parties. Ideally, to ensure zero leakage of extra information, even the structure of the tree should be hidden, with an oblivious protocol for classifying a new instance. However, the cost associated with this is typically unacceptable. A compromise is to hide the attribute tests used in the tree while still revealing the basic structure of the tree. A distributed protocol can then be run to to evaluate a new instance. As in Na¨ıve Bayes, the drawback of this is that all parties have to be online in order to classify any new instance. Before we go ahead, we brieﬂy review the ID3 algorithm. The ID3 algo- rithm is a recursive partitioning algorithm. At start, all the training examples are at the root. Examples are then partitioned recursively based on selected attributes. ID3 is a greedy algorithm – in each case the attribute with the high- est information gain is selected as the partitioning attribute. Partitioning stops either when all samples for a given node belong to the same class, there are no remaining attributes, or there are no samples left. In order to construct a cloaked decision tree, the parties must together ﬁgure out all how to solve all of these problems in a privacy-preserving way. Determination of the majority class for a node requires a secure protocol since only one site knows the class. First, each site determines which of its transactions might reach that node of the tree. The intersection of these sets with the transactions in a particular class gives the number of transactions that reach that point in the tree having that particular class. Once this is done for all classes, the class site can now determine the distribution and majority class, and return a (leaf) node identiﬁer. The identiﬁer is used to map to the distribution at the time of classiﬁcation. The intersection process itself needs to be secure – this can be done by using a protocol for securely determining the cardinality of set intersection. Many protocols for doing so are known [30, 10, 2]. To formalize the whole process, the notion of a Constraint Set is introduced. As the tree is being built, each party i keeps track of the values of its attributes used to reach that point in the tree in a ﬁlter Constraintsi. Initially, this is composed of all don’t care values (‘?’). However, when an attribute Aij at site i is used to partition, entry j in Constraintsi is set to the appropriate value before recursing to build the subtree. Now, the majority class (and class distributions) are determined 346 Privacy-Preserving Data Mining: Models and Algorithms by computing for each class i=1..k Yi,whereYk includes a constraint on the class value. Determining if all transactions have the same class can use the same distrib- ution count idea described above to get the distribution counts and then check if all transactions in that node do belong to the same class. Figuring out if all attributes are used up or all transactions are done is easily done using the se- cure sum protocol. The main challenge lies in ﬁnding the partitioning attribute – i.e., the attribute with the largest information gain. [29] show that ﬁnding the information gain of a attribute can be done simply by counting transactions. If the number of transactions reaching a node can be determined, the number in each class c, and the same two after partitioning with each possible attribute value a ∈ A, the gain due to A can be computed. The constraint set is once again used to apply appropriate ﬁlters to get the correct count of transactions. Finding the best attribute is a simple matter of ﬁnding out the information gain due to each attribute and selecting the best one. A na¨ıve efﬁcient imple- mentation would leak the information gain due to each attribute. If even this minimal information should not be leaked, the information gain can be split between the parties, and a sequence of secure comparisons carried out to de- termine the best attribute. Thus, the entire ID3 tree can be built in a secure manner using these sub-protocols. Classifying a new instance again requires a distributed protocol. Given that the structure of the tree is known, the root site ﬁrst makes a decision based on its data. It then looks at the node this decision leads to and tells the site respon- sible for that node the node and the instance to be classiﬁed. This continues until a leaf is reached, and which point the site that originally held the class value knows the predicted class of the new instance. While this does lead to some disclosure of information (knowing the path followed, a site can say if instances have the same values for data not known to that site), speciﬁc values need not be disclosed. 14.3 Clustering One question with clustering is how are the clusters shared? Speciﬁcally, is only cluster membership shared or is more information about the clusters shared, and if so, how? Based on cluster membership, each party can locally compute its share of the cluster means. However, are the complete cluster means shared with all of the parties? In this case other parties could easily learn a lot of information about the other attributes. [26] proposed the ﬁrst method for clustering over vertically partitioned data – a privacy-preserving protocol perform do k-means clustering. Though all parties know the ﬁnal assignment of data points to clusters, they retain only partial information for each cluster. The cluster centers µi are assumed to be Privacy-Preserving Methods across Vertically Partitioned Data 347 semiprivate information, i.e., each site can learn only the components of µ that correspond to the attributes it holds. Thus, all information about a site’s at- tributes (not just individual values) is kept private; if sharing the µ is desired, an evaluation of privacy/secrecy concerns can be performed after the values are known. The basic protocol proposed closely follows the original K-means protocol. There are two major challenges – ﬁguring out how to assign points to clusters in each iteration, and ﬁguring out when to stop. Since the means at each itera- tion are not considered private information, ﬁguring out when to stop is quite simple. Each party can locally compute the difference between their shares of the mean, and ﬁnally check if the total difference is less than the threshold. Since all arithmetic takes place in a ﬁeld, the threshold evaluation at the end is somewhat non-obvious. Intervals are compared rather than the actual num- bers. Further details can be found in [26]. The assignment of points to clusters in each iteration is carried out through a secure protocol utilizing three key ideas: 1 Disguise the site components of the distance with random values that cancel out when combined. 2 Compare distances so only the comparison result is learned; no party knows the distances being compared. 3 Permute the order of clusters so the real meaning of the comparison results is unknown. One drawback of the Vaidya and Clifton protocol is that it is not completely secure since intermediate results are revealed. Essentially, the intermediate cluster assignment of data points is known to every party for each iteration, though the ﬁnal result only speciﬁes the ﬁnal clusters. However, this compro- mise is required for efﬁciency. [15] propose a completely secure protocol for arbitrarily partitioned data. Their protocol is very similar to the Vaidya and Clifton protocol with the added complexity of splitting the intermediate clus- ter centers. Thus, no information is leaked whatsoever. 14.4 Association Rule Mining [25] ﬁrst showed how secure association rule mining can be done for verti- cally partitioned data by extending the apriori algorithm. Vertical partitioning implies that an itemset could be split between multiple sites. Most steps of the apriori algorithm can be done locally at each of the sites. The crucial step involves ﬁnding the support count of an itemset. If the support count of an itemset can be securely computed, one can check if the support is greater than threshold, and decide whether the itemset is frequent. Using this, association rules can be easily mined securely. 348 Privacy-Preserving Data Mining: Models and Algorithms The key insight of [25] is that computing the support of an itemset is exactly the scalar product of the vectors representing the sub-itemsets with different parties. Thus, the entire secure association rule mining problem can be reduced to computing the scalar product of two vectors in a privacy-preserving way. [25] also proposed an algebraic method to compute the scalar product. While this method is not provably secure, it is quite efﬁcient. A strong point of the secure association rule mining protocol is that it is not tied to any speciﬁc scalar product protocol. Indeed, there have been a number of secure scalar product protocols proposed[7, 14, 11, 33, 24], out of which at least two are provably secure. All of them have differing tradeoffs of security, efﬁciency, and utility (some are limited to scalar products over boolean data). Any of these could be used. [1] shows one possible secure protocol to compute the scalar product using homomorphic encryption. While there are now several solutions using scalar product computation, one alternative solution needs to be mentioned. [30] provide an innovative al- ternative solution to the association rule mining problem. There are two key insights provided in this solution. First, if we encode the vectors as sets (with position numbers as elements), the scalar product is the same as the size of the intersection set. For example, assume we have vector X =(1, 0, 0, 1, 1) and Y =(0, 1, 0, 1, 0). Then the scalar product X· Y = 5 i=1 xi ∗ yi.Now,the corresponding set encodings are XS =(1, 4, 5) and YS =(2, 4). Once can see that the size of the intersection set |XS YS| =1is exactly the same as the scalar product. This idea is used to compute the scalar product. The basic idea is to use commutative encryption to encrypt all of the items in each party’s set. Commutative encryption is an important tool used in many cryptographic protocols. An encryption algorithm is commutative if the order of encryption does not matter. Thus, for any two encryption keys E1 and E2, and any message m, E1(E2(m)) = E2(E1(m)). The same property applies to de- cryption as well – thus to decrypt a message encrypted by two keys, it is sufﬁ- cient to decrypt it one key at a time. The basic idea is for each source to encrypt its data set with its keys and pass the encrypted data set to the next source. This source again encrypts the received data using its encryption keys and passes the encrypted data to the next source until all sources have encrypted the data. Since we are using commutative encryption, the encrypted values of the set items across different data sets will be equal if and only if their original values are equal. Thus, all the intersection of the encrypted values gives the logical AND of the vectors, and counting the size of the intersection set gives the total number of 1s (i.e., the scalar product). The encryption prevents any party from knowing the actual value of any local item. This scalar product method only works for boolean vectors, but it will still work for the association rule mining problem. This idea is also used by [2] to compute Set Union, Set Intersection, Size of Set Union, and Size of Set Intersection. However, their work is limited Privacy-Preserving Methods across Vertically Partitioned Data 349 to two parties. [10] also propose techniques using homomorphic encryption to do private matching and set intersection for two parties which can guard against malicious adversaries in the random oracle model as well. While this is a good alternative, the real innovativeness lies in realizing the fact that once all of the items are encrypted by the keys of all of the parties, all parties can locally compute all of the frequent itemsets. This implies that the overall cost of secure association rule mining is simply the cost of completely encrypting all of the items. If there are k parties, n items and m transactions, the total cost of association rule mining is O(nmk) since these will be the total number of encryptions required (the encryption time dominates all other costs). Note that this is independent of the number of frequent itemsets which can eas- ily be in the tens of thousands. Thus, the protocol in [30] is extremely efﬁcient in the global sense and makes privacy-preserving association rule mining really feasible. Most of the protocols developed typically assume a semi-honest model, where the parties involved will honestly follow the protocol but can later try to infer additional information from whatever data they receive through the proto- col. One result of this is that parties are not allowed to give spurious input to the protocol. If a party is allowed to give spurious input, they can probe to deter- mine the value of a speciﬁc item at other parties. For example, if a party gives the input (0,...,0, 1, 0,...,0), the result of the scalar product (1 or 0) tells the malicious party if the other party the transaction corresponding to the 1. Attacks of this type can be termed probing attacks and need to be protected against. The protocol in [30] can partially protect against such attacks. 14.5 Outlier detection Outlier / anomaly detection is one of the most common data mining tasks carried out in practice. Hawkins [12] deﬁnes an outlier as an observation which deviates so much from other observations so as to arouse suspicions that it was generated by a different mechanism. Outlier detection has been used to ﬁnd uncommon sequences in gene data, to ﬁnd fradulent transactions in credit card records, fraud discovery in mobile phones, to ﬁnd intrusions from network trafﬁc data[3, 19], etc. Indeed even the search for terrorism involves outlier detection – detecting previously unknown suspicious behavior is a clear out- lier detection problem. Many of these applications also have privacy concerns, and organizations must be careful to avoid overstepping the bounds of privacy legislation[9]. So what does it mean to protect privacy in this context? By deﬁnition, outlier detection means ﬁnding outliers. Thus, the output of outlier detection would be a list of detected outliers. This is highly speciﬁc information – anomalous enti- ties/transactions are highlighted. There is no summarization carried out. Thus, 350 Privacy-Preserving Data Mining: Models and Algorithms implicitly, no information about a true outlier should be protected/concealed. However, no information about the other entities should be revealed. Indeed, the process of ﬁnding outliers should not reveal any extra information. Privacy- preserving outlier detection will ensure these concerns are balanced, allowing us to get the beneﬁts of outlier detection without worrying about legal or pri- vacy concerns. However, what about false positives? i.e., what about entities identiﬁed as outliers without really being so. While this seems problematic, a couple of caveats exist. First, no detection technique is fool-proof and false positives always exist. We merely reduce the privacy leakage and problems. Secondly, technical solutions exist. All the identiﬁers can be eliminated to be- gin with. The outliers detected are hand examined and if sufﬁcient cause exists, the anonymization is taken away and the real identity is revealed (just as it oc- curs in real life with a court order). While there are numerous different deﬁnitions of outliers as well as tech- niques to ﬁnd them, the ﬁrst privacy-preserving outlier detection technique developed was for distance-based outliers. The method developed by Vaidya and Clifton[28] ﬁnds distance-based outliers without any party gaining knowl- edge beyond learning which items are outliers. Ensuring that data is not dis- closed maintains privacy, i.e., no privacy is lost beyond that inherently revealed in knowing the outliers. This is the absolute minimum information that must be revealed for privacy-preserving outlier detection over vertically partitioned data. Before going into speciﬁcs, we ﬁrst brieﬂy review the notion of distance- based outliers. Knorr and Ng [17] deﬁne the notion of a Distance Based outlier as follows: An object O in a dataset T is a DB(p,dt)-outlier if at least fraction p of the objects in T lie at distance greater than dt from O. Other distance based outlier techniques also exist[18, 22]. The advantages of distance based outliers are that no explicit distribution needs to be deﬁned to determine unusualness, and that it can be applied to any feature space for which we can deﬁne a dis- tance measure. Euclidean distance is the standard, although the algorithms are easily extended to general Minkowski distances. There are other non distance based techniques for ﬁnding outliers as well as signiﬁcant work in statistics [4], but there is little work on ﬁnding them in a privacy-preserving fashion – thus, this is a rich area for future work. For Euclidean distance, for vertically partitioned data, the distance dt is ﬁxed by the local parties deciding on the local distances dti (i.e., dt =k i=1 dti), since no site globally knows all of the attributes. An object X is an outlier if at least p% of the other objects lie at a distance greater than dt. The approach of [28] duplicates the results of the outlier detection algorithm of [17]. The idea is that an object O is an outlier if more than a percentage p of the objects in the data set are farther than distance dt from O. The basic idea is that parties compute the portion of the answer they know, then engage in a Privacy-Preserving Methods across Vertically Partitioned Data 351 secure sum to compute the total distance. The key is that this total is (randomly) split between sites, so nobody knows the actual distance. A secure protocol is used to determine if the actual distance between any two points exceeds the threshold; again the comparison results are randomly split such that summing the splits (over a closed ﬁeld) results in a 1 if the distance exceeds the threshold, or a 0 otherwise. For a given object O, each site can now sum all of its shares of comparison results (again over the closed ﬁeld). When added to the sum of shares from other sites, the result is the correct count; all that remains is to compare it with the percentage threshold p. This addition/comparison is also done with a secure protocol, revealing only the result: if O is an outlier. The pairwise comparison of all points may seem excessive, but early termination could disclose informa- tion about relative positions of points. The asymptotic complexity still equals that of [17]. Note that a secure solution requires that all operations are carried out mod- ulo some ﬁeld. For the algorithms, the ﬁeld D is used for distances, and F is used for counts of the number of entities. The ﬁeld F must be over twice the number of objects. Limits on D are based on maximum distances; details on the size are given with each algorithm. We now present the actual algorithm, followed by the complete proof of security for the algorithm. This is especially instructive for readers wishing to develop their own algorithms since the proof of security forms a signiﬁcantly important component necessary for trust in the overall solution. A discussion of the computational and communication complexity of the algorithm rounds off this section, and affords the opportunity to discuss avenues for future work in this area. 14.5.1 Algorithm For each object i, the protocol iterates over every other object j. Since each party owns some of the attributes, each party can compute the distance be- tween two objects for those attributes. Thus, each party can compute a share of the pairwise distance locally; the sum of these shares is the total distance. However, revealing the distance still reveals too much information, therefore a secure protocol is used to get shares of the pairwise comparison of distance and threshold. The key to this protocol is that the 1 or 0 is actually two shares r q and r s returned to the two parties, such that r q + r s =1(or 0) (mod F). Looking at only one share, neither party can learn anything. Once all points have been compared, the parties individually sum their shares. Since the shares add to 1 for distances exceeding the distance threshold, and 0 otherwise, the total sum (mod F) gives the number of points for which the distance exceeds the threshold. Explicit computation of this sum would 352 Privacy-Preserving Data Mining: Models and Algorithms still reveal the actual number of points distant. So the parties do not actually compute this sum; instead all parties pass their (random) shares to a designate to add, and the designated party and the party holding the point engage in a secure protocol that reveals only if the sum of the shares exceeds p%. Thus, the only result of the protocol is to reveal whether the point is an outlier or not. An interesting side effect of this algorithm is that the parties need not reveal any information about the attributes they hold, or even the number of attributes. Each party locally determines the distance threshold for its attributes (or more precisely, the share of the overall threshold for its attributes). Instead of com- puting the local pairwise distance, each party computes the difference between the local pairwise distance and the local threshold. If the sum of these differ- ences is greater than 0, the pairwise distance exceeds the threshold. Algorithm 2 gives the full details. In steps 6-10, the sites sum their local dis- tances (actually the difference between the local distance and the local thresh- old). The random x added by P1 masks the distance from each party. In steps 11-13, Parties P1 and Pk get shares of the pairwise comparison result. The comparison is a test if the sum is greater than 0 (since the threshold has al- ready been subtracted.) These two parties keep a running sum of their shares. At the end, in step 15 these shares are added and compared with the percentage threshold. At several stages in the algorithm, a protocol is required to securely compare the sum of two numbers, with the output split between the parties holding those numbers. This can be accomplished using the generic circuit evaluation technique ﬁrst proposed by Yao[32]. 14.5.2 Security Analysis The protocol described above can be proven to be secure using the proof techniques of Secure Multiparty Computation. The idea is that since what a party sees during the protocol (its shares) are randomly chosen from a uniform distribution over a ﬁeld, it learns nothing in isolation. (Of course, collusion with other parties could reveal information, since the joint distribution of the shares is not random). The idea of the proof is based on a simulation argument: If we can deﬁne a simulator that uses the algorithm output and a party’s own data to simulate the messages seen by a party during a real execution of the protocol, then the real execution isn’t giving away any new information (as long as the simulator runs in polynomial time). Since all parties know the number (and identity) of objects in O, they can set up the loops; the simulator just runs the algorithm to generate most of the simulation. The only communication is at lines 8, 11, 15, and 16. Privacy-Preserving Methods across Vertically Partitioned Data 353 Protocol 2 Finding DB(p,D)-outliers Require: k parties, P1,...,Pk; each holding a subset of the attributes for all objects O. Require: dtr : local distance threshold for Pr (e.g., dt2 + mr/m). Require: Fields D larger than twice the maximum distance value (e.g., for Euclidean this is actually Distance2), F larger than |O| 1: for all objects oi ∈ O do 2: m 1 ← m k ← 0(modF) 3: for all objects oj ∈ O, oj = oi do 4: P1: Randomly choose a number x from a uniform distribution over the ﬁeld D 5: P1: x ← x 6: for r ← 1,...,k− 1 do 7: At Pr: x ← x + Distancer(oi,oj) − dtr (mod D){Distancer is local distance at Pr} 8: Pr sends x to Pr+1 9: end for 10: At Pk: x ← x + Distancek(oi,oj) − dtk (mod D) 11: P1 and Pk engage in the secure comparison protocol to get m1 and mk respectively such that the following condition holds: if 0 |O|∗p%,thentemp1 + tempk ← 1 (oi is an outlier), otherwise temp1 + tempk ← 0 16: P1 and Pk send temp1 and tempk to the party authorized to learn the result; if temp1 + tempk =1then oi is an outlier. 17: end for Step 8: Each party Ps sees x = x + s−1 r=1 Distancer(oi,oj),where x is the random value chosen by P1. Pr(x = y)=Pr(x +s−1 r=1 Distancer(oi,oj)=y)=Pr(x = y − s−1 r=0 Distancer(oi,oj)) = 1 |D|. Thus we can simulate the value received by choosing a random value from a uniform distribution over D. Steps 11 and 15: Each step is a secure comparison. Assuming this is secure, the messages in this step can be easily simulated. 354 Privacy-Preserving Data Mining: Models and Algorithms Step 16: This is the ﬁnal result, and can be easily simulated. temp1 is sim- ulated by choosing a random value, tempk = result − temp1. By the same argument on random shares used above, the distribution of simulated values is indistinguishable from the distribution of the shares. The simulator clearly runs in polynomial time (the same as the algorithm). Since each party is able to simulate the view of its execution (i.e., the proba- bility of any particular value is the same as in a real execution with the same inputs/results) in polynomial time, the algorithm is secure with respect to the semi-honest SMC deﬁnitions. Without collusion and assuming a malicious-model secure comparison, a malicious party is unable to learn anything it could not learn from altering its input. Step 8 is particularly sensitive to collusion, but can be improved (at cost) by splitting the sum into shares and performing several such sums (see [16] for more discussion of collusion-resistant secure sum). 14.5.3 Computation and Communication Analysis In general we do not discuss the computational/communicational complex- ity of any of the algorithms in detail. However, in this case the algorithmic complexity raises interesting issues vis-a-vis security. Therefore we discuss it below in detail. Algorithm 2 suffers the drawback of having quadratic computation com- plexity due to the nested iteration over all objects. Due to the nested iteration, Algorithm 2 also requires O(n2) secure comparisons (step 11), where n is the total number of objects. While operation parallelism can be used to reduce the round complexity of communication, the key practical issue is the computa- tional complexity of the encryption required for the secure comparison and scalar product protocols. This quadratic complexity is troubling since the major focus of new algo- rithms for outlier detection has been to reduce the complexity, since n2 is assumed to be inordinately large. However, achieving lower than quadratic complexity is challenging – at least with the basic algorithm. Failing to com- pare all pairs of points is likely to reveal information about the relative dis- tances of the points that are compared. Developing protocols where such reve- lation can be proven not to disclose information beyond that revealed by simply knowing the outliers is a challenge. Otherwise, completely novel techniques must be developed which do not require any pairwise comparison. When there are three or more parties, assuming no collusion, much more efﬁcient solutions that reveal some information can be developed. Essentially a much more efﬁ- cient secure comparison can be used [5] that still reveals nothing to the third party. While not completely secure, the privacy versus cost tradeoff may be acceptable in some situations. An alternative (and another approach to future Privacy-Preserving Methods across Vertically Partitioned Data 355 work) is demonstrating lower bounds on the complexity of fully secure outlier detection. However, signiﬁcant work is required to make any of this happen – thus opening a rich area for future work. [23] use very similar techniques to perform privacy-preserving nearest neighbor search. They further show how this can be used to perform privacy- preserving LOF outlier detection, SNN clustering and kNN classiﬁcation. 14.6 Challenges and Research Directions This chapter presents a survey of efﬁcient solutions for many privacy pre- serving data mining tasks on vertically partitioned data. Like horizontally par- titioned data, it can be seen that even for vertically partitioned data, many privacy-preserving algorithms can be efﬁciently implemented by combining speciﬁc basic secure building blocks. However, inherently, the main challenge with techniques dealing with vertically partitioned data lies with efﬁciency. Unlike, horizontally partitioned data, it is very difﬁcult to carry out much lo- cal aggregation beforehand. For example, in a lot of the protocols seen above, the secure scalar product is a critical component. Utilizing the [11] protocol, a single scalar product of two vectors of length n will require n encryptions, n modular exponentiations, n modular multiplications and 1 decryption. The cost for the encryptions and exponentiations dominate. With the current speed of encryption/exponentiation, it still takes a signiﬁcant amount of time to carry out a single scalar product. For example, the scalar product of two vectors of length 1000 takes approximately 40s with 512 bit encryption and 270s with 1024 bit encryption. Since data mining is typically done over millions of trans- actions, this cost signiﬁcantly balloons up. Therefore we clearly need more ef- ﬁcient protocols. Indeed, very few of the protocols are actually implemented. This deﬁnitely needs to change to ensure deployment of these algorithms into real life. The other technical challenge lies with the adversarial model of the protocols. Almost all of the protocols seen above assume semi-honest partici- pants – i.e., participants that will follow the protocol exactly but may later try to ﬁnd additional information. While this is a good starting model, eventually we need protocols that would work in the presence of malicious adversaries. Overall, we believe that the trend towards usage of privacy-preserving algo- rithms is on the rise. Due to increasing privacy and security concerns as well as the need to leverage commercial assets, there is a clear need for ﬂexible and efﬁcient privacy-preserving solutions that could be tailored for individual privacy needs. Development of such ﬂexible and efﬁcient solutions will be in- strumental in wide-scale adoption of this technology. 356 Privacy-Preserving Data Mining: Models and Algorithms References [1] Murat Kantarcioglu. A survey of Privacy-Preserving Methods across Horizontall Partitioned Data. Privacy-Preserving Data Mining: Models and Algorithms. Ed. Charu Aggarwal, Philip Yu, Springer, 2008. [2] Rakesh Agrawal, Alexandre Evﬁmievski, and Ramakrishnan Srikant. In- formation sharing across private databases. In Proceedings of ACM SIG- MOD International Conference on Management of Data, San Diego, Cal- ifornia, June 9-12 2003. [3] Daniel Barbar´a,Ningning Wu, and Sushil Jajodia. Detecting novel net- work intrusions using bayes estimators. In First SIAM International Con- ference on Data Mining, Chicago, Illinois, April 5-7 2001. [4] Vic Barnett and Toby Lewis. Outliers in Statistical Data. John Wiley and Sons, 3rd edition, 1994. [5] Christian Cachin. Efﬁcient private bidding and auctions with an oblivious third party. In Proceedings of the 6th ACM conference on Computer and communications security, pages 120–127. ACM Press, 1999. [6] Gregory F. Cooper and Edward Herskovits. A bayesian method for the induction of probabilistic networks from data. Mach. Learn., 9(4):309– 347, 1992. [7] Wenliang Du and Mikhail J. Atallah. Privacy-preserving statistical analy- sis. In Proceeding of the 17th Annual Computer Security Applications Conference, New Orleans, Louisiana, USA, December 10-14 2001. [8] Wenliang Du and Zhijun Zhan. Building decision tree classiﬁer on private data. In Chris Clifton and Vladimir Estivill-Castro, editors, IEEE Inter- national Conference on Data Mining Workshop on Privacy, Security, and Data Mining, volume 14, pages 1–8, Maebashi City, Japan, December 9 2002. Australian Computer Society. [9] Directive 95/46/EC of the european parliament and of the council of 24 october 1995 on the protection of individuals with regard to the process- ing of personal data and on the free movement of such data. Ofﬁcial Jour- nal of the European Communities, No I.(281):31–50, October 24 1995. [10] Michael J. Freedman, Kobbi Nissim, and Benny Pinkas. Efﬁcient private matching and set intersection. In Eurocrypt 2004, Interlaken, Switzer- land, May 2-6 2004. International Association for Cryptologic Research (IACR). [11] Bart Goethals, Sven Laur, Helger Lipmaa, and Taneli Mielik¨ainen. On Secure Scalar Product Computation for Privacy-Preserving Data Mining. In Choonsik Park and Seongtaek Chee, editors, The 7th Annual Interna- tional Conference in Information Security and Cryptology (ICISC 2004), volume 3506, pages 104–120, December 2–3, 2004. Privacy-Preserving Methods across Vertically Partitioned Data 357 [12] D. M. Hawkins. Identiﬁcation of Outliers. Chapman and Hall, 1st edition, 1980. [13] Standard for privacy of individually identiﬁable health information. Fed- eral Register, 66(40), February 28 2001. [14] Ioannis Ioannidis, Ananth Grama, and Mikhail Atallah. A secure protocol for computing dot-products in clustered and distributed environments. In The 2002 International Conference on Parallel Processing, Vancouver, British Columbia, August 18-21 2002. [15] Geetha Jagannathan and Rebecca N. Wright. Privacy-preserving distrib- uted k-means clustering over arbitrarily partitioned data. In Proceed- ings of the 2005 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 593–599, Chicago, IL, August 21-24 2005. [16] Murat Kantarcıoˇglu and Chris Clifton. Privacy-preserving distributed mining of association rules on horizontally partitioned data. IEEE Trans- actions on Knowledge and Data Engineering, 16(9):1026–1037, Septem- ber 2004. [17] Edwin M. Knorr and Raymond T. Ng. Algorithms for mining distance- based outliers in large datasets. In Proceedings of 24th International Conference on Very Large Data Bases (VLDB 1998), pages 392–403, New York City, NY, USA, August24-27 1998. [18] Edwin M. Knorr, Raymond T. Ng, and Vladimir Tucakov. Distance-based outliers: algorithms and applications. The VLDB Journal, 8(3-4):237– 253, 2000. [19] Aleksandar Lazarevic, Aysel Ozgur, Levent Ertoz, Jaideep Srivastava, and Vipin Kumar. A comparative study of anomaly detection schemes in network intrusion detection. In SIAM International Conference on Data Mining (2003), San Francisco, California, May 1-3 2003. [20] Yehuda Lindell and Benny Pinkas. Privacy preserving data mining. In Advances in Cryptology Ð CRYPTO 2000, pages 36–54. Springer-Verlag, August 20-24 2000. [21] Yehuda Lindell and Benny Pinkas. Privacy preserving data mining. Jour- nal of Cryptology, 15(3):177–206, 2002. [22] Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. Efﬁcient algorithms for mining outliers from large data sets. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pages 427–438. ACM Press, 2000. [23] Mark Shaneck, Yongdae Kim, and Vipin Kumar. Privacy preserving near- est neighbor search. In ICDM Workshops, pages 541–545. IEEE Com- puter Society, 2006. 358 Privacy-Preserving Data Mining: Models and Algorithms [24] Dragos Trinca and Sanguthevar Rajasekaran. Towards a collusion- resistant algebraic multi-party protocol for privacy-preserving associa- tion rule mining in vertically partitioned data. In 3rd International Work- shop on Information Assurance, April11–13 2007. [25] Jaideep Vaidya and Chris Clifton. Privacy preserving association rule mining in vertically partitioned data. In The Eighth ACM SIGKDD Inter- national Conference on Knowledge Discovery and Data Mining, pages 639–644, Edmonton, Alberta, Canada, July 23-26 2002. [26] Jaideep Vaidya and Chris Clifton. Privacy-preserving k-means cluster- ing over vertically partitioned data. In The Ninth ACM SIGKDD Inter- national Conference on Knowledge Discovery and Data Mining, pages 206–215, Washington, DC, August 24-27 2003. [27] Jaideep Vaidya and Chris Clifton. Privacy preserving na¨ıve bayes clas- siﬁer for vertically partitioned data. In 2004 SIAM International Confer- ence on Data Mining, pages 522–526, Lake Buena Vista, Florida, April 22–24 2004. [28] Jaideep Vaidya and Chris Clifton. Privacy-preserving outlier detection. In Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM’04), pages 233–240, Los Alamitos, CA, November 1 – 4 2004. IEEE Computer Society Press. [29] Jaideep Vaidya and Chris Clifton. Privacy-preserving decision trees over vertically partitioned data. In The 19th Annual IFIP WG 11.3 Working Conference on Data and Applications Security, Storrs, Connecticut, Au- gust 7-10 2005. Springer. [30] Jaideep Vaidya and Chris Clifton. Secure set intersection cardinality with application to association rule mining. Journal of Computer Security, 13(4):593–622, November 2005. [31] Rebecca Wright and Zhiqiang Yang. Privacy-preserving bayesian network structure computation on distributed heterogeneous data. In Pro- ceedings of the 10th ACM SIGKDD International Conference on Knowl- edge Discovery and Data Mining, Seattle, WA, August22-25 2004. [32] Andrew C. Yao. How to generate and exchange secrets. In Proceedings of the 27th IEEE Symposium on Foundations of Computer Science, pages 162–167. IEEE, 1986. [33] Sheng Zhong. Privacy-preserving algorithms for distributed mining of frequent itemsets. Information Sciences, 177(2):490–503, 2007. Chapter 15 A Survey of Attack Techniques on Privacy-Preserving Data Perturbation Methods Kun Liu1, Chris Giannella2, and Hillol Kargupta3 1IBM Almaden Research Center 650 Harry Road, San Jose, CA 95120 kun@us.ibm.com 2Department of Computer Science Loyola College in Maryland 4501 N. Charles Street, Baltimore, MD. 21210 cgiannel@acm.org 3Department of Computer Science and Electrical Engineering University of Maryland, Baltimore County 1000 Hilltop Circle, Baltimore, MD 21250 Also afﬁliated with AGNIK, LLC 8840 Stanford Blvd. Suite 1300, Columbia, MD 21045 hillol@cs.umbc.edu Abstract We focus primarily on the use of additive and matrix multiplicative data pertur- bation techniques in privacy preserving data mining (PPDM). We survey a re- cent body of research aimed at better understanding the vulnerabilities of these techniques. These researchers assumed the role of an attacker and developed methods for estimating the original data from the perturbed data and any avail- able prior knowledge. Finally, we brieﬂy discuss research aimed at attacking k-anonymization, another data perturbation technique in PPDM. Keywords: Data perturbation, additive noise, matrix multiplicative noise, attack techniques, k-anonymity. 360 Privacy-Preserving Data Mining: Models and Algorithms 15.1 Introduction Data perturbation represents one common approach in privacy preserving data mining (PPDM). It builds on a longer history in the areas of statisti- cal disclosure control and statistical databases [1] where the original (private) dataset is perturbed and the result is released for data analysis. Typically, a “privacy/accuracy” trade-off is faced. On the one hand, perturbation must not allow the original data records to be adequately recovered. On the other, it must allow “patterns” in the original data to be mined. Data perturbation includes a wide variety of techniques including (but not limited to): additive, multiplica- tive [24], matrix multiplicative, k-anonymization [38, 41], micro-aggregation [3, 26], categorical data perturbation [10, 45], data swapping [11], resampling [27], data shufﬂing [34] (see [1, 28] for a more complete survey). In this chapter we mostly focus on two types of data perturbation that apply to continuous data: additive and matrix multiplicative. Additive data perturba- tion was originally introduced in statistical disclosure control more that twenty years ago and was further studied in the PPDM community in the last eight years. Matrix multiplicative data perturbation were introduced only ﬁve years ago in the PPDM community and is in its early stages of study. In order to better understand the privacy offered by these techniques, some PPDM researchers have assumed the role of an attacker and developed techniques for breaching privacy by estimating the original data from the perturbed data and any avail- able additional prior knowledge. Their work offers insight into vulnerabilities of this type of data perturbation. We provide a detailed survey of their work in an effort to allow the reader to observe common themes and future directions. Moreover, due to its rapidly growing study, we also provide a brief overview of attacks on k-anonymization. This chapter is organized as follows. Section 15.2 describes deﬁnitions and notation used throughout. Section 15.3 discusses additive data perturbation, its uses and several attack techniques in detail. Section 15.4 describes matrix multiplicative data perturbation, its uses and several attack techniques in de- tail. Section 15.5 discusses k-anonymization and recent literature addressing vulnerabilities of this data perturbation model. Finally, Section 15.6 concludes the paper with a summary. 15.2 Deﬁnitions and Notation Throughout this chapter, the original dataset is represented as an n×m, real- valued matrix X, with each column a data record. The data owner perturbs X to produce an n × m data matrix Y, which is then released to the public or another party for analysis. The attacker uses Y and any other available infor- mation to produce an estimation of X, denoted by ˆX. Unless otherwise stated, we will assume that each record of the original dataset arose as an independent Survey of Attack Techniques on Privacy-Preserving Data Perturbation Methods 361 sample from an n-dimensional random vector X with unknown probability density function (p.d.f.) (and this assumption is public knowledge). Let ΣX denote the covariance matrix of X. We will also assume that ΣX has all dis- tinct and non-zero eigenvalues (more details later) since, as argued in [20, pg. 27], this assumption holds in most practical situations. Unless otherwise stated, all vectors are column-vectors. Given a matrix A, AT denotes its transpose and A−1 denotes its inverse (provided one exists). I denotes the identity matrix with dimensions speciﬁed by context. Given vector x, ||x|| denotes the Euclidean distance of x to the origin i.e. the Euclidean norm. 15.3 Attacking Additive Data Perturbation The data owner replaces the original dataset X with Y = X + R, (15.1) where R is a noise matrix with each column generated independently from a n-dimensional random vector R with mean vector zero. As is commonly done, we assume throughout that ΣR equals σ2I, i.e., the entries of R were generated independently from some distribution with mean zero and variance σ2 (typical choices for this distribution include Gaussian and uniform). In this case, R is sometimes referred to as additive white noise. While having a long history in the statistical disclosure control and statistical database ﬁelds (see [6] for a comprehensive survey), additive data perturbation was ﬁrst revisited to address PPDM problems by Agrawal and Srikant [5]. They assumed the p.d.f. of R is public. They developed a technique for estimating the p.d.f. of X from Y and show how a decision tree classiﬁer can then be constructed. Their distribution recovery technique is further developed in [4, 9]. We describe ﬁve different attack techniques against additive perturbation. The ﬁrst three attacks ﬁlter off the random noise by analyzing the eigenstates of the data: spectral ﬁltering [22], singular value decomposition (SVD) ﬁlter- ing [17], and principal component analysis (PCA) ﬁltering [18]. They all use eigen-analysis for ﬁltering out the protected data. The fourth attack is a Bayes approach based on maximum a posteriori probability (MAP) estimation [18]. The ﬁfth attack shows that if the p.d.f. of X is reconstructed, in some cases, it can lead to disclosure. We refer to this attack as distribution analysis.Note that in all ﬁve we assume that the attacker knows the p.d.f. of R, and attacker implicitly knows that the perturbed data records arose as independent samples from random vector Y = X + R. Next, we describe each of these attacks in detail. 362 Privacy-Preserving Data Mining: Models and Algorithms 15.3.1 Eigen-Analysis and PCA Preliminaries Before describing eigen-analysis based attacks, we ﬁrst provide a brief back- ground of eigen-analysis and PCA. Let X be an n-dimensional random vector. Generally speaking the eigenvalues of covariance ΣX are the n roots (possible including repeats) of the degree n polynomial |ΣX − Iλ| where |.| denotes the matrix determinant. Since ΣX is positive semi-deﬁnite, all its eigenvalues are non-negative and real [13, pg. 295]. If we assume that they are also all distinct and non-zero, they can be denoted as λ1X >...>λnX > 0. Associated with λj X is its normalized eigenspace,Vj X = {v ∈ Rn :ΣX v = vλj X and ||v|| =1}. These normalized eigenspaces are pair-wise orthogonal and have dimension one [13, pg. 295]. Hence each can be written as {vj X, −vj X} where vj X is lexico- graphically larger than −vj X.LetVX denote the normalized eigenvector matrix [v1X···vnX](which is orthogonal). As is standard practice in PCA, we assume that X has mean vector zero (if not, it is replaced by X−E[X]). The jth principal component (PC) of X is vj X TX(or −vj X TX). It can be shown that the PCs are pair-wise uncorrelated and capture the maximum possible variance in the following sense. For each 1 ≤ j ≤ n, there does not exist v ∈ Rn orthogonal to v for all 1 ≤ Var(vj X TX). It can further be shown that Var(vj X TX)= λj X. Therefore, the dimensionality of X can be reduced by choosing 1 ≤ k ≤ n and transforming X to ˜X = ˜VTXX where ˜VX denotes the leftmost k columns of VX. The amount of “information” preserved is typically quantiﬁed by 100 k =1 λXn =1 λX . This is commonly referred to as the percentage of variance captured by ˜X. If this percentage is large, most of the information is preserved in the sense that ˜VX ˜X is a good approximation to X. Indeed, if the percentage is 100, i.e., k = n,then ˜VX ˜X = ˜VX ˜VTXX = X. The properties of left multiplication to X by ˜VX ˜VTX have special signiﬁcance in the eigen-analysis based attacks. We call this transformation, a projection through the ﬁrst k PCs. In practice, one has a collection of data tuples on which dimensionality re- duction via PCA is desired. If the tuples can all be regarded as independent samples from X, PCA can be fruitfully carried out on their standard sample covariance matrix (after subtracting from each the row-mean vector of the dataset). The eigen-analysis based attacks will make critical use of the pro- jection of the dataset through its ﬁrst k PCs. Survey of Attack Techniques on Privacy-Preserving Data Perturbation Methods 363 −1.5 −1 −0.5 0 0.5 1 1.5 −1 −0.5 0 0.5 1 1.5 2 Figure 15.1. Wigner’s semi-circle law: a histogram of the eigenvalues of A+A 2 √ 2p for a large, randomly generated A 15.3.2 Spectral Filtering This technique, developed by Kargupta et al. [22], utilizes the fact that the eigenvalues of a random matrix are distributed in a fairly predictable man- ner. For example, Wigner’s semi-circle law [47] says that if A is a p × p ma- trix whose entries were generated independently from a distribution with zero mean and unit variance, then, for large p, the distribution of the eigenvalues of A+A 2 √ 2p has p.d.f. depicted in Figure 15.1; it takes the shape of a semi-circle. As another example, consider n × m matrix R whose entries were generated independently from a distribution with mean zero and variance σ2.Forlarge m and n, the distribution of the eigenvalues of the sample covariance matrix of R is similar to the semi-circle law. And, key to the spectral ﬁltering technique, this result allows bounds on these eigenvalues to be computed. Kargupta et al. observe that if the jth eigenvalue arising from Y is “large”, it is a good approximation to the jth eigenvalue arising from X. Therefore, the projection of Y through its PCs corresponding to these large eigenvalues (say the ﬁrst k) is a good approximation to the projection of X through its ﬁrst k PCs. As such ˆX is set to the projection of Y through its ﬁrst k PCs. Results from matrix perturbation theory and spectral analysis of large random matrices provide the basis for this observation. 364 Privacy-Preserving Data Mining: Models and Algorithms Lemma 15.1 [40, Corollary 4.9] For any n-dimensional random vectors X and R(R has mean vector zero) and Y = X + R, it is the case that: for 1 ≤ j ≤ n, λj Y ∈ [λj X + λnR,λj X + λ1R]. Therefore, if λj Y ∈ [λnR,λ1R], then this eigenvalue is largely affected by noise (R). Hence, it is not regarded by Kargupta et al. as large and, therefore, not regarded as a good approximation of λj X. On the other hand, λj Y >λ1R is regarded as large and, therefore, is regarded as a good approximation of λj X. So how can the attacker use this threshold criterion given only Y? Let ˆΣY and ˆΣR be the standard sample covariance matrices computed from Y and R;letˆλ1 Y ≥ ...≥ ˆλn Y and ˆλ1 R ≥ ...≥ ˆλn R be the associated eigenval- ues, respectively. The above criterion can be modiﬁed to consider ˆλj Y > ˆλ1 R as large. But how should the attacker estimate an upper-bound on ˆλ1 R?This question is answered using a result from large random matrix theory alluded to in the opening paragraph of this subsection. Intuitively, as R grows large, the eigenvalues computed from R can be bounded by the attacker. And when m is large relative to n, these bounds are quite good. Formally stated [21, 39], as m, n →∞and m n → Q ≥ 1, ˆλmax R = σ2(1 + 1/ Q)2 ≥ ˆλ1 R ≥ ˆλn R ≥ ˆλmin R = σ2(1 − 1/ Q)2. As such, ˆλmax R serves as the estimate of an upper-bound on ˆλ1 R. Moreover, for Q large relative to σ2, this bound will be quite good as all eigenvalues of ˆΣR will be concentrated in a small band. Since the attacker is assumed to know σ2, then she can compute ˆλmax R and will deem any ˆλj Y > ˆλmax R as large. The spectral ﬁltering algorithm is given in Algorithm 3. The empirical re- sults show that when the variance of the noise is low and the original data does not contain many inherent random components, the recovered data can be rea- sonably close to the original data. However, two important questions remain to be answered. 1) What are the theoretical bounds on the estimation accuracy? 2) What are the fundamental factors that determine the quality of the data estima- tion? The ﬁrst is touched on in Section 15.3.3 and the second in Section 15.3.4. 15.3.3 SVD Filtering Guo et al. [17] revisited spectral ﬁltering to address the issue of an optimal choice of k and to develop bounds on the estimation accuracy. They showed that when k =min{1 ≤ j ≤ n|ˆλj Y < 2σ2}−1, the estimated data is approx- imately optimal, i.e., the beneﬁts due to the inclusion of the kth eigenvector is greater than the information loss due to the noise projected along the kth eigenvector. They further proposed a singular value decomposition-based data reconstruction approach, and proved the equivalence of this approach to spec- tral ﬁltering. A lower bound and upper bound of the estimation error in terms Survey of Attack Techniques on Privacy-Preserving Data Perturbation Methods 365 Protocol 3 Spectral Filtering Require: Y, the perturbed data matrix and σ2, the variance of the random noise. Ensure: ˆX, an estimate of the original data matrix X. 1: Compute the sample mean of Y and subtract it from every column of Y. 2: Compute the standard sample covariance ˆΣY of Y, its eigenvalues ˆλ1 Y ≥ ...≥ ˆλn Y, and their associated normalized eigenvectors ˆv1 Y,...,ˆvn Y. 3: Compute k =max{1 ≤ j ≤ n|ˆλj Y > ˆλmax R}.Let˜ˆVY denote the matrix [ˆv1 Y···ˆvk Y]. 4: Set ˆX to ˜ˆVY ˜ˆVT Y Y. of Frobenius matrix norm were also derived. We refer readers to [14, 17] for more details. 15.3.4 PCA Filtering Huang et al. [18] observe that a key factor in determining the accuracy of spectral ﬁltering is the degree of correlation that exists among the attributes of X relative to σ2. The higher the degree, the greater the accuracy in estimating the original data. Indeed, for small k, the higher the degree of correlation, the more variance will be captured by the ﬁrst k PCs. The addition of R does not change this property. The attributes of R are uncorrelated and thus, the amount of variance captured by any direction is the same. Therefore, removing the last n − k PCs of X does not cause much variance loss but will cause 100 n−k n percent of the variance in R to be lost. Based on this observation, Huang et al. [18] proposed a ﬁltering technique based on PCA. A major difference with spectral ﬁltering, is that PCA ﬁltering does not use matrix perturbation theory and spectral analysis to estimate the dominant PCs of X. Instead PCA ﬁltering takes a more direct approach based on the fact that ΣY =ΣX +ΣR =ΣX + σ2I. (15.2) The ﬁrst equality is due to the independence of X and R and the second by assumption. Therefore, the attacker can directly estimate ΣX as ˆΣY − σ2I, then compute the top k PCs of this. The PCA ﬁltering procedure is given in Algorithm 4. The original dataset estimate can be written as the sum of two parts: ˆX = ˜ˆVX ˜ˆVT XY = ˜ˆVX ˜ˆVT XX + ˜ˆVX ˜ˆVT XR. Therefore, the recovery error 1 is determined 1assuming the estimated sample covariance ˆΣX is very close to ΣX 366 Privacy-Preserving Data Mining: Models and Algorithms Protocol 4 PCA Filtering Require: Y, the perturbed data matrix; σ2, the variance of the random noise; and 1 ≤ k ≤ n, the number of PCs to keep. Ensure: ˆX, an estimate of the original data matrix X. 1: Compute the sample mean of Y and subtract it from every column of Y. 2: Compute the standard sample covariance ˆΣY of Y, and produce ˆΣX = ˆΣY − σ2I an estimate of ΣX. 3: Compute the eigenvalues of ˆΣX, ˆλ1 X ≥ ... ≥ ˆλn X. Compute their their associated normalized eigenvectors, ˆv1 X,...,ˆvn X.Let ˜ˆVX denote the matrix [ˆv1 X···ˆvk X]. 4: Set ˆX to ˜ˆVX ˜ˆVT X Y. by the the percentage of variance captured by the ﬁrst k PCs of X and the noise. It can be shown that the mean squared recovery error caused by the noise part is σ2 k n . These results echo the empirical results observed in spectral ﬁltering and suggests an approach for choosing k. 15.3.5 MAP Estimation Attack Different from eigen-analysis, MAP estimation considers both prior and posterior knowledge via Bayes’ theorem to estimate original dataset. For each 1 ≤ i ≤ m, the attacker will produce ˆxi an estimate of xi using2 yi.LetfX and fR denote the p.d.f of X and R, respectively. Given x ∈ Rn and y ∈ Rn , let fX|Y=y and fY|X=x denote the p.d.f of X conditioned on Y = y and the p.d.f of Y conditioned on X = x, respectively. The MAP estimate of xi is3 ˆxi = argsup{fX|Y=yi(x):x ∈ Rn} = argsup{fY|X=x(yi)fX(x):x ∈ Rn} = argsup{fR(yi − x)fX(x):x ∈ Rn}.(15.3) The second equality is due to Bayes’ theorem and the third due to the fact that Y = X + R and R is independent of X. Huang et al. [18] considered the case where both fX and fR are multi- variate normal (and the attacker knows this). The following closed form expression can then be derived with µX denoting the mean vector of X. ˆxi =(Σ−1 X +(1/σ2)I)−1(Σ−1 X µX + yi/σ2). 2Due to independence, the attacker will gain nothing more if using all of Y. 3Here argsup{} is based on supA which denotes the smallest upper bound on a set A(if A is upper- bounded, supA always exists. Survey of Attack Techniques on Privacy-Preserving Data Perturbation Methods 367 The assumption that fX is multi-variate normal and known to the attacker is quite strong. Other cases are worth comment (in each, fR is multi-variate nor- mal and known to the attacker). When fX is known but not multivariate normal, it may be difﬁcult to derive a closed-form expression for ˆxi. In this case, the at- tacker can use numerical methods such as Newton’s gradient descent methods. When fX is not known, the MAP estimate reduces to the maximum likelihood estimate (MLE) by assuming fX is uniform over some interval. Therefore, fX can be dropped from (15.3) and ˆxi = yi. However, this estimate may suffer from accuracy problems due to dropping fX. It is worth noting that the MAP approach has been widely studied in statis- tical disclosure control. For example, Trottini et al. [44] used this approach to study the linkage privacy breaches in the scenario where microdata is masked by both additive and multiplicative noise. In their settings, the attacker tries to identify the identity (of a person) linked to a speciﬁc record, which is different from the primary focus of this chapter - data record recovery. 15.3.6 Distribution Analysis Attack Recall that techniques exist for estimating fX from Y. This is quite useful as fX represents a useful data mining pattern. However, in some cases, this reconstructed distribution can be used by the attacker to gain extra knowledge about the private data. For example, assume the each entry of R is uniformly distributed over [−1, 1] and the observed perturbed data Y =1. If there is no additional information, the attacker can determine X∈[0, 2].However,ifa large amount of data is available, the reconstructed distribution will have a high degree of accuracy. Assume the attacker can perfectly recover fX which is: fX(x)= ⎧ ⎨ ⎩ 0.5, 0 ≤ x ≤ 1; 0.5, 5 ≤ x ≤ 6; 0, otherwise. Then, the estimate of X given Y =1is localized to a smaller interval [0, 1] instead of [0, 2]. When data has a multi-variate distribution, the attacker can de- termine intervals I1,I2,...,In, which are narrow in one or more dimensions, and for which the number of data records that fall in the interval is very small. Such intervals make outliers/minorities more identiﬁable than they would seem when merely looking at the perturbed data set. This kind of disclosure leads to a bigger open problem - when do data mining results cause privacy breach? Further discussions can be found in [4, 9, 31, 16, 12]. 15.3.7 Summary This section surveyed recent research that investigated the vulnerability additive data perturbation. The research showed, in many cases, the private 368 Privacy-Preserving Data Mining: Models and Algorithms information can be reasonably well derived from the perturbed data. The pri- mary attack techniques presented are summarized in Table 15.1. Table 15.1. Summarization of Attacks on Additive Perturbation Categories Related Work General Assumptions Eigen-Analysis [14, 17, 18, 22] the degree of correlation between the original data attributes is high relative to σ2 MAP Estimation [18] data and noise arose from a multi-variate normal distribution Distribution Analysis [4, 9, 16] reconstructed distribution describes the original data with sufﬁcient accuracy One possible improvement on additive perturbation is to use colored noise with similar correlation structure to the original data [23, 43], i.e., R∼ (0,ΣR),whereΣR = βΣX for β>0. With this method, the covariance of the perturbed data is ΣY =ΣX + βΣX =(1+β)ΣX. The correlation coefﬁcients of the perturbed attributes are the same as that of the original attributes: ρYi,Yj = 1+β 1+β Cov(Xi,Xj) Var(Xi)Var(Xj) = ρXi,Xj . This kind of perturbation puts noise on the principal components of the original data, therefore, separating noise from the data using eigen analysis becomes difﬁcult. However, this approach is not free from problem either. Domingo- Ferrer et al. [9] pointed out that the reconstructed distribution (using their p- dimensional reconstruction algorithm, a multivariate generalization of the ap- proach describe in [5] for the univariate case) may still lead to disclosure in some cases. The higher the dimensionality, the more likely is the disclosure. In summary, additive perturbation has its roots in statistical disclosure con- trol. It offers a simply way to mask private data while allowing aggregate sta- tistics to be queried; and making more sophisticated privacy preserving data mining possible. However, recent work from PPDM community has shown this technique vulnerable to attack in many cases (e.g., high correlations between many attributes). Therefore, careful attention must be paid when applying this technique in practice. Before closing this section, we note that several researchers have proposed privacy metrics e.g., interval-based [5], entropy-based [4], mixture models [49]. However, the relationship between these and the recovery accuracy of the attack techniques is not clear. Survey of Attack Techniques on Privacy-Preserving Data Perturbation Methods 369 15.4 Attacking Matrix Multiplicative Data Perturbation The data owner replaces the original data X with Y = MX, (15.4) where M is an n × n matrix chosen to have certain useful properties. If M is orthogonal (n = n and MTM = I) [7, 36, 37], then the perturbation exactly preserves Euclidean distances, i.e., for any columns x1,x2 in X, their corre- sponding columns y1,y2 in Y satisfy ||x1 −x2|| = ||y1 −y2||.4 If each entry of M is generated independently from the same distribution with mean zero and variance σ2 (n not necessarily equal to n) [28, 30], then the perturbation ap- proximately preserves Euclidean distances on expectation up to constant factor σ2n.IfM is the product of a discrete cosine transformation matrix and a trun- cated perturbation matrix [33], then the perturbation approximately preserves Euclidean distances. Because matrix multiplicative perturbation preserves Euclidean distance with either small or no error, it allows many important data mining algorithms to be applied to the perturbed data and produce results very similar to, or ex- actly the same as those produced by the original algorithm applied to the orig- inal data, e.g., hierarchical clustering, k-means clustering. However, the issue of how well X is hidden is not clear and deserves careful study. Without any prior knowledge, an attacker can do very little (if anything) to accurately re- cover X. However, no prior knowledge seems an unreasonable assumption in many situations. Motivated by this line of reasoning, several researchers have investigated the vulnerabilities of matrix multiplicative perturbation using var- ious forms of prior knowledge [8, 15, 28–30]. In the bulk of this section (15.4.1 and 15.4.2), we discuss attack techniques based on two types of prior knowl- edge. 1 Known input-output (I/O): The attacker knows some small collection of original data records and the attacker knows the mapping between these known original data records and their perturbed counterparts in Y. In other words, the attacker has a set of input-output pairs. 2 Known sample: The attacker has a collection of independent samples (columns of S) from X(S may or may not overlap with X). The ﬁrst two attacks are based on the known I/O prior knowledge assump- tion. The ﬁrst one [29] assumes an orthogonal perturbation matrix while the 4Conversely, any function T:Rn → Rn which preserves Euclidean distance (for all x, y ∈ Rn, ||x−y|| = ||T(x)−T(y)||) and ﬁxes the origin is equivalent to left-multiplication by an n×n orthogonal matrix. 370 Privacy-Preserving Data Mining: Models and Algorithms second [28] assumes a randomly generated perturbation matrix. The third at- tack is based on the known sample prior knowledge assumption and assumes an orthogonal perturbation matrix. It works by examining certain features of the original and perturbed data distributions (i.e.,thep.d.f. of X and Y), namely the eigenvectors of ΣX and ΣY. These features have two important properties: (i) they are related to each other in a natural way allowing M to be estimated, and (ii) they can be accurately extracted from S and Y. Before moving on, we emphasize the fact that the perturbation technique considered here, matrix multiplicative, is completely different than multiplica- tive data perturbation mentioned in the introduction. There each element of X is separately multiplied by a randomly generated number. 15.4.1 Known I/O Attacks Without loss of generality, the attacker is assumed to know Xp (1 ≤ p0, the attacker’s success probability, ρ(xi,), is deﬁned as the probability that the relative Euclidean distance between xi and ˆxi is no larger than , i.e., Pr(||ˆxi − xi|| ≤ ||xi||).Liuet al. developed closed form expression ρ(xi,)= 1 π 2arcsin # ||xi|| 2d(xi,Xp) $ if ||xi||<2d(xi,Xp); 1 otherwise, (15.6) where d(xi,Xp) denotes the Euclidean distance of xi to the space of vectors spanned by the columns of Xp, i.e., inf{||x − xi||:x is in the column space Survey of Attack Techniques on Privacy-Preserving Data Perturbation Methods 371 of Xp}. Equation (15.6) illustrates that the sensitivity of a tuple, xi, to breach depends upon its length relative to its distance to the column space of Xp, i.e., ||xi|| 2d(xi,Xp). Tuples whose relative length is large are particularly sensitive to breach. In particular when xi is in the column space of Xp, the attacker’s success probability equals one. Liu et al. also described how the attacker can compute ||xi|| and d(xi,Xp) for any p ≤ i ≤ m, and therefore, determine which tuple is most sensitive to breach. Chen et al. [8] also discussed a known I/O attack technique. They however consider a combination of matrix multiplicative and additive perturbation: Y = MX + R. They considered the case when the number of linearly independent data tuples (columns in Xp) is no smaller than the data dimensionality, n (rows in Xp). They pointed out that ˆM, an estimate of M, can be produced using linear regression, then xi estimated as ˆM−1yi. Random Perturbation Matrix. Liu [28] developed a MAP-based known I/O attack which works under the assumption that M is an n ×n matrix whose entries were generated independently from a normal distribution with mean zero and variance σ2 (n may be ≤ n or >n).5 The larger n is, the more closely preserved are Euclidean distances between data tuples (up to constant factor σ2n), but, the better the known I/O attack will work at breaching pri- vacy. Therefore, a trade-off must be balanced in setting n. For simplicity, we assume that the columns of Yp are linearly independent.6 For any p ≤ i ≤ m, the attacker will produce ˆxi an estimate of xi.Ifxi is linearly dependent on the columns of Xp, the attacker can discover this as yi will be linearly dependent on the columns of Yp. In this case, the attacker will set ˆxi = Xp(YT p Yp)−1YT p yi which equals xi (perfect recovery).7 Henceforth, we assume xi is linearly independent of the columns of Xp. Therefore, the attacker will only consider estimates, ˆx ∈ Rn, which are also linearly inde- pendent of the columns of Xp (for brevity, we write “l.i. ˆx” to mean that ˆx is linearly independent of the columns of Xp). Finally, since the columns of Yp are assumed to be linearly independent, then it follows that the columns of Xp are too. Let M be an n × n matrix of random variables each independently and identically distributed as normal with mean zero and variance σ2.Thecolumns of Y arose as independent samples from random vector Y = MX.Usingthe 5They do assume that the original data records arose as independent samples from X. 6This assumption is not essential. It can be eliminated at the cost of a more complicated attack algo- rithm. However, the fundamental idea remains the same. 7There exists zi ∈ Rp such that Xpzi = xi and Ypzi = yi. Since the columns of Yp are assumed to be linearly independent, then by [13, pg. 96], the matrix (YT p Yp)−1YT p exists. Thus, Xp(YT p Yp)−1YT p yi = Xp(YT p Yp)−1(YT p Yp)zi = Xpzi = xi. 372 Privacy-Preserving Data Mining: Models and Algorithms MAP approach, the attacker will choose l.i. ˆx so as to maximize the likelihood that X equals ˆx given that Y equals yi and MXp equals Yp. This analysis is based on the following key observation (whose proof follows directly from manipulating moment-generating functions). For any matrix B,letB denote the column vector which results from stacking the columns of B. Theorem 15.2 For any n × q matrix A with linearly independent columns, MA is distributed as an (qn)-variate Gaussian with mean vector zero and covariance matrix ΣMA = σ2 ⎡ ⎢⎢⎢⎢⎢⎣ ATA 00··· 0 0 ATA 0 ··· 0 00ATA··· 0 ............... 000··· ATA ⎤ ⎥⎥⎥⎥⎥⎦ Let [Xp, ˆx] and [Yp,yi] denote matrices which result from attaching ˆx and yi as an additional right-most column onto Xp and Yp. Observe that [Xp, ˆx] has linearly independent columns. Let fX|Y=yi,MXp=Yp denote the p.d.f. of X conditioned on Y = yi and MXp = Yp;letfM[Xp,ˆx] denote the p.d.f. of M[Xp, ˆx]. Using the MAP approach, the attacker will choose ˆxi = argsup{fX|Y=yi,MXp=Yp (ˆx): l.i. ˆx ∈ Rn}. Using Bayes’ rule, it can be shown that ˆxi = argsup{fM[Xp,ˆx]([Yp,yi])fX(ˆx): l.i. ˆx ∈ Rn}, thus, Theorem 15.2 implies ˆxi = argsup{φ([Yp,yi])fX(ˆx): l.i. ˆx ∈ Rn},(15.7) where φ is the ((p +1)n)-variate Gaussian distribution with mean vector zero and covariance matrix ΣM[Xp,ˆx]. For simplicity we assume that the attacker knows nothing about fX and, following a common practice, uses a uniform distribution over some interval in place of fX in (15.7).8 Thus, ˆxi = argsup{φ([Yp,yi]) : l.i. ˆx ∈ Rn}.(15.8) Producing a closed-form expression for ˆxi in (15.8) is desirable, but quite difﬁcult. Instead, the attacker can turn to numerical approaches. Experiments 8A more complicated approach could have the attacker using the fact that the columns of Xp arose as independent samples from X, and use Xp to inform a better substitution for fX in (15.7). Survey of Attack Techniques on Privacy-Preserving Data Perturbation Methods 373 were reported in [28] where the attacker used the Matlab implementation9 of the Nelder-Mead simplex algorithm [35] to solve this optimization problem. The results show that the accuracy of the attack technique increases with n or the number of known input-output pairs. 15.4.2 Known Sample Attack The attacker is assumed to know a collection of independent samples (columns of S) from X(S may or may not overlap with X). Furthermore, the attacker assumes M is orthogonal. The approach is based on the observation that the eigenvectors of Y are equal to those of X left-multiplied by M(up to a factor of ±1). Therefore by estimating ΣY and ΣX and matching their eigenvectors, the attacker can produce, ˆM, an estimation of M. Using this, data record xi (1 ≤ i ≤ m)is estimated as ˆxi = ˆMT yi. The following results (proved in [29]) establishes the key match between the normalized eigenspaces. Theorem 15.3 The eigenvalues of ΣX and ΣY are the same and for all 1 ≤ j ≤ n,MVj X = Vj Y,whereMVj X equals {Mv : v ∈ Vj X}. Corollary 15.4 Let In be the space of all n×n, matrices with each diago- nal entry ±1 and each off-diagonal entry 0 (2n matrices in total). There exists D0 ∈ In such that M = VYD0VTX. First assume that the attacker knows the covariance matrices ΣX and ΣY and, thus, computes VX and VY. By Corollary 15.4, the attacker can perfectly recover M if she can choose the right D from In. To do so, the attacker utilizes S and Y, in particular, the fact that these arose as independent samples from X and Y = MX.ForanyD ∈ In,ifD = D0,thenVYDV TXS and Y have both arisen as independent samples from Y. The attacker will estimate M as ˆM = VY DV TX,whereD was chosen from In so as to maximize the likelihood that VY DV TXS and Y arose from the same random vector. To make this choice, the attacker can use a multi-variate two-sample hypothesis test for equal distribu- tions [42]. The smaller the p-value, the more convincingly the null hypothesis (that VY DV TXS and Y have both arisen as independent samples from Y) can be rejected. Therefore, D ∈ In is chosen to maximize the p-value. Finally, the attacker can eliminate the assumption at the start of the previous paragraph by replacing ΣX and ΣY with estimates computed from S and Y. Using the standard sample covariance matrices, the pseudo-code for the attack technique is shown in algorithm 5. A weakness lies in its computation cost, O(2n(m + p)2). For high-dimensional data, the technique is infeasible. 9http://www.mathworks.com/access/helpdesk/help/techdoc/ref/fminsearch.html 374 Privacy-Preserving Data Mining: Models and Algorithms Protocol 5 Eigen-Analysis Attack Require: Y, the perturbed data matrix and S, the sample data matrix. Ensure: ˆX, an estimate of the original data matrix X. 1: Compute standard, sample covariance matrices of S and Y and ˆVX and ˆVY their normalized eigenvector matrices. 2: Choose D ∈ In so as to maximize the p-value of two-sample hypothesis test for equal distributions on ˆVYD ˆVTXS and Y. 3: Set ˆM to ˆVYD ˆVTX and ˆX to ˆMTY. It should be noted the eigen-analysis attack does not work if each entry of M were generated independently from some distribution with mean zero and variance σ2. In that case, ΣY will equal γI for some constant γ>0, thereby killing any useful matching like that in Theorem 15.3. 15.4.3 Other Attacks Based on ICA Before ﬁnishing the section, we brieﬂy describe some attacks based on in- dependent component analysis (ICA) [19]. ICA Overview. Given an n-variate random vector V, one common ICA model posits that this random vector was generated by a linear combination of independent random variables, i.e.,V = AS with S an n-variate random vector with independent components. Typically, S is further assumed to satisfy the following additional assumptions: (i) at most one component is distributed as a Gaussian; (ii) n ≥ n; and (iii) A has rank n. One common scenario in practice: there is a set of unobserved samples (the columns of n × q matrix S) that arose from S which satisﬁes (i) - (iii) and whose components are independent. But observed is n × q matrix V whose columns arose as linear combination of the rows of S.ThecolumnsofV can be thought of as samples that arose from a random vector V which satisﬁes the above generative model. There are ICA algorithms whose goal is to recover S and A up to a row permutation and constant multiple. This ambiguity is inevitable due to the fact that for any diagonal matrix (with all non-zeros on the diagonal) D, and permutation matrix P,ifA, S is a solution, then so is (ADP),(P −1D−1S). Other Attacks. Liu et al. [30] considered matrix multiplicative data pertur- bation where M is an n × n matrix with each entry generated independently from the some distribution with mean zero and variance σ2. They discussed the application of the above ICA approach to estimate X directly from Y: S = X,V = Y,S = X,V = Y,andA = M. They argued the approach to be Survey of Attack Techniques on Privacy-Preserving Data Perturbation Methods 375 problematic because the ICA generative model imposes assumptions not likely to hold in many practical situations: the components of X are independent with at most one such being Gaussian distributed. Moreover, they pointed out that the row permutation and constant multiple ambiguity further hampers accurate recovery of X. A similar observation is made later by Chen et al. [8]. Guo and Wu [15] considered matrix multiplicative perturbation assuming only that M is an n×n matrix (orthogonal or otherwise). Further they assumed a weaker variant of the known I/O holds: the attacker knows, %X, a collection of original data columns from X but does not know to which of the columns in Y these correspond. They develop an ICA-based attack technique for estimating the remaining columns in X. To avoid the ICA problems described in the pre- vious paragraph, they instead applied ICA separately to %X and Y producing representations (A%X,S%X) and (AY,SY). They argued that these representa- tions are related in a natural way allowing X to be estimated. Their approach is similar in spirit to the known sample attack described earlier which related S and Y through representations derived through eigen-analysis. 15.4.4 Summary This section discussed the vulnerabilities of matrix multiplicative data per- turbation to certain attacks based on prior knowledge. The primary attack tech- niques discussed are summarized in Table 15.2.10 Table 15.2. Summarization of Attacks on Matrix Multiplicative Perturbation Categories Related Work General Assumptions Linear algebra/measure theory [29] known I/O, M is orthogonal MAP Estimation [28] known I/O, M is n × n with entries generated independently from N(0,σ2), Eigen-Analysis [29] known sample, M is orthogonal, ICA[8, 30] M has rank n, the data attributes are largely independent and at most one is Gaussian ICA[15] M is n × n, weak known I/O Chen et al. [8] discussed a modiﬁcation of matrix multiplicative data per- turbation to improve its resilience to attack. They examine the combination of matrix multiplicative and additive data perturbation. They argue that this ap- proach offers additional privacy protection, but the utility of the perturbed data 10All the attack techniques, except known I/O with orthogonal M, implicitly assume that the original data records arose independently from X. 376 Privacy-Preserving Data Mining: Models and Algorithms is negatively affected since additive noise does not preserve Euclidean distance well. 15.5 Attacking k-Anonymization Before concluding this chapter, we brieﬂy survey a very recent body of re- search aimed at analyzing the vulnerabilities of the popular k-anonymity model [38, 41]. Here, the private data X is perturbed such that each of the resulting records is identical to at least k − 1 others with respect to a pre-deﬁned set of attributes called quasi-identiﬁers. All of the other attributes are called sensi- tive attributes and these are not modiﬁed by the perturbation. This perturbation can be carried out by judicious value generalization (e.g., zip 95120 → 951**) or tuple suppression, and it is aimed at preventing linkage attacks through the quasi-identiﬁers. Recently, Machanavajjhala et al. [32] developed a background knowledge attack on k-anonymity which we call a homogeneity attack. They showed how a lack of diversity among the sensitive attribute values can be used to establish a linkage between individuals and sensitive values. To remedy this problem, they proposed a new privacy deﬁnition called l-diversity such that in each equiva- lence class there are at least l “well-represented” sensitive values. Along the same line, Wong et al. [48] proposed an (α, k)-anonymization model such that the relative frequency of the sensitive value in every equivalence class is less than or equal to α.Liet al. [25] later developed attacks on l-diversity (skew- ness attack and similarity attack), and argued that l-diversity is neither neces- sary nor sufﬁcient to prevent attribute disclosure. To cope with these problems, they proposed an improved framework called t-closeness, which requires the distribution of a sensitive attribute in any equivalence class to be close to the distribution of the attribute in the original data set. Wang et al. [46] considered the privacy breach caused by the attacker’s data mining capabilities. They presented an approach (that combines association rule hiding and k-anonymity) to limit the conﬁdence of inferring sensitive prop- erties about the existing individuals. Aggarwal [2] also argued the original k-anonymity model to be problematic. He considered the case of high dimensional data and pointed out that the ex- ponential number of quasi-identiﬁer combinations can allow precise inference attacks unless an unacceptably high amount of information loss is suffered. 15.6 Conclusion This chapter provides a detailed survey of attack techniques on additive and matrix multiplicative perturbation. It also presents a brief overview of attacks on k-anonymization. These attacks offer insights into vulnerabilities data per- turbation techniques under certain circumstances. In summary, the following Survey of Attack Techniques on Privacy-Preserving Data Perturbation Methods 377 information could lead to disclosure of private information from the perturbed data. 1. Attribute Correlation: Many real world data has strong correlated at- tributes, and this correlation can be used to ﬁlter off additive white noise. See, e.g., [14, 17, 18, 22]. 2. Known Sample: Sometimes, the attacker has certain background knowl- edge about the data such as the p.d.f. or a collection of independent samples which may or may not overlap with the original data. See, e.g., [28, 29, 18]. 3. Known Inputs/Outputs: Sometimes, the attacker knows a small set of pri- vate data and their perturbed counterparts. This correspondence can help the attacker to estimate other private data. See, e.g., [28, 15, 29]. 4. Data Mining Results: The underlying pattern discovered by data mining also provides a certain level of knowledge which can be used to guess the private data to a higher level of accuracy. See, e.g., [4, 9, 31, 16, 12, 46]. 5. Sample Dependency: Most of the attacks (except the known I/O devel- oped by [29]) discussed in this chapter assume the data as independent sam- ples from some unknown distribution. This assumption may not hold true for all real applications. For certain types of data, such as the time series data, there exists auto correlation/dependency among the samples. How this dependency can help the attacker to estimate the original data is still an open problem. Notes The contributions of C. Giannella and K. Liu were equal. Acknowledgements The authors wish to thank the U.S. National Science Foundation for their support through awards IIS-0329143 and IIS-0093353. The authors also wish to thank Kamalika Das, Souptik Datta, and Ran Wolff for their assistance. References [1] N. R. Adam and J. C. Worthmann. Security-control methods for statis- tical databases: a comparative study. ACM Computing Surveys (CSUR), 21(4):515–556, 1989. [2] Charu C. Aggarwal. On k-anonymity and the curse of dimensionality. In Proceedings of the 31st VLDB Conference, pages 901–909, Trondheim, Norway, 2005. [3] Charu C. Aggarwal and Philip S. Yu. A condensation based approach to privacy preserving data mining. In Proceedings of the 9th International 378 Privacy-Preserving Data Mining: Models and Algorithms Conference on Extending Database Technology (EDBT’04), pages 183– 199, Heraklion, Crete, Greece, March 2004. [4] D. Agrawal and C. C. Aggarwal. On the design and quantiﬁcation of privacy preserving data mining algorithms. In Proceedings of the 20th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pages 247–255, Santa Barbara, CA, 2001. [5] R. Agrawal and R. Srikant. Privacy-preserving data mining. In Proceed- ings of the ACM SIGMOD Conference on Management of Data, pages 439–450, Dallas, TX, May 2000. [6] R. Brand. Microdata protection through noise addition. Lecture Notes in Computer Science - Inference Control in Statistical Databases, 2316:97– 116, 2002. [7] K. Chen and L. Liu. Privacy preserving data classiﬁcation with rotation perturbation. In Proceedings of the 5th IEEE International Conference on Data Mining (ICDM’05), pages 589–592, Houston, TX, November 2005. [8] K. Chen, G. Sun, and L. Liu. Towards attack-resilient geometric data perturbation. In Proceedings of the 2007 SIAM International Conference on Data Mining (SDM’07), Minneapolis, MN, April 2007. [9] J. Domingo-Ferrer, F. Seb´e, and J. Castell`a-Roca. On the security of noise addition for privacy in statistical databases. Privacy in Statistical Databases, LNCS3050:149–161, 2004. [10] A. Evﬁmevski, J. Gehrke, and R. Srikant. Limiting privacy breaches in privacy preserving data mining. In Proceedings of the ACM SIG- MOD/PODS Conference, San Diego, CA, June 2003. [11] S. E. Fienberg and J. McIntyre. Data swapping: Variations on a theme by dalenius and reiss. Technical report, National Institute of Statistical Sciences, Research Triangle Park, NC, 2003. [12] A. Friedman, R. Wolff, and A. Schuster. Providing k-anonymity in data mining. Journal of VLDB, 2006 (to be published). [13] G. Strang. Linear Algebra and Its Applications (3rd Ed.). Harcourt Brace Jovanovich College Publishers, New York, 1986. [14] S. Guo and X. Wu. On the use of spectral ﬁltering for privacy preserving data mining. In Proceedings of the 21st ACM Symposium on Applied Computing, pages 622–626, Dijon, France, April 2006. [15] S. Guo and X. Wu. Deriving private information from arbitrarily pro- jected data. In Proceedings of the 11th Paciﬁc-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’07), Nanjing, China, May 2007. Survey of Attack Techniques on Privacy-Preserving Data Perturbation Methods 379 [16] S. Guo, X. Wu, and Y. Li. Deriving private information from perturbed data using iqr based approach. In Proceedings of the Second International Workshop on Privacy Data Management (PDM’06), Atlanta, GA, April 2006. [17] S. Guo, X. Wu, and Y. Li. On the lower bound of reconstruction error for spectral ﬁltering based privacy preserving data mining. In Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD’06), pages 520–527, Berlin, Germany, September 2006. [18] Z. Huang, W. Du, and B. Chen. Deriving private information from ran- domized data. In Proceedings of the 2005 ACM SIGMOD Conference, pages 37–48, Baltimroe, MD, June 2005. [19] A. Hyv¨arinen and E. Oja. Independent component analysis: Algorithms and applications. Neural Networks, 13(4):411–430, June 2000. [20] I. T. Jolliffe. Principal Component Analysis. Springer Series in Statistics. Springer, second edition, 2002. [21] D. Jonsson. Some limit theorems for the eigenvalues of a sample covari- ance matrix. Journal of Multivariate Analysis, 12:1–38, 1982. [22] H. Kargupta, S. Datta, Q. Wang, and K. Sivakumar. On the privacy pre- serving properties of random data perturbation techniques. In Proceed- ings of the IEEE International Conference on Data Mining (ICDM’03), pages 99–106, Melbourne, FL, November 2003. [23] J. Kim. A method for limiting disclosure in microdata based on random noise and transformation. In Proceedings of the American Statistical As- sociation on Survey Research Methods, pages 370–374, Washington, DC, 1986. [24] J. J. Kim and W. E. Winkler. Multiplicative noise for masking continuous data. Technical Report Statistics #2003-01, Statistical Research Division, U.S. Bureau of the Census, Washington D.C., April 2003. [25] N. Li, T. Li, and S. Venkatasubramanian. t-closeness: Privacy beyond k-anonymity and l-diversity. In Proceedings of the 23rd International Conference on Data Engineering (ICDE’07), pages 106–115, Istanbul, Turkey, April 2007. [26] X.-B. Li and S. Sarkar. A tree-based data perturbation approach for privacy-preserving data mining. IEEE Transactions on Knowledge and Data Engineering (TKDE), 18(9):1278–1283, 2006. [27] C. K. Liew, U. J. Choi, and C. J. Liew. A data distortion by prob- ability distribution. ACM Transactions on Database Systems (TODS), 10(3):395–411, 1985. 380 Privacy-Preserving Data Mining: Models and Algorithms [28] K. Liu. Multiplicative Data Perturbation for Privacy Preserving Data Mining. PhD thesis, University of Maryland, Baltimore County, Balti- more, MD, January 2007. [29] K. Liu, C. Giannella, and H. Kargupta. An attacker’s view of distance preserving maps for privacy preserving data mining. In Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD’06), pages 297–308, Berlin, Germany, September 2006. [30] K. Liu, H. Kargupta, and J. Ryan. Random projection-based multi- plicative data perturbation for privacy preserving distributed data min- ing. IEEE Transactions on Knowledge and Data Engineering (TKDE), 18(1):92–106, January 2006. [31] M. Kantarcioˇglu,J. Jin, and C. Clifton. When do data mining results violate privacy? In Proceedings of the 10th ACM SIGKDD Conference (KDD’04), pages 599–604, Seattle, WA, August 2004. [32] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam. l- diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data, 1(1), 2006. [33] S. Mukherjee, Z. Chen, and A. Gangopadhyay. A privacy preserving technique for euclidean distance-based mining algorithms using fourier- related transforms. The VLDB Journal, 15(4):293–315, 2006. [34] K. Muralidhar and R. Sarathy. Data shufﬂing - a new masking approach for numerical data. Management Science, 52(5):658–670, May 2006. [35] J. A. Nelder and R. Mead. A simplex method for function minimization. Computer Journal, 7:308–313, 1965. [36] S. R. M. Oliveira and O. R. Za¨ıane. Privacy preserving clustering by data transformation. In Proceedings of the 18th Brazilian Symposium on Databases, pages 304–318, Manaus, Amazonas, Brazil, October 2003. [37] S. R. M. Oliveira and O. R. Za¨ıane. Privacy preservation when sharing data for clustering. In Proceedings of the International Workshop on Secure Data Management in a Connected World, pages 67–82, Toronto, Canada, August 2004. [38] P. Samarati. Protecting respondents identities in microdata release. IEEE Transactions on Knowledge and Data Engineering, 13(6):1010–1027, November/December 2001. [39] J. W. Silverstein and P. L. Combettes. Signal detection via spectral the- ory of large dimensional random matrices. IEEE Transactions on Signal Processing, 40(8):2100–2105, 1992. [40] G. W. Stewart and Ji-Guang Sun. Matrix Perturbation Theory. Academic Press, 1990. Survey of Attack Techniques on Privacy-Preserving Data Perturbation Methods 381 [41] L. Sweeney. k-anonymity: a model for protecting privacy. Interna- tional Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10(5):557–570, 2002. [42] G. J. Szek´ely and M. L. Rizzo. Testing for equal distributions in high dimensions. InterStat, November(5), 2004. [43] P. Tendick. Optimal noise addition for preserving conﬁdentiality in mul- tivariate data. Journal of Statistical Planning and Inference, 27(2):341– 353, 1991. [44] M. Trottini, S. E. Fienberg, U. E. Makov, and M. M. Meyer. Additive noise and multiplicative bias as disclosure limitation techniques for con- tinuous microdata: A simulation study. Journal of Computational Meth- ods in Sciences and Engineering, 4:5–16, 2004. [45] V. S. Verykios, A. K. Elmagarmid, E. Bertino, Y. Saygin, and E. Dasseni. Association rule hiding. In IEEE Transactions on Knowledge and Data Engineering, volume 16, pages 434–447, 2004. [46] K. Wang, Benjamin C. M. Fung, and Philip S. Yu. Handicapping at- tacker’s conﬁdence: an alternative to k-anonymization. Knowledge and Information Systems, 11(3):345–368, 2007. [47] E. P. Wigner. On the statistical distribution of the widths and spacings of nuclear resonance levels. Proceedings of the Cambridge Philosophical Society, 47:790–798, 1952. [48] R. Chi-Wing Wong, J. Li, A. Wai-Chee Fu, and K. Wang. (α, k)- anonymity: an enhanced k-anonymity model for privacy preserving data publishing. In Proceedings of the 12th ACM SIGKDD Conference (KDD’06), pages 754–759, Philadelphia, PA, August 2006. [49] Y. Zhu and L. Liu. Optimal randomization for privacy preserving data mining. In Proceedings of the 10th ACM SIGKDD Conference (KDD’04), pages 761–766, Seattle, WA, August 2004. Chapter 16 Private Data Analysis via Output Perturbation A Rigorous Approach to Constructing Sanitizers and Privacy Preserving Algorithms Kobbi Nissim Department of Computer Science, Ben-Gurion University of the Negev, Be’er Sheva, Israel. kobbi@cs.bgu.ac.il To R., Y., and N. Abstract We describe output perturbation techniques that allow for a provable, rigorous sense of individual privacy. Examples where the techniques are effective span from basic statistical computations to sophisticated machine learning algorithms. Keywords: Private query processing, output perturbation. 16.1 Introduction Rapidly increasing volumes of sensitive individual information are main- tained by governments, statistical agencies and private enterprises, the lat- ter making them increasingly ubiquitous as electronic collection and archiv- ing evolves. The potential social beneﬁts from analyzing these databases are enormous. A challenge, however, is to compute and release useful information about the data while protecting the privacy of individual data contributors. Our focus is on such analyses. Applying to intuition, one may claim that statistical analysis and datamining procedures already answer this challenge. After all, these analyses are aimed at ﬁnding large scale phenomena, hence, applying them to data collections should not result in a signiﬁcant leakage of private individual information. This intuition, however, seems hard to substantiate, as it may so happen that the 384 Privacy-Preserving Data Mining: Models and Algorithms results of several such “harmless looking” analyses may be combined in a way that would cause privacy breaches. In this chapter we describe a formal approach, analogous to that taken in the theoretical research of cryptography. We ﬁrst present a simple and rather intuitive privacy deﬁnition that allows us to argue about its implications (so as to hopefully understand what kind of privacy is provided), and then construct analyses that preserve privacy. Before we continue, we differentiate our goal from another goal pursued in the privacy literature, namely the construction of efﬁcient secure multiparty protocols for datamining tasks1. This is the problem of applying the crypto- graphic tool of secure multiparty computation to collections of sensitive indi- vidual information that are distributed among several parties, each of which is not willing to explicitly share its information with the other parties. Extremely rich theory exists, starting from the foundational work of [34, 21, 7, 5], showing that essentially every analysis may be performed such that the parties collab- oratively compute it over their joint data without any of them learning more than what is implied by the intended outcome of the analysis. These strong results follow by generic transformations of insecure computations to secure computations, that result with only a polynomial overhead. When applied to large datasets, the generic techniques of creating secure multiparty protocols are inefﬁcient in practice. Hence, more efﬁcient secure protocols for speciﬁc functionalities are sought after, e.g. protocols where the total communication is sublinear in the dataset size. A breakthrough result in this direction is due to Lindel and Pinkas [24]. They showed how to securely compute an ID3 decision tree when the dataset is vertically split between two parties. The approach taken in [24] and much of the following research is to choose one of the existing algorithms/heuristics for a datamining problem, and implement an efﬁcient secure protocol for it, avoiding using the generic tech- niques when they do not yield efﬁcient protocols. The privacy guarantee is that the participating parties would not learn any information beyond what is im- plied by the outcome of the chosen algorithm/heuristic. This is a very different notion of privacy from what we seek herein – in general there is no guarantee that the outcome of the datamining analysis procedure itself preserves individ- ual privacy. It may leak some information pertaining to individuals, or small groups, and no matter how secure the implementation is, it would also leak this information. In the rest of this chapter we consider a simple formal model – statisti- cal databases – that serves as an underlying model for our discussion. In this model we present a privacy deﬁnition, capturing the intuition that an 1The term privacy preserving datamining is often used in the literature in connection with both goals. Private Data Analysis via Output Perturbation 385 individual’s privacy is preserved if the inclusion of her data in the analysis has a minor effect on the outcome. At ﬁrst sight it may seem that this deﬁnition is so restrictive that it would prohibit any useful computation, however, this is not the case – it is possible to construct analyses that yield useful outcome, and yet preserve privacy in a rigorous sense. Our focus is on the basic techniques for constructing such statistical and machine learning analyses. We start in Sec- tion 16.4 with the basic technique of adding noise of magnitude proportional to a property of the query function called sensitivity, and show the effectiveness of this idea for simple functions. These simple functions are used in Section 16.5 as the building blocks for more complex functionalities. Section 16.6 includes a brief overview of more recent techniques that have emerged from the basic techniques. We conclude with related work and bibliographic notes. 16.2 The Abstract Model Ð Statistical Databases, Queries, and Sanitizers As the underlying model for our discussion of privacy we will consider a simple abstract model that we will refer to as a statistical database. Roughly speaking, a statistical database is a centralized database, controlled by a single trusted party that interacts with users who wish to issue queries to the database. We note, however, that our deﬁnitions and results carry also to many other set- tings. In particular, settings where the data is distributed among several parties and settings where the collection of individual data does not physically or for- mally consist a database. Definition 16.1 (Statistical Database) A statistical database x of size n over domain D is an ordered collection of n entries x =(x1,...,xn), where each entry is taken from the domain D. The deﬁnition of statistical databases is very general. In particular, the do- main D can be points in Rd, text, images, or any other (arbitrarily complex) set of possible entries. Furthermore, we do not make any assumption regarding to how the entries xi of the database are selected (i.e. whether the database entries are sampled from some underlying distribution, whether the entries are independent of each other, etc.). As a means of accessing the information stored in a statistical database x we will assume the existence of an algorithmic mechanism that has access to the statistical database x. We call this mechanism a sanitizer, emphasizing its goal of preserving the privacy of the underlying data by only releasing answers from which the dependency on individual information was “cleared”. Users access the statistical database by issuing queries to the sanitizer, where a query to 386 Privacy-Preserving Data Mining: Models and Algorithms a database x is any function f :Dn →R. For simplicity of exposition, we will only consider real valued functions f, i.e., f :Dn → Rd . We note, however, that the techniques we present in the sequel do generalize to other metric spaces. Example 16.2 (Sum Queries) A family of queries, that turns to be ex- tremely useful, is that of sum queries. These are queries of the form sumg(x)= n i=1 g(i, xi),(16.1) where g :N ×D → [0, 1]. Sum queries allow expressing basic statistical functions (such as counts, averages etc.) as well as more complicated compu- tations. Example 16.3 Consider a database held and maintained by a hospital, con- taining patient information as depicted below. # SSN Sex Age Disease Smoking 1 631-35-1210 M 41 Heart Yes 2 051-34-1430 F 32 Cancer No ... n 615-84-1924 M 37 Viral No We can view this database as a statistical database, where the domain D corresponds to the possible values for a record of format (SSN, Sex, Age, Disease, Smoking). Many different queries may be generated by speciﬁcally setting the function g in a sum query. Such queries may e.g. used for comparing the odds of having cancer of smokers and non-smokers. Let g1(x) ∆= x.Smoking = Yes g2(x) ∆=(x.Smoking = Yes) ∧ (x.Disease = Cancer) g3(x) ∆=(x.Smoking = No) ∧ (x.Disease = Cancer). The odds are sumg2 (x)/sumg1 (x) and sumg3 (x)/(n − sumg1 (x)) for smokers and non-smokers respectively. Private Data Analysis via Output Perturbation 387 Note that unlike a common practice in the privacy literature (see e.g. [31, 32]), we do not assume a classiﬁcation of record ﬁelds into identifying and sensitive. An implication of this choice is that all parts of individual infor- mation is treated as sensitive. This may seem as an over-conservative choice. However, it saves the need to decide which information is sensitive, and pro- tects against the risk that harmless looking pieces of information, or the rela- tionships between them, would eventually be linked to sensitive information (e.g. using datamining techniques), and hence become sensitive themselves. Furthermore, we allow queries to directly address individuals in the data- base. For example, in the deﬁnition of sum queries (Equation 16.1) the func- tion g is explicitly given the record ‘identity’ i, and hence distinct functions gi(·)=g(i, ·) may be applied to distinct individual records. One implication of this choice is that in our model privacy is not a derivative of anonymity – privacy has to be maintained when the attacker is able to separately address in his query function each of the individual contributors to the database, i.e. even if anonymity is breached. A large collection of techniques for constructing sanitizers appear in the literature, and we refer the reader to the survey in [1] for a classiﬁcation of sanitization techniques. Roughly speaking, sanitizers may decide not to answer some queries, and to modify query results. We will restrict our attention to sanitizers that preserve privacy by adding noise to query answers, so as to mask out the effects of individual records, but still leave global trends visible. This intuitively appealing technique is commonly referred to as output perturbation. The answer given by the output perturbation sanitizer on a query f is dis- tributed according to: San(x,f)=f(x)+Y, where Y– often refereed to as noise – is a random variable taken from a probability distribution N. In general, the noise distribution N may depend on the query and on the actual values stored in the statistical database, i.e. N = N(f,x). In most of our discussion, however, we will consider probabil- ity distributions that depend on the query type, but do not depend on the actual values stored in the statistical database2. We will not touch upon questions of how statistical databases and their san- itizers are actually implemented, but rather on their functionality. A typical example where the model of statistical database directly applies is the database of information collected by statistical agencies such as the U.S. Census Bu- reau. Similarly, collections of individual data records collected and maintained by health care organizations, ﬁnancial organizations, search engines, etc. may 2Note that when N is a function of x special care has to be taken as the noise itself may become an unexpected source of information leakage. See [25] and Section 16.6.1. 388 Privacy-Preserving Data Mining: Models and Algorithms be viewed as statistical databases. As noted above, our results also apply to dis- tributed setups e.g. by reducing a distributed setup to a centralized setup using standard cryptographic techniques of Secure Multiparty Computation [34, 21, 7, 5]. Efﬁcient secure multiparty computation protocols may be designed for speciﬁc sanitizers, as in [15]. 16.3 Privacy In an attack on the privacy of a statistical database, an adversarial attacker that has complete knowledge of the sanitizer algorithm and privacy parameters communicates with the sanitizer issuing queries f1,f2,... and receiving an- swers a1,a2,...where ai is distributed according to San(x,fi). The issue of attack detection, if at all possible, is beyond the scope of our discussion, and we assume the sanitizers answers all queries as if it communicates with a legit- imate user. The attacker may choose the queries adaptively, i.e. the choice of query fi+1 may depend on the answers a1,...,ai to the previous queries. The deﬁnition we present in this section captures the requirement that individual privacy is preserved even in presence of any such attacker. Definition 16.4 (Hamming Distance, Neighbor Databases) The Hamming Distance between two databases of the same size is deﬁned as the number of entries on which they differ: distH(x, x)= & i : xi = x i ' . Two databases that differ on a single individual entry, i.e. x, x such that distH(x, x)=1are called neighbor databases. We can now state our privacy deﬁnition. It is reminiscent of (and was in- spired by) the notion of indistinguishability of ciphertexts introduced by Gold- wasser and Micali [20] in the context of probabilistic encryption. Informally, a sanitizer is private if no adversary A gains signiﬁcant knowledge about an individual entry of the statistical database beyond what A could have learned by interacting with a similar (neighbor) database where that individual entry is arbitrarily modiﬁed, or removed. This is formalized as a requirement that for all pairs of neighbor databases x, x, and all possible sanitizer answers, the probability that an adversary obtains a speciﬁc answer when interacting with the sanitizer on the database x is within an eε multiplicative factor from the probability the same answer is obtained on x,whereε>0 – the privacy para- meter – is chosen by the privacy policy. Definition 16.5 (ε-privacy (ε-differential privacy) [16]) A sanitizer San is ε-private if for all neighbor statistical databases x, x ∈Dn, and for all subsets of possible answers T(i.e. subsets of the support of Private Data Analysis via Output Perturbation 389 San(·)): Pr[San(x) ∈T] Pr[San(x) ∈T] ≤ eε .(16.2) The probability is taken over the coin tosses of the sanitizer. In similarity to the security parameter of cryptographic primitives, the pa- rameter ε controls the leakage of information about individual entries of the statistical database. When ε is small, eε ≈ 1+ε, and hence the requirement is roughly that for all sets of possible transcripts T the probability of San(x) ∈T is about the same as that of San(x) ∈T. An immediate consequence of Deﬁnition 16.5 is that the sanitizer San can- not be deterministic (unless it computes a constant function). Otherwise, there would exist neighbor databases x, x andananswert in the support of San() such that San(x)=t but San(x) = t and hence the ratio Pr[San(x)= t]/ Pr[San(x)=t] is unbounded. Notation. For simplicity of exposition, we will only consider sanitizers where San(x) is sampled from a continuous distribution. We will use the no- tation hSanx (t) for the probability density function of this distribution and hSanx (t|A) for this probability density function conditioned on the event A. We will usually abuse notation and write hx for hSanx . When San(x) is sampled from a continuous distribution hSanx we can state an equivalent requirement to Equation 16.2: hSanx (t) hSanx (t) ≤ eε for all possible sanitizer answers t. (16.3) Note 16.6 Readers familiar with the cryptographic notion of indistinguisha- bility of ciphertexts might have expected the requirement in Equation 16.2 to be that the distributions San(x) and San(x) would be statistically close.In our setting, however, the difference between these distributions should not be negligible, as a negligible difference would disallow any utility3. When the dif- ference ε is not negligible, the requirement of statistical difference ε is insufﬁ- cient Ð it is possible to have two distributions San(x) and San(x) where with probability Θ(ε) an attacker receives an answer a ∈R San(x) that is not in the support of San(x)(or vice versa), and hence is able to tell these cases apart. For small ε the multiplicative requirement of Equation 16.2 is more stringent 3This follows by a standard hybrid argument noting that for any two statistical databases x, x there exist m ≤ n +1statistical databases x = x1, x2,...,xm = x such that xi, xi+1 are neighbor databases for all 1 ≤ iβ= 1+α 2 and Pr[A reads xj] ≤ α for all 1 ≤ j ≤ n Assume x, x differ on the jth entry. Denote by A−j(x) the distribution on A(x) conditioned on not reading xj. We get that Pr[A−j(x) − f(x)1 ≤ σ] > β − α 1 − α ≥ 1 2 . The same argument holds for x.AsA−j(x) and A−j(x) are equally distrib- uted, we get, using the union bound, that Pr[A−j(x) − f(x)1 >σor A−j(x) − f(x)1 >σ] < 1 2 + 1 2 =1. Hence, there exists a point p ∈ Rd in the support of A satisfying p − f(x)1 ≤ σ and p − f(x)1 ≤ σ, implying f(x) − f(x)1 ≤ 2σ. As the above argument holds for every two databases that differ on a single entry we get that GSf ≤ 2σ. 16.5 Constructing Sanitizers for Complex Functionalities As we have seen above, Theorem 16.10 directly yields output perturbation sanitizers for functions whose global sensitivity can be analyzed, and turns to be low. For many functions, however, a direct calculation of global sensitivity is complicated (sometimes computationally intractable), or yields high global sensitivity, even when the function is expected to be insensitive for typical inputs. Lemma 16.7 suggests a partial remedy to these problems (we discuss other techniques in Section 16.6). It implies that simple functions, that exhibit low global sensitivity, may be combined in algorithms computing more complex functions. Suppose algorithm A is constructed so that it behaves as if its input is stored in a statistical database, and accesses it at most q times by simulat- ing ε-private sanitizers San1,...,Sanq where ε = ε/q, then the outcome of algorithm A is assured to preserve ε-privacy. Private Data Analysis via Output Perturbation 401 We demonstrate this idea by presenting two types of results. In sections 16.5.1 and 16.5.2 we modify well known machine learning algorithms — k- means, Singular Value Decomposition and Principle Component Analysis — so that the resulting algorithms preserve ε-privacy. The input to these algo- rithms is a collection of n points p1,...,pn ∈ Rd, where each point corre- sponds to an individual’s information. While the original algorithms may ac- cess their input in a point by point manner, the modiﬁed algorithms access their input via a small number q of insensitive queries The exact answers to these queries are replaced with noisy answers so that each answer preserves ε-privacy. For that, we view the collection of n points as a statistical database, where each database entry consists a point. (See [6] for the private version of other machine learning algorithms — the Perceptron Algorithm, and construct- ing ID3 classiﬁcation trees.) The last result of this section is a more general result, translating a large family of algorithms into their ε-private version, while retaining their accuracy. A little more speciﬁcally, the result in Section 16.5.3 shows a strong connection between learning and privacy — any learning task that can be performed in the statistical queries learning model of Kearns [22] can also be performed while preserving ε-privacy. Note. In our analysis, we will need to bound the location of the input points, and will assume they satisfy pi1 ≤ γ for all 1 ≤ i ≤ n. 16.5.1 k-Means Clustering Clustering is the task of partitioning n data points p1,...,pn into k dis- joint sets of ‘similar’ points. One approach to solving this problem is known as Lloyd’s Algorithm. This algorithm iteratively updates k cluster centers c1,...,ck by moving each center to the mean of the points that are closer to it than to the other centers. k-Means Iteration: Input: points p1,...,pn ∈ Rd, and centers c1,...,ck ∈ Rd. 1 [Partition the points into k sets] Sj ←{pi : cj is the closest center to pi}, let sj ←|Sj| . 2 [Move each center to the mean of its associated points] for 1 ≤ j ≤ k: Let mj ← i∈Sj pi, and set c j ← mj sj . 402 Privacy-Preserving Data Mining: Models and Algorithms This rule is repeated either for a ﬁxed number of iterations, or until a conver- gence criteria is satisﬁed. At ﬁrst sight it may seem that the k-Means Iteration cannot be implemented privately. In particular, unless noise renders it useless, a partitioning of the points according to their nearest centers would breach privacy. However, an equivalent computation can be performed without revealing the partitioning. Using our hist and subsets queries and setting g :Rd → Rd to be the identity function g(p) ∆= p,andq :Rd → [k] to be the function associating points to their centers, i.e., q(p) ∆= argmin j∈[k] (dist(p, cj) ≤ dist(p, ci) for all i ∈ [k]) , we can rewrite an equivalent algorithm as: Modiﬁed k-Means Iteration: Input: points p1,...,pn ∈ Rd, and centers c1,...,ck ∈ Rd. 1 [Compute the number of points in each of the sets Sj] (s1,...,sk) ← histq(p1,...,pn). 2 [Compute the sum of points in each of the sets Sj] (m1,...,mk) ← subsetsq,g(p1,...,pn). 3 [Update each mean] for 1 ≤ j ≤ k: c j ← mj sj . As our last step, we replace hist and subsets with their noisy version, adding Laplace noise to each to each coordinate according to our analysis in the previous section: sj = sj +ˆsj, where ˆsj ∼ Lap(2/ε) d , and mj = mj +ˆmj, where ˆmj ∼ Lap(4γ/ε) d . where ε = ε/2. Appealing to Lemma 16.7, the outcome of the modiﬁed algo- rithm preserves ε privacy. We get that, as long as the number of points in each cluster is large, sj is a good estimate of sj, and hence c j is very close to c j of the non private computation. A little more formally: Private Data Analysis via Output Perturbation 403 Lemma 16.11 For each 1 ≤ j ≤ k,ifsj 1/ε then with high probability c j − c j1 = O )cj1 + γd εsj * . Proof: c j − cj1 = mj sj − mj sj 1 = mj +ˆmj sj − mj sj 1 ≤mj sj 1 · sj − sj sj + ˆmj1 · 1 sj = cj1 · sj − sj sj + ˆmj1 · 1 sj . From our assumption that sj 1/ε, we get that with high probability |(sj − sj)/sj| = O(1/εsj) and | ˆmj1/|sj| = O(γd/εsj), The lemma follows. 16.5.2 SVD and PCA Many datamining algorithms treat their data points p1,...,pn ∈ Rd as an d × n matrix A(whose columns correspond to the points), and analyze the top k eigenvectors of the matrix AAT. This analysis can be performed while preserving ε-privacy. Notice that AAT = n i=1 pipT i . Hence, an analysis similar to that of cov leads to the following natural algo- rithm: SVD: Input: The matrix A ∈ Rd×n and a parameter 0 z]=e−z/λ. Hence, the probability that in the ith iteration |Y | >nτi is bounded by e−nτiε = δ/q. Using the union bound, the probability that in any of the iterations |Y | >nτi is bounded by δ. Finally, using Lemma 16.7 we get that the outcome of A preserves ε-privacy. The importance of Theorem 16.12 is in its generality. Although it would probably not yield the most efﬁcient algorithm for speciﬁc learning tasks (e.g. in terms of the number of samples needed), it shows that an important collec- tion of learning problems can be solved while preserving ε-privacy. 16.6 Beyond the Basics As we have seen in Section 16.4.1, Theorem 16.10 directly yields simple output perturbation sanitizers for a variety of functions — those which exhibit low global sensitivity. However, in some cases Theorem 16.10 cannot be di- rectly used. E.g. when one is interested in a query f that does not exhibit low global sensitivity (when compared with the magnitude of f(x)), or the global sensitivity of f is hard to analyze (or intractable), or when the range of f does not lend itself to a natural metric. In Section 16.5 we have seen one technique to get around these shortcomings, by expressing complex functionalities in terms of simple, insensitive ones, that are easy to analyze. 406 Privacy-Preserving Data Mining: Models and Algorithms We review some of the more recent techniques for creating algorithms that preserve ε-privacy. The presentation of this section is not self contained as we only attempt to present the main ideas. 16.6.1 Instance Based Noise and Smooth Sensitivity The framework of Theorem 16.10 considers the global, i.e. worst-case, sen- sitivity of the query function f. However, for many interesting functions, the worst-case sensitivity is high due to instances that do not typically occur in practice. As an example, consider the median function: Example 16.13 (Median) Let x1,...,xn be real numbers taken from a bounded interval [0, 1]. The median of x = x1,...,xn is its middle ranked ele- ment. Assuming (for simplicity) that n is odd, and that x1 ≤ x2 ≤···≤xn,we can write: med(x)=x n+1 2 . Although med is usually considered insensitive, it exhibits high global sensitivity. To see that, consider the case where x1 = ···= x n+1 2 =0and x n+1 2 +1 = ···= xn . Note that med(x1,...,xn)=0, and that by setting x n+1 2 =1,weget med(x1,...,xn)=1. Hence, GSmed =1. Applying Theorem 16.10 hence results with noise magnitude GSmed/ε that, for small ε, completely destroys the information. A ﬁrst natural attempt at ﬁxing this problem is to consider a local variant of Equation 16.5, and perturb the query function result with noise poroportional to it: LSf (x)= max x:distH(x,x)=1 f(x) − f(x)1 . (Observe that GSf =maxx LS(x).) This attempt fails, as we show now for the median. Example 16.13 (Median (cont.)) It is easy to see that given an in- stance x, the maximum change in med(x) occurs when x1 is set to 1 or when xn is set to 0. This observation yields an expression for the local sensitivity in terms of the values next to the median: LSmed(x)=max # x n+1 2 − x n+1 2 −1,xn+1 2 +1 − x n+1 2 $ . For inputs where a constant fraction of the population is uniformly concen- trated around the median we get LSmed(x) ∝ 1 n GSmed. Releasing med(x) with noise sampled from Lap(LSmed(x)/ε) fails to sat- isfy Deﬁnition 16.5. For instance, the probability of receiving a non-zero an- swer when x1 = ··· = x n+1 2 +1 =0and x n+1 2 +2,...,xn > 0 is zero, Private Data Analysis via Output Perturbation 407 whereas the probability of a non-zero answer on its neighbor database where x n+1 2 +1 > 0 is one. This example illustrates that special care has to be taken when adding in- stance based noise. As the noise is correlated with the instance x, it may it- self be the cause of information leakage. To prevent this kind of leakage, a variant of local sensitivity — smooth sensitivity — was deﬁned in [25], such that adding noise proportional to the smooth sensitivity at x is safe. Unlike lo- cal sensitivity, smooth sensitivity does not change abruptly as x changes, and hence an adversary cannot distinguish well the noise distributions on neighbor databases, as was in the example above. We will only preset the deﬁnition of smooth sensitivity, without getting into the details of constructing output perturbation sanitizers with instance based noise7. Definition 16.14 (Smooth Sensitivity) An ε-smooth upperbound on LSf is a function satisfying Sf (x) ≥ LSf (x) for all databases x ; and Sf (x) ≤ eεSf (x) for all neighbor databases x, x . Clearly, Sf (x)=GSf is an ε-smooth upperbound on LSf , but the deﬁni- tion allows for cases where Sf (x) GSf , and hence a gain with respect to Theorem 16.10. It turns out that a minimal ε-smooth upperbound on LSf exists. This func- tion is called the ε-smooth sensitivity of f and satisﬁes for every smooth up- perbound Sf on LSf : S∗ f (x) ≤ Sf (x) for all x ∈Dn . It can be shown that S∗ f (x)= maxx∈Dn # LSf (x)· eε·distH(x,x) $ .(16.7) Equation 16.7 implies that low noise may be added at the instance x if the local sensitivity at its ‘neighborhood’ is low (i.e. LSf (x) is low for those in- stances x where distH(x, x) is small), as the inﬂuence of far instances decays exponentially with distH(x, x). Computing S∗ f (x) may prove to be tricky, and if an approximation to S∗ f (x) is used for the noise magnitude, it has to be a smooth upperbound on LSf by itself. We omit these details, and refer the reader to [25] where it is shown how to compute S∗ f (x) for queries like median, minimum, and graph problems such as MST cost and the number of triangles in a graph. 7The technicalities include (i) a relaxation of Deﬁnition 16.5 where breaches may occur with negligible probability, and (ii) conditions on the noise process (in analogy to Equation 16.6). 408 Privacy-Preserving Data Mining: Models and Algorithms 16.6.2 The Sample-Aggregate Framework The sample-aggregate framework of [25] is a generic technique for creat- ing a ‘smoothed’ version ¯f of a query function f. Assume f(x) is a func- tion that can be well approximated on random samples taken from x1,...,xn. We abuse notation and write f(S) for the approximation of f(x) where S ⊂{x1,...,xn} although f is formally deﬁned to take an n-tuple as input. The function f is evaluated on several random samples, and the results of these evaluations f(S1),...,f(St) are combined using an aggregation func- tion g: ¯f = g(f(S1),...,f(St)) . The main observation is that to preserve privacy it is sufﬁcient to add noise whose magnitude depends on the smooth sensitivity8 of the aggregation func- tion g. To illustrate why privacy would be preserved, assume (for simplic- ity) that each entry from x1,...,xn appears in exactly one of the samples S1,...,St. As Deﬁnition 16.5 is only concerned with neighbor databases, we only need to care about a change in a single entry xi ∈ Sj. however, a change in xi may only affect a single of the inputs to g (i.e. f(Sj)). Even if the change in f(Sj) is signiﬁcant, it is enough to mask it by adding noise proportional to the smooth sensitivity of g. (The complete argument is a little more involved. In particular, xi may appear in several of the subsets.) The crux of this technique is ﬁnding good aggregation functions, i.e. func- tions g whose outcome (plus the required noise) would faithfully represent f(S1),...,f(St). In particular, when f(S1),...,f(St) are well concentrated (or ‘clustered’), the aggregation g(f(S1),...,f(St)) should return a point that is close to the cluster center, and the noise level should be low. Furthermore, we would like g, and its smooth sensitivity to be efﬁciently computable. An aggregation function satisfying these requirements — the center of attention — was proposed in [25]. The sample-aggregate technique was applied to Lloyd’s algorithm, and to the problem of learning the parameters of a mixture of k spherical Gaussian distributions when the data x consists of polynomially-many (in the dimension and k) i.i.d. samples from the distribution. As with the result of Section 16.5.3, an application of sample-aggregate need not always result in the optimal sanitizer. It serves, however, as a st