Download Fundamental Data Structures

Fundamental Data Structures Contents 1 Introduction 1 1.1 Abstract data type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.3 Defining an abstract data type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.4 Advantages of abstract data typing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.5 Typical operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.1.7 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.1.8 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.1.9 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.1.10 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.1.11 Further . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.1.12 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Data structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.3 Language support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.4 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.6 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.7 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Analysis of algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3.1 Cost models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3.2 Run-time analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3.3 Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.4 Constant factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.5 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.6 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Amortized analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.4.1 13 1.2 1.3 1.4 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i ii CONTENTS 1.5 1.6 2 1.4.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.4.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.4.4 Common use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.4.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Accounting method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.5.1 The method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.5.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.5.3 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Potential method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.6.1 Definition of amortized time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.6.2 Relation between amortized and actual time . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.6.3 Amortized analysis of worst-case inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.6.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.6.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.6.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Sequences 18 2.1 Array data type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.1.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.1.2 Abstract arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.1.3 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.1.4 Language support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.1.5 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.1.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.1.7 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Array data structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2.3 Element identifier and addressing formulas . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2.4 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.5 Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2.6 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Dynamic array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.1 Bounded-size dynamic arrays and capacity . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.2 Geometric expansion and amortized cost . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.3 Growth factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.5 Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.6 Language support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.8 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.2 2.3 CONTENTS iii 2.4 Linked list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.1 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4.2 Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4.3 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4.4 Basic concepts and nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4.5 Tradeoffs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.4.6 Linked list operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.4.7 Linked lists using arrays of nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.4.8 Language support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.4.9 Internal and external storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.4.10 Related data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.4.11 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.4.12 Footnotes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.4.13 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.4.14 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Doubly linked list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.5.1 Nomenclature and implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.5.2 Basic algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.5.3 Advanced concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.5.4 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.5.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Stack (abstract data type) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.6.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.6.2 Non-essential operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.6.3 Software stacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.6.4 Hardware stacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.6.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.6.6 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.6.7 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.6.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.6.9 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.6.10 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Queue (abstract data type) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.7.1 Queue implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.7.2 Purely functional implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.7.3 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.7.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.7.5 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Double-ended queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.8.1 Naming conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.8.2 Distinctions and sub-types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.5 2.6 2.7 2.8 iv CONTENTS 2.9 3 2.8.3 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.8.4 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.8.5 Language support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.8.6 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.8.7 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.8.8 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.8.9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.8.10 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Circular buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.9.1 Uses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.9.2 How it works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 2.9.3 Circular buffer mechanics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 2.9.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 2.9.5 Fixed-length-element and contiguous-block circular buffer . . . . . . . . . . . . . . . . . 53 2.9.6 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Dictionaries 54 3.1 Associative array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.1.1 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.1.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.1.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.1.4 Language support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.1.5 Permanent storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.1.6 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.1.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.1.8 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Association list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.2.1 Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.2.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.2.3 Applications and software libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.2.4 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.2.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Hash table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.3.1 Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.3.2 Key statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.3.3 Collision resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.3.4 Dynamic resizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.3.5 Performance analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.3.6 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.3.7 Uses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.3.8 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.3.9 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.2 3.3 CONTENTS 3.4 3.5 3.6 3.7 3.8 3.9 v 3.3.10 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.3.11 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.3.12 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.3.13 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Linear probing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.4.1 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.4.2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.4.4 Choice of hash function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.4.5 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.4.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Quadratic probing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.5.1 Quadratic function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.5.2 Quadratic probing insertion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.5.3 Quadratic probing search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.5.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.5.5 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.5.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.5.7 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Double hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.6.1 Classical applied data structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.6.2 Implementation details for caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.6.3 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.6.4 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.6.5 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Cuckoo hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.7.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.7.2 Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.7.3 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.7.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.7.5 Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.7.6 Comparison with related structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.7.7 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.7.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.7.9 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Hopscotch hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.8.1 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.8.2 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.8.3 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Hash function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.9.1 79 Uses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi CONTENTS 3.9.2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.9.3 Hash function algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.9.4 Locality-sensitive hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.9.5 Origins of the term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.9.6 List of hash functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.9.7 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.9.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.9.9 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.10 Perfect hash function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.10.1 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.10.2 Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.10.3 Space lower bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.10.4 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.10.5 Related constructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.10.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.10.7 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 3.10.8 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 3.11 Universal hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 3.11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 3.11.2 Mathematical guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.11.3 Constructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.11.4 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 3.11.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 3.11.6 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 3.11.7 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 3.12 K-independent hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 3.12.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 3.12.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 3.12.3 Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 3.12.4 Independence needed by different hashing methods . . . . . . . . . . . . . . . . . . . . . 94 3.12.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 3.12.6 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 3.13 Tabulation hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 3.13.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.13.2 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.13.3 Universality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.13.4 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 3.13.5 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 3.13.6 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 3.13.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 3.14 Cryptographic hash function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 CONTENTS vii 3.14.1 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 3.14.2 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 3.14.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 3.14.4 Hash functions based on block ciphers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 3.14.5 Merkle–Damgård construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 3.14.6 Use in building other cryptographic primitives . . . . . . . . . . . . . . . . . . . . . . . . 100 3.14.7 Concatenation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 3.14.8 Cryptographic hash algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 3.14.9 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 3.14.10 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 3.14.11 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4 Sets 4.1 4.2 103 Set (abstract data type) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.1.1 Type theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.1.2 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.1.3 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.1.4 Language support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.1.5 Multiset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.1.6 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.1.7 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.1.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Bit array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.2.2 Basic operations 4.2.3 More complex operations 4.2.4 Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.2.5 Advantages and disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.2.6 Applications 4.2.7 Language support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.2.8 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.2.9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.2.10 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.3 Bloom filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.3.1 Algorithm description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.3.2 Space and time advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.3.3 Probability of false positives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.3.4 Approximating the number of items in a Bloom filter . . . . . . . . . . . . . . . . . . . . 113 4.3.5 The union and intersection of sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.3.6 Interesting properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.3.7 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.3.8 Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 viii CONTENTS 4.3.9 Extensions and applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.3.10 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.3.11 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.3.12 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.3.13 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.4 4.5 4.6 5 MinHash . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.4.1 Jaccard similarity and minimum hash values . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.4.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.4.3 Min-wise independent permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.4.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.4.5 Other uses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.4.6 Evaluation and benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.4.7 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.4.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 4.4.9 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Disjoint-set data structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 4.5.1 Disjoint-set linked lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 4.5.2 Disjoint-set forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 4.5.3 Applications 4.5.4 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 4.5.5 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 4.5.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 4.5.7 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Partition refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.6.1 Data structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.6.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.6.3 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 4.6.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Priority queues 5.1 128 Priority queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.1.1 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.1.2 Similarity to queues 5.1.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.1.4 Equivalence of priority queues and sorting algorithms . . . . . . . . . . . . . . . . . . . . 129 5.1.5 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.1.6 Applications 5.1.7 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.1.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.1.9 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.1.10 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.2 Bucket queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 CONTENTS 5.3 5.4 5.5 5.6 5.7 ix 5.2.1 Basic data structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.2.2 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.2.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.2.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Heap (data structure) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.3.1 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 5.3.3 Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 5.3.4 Comparison of theoretic bounds for variants . . . . . . . . . . . . . . . . . . . . . . . . . 135 5.3.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 5.3.6 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 5.3.7 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.3.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.3.9 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 Binary heap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.4.1 Heap operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 5.4.2 Building a heap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 5.4.3 Heap implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.4.4 Derivation of index equations 5.4.5 Related structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 5.4.6 Summary of running times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 5.4.7 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 5.4.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 5.4.9 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 ''d''-ary heap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 5.5.1 Data structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 5.5.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 5.5.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 5.5.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 5.5.5 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Binomial heap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 5.6.1 Binomial heap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 5.6.2 Structure of a binomial heap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 5.6.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 5.6.4 Summary of running times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 5.6.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 5.6.6 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 5.6.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 5.6.8 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 Fibonacci heap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 5.7.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 x CONTENTS 5.8 5.9 5.7.2 Implementation of operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 5.7.3 Proof of degree bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 5.7.4 Worst case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 5.7.5 Summary of running times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 5.7.6 Practical considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 5.7.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 5.7.8 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 Pairing heap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 5.8.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 5.8.2 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 5.8.3 Summary of running times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 5.8.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 5.8.5 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 Double-ended priority queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 5.9.1 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 5.9.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 5.9.3 Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 5.9.4 Applications 5.9.5 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 5.9.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 5.10 Soft heap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 5.10.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 5.10.2 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 6 Successors and neighbors 6.1 157 Binary search algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 6.1.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 6.1.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 6.1.3 Binary search versus other schemes 6.1.4 Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 6.1.5 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 6.1.6 Implementation issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 6.1.7 Library support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 6.1.8 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 6.1.9 Notes and references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 6.1.10 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 6.2 Binary search tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 6.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 6.2.2 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 6.2.3 Examples of applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 6.2.4 Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 6.2.5 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 CONTENTS 6.3 6.4 6.5 6.6 6.7 xi 6.2.6 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 6.2.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 6.2.8 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 6.2.9 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 Random binary tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 6.3.1 Binary trees from random permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 6.3.2 Uniformly random binary trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 6.3.3 Random split trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 6.3.4 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 6.3.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 6.3.6 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 Tree rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 6.4.1 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 6.4.2 Detailed illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 6.4.3 Inorder invariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 6.4.4 Rotations for rebalancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 6.4.5 Rotation distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 6.4.6 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 6.4.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 6.4.8 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 Self-balancing binary search tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 6.5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 6.5.2 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 6.5.3 Applications 6.5.4 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 6.5.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 6.5.6 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Treap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 6.6.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 6.6.2 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 6.6.3 Randomized binary search tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 6.6.4 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 6.6.5 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 6.6.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 6.6.7 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 AVL tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 6.7.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 6.7.2 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 6.7.3 Comparison to other structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 6.7.4 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 6.7.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 xii CONTENTS 6.8 6.7.6 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 6.7.7 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 Red–black tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 6.8.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 6.8.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 6.8.3 Properties 6.8.4 Analogy to B-trees of order 4 6.8.5 Applications and related data structures 6.8.6 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 6.8.7 Proof of asymptotic bounds 6.8.8 Parallel algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 6.8.9 Popular Culture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 6.8.10 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 6.8.11 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 6.8.12 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 6.8.13 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 6.9 WAVL tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 6.9.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 6.9.2 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 6.9.3 Computational complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 6.9.4 Related structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 6.9.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 6.10 Scapegoat tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 6.10.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 6.10.2 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 6.10.3 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 6.10.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 6.10.5 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 6.11 Splay tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 6.11.1 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 6.11.2 Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 6.11.3 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 6.11.4 Implementation and variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 6.11.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 6.11.6 Performance theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 6.11.7 Dynamic optimality conjecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 6.11.8 Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 6.11.9 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 6.11.10 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 6.11.11 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 6.11.12 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 CONTENTS xiii 6.12 Tango tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 6.12.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 6.12.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 6.12.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 6.12.4 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 6.12.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 6.13 Skip list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 6.13.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 6.13.2 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 6.13.3 Usages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 6.13.4 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 6.13.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 6.13.6 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 6.14 B-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 6.14.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 6.14.2 B-tree usage in databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 6.14.3 Technical description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 6.14.4 Best case and worst case heights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 6.14.5 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 6.14.6 In filesystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 6.14.7 Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 6.14.8 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 6.14.9 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 6.14.10 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 6.14.11 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 6.15 B+ tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 6.15.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 6.15.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 6.15.3 Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 6.15.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 6.15.5 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 6.15.6 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 6.15.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 6.15.8 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 7 Integer and string searching 7.1 215 Trie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 7.1.1 History and etymology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 7.1.2 Applications 7.1.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 7.1.4 Implementation strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 7.1.5 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 xiv CONTENTS 7.2 7.3 7.1.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 7.1.7 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 Radix tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 7.2.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 7.2.2 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 7.2.3 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 7.2.4 Comparison to other data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 7.2.5 Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 7.2.6 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 7.2.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 7.2.8 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 Suffix tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 7.3.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 7.3.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 7.3.3 Generalized suffix tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 7.3.4 Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 7.3.5 Applications 7.3.6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 7.3.7 Parallel construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 7.3.8 External construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 7.3.9 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 7.3.10 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 7.3.11 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 7.3.12 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 7.4 7.5 7.6 Suffix array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 7.4.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 7.4.2 Example 7.4.3 Correspondence to suffix trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 7.4.4 Space Efficiency 7.4.5 Construction Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 7.4.6 Applications 7.4.7 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 7.4.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 7.4.9 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 Suffix automaton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 7.5.1 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 7.5.2 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 7.5.3 Additional reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 Van Emde Boas tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 7.6.1 Supported operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 7.6.2 How it works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 CONTENTS 7.6.3 7.7 8 xv References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 Fusion tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 7.7.1 How it works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 7.7.2 Fusion hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 7.7.3 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 7.7.4 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 Text and image sources, contributors, and licenses 236 8.1 Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 8.2 Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 8.3 Content license . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 Chapter 1 Introduction 1.1 Abstract data type 1.1.1 Examples For example, integers are an ADT, defined as the values …, −2, −1, 0, 1, 2, …, and by the operations of addition, subtraction, multiplication, and division, together with greater than, less than, etc., which behave according to familiar mathematics (with care for integer division), independently of how the integers are represented by the computer.[lower-alpha 1] Explicitly, “behavior” includes obeying various axioms (associativity and commutativity of addition etc.), and preconditions on operations (cannot divide by zero). Typically integers are represented in a data structure as binary numbers, most often as two’s complement, but might be binary-coded decimal or in ones’ complement, but the user is abstracted from the concrete choice of representation, and can simply use the data as integers. Not to be confused with Algebraic data type. In computer science, an abstract data type (ADT) is a mathematical model for data types where a data type is defined by its behavior (semantics) from the point of view of a user of the data, specifically in terms of possible values, possible operations on data of this type, and the behavior of these operations. This contrasts with data structures, which are concrete representations of data, and are the point of view of an implementer, not a user. Formally, an ADT may be defined as a “class of objects whose logical behavior is defined by a set of values and a set of operations";[1] this is analogous to an algebraic structure in mathematics. What is meant by “behavior” varies by author, with the two main types of formal specifications for behavior being axiomatic (algebraic) specification and an abstract model;[2] these correspond to axiomatic semantics and operational semantics of an abstract machine, respectively. Some authors also include the computational complexity (“cost”), both in terms of time (for computing operations) and space (for representing values). In practice many common data types are not ADTs, as the abstraction is not perfect, and users must be aware of issues like arithmetic overflow that are due to the representation. For example, integers are often stored as fixed width values (32-bit or 64-bit binary numbers), and thus experience integer overflow if the maximum value is exceeded. An ADT consists not only of operations, but also of values of the underlying data and of constraints on the operations. An “interface” typically refers only to the operations, and perhaps some of the constraints on the operations, notably pre-conditions and post-conditions, but not other constraints, such as relations between the operations. For example, an abstract stack, which is a last-in-firstout structure, could be defined by three operations: push, that inserts a data item onto the stack; pop, that removes a data item from it; and peek or top, that accesses a data item on top of the stack without removal. An abstract queue, which is a first-in-first-out structure, would also have three operations: enqueue, that inserts a data item into the queue; dequeue, that removes the first data item from it; and front, that accesses and serves the first data item in the queue. There would be no way of differentiating these two data types, unless a mathematical constraint is introduced that for a stack specifies that each pop always returns the most recently pushed item that has not been popped yet. When analyzing the efficiency of algorithms that use stacks, one may also specify that all operations take the same time no matter how many data items have been pushed into the stack, and that the stack uses a constant amount of storage for each element. ADTs are a theoretical concept in computer science, used in the design and analysis of algorithms, data structures, and software systems, and do not correspond to specific features of computer languages—mainstream computer languages do not directly support formally specified ADTs. However, various language features correspond to certain aspects of ADTs, and are easily confused with ADTs proper; these include abstract types, opaque data types, protocols, and design by contract. ADTs were first proposed by Barbara Liskov and Stephen N. Zilles in 1974, as part of the development of the CLU language.[3] 1 2 1.1.2 CHAPTER 1. INTRODUCTION Introduction Abstract data types are purely theoretical entities, used (among other things) to simplify the description of abstract algorithms, to classify and evaluate data structures, and to formally describe the type systems of programming languages. However, an ADT may be implemented by specific data types or data structures, in many ways and in many programming languages; or described in a formal specification language. ADTs are often implemented as modules: the module’s interface declares procedures that correspond to the ADT operations, sometimes with comments that describe the constraints. This information hiding strategy allows the implementation of the module to be changed without disturbing the client programs. • store(V, x) where x is a value of unspecified nature; • fetch(V), that yields a value, with the constraint that • fetch(V) always returns the value x used in the most recent store(V, x) operation on the same variable V. As in so many programming languages, the operation store(V, x) is often written V ← x (or some similar notation), and fetch(V) is implied whenever a variable V is used in a context where a value is required. Thus, for example, V ← V + 1 is commonly understood to be a shorthand for store(V,fetch(V) + 1). The term abstract data type can also be regarded as a generalised approach of a number of algebraic structures, such as lattices, groups, and rings.[4] The notion of abstract data types is related to the concept of data abstraction, important in object-oriented programming and design by contract methodologies for software development. In this definition, it is implicitly assumed that storing a value into a variable U has no effect on the state of a distinct variable V. To make this assumption explicit, one could add the constraint that 1.1.3 More generally, ADT definitions often assume that any operation that changes the state of one ADT instance has no effect on the state of any other instance (including other instances of the same ADT) — unless the ADT axioms imply that the two instances are connected (aliased) in that sense. For example, when extending the definition of abstract variable to include abstract records, the operation that selects a field from a record variable R must yield a variable V that is aliased to that part of R. Defining an abstract data type An abstract data type is defined as a mathematical model of the data objects that make up a data type as well as the functions that operate on these objects. There are no standard conventions for defining them. A broad division may be drawn between “imperative” and “functional” definition styles. • if U and V are distinct variables, the sequence { store(U, x); store(V, y) } is equivalent to { store(V, y); store(U, x) }. The definition of an abstract variable V may also restrict the stored values x to members of a specific set X, called the range or type of V. As in programming languages, In the philosophy of imperative programming languages, such restrictions may simplify the description and analysis an abstract data structure is conceived as an entity that is of algorithms, and improve their readability. mutable—meaning that it may be in different states at different times. Some operations may change the state of the Note that this definition does not imply anything about ADT; therefore, the order in which operations are eval- the result of evaluating fetch(V) when V is un-initialized, uated is important, and the same operation on the same that is, before performing any store operation on V. An entities may have different effects if executed at differ- algorithm that does so is usually considered invalid, beent times—just like the instructions of a computer, or the cause its effect is not defined. (However, there are some commands and procedures of an imperative language. To important algorithms whose efficiency strongly depends underscore this view, it is customary to say that the oper- on the assumption that such a fetch is legal, and returns ations are executed or applied, rather than evaluated. The some arbitrary value in the variable’s range.) imperative style is often used when describing abstract algorithms. (See The Art of Computer Programming by Instance creation Some algorithms need to create new Donald Knuth for more details) instances of some ADT (such as new variables, or new stacks). To describe such algorithms, one usually includes Abstract variable Imperative-style definitions of in the ADT definition a create() operation that yields an ADT often depend on the concept of an abstract vari- instance of the ADT, usually with axioms equivalent to Imperative-style definition able, which may be regarded as the simplest non-trivial ADT. An abstract variable V is a mutable entity that admits two operations: • the result of create() is distinct from any instance in use by the algorithm. 1.1. ABSTRACT DATA TYPE 3 This axiom may be strengthened to exclude also partial aliasing with other instances. On the other hand, this axiom still allows implementations of create() to yield a previously created instance that has become inaccessible to the program. Single-instance style Sometimes an ADT is defined as if only one instance of it existed during the execution of the algorithm, and all operations were applied to that instance, which is not explicitly notated. For example, the abstract stack above could have been defined with operations push(x) and pop(), that operate on the only existing stack. ADT definitions in this style can be easily rewritExample: abstract stack (imperative) As another ten to admit multiple coexisting instances of the ADT, by example, an imperative-style definition of an abstract adding an explicit instance parameter (like S in the previstack could specify that the state of a stack S can be mod- ous example) to every operation that uses or modifies the ified only by the operations implicit instance. On the other hand, some ADTs cannot be meaningfully • push(S, x), where x is some value of unspecified nadefined without assuming multiple instances. This is the ture; case when a single operation takes two distinct instances of the ADT as parameters. For an example, consider aug• pop(S), that yields a value as a result, menting the definition of the abstract stack with an operation compare(S, T) that checks whether the stacks S and with the constraint that T contain the same items in the same order. • For any value x and any abstract variable V, the sequence of operations { push(S, x); V ← pop(S) } is Functional-style definition equivalent to V ← x. Another way to define an ADT, closer to the spirit of Since the assignment V ← x, by definition, cannot change functional programming, is to consider each state of the the state of S, this condition implies that V ← pop(S) restructure as a separate entity. In this view, any operastores S to the state it had before the push(S, x). From this tion that modifies the ADT is modeled as a mathematical condition and from the properties of abstract variables, it function that takes the old state as an argument, and returns the new state as part of the result. Unlike the imfollows, for example, that the sequence perative operations, these functions have no side effects. Therefore, the order in which they are evaluated is imma{ push(S, x); push(S, y); U ← pop(S); push(S, terial, and the same operation applied to the same arguz); V ← pop(S); W ← pop(S) } ments (including the same input states) will always return the same results (and output states). where x, y, and z are any values, and U, V, W are pairwise In the functional view, in particular, there is no way (or distinct variables, is equivalent to need) to define an “abstract variable” with the semantics of imperative variables (namely, with fetch and store op{ U ← y; V ← z; W ← x } erations). Instead of storing values into variables, one passes them as arguments to functions. Here it is implicitly assumed that operations on a stack instance do not modify the state of any other ADT instance, Example: abstract stack (functional) For example, including other stacks; that is, a complete functional-style definition of an abstract stack could use the three operations: • For any values x, y, and any distinct stacks S and T, the sequence { push(S, x); push(T, y) } is equivalent • push: takes a stack state and an arbitrary value, reto { push(T, y); push(S, x) }. turns a stack state; An abstract stack definition usually includes also a Boolean-valued function empty(S) and a create() operation that returns a stack instance, with axioms equivalent to • top: takes a stack state, returns a value; • pop: takes a stack state, returns a stack state. In a functional-style definition there is no need for a create operation. Indeed, there is no notion of “stack in• create() ≠ S for any stack S (a newly created stack is stance”. The stack states can be thought of as being podistinct from all previous stacks); tential states of a single stack structure, and two stack states that contain the same values in the same order are • empty(create()) (a newly created stack is empty); considered to be identical states. This view actually mir• not empty(push(S, x)) (pushing something into a rors the behavior of some concrete implementations, such stack makes it non-empty). as linked lists with hash cons. 4 CHAPTER 1. INTRODUCTION Instead of create(), a functional-style definition of an abstract stack may assume the existence of a special stack state, the empty stack, designated by a special symbol like Λ or "()"; or define a bottom() operation that takes no arguments and returns this special stack state. Note that the axioms imply that 1.1.4 Advantages of abstract data typing Encapsulation Abstraction provides a promise that any implementation of the ADT has certain properties and abilities; knowing these is all that is required to make use of an ADT object. The user does not need any technical knowledge of how • push(Λ, x) ≠ Λ. the implementation works to use the ADT. In this way, the implementation may be complex but will be encapsuIn a functional-style definition of a stack one does not lated in a simple interface when it is actually used. need an empty predicate: instead, one can test whether a stack is empty by testing whether it is equal to Λ. Localization of change Note that these axioms do not define the effect of top(s) or pop(s), unless s is a stack state returned by a push. Since Code that uses an ADT object will not need to be edited push leaves the stack non-empty, those two operations are if the implementation of the ADT is changed. Since any undefined (hence invalid) when s = Λ. On the other hand, changes to the implementation must still comply with the the axioms (and the lack of side effects) imply that push(s, interface, and since code using an ADT object may only x) = push(t, y) if and only if x = y and s = t. refer to properties and abilities specified in the interface, As in some other branches of mathematics, it is custom- changes may be made to the implementation without reary to assume also that the stack states are only those quiring any changes in code where the ADT is used. whose existence can be proved from the axioms in a finite number of steps. In the abstract stack example above, this rule means that every stack is a finite sequence of values, Flexibility that becomes the empty stack (Λ) after a finite number of pops. By themselves, the axioms above do not ex- Different implementations of the ADT, having all the clude the existence of infinite stacks (that can be poped same properties and abilities, are equivalent and may forever, each time yielding a different state) or circular be used somewhat interchangeably in code that uses the stacks (that return to the same state after a finite number ADT. This gives a great deal of flexibility when using of pops). In particular, they do not exclude states s such ADT objects in different situations. For example, differthat pop(s) = s or push(s, x) = s for some x. However, ent implementations of the ADT may be more efficient since one cannot obtain such stack states with the given in different situations; it is possible to use each in the situation where they are preferable, thus increasing overall operations, they are assumed “not to exist”. efficiency. Whether to include complexity 1.1.5 Typical operations Aside from the behavior in terms of axioms, it is also possible to include, in the definition of an ADT operation, Some operations that are often specified for ADTs (postheir algorithmic complexity. Alexander Stepanov, de- sibly under other names) are signer of the C++ Standard Template Library, included • compare(s, t), that tests whether two instances’ states complexity guarantees in the STL specification, arguing: are equivalent in some sense; The reason for introducing the notion of abstract data types was to allow interchangeable software modules. You cannot have interchangeable modules unless these modules share similar complexity behavior. If I replace one module with another module with the same functional behavior but with different complexity tradeoffs, the user of this code will be unpleasantly surprised. I could tell him anything I like about data abstraction, and he still would not want to use the code. Complexity assertions have to be part of the interface. — Alexander Stepanov[5] • hash(s), that computes some standard hash function from the instance’s state; • print(s) or show(s), that produces a human-readable representation of the instance’s state. In imperative-style ADT definitions, one often finds also • create(), that yields a new instance of the ADT; • initialize(s), that prepares a newly created instance s for further operations, or resets it to some “initial state"; • copy(s, t), that puts instance s in a state equivalent to that of t; 1.1. ABSTRACT DATA TYPE 5 • clone(t), that performs s ← create(), copy(s, t), and 1.1.7 Implementation returns s; Further information: Opaque data type • free(s) or destroy(s), that reclaims the memory and other resources used by s. Implementing an ADT means providing one procedure or function for each abstract operation. The ADT instances The free operation is not normally relevant or meaning- are represented by some concrete data structure that is ful, since ADTs are theoretical entities that do not “use manipulated by those procedures, according to the ADT’s memory”. However, it may be necessary when one needs specifications. to analyze the storage used by an algorithm that uses the ADT. In that case one needs additional axioms that spec- Usually there are many ways to implement the same ify how much memory each ADT instance uses, as a func- ADT, using several different concrete data structures. tion of its state, and how much of it is returned to the pool Thus, for example, an abstract stack can be implemented by a linked list or by an array. by free. In order to prevent clients from depending on the implementation, an ADT is often packaged as an opaque data 1.1.6 Examples type in one or more modules, whose interface contains only the signature (number and types of the parameters Some common ADTs, which have proved useful in a great and results) of the operations. The implementation of the variety of applications, are module—namely, the bodies of the procedures and the concrete data structure used—can then be hidden from • Container most clients of the module. This makes it possible to change the implementation without affecting the clients. • List If the implementation is exposed, it is known instead as a transparent data type. • Set When implementing an ADT, each instance (in • Multiset imperative-style definitions) or each state (in functionalstyle definitions) is usually represented by a handle of • Map some sort.[7] • Multimap Modern object-oriented languages, such as C++ and Java, support a form of abstract data types. When a class is • Graph used as a type, it is an abstract type that refers to a hidden representation. In this model an ADT is typically imple• Stack mented as a class, and each instance of the ADT is usually an object of that class. The module’s interface typ• Queue ically declares the constructors as ordinary procedures, • Priority queue and most of the other ADT operations as methods of that class. However, such an approach does not easily en• Double-ended queue capsulate multiple representational variants found in an ADT. It also can undermine the extensibility of object• Double-ended priority queue oriented programs. In a pure object-oriented program that uses interfaces as types, types refer to behaviors not Each of these ADTs may be defined in many ways and representations. variants, not necessarily equivalent. For example, an abstract stack may or may not have a count operation that tells how many items have been pushed and not yet Example: implementation of the abstract stack popped. This choice makes a difference not only for its clients but also for the implementation. As an example, here is an implementation of the abstract stack above in the C programming language. Abstract graphical data type An extension of ADT for computer graphics was proposed in 1979:[6] an abstract graphical data type (AGDT). It was introduced by Nadia Magnenat Thalmann, and Daniel Thalmann. AGDTs provide the advantages of ADTs with facilities to build graphical objects in a structured way. Imperative-style interface An imperative-style interface might be: typedef struct stack_Rep stack_Rep; // type: stack instance representation (opaque record) typedef stack_Rep* stack_T; // type: handle to a stack instance (opaque pointer) typedef void* stack_Item; // type: 6 value stored in stack instance (arbitrary address) stack_T stack_create(void); // creates a new empty stack instance void stack_push(stack_T s, stack_Item x); // adds an item at the top of the stack stack_Item stack_pop(stack_T s); // removes the top item from the stack and returns it bool stack_empty(stack_T s); // checks whether stack is empty This interface could be used in the following manner: #include <stack.h> // includes the stack interface stack_T s = stack_create(); // creates a new empty stack instance int x = 17; stack_push(s, &x); // adds the address of x at the top of the stack void* y = stack_pop(s); // removes the address of x from the stack and returns it if(stack_empty(s)) { } // does something if stack is empty CHAPTER 1. INTRODUCTION ADT libraries Many modern programming languages, such as C++ and Java, come with standard libraries that implement several common ADTs, such as those listed above. Built-in abstract data types The specification of some programming languages is intentionally vague about the representation of certain built-in data types, defining only the operations that can be done on them. Therefore, those types can be viewed as “built-in ADTs”. Examples are the arrays in many scripting languages, such as Awk, Lua, and Perl, which can be regarded as an implementation of the abstract list. This interface can be implemented in many ways. The implementation may be arbitrarily inefficient, since the 1.1.8 See also formal definition of the ADT, above, does not specify how much space the stack may use, nor how long each • Concept (generic programming) operation should take. It also does not specify whether • Formal methods the stack state s continues to exist after a call x ← pop(s). In practice the formal definition should specify that the space is proportional to the number of items pushed and not yet popped; and that every one of the operations above must finish in a constant amount of time, independently of that number. To comply with these additional specifications, the implementation could use a linked list, or an array (with dynamic resizing) together with two integers (an item count and the array size). • Functional specification • Generalized algebraic data type • Initial algebra • Liskov substitution principle • Type theory • Walls and Mirrors 1.1.9 Notes Functional-style interface Functional-style ADT definitions are more appropriate for functional programming [1] Compare to the characterization of integers in abstract algebra. languages, and vice versa. However, one can provide a functional-style interface even in an imperative language like C. For example: 1.1.10 References typedef struct stack_Rep stack_Rep; // type: stack state representation (opaque record) typedef stack_Rep* [1] Dale & Walker 1996, p. 3. stack_T; // type: handle to a stack state (opaque pointer) [2] Dale & Walker 1996, p. 4. typedef void* stack_Item; // type: value of a stack state (arbitrary address) stack_T stack_empty(void); // returns [3] Liskov & Zilles 1974. the empty stack state stack_T stack_push(stack_T s, [4] Rudolf Lidl (2004). Abstract Algebra. Springer. ISBN stack_Item x); // adds an item at the top of the stack 81-8128-149-7., Chapter 7,section 40. state and returns the resulting stack state stack_T stack_pop(stack_T s); // removes the top item from [5] Stevens, Al (March 1995). “Al Stevens Interviews Alex Stepanov”. Dr. Dobb’s Journal. Retrieved 31 January the stack state and returns the resulting stack state 2015. stack_Item stack_top(stack_T s); // returns the top item of the stack state [6] D. Thalmann, N. Magnenat Thalmann (1979). Design The main problem is that C lacks garbage collection, and this makes this style of programming impractical; moreover, memory allocation routines in C are slower than allocation in a typical garbage collector, thus the performance impact of so many allocations is even greater. and Implementation of Abstract Graphical Data Types (PDF). IEEE., Proc. 3rd International Computer Software and Applications Conference (COMPSAC'79), IEEE, Chicago, USA, pp.519-524 [7] Robert Sedgewick (1998). Algorithms in C. Addison/Wesley. ISBN 0-201-31452-5., definition 4.4. 1.2. DATA STRUCTURE 7 • Liskov, Barbara; Zilles, Stephen (1974). “Programming with abstract data types”. Proceedings of the ACM SIGPLAN symposium on Very high level languages. pp. 50–59. doi:10.1145/800233.807045. compiler implementations usually use hash tables to look up identifiers. Data structures provide a means to manage large amounts of data efficiently for uses such as large databases and • Dale, Nell; Walker, Henry M. (1996). Abstract Data internet indexing services. Usually, efficient data strucTypes: Specifications, Implementations, and Appli- tures are key to designing efficient algorithms. Some forcations. Jones & Bartlett Learning. ISBN 978-0- mal design methods and programming languages emphasize data structures, rather than algorithms, as the key or66940000-7. ganizing factor in software design. Data structures can be used to organize the storage and retrieval of information stored in both main memory and secondary memory. 1.1.11 Further • Mitchell, John C.; Plotkin, Gordon (July 1988). 1.2.1 Overview “Abstract Types Have Existential Type” (PDF). ACM Transactions on Programming Languages and Data structures are generally based on the ability of a Systems. 10 (3). doi:10.1145/44501.45065. computer to fetch and store data at any place in its memory, specified by a pointer—a bit string, representing a memory address, that can be itself stored in memory and 1.1.12 External links manipulated by the program. Thus, the array and record data structures are based on computing the addresses of • Abstract data type in NIST Dictionary of Algodata items with arithmetic operations; while the linked rithms and Data Structures data structures are based on storing addresses of data items within the structure itself. Many data structures use both principles, sometimes combined in non-trivial ways 1.2 Data structure (as in XOR linking). keys hash function buckets 00 John Smith Lisa Smith Sandra Dee 01 521-8976 02 521-1234 03 : : 13 14 The implementation of a data structure usually requires writing a set of procedures that create and manipulate instances of that structure. The efficiency of a data structure cannot be analyzed separately from those operations. This observation motivates the theoretical concept of an abstract data type, a data structure that is defined indirectly by the operations that may be performed on it, and the mathematical properties of those operations (including their space and time cost). 521-9655 15 A hash table. 1.2.2 Examples Main article: List of data structures There are numerous types of data structures, generally built upon simpler primitive data types: Not to be confused with data type. In computer science, a data structure is a particular way of organizing data in a computer so that it can be used efficiently.[1][2] Data structures can implement one or more particular abstract data types (ADT), which specify the operations that can be performed on a data structure and the computional complexity of those operations. In comparison, a data structure is a concrete implementation of the specification provided by an ADT. • An array is a number of elements in a specific order, typically all of the same type. Elements are accessed using an integer index to specify which element is required (Depending on the language, individual elements may either all be forced to be the same type, or may be of almost any type). Typical implementations allocate contiguous memory words for the elements of arrays (but this is not always a necessity). Arrays may be fixed-length or resizable. Different kinds of data structures are suited to different kinds of applications, and some are highly specialized to specific tasks. For example, relational databases commonly use B-tree indexes for data retrieval,[3] while • A linked list (also just called list) is a linear collection of data elements of any type, called nodes, where each node has itself a value, and points to the next node in the linked list. The principal advantage of a 8 CHAPTER 1. INTRODUCTION linked list over an array, is that values can always be Many known data structures have concurrent versions that efficiently inserted and removed without relocating allow multiple computing threads to access the data structhe rest of the list. Certain other operations, such ture simultaneously. as random access to a certain element, are however slower on lists than on arrays. • A record (also called tuple or struct) is an aggregate data structure. A record is a value that contains other values, typically in fixed number and sequence and typically indexed by names. The elements of records are usually called fields or members. • A union is a data structure that specifies which of a number of permitted primitive types may be stored in its instances, e.g. float or long integer. Contrast with a record, which could be defined to contain a float and an integer; whereas in a union, there is only one value at a time. Enough space is allocated to contain the widest member datatype. • A tagged union (also called variant, variant record, discriminated union, or disjoint union) contains an additional field indicating its current type, for enhanced type safety. • A class is a data structure that contains data fields, like a record, as well as various methods which operate on the contents of the record. In the context of object-oriented programming, records are known as plain old data structures to distinguish them from classes. 1.2.3 Language support 1.2.4 See also • Abstract data type • Concurrent data structure • Data model • Dynamization • Linked data structure • List of data structures • Persistent data structure • Plain old data structure 1.2.5 References [1] Paul E. Black (ed.), entry for data structure in Dictionary of Algorithms and Data Structures. U.S. National Institute of Standards and Technology. 15 December 2004. Online version Accessed May 21, 2009. [2] Entry data structure in the Encyclopædia Britannica (2009) Online entry accessed on May 21, 2009. [3] Gavin Powell (2006). “Chapter 8: Building FastPerforming Database Models”. Beginning Database Design ISBN 978-0-7645-7490-0. Wrox Publishing. Most assembly languages and some low-level languages, such as BCPL (Basic Combined Programming Language), lack built-in support for data structures. On the [4] “The GNU C Manual”. Free Software Foundation. Reother hand, many high-level programming languages and trieved 15 October 2014. some higher-level assembly languages, such as MASM, have special syntax or other built-in support for cer- [5] “Free Pascal: Reference Guide”. Free Pascal. Retrieved 15 October 2014. tain data structures, such as records and arrays. For example, the C and Pascal languages support structs and records, respectively, in addition to vectors (one1.2.6 Further reading dimensional arrays) and multi-dimensional arrays.[4][5] Most programming languages feature some sort of library . mechanism that allows data structure implementations to be reused by different programs. Modern languages usu• Peter Brass, Advanced Data Structures, Cambridge ally come with standard libraries that implement the most University Press, 2008. common data structures. Examples are the C++ Standard Template Library, the Java Collections Framework, and • Donald Knuth, The Art of Computer Programming, Microsoft's .NET Framework. vol. 1. Addison-Wesley, 3rd edition, 1997. Modern languages also generally support modular programming, the separation between the interface of a li• Dinesh Mehta and Sartaj Sahni Handbook of brary module and its implementation. Some provide Data Structures and Applications, Chapman and opaque data types that allow clients to hide implemenHall/CRC Press, 2007. tation details. Object-oriented programming languages, • Niklaus Wirth, Algorithms and Data Structures, such as C++, Java and Smalltalk may use classes for this purpose. Prentice Hall, 1985. 1.3. ANALYSIS OF ALGORITHMS 1.2.7 9 External links to estimate the complexity function for arbitrarily large input. Big O notation, Big-omega notation and Big• course on data structures theta notation are used to this end. For instance, binary search is said to run in a number of steps proportional • Data structures Programs Examples in c,java to the logarithm of the length of the sorted list being searched, or in O(log(n)), colloquially “in logarithmic • UC Berkeley video course on data structures time". Usually asymptotic estimates are used because • Descriptions from the Dictionary of Algorithms and different implementations of the same algorithm may difData Structures fer in efficiency. However the efficiencies of any two “reasonable” implementations of a given algorithm are • Data structures course related by a constant multiplicative factor called a hidden • An Examination of Data Structures from .NET per- constant. spective Exact (not asymptotic) measures of efficiency can some• Schaffer, C. Data Structures and Algorithm Analysis times be computed but they usually require certain assumptions concerning the particular implementation of the algorithm, called model of computation. A model of computation may be defined in terms of an abstract com1.3 Analysis of algorithms puter, e.g., Turing machine, and/or by postulating that certain operations are executed in unit time. For examn! 2ⁿ n² n log₂n n ple, if the sorted list to which we apply binary search has 100 n elements, and we can guarantee that each lookup of an 90 element in the list can be done in unit time, then at most log2 n + 1 time units are needed to return an answer. 80 70 N 1.3.1 Cost models 60 50 40 30 20 √n 10 0 0 log₂n 1 10 20 30 40 50 60 70 80 90 100 n Graphs of number of operations, N vs input size, n for common complexities, assuming a coefficient of 1 In computer science, the analysis of algorithms is the determination of the amount of resources (such as time and storage) necessary to execute them. Most algorithms are designed to work with inputs of arbitrary length. Usually, the efficiency or running time of an algorithm is stated as a function relating the input length to the number of steps (time complexity) or storage locations (space complexity). The term “analysis of algorithms” was coined by Donald Knuth.[1] Algorithm analysis is an important part of a broader computational complexity theory, which provides theoretical estimates for the resources needed by any algorithm which solves a given computational problem. These estimates provide an insight into reasonable directions of search for efficient algorithms. Time efficiency estimates depend on what we define to be a step. For the analysis to correspond usefully to the actual execution time, the time required to perform a step must be guaranteed to be bounded above by a constant. One must be careful here; for instance, some analyses count an addition of two numbers as one step. This assumption may not be warranted in certain contexts. For example, if the numbers involved in a computation may be arbitrarily large, the time required by a single addition can no longer be assumed to be constant. Two cost models are generally used:[2][3][4][5][6] • the uniform cost model, also called uniform-cost measurement (and similar variations), assigns a constant cost to every machine operation, regardless of the size of the numbers involved • the logarithmic cost model, also called logarithmic-cost measurement (and variations thereof), assigns a cost to every machine operation proportional to the number of bits involved The latter is more cumbersome to use, so it’s only employed when necessary, for example in the analysis of arbitrary-precision arithmetic algorithms, like those used in cryptography. A key point which is often overlooked is that published lower bounds for problems are often given for a model of In theoretical analysis of algorithms it is common to computation that is more restricted than the set of operestimate their complexity in the asymptotic sense, i.e., ations that you could use in practice and therefore there 10 CHAPTER 1. INTRODUCTION are algorithms that are faster than what would naively be Informally, an algorithm can be said to exhibit a growth thought possible.[7] rate on the order of a mathematical function if beyond a certain input size n, the function times a positive constant provides an upper bound or limit for the run-time of that algorithm. In other words, for a given input size n 1.3.2 Run-time analysis greater than some n0 and a constant c, the running time of Run-time analysis is a theoretical classification that es- that algorithm will never be larger than . This concept is timates and anticipates the increase in running time (or frequently expressed using Big O notation. For example, run-time) of an algorithm as its input size (usually denoted since the run-time of insertion sort grows quadratically as as n) increases. Run-time efficiency is a topic of great its input size increases, insertion sort can be said to be of 2 interest in computer science: A program can take sec- order O(n ). onds, hours or even years to finish executing, depending Big O notation is a convenient way to express the worston which algorithm it implements (see also performance case scenario for a given algorithm, although it can also analysis, which is the analysis of an algorithm’s run-time be used to express the average-case — for example, in practice). the worst-case scenario for quicksort is O(n2 ), but the average-case run-time is O(n log n). Shortcomings of empirical metrics Empirical orders of growth Since algorithms are platform-independent (i.e. a given algorithm can be implemented in an arbitrary programming language on an arbitrary computer running an arbitrary operating system), there are significant drawbacks to using an empirical approach to gauge the comparative performance of a given set of algorithms. Take as an example a program that looks up a specific entry in a sorted list of size n. Suppose this program were implemented on Computer A, a state-of-the-art machine, using a linear search algorithm, and on Computer B, a much slower machine, using a binary search algorithm. Benchmark testing on the two computers running their respective programs might look something like the following: Based on these metrics, it would be easy to jump to the conclusion that Computer A is running an algorithm that is far superior in efficiency to that of Computer B. However, if the size of the input-list is increased to a sufficient number, that conclusion is dramatically demonstrated to be in error: Computer A, running the linear search program, exhibits a linear growth rate. The program’s run-time is directly proportional to its input size. Doubling the input size doubles the run time, quadrupling the input size quadruples the run-time, and so forth. On the other hand, Computer B, running the binary search program, exhibits a logarithmic growth rate. Quadrupling the input size only increases the run time by a constant amount (in this example, 50,000 ns). Even though Computer A is ostensibly a faster machine, Computer B will inevitably surpass Computer A in run-time because it’s running an algorithm with a much slower growth rate. Orders of growth Main article: Big O notation Assuming the execution time follows power rule, t ≈ k na , the coefficient a can be found [8] by taking empirical measurements of run time {t1 , t2 } at some problemsize points {n1 , n2 } , and calculating t2 /t1 = (n2 /n1 )a so that a = log(t2 /t1 )/ log(n2 /n1 ) . In other words, this measures the slope of the empirical line on the log– log plot of execution time vs. problem size, at some size point. If the order of growth indeed follows the power rule (and so the line on log–log plot is indeed a straight line), the empirical value of a will stay constant at different ranges, and if not, it will change (and the line is a curved line) - but still could serve for comparison of any two given algorithms as to their empirical local orders of growth behaviour. Applied to the above table: It is clearly seen that the first algorithm exhibits a linear order of growth indeed following the power rule. The empirical values for the second one are diminishing rapidly, suggesting it follows another rule of growth and in any case has much lower local orders of growth (and improving further still), empirically, than the first one. Evaluating run-time complexity The run-time complexity for the worst-case scenario of a given algorithm can sometimes be evaluated by examining the structure of the algorithm and making some simplifying assumptions. Consider the following pseudocode: 1 get a positive integer from input 2 if n > 10 3 print “This might take a while...” 4 for i = 1 to n 5 for j = 1 to i 6 print i * j 7 print “Done!" A given computer will take a discrete amount of time to execute each of the instructions involved with carrying out this algorithm. The specific amount of time to carry out a given instruction will vary depending on which instruction is being executed and which computer is exe- 1.3. ANALYSIS OF ALGORITHMS 11 cuting it, but on a conventional computer, this amount [ ] [ ] will be deterministic.[9] Say that the actions carried out 1 1 in step 1 are considered to consume time T 1 , step 2 uses f (n) = T1 +T2 +T3 +T7 +(n+1)T4 + (n2 + n) T6 + (n2 + 3n) T5 2 2 time T 2 , and so forth. In the algorithm above, steps 1, 2 and 7 will only be run which reduces to once. For a worst-case evaluation, it should be assumed that step 3 will be run as well. Thus the total amount of [ ] [ ] 1 2 1 2 time to run steps 1-3 and step 7 is: f (n) = (n + n) T6 + (n + 3n) T5 +(n+1)T4 +T1 +T2 +T3 +T7 2 2 As a rule-of-thumb, one can assume that the highestorder term in any given function dominates its rate of growth and thus defines its run-time order. In this exThe loops in steps 4, 5 and 6 are trickier to evaluate. The ample, n² is the highest-order term, so one can conclude outer loop test in step 4 will execute ( n + 1 ) times (note that f(n) = O(n²). Formally this can be proven as follows: that an extra step is required to terminate the for loop, ] [1 2 hence n + 1 and not n executions), which will consume that + 2 (n + n) T6 [ 1 Prove ] T 4 ( n + 1 ) time. The inner loop, on the other hand, is 2 (n + 3n) T + (n + 1)T + T + 5 4 1 2 governed by the value of i, which iterates from 1 to i. On 2 T + T + T ≤ cn , n ≥ n 2 3 7 0 [ ] [ ] the first pass through the outer loop, j iterates from 1 to 1 2 1 1: The inner loop makes one pass, so running the inner (n + n) T6 + (n2 + 3n) T5 + (n + 1)T4 + T1 + T2 2 2 loop body (step 6) consumes T 6 time, and the inner loop 2 2 ≤(n + n)T6 + (n + 3n)T5 + (n + 1)T4 + T1 + T2 + T3 + T test (step 5) consumes 2T 5 time. During the next pass Let k be a constant greater than or equal to through the outer loop, j iterates from 1 to 2: the inner [T 1 ..T 7 ] loop makes two passes, so running the inner loop body T6 (n2 + n) + T5 (n2 + 3n) + (n + 1)T4 + T1 + T2 + T3 + T7 ≤ (step 6) consumes 2T 6 time, and the inner loop test (step 5) consumes 3T 5 time. =2kn2 + 5kn + 5k[ ≤ 2kn2 +] 5kn2 + 5kn2 ( forn ≥ 1) = 12kn2 1 2 + Altogether, the total time required to run the inner loop 2 (n + n) T6 [Therefore ] 1 2 (n + 3n) T + (n + 1)T + T + T + body can be expressed as an arithmetic progression: 5 4 1 2 2 T3 + T7 ≤ cn2 , n ≥ n0 for c = 12k, n0 = 1 T1 + T2 + T3 + T7 . A more elegant approach to analyzing this algorithm would be to declare that [T 1 ..T 7 ] are all equal to one unit which can be factored[10] as of time, in a system of units chosen so that one unit is greater than or equal to the actual times for these steps. that the algorithm’s running time breaks [ ] This would mean[11] 1 2 down as follows: T6 [1 + 2 + 3 + · · · + (n − 1) + n] = T6 (n + n) 2 ∑n ∑n 4 + i=1 i ≤ 4 + i=1 n = 4 + n2 ≤ The total time required to run the outer loop test can be 5n2 ( forn ≥ 1) = O(n2 ). evaluated similarly: T6 + 2T6 + 3T6 + · · · + (n − 1)T6 + nT6 Growth rate analysis of other resources 2T5 + 3T5 + 4T5 + · · · + (n − 1)T5 + nT5 + (n + 1)T5 methodology of run-time analysis can also be utilized = T5 + 2T5 + 3T5 + 4T5 + · · · + (n − 1)T5 + nT5 + (nThe + 1)T 5 − T5 for predicting other growth rates, such as consumption of memory space. As an example, consider the following which can be factored as pseudocode which manages and reallocates memory usage by a program based on the size of a file which that T5 [1 + 2 + 3 + · · · + (n − 1) + n + (n + 1)] − T5 program manages: ] [ 1 while (file still open) let n = size of file for every 100,000 = (n2 + n) T5 + (n + 1)T5 − T5 kilobytes of increase in file size double the amount of mem2 [ ] ory reserved 1 =T5 (n2 + n) + nT5 2 In this instance, as the file size n increases, memory will [ ] be consumed at an exponential growth rate, which is or1 = (n2 + 3n) T5 der O(2n ). This is an extremely rapid and most likely 2 unmanageable growth rate for consumption of memory Therefore, the total running time for this algorithm is: resources. 12 1.3.3 CHAPTER 1. INTRODUCTION Relevance Algorithm analysis is important in practice because the accidental or unintentional use of an inefficient algorithm can significantly impact system performance. In timesensitive applications, an algorithm taking too long to run can render its results outdated or useless. An inefficient algorithm can also end up requiring an uneconomical amount of computing power or storage in order to run, again rendering it practically useless. • Computational complexity theory • Master theorem • NP-Complete • Numerical analysis • Polynomial time • Program optimization • Profiling (computer programming) 1.3.4 Constant factors • Scalability • Smoothed analysis Analysis of algorithms typically focuses on the asymptotic performance, particularly at the elementary level, • Termination analysis — the subproblem of checking but in practical applications constant factors are imporwhether a program will terminate at all tant, and real-world data is in practice always limited in size. The limit is typically the size of addressable • Time complexity — includes table of orders of memory, so on 32-bit machines 232 = 4 GiB (greater if growth for common algorithms segmented memory is used) and on 64-bit machines 264 = 16 EiB. Thus given a limited size, an order of growth (time or space) can be replaced by a constant factor, and 1.3.6 Notes in this sense all practical algorithms are O(1) for a large [1] Donald Knuth, Recent News enough constant, or for small enough data. This interpretation is primarily useful for functions that grow extremely slowly: (binary) iterated logarithm (log* ) is less than 5 for all practical data (265536 bits); (binary) log-log (log log n) is less than 6 for virtually all practical data (264 bits); and binary log (log n) is less than 64 for virtually all practical data (264 bits). An algorithm with non-constant complexity may nonetheless be more efficient than an algorithm with constant complexity on practical data if the overhead of the constant time algorithm results in a larger constant factor, e.g., one may have 6 K > k log log n so long as K/k > 6 and n < 22 = 264 . [2] Alfred V. Aho; John E. Hopcroft; Jeffrey D. Ullman (1974). The design and analysis of computer algorithms. Addison-Wesley Pub. Co., section 1.3 For large data linear or quadratic factors cannot be ignored, but for small data an asymptotically inefficient algorithm may be more efficient. This is particularly used in hybrid algorithms, like Timsort, which use an asymptotically efficient algorithm (here merge sort, with time complexity n log n ), but switch to an asymptotically inefficient algorithm (here insertion sort, with time complexity n2 ) for small data, as the simpler algorithm is faster on small data. [5] Wegener, Ingo (2005), Complexity theory: exploring the limits of efficient algorithms, Berlin, New York: SpringerVerlag, p. 20, ISBN 978-3-540-21045-0 1.3.5 See also • Amortized analysis • Analysis of parallel algorithms • Asymptotic computational complexity • Best, worst and average case • Big O notation [3] Juraj Hromkovič (2004). Theoretical computer science: introduction to Automata, computability, complexity, algorithmics, randomization, communication, and cryptography. Springer. pp. 177–178. ISBN 978-3-540-14015-3. [4] Giorgio Ausiello (1999). Complexity and approximation: combinatorial optimization problems and their approximability properties. Springer. pp. 3–8. ISBN 978-3-54065431-5. [6] Robert Endre Tarjan (1983). Data structures and network algorithms. SIAM. pp. 3–7. ISBN 978-0-89871-187-5. [7] Examples of the price ory.stackexchange.com of abstraction?, csthe- [8] How To Avoid O-Abuse and Bribes, at the blog “Gödel’s Lost Letter and P=NP” by R. J. Lipton, professor of Computer Science at Georgia Tech, recounting idea by Robert Sedgewick [9] However, this is not the case with a quantum computer [10] It can be proven by induction that 1 + 2 + 3 + · · · + (n − 1) + n = n(n+1) 2 [11] This approach, unlike the above approach, neglects the constant time consumed by the loop tests which terminate their respective loops, but it is trivial to prove that such omission does not affect the final result 1.4. AMORTIZED ANALYSIS 1.3.7 13 References 1.4.2 Method • Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L. & Stein, Clifford (2001). Introduction to Algorithms. Chapter 1: Foundations (Second ed.). Cambridge, MA: MIT Press and McGraw-Hill. pp. 3–122. ISBN 0-262-03293-7. The method requires knowledge of which series of operations are possible. This is most commonly the case with data structures, which have state that persists between operations. The basic idea is that a worst case operation can alter the state in such a way that the worst case cannot occur again for a long time, thus “amortizing” its cost. • Sedgewick, Robert (1998). Algorithms in C, Parts 14: Fundamentals, Data Structures, Sorting, Search- There are generally three methods for performing amoring (3rd ed.). Reading, MA: Addison-Wesley Pro- tized analysis: the aggregate method, the accounting method, and the potential method. All of these give the fessional. ISBN 978-0-201-31452-6. same answers, and their usage difference is primarily cir[3] • Knuth, Donald. The Art of Computer Programming. cumstantial and due to individual preference. Addison-Wesley. • Aggregate analysis determines the upper bound T(n) • Greene, Daniel A.; Knuth, Donald E. (1982). Mathon the total cost of a sequence of n operations, then ematics for the Analysis of Algorithms (Second ed.). calculates the amortized cost to be T(n) / n.[3] Birkhäuser. ISBN 3-7643-3102-X. • The accounting method determines the individual • Goldreich, Oded (2010). Computational Complexcost of each operation, combining its immediate exity: A Conceptual Perspective. Cambridge Univerecution time and its influence on the running time sity Press. ISBN 978-0-521-88473-0. of future operations. Usually, many short-running operations accumulate a “debt” of unfavorable state in small increments, while rare long-running operations decrease it drastically.[3] 1.4 Amortized analysis “Amortized” redirects here. Amortization. For other uses, see • The potential method is like the accounting method, but overcharges operations early to compensate for undercharges later.[3] In computer science, amortized analysis is a method 1.4.3 Examples for analyzing a given algorithm’s time complexity, or how much of a resource, especially time or memory in the con- Dynamic Array text of computer programs, it takes to execute. The motivation for amortized analysis is that looking at the worstcase run time per operation can be too pessimistic.[1] While certain operations for a given algorithm may have a significant cost in resources, other operations may not be as costly. Amortized analysis considers both the costly and less costly operations together over the whole series of operations of the algorithm. This may include accounting for different types of input, length of the input, and other factors that affect its performance.[2] 1.4.1 History Amortized analysis initially emerged from a method called aggregate analysis, which is now subsumed by amortized analysis. However, the technique was first formally introduced by Robert Tarjan in his 1985 paper Amortized Computational Complexity, which addressed the need for more useful form of analysis than the common probabilistic methods used. Amortization was initially used for very specific types of algorithms, particularly those involving binary trees and union operations. However, it is now ubiquitous and comes into play when analyzing many other algorithms as well.[2] Amortized Analysis of the Push operation for a Dynamic Array Consider a dynamic array that grows in size as more elements are added to it such as an ArrayList in Java. If we started out with a dynamic array of size 4, it would take constant time to push four elements onto it. Yet pushing a fifth element onto that array would take longer as the 14 CHAPTER 1. INTRODUCTION array would have to create a new array of double the cur- 1.4.5 References rent size (8), copy the old elements onto the new array, • Allan Borodin and Ran El-Yaniv (1998). Online and then add the new element. The next four push opComputation and Competitive Analysis. Cambridge erations would similarly take constant time, and then the University Press. pp. 20,141. subsequent addition would require another slow doubling of the array size. In general if we consider an arbitrary number of pushes n to an array of size n, we notice that push operations take constant time except for the last one which takes O(n) time to perform the size doubling operation. Since there were n operations total we can take the average of this and find that for pushing elements onto the dynamic array takes: O( nn ) = O(1) , constant time.[3] [1] “Lecture 7: Amortized Analysis” (PDF). https://www.cs. cmu.edu/. Retrieved 14 March 2015. External link in |website= (help) Queue [4] Grossman, Dan. “CSE332: Data Abstractions” (PDF). cs.washington.edu. Retrieved 14 March 2015. Let’s look at a Ruby implementation of a Queue, a FIFO data structure: [2] Rebecca Fiebrink (2007), Amortized Analysis Explained (PDF), retrieved 2011-05-03 [3] “Lecture 20: Amortized Analysis”. http://www.cs.cornell. edu/. Cornell University. Retrieved 14 March 2015. External link in |website= (help) 1.5 Accounting method class Queue def initialize @input = [] @output = [] end def enqueue(element) @input << element end def For accounting methods in business and financial reportdequeue if @output.empty? while @input.any? @output ing, see accounting methods. << @input.pop end end @output.pop end end In the field of analysis of algorithms in computer science, The enqueue operation just pushes an element onto the the accounting method is a method of amortized analinput array; this operation does not depend on the lengths ysis based on accounting. The accounting method often of either input or output and therefore runs in constant gives a more intuitive account of the amortized cost of an time. operation than either aggregate analysis or the potential However the dequeue operation is more complicated. If method. Note, however, that this does not guarantee such the output array already has some elements in it, then de- analysis will be immediately obvious; often, choosing the queue runs in constant time; otherwise, dequeue takes correct parameters for the accounting method requires O(n) time to add all the elements onto the output array as much knowledge of the problem and the complexity from the input array, where n is the current length of the bounds one is attempting to prove as the other two methinput array. After copying n elements from input, we can ods. perform n dequeue operations, each taking constant time, The accounting method is most naturally suited for provbefore the output array is empty again. Thus, we can pering an O(1) bound on time. The method as explained here form a sequence of n dequeue operations in only O(n) is for proving such a bound. time, which implies that the amortized time of each dequeue operation is O(1).[4] Alternatively, we can charge the cost of copying any item from the input array to the output array to the earlier enqueue operation for that item. This charging scheme doubles the amortized time for enqueue, but reduces the amortized time for dequeue to O(1). 1.4.4 Common use 1.5.1 The method A set of elementary operations which will be used in the algorithm is chosen and their costs are arbitrarily set to 1. The fact that the costs of these operations may differ in reality presents no difficulty in principle. What is important is that each elementary operation has a constant cost. Each aggregate operation is assigned a “payment”. The payment is intended to cover the cost of elementary oper• In common usage, an “amortized algorithm” is one ations needed to complete this particular operation, with that an amortized analysis has shown to perform some of the payment left over, placed in a pool to be used well. later. The difficulty with problems that require amortized anal• Online algorithms commonly use amortized analy- ysis is that, in general, some of the operations will require sis. greater than constant cost. This means that no constant 1.6. POTENTIAL METHOD 15 payment will be enough to cover the worst case cost of an operation, in and of itself. With proper selection of payment, however, this is no longer a difficulty; the expensive operations will only occur when there is sufficient payment in the pool to cover their costs. operation, the pool has 3m + 3 - (2m + 1) = m + 2. Note that this is the same amount as after inserting element m + 1. In fact, we can show that this will be the case for any number of reallocations. Before looking at the procedure in detail, we need some definitions. Let T be a table, E an element to insert, num(T) the number of elements in T, and size(T) the allocated size of T. We assume the existence of operations create_table(n), which creates an empty table of size n, for now assumed to be free, and elementary_insert(T,E), which inserts element E into a table T that already has space allocated, with a cost of 1. When a new table is created, there is an old table with m entries. The new table will be of size 2m. As long as the entries currently in the table have added enough to the pool to pay for creating the new table, we will be all right. It can now be made clear why the payment for an insertion is 3. 1 pays for the first insertion of the element, 1 pays for moving the element the next time the table is ex1.5.2 Examples panded, and 1 pays for moving an older element the next time the table is expanded. Intuitively, this explains why A few examples will help to illustrate the use of the ac- an element’s contribution never “runs out” regardless of counting method. how many times the table is expanded: since the table is always doubled, the newest half always covers the cost of moving the oldest half. Table expansion We initially assumed that creating a table was free. In It is often necessary to create a table before it is known reality, creating a table of size n may be as expensive as how much space is needed. One possible strategy is to O(n). Let us say that the cost of creating a table of size n double the size of the table when it is full. Here we will is n. Does this new cost present a difficulty? Not really; it use the accounting method to show that the amortized turns out we use the same method to show the amortized O(1) bounds. All we have to do is change the payment. cost of an insertion operation in such a table is O(1). We cannot expect the first m 2 entries to help pay for the new table. Those entries already paid for the current table. We must then rely on the last m 2 entries to pay the 2m cost 2m . This means we must add m/2 = 4 to the payment for each entry, for a total payment of 3 + 4 = 7. The following pseudocode illustrates the table insertion procedure: function table_insert(T,E) if num(T) = size(T) U := 1.5.3 References create_table(2 × size(T)) for each F in T elemen• Thomas H. Cormen, Charles E. Leiserson, Ronald tary_insert(U,F) T := U elementary_insert(T,E) L. Rivest, and Clifford Stein. Introduction to AlgoWithout amortized analysis, the best bound we can show rithms, Second Edition. MIT Press and McGraw2 for n insert operations is O(n ) — this is due to the loop Hill, 2001. ISBN 0-262-03293-7. Section 17.2: at line 4 that performs num(T) elementary insertions. The accounting method, pp. 410–412. For analysis using the accounting method, we assign a payment of 3 to each table insertion. Although the reason for this is not clear now, it will become clear during the 1.6 Potential method course of the analysis. Assume that initially the table is empty with size(T) = m. The first m insertions therefore do not require reallocation and only have cost 1 (for the elementary insert). Therefore, when num(T) = m, the pool has (3 - 1)×m = 2m. Inserting element m + 1 requires reallocation of the table. Creating the new table on line 3 is free (for now). The loop on line 4 requires m elementary insertions, for a cost of m. Including the insertion on the last line, the total cost for this operation is m + 1. After this operation, the pool therefore has 2m + 3 - (m + 1) = m + 2. Next, we add another m - 1 elements to the table. At this point the pool has m + 2 + 2×(m - 1) = 3m. Inserting an additional element (that is, element 2m + 1) can be seen to have cost 2m + 1 and a payment of 3. After this In computational complexity theory, the potential method is a method used to analyze the amortized time and space complexity of a data structure, a measure of its performance over sequences of operations that smooths out the cost of infrequent but expensive operations.[1][2] 1.6.1 Definition of amortized time In the potential method, a function Φ is chosen that maps states of the data structure to non-negative numbers. If S is a state of the data structure, Φ(S) may be thought of intuitively as an amount of potential energy stored in that state;[1][2] alternatively, Φ(S) may be thought of as representing the amount of disorder in state S or its distance from an ideal state. The potential value prior to the 16 CHAPTER 1. INTRODUCTION operation of initializing a data structure is defined to be this assumption, if X is a type of operation that may be zero. performed by the data structure, and n is an integer definLet o be any individual operation within a sequence of ing the size of the given data structure (for instance, the operations on some data structure, with S ₑ ₒᵣₑ denoting number of items that it contains), then the amortized time the state of the data structure prior to operation o and for operations of type X is defined to be the maximum, Sₐ ₑᵣ denoting its state after operation o has completed. among all possible sequences of operations on data strucThen, once Φ has been chosen, the amortized time for tures of size n and all operations oi of type X within the sequence, of the amortized time for operation oi. operation o is defined to be Tamortized (o) = Tactual (o) + C · (Φ(Safter ) − Φ(Sbefore )), With this definition, the time to perform a sequence of operations may be estimated by multiplying the amortized time for each type of operation in the sequence by the number of operations of that type. where C is a non-negative constant of proportionality (in units of time) that must remain fixed throughout the analysis. That is, the amortized time is defined to be the actual 1.6.4 Examples time taken by the operation plus C times the difference in potential caused by the operation.[1][2] Dynamic array 1.6.2 A dynamic array is a data structure for maintaining an Relation between amortized and ac- array of items, allowing both random access to positions tual time within the array and the ability to increase the array size by one. It is available in Java as the “ArrayList” type and Despite its artificial appearance, the total amortized time in Python as the “list” type. of a sequence of operations provides a valid upper bound A dynamic array may be implemented by a data strucon the actual time for the same sequence of operations. ture consisting of an array A of items, of some length For any sequence of operations O = o1 , o2 , . . . , define: N, together with a number n ≤ N representing the positions within the array that have been used so far. With • The total amortized time: T (O) = this structure, random accesses to the dynamic array may amortized ∑ T (o ) be implemented by accessing the same cell of the interamortized i i ∑ nal array A, and when n < N an operation that increases • The total actual time: Tactual (O) = i Tactual (oi ) the dynamic array size may be implemented simply by incrementing n. However, when n = N, it is necessary to Then: resize A, and a common strategy for doing so is to double its size, replacing A by a new array of length 2n.[3] ∑ structure may be analyzed using the potential funcTamortized (O) = (Tactual (oi ) + C · (Φ(Si+1 ) − Φ(Si )))This = Tactual (O)+C·(Φ(S final )−Φ(Sinitial )) tion: i where the sequence of potential function values forms a telescoping series in which all terms other than the initial and final potential function values cancel in pairs. Hence: Φ = 2n − N Since the resizing strategy always causes A to be at least half-full, this potential function is always non-negative, as desired. When an increase-size operation does not lead to a resize operation, Φ increases by 2, a constant. Therefore, the In case Φ(Sfinal ) ≥ 0 and Φ(Sinitial ) = 0 , Tactual (O) ≤ constant actual time of the operation and the constant inTamortized (O) , so the amortized time can be used to pro- crease in potential combine to give a constant amortized vide accurate predictions about the actual time of se- time for an operation of this type. quences of operations, even though the amortized time However, when an increase-size operation causes a resize, for an individual operation may vary widely from its acthe potential value of n decreases to zero after the resize. tual time. Allocating a new internal array A and copying all of the values from the old internal array to the new one takes 1.6.3 Amortized analysis of worst-case in- O(n) actual time, but (with an appropriate choice of the constant of proportionality C) this is entirely cancelled puts by the decrease in the potential function, leaving again a Typically, amortized analysis is used in combination with constant total amortized time for the operation. Tactual (O) = Tamortized (O) + C · (Φ(Sinitial ) − Φ(Sfinal )) a worst case assumption about the input sequence. With The other operations of the data structure (reading and 1.6. POTENTIAL METHOD 17 writing array cells without changing the array size) do not This number is always non-negative and starts with 0, as cause the potential function to change and have the same required. constant amortized time as their actual time.[2] An Inc operation flips the least significant bit. Then, if the Therefore, with this choice of resizing strategy and poten- LSB were flipped from 1 to 0, then the next bit should be tial function, the potential method shows that all dynamic flipped. This goes on until finally a bit is flipped from 0 to array operations take constant amortized time. Combin- 1, in which case the flipping stops. If the number of bits ing this with the inequality relating amortized time and flipped from 1 to 0 is k, then the actual time is k+1 and actual time over sequences of operations, this shows that the potential is reduced by k−1, so the amortized time is any sequence of n dynamic array operations takes O(n) 2. Hence, the actual time for running m Inc operations is actual time in the worst case, despite the fact that some O(m). of the individual operations may themselves take a linear amount of time.[2] 1.6.5 Applications Multi-Pop Stack The potential function method is commonly used to analyze Fibonacci heaps, a form of priority queue in which Consider a stack which supports the following operations: removing an item takes logarithmic amortized time, and all other operations take constant amortized time.[4] It may also be used to analyze splay trees, a self-adjusting • Initialize - create an empty stack. form of binary search tree with logarithmic amortized time per operation.[5] • Push - add a single element on top of the stack. • Pop(k) - remove k elements from the top of the 1.6.6 stack. This structure may be analyzed using the potential function: Φ = number-of-elementsin-stack This number is always non-negative, as required. A Push operation takes constant time and increases Φ by 1, so its amortized time is constant. A Pop operation takes time O(k) but also reduces Φ by k, so its amortized time is also constant. This proves that any sequence of m operations takes O(m) actual time in the worst case. Binary counter Consider a counter represented as a binary number and supporting the following operations: • Initialize - create a counter with value 0. • Inc - add 1 to the counter. • Read - return the current counter value. This structure may be analyzed using the potential function: Φ = number-of-bitsequal-to-1 References [1] Goodrich, Michael T.; Tamassia, Roberto (2002), “1.5.1 Amortization Techniques”, Algorithm Design: Foundations, Analysis and Internet Examples, Wiley, pp. 36–38. [2] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2001) [1990]. “17.3 The potential method”. Introduction to Algorithms (2nd ed.). MIT Press and McGraw-Hill. pp. 412–416. ISBN 0262-03293-7. [3] Goodrich and Tamassia, 1.5.2 Analyzing an Extendable Array Implementation, pp. 139–141; Cormen et al., 17.4 Dynamic tables, pp. 416–424. [4] Cormen et al., Chapter 20, “Fibonacci Heaps”, pp. 476– 497. [5] Goodrich and Tamassia, Section 3.4, “Splay Trees”, pp. 185–194. Chapter 2 Sequences 2.1 Array data type selected by indices computed at run-time. Depending on the language, array types may overlap (or This article is about the abstract data type. For the be identified with) other data types that describe aggrebyte-level structure, see Array data structure. For other gates of values, such as lists and strings. Array types are often implemented by array data structures, but someuses, see Array. times by other means, such as hash tables, linked lists, or search trees. In computer science, an array type is a data type that is meant to describe a collection of elements (values or variables), each selected by one or more indices (identi- 2.1.1 History fying keys) that can be computed at run time by the program. Such a collection is usually called an array vari- Heinz Rutishauser's programming language Superable, array value, or simply array.[1] By analogy with plan (1949–1951) included multi-dimensional arrays. the mathematical concepts of vector and matrix, array Rutishauser however although describing how a compiler types with one and two indices are often called vector for his language should be built, did not implement one. type and matrix type, respectively. Assembly languages and low-level languages like Language support for array types may include certain BCPL[3] generally have no syntactic support for arrays. built-in array data types, some syntactic constructions (array type constructors) that the programmer may use to Because of the importance of array structures for efdefine such types and declare array variables, and spe- ficient computation, the earliest high-level programcial notation for indexing array elements.[1] For exam- ming languages, including FORTRAN (1957), COBOL ple, in the Pascal programming language, the declara- (1960), and Algol 60 (1960), provided support for multition type MyTable = array [1..4,1..2] of integer, defines dimensional arrays. a new array data type called MyTable. The declaration var A: MyTable then defines a variable A of that type, which is an aggregate of eight elements, each being an 2.1.2 Abstract arrays integer variable identified by two indices. In the Pascal program, those elements are denoted A[1,1], A[1,2], An array data structure can be mathematically modeled A[2,1],… A[4,2].[2] Special array types are often defined as an abstract data structure (an abstract array) with two operations by the language’s standard libraries. Dynamic lists are also more common and easier to imget(A, I): the data stored in the element of the plement than dynamic arrays. Array types are distinarray A whose indices are the integer tuple I. guished from record types mainly because they allow the element indices to be computed at run time, as in the Passet(A,I,V): the array that results by setting the cal assignment A[I,J] := A[N-I,2*J]. Among other things, value of that element to V. this feature allows a single iterative statement to process arbitrarily many elements of an array variable. These operations are required to satisfy the axioms[4] In more theoretical contexts, especially in type theory and in the description of abstract algorithms, the terms “arget(set(A,I, V), I) = V ray” and “array type” sometimes refer to an abstract data get(set(A,I, V), J) = get(A, J) if I ≠ J type (ADT) also called abstract array or may refer to an associative array, a mathematical model with the basic operations and behavior of a typical array type in most for any array state A, any value V, and any tuples I, J for languages — basically, a collection of elements that are which the operations are defined. 18 2.1. ARRAY DATA TYPE 19 The first axiom means that each element behaves like a variable. The second axiom means that elements with distinct indices behave as disjoint variables, so that storing a value in one element does not affect the value of any other element. These axioms do not place any constraints on the set of valid index tuples I, therefore this abstract model can be used for triangular matrices and other oddly-shaped arrays. 2.1.3 Implementations In order to effectively implement variables of such types as array structures (with indexing done by pointer arithmetic), many languages restrict the indices to integer data types (or other types that can be interpreted as integers, such as bytes and enumerated types), and require that all elements have the same data type and storage size. Most of those languages also restrict each index to a finite interval of integers, that remains fixed throughout the lifetime of the array variable. In some compiled languages, in fact, the index ranges may have to be known at compile time. On the other hand, some programming languages provide more liberal array types, that allow indexing by arbitrary values, such as floating-point numbers, strings, objects, references, etc.. Such index values cannot be restricted to an interval, much less a fixed interval. So, these languages usually allow arbitrary new elements to be created at any time. This choice precludes the implementation of array types as array data structures. That is, those languages use array-like syntax to implement a more general associative array semantics, and must therefore be implemented by a hash table or some other search data structure. 2.1.4 Language support Multi-dimensional arrays The number of indices needed to specify an element is called the dimension, dimensionality, or rank of the array type. (This nomenclature conflicts with the concept of dimension in linear algebra,[5] where it is the number of elements. Thus, an array of numbers with 5 rows and 4 columns, hence 20 elements, is said to have dimension 2 in computing contexts, but represents a matrix with dimension 4-by-5 or 20 in mathematics. Also, the computer science meaning of “rank” is similar to its meaning in tensor algebra but not to the linear algebra concept of rank of a matrix.) Many languages support only one-dimensional arrays. In those languages, a multi-dimensional array is typically represented by an Iliffe vector, a one-dimensional array of references to arrays of one dimension less. A twodimensional array, in particular, would be implemented 1 2 3 4 5 6 7 8 9 A two-dimensional array stored as a one-dimensional array of one-dimensional arrays (rows). as a vector of pointers to its rows. Thus an element in row i and column j of an array A would be accessed by double indexing (A[i][j] in typical notation). This way of emulating multi-dimensional arrays allows the creation of jagged arrays, where each row may have a different size — or, in general, where the valid range of each index depends on the values of all preceding indices. This representation for multi-dimensional arrays is quite prevalent in C and C++ software. However, C and C++ will use a linear indexing formula for multi-dimensional arrays that are declared with compile time constant size, e.g. by int A[10][20] or int A[m][n], instead of the traditional int **A.[6]:p.81 Indexing notation Most programming languages that support arrays support the store and select operations, and have special syntax for indexing. Early languages used parentheses, e.g. A(i,j), as in FORTRAN; others choose square brackets, e.g. A[i,j] or A[i][j], as in Algol 60 and Pascal (to distinguish from the use of parentheses for function calls). Index types Array data types are most often implemented as array structures: with the indices restricted to integer (or totally ordered) values, index ranges fixed at array creation time, and multilinear element addressing. This was the case in most “third generation” languages, and is still the case of most systems programming languages such as Ada, C, and C++. In some languages, however, array data types have the semantics of associative arrays, with indices of arbitrary type and dynamic element creation. This is the case in some scripting languages such as Awk and Lua, and of some array types provided by standard C++ libraries. 20 CHAPTER 2. SEQUENCES Bounds checking and which of these is represented by the * operator varies by language. Some languages (like Pascal and Modula) perform bounds checking on every access, raising an exception or aborting the program when any index is out of its valid range. Compilers may allow these checks to be turned off to trade safety for speed. Other languages (like FORTRAN and C) trust the programmer and perform no checks. Good compilers may also analyze the program to determine the range of possible values that the index may have, and this analysis may lead to bounds-checking elimination. Languages providing array programming capabilities have proliferated since the innovations in this area of APL. These are core capabilities of domain-specific languages such as GAUSS, IDL, Matlab, and Mathematica. They are a core facility in newer languages, such as Julia and recent versions of Fortran. These capabilities are also provided via standard extension libraries for other general purpose programming languages (such as the widely used NumPy library for Python). Index origin Some languages, such as C, provide only zero-based array types, for which the minimum valid value for any index is 0. This choice is convenient for array implementation and address computations. With a language such as C, a pointer to the interior of any array can be defined that will symbolically act as a pseudo-array that accommodates negative indices. This works only because C does not check an index against bounds when used. Other languages provide only one-based array types, where each index starts at 1; this is the traditional convention in mathematics for matrices and mathematical sequences. A few languages, such as Pascal, support n-based array types, whose minimum legal indices are chosen by the programmer. The relative merits of each choice have been the subject of heated debate. Zerobased indexing has a natural advantage to one-based indexing in avoiding off-by-one or fencepost errors.[7] String types and arrays Many languages provide a built-in string data type, with specialized notation ("string literals") to build values of that type. In some languages (such as C), a string is just an array of characters, or is handled in much the same way. Other languages, like Pascal, may provide vastly different operations for strings and arrays. Array index range queries Some programming languages provide operations that return the size (number of elements) of a vector, or, more generally, range of each index of an array. In C and C++ arrays do not support the size function, so programmers often have to declare separate variable to hold the size, and pass it to procedures as a separate parameter. Elements of a newly created array may have undefined values (as in C), or may be defined to have a specific “default” value such as 0 or a null pointer (as in Java). See comparison of programming languages (array) for In C++ a std::vector object supports the store, select, and the base indices used by various languages. append operations with the performance characteristics discussed above. Vectors can be queried for their size and can be resized. Slower operations like inserting an Highest index element in the middle are also supported. The relation between numbers appearing in an array declaration and the index of that array’s last element also Slicing varies by language. In many languages (such as C), one should specify the number of elements contained in the An array slicing operation takes a subset of the elements array; whereas in others (such as Pascal and Visual Basic of an array-typed entity (value or variable) and then as.NET) one should specify the numeric value of the index sembles them as another array-typed entity, possibly with of the last element. Needless to say, this distinction is other indices. If array types are implemented as array immaterial in languages where the indices start at 1. structures, many useful slicing operations (such as selecting a sub-array, swapping indices, or reversing the direction of the indices) can be performed very efficiently by Array algebra manipulating the dope vector of the structure. The posSome programming languages support array program- sible slicings depend on the implementation details: for ming, where operations and functions defined for certain example, FORTRAN allows slicing off one column of a data types are implicitly extended to arrays of elements matrix variable, but not a row, and treat it as a vector; of those types. Thus one can write A+B to add corre- whereas C allow slicing off a row from a matrix, but not sponding elements of two arrays A and B. Usually these a column. languages provide both the element-by-element multipli- On the other hand, other slicing operations are possible cation and the standard matrix product of linear algebra, when array types are implemented in other ways. 2.2. ARRAY DATA STRUCTURE Resizing Some languages allow dynamic arrays (also called resizable, growable, or extensible): array variables whose index ranges may be expanded at any time after creation, without changing the values of its current elements. For one-dimensional arrays, this facility may be provided as an operation “append(A,x)" that increases the size of the array A by one and then sets the value of the last element to x. Other array types (such as Pascal strings) provide a concatenation operator, which can be used together with slicing to achieve that effect and more. In some languages, assigning a value to an element of an array automatically extends the array, if necessary, to include that element. In other array types, a slice can be replaced by an array of different size” with subsequent elements being renumbered accordingly — as in Python’s list assignment "A[5:5] = [10,20,30]", that inserts three new elements (10,20, and 30) before element "A[5]". Resizable arrays are conceptually similar to lists, and the two concepts are synonymous in some languages. 21 2.1.6 References [1] Robert W. Sebesta (2001) Concepts of Programming Languages. Addison-Wesley. 4th edition (1998), 5th edition (2001), ISBN 9780201385960 [2] K. Jensen and Niklaus Wirth, PASCAL User Manual and Report. Springer. Paperback edition (2007) 184 pages, ISBN 978-3540069508 [3] John Mitchell, Concepts of Programming Languages. Cambridge University Press. [4] Lukham, Suzuki (1979), “Verification of array, record, and pointer operations in Pascal”. ACM Transactions on Programming Languages and Systems 1(2), 226–244. [5] see the definition of a matrix [6] Brian W. Kernighan and Dennis M. Ritchie (1988), The C programming Language. Prentice-Hall, 205 pages. [7] Edsger W. Dijkstra, Why numbering should start at zero An extensible array can be implemented as a fixed-size 2.1.7 External links array, with a counter that records how many elements are actually in use. The append operation merely increments • NIST’s Dictionary of Algorithms and Data Structhe counter; until the whole array is used, when the aptures: Array pend operation may be defined to fail. This is an implementation of a dynamic array with a fixed capacity, as in the string type of Pascal. Alternatively, the append operation may re-allocate the underlying array with a larger 2.2 Array data structure size, and copy the old elements to the new area. This article is about the byte-layout-level structure. For the abstract data type, see Array data type. For other uses, see Array. 2.1.5 See also • Array access analysis • Array programming • Array slicing • Bounds checking and index checking • Bounds checking elimination In computer science, an array data structure, or simply an array, is a data structure consisting of a collection of elements (values or variables), each identified by at least one array index or key. An array is stored so that the position of each element can be computed from its index tuple by a mathematical formula.[1][2][3] The simplest type of data structure is a linear array, also called one-dimensional array. • Comparison of programming languages (array) For example, an array of 10 32-bit integer variables, with indices 0 through 9, may be stored as 10 words at memory addresses 2000, 2004, 2008, ... 2036, so that the element with index i has the address 2000 + 4 × i.[4] • Parallel array The memory address of the first element of an array is called first address or foundation address. • Delimiter-separated values Related types • Variable-length array • Dynamic array • Sparse array Because the mathematical concept of a matrix can be represented as a two-dimensional grid, two-dimensional arrays are also sometimes called matrices. In some cases the term “vector” is used in computing to refer to an array, although tuples rather than vectors are more correctly the mathematical equivalent. Arrays are often used to implement tables, especially lookup tables; the word table is sometimes used as a synonym of array. 22 CHAPTER 2. SEQUENCES Arrays are among the oldest and most important data structures, and are used by almost every program. They are also used to implement many other data structures, such as lists and strings. They effectively exploit the addressing logic of computers. In most modern computers and many external storage devices, the memory is a one-dimensional array of words, whose indices are their addresses. Processors, especially vector processors, are often optimized for array operations. Arrays are useful mostly because the element indices can be computed at run time. Among other things, this feature allows a single iterative statement to process arbitrarily many elements of an array. For that reason, the elements of an array data structure are required to have the same size and should use the same data representation. The set of valid index tuples and the addresses of the elements (and hence the element addressing formula) are usually,[3][5] but not always,[2] fixed while the array is in use. The term array is often used to mean array data type, a kind of data type provided by most high-level programming languages that consists of a collection of values or variables that can be selected by one or more indices computed at run-time. Array types are often implemented by array structures; however, in some languages they may be implemented by hash tables, linked lists, search trees, or other data structures. 2.2.2 Applications Arrays are used to implement mathematical vectors and matrices, as well as other kinds of rectangular tables. Many databases, small and large, consist of (or include) one-dimensional arrays whose elements are records. Arrays are used to implement other data structures, such as heaps, hash tables, deques, queues, stacks, strings, and VLists. One or more large arrays are sometimes used to emulate in-program dynamic memory allocation, particularly memory pool allocation. Historically, this has sometimes been the only way to allocate “dynamic memory” portably. Arrays can be used to determine partial or complete control flow in programs, as a compact alternative to (otherwise repetitive) multiple IF statements. They are known in this context as control tables and are used in conjunction with a purpose built interpreter whose control flow is altered according to values contained in the array. The array may contain subroutine pointers (or relative subroutine numbers that can be acted upon by SWITCH statements) that direct the path of the execution. 2.2.3 Element identifier and addressing The term is also used, especially in the description of formulas algorithms, to mean associative array or “abstract array”, a theoretical computer science model (an abstract data When data objects are stored in an array, individual type or ADT) intended to capture the essential properties objects are selected by an index that is usually a nonof arrays. negative scalar integer. Indexes are also called subscripts. An index maps the array value to a stored object. 2.2.1 History The first digital computers used machine-language programming to set up and access array structures for data tables, vector and matrix computations, and for many other purposes. John von Neumann wrote the first arraysorting program (merge sort) in 1945, during the building of the first stored-program computer.[6]p. 159 Array indexing was originally done by self-modifying code, and later using index registers and indirect addressing. Some mainframes designed in the 1960s, such as the Burroughs B5000 and its successors, used memory segmentation to perform index-bounds checking in hardware.[7] Assembly languages generally have no special support for arrays, other than what the machine itself provides. The earliest high-level programming languages, including FORTRAN (1957), COBOL (1960), and ALGOL 60 (1960), had support for multi-dimensional arrays, and so has C (1972). In C++ (1983), class templates exist for multi-dimensional arrays whose dimension is fixed at runtime[3][5] as well as for runtime-flexible arrays.[2] There are three ways in which the elements of an array can be indexed: • 0 (zero-based indexing): The first element of the array is indexed by subscript of 0.[8] • 1 (one-based indexing): The first element of the array is indexed by subscript of 1.[9] • n (n-based indexing): The base index of an array can be freely chosen. Usually programming languages allowing n-based indexing also allow negative index values and other scalar data types like enumerations, or characters may be used as an array index. Arrays can have multiple dimensions, thus it is not uncommon to access an array using multiple indices. For example, a two-dimensional array A with three rows and four columns might provide access to the element at the 2nd row and 4th column by the expression A[1, 3] (in a row major language) or A[3, 1] (in a column major language) in the case of a zero-based indexing system. Thus two indices are used for a two-dimensional array, three 2.2. ARRAY DATA STRUCTURE 23 for a three-dimensional array, and n for an n-dimensional This means that array a has 2 rows and 3 columns, and array. the array is of integer type. Here we can store 6 elements The number of indices needed to specify an element is they are stored linearly but starting from first row linear called the dimension, dimensionality, or rank of the array. then continuing with second row. The above array will be stored as a11 , a12 , a13 , a21 , a22 , a23 . In standard arrays, each index is restricted to a certain range of consecutive integers (or consecutive values of This formula requires only k multiplications and k additions, for any array that can fit in memory. Moreover, if some enumerated type), and the address of an element is any coefficient is a fixed power of 2, the multiplication computed by a “linear” formula on the indices. can be replaced by bit shifting. One-dimensional arrays The coefficients ck must be chosen so that every valid index tuple maps to the address of a distinct element. If the minimum legal value for every index is 0, then B is A one-dimensional array (or single dimension array) is a the address of the element whose indices are all zero. As type of linear array. Accessing its elements involves a sinin the one-dimensional case, the element indices may be gle subscript which can either represent a row or column changed by changing the base address B. Thus, if a twoindex. dimensional array has rows and columns indexed from 1 As an example consider the C declaration int anArray- to 10 and 1 to 20, respectively, then replacing B by B + c1 Name[10]; − 3 c1 will cause them to be renumbered from 0 through 9 and 4 through 23, respectively. Taking advantage of Syntax : datatype anArrayname[sizeofArray]; this feature, some languages (like FORTRAN 77) specify In the given example the array can contain 10 elements of that array indices begin at 1, as in mathematical tradition any value available to the int type. In C, the array element while other languages (like Fortran 90, Pascal and Algol) indices are 0-9 inclusive in this case. For example, the ex- let the user choose the minimum value for each index. pressions anArrayName[0] and anArrayName[9] are the first and last elements respectively. Dope vectors For a vector with linear addressing, the element with index i is located at the address B + c × i, where B is a fixed The addressing formula is completely defined by the dibase address and c a fixed constant, sometimes called the mension d, the base address B, and the increments c , c , 1 2 address increment or stride. ..., ck. It is often useful to pack these parameters into a If the valid element indices begin at 0, the constant B is simply the address of the first element of the array. For this reason, the C programming language specifies that array indices always begin at 0; and many programmers will call that element "zeroth" rather than “first”. record called the array’s descriptor or stride vector or dope vector.[2][3] The size of each element, and the minimum and maximum values allowed for each index may also be included in the dope vector. The dope vector is a complete handle for the array, and is a convenient way to pass arrays as arguments to procedures. Many useful array slicing operations (such as selecting a sub-array, swapping indices, or reversing the direction of the indices) can be performed very efficiently by manipulating the dope vector.[2] However, one can choose the index of the first element by an appropriate choice of the base address B. For example, if the array has five elements, indexed 1 through 5, and the base address B is replaced by B + 30c, then the indices of those same elements will be 31 to 35. If the numbering does not start at 0, the constant B may not be the address of any element. Compact layouts Often the coefficients are chosen so that the elements occupy a contiguous area of memory. However, that is not necessary. Even if arrays are always created with conFor multi dimensional array, the element with indices i,j tiguous elements, some array slicing operations may crewould have address B + c · i + d · j, where the coeffi- ate non-contiguous sub-arrays from them. cients c and d are the row and column address increments, There are two systematic compact layouts for a tworespectively. dimensional array. For example, consider the matrix More generally, in a k-dimensional array, the address of an element with indices i1 , i2 , ..., ik is   1 2 3 A = 4 5 6. B + c1 · i1 + c2 · i2 + ... + ck · ik. 7 8 9 Multidimensional arrays For example: int a[2][3]; In the row-major order layout (adopted by C for statically 24 declared arrays), the elements in each row are stored in consecutive positions and all of the elements of a row have a lower address than any of the elements of a consecutive row: In column-major order (traditionally used by Fortran), the elements in each column are consecutive in memory and all of the elements of a column have a lower address than any of the elements of a consecutive column: For arrays with three or more indices, “row major order” puts in consecutive positions any two elements whose index tuples differ only by one in the last index. “Column major order” is analogous with respect to the first index. In systems which use processor cache or virtual memory, scanning an array is much faster if successive elements are stored in consecutive positions in memory, rather than sparsely scattered. Many algorithms that use multidimensional arrays will scan them in a predictable order. A programmer (or a sophisticated compiler) may use this information to choose between row- or column-major layout for each array. For example, when computing the product A·B of two matrices, it would be best to have A stored in row-major order, and B in column-major order. CHAPTER 2. SEQUENCES In an array with element size k and on a machine with a cache line size of B bytes, iterating through an array of n elements requires the minimum of ceiling(nk/B) cache misses, because its elements occupy contiguous memory locations. This is roughly a factor of B/k better than the number of cache misses needed to access n elements at random memory locations. As a consequence, sequential iteration over an array is noticeably faster in practice than iteration over many other data structures, a property called locality of reference (this does not mean however, that using a perfect hash or trivial hash within the same (local) array, will not be even faster - and achievable in constant time). Libraries provide low-level optimized facilities for copying ranges of memory (such as memcpy) which can be used to move contiguous blocks of array elements significantly faster than can be achieved through individual element access. The speedup of such optimized routines varies by array element size, architecture, and implementation. Memory-wise, arrays are compact data structures with no per-element overhead. There may be a per-array overhead, e.g. to store index bounds, but this is languagedependent. It can also happen that elements stored in an array require less memory than the same elements stored in individual variables, because several array elements Resizing can be stored in a single word; such arrays are often called packed arrays. An extreme (but commonly used) case is Main article: Dynamic array the bit array, where every bit represents a single element. A single octet can thus hold up to 256 different combinaStatic arrays have a size that is fixed when they are created tions of up to 8 different conditions, in the most compact and consequently do not allow elements to be inserted or form. removed. However, by allocating a new array and copy- Array accesses with statically predictable access patterns ing the contents of the old array to it, it is possible to are a major source of data parallelism. effectively implement a dynamic version of an array; see dynamic array. If this operation is done infrequently, insertions at the end of the array require only amortized Comparison with other data structures constant time. Some array data structures do not reallocate storage, but do store a count of the number of elements of the array in use, called the count or size. This effectively makes the array a dynamic array with a fixed maximum size or capacity; Pascal strings are examples of this. Growable arrays are similar to arrays but add the ability to insert and delete elements; adding and deleting at the end is particularly efficient. However, they reserve linear (Θ(n)) additional storage, whereas arrays do not reserve additional storage. Associative arrays provide a mechanism for array-like functionality without huge storage overheads when the inNon-linear formulas dex values are sparse. For example, an array that contains values only at indexes 1 and 2 billion may benefit from usMore complicated (non-linear) formulas are occasionally ing such a structure. Specialized associative arrays with used. For a compact two-dimensional triangular array, integer keys include Patricia tries, Judy arrays, and van for instance, the addressing formula is a polynomial of Emde Boas trees. degree 2. Balanced trees require O(log n) time for indexed access, but also permit inserting or deleting elements in O(log n) time,[15] whereas growable arrays require linear (Θ(n)) 2.2.4 Efficiency time to insert or delete elements at an arbitrary position. Both store and select take (deterministic worst case) Linked lists allow constant time removal and insertion in constant time. Arrays take linear (O(n)) space in the the middle but take linear time for indexed access. Their number of elements n that they hold. memory use is typically worse than arrays, but is still lin- 2.2. ARRAY DATA STRUCTURE 25 • Variable-length array ear. 1 2 3 • Bit array • Array slicing • Offset (computer science) 4 5 6 7 8 9 A two-dimensional array stored as a one-dimensional array of one-dimensional arrays (rows). An Iliffe vector is an alternative to a multidimensional array structure. It uses a one-dimensional array of references to arrays of one dimension less. For two dimensions, in particular, this alternative structure would be a vector of pointers to vectors, one for each row. Thus an element in row i and column j of an array A would be accessed by double indexing (A[i][j] in typical notation). This alternative structure allows jagged arrays, where each row may have a different size — or, in general, where the valid range of each index depends on the values of all preceding indices. It also saves one multiplication (by the column address increment) replacing it by a bit shift (to index the vector of row pointers) and one extra memory access (fetching the row address), which may be worthwhile in some architectures. 2.2.5 Dimension The dimension of an array is the number of indices needed to select an element. Thus, if the array is seen as a function on a set of possible index combinations, it is the dimension of the space of which its domain is a discrete subset. Thus a one-dimensional array is a list of data, a two-dimensional array a rectangle of data, a threedimensional array a block of data, etc. This should not be confused with the dimension of the set of all matrices with a given domain, that is, the number of elements in the array. For example, an array with 5 rows and 4 columns is two-dimensional, but such matrices form a 20-dimensional space. Similarly, a three-dimensional vector can be represented by a onedimensional array of size three. 2.2.6 See also • Dynamic array • Parallel array • Row-major order • Stride of an array 2.2.7 References [1] Black, Paul E. (13 November 2008). “array”. Dictionary of Algorithms and Data Structures. National Institute of Standards and Technology. Retrieved 22 August 2010. [2] Bjoern Andres; Ullrich Koethe; Thorben Kroeger; Hamprecht (2010). “Runtime-Flexible Multidimensional Arrays and Views for C++98 and C++0x”. arXiv:1008.2909 [cs.DS]. [3] Garcia, Ronald; Lumsdaine, Andrew (2005). “MultiArray: a C++ library for generic programming with arrays”. Software: Practice and Experience. 35 (2): 159–188. doi:10.1002/spe.630. ISSN 0038-0644. [4] David R. Richardson (2002), The Book on Data Structures. iUniverse, 112 pages. ISBN 0-595-24039-9, ISBN 978-0-595-24039-5. [5] Veldhuizen, Todd L. (December 1998). Arrays in Blitz++ (PDF). Computing in Object-Oriented Parallel Environments. Lecture Notes in Computer Science. Springer Berlin Heidelberg. pp. 223–230. doi:10.1007/3-54049372-7_24. ISBN 978-3-540-65387-5. [6] Donald Knuth, The Art of Computer Programming, vol. 3. Addison-Wesley [7] Levy, Henry M. (1984), Capability-based Computer Systems, Digital Press, p. 22, ISBN 9780932376220. [8] “Array Code Examples - PHP Array Functions - PHP code”. http://www.configure-all.com/: Computer Programming Web programming Tips. Retrieved 8 April 2011. In most computer languages array index (counting) starts from 0, not from 1. Index of the first element of the array is 0, index of the second element of the array is 1, and so on. In array of names below you can see indexes and values. [9] “Chapter 6 - Arrays, Types, and Constants”. Modula-2 Tutorial. http://www.modula2.org/tutor/index.php. Retrieved 8 April 2011. The names of the twelve variables are given by Automobiles[1], Automobiles[2], ... Automobiles[12]. The variable name is “Automobiles” and the array subscripts are the numbers 1 through 12. [i.e. in Modula-2, the index starts by one!] [10] Chris Okasaki (1995). “Purely Functional RandomAccess Lists”. Proceedings of the Seventh International Conference on Functional Programming Languages and Computer Architecture: 86–95. doi:10.1145/224164.224187. 26 CHAPTER 2. SEQUENCES [11] Gerald Kruse. CS 240 Lecture Notes: Linked Lists Plus: Complexity Trade-offs. Juniata College. Spring 2008. 2.3.1 Bounded-size dynamic arrays and capacity [12] Day 1 Keynote - Bjarne Stroustrup: C++11 Style at GoingNative 2012 on channel9.msdn.com from minute 45 or foil 44 A simple dynamic array can be constructed by allocating an array of fixed-size, typically larger than the number of elements immediately required. The elements of the dynamic array are stored contiguously at the start of the underlying array, and the remaining positions towards the end of the underlying array are reserved, or unused. Elements can be added at the end of a dynamic array in constant time by using the reserved space, until this space is completely consumed. When all space is consumed, and an additional element is to be added, then the underlying fixed-sized array needs to be increased in size. Typically resizing is expensive because it involves allocating a new underlying array and copying each element from the original array. Elements can be removed from the end of a dynamic array in constant time, as no resizing is required. The number of elements used by the dynamic array contents is its logical size or size, while the size of the underlying array is called the dynamic array’s capacity or physical size, which is the maximum possible size without relocating data.[2] [13] Number crunching: Why you should never, ever, EVER use linked-list in your code again at kjellkod.wordpress.com [14] Brodnik, Andrej; Carlsson, Svante; Sedgewick, Robert; Munro, JI; Demaine, ED (1999), Resizable Arrays in Optimal Time and Space (Technical Report CS-99-09) (PDF), Department of Computer Science, University of Waterloo [15] Counted B-Tree 2.3 Dynamic array 2 27 271 2713 27138 271384 A fixed-size array will suffice in applications where the maximum logical size is fixed (e.g. by specification), or can be calculated before the array is allocated. A dynamic array might be preferred if • the maximum logical size is unknown, or difficult to calculate, before the array is allocated • it is considered that a maximum logical size given by a specification is likely to change • the amortized cost of resizing a dynamic array does not significantly affect performance or responsiveness Logical size Capacity 2.3.2 Geometric expansion and amortized cost Several values are inserted at the end of a dynamic array using geometric expansion. Grey cells indicate space reserved for expansion. Most insertions are fast (constant time), while some are slow due to the need for reallocation (Θ(n) time, labelled with turtles). The logical size and capacity of the final array are shown. To avoid incurring the cost of resizing many times, dynamic arrays resize by a large amount, such as doubling in size, and use the reserved space for future expansion. The operation of adding an element to the end might work as follows: In computer science, a dynamic array, growable array, resizable array, dynamic table, mutable array, or array list is a random access, variable-size list data structure that allows elements to be added or removed. It is supplied with standard libraries in many modern mainstream programming languages. function insertEnd(dynarray a, element e) if (a.size = a.capacity) // resize a to twice its current capacity: a.capacity ← a.capacity * 2 // (copy the contents to the new memory location here) a[a.size] ← e a.size ← a.size +1 A dynamic array is not the same thing as a dynamically allocated array, which is an array whose size is fixed when the array is allocated, although a dynamic array may use such a fixed-size array as a back end.[1] As n elements are inserted, the capacities form a geometric progression. Expanding the array by any constant proportion a ensures that inserting n elements takes O(n) time overall, meaning that each insertion takes 2.3. DYNAMIC ARRAY amortized constant time. Many dynamic arrays also deallocate some of the underlying storage if its size drops below a certain threshold, such as 30% of the capacity. This threshold must be strictly smaller than 1/a in order to provide hysteresis (provide a stable band to avoiding repeatedly growing and shrinking) and support mixed sequences of insertions and removals with amortized constant cost. Dynamic arrays are a common example when teaching amortized analysis.[3][4] 2.3.3 Growth factor The growth factor for the dynamic array depends on several factors including a space-time trade-off and algorithms used in the memory allocator itself. For growth factor a, the average time per insertion operation is about a/(a−1), while the number of wasted cells is bounded above by (a−1)n. If memory allocator uses a first-fit allocation algorithm, then growth factor values such as a=2 can cause dynamic array expansion to run out of memory even though a significant amount of memory may still be available.[5] There have been various discussions on ideal growth factor values, including proposals for the Golden Ratio as well as the value 1.5.[6] Many textbooks, however, use a = 2 for simplicity and analysis purposes.[3][4] 27 that resides in other areas of memory. In this case, accessing items in the array sequentially will actually involve accessing multiple non-contiguous areas of memory, so the many advantages of the cache-friendliness of this data structure are lost. Compared to linked lists, dynamic arrays have faster indexing (constant time versus linear time) and typically faster iteration due to improved locality of reference; however, dynamic arrays require linear time to insert or delete at an arbitrary location, since all following elements must be moved, while linked lists can do this in constant time. This disadvantage is mitigated by the gap buffer and tiered vector variants discussed under Variants below. Also, in a highly fragmented memory region, it may be expensive or impossible to find contiguous space for a large dynamic array, whereas linked lists do not require the whole data structure to be stored contiguously. A balanced tree can store a list while providing all operations of both dynamic arrays and linked lists reasonably efficiently, but both insertion at the end and iteration over the list are slower than for a dynamic array, in theory and in practice, due to non-contiguous storage and tree traversal/manipulation overhead. Below are growth factors used by several popular implementations: 2.3.5 Variants 2.3.4 Performance Gap buffers are similar to dynamic arrays but allow effiThe dynamic array has performance similar to an array, cient insertion and deletion operations clustered near the with the addition of new operations to add and remove same arbitrary location. Some deque implementations elements: use array deques, which allow amortized constant time insertion/removal at both ends, instead of just one end. • Getting or setting the value at a particular index Goodrich[15] presented a dynamic array algorithm called (constant time) Tiered Vectors that provided O(n1/2 ) performance for order preserving insertions or deletions from the middle of • Iterating over the elements in order (linear time, the array. good cache performance) Hashed Array Tree (HAT) is a dynamic array algorithm • Inserting or deleting an element in the middle of the published by Sitarski in 1996.[16] Hashed Array Tree array (linear time) wastes order n1/2 amount of storage space, where n is the number of elements in the array. The algorithm has O(1) • Inserting or deleting an element at the end of the amortized performance when appending a series of obarray (constant amortized time) jects to the end of a Hashed Array Tree. Dynamic arrays benefit from many of the advantages of arrays, including good locality of reference and data cache utilization, compactness (low memory use), and random access. They usually have only a small fixed additional overhead for storing information about the size and capacity. This makes dynamic arrays an attractive tool for building cache-friendly data structures. However, in languages like Python or Java that enforce reference semantics, the dynamic array generally will not store the actual data, but rather it will store references to the data In a 1999 paper,[14] Brodnik et al. describe a tiered dynamic array data structure, which wastes only n1/2 space for n elements at any point in time, and they prove a lower bound showing that any dynamic array must waste this much space if the operations are to remain amortized constant time. Additionally, they present a variant where growing and shrinking the buffer has not only amortized but worst-case constant time. Bagwell (2002)[17] presented the VList algorithm, which can be adapted to implement a dynamic array. 28 2.3.6 CHAPTER 2. SEQUENCES Language support C++'s std::vector is an implementation of dynamic arrays, as are the ArrayList[18] class supplied with the Java API and the .NET Framework.[19] The generic List<> class supplied with version 2.0 of the .NET Framework is also implemented with dynamic arrays. Smalltalk's OrderedCollection is a dynamic array with dynamic start and end-index, making the removal of the first element also O(1). Python's list datatype implementation is a dynamic array. Delphi and D implement dynamic arrays at the language’s core. Ada's Ada.Containers.Vectors generic package provides dynamic array implementation for a given subtype. Many scripting languages such as Perl and Ruby offer dynamic arrays as a built-in primitive data type. Several cross-platform frameworks provide dynamic array implementations for C, including CFArray and CFMutableArray in Core Foundation, and GArray and GPtrArray in GLib. [12] Day 1 Keynote - Bjarne Stroustrup: C++11 Style at GoingNative 2012 on channel9.msdn.com from minute 45 or foil 44 [13] Number crunching: Why you should never, ever, EVER use linked-list in your code again at kjellkod.wordpress.com [14] Brodnik, Andrej; Carlsson, Svante; Sedgewick, Robert; Munro, JI; Demaine, ED (1999), Resizable Arrays in Optimal Time and Space (Technical Report CS-99-09) (PDF), Department of Computer Science, University of Waterloo [15] Goodrich, Michael T.; Kloss II, John G. (1999), “Tiered Vectors: Efficient Dynamic Arrays for Rank-Based Sequences”, Workshop on Algorithms and Data Structures, Lecture Notes in Computer Science, 1663: 205– 216, doi:10.1007/3-540-48447-7_21, ISBN 978-3-54066279-2 [16] Sitarski, Edward (September 1996), “HATs: Hashed array trees”, Algorithm Alley, Dr. Dobb’s Journal, 21 (11) [17] Bagwell, Phil (2002), Fast Functional Lists, Hash-Lists, Deques and Variable Length Arrays, EPFL 2.3.7 References [18] Javadoc on ArrayList [1] See, for example, the source code of java.util.ArrayList class from OpenJDK 6. [19] ArrayList Class [2] Lambert, Kenneth Alfred (2009), “Physical size and logical size”, Fundamentals of Python: From First Programs Through Data Structures, Cengage Learning, p. 510, ISBN 1423902181 2.3.8 External links [3] Goodrich, Michael T.; Tamassia, Roberto (2002), “1.5.2 Analyzing an Extendable Array Implementation”, Algorithm Design: Foundations, Analysis and Internet Examples, Wiley, pp. 39–41. [4] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2001) [1990]. “17.4 Dynamic tables”. Introduction to Algorithms (2nd ed.). MIT Press and McGraw-Hill. pp. 416–424. ISBN 0-262-03293-7. [5] “C++ STL vector: definition, growth factor, member functions”. Retrieved 2015-08-05. [6] “vector growth factor of 1.5”. comp.lang.c++.moderated. Google Groups. [7] List object implementation from python.org, retrieved 2011-09-27. [8] Brais, Hadi. “Dissecting the C++ STL Vector: Part 3 Capacity & Size”. Micromysteries. Retrieved 2015-0805. [9] “facebook/folly”. GitHub. Retrieved 2015-08-05. [10] Chris Okasaki (1995). “Purely Functional RandomAccess Lists”. Proceedings of the Seventh International Conference on Functional Programming Languages and Computer Architecture: 86–95. doi:10.1145/224164.224187. [11] Gerald Kruse. CS 240 Lecture Notes: Linked Lists Plus: Complexity Trade-offs. Juniata College. Spring 2008. • NIST Dictionary of Algorithms and Data Structures: Dynamic array • VPOOL - C language implementation of dynamic array. • CollectionSpy — A Java profiler with explicit support for debugging ArrayList- and Vector-related issues. • Open Data Structures - Chapter 2 - Array-Based Lists 2.4 Linked list In computer science, a linked list is a linear collection of data elements, called nodes, each pointing to the next node by means of a pointer. It is a data structure consisting of a group of nodes which together represent a sequence. Under the simplest form, each node is composed of data and a reference (in other words, a link) to the next node in the sequence. This structure allows for efficient insertion or removal of elements from any position in the sequence during iteration. More complex variants add additional links, allowing efficient insertion or removal from arbitrary element references. 12 99 37 linked list whose nodes contain two fields: an integer value and 2.4. LINKED LIST 29 a link to the next node. The last node is linked to a terminator used to signify the end of the list. • Nodes are stored incontiguously, greatly increasing the time required to access individual elements within the list, especially with a CPU cache. Linked lists are among the simplest and most common data structures. They can be used to implement several other common abstract data types, including lists (the abstract data type), stacks, queues, associative arrays, and S-expressions, though it is not uncommon to implement the other data structures directly without using a list as the basis of implementation. • Difficulties arise in linked lists when it comes to reverse traversing. For instance, singly linked lists are cumbersome to navigate backwards[1] and while doubly linked lists are somewhat easier to read, memory is wasted in allocating space for a backpointer. The principal benefit of a linked list over a conventional array is that the list elements can easily be inserted or removed without reallocation or reorganization of the entire structure because the data items need not be stored contiguously in memory or on disk, while an array has to be declared in the source code, before compiling and running the program. Linked lists allow insertion and removal of nodes at any point in the list, and can do so with a constant number of operations if the link previous to the link being added or removed is maintained during list traversal. On the other hand, simple linked lists by themselves do not allow random access to the data, or any form of efficient indexing. Thus, many basic operations — such as obtaining the last node of the list (assuming that the last node is not maintained as separate node reference in the list structure), or finding a node that contains a given datum, or locating the place where a new node should be inserted — may require sequential scanning of most or all of the list elements. The advantages and disadvantages of using linked lists are given below. 2.4.1 Advantages • Linked lists are a dynamic data structure, which can grow and be pruned, allocating and deallocating memory while the program is running. • Insertion and deletion node operations are easily implemented in a linked list. 2.4.3 History Linked lists were developed in 1955–1956 by Allen Newell, Cliff Shaw and Herbert A. Simon at RAND Corporation as the primary data structure for their Information Processing Language. IPL was used by the authors to develop several early artificial intelligence programs, including the Logic Theory Machine, the General Problem Solver, and a computer chess program. Reports on their work appeared in IRE Transactions on Information Theory in 1956, and several conference proceedings from 1957 to 1959, including Proceedings of the Western Joint Computer Conference in 1957 and 1958, and Information Processing (Proceedings of the first UNESCO International Conference on Information Processing) in 1959. The now-classic diagram consisting of blocks representing list nodes with arrows pointing to successive list nodes appears in “Programming the Logic Theory Machine” by Newell and Shaw in Proc. WJCC, February 1957. Newell and Simon were recognized with the ACM Turing Award in 1975 for having “made basic contributions to artificial intelligence, the psychology of human cognition, and list processing”. The problem of machine translation for natural language processing led Victor Yngve at Massachusetts Institute of Technology (MIT) to use linked lists as data structures in his COMIT programming language for computer research in the field of linguistics. A report on this language entitled “A programming language for mechanical translation” appeared in Mechanical Translation in 1958. • Dynamic data structures such as stacks and queues LISP, standing for list processor, was created by John can be implemented using a linked list. McCarthy in 1958 while he was at MIT and in 1960 he published its design in a paper in the Communications • There is no need to define an initial size for a Linked of the ACM, entitled “Recursive Functions of Symbolic list. Expressions and Their Computation by Machine, Part I”. • Items can be added or removed from the middle of One of LISP’s major data structures is the linked list. list. By the early 1960s, the utility of both linked lists and languages which use these structures as their primary data representation was well established. Bert Green of the 2.4.2 Disadvantages MIT Lincoln Laboratory published a review article enti• They use more memory than arrays because of the tled “Computer languages for symbol manipulation” in IRE Transactions on Human Factors in Electronics in storage used by their pointers. March 1961 which summarized the advantages of the • Nodes in a linked list must be read in order from linked list approach. A later review article, “A Comparthe beginning as linked lists are inherently sequential ison of list-processing computer languages” by Bobrow access. and Raphael, appeared in Communications of the ACM 30 CHAPTER 2. SEQUENCES in April 1964. be called 'forward('s’) and 'backwards’, or 'next' and Several operating systems developed by Technical Sys- 'prev'('previous’). tems Consultants (originally of West Lafayette Indiana, and later of Chapel Hill, North Carolina) used singly linked lists as file structures. A directory entry pointed to the first sector of a file, and succeeding portions of doubly linked list whose nodes contain three fields: an integer the file were located by traversing pointers. Systems us- value, the link forward to the next node, and the link backward ing this technique included Flex (for the Motorola 6800 to the previous node CPU), mini-Flex (same CPU), and Flex9 (for the Motorola 6809 CPU). A variant developed by TSC for and A technique known as XOR-linking allows a doubly marketed by Smoke Signal Broadcasting in California, linked list to be implemented using a single link field in used doubly linked lists in the same manner. each node. However, this technique requires the ability 12 The TSS/360 operating system, developed by IBM for the System 360/370 machines, used a double linked list for their file system catalog. The directory structure was similar to Unix, where a directory could contain files and other directories and extend to any depth. 2.4.4 Basic concepts and nomenclature 99 to do bit operations on addresses, and therefore may not be available in some high-level languages. Many modern operating systems use doubly linked lists to maintain references to active processes, threads, and other dynamic objects.[2] A common strategy for rootkits to evade detection is to unlink themselves from these lists.[3] Each record of a linked list is often called an 'element' or Multiply linked list 'node'. In a 'multiply linked list', each node contains two or more The field of each node that contains the address of the link fields, each field being used to connect the same set next node is usually called the 'next link' or 'next pointer'. of data records in a different order (e.g., by name, by The remaining fields are known as the 'data', 'informadepartment, by date of birth, etc.). While doubly linked tion', 'value', 'cargo', or 'payload' fields. lists can be seen as special cases of multiply linked list, the The 'head' of a list is its first node. The 'tail' of a list may fact that the two orders are opposite to each other leads to refer either to the rest of the list after the head, or to the simpler and more efficient algorithms, so they are usually last node in the list. In Lisp and some derived languages, treated as a separate case. the next node may be called the 'cdr' (pronounced coulder) of the list, while the payload of the head node may be Circular Linked list called the 'car'. In the last node of a list, the link field often contains a null reference, a special value used to indicate the lack of further nodes. A less common convention is to make it Singly linked lists contain nodes which have a data field point to the first node of the list; in that case the list is as well as a 'next' field, which points to the next node in said to be 'circular' or 'circularly linked'; otherwise it is line of nodes. Operations that can be performed on singly said to be 'open' or 'linear'. linked lists include insertion, deletion and traversal. Singly linked list 12 99 37 singly linked list whose nodes contain two fields: an integer value and a link to the next node Doubly linked list 12 99 37 A circular linked list In the case of a circular doubly linked list, the only change that occurs is that the end, or “tail”, of the said list is linked back to the front, or “head”, of the list and vice versa. Main article: Doubly linked list Sentinel nodes In a 'doubly linked list', each node contains, besides the next-node link, a second link field pointing to the Main article: Sentinel node 'previous’ node in the sequence. The two links may A 2.4. LINKED LIST In some implementations an extra 'sentinel' or 'dummy' node may be added before the first data record or after the last one. This convention simplifies and accelerates some list-handling algorithms, by ensuring that all links can be safely dereferenced and that every list (even one that contains no data elements) always has a “first” and “last” node. 31 dynamic array is exceeded, it is reallocated and (possibly) copied, which is an expensive operation. Linked lists have several advantages over dynamic arrays. Insertion or deletion of an element at a specific point of a list, assuming that we have indexed a pointer to the node (before the one to be removed, or before the insertion point) already, is a constant-time operation (otherwise without this reference it is O(n)), whereas insertion in a dynamic array at random locations will require moving Empty lists half of the elements on average, and all the elements in An empty list is a list that contains no data records. This the worst case. While one can “delete” an element from is usually the same as saying that it has zero nodes. If an array in constant time by somehow marking its slot as sentinel nodes are being used, the list is usually said to be “vacant”, this causes fragmentation that impedes the performance of iteration. empty when it has only sentinel nodes. Hash linking The link fields need not be physically part of the nodes. If the data records are stored in an array and referenced by their indices, the link field may be stored in a separate array with the same indices as the data records. List handles Since a reference to the first node gives access to the whole list, that reference is often called the 'address’, 'pointer', or 'handle' of the list. Algorithms that manipulate linked lists usually get such handles to the input lists and return the handles to the resulting lists. In fact, in the context of such algorithms, the word “list” often means “list handle”. In some situations, however, it may be convenient to refer to a list by a handle that consists of two links, pointing to its first and last nodes. Combining alternatives Moreover, arbitrarily many elements may be inserted into a linked list, limited only by the total memory available; while a dynamic array will eventually fill up its underlying array data structure and will have to reallocate — an expensive operation, one that may not even be possible if memory is fragmented, although the cost of reallocation can be averaged over insertions, and the cost of an insertion due to reallocation would still be amortized O(1). This helps with appending elements at the array’s end, but inserting into (or removing from) middle positions still carries prohibitive costs due to data moving to maintain contiguity. An array from which many elements are removed may also have to be resized in order to avoid wasting too much space. On the other hand, dynamic arrays (as well as fixed-size array data structures) allow constant-time random access, while linked lists allow only sequential access to elements. Singly linked lists, in fact, can be easily traversed in only one direction. This makes linked lists unsuitable for applications where it’s useful to look up an element by its index quickly, such as heapsort. Sequential access on arrays and dynamic arrays is also faster than on linked lists on many machines, because they have optimal locality of reference and thus make good use of data caching. The alternatives listed above may be arbitrarily combined in almost every way, so one may have circular doubly Another disadvantage of linked lists is the extra storage linked lists without sentinels, circular singly linked lists needed for references, which often makes them impractical for lists of small data items such as characters or with sentinels, etc. boolean values, because the storage overhead for the links may exceed by a factor of two or more the size of the data. In contrast, a dynamic array requires only the space 2.4.5 Tradeoffs for the data itself (and a very small amount of control [note 1] It can also be slow, and with a naïve allocaAs with most choices in computer programming and de- data). tor, wasteful, to allocate memory separately for each new sign, no method is well suited to all circumstances. A element, a problem generally solved using memory pools. linked list data structure might work well in one case, but cause problems in another. This is a list of some of the Some hybrid solutions try to combine the advantages of common tradeoffs involving linked list structures. the two representations. Unrolled linked lists store several elements in each list node, increasing cache performance while decreasing memory overhead for references. CDR Linked lists vs. dynamic arrays coding does both these as well, by replacing references with the actual data referenced, which extends off the end A dynamic array is a data structure that allocates all eleof the referencing record. ments contiguously in memory, and keeps a count of the current number of elements. If the space reserved for the A good example that highlights the pros and cons of us- 32 ing dynamic arrays vs. linked lists is by implementing a program that resolves the Josephus problem. The Josephus problem is an election method that works by having a group of people stand in a circle. Starting at a predetermined person, you count around the circle n times. Once you reach the nth person, take them out of the circle and have the members close the circle. Then count around the circle the same n times and repeat the process, until only one person is left. That person wins the election. This shows the strengths and weaknesses of a linked list vs. a dynamic array, because if you view the people as connected nodes in a circular linked list then it shows how easily the linked list is able to delete nodes (as it only has to rearrange the links to the different nodes). However, the linked list will be poor at finding the next person to remove and will need to search through the list until it finds that person. A dynamic array, on the other hand, will be poor at deleting nodes (or elements) as it cannot remove one node without individually shifting all the elements up the list by one. However, it is exceptionally easy to find the nth person in the circle by directly referencing them by their position in the array. The list ranking problem concerns the efficient conversion of a linked list representation into an array. Although trivial for a conventional computer, solving this problem by a parallel algorithm is complicated and has been the subject of much research. CHAPTER 2. SEQUENCES a persistent data structure. Again, this is not true with the other variants: a node may never belong to two different circular or doubly linked lists. In particular, end-sentinel nodes can be shared among singly linked non-circular lists. The same end-sentinel node may be used for every such list. In Lisp, for example, every proper list ends with a link to a special node, denoted by nil or (), whose CAR and CDR links point to itself. Thus a Lisp procedure can safely take the CAR or CDR of any list. The advantages of the fancy variants are often limited to the complexity of the algorithms, not in their efficiency. A circular list, in particular, can usually be emulated by a linear list together with two variables that point to the first and last nodes, at no extra cost. Doubly linked vs. singly linked Double-linked lists require more space per node (unless one uses XOR-linking), and their elementary operations are more expensive; but they are often easier to manipulate because they allow fast and easy sequential access to the list in both directions. In a doubly linked list, one can insert or delete a node in a constant number of operations given only that node’s address. To do the same in a singly linked list, one must have the address of the pointer to that node, which is either the handle for the whole list (in case of the first node) or the link field in the previous node. Some algorithms require access in both directions. On the other hand, doubly linked lists do not allow tailsharing and cannot be used as persistent data structures. A balanced tree has similar memory access patterns and space overhead to a linked list while permitting much more efficient indexing, taking O(log n) time instead of O(n) for a random access. However, insertion and deletion operations are more expensive due to the overhead of tree manipulations to maintain balance. Schemes exist for trees to automatically maintain themselves in a balCircularly linked vs. linearly linked anced state: AVL trees or red-black trees. A circularly linked list may be a natural option to represent arrays that are naturally circular, e.g. the corners of a polygon, a pool of buffers that are used and released in While doubly linked and circular lists have advantages FIFO (“first in, first out”) order, or a set of processes that over singly linked linear lists, linear lists offer some ad- should be time-shared in round-robin order. In these apvantages that make them preferable in some situations. plications, a pointer to any node serves as a handle to the whole list. A singly linked linear list is a recursive data structure, because it contains a pointer to a smaller object of the With a circular list, a pointer to the last node gives easy same type. For that reason, many operations on singly access also to the first node, by following one link. Thus, linked linear lists (such as merging two lists, or enumerat- in applications that require access to both ends of the list ing the elements in reverse order) often have very simple (e.g., in the implementation of a queue), a circular strucrecursive algorithms, much simpler than any solution us- ture allows one to handle the structure by a single pointer, ing iterative commands. While those recursive solutions instead of two. can be adapted for doubly linked and circularly linked A circular list can be split into two circular lists, in conlists, the procedures generally need extra arguments and stant time, by giving the addresses of the last node of each more complicated base cases. piece. The operation consists in swapping the contents of Singly linked linear lists vs. other lists Linear singly linked lists also allow tail-sharing, the use of a common final portion of sub-list as the terminal portion of two different lists. In particular, if a new node is added at the beginning of a list, the former list remains available as the tail of the new one — a simple example of the link fields of those two nodes. Applying the same operation to any two nodes in two distinct lists joins the two list into one. This property greatly simplifies some algorithms and data structures, such as the quad-edge and face-edge. 2.4. LINKED LIST The simplest representation for an empty circular list (when such a thing makes sense) is a null pointer, indicating that the list has no nodes. Without this choice, many algorithms have to test for this special case, and handle it separately. By contrast, the use of null to denote an empty linear list is more natural and often creates fewer special cases. 33 Singly linked lists Our node data structure will have two fields. We also keep a variable firstNode which always points to the first node in the list, or is null for an empty list. record Node { data; // The data being stored in the node Node next // A reference to the next node, null for last node } record List { Node firstNode // points to first node of list; null for empty list } Using sentinel nodes Traversal of a singly linked list is simple, beginning at the first node and following each next link until we come to Sentinel node may simplify certain list operations, by en- the end: suring that the next or previous nodes exist for every elnode := list.firstNode while node not null (do something ement, and that even empty lists have at least one node. with node.data) node := node.next One may also use a sentinel node at the end of the list, with an appropriate data field, to eliminate some end-of- The following code inserts a node after an existing node list tests. For example, when scanning the list looking for in a singly linked list. The diagram shows how it works. a node with a given value x, setting the sentinel’s data field Inserting a node before an existing one cannot be done dito x makes it unnecessary to test for end-of-list inside the rectly; instead, one must keep track of the previous node loop. Another example is the merging two sorted lists: and insert a node after it. if their sentinels have data fields set to +∞, the choice of newNode newNode the next output node does not need special handling for 37 37 empty lists. However, sentinel nodes use up extra space (especially in applications that use many short lists), and they may complicate other operations (such as the creation of a new empty list). However, if the circular list is used merely to simulate a linear list, one may avoid some of this complexity by adding a single sentinel node to every list, between the last and the first data nodes. With this convention, an empty list consists of the sentinel node alone, pointing to itself via the next-node link. The list handle should then be a pointer to the last data node, before the sentinel, if the list is not empty; or to the sentinel itself, if the list is empty. The same trick can be used to simplify the handling of a doubly linked linear list, by turning it into a circular doubly linked list with a single sentinel node. However, in this case, the handle should be a single pointer to the dummy node itself.[9] 2.4.6 Linked list operations 12 12 99 node 99 node node.next node.next function insertAfter(Node node, Node newNode) // insert newNode after node newNode.next := node.next node.next := newNode Inserting at the beginning of the list requires a separate function. This requires updating firstNode. function insertBeginning(List list, Node newNode) // insert node before current first node newNode.next := list.firstNode list.firstNode := newNode Similarly, we have functions for removing the node after a given node, and for removing a node from the beginning of the list. The diagram demonstrates the former. To find and remove a particular node, one must again keep track of the previous element. 12 node 99 node.next 37 node.next.next When manipulating linked lists in-place, care must be taken to not use values that you have invalidated in previous assignments. This makes algorithms for insert12 99 37 ing or deleting linked list nodes somewhat subtle. This node node.next node.next.next section gives pseudocode for adding or removing nodes from singly, doubly, and circularly linked lists in-place. Throughout we will use null to refer to an end-of-list function removeAfter(Node node) // remove node marker or sentinel, which may be implemented in a numpast this one obsoleteNode := node.next node.next := ber of ways. node.next.next destroy obsoleteNode function removeBeginning(List list) // remove first node obsoleteNode := list.firstNode list.firstNode := list.firstNode.next // point Linearly linked lists past deleted node destroy obsoleteNode 34 CHAPTER 2. SEQUENCES Notice that removeBeginning() sets list.firstNode to null assumes that the list is empty. when removing the last node in the list. function insertAfter(Node node, Node newNode) if node Since we can't iterate backwards, efficient insertBefore = null newNode.next := newNode else newNode.next := or removeBefore operations are not possible. Inserting to node.next node.next := newNode a list before a specific node requires traversing the list, Suppose that “L” is a variable pointing to the last node which would have a worst case running time of O(n). of a circular linked list (or null if the list is empty). To Appending one linked list to another can be inefficient unless a reference to the tail is kept as part of the List structure, because we must traverse the entire first list in order to find the tail, and then append the second list to this. Thus, if two linearly linked lists are each of length n , list appending has asymptotic time complexity of O(n) . In the Lisp family of languages, list appending is provided by the append procedure. Many of the special cases of linked list operations can be eliminated by including a dummy element at the front of the list. This ensures that there are no special cases for the beginning of the list and renders both insertBeginning() and removeBeginning() unnecessary. In this case, the first useful data in the list will be found at list.firstNode.next. Circularly linked list In a circularly linked list, all nodes are linked in a continuous circle, without using null. For lists with a front and a back (such as a queue), one stores a reference to the last node in the list. The next node after the last node is the first node. Elements can be added to the back of the list and removed from the front in constant time. append “newNode” to the end of the list, one may do insertAfter(L, newNode) L := newNode To insert “newNode” at the beginning of the list, one may do insertAfter(L, newNode) if L = null L := newNode 2.4.7 Linked lists using arrays of nodes Languages that do not support any type of reference can still create links by replacing pointers with array indices. The approach is to keep an array of records, where each record has integer fields indicating the index of the next (and possibly previous) node in the array. Not all nodes in the array need be used. If records are also not supported, parallel arrays can often be used instead. As an example, consider the following linked list record that uses arrays instead of pointers: record Entry { integer next; // index of next entry in array integer prev; // previous entry (if double-linked) string name; real balance; } A linked list can be built by creating an array of these structures, and an integer variable to store the index of Circularly linked lists can be either singly or doubly the first element. linked. integer listHead Entry Records[1000] Both types of circularly linked lists benefit from the abil- Links between elements are formed by placing the array ity to traverse the full list beginning at any given node. index of the next (or previous) cell into the Next or Prev This often allows us to avoid storing firstNode and lastN- field within a given element. For example: ode, although if the list may be empty we need a special representation for the empty list, such as a lastNode vari- In the above example, ListHead would be set to 2, the able which points to some node in the list or is null if it’s location of the first entry in the list. Notice that entry empty; we use such a lastNode here. This representation 3 and 5 through 7 are not part of the list. These cells significantly simplifies adding and removing nodes with are available for any additions to the list. By creating a a non-empty list, but empty lists are then a special case. ListFree integer variable, a free list could be created to keep track of what cells are available. If all entries are in use, the size of the array would have to be increased Algorithms Assuming that someNode is some node in or some elements would have to be deleted before new a non-empty circular singly linked list, this code iterates entries could be stored in the list. through that list starting with someNode: The following code would traverse the list and display function iterate(someNode) if someNode ≠ null node names and account balance: := someNode do do something with node.value node := i := listHead while i ≥ 0 // loop through the list print node.next while node ≠ someNode i, Records[i].name, Records[i].balance // print entry i := Notice that the test "while node ≠ someNode” must be at Records[i].next the end of the loop. If the test was moved to the beginning When faced with a choice, the advantages of this apof the loop, the procedure would fail whenever the list had proach include: only one node. This function inserts a node “newNode” into a circular linked list after a given node “node”. If “node” is null, it • The linked list is relocatable, meaning it can be moved about in memory at will, and it can also be 2.4. LINKED LIST 35 quickly and directly serialized for storage on disk or 2.4.9 Internal and external storage transfer over a network. When constructing a linked list, one is faced with the • Especially for a small list, array indexes can occupy choice of whether to store the data of the list directly in significantly less space than a full pointer on many the linked list nodes, called internal storage, or merely to architectures. store a reference to the data, called external storage. In• Locality of reference can be improved by keeping ternal storage has the advantage of making access to the the nodes together in memory and by periodically data more efficient, requiring less storage overall, havrearranging them, although this can also be done in ing better locality of reference, and simplifying memory a general store. management for the list (its data is allocated and deallocated at the same time as the list nodes). • Naïve dynamic memory allocators can produce an excessive amount of overhead storage for each node External storage, on the other hand, has the advantage of allocated; almost no allocation overhead is incurred being more generic, in that the same data structure and machine code can be used for a linked list no matter what per node in this approach. the size of the data is. It also makes it easy to place the • Seizing an entry from a pre-allocated array is faster same data in multiple linked lists. Although with internal than using dynamic memory allocation for each storage the same data can be placed in multiple lists by node, since dynamic memory allocation typically re- including multiple next references in the node data strucquires a search for a free memory block of the de- ture, it would then be necessary to create separate rousired size. tines to add or delete cells based on each field. It is possible to create additional linked lists of elements that use This approach has one main disadvantage, however: it internal storage by using external storage, and having the creates and manages a private memory space for its nodes. cells of the additional linked lists store references to the This leads to the following issues: nodes of the linked list containing the data. • It increases complexity of the implementation. • Growing a large array when it is full may be difficult or impossible, whereas finding space for a new linked list node in a large, general memory pool may be easier. In general, if a set of data structures needs to be included in linked lists, external storage is the best approach. If a set of data structures need to be included in only one linked list, then internal storage is slightly better, unless a generic linked list package using external storage is available. Likewise, if different sets of data that can be stored in the same data structure are to be included in a single linked list, then internal storage would be fine. • Adding elements to a dynamic array will occasionally (when it is full) unexpectedly take linear (O(n)) instead of constant time (although it’s still an Another approach that can be used with some languages amortized constant). involves having different data structures, but all have the • Using a general memory pool leaves more memory initial fields, including the next (and prev if double linked for other data if the list is smaller than expected or list) references in the same location. After defining separate structures for each type of data, a generic strucif many nodes are freed. ture can be defined that contains the minimum amount For these reasons, this approach is mainly used for lanof data shared by all the other structures and contained guages that do not support dynamic memory allocation. at the top (beginning) of the structures. Then generic These disadvantages are also mitigated if the maximum routines can be created that use the minimal structure to perform linked list type operations, but separate routines size of the list is known at the time the array is created. can then handle the specific data. This approach is often used in message parsing routines, where several types 2.4.8 Language support of messages are received, but all start with the same set of fields, usually including a field for message type. The Many programming languages such as Lisp and Scheme generic routines are used to add new messages to a queue have singly linked lists built in. In many functional lan- when they are received, and remove them from the queue guages, these lists are constructed from nodes, each called in order to process the message. The message type field is a cons or cons cell. The cons has two fields: the car, a ref- then used to call the correct routine to process the specific erence to the data for that node, and the cdr, a reference type of message. to the next node. Although cons cells can be used to build other data structures, this is their primary purpose. In languages that support abstract data types or templates, Example of internal and external storage linked list ADTs or templates are available for building linked lists. In other languages, linked lists are typically Suppose you wanted to create a linked list of families built using references together with records. and their members. Using internal storage, the structure 36 CHAPTER 2. SEQUENCES might look like the following: In an unordered list, one simple heuristic for decreasing record member { // member of a family member next; average search time is the move-to-front heuristic, which string firstName; integer age; } record family { // the simply moves an element to the beginning of the list family itself family next; string lastName; string address; once it is found. This scheme, handy for creating simmember members // head of list of members of this family ple caches, ensures that the most recently used items are also the quickest to find again. } Another common approach is to "index" a linked list usTo print a complete list of families and their members ing a more efficient external data structure. For example, using internal storage, we could write: one can build a red-black tree or hash table whose eleaFamily := Families // start at head of families list while ments are references to the linked list nodes. Multiple aFamily ≠ null // loop through list of families print in- such indexes can be built on a single list. The disadvanformation about family aMember := aFamily.members // tage is that these indexes may need to be updated each get head of list of this family’s members while aMember time a node is added or removed (or at least, before that ≠ null // loop through list of members print information index is used again). about member aMember := aMember.next aFamily := aFamily.next Using external storage, we would create the following Random access lists structures: A random access list is a list with support for fast random access to read or modify any element in the list.[10] One possible implementation is a skew binary random access list using the skew binary number system, which involves a list of trees with special properties; this allows worst-case constant time head/cons operations, and worst-case logarithmic time random access to an element [10] To print a complete list of families and their members by index. Random access lists can be implemented as persistent data structures.[10] using external storage, we could write: record node { // generic link structure node next; pointer data // generic pointer for data at node } record member { // structure for family member string firstName; integer age } record family { // structure for family string lastName; string address; node members // head of list of members of this family } famNode := Families // start at head of families list while famNode ≠ null // loop through list of families aFamily := (family) famNode.data // extract family from node print information about family memNode := aFamily.members // get list of family members while memNode ≠ null // loop through list of members aMember := (member)memNode.data // extract member from node print information about member memNode := memNode.next famNode := famNode.next Random access lists can be viewed as immutable linked lists in that they likewise support the same O(1) head and tail operations.[10] Notice that when using external storage, an extra step is needed to extract the record from the node and cast it into the proper data type. This is because both the list of families and the list of members within the family are stored in two linked lists using the same data structure (node), and this language does not have parametric types. 2.4.10 Related data structures As long as the number of families that a member can belong to is known at compile time, internal storage works fine. If, however, a member needed to be included in an arbitrary number of families, with the specific number known only at run time, external storage would be necessary. Speeding up search Finding a specific element in a linked list, even if it is sorted, normally requires O(n) time (linear search). This is one of the primary disadvantages of linked lists over other data structures. In addition to the variants discussed above, below are two simple ways to improve search time. A simple extension to random access lists is the minlist, which provides an additional operation that yields the minimum element in the entire list in constant time (without mutation complexities).[10] Both stacks and queues are often implemented using linked lists, and simply restrict the type of operations which are supported. The skip list is a linked list augmented with layers of pointers for quickly jumping over large numbers of elements, and then descending to the next layer. This process continues down to the bottom layer, which is the actual list. A binary tree can be seen as a type of linked list where the elements are themselves linked lists of the same nature. The result is that each node may include a reference to the first node of one or two other linked lists, which, together with their contents, form the subtrees below that node. An unrolled linked list is a linked list in which each node contains an array of data values. This leads to improved cache performance, since more list elements are contiguous in memory, and reduced memory overhead, because less metadata needs to be stored for each element of the 2.4. LINKED LIST 2.4.13 References list. A hash table may use linked lists to store the chains of items that hash to the same position in the hash table. A heap shares some of the ordering properties of a linked list, but is almost always implemented using an array. Instead of references from node to node, the next and previous data indexes are calculated using the current data’s index. A self-organizing list rearranges its nodes based on some heuristic which reduces search times for data retrieval by keeping commonly accessed nodes at the head of the list. 2.4.11 Notes [1] The amount of control data required for a dynamic array is usually of the form K + B ∗ n , where K is a perarray constant, B is a per-dimension constant, and n is the number of dimensions. K and B are typically on the order of 10 bytes. 2.4.12 37 Footnotes [1] Skiena, Steven S. (2009). The Algorithm Design Manual (2nd ed.). Springer. p. 76. ISBN 9781848000704. We can do nothing without this list predecessor, and so must spend linear time searching for it on a singly-linked list. [2] http://www.osronline.com/article.cfm?article=499 [3] http://www.cs.dartmouth.edu/~{}sergey/me/cs/cs108/ rootkits/bh-us-04-butler.pdf [4] Chris Okasaki (1995). “Purely Functional RandomAccess Lists”. Proceedings of the Seventh International Conference on Functional Programming Languages and Computer Architecture: 86–95. doi:10.1145/224164.224187. [5] Gerald Kruse. CS 240 Lecture Notes: Linked Lists Plus: Complexity Trade-offs. Juniata College. Spring 2008. • Juan, Angel (2006). “Ch20 –Data Structures; ID06 - PROGRAMMING with JAVA (slide part of the book 'Big Java', by CayS. Horstmann)" (PDF). p. 3. • Black, Paul E. (2004-08-16). Pieterse, Vreda; Black, Paul E., eds. “linked list”. Dictionary of Algorithms and Data Structures. National Institute of Standards and Technology. Retrieved 2004-12-14. • Antonakos, James L.; Mansfield, Kenneth C., Jr. (1999). Practical Data Structures Using C/C++. Prentice-Hall. pp. 165–190. ISBN 0-13-2808439. • Collins, William J. (2005) [2002]. Data Structures and the Java Collections Framework. New York: McGraw Hill. pp. 239–303. ISBN 0-07-2823798. • Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2003). Introduction to Algorithms. MIT Press. pp. 205–213, 501–505. ISBN 0-262-03293-7. • Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2001). “10.2: Linked lists”. Introduction to Algorithms (2nd ed.). MIT Press. pp. 204–209. ISBN 0-262-03293-7. • Green, Bert F., Jr. (1961). “Computer Languages for Symbol Manipulation”. IRE Transactions on Human Factors in Electronics (2): 3–8. doi:10.1109/THFE2.1961.4503292. • McCarthy, John (1960). “Recursive Functions of Symbolic Expressions and Their Computation by Machine, Part I”. Communications of the ACM. 3 (4): 184. doi:10.1145/367177.367199. [6] Day 1 Keynote - Bjarne Stroustrup: C++11 Style at GoingNative 2012 on channel9.msdn.com from minute 45 or foil 44 • Knuth, Donald (1997). “2.2.3-2.2.5”. Fundamental Algorithms (3rd ed.). Addison-Wesley. pp. 254– 298. ISBN 0-201-89683-4. [7] Number crunching: Why you should never, ever, EVER use linked-list in your code again at kjellkod.wordpress.com • Newell, Allen; Shaw, F. C. (1957). “Programming the Logic Theory Machine”. Proceedings of the Western Joint Computer Conference: 230–240. [8] Brodnik, Andrej; Carlsson, Svante; Sedgewick, Robert; Munro, JI; Demaine, ED (1999), Resizable Arrays in Optimal Time and Space (Technical Report CS-99-09) (PDF), Department of Computer Science, University of Waterloo • Parlante, Nick (2001). “Linked list basics” (PDF). Stanford University. Retrieved 2009-09-21. [9] Ford, William; Topp, William (2002). Data Structures with C++ using STL (Second ed.). Prentice-Hall. pp. 466–467. ISBN 0-13-085850-1. • Sedgewick, Robert (1998). Algorithms in C. Addison Wesley. pp. 90–109. ISBN 0-201-31452-5. [10] Okasaki, Chris (1995). Purely Functional Random-Access Lists (PS). In Functional Programming Languages and Computer Architecture. ACM Press. pp. 86–95. Retrieved May 7, 2015. • Shaffer, Clifford A. (1998). A Practical Introduction to Data Structures and Algorithm Analysis. New Jersey: Prentice Hall. pp. 77–102. ISBN 0-13660911-2. 38 CHAPTER 2. SEQUENCES • Wilkes, Maurice Vincent (1964). “An Experiment with a Self-compiling Compiler for a Simple ListProcessing Language”. Annual Review in Automatic Programming. Pergamon Press. 4 (1): 1. doi:10.1016/0066-4138(64)90013-8. nodes) because there is no need to keep track of the previous node during traversal or no need to traverse the list to find the previous node, so that its link can be modified. The concept is also the basis for the mnemonic link system memorization technique. • Wilkes, Maurice Vincent (1964). “Lists and Why They are Useful”. Proceeds of the ACM National Conference, Philadelphia 1964. ACM (P–64): F1– 2.5.1 1. Nomenclature and implementation • Shanmugasundaram, Kulesh (2005-04-04). “Linux The first and last nodes of a doubly linked list are immeKernel Linked List Explained”. Retrieved 2009-09- diately accessible (i.e., accessible without traversal, and usually called head and tail) and therefore allow traversal 21. of the list from the beginning or end of the list, respectively: e.g., traversing the list from beginning to end, or from end to beginning, in a search of the list for a node 2.4.14 External links with specific data value. Any node of a doubly linked list, • Description from the Dictionary of Algorithms and once obtained, can be used to begin a new traversal of the list, in either direction (towards beginning or end), from Data Structures the given node. • Introduction to Linked Lists, Stanford University The link fields of a doubly linked list node are often called Computer Science Library next and previous or forward and backward. The ref• Linked List Problems, Stanford University Com- erences stored in the link fields are usually implemented as pointers, but (as in any linked data structure) they may puter Science Library also be address offsets or indices into an array where the • Open Data Structures - Chapter 3 - Linked Lists nodes live. • Patent for the idea of having nodes which are in several linked lists simultaneously (note that this technique was widely used for many decades before the 2.5.2 Basic algorithms patent was granted) Consider the following basic algorithms written in Ada: 2.5 Doubly linked list Open doubly linked lists In computer science, a doubly linked list is a linked data structure that consists of a set of sequentially linked records called nodes. Each node contains two fields, called links, that are references to the previous and to the next node in the sequence of nodes. The beginning and ending nodes’ previous and next links, respectively, point to some kind of terminator, typically a sentinel node or null, to facilitate traversal of the list. If there is only one sentinel node, then the list is circularly linked via the sentinel node. It can be conceptualized as two singly linked lists formed from the same data items, but in opposite sequential orders. 12 99 37 A doubly linked list whose nodes contain three fields: an integer value, the link to the next node, and the link to the previous node. The two node links allow traversal of the list in either direction. While adding or removing a node in a doubly linked list requires changing more links than the same operations on a singly linked list, the operations are simpler and potentially more efficient (for nodes other than first record DoublyLinkedNode { prev // A reference to the previous node next // A reference to the next node data // Data or a reference to data } record DoublyLinkedList { DoublyLinkedNode firstNode // points to first node of list DoublyLinkedNode lastNode // points to last node of list } Traversing the list Traversal of a doubly linked list can be in either direction. In fact, the direction of traversal can change many times, if desired. Traversal is often called iteration, but that choice of terminology is unfortunate, for iteration has well-defined semantics (e.g., in mathematics) which are not analogous to traversal. Forwards node := list.firstNode while node ≠ null <do something with node.data> node := node.next Backwards node := list.lastNode while node ≠ null <do something with node.data> node := node.prev 2.5. DOUBLY LINKED LIST 39 Inserting a node These symmetric functions insert a of the loop. This is important for the case where the list node either after or before a given node: contains only the single node someNode. function insertAfter(List list, Node node, Node newNode) newNode.prev := node newNode.next := node.next if node.next == null list.lastNode := newNode else node.next.prev := newNode node.next := newNode function insertBefore(List list, Node node, Node newNode) newNode.prev := node.prev newNode.next := node if node.prev == null list.firstNode := newNode else node.prev.next := newNode node.prev := newNode Inserting a node This simple function inserts a node into a doubly linked circularly linked list after a given element: function insertAfter(Node node, Node newNode) newNode.next := node.next newNode.prev := node node.next.prev := newNode node.next := newNode We also need a function to insert a node at the beginning To do an “insertBefore”, we can simply “insertAfter(node.prev, newNode)". of a possibly empty list: function insertBeginning(List list, Node newNode) Inserting an element in a possibly empty list requires a if list.firstNode == null list.firstNode := newNode special function: list.lastNode := newNode newNode.prev := null newN- function insertEnd(List list, Node node) if list.lastNode ode.next := null else insertBefore(list, list.firstNode, == null node.prev := node node.next := node else innewNode) sertAfter(list.lastNode, node) list.lastNode := node A symmetric function inserts at the end: To insert at the beginning we simply “infunction insertEnd(List list, Node newNode) if sertAfter(list.lastNode, node)". list.lastNode == null insertBeginning(list, newNode) Finally, removing a node must deal with the case where else insertAfter(list, list.lastNode, newNode) the list empties: function remove(List list, Node node); if node.next Removing a node Removal of a node is easier than in- == node list.lastNode := null else node.next.prev := sertion, but requires special handling if the node to be node.prev node.prev.next := node.next if node == list.lastNode list.lastNode := node.prev; destroy node removed is the firstNode or lastNode: function remove(List list, Node node) if node.prev == null list.firstNode := node.next else node.prev.next := Deleting a node As in doubly linked lists, “renode.next if node.next == null list.lastNode := node.prev moveAfter” and “removeBefore” can be implemented else node.next.prev := node.prev with “remove(list, node.prev)" and “remove(list, One subtle consequence of the above procedure is that node.next)". deleting the last node of a list sets both firstNode and lastNode to null, and so it handles removing the last node from a one-element list correctly. Notice that we also don't need separate “removeBefore” or “removeAfter” methods, because in a doubly linked list we can just use “remove(node.prev)" or “remove(node.next)" where these are valid. This also assumes that the node being removed is guaranteed to exist. If the node does not exist in this list, then some error handling would be required. Circular doubly linked lists Traversing the list Assuming that someNode is some node in a non-empty list, this code traverses through that list starting with someNode (any node will do): Forwards node := someNode do do something with node.value node := node.next while node ≠ someNode Backwards node := someNode do do something with node.value node := node.prev while node ≠ someNode //NODEPA Notice the postponing of the test to the end Double linked list implementation The following program illustrates implementation of double linked list functionality in C programming language. /* Description: Double linked list header file License: GNU GPL v3 */ #ifndef DOUBLELINKEDLIST_H #define DOUBLELINKEDLIST_H /* Codes for various errors */ #define NOERROR 0x0 #define MEMALLOCERROR 0x01 #define LISTEMPTY 0x03 #define NODENOTFOUND 0x4 /* True or false */ #define TRUE 0x1 #define FALSE 0x0 /* Double linked DoubleLinkedList definition */ typedef struct DoubleLinkedList { int number; struct DoubleLinkedList* pPrevious; struct DoubleLinkedList* pNext; }DoubleLinkedList; /* Get data for each node */ extern DoubleLinkedList* GetNodeData(DoubleLinkedList* pNode); /* Add a new node forward */ extern void AddNodeForward(void); /* Add a new node in the reverse direction */ extern void AddNodeReverse(void); /* Display nodes in forward direction */ extern void DisplayNodeForward(void); /*Display nodes in reverse direction */ extern void DisplayNodeReverse(void); /* Delete nodes in the DoubleLinkedList by searching for a node */ extern void DeleteNode(const int number); 40 CHAPTER 2. SEQUENCES /* Function to detect cycle in a DoubleLinkedList */ extern unsigned int DetectCycleinList(void); /*Function to reverse nodes */ extern void ReverseNodes(void); /* function to display error message that DoubleLinkedList is empty */ void ErrorMessage(int Error); /* Sort nodes */ extern void SortNodes(void); #endif */ void DeleteNode(const int SearchNumber) { unsigned int Nodefound = FALSE; DoubleLinkedList* pCurrent = pHead; if (pCurrent != NULL) { DoubleLinkedList* pNextNode = pCurrent->pNext; DoubleLinkedList* pTemp = (DoubleLinkedList* ) NULL; if (pNextNode != NULL) { while((pNextNode != NULL) && (Nodefound==FALSE)) { // If search entry is at the /* Double linked List functions */ beginning if(pHead->number== SearchNumber) { pCurrent=pHead->pNext; pHead= pCurrent; pHead/***************************************************** Name: DoubledLinked.c version: 0.1 Description: Im- >pPrevious= NULL; Nodefound =TRUE; } /* if the search entry is somewhere in the DoubleLinkedList or plementation of a DoubleLinkedList. These functions provide functionality of a double linked List. Change at the end */ else if(pNextNode->number == SearchNumber) { Nodefound = TRUE; pTemp = pNextNodehistory: 0.1 Initial version License: GNU GPL v3 >pNext; pCurrent->pNext = pTemp; /* if the node to ******************************************************/ be deleted is not NULL,,, then point pNextnode->pNext #include “DoubleLinkedList.h” #include “stdlib.h” #include “stdio.h” /* Declare pHead */ DoubleLinkedList* to the previous node which is pCurrent */ if(pTemp) pHead = NULL; /* Variable for storing error sta- { pTemp->pPrevious= pCurrent; } free(pNextNode); tus */ unsigned int Error = NOERROR; Dou- } /* iterate through the Double Linked List until next bleLinkedList* GetNodeData(DoubleLinkedList* node is NULL */ pNextNode=pNextNode->pNext; pNode) { if(!(pNode)) { Error = MEMALLO- pCurrent=pCurrent->pNext; } } else if (pCurrentCERROR; return NULL; } else { printf("\nEnter >number == SearchNumber) { /* add code to delete a number: "); scanf("%d”,&pNode->number); re- nodes allocated with other functions if the search entry turn pNode; } } /* Add a node forward */ void is found. */ Nodefound = TRUE; free(pCurrent); AddNodeForward(void) { DoubleLinkedList* pNode = pCurrent= NULL; pHead = pCurrent; } } else if malloc(sizeof(DoubleLinkedList)); pNode = GetNode- (pCurrent == NULL) { Error= LISTEMPTY; ErData(pNode); if(pNode) { DoubleLinkedList* pCurrent rorMessage(Error); } if (Nodefound == FALSE && = pHead; if (pHead== NULL) { pNode->pNext= pCurrent!= NULL) { Error = NODENOTFOUND; NULL; pNode->pPrevious= NULL; pHead=pNode; ErrorMessage(Error); } } /* Function to detect cycle } else { while(pCurrent->pNext!=NULL) { in double linked List */ unsigned int DetectCycleinpCurrent=pCurrent->pNext; } pCurrent->pNext= List(void) { DoubleLinkedList* pCurrent = pHead; pNode; pNode->pNext= NULL; pNode->pPrevious= DoubleLinkedList* pFast = pCurrent; unsigned int pCurrent; } } else { Error = MEMALLOCERROR; cycle = FALSE; while( (cycle==FALSE) && pCurrent} } /* Function to add nodes in reverse direction, >pNext != NULL) { if(!(pFast = pFast->pNext)) { Arguments; Node to be added. Returns : Nothing cycle= FALSE; break; } else if (pFast == pCurrent) { */ void AddNodeReverse(void) { DoubleLinkedList* cycle = TRUE; break; } else if (!(pFast = pFast->pNext)) pNode = malloc(sizeof(DoubleLinkedList)); pN- { cycle = FALSE; break; } else if(pFast == pCurrent) { ode = GetNodeData(pNode); if(pNode) { Dou- cycle = TRUE; break; } pCurrent=pCurrent->pNext; } if(cycle) { printf("\nDouble Linked list is cyclic”); } else bleLinkedList* pCurrent = pHead; if (pHead==NULL) { pNode->pPrevious= NULL; pNode->pNext= NULL; { Error=LISTEMPTY; ErrorMessage(Error); } return cycle; } /*Function to reverse nodes in a double linked pHead=pNode; } else { while(pCurrent->pPrevious != NULL ) { pCurrent=pCurrent->pPrevious; } list */ void ReverseNodes(void) { DoubleLinkedList *pCurrent= NULL, *pNextNode= NULL; pCurrent = pNode->pPrevious= NULL; pNode->pNext= pCurrent; pCurrent->pPrevious= pNode; pHead=pNode; } } pHead; if (pCurrent) { pHead =NULL; while (pCurrent != NULL) { pNextNode = pCurrent->pNext; pCurrentelse { Error = MEMALLOCERROR; } } /* Display Double linked list data in forward direction */ >pNext = pHead; pCurrent->pPrevious=pNextNode; void DisplayNodeForward(void) { DoubleLinkedList* pHead = pCurrent; pCurrent = pNextNode; } } else pCurrent = pHead; if (pCurrent) { while(pCurrent != { Error= LISTEMPTY; ErrorMessage(Error); } } /* NULL ) { printf("\nNumber in forward direction is %d Function to display diagnostic errors */ void ErrorMes",pCurrent->number); pCurrent=pCurrent->pNext; } } sage(int Error) { switch(Error) { case LISTEMPTY: else { Error = LISTEMPTY; ErrorMessage(Error); } } printf("\nError: Double linked list is empty!"); break; /* Display Double linked list data in Reverse direction */ case MEMALLOCERROR: printf("\nMemory alvoid DisplayNodeReverse(void) { DoubleLinkedList* location error "); break; case NODENOTFOUND: pCurrent = pHead; if (pCurrent) { while(pCurrent- printf("\nThe searched node is not found "); break; >pNext != NULL) { pCurrent=pCurrent->pNext; } default: printf("\nError code missing\n”); break; } } while(pCurrent) { printf("\nNumber in Reverse direction is %d ",pCurrent->number); pCurrent=pCurrent- /* main.h header file */ #ifndef MAIN_H #define >pPrevious; } } else { Error = LISTEMPTY; ErrorMes- MAIN_H #include “DoubleLinkedList.h” /* Error code sage(Error); } } /* Delete nodes in a double linked List */ extern unsigned int Error; #endif 2.6. STACK (ABSTRACT DATA TYPE) 41 := newNode newNode.next := node node.prev = addressOf(newNode.next) function insertAfter(Node /***************************************************/ node, Node newNode) newNode.next := node.next if /*************************************************** Name: main.c version: 0.1 Description: Imple- newNode.next != null newNode.next.prev = addresmentation of a double linked list Change his- sOf(newNode.next) node.next := newNode newNtory: 0.1 Initial version License: GNU GPL v3 ode.prev := addressOf(node.next) ****************************************************/ #include <stdio.h> #include <stdlib.h> #include Deleting a node To remove a node, we simply modify “main.h” int main(void) { int choice =0; int Input- the link pointed by prev, regardless of whether the node Number=0; printf("\nThis program creates a double was the first one of the list. linked list”); printf("\nYou can add nodes in forward and reverse directions”); do { printf("\n1.Create function remove(Node node) atAddress(node.prev) Node Forward”); printf("\n2.Create Node Reverse”); := node.next if node.next != null node.next.prev = printf("\n3.Delete Node”); printf("\n4.Display Nodes node.prev destroy node in forward direction”); printf("\n5.Display Nodes in reverse direction”); printf("\n6.Reverse nodes”); printf("\n7.Exit\n”); printf("\nEnter your choice: 2.5.4 See also "); scanf("%d”,&choice); switch(choice) { case 1: • XOR linked list AddNodeForward(); break; case 2: AddNodeReverse(); break; case 3: printf("\nEnter the node you want to • SLIP (programming language) delete: "); scanf("%d”,&InputNumber); DeleteNode(InputNumber); break; case 4: printf("\nDisplaying node data in forward direction \n”); DisplayNode- 2.5.5 References Forward(); break; case 5: printf("\nDisplaying node data in reverse direction\n”); DisplayNodeRe- [1] http://www.codeofhonor.com/blog/ avoiding-game-crashes-related-to-linked-lists verse(); break; case 6: ReverseNodes(); break; case 7: printf(“Exiting program”); break; default: printf("\nIncorrect choice\n”); } } while (choice !=7); [2] https://github.com/webcoyote/coho/blob/master/Base/ List.h return 0; } 2.6 Stack (abstract data type) 2.5.3 Advanced concepts Asymmetric doubly linked list For the use of the term LIFO in accounting, see LIFO (accounting). In computer science, a stack is an abstract data type An asymmetric doubly linked list is somewhere between the singly linked list and the regular doubly linked list. It shares some features with the singly linked list (singledirection traversal) and others from the doubly linked list (ease of modification) It is a list where each node’s previous link points not to the previous node, but to the link to itself. While this makes little difference between nodes (it just points to an offset within the previous node), it changes the head of the list: It allows the first node to modify the firstNode link easily.[1][2] As long as a node is in a list, its previous link is never null. Inserting a node To insert a node before another, we change the link that pointed to the old node, using the prev link; then set the new node’s next link to point to the old node, and change that node’s prev link accordingly. Simple representation of a stack runtime with push and pop operations. that serves as a collection of elements, with two principal operations: push, which adds an element to the collection, function insertBefore(Node node, Node newNode) if and pop, which removes the most recently added element node.prev == null error “The node is not in a list” that was not yet removed. The order in which elements newNode.prev := node.prev atAddress(newNode.prev) come off a stack gives rise to its alternative name, LIFO 42 CHAPTER 2. SEQUENCES (for last in, first out). Additionally, a peek operation may stack in either case is not the implementation but the ingive access to the top without modifying the stack. terface: the user is only allowed to pop or push items The name “stack” for this type of structure comes from onto the array or linked list, with few other helper operathe analogy to a set of physical items stacked on top of tions. The following will demonstrate both implementaeach other, which makes it easy to take an item off the tions, using pseudocode. top of the stack, while getting to an item deeper in the stack may require taking off multiple other items first.[1] Array An array can be used to implement a (bounded) stack, as follows. The first element (usually at the zero offset) is the bottom, resulting in array[0] being the first element pushed onto the stack and the last element popped off. The program must keep track of the size (length) of the stack, using a variable top that records A stack may be implemented to have a bounded capacity. the number of items pushed so far, therefore pointing to If the stack is full and does not contain enough space to the place in the array where the next element is to be inaccept an entity to be pushed, the stack is then considered serted (assuming a zero-based index convention). Thus, to be in an overflow state. The pop operation removes an the stack itself can be effectively implemented as a threeelement structure: item from the top of the stack. Considered as a linear data structure, or more abstractly a sequential collection, the push and pop operations occur only at one end of the structure, referred to as the top of the stack. This makes it possible to implement a stack as a singly linked list and a pointer to the top element. 2.6.1 History Stacks entered the computer science literature in 1946, in the computer design of Alan M. Turing (who used the terms “bury” and “unbury”) as a means of calling and returning from subroutines.[2] Subroutines had already been implemented in Konrad Zuse's Z4 in 1945. Klaus Samelson and Friedrich L. Bauer of Technical University Munich proposed the idea in 1955 and filed a patent in 1957.[3] The same concept was developed, independently, by the Australian Charles Leonard Hamblin in the first half of 1957.[4] Stacks are often described by analogy to a spring-loaded stack of plates in a cafeteria.[5][1][6] Clean plates are placed on top of the stack, pushing down any already there. When a plate is removed from the stack, the one below it pops up to become the new top. 2.6.2 Non-essential operations In many implementations, a stack has more operations than “push” and “pop”. An example is “top of stack”, or "peek", which observes the top-most element without removing it from the stack.[7] Since this can be done with a “pop” and a “push” with the same data, it is not essential. An underflow condition can occur in the “stack top” operation if the stack is empty, the same as “pop”. Also, implementations often have a function which just returns whether the stack is empty. 2.6.3 Software stacks Implementation structure stack: maxsize : integer top : integer items : array of item procedure initialize(stk : stack, size : integer): stk.items ← new array of size items, initially empty stk.maxsize ← size stk.top ← 0 The push operation adds an element and increments the top index, after checking for overflow: procedure push(stk : stack, x : item): if stk.top = stk.maxsize: report overflow error else: stk.items[stk.top] ← x stk.top ← stk.top + 1 Similarly, pop decrements the top index after checking for underflow, and returns the item that was previously the top one: procedure pop(stk : stack): if stk.top = 0: report underflow error else: stk.top ← stk.top − 1 r ← stk.items[stk.top] Using a dynamic array, it is possible to implement a stack that can grow or shrink as much as needed. The size of the stack is simply the size of the dynamic array, which is a very efficient implementation of a stack since adding items to or removing items from the end of a dynamic array requires amortized O(1) time. Linked list Another option for implementing stacks is to use a singly linked list. A stack is then a pointer to the “head” of the list, with perhaps a counter to keep track of the size of the list: structure frame: data : item next : frame or nil structure stack: head : frame or nil size : integer procedure initialize(stk : stack): stk.head ← nil stk.size ← 0 Pushing and popping items happens at the head of the list; overflow is not possible in this implementation (unless memory is exhausted): procedure push(stk : stack, x : item): newhead ← new frame newhead.data ← x newhead.next ← stk.head A stack can be easily implemented either through an array stk.head ← newhead stk.size ← stk.size + 1 procedure or a linked list. What identifies the data structure as a pop(stk : stack): if stk.head = nil: report underflow er- 2.6. STACK (ABSTRACT DATA TYPE) 43 ror r ← stk.head.data stk.head ← stk.head.next stk.size ← stk.size - 1 return r Stacks and programming languages Some languages, such as Perl, LISP and Python, make the stack operations push and pop available on their standard list/array types. Some languages, notably those in the Forth family (including PostScript), are designed around language-defined stacks that are directly visible to and manipulated by the programmer. The following is an example of manipulating a stack in Common Lisp (">" is the Lisp interpreter’s prompt; lines not starting with ">" are the interpreter’s responses to expressions): > (setf stack (list 'a 'b 'c)) ;; set the variable “stack” (A B C) > (pop stack) ;; get top (leftmost) element, should modify the stack A > stack ;; check the value of stack (B C) > (push 'new stack) ;; push a new top onto the stack (NEW B C) Several of the C++ Standard Library container types have push_back and pop_back operations with LIFO semantics; additionally, the stack template class adapts existing containers to provide a restricted API with only push/pop operations. PHP has an SplStack class. Java’s library contains a Stack class that is a specialization of Vector. Following is an example program in Java language, using that class. import java.util.*; class StackDemo { public static void main(String[]args) { Stack<String> stack = new Stack<String>(); stack.push(“A”); // Insert “A” in the stack stack.push(“B”); // Insert “B” in the stack stack.push(“C”); // Insert “C” in the stack stack.push(“D”); // Insert “D” in the stack System.out.println(stack.peek()); // Prints the top of the stack (“D”) stack.pop(); // removing the top (“D”) stack.pop(); // removing the next top (“C”) } } A typical stack, storing local data and call information for nested procedure calls (not necessarily nested procedures!). This stack grows downward from its origin. The stack pointer points to the current topmost datum on the stack. A push operation decrements the pointer and copies the data to the stack; a pop operation copies data from the stack and then increments the pointer. Each procedure called in the program stores procedure return information (in yellow) and local data (in other colors) by pushing them onto the stack. This type of stack implementation is extremely common, but it is vulnerable to buffer overflow attacks (see the text). • a push operation, in which a data item is placed at the location pointed to by the stack pointer, and the address in the stack pointer is adjusted by the size of the data item; • a pop or pull operation: a data item at the current location pointed to by the stack pointer is removed, and the stack pointer is adjusted by the size of the data item. There are many variations on the basic principle of stack operations. Every stack has a fixed location in memory at 2.6.4 Hardware stacks which it begins. As data items are added to the stack, the A common use of stacks at the architecture level is as a stack pointer is displaced to indicate the current extent of the stack, which expands away from the origin. means of allocating and accessing memory. Basic architecture of a stack A typical stack is an area of computer memory with a fixed origin and a variable size. Initially the size of the stack is zero. A stack pointer, usually in the form of a hardware register, points to the most recently referenced location on the stack; when the stack has a size of zero, the stack pointer points to the origin of the stack. The two operations applicable to all stacks are: Stack pointers may point to the origin of a stack or to a limited range of addresses either above or below the origin (depending on the direction in which the stack grows); however, the stack pointer cannot cross the origin of the stack. In other words, if the origin of the stack is at address 1000 and the stack grows downwards (towards addresses 999, 998, and so on), the stack pointer must never be incremented beyond 1000 (to 1001, 1002, etc.). If a pop operation on the stack causes the stack pointer to move past the origin of the stack, a stack underflow occurs. If a push operation causes the stack pointer to in- 44 CHAPTER 2. SEQUENCES crement or decrement beyond the maximum extent of the pointer will be updated before a new item is pushed onto stack, a stack overflow occurs. the stack; if it points to the next available location in the Some environments that rely heavily on stacks may pro- stack, it will be updated after the new item is pushed onto the stack. vide additional operations, for example: Popping the stack is simply the inverse of pushing. The • Duplicate: the top item is popped, and then pushed topmost item in the stack is removed and the stack pointer again (twice), so that an additional copy of the for- is updated, in the opposite order of that used in the push mer top item is now on top, with the original below operation. it. • Peek: the topmost item is inspected (or returned), but the stack pointer is not changed, and the stack size does not change (meaning that the item remains on the stack). This is also called top operation in many articles. • Swap or exchange: the two topmost items on the stack exchange places. • Rotate (or Roll): the n topmost items are moved on the stack in a rotating fashion. For example, if n=3, items 1, 2, and 3 on the stack are moved to positions 2, 3, and 1 on the stack, respectively. Many variants of this operation are possible, with the most common being called left rotate and right rotate. Stacks are often visualized growing from the bottom up (like real-world stacks). They may also be visualized growing from left to right, so that “topmost” becomes “rightmost”, or even growing from top to bottom. The important feature is that the bottom of the stack is in a fixed position. The illustration in this section is an example of a top-to-bottom growth visualization: the top (28) is the stack “bottom”, since the stack “top” is where items are pushed or popped from. A right rotate will move the first element to the third position, the second to the first and the third to the second. Here are two equivalent visualizations of this process: apple banana banana ===right rotate==> cucumber cucumber apple cucumber apple banana ===left rotate==> cucumber apple banana A stack is usually represented in computers by a block of memory cells, with the “bottom” at a fixed location, and the stack pointer holding the address of the current “top” cell in the stack. The top and bottom terminology are used irrespective of whether the stack actually grows towards lower memory addresses or towards higher memory addresses. Pushing an item on to the stack adjusts the stack pointer by the size of the item (either decrementing or incrementing, depending on the direction in which the stack grows in memory), pointing it to the next cell, and copies the new top item to the stack area. Depending again on the exact implementation, at the end of a push operation, the stack pointer may point to the next unused location in the stack, or it may point to the topmost item in the stack. If the stack points to the current topmost item, the stack Hardware support Stack in main memory Many CPU families, including the x86, Z80 and 6502, have a dedicated register reserved for use as (call) stack pointers and special push and pop instructions that manipulate this specific register, conserving opcode space. Some processors, like the PDP-11 and the 68000, also have special addressing modes for implementation of stacks, typically with a semi-dedicated stack pointer as well (such as A7 in the 68000). However, in most processors, several different registers may be used as additional stack pointers as needed (whether updated via addressing modes or via add/sub instructions). Stack in registers or dedicated memory Main article: Stack machine The x87 floating point architecture is an example of a set of registers organised as a stack where direct access to individual registers (relative the current top) is also possible. As with stack-based machines in general, having the top-of-stack as an implicit argument allows for a small machine code footprint with a good usage of bus bandwidth and code caches, but it also prevents some types of optimizations possible on processors permitting random access to the register file for all (two or three) operands. A stack structure also makes superscalar implementations with register renaming (for speculative execution) somewhat more complex to implement, although it is still feasible, as exemplified by modern x87 implementations. Sun SPARC, AMD Am29000, and Intel i960 are all examples of architectures using register windows within a register-stack as another strategy to avoid the use of slow main memory for function arguments and return values. There are also a number of small microprocessors that implements a stack directly in hardware and some microcontrollers have a fixed-depth stack that is not directly accessible. Examples are the PIC microcontrollers, the Computer Cowboys MuP21, the Harris RTX line, and the Novix NC4016. Many stack-based microprocessors were used to implement the programming language Forth at the microcode level. Stacks were also used as a basis of a number of mainframes and mini computers. Such machines were called stack machines, the most famous 2.6. STACK (ABSTRACT DATA TYPE) 45 being the Burroughs B5000. the caller function when the calling finishes. The functions follow a runtime protocol between caller and callee to save arguments and return value on the stack. Stacks 2.6.5 Applications are an important way of supporting nested or recursive function calls. This type of stack is used implicitly by the Expression evaluation and syntax parsing compiler to support CALL and RETURN statements (or their equivalents) and is not manipulated directly by the Calculators employing reverse Polish notation use a stack programmer. structure to hold values. Expressions can be represented Some programming languages use the stack to store data in prefix, postfix or infix notations and conversion from that is local to a procedure. Space for local data items is one form to another may be accomplished using a stack. allocated from the stack when the procedure is entered, Many compilers use a stack for parsing the syntax of exand is deallocated when the procedure exits. The C propressions, program blocks etc. before translating into low gramming language is typically implemented in this way. level code. Most programming languages are contextUsing the same stack for both data and procedure calls free languages, allowing them to be parsed with stack has important security implications (see below) of which based machines. a programmer must be aware in order to avoid introducing serious security bugs into a program. Backtracking Main article: Backtracking Another important application of stacks is backtracking. Consider a simple example of finding the correct path in a maze. There are a series of points, from the starting point to the destination. We start from one point. To reach the final destination, there are several paths. Suppose we choose a random path. After following a certain path, we realise that the path we have chosen is wrong. So we need to find a way by which we can return to the beginning of that path. This can be done with the use of stacks. With the help of stacks, we remember the point where we have reached. This is done by pushing that point into the stack. In case we end up on the wrong path, we can pop the last point from the stack and thus return to the last point and continue our quest to find the right path. This is called backtracking. Runtime memory management Main articles: Stack-based memory allocation and Stack machine A number of programming languages are stack-oriented, meaning they define most basic operations (adding two numbers, printing a character) as taking their arguments from the stack, and placing any return values back on the stack. For example, PostScript has a return stack and an operand stack, and also has a graphics state stack and a dictionary stack. Many virtual machines are also stackoriented, including the p-code machine and the Java Virtual Machine. Almost all calling conventions—the ways in which subroutines receive their parameters and return results— use a special stack (the "call stack") to hold information about procedure/function calling and nesting in order to switch to the context of the called function and restore to 2.6.6 Security Some computing environments use stacks in ways that may make them vulnerable to security breaches and attacks. Programmers working in such environments must take special care to avoid the pitfalls of these implementations. For example, some programming languages use a common stack to store both data local to a called procedure and the linking information that allows the procedure to return to its caller. This means that the program moves data into and out of the same stack that contains critical return addresses for the procedure calls. If data is moved to the wrong location on the stack, or an oversized data item is moved to a stack location that is not large enough to contain it, return information for procedure calls may be corrupted, causing the program to fail. Malicious parties may attempt a stack smashing attack that takes advantage of this type of implementation by providing oversized data input to a program that does not check the length of input. Such a program may copy the data in its entirety to a location on the stack, and in so doing it may change the return addresses for procedures that have called it. An attacker can experiment to find a specific type of data that can be provided to such a program such that the return address of the current procedure is reset to point to an area within the stack itself (and within the data provided by the attacker), which in turn contains instructions that carry out unauthorized operations. This type of attack is a variation on the buffer overflow attack and is an extremely frequent source of security breaches in software, mainly because some of the most popular compilers use a shared stack for both data and procedure calls, and do not verify the length of data items. Frequently programmers do not write code to verify the size of data items, either, and when an oversized or undersized data item is copied to the stack, a security breach may occur. 46 CHAPTER 2. SEQUENCES 2.6.7 See also 2.6.10 External links • List of data structures • Stacks and its Applications • Queue • Stack Machines - the new wave • Double-ended queue • Call stack • FIFO (computing and electronics) • Stack-based memory allocation • Bounding stack depth • Stack Size Analysis for Interrupt-driven Programs (322 KB) • This article incorporates public domain material from the NIST document: Black, Paul E. “Bounded stack”. Dictionary of Algorithms and Data Structures. • Stack overflow • Stack-oriented programming language 2.6.8 2.7 Queue (abstract data type) References [1] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2009) [1990]. Introduction to Algorithms (3rd ed.). MIT Press and McGraw-Hill. ISBN 0-262-03384-4. [2] Newton, David E. (2003). Alan Turing: a study in light and shadow. Philadelphia: Xlibris. p. 82. ISBN 9781401090791. Retrieved 28 January 2015. [3] Dr. Friedrich Ludwig Bauer and Dr. Klaus Samelson (30 March 1957). “Verfahren zur automatischen Verarbeitung von kodierten Daten und Rechenmaschine zur Ausübung des Verfahrens” (in German). Germany, Munich: Deutsches Patentamt. Retrieved 2010-10-01. [4] [5] [6] [7] Back Front Dequeue Enqueue Representation of a FIFO (first in, first out) queue In computer science, a queue (/ˈkjuː/ KYEW) is a particular kind of abstract data type or collection in which the enC. L. Hamblin, “An Addressless Coding Scheme based tities in the collection are kept in order and the principal on Mathematical Notation”, N.S.W University of Tech(or only) operations on the collection are the addition of nology, May 1957 (typescript) entities to the rear terminal position, known as enqueue, Ball, John A. (1978). Algorithms for RPN calcula- and removal of entities from the front terminal position, tors (1 ed.). Cambridge, Massachusetts, USA: Wiley- known as dequeue. This makes the queue a First-In-FirstInterscience, John Wiley & Sons, Inc. ISBN 0-471- Out (FIFO) data structure. In a FIFO data structure, the 03070-8. first element added to the queue will be the first one to be removed. This is equivalent to the requirement that once Godse, A. P.; Godse, D. A. (2010-01-01). Computer a new element is added, all elements that were added beArchitecture. Technical Publications. pp. 1–56. ISBN fore have to be removed before the new element can be 9788184315349. Retrieved 2015-01-30. removed. Often a peek or front operation is also entered, returning the value of the front element without dequeuHorowitz, Ellis: “Fundamentals of Data Structures in Pas- ing it. A queue is an example of a linear data structure, cal”, page 67. Computer Science Press, 1984 or more abstractly a sequential collection. Queues provide services in computer science, transport, and operations research where various entities such as 2.6.9 Further reading data, objects, persons, or events are stored and held to • Donald Knuth. The Art of Computer Program- be processed later. In these contexts, the queue performs ming, Volume 1: Fundamental Algorithms, Third the function of a buffer. Edition.Addison-Wesley, 1997. ISBN 0-201- Queues are common in computer programs, where they 89683-4. Section 2.2.1: Stacks, Queues, and De- are implemented as data structures coupled with acques, pp. 238–243. cess routines, as an abstract data structure or in object- 2.7. QUEUE (ABSTRACT DATA TYPE) 47 oriented languages as classes. Common implementations Queues and programming languages are circular buffers and linked lists. Queues may be implemented as a separate data type, or may be considered a special case of a double-ended queue 2.7.1 Queue implementation (deque) and not implemented separately. For example, Perl and Ruby allow pushing and popping an array from Theoretically, one characteristic of a queue is that it does both ends, so one can use push and shift functions to ennot have a specific capacity. Regardless of how many ele- queue and dequeue a list (or, in reverse, one can use unments are already contained, a new element can always be shift and pop), although in some cases these operations added. It can also be empty, at which point removing an are not efficient. element will be impossible until a new element has been C++'s Standard Template Library provides a “queue” added again. templated class which is restricted to only push/pop Fixed length arrays are limited in capacity, but it is not operations. Since J2SE5.0, Java’s library contains a true that items need to be copied towards the head of the Queue interface that specifies queue operations; implequeue. The simple trick of turning the array into a closed menting classes include LinkedList and (since J2SE 1.6) circle and letting the head and tail drift around endlessly ArrayDeque. PHP has an SplQueue class and third party in that circle makes it unnecessary to ever move items libraries like beanstalk'd and Gearman. stored in the array. If n is the size of the array, then computing indices modulo n will turn the array into a circle. This is still the conceptually simplest way to construct a Examples queue in a high level language, but it does admittedly slow things down a little, because the array indices must be A simple queue implemented in Ruby: compared to zero and the array size, which is compara- class Queue def initialize @list = Array.new end def ble to the time taken to check whether an array index is enqueue(element) @list << element end def dequeue out of bounds, which some languages do, but this will @list.shift end end certainly be the method of choice for a quick and dirty implementation, or for any high level language that does not have pointer syntax. The array size must be declared ahead of time, but some implementations simply double 2.7.2 Purely functional implementation the declared array size when overflow occurs. Most modern languages with objects or pointers can implement or Queues can also be implemented as a purely functional come with libraries for dynamic lists. Such data struc- data structure.[2] Two versions of the implementation tures may have not specified fixed capacity limit besides exists. The first one, called real-time queue,[3] prememory constraints. Queue overflow results from trying sented below, allows the queue to be persistent with opto add an element onto a full queue and queue underflow erations in O(1) worst-case time, but requires lazy lists happens when trying to remove an element from an empty with memoization. The second one, with no lazy lists nor queue. memoization is presented at the end of the sections. Its A bounded queue is a queue limited to a fixed number of amortized time is O(1) if the persistency is not used; but its worst-time complexity is O(n) where n is the number items.[1] of elements in the queue. There are several efficient implementations of FIFO queues. An efficient implementation is one that can per- Let us recall that, for l a list, |l| denotes its length, that NIL form the operations—enqueuing and dequeuing—in O(1) represents an empty list and CON S(h, t) represents the list whose head is h and whose tail is t. time. • Linked list Real-time queue • A doubly linked list has O(1) insertion and The data structure used to implements our queues condeletion at both ends, so is a natural choice for sists of three linked lists (f, r, s) where f is the front queues. of the queue, r is the rear of the queue in reverse order. The invariant of the structure is that s is the rear • A regular singly linked list only has efficient of f without its |r| first elements, that is |s| = |f | − |r| . insertion and deletion at one end. However, a The tail of the queue (CON S(x, f ), r, s) is then almost small modification—keeping a pointer to the (f, r, s) and inserting an element x to (f, r, s) is almost last node in addition to the first one—will en(f, CON S(x, r), s) . It is said almost, because in both of able it to implement an efficient queue. those results, |s| = |f | − |r| + 1 . An auxiliary function • A deque implemented using a modified dynamic ar- aux must the be called for the invariant to be satisfied. ray Two cases must be considered, depending on whether s 48 CHAPTER 2. SEQUENCES is the empty list, in which case |r| = |f | + 1 , or not. The formal definition is aux(f, r, Cons(_, s)) = (f, r, s) and aux(f, r, N IL) = (f ′ , N IL, f ′ ) where f ′ is f followed by r reversed. Let us call reverse(f, r) the function which returns f followed by r reversed. Let us furthermore assume that |r| = |f | + 1 , since it is the case when this function is called. More precisely, we define a lazy function rotate(f, r, a) which takes as input three list such that |r| = |f | + 1 , and return the concatenation of f, of r reversed and of a. Then reverse(f, r) = rotate(f, r, N IL) . The inductive definition of rotate is rotate(N IL, Cons(y, N IL), a) = Cons(y, a) and rotate(CON S(x, f ), CON S(y, r), a) = Cons(x, rotate(f, r, CON S(y, a))) . Its running time is O(r) , but, since lazy evaluation is used, the computation is delayed until the results is forced by the computation. [3] Hood, Robert; Melville, Robert (November 1981.). “Real-time queue operations in pure Lisp”. Information Processing Letters,. 13 (2). Check date values in: |date= (help) • Donald Knuth. The Art of Computer Programming, Volume 1: Fundamental Algorithms, Third Edition. Addison-Wesley, 1997. ISBN 0-201-89683-4. Section 2.2.1: Stacks, Queues, and Deques, pp. 238– 243. • Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, Second Edition. MIT Press and McGrawHill, 2001. ISBN 0-262-03293-7. Section 10.1: Stacks and queues, pp. 200–204. • William Ford, William Topp. Data Structures with C++ and STL, Second Edition. Prentice Hall, 2002. ISBN 0-13-085850-1. Chapter 8: Queues and Priority Queues, pp. 386–390. The list s in the data structure has two purposes. This list serves as a counter for |f | − |r| , indeed, |f | = |r| if and only if s is the empty list. This counter allows us to ensure that the rear is never longer than the front list. Further• Adam Drozdek. Data Structures and Algorithms in more, using s, which is a tail of f, forces the computation C++, Third Edition. Thomson Course Technology, of a part of the (lazy) list f during each tail and insert op2005. ISBN 0-534-49182-0. Chapter 4: Stacks and eration. Therefore, when |f | = |r| , the list f is totally Queues, pp. 137–169. forced. If it wast not the case, the intern representation of f could be some append of append of... of append, and forcing would not be a constant time operation anymore. 2.7.5 External links • Queue Data Structure and Algorithm Amortized queue Note that, without the lazy part of the implementation, the real-time queue would be a non-persistent implementation of queue in O(1) amortized time. In this case, the list s can be replaced by the integer |f | − |r| , and the reverse function would be called when s is 0. 2.7.3 • Queues with algo and 'c' programme • STL Quick Reference • VBScript implementation of stack, queue, deque, and Red-Black Tree This article incorporates public domain material from the NIST document: Black, Paul E. “Bounded queue”. Dictionary of Algorithms and Data Structures. See also • Circular buffer • Deque 2.8 Double-ended queue • Priority queue • Queueing theory • Stack – the “opposite” of a queue: LIFO (Last In First Out) 2.7.4 References [1] “Queue (Java Platform SE 7)". Docs.oracle.com. 201403-26. Retrieved 2014-05-22. [2] Okasaki, Chris. (PDF). “Purely Functional Data Structures” “Deque” redirects here. It is not to be confused with dequeueing, a queue operation. Not to be confused with Double-ended priority queue. In computer science, a double-ended queue (dequeue, often abbreviated to deque, pronounced deck) is an abstract data type that generalizes a queue, for which elements can be added to or removed from either the front (head) or back (tail).[1] It is also often called a head-tail linked list, though properly this refers to a specific data structure implementation (see below). 2.8. DOUBLE-ENDED QUEUE 2.8.1 Naming conventions Deque is sometimes written dequeue, but this use is generally deprecated in technical literature or technical writing because dequeue is also a verb meaning “to remove from a queue”. Nevertheless, several libraries and some writers, such as Aho, Hopcroft, and Ullman in their textbook Data Structures and Algorithms, spell it dequeue. John Mitchell, author of Concepts in Programming Languages, also uses this terminology. 2.8.2 Distinctions and sub-types 49 • Storing deque contents in a circular buffer, and only resizing when the buffer becomes full. This decreases the frequency of resizings. • Allocating deque contents from the center of the underlying array, and resizing the underlying array when either end is reached. This approach may require more frequent resizings and waste more space, particularly when elements are only inserted at one end. • Storing contents in multiple smaller arrays, allocating additional arrays at the beginning or end as needed. Indexing is implemented by keeping a dynamic array containing pointers to each of the smaller arrays. This differs from the queue abstract data type or First-InFirst-Out List (FIFO), where elements can only be added to one end and removed from the other. This general data class has some possible sub-types: Purely functional implementation Double-ended queues can also be implemented as a • An input-restricted deque is one where deletion can purely functional data structure.[2] Two versions of the be made from both ends, but insertion can be made implementation exists. The first one, called 'real-time at one end only. deque, is presented below. It allows the queue to be persistent with operations in O(1) worst-case time, but • An output-restricted deque is one where insertion requires lazy lists with memoization. The second one, can be made at both ends, but deletion can be made with no lazy lists nor memoization is presented at the end from one end only. of the sections. Its amortized time is O(1) if the persistency is not used; but the worst-time complexity of an Both the basic and most common list types in comput- operation is O(n) where n is the number of elements in ing, queues and stacks can be considered specializations the double-ended queue. of deques, and can be implemented using deques. Let us recall that, for l a list, |l| denotes its length, that NIL represents an empty list and CON S(h, t) represents the list whose head is h and whose tail is t. The functions 2.8.3 Operations drop(i,l) and take(i,l) return the list l without its first i elements, and the i first elements respectively. Or, if |l| < i The basic operations on a deque are enqueue and dequeue , they return the empty list and l respectively. on either end. Also generally implemented are peek operations, which return the value at that end without de- A double-ended queue is represented as a sixtuple (lenf, f, sf, lenr, r, sr) where f is a linked list which queuing it. contains the front of the queue of length lenf . SimiNames vary between languages; major implementations larly, r is a linked list which represents the reverse of the include: rear of the queue, of length lenr . Furthermore, it is assured that |f | ≤ 2|r| + 1 and |r| ≤ 2|f | + 1 - intuitively, it means that neither the front nor the rear contains more 2.8.4 Implementations than a third of the list plus one element. Finally, sf and sr are tails of f and of r, they allow to schedule the moThere are at least two common ways to efficiently implement where some lazy operations are forced. Note that, ment a deque: with a modified dynamic array or with a when a double-ended queue contains n elements in the doubly linked list. front list and n elements in the rear list, then the inequalThe dynamic array approach uses a variant of a dynamic ity invariant remains satisfied after i insertions and d delearray that can grow from both ends, sometimes called tions when i + d/2 ≤ n . That is, at most n/2 operation array deques. These array deques have all the proper- can happen between each rebalancing. ties of a dynamic array, such as constant-time random access, good locality of reference, and inefficient insertion/removal in the middle, with the addition of amortized constant-time insertion/removal at both ends, instead of just one end. Three common implementations include: Intuitively, inserting an element x in front of the double-ended queue (lenf, f, sf, lenr, sr) leads almost to the double-ended queue (lenf + 1, CON S(x, f ), drop(2, sf ), lenr, r, drop(2, sr)) , the head and the tail of the double-ended queue (lenf, CON S(x, f ), sf, lenr, r, sr) are x and al- 50 most (lenf − 1, f, drop(2, sf ), lenr, r, drop(2, sr)) respectively, and the head and the tail of (lenf, N IL, N IL, lenr, CON S(x, N IL), drop(2, sr)) are x and (0, N IL, N IL, 0, N IL, N IL) respectively. The function to insert an element in the rear, or to drop the last element of the double-ended queue, are similar to the above function which deal with the front of the double-ended queue. It is said “almost” because, after insertion and after an application of tail, the invariant |r| ≤ 2|f | + 1 may not be satisfied anymore. In this case it is required to rebalance the double-ended queue. CHAPTER 2. SEQUENCES linked list implementations, respectively. As of Java 6, Java’s Collections Framework provides a new Deque interface that provides the functionality of insertion and removal at both ends. It is implemented by classes such as ArrayDeque (also new in Java 6) and LinkedList, providing the dynamic array and linked list implementations, respectively. However, the ArrayDeque, contrary to its name, does not support random access. Perl's arrays have native support for both removing (shift and pop) and adding (unshift and push) elements on both In order to avoid an operation with an O(n) costs, the ends. algorithm uses laziness with memoization, and force the rebalancing to be partly done during the following Python 2.4 introduced the collections module with sup(|l| + |r|)/2 operations, that is, before the following port for deque objects. It is implemented using a doubly rebalancing. In order to create the scheduling, some linked list of fixed-length subarrays. auxiliary lazy functions are required. The function As of PHP 5.3, PHP’s SPL extension contains the rotateRev(f,r,a) returns the list f, followed by the list 'SplDoublyLinkedList' class that can be used to impler reversed, followed by the list a. It is required in ment Deque datastructures. Previously to make a Deque this function that |r| − 2|f | is 2 or 3. This function structure the array functions array_shift/unshift/pop/push is defined by induction as rotateRev(N IL, r, a) = had to be used instead. reverser + +a where ++ is the concatenation operation, and by rotateRev(CON S(x, f ), r, a) = GHC's Data.Sequence module implements an efficient, functional deque structure in Haskell. The implemenCON S(x, rotateRev(f, drop(2, r), reverse(take(2, r))+ +a)) . It should be noted that, rotateRev(f, r, N IL) tation uses 2–3 finger trees annotated with sizes. There returns the list f followed by the list r reversed. The are other (fast) possibilities to implement purely funcdouble queues (most using function rotateDrop(f, j, r) which returns f, followed tional (thus also persistent) [3][4] heavily lazy evaluation). Kaplan and Tarjan were the by ((r without its j first element) reversed) is also refirst to implement optimal confluently persistent catenquired, for j<|f|. It is defined by rotateDrop(f, 0, r) = [5] able deques. Their implementation was strictly purely rotateRev(f, r, N IL) , rotateDrop(f, 1, r) = functional in the sense that it did not use lazy evaluation. rotateRev(f, drop(1, r), N IL) and Okasaki simplified the data structure by using lazy evalrotateDrop(CON S(x, f ), j, r) = uation with a bootstrapped data structure and degrading CON S(x, rotateDrop(f, j − 2), drop(2, r)) . the performance bounds from worst-case to amortized. The balancing function can now be defined with Kaplan, Okasaki, and Tarjan produced a simpler, nonfun balance(q as (lenf, f,sf, lenr,r,sr))= if lenf > 2*lenr+1 bootstrapped, amortized version that can be implemented then let val i= (left+lenr)div 2 val j=lenf + lenr -i val either using lazy evaluation or more efficiently using muf'=take(i,f) val r'=rotateDrop(r,i,f) in (i,f',f',j,r',r') else tation in a broader but still restricted fashion. Mihaesau if lenf > 2*lenr+1 then let val j= (left+lenr)div 2 val and Tarjan created a simpler (but still highly complex) i=lenf + lenr -j val r'=take(i,r) val f'=rotateDrop(f,i,r) in strictly purely functional implementation of catenable deques, and also a much simpler implementation of strictly (i,f',f',j,r',r') else q purely functional non-catenable deques, both of which have optimal worst-case bounds. Note that, without the lazy part of the implementation, this would be a non-persistent implementation of queue in O(1) amortized time. In this case, the lists sf and sr 2.8.6 Complexity can be removed from the representation of the doubleended queue. • In a doubly-linked list implementation and assuming no allocation/deallocation overhead, the time complexity of all deque operations is O(1). Additionally, the time complexity of insertion or deletion in 2.8.5 Language support the middle, given an iterator, is O(1); however, the time complexity of random access by index is O(n). Ada's containers provides the generic packages Ada.Containers.Vectors and Ada.Containers.Doubly_Linked_Lists, for the dynamic array and linked list implementations, respectively. C++'s Standard Template Library provides the class templates std::deque and std::list, for the multiple array and • In a growing array, the amortized time complexity of all deque operations is O(1). Additionally, the time complexity of random access by index is O(1); but the time complexity of insertion or deletion in the middle is O(n). 2.9. CIRCULAR BUFFER 2.8.7 Applications One example where a deque can be used is the A-Steal job scheduling algorithm.[6] This algorithm implements task scheduling for several processors. A separate deque with threads to be executed is maintained for each processor. To execute the next thread, the processor gets the first element from the deque (using the “remove first element” deque operation). If the current thread forks, it is put back to the front of the deque (“insert element at front”) and a new thread is executed. When one of the processors finishes execution of its own threads (i.e. its deque is empty), it can “steal” a thread from another processor: it gets the last element from the deque of another processor (“remove last element”) and executes it. The steal-job scheduling algorithm is used by Intel’s Threading Building Blocks (TBB) library for parallel programming. 2.8.8 51 • Code Project: An In-Depth Study of the STL Deque Container • Deque implementation in C • VBScript implementation of stack, queue, deque, and Red-Black Tree • Multiple implementations of non-catenable deques in Haskell 2.9 Circular buffer See also • Pipe • Queue • Priority queue 2.8.9 References [1] Donald Knuth. The Art of Computer Programming, Volume 1: Fundamental Algorithms, Third Edition. AddisonWesley, 1997. ISBN 0-201-89683-4. Section 2.2.1: Stacks, Queues, and Deques, pp. 238–243. [2] Okasaki, Chris. (PDF). “Purely Functional Data Structures” [3] http://www.cs.cmu.edu/~{}rwh/theses/okasaki.pdf C. Okasaki, “Purely Functional Data Structures”, September 1996 [4] Adam L. Buchsbaum and Robert E. Tarjan. Confluently persistent deques via data structural bootstrapping. Journal of Algorithms, 18(3):513–547, May 1995. (pp. 58, 101, 125) [5] Haim Kaplan and Robert E. Tarjan. Purely functional representations of catenable sorted lists. In ACM Symposium on Theory of Computing, pages 202–211, May 1996. (pp. 4, 82, 84, 124) A ring showing, conceptually, a circular buffer. This visually shows that the buffer has no real end and it can loop around the buffer. However, since memory is never physically created as a ring, a linear representation is generally used as is done below. A circular buffer, circular queue, cyclic buffer or ring buffer is a data structure that uses a single, fixed-size buffer as if it were connected end-to-end. This structure lends itself easily to buffering data streams. 2.9.1 Uses The useful property of a circular buffer is that it does not need to have its elements shuffled around when one [6] Eitan Frachtenberg, Uwe Schwiegelshohn (2007). Job is consumed. (If a non-circular buffer were used then it Scheduling Strategies for Parallel Processing: 12th Inter- would be necessary to shift all elements when one is connational Workshop, JSSPP 2006. Springer. ISBN 3-540- sumed.) In other words, the circular buffer is well-suited as a FIFO buffer while a standard, non-circular buffer is 71034-5. See p.22. well suited as a LIFO buffer. Circular buffering makes a good implementation strategy for a queue that has fixed maximum size. Should a maximum size be adopted for a queue, then a circular buffer is • Type-safe open source deque implementation at a completely ideal implementation; all queue operations Comprehensive C Archive Network are constant time. However, expanding a circular buffer • SGI STL Documentation: deque<T, Alloc> requires shifting memory, which is comparatively costly. 2.8.10 External links 52 CHAPTER 2. SEQUENCES For arbitrarily expanding queues, a linked list approach If two elements are then removed from the buffer, the may be preferred instead. oldest values inside the buffer are removed. The two eleIn some situations, overwriting circular buffer can be ments removed, in this case, are 1 & 2, leaving the buffer used, e.g. in multimedia. If the buffer is used as the with just a 3: bounded buffer in the producer-consumer problem then it is probably desired for the producer (e.g., an audio gen3 erator) to overwrite old data if the consumer (e.g., the sound card) is unable to momentarily keep up. Also, the LZ77 family of lossless data compression algorithms operates on the assumption that strings seen more recently in a data stream are more likely to occur soon in the stream. If the buffer has 7 elements then it is completely full: Implementations store the most recent data in a circular buffer. 6 2.9.2 7 8 9 3 4 5 How it works A consequence of the circular buffer is that when it is full and a subsequent write is performed, then it starts overwriting the oldest data. In this case, two more elements — A & B — are added and they overwrite the 3 & 4: 6 A 24-byte keyboard circular buffer. When the write pointer is about to reach the read pointer - because the microprocessor is not responding, the buffer will stop recording keystrokes and - in some computers - a beep will be played. A circular buffer first starts empty and of some predefined length. For example, this is a 7-element buffer: 7 8 9 A B 5 Alternatively, the routines that manage the buffer could prevent overwriting the data and return an error or raise an exception. Whether or not data is overwritten is up to the semantics of the buffer routines or the application using the circular buffer. Finally, if two elements are now removed then what would be returned is not 3 & 4 but 5 & 6 because A & B overwrote the 3 & the 4 yielding the buffer with: 7 8 9 A B Assume that a 1 is written into the middle of the buffer 2.9.3 Circular buffer mechanics (exact starting location does not matter in a circular A circular buffer can be implemented using four pointers, buffer): or two pointers and two integers: 1 • buffer start in memory • buffer end in memory, or buffer capacity Then assume that two more elements are added — 2 & 3 — which get appended after the 1: 1 2 3 • start of valid data (index or pointer) • end of valid data (index or pointer), or amount of data currently in the buffer (integer) This image shows a partially full buffer: 2.9. CIRCULAR BUFFER 1 53 2 which can be variable length. This offers nearly all the efficiency advantages of a circular buffer while maintaining the ability for the buffer to be used in APIs that only accept contiguous blocks.[1] 3 START END This image shows a full buffer with four elements (num- Fixed-sized compressed circular buffers use an alternative indexing strategy based on elementary number theory to bers 1 through 4) having been overwritten: maintain a fixed-sized compressed representation of the entire data sequence.[3] 6 7 8 9 A END B 5 START 2.9.6 External links When an element is overwritten, the start pointer is incremented to the next element. [1] Simon Cooke (2003), “The Bip Buffer - The Circular Buffer with a Twist” In the pointer-based implementation strategy, the buffer’s full or empty state can be resolved from the start and end indexes. When they are equal, the buffer is empty, and when the start is one greater than the end, the buffer is full.[1] When the buffer is instead designed to track the number of inserted elements n, checking for emptiness means checking n = 0 and checking for fullness means checking whether n equals the capacity.[2] [2] Morin, Pat. “ArrayQueue: An Array-Based Queue”. Open Data Structures (in pseudocode). Retrieved 7 November 2015. 2.9.4 Optimization A circular-buffer implementation may be optimized by mapping the underlying buffer to two contiguous regions of virtual memory. (Naturally, the underlying buffer‘s length must then equal some multiple of the system’s page size.) Reading from and writing to the circular buffer may then be carried out with greater efficiency by means of direct memory access; those accesses which fall beyond the end of the first virtual-memory region will automatically wrap around to the beginning of the underlying buffer. When the read offset is advanced into the second virtualmemory region, both offsets—read and write—are decremented by the length of the underlying buffer.[1] 2.9.5 Fixed-length-element and contiguous-block circular buffer Perhaps the most common version of the circular buffer uses 8-bit bytes as elements. Some implementations of the circular buffer use fixedlength elements that are bigger than 8-bit bytes—16-bit integers for audio buffers, 53-byte ATM cells for telecom buffers, etc. Each item is contiguous and has the correct data alignment, so software reading and writing these values can be faster than software that handles noncontiguous and non-aligned values. Ping-pong buffering can be considered a very specialized circular buffer with exactly two large fixed-length elements. The Bip Buffer (bipartite buffer) is very similar to a circular buffer, except it always returns contiguous blocks [3] John C. Gunther. 2014. Algorithm 938: Compressing circular buffers. ACM Trans. Math. Softw. 40, 2, Article 17 (March 2014) • CircularBuffer at the Portland Pattern Repository • Boost: Templated Circular Buffer Container • http://www.dspguide.com/ch28/2.htm Chapter 3 Dictionaries 3.1 Associative array “binding” may also be used to refer to the process of creating a new association. “Dictionary (data structure)" redirects here. It is not to The operations that are usually defined for an associative be confused with data dictionary. array are:[1][2] “Associative container” redirects here. For the implementation of ordered associative arrays in the standard library of the C++ programming language, see • Add or insert: add a new (key, value) pair to the associative containers. collection, binding the new key to its new value. The arguments to this operation are the key and the value. In computer science, an associative array, map, symbol table, or dictionary is an abstract data type composed of a collection of (key, value) pairs, such that each possible key appears at most once in the collection. Operations associated with this data type allow:[1][2] • the addition of a pair to the collection • the removal of a pair from the collection • the modification of an existing pair • the lookup of a value associated with a particular key • Reassign: replace the value in one of the (key, value) pairs that are already in the collection, binding an old key to a new value. As with an insertion, the arguments to this operation are the key and the value. • Remove or delete: remove a (key, value) pair from the collection, unbinding a given key from its value. The argument to this operation is the key. • Lookup: find the value (if any) that is bound to a The dictionary problem is a classic computer science given key. The argument to this operation is the key, problem: the task of designing a data structure that mainand the value is returned from the operation. If no tains a set of data during 'search', 'delete', and 'insert' value is found, some associative array implementaoperations.[3] The two major solutions to the dictionary tions raise an exception. problem are a hash table or a search tree.[1][2][4][5] In some cases it is also possible to solve the problem using directly addressed arrays, binary search trees, or other more speOften then instead of add or reassign there is a single set cialized structures. operation that adds a new (key, value) pair if one does Many programming languages include associative arrays not already exist, and otherwise reassigns it. as primitive data types, and they are available in software libraries for many others. Content-addressable memory In addition, associative arrays may also include other opis a form of direct hardware-level support for associative erations such as determining the number of bindings or constructing an iterator to loop over all the bindings. Usuarrays. ally, for such an operation, the order in which the bindings Associative arrays have many applications including such are returned may be arbitrary. fundamental programming patterns as memoization and A multimap generalizes an associative array by allowing the decorator pattern.[6] multiple values to be associated with a single key.[7] A bidirectional map is a related abstract data type in which the bindings operate in both directions: each value must 3.1.1 Operations be associated with a unique key, and a second lookup opIn an associative array, the association between a key and eration takes a value as argument and looks up the key a value is often known as a “binding”, and the same word associated with that value. 54 3.1. ASSOCIATIVE ARRAY 3.1.2 Example Suppose that the set of loans made by a library is represented in a data structure. Each book in a library may be checked out only by a single library patron at a time. However, a single patron may be able to check out multiple books. Therefore, the information about which books are checked out to which patrons may be represented by an associative array, in which the books are the keys and the patrons are the values. Using notation from Python or JSON, the data structure would be: { “Pride and Prejudice": “Alice”, “Wuthering Heights": “Alice”, “Great Expectations": “John” } 55 The great advantage of a Hash over a straight address is that there doesn't have to be a search for the key to find the address, the hash IS the address of the correct key and the value is immediately available. However, hash table based dictionaries must be prepared to handle collisions that occur when two keys are mapped by the hash function to the same index, and many different collision resolution strategies have been developed for dealing with this situation, often based either on open addressing (looking at a sequence of hash table indices instead of a single index, until finding either the given key or an empty cell) or on hash chaining (storing a small association list instead of a single binding in each hash table cell).[1][2][4][9] A lookup operation on the key “Great Expectations” Search tree implementations would return “John”. If John returns his book, that would cause a deletion operation, and if Pat checks out a book, Main article: search tree that would cause an insertion operation, leading to a different state: Another common approach is to implement an associative { “Pride and Prejudice": “Alice”, “The Brothers Kara- array with a (self-balancing) red-black tree.[10] mazov": “Pat”, “Wuthering Heights": “Alice” } Dictionaries may also be stored in binary search trees or in data structures specialized to a particular type of keys such as radix trees, tries, Judy arrays, or van Emde Boas trees, but these implementation methods are less efficient 3.1.3 Implementation than hash tables as well as placing greater restrictions on For dictionaries with very small numbers of bindings, it the types of data that they can handle. The advantages may make sense to implement the dictionary using an of these alternative structures come from their ability to association list, a linked list of bindings. With this im- handle operations beyond the basic ones of an associative plementation, the time to perform the basic dictionary array, such as finding the binding whose key is the closest operations is linear in the total number of bindings; how- to a queried key, when the query is not itself present in ever, it is easy to implement and the constant factors in the set of bindings. its running time are small.[1][8] Another very simple implementation technique, usable when the keys are restricted to a narrow range of integers, is direct addressing into an array: the value for a given key k is stored at the array cell A[k], or if there is no binding for k then the cell stores a special sentinel value that indicates the absence of a binding. As well as being simple, this technique is fast: each dictionary operation takes constant time. However, the space requirement for this structure is the size of the entire keyspace, making it impractical unless the keyspace is small.[4] Other implementations Hash table implementations Built-in syntactic support for associative arrays was introduced by SNOBOL4, under the name “table”. MUMPS made multi-dimensional associative arrays, optionally persistent, its key data structure. SETL supported them as one possible implementation of sets and maps. Most modern scripting languages, starting with AWK and including Rexx, Perl, Tcl, JavaScript, Wolfram Language, Python, Ruby, and Lua, support associative arrays as a primary container type. In many more languages, they are available as library functions without special syntax. 3.1.4 Language support Main article: Comparison of programming languages (mapping) Associative arrays can be implemented in any programming language as a package and many language systems provide them as part of their standard library. In some The two major approaches to implementing dictionaries languages, they are not only built into the standard sysare a hash table or a search tree.[1][2][4][5] tem, but have special syntax, often using array-like subscripting. The most frequently used general purpose implementation of an associative array is with a hash table: an array of bindings, together with a hash function that maps each possible key into an array index. The basic idea of a hash table is that the binding for a given key is stored at the position given by applying the hash function to that key, and that lookup operations are performed by looking at that cell of the array and using the binding found there. 56 CHAPTER 3. DICTIONARIES In Smalltalk, Objective-C, .NET,[11] Python, 3.1.6 See also REALbasic, Swift, and VBA they are called dictio• Key-value database naries; in Perl, Ruby and Seed7 they are called hashes; in C++, Java, Go, Clojure, Scala, OCaml, Haskell they • Tuple are called maps (see map (C++), unordered_map (C++), and Map); in Common Lisp and Windows PowerShell, • Function (mathematics) they are called hash tables (since both typically use this • JSON implementation). In PHP, all arrays can be associative, except that the keys are limited to integers and strings. In JavaScript (see also JSON), all objects behave as 3.1.7 References associative arrays with string-valued keys, while the Map and WeakMap types take arbitrary objects as keys. In [1] Goodrich, Michael T.; Tamassia, Roberto (2006), “9.1 Lua, they are called tables, and are used as the primitive The Map Abstract Data Type”, Data Structures & Algobuilding block for all data structures. In Visual FoxPro, rithms in Java (4th ed.), Wiley, pp. 368–371 they are called Collections. The D language also has [2] Mehlhorn, Kurt; Sanders, Peter (2008), “4 Hash Tables support for associative arrays.[12] and Associative Arrays”, Algorithms and Data Structures: The Basic Toolbox (PDF), Springer, pp. 81–98 3.1.5 Permanent storage Main article: Key-value store Most programs using associative arrays will at some point need to store that data in a more permanent form, like in a computer file. A common solution to this problem is a generalized concept known as archiving or serialization, which produces a text or binary representation of the original objects that can be written directly to a file. This is most commonly implemented in the underlying object model, like .Net or Cocoa, which include standard functions that convert the internal data into text form. The program can create a complete text representation of any group of objects by calling these methods, which are almost always already implemented in the base associative array class.[13] [3] Anderson, Arne (1989). “Optimal Bounds on the Dictionary Problem”. Proc. Symposium on Optimal Algorithms. Springer Verlag: 106–114. [4] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2001), “11 Hash Tables”, Introduction to Algorithms (2nd ed.), MIT Press and McGraw-Hill, pp. 221–252, ISBN 0-262-03293-7. [5] Dietzfelbinger, M., Karlin, A., Mehlhorn, K., Meyer auf der Heide, F., Rohnert, H., and Tarjan, R. E. 1994. “Dynamic Perfect Hashing: Upper and Lower Bounds”. SIAM J. Comput. 23, 4 (Aug. 1994), 738-761. http://portal.acm.org/citation.cfm?id=182370 doi:10.1137/S0097539791194094 [6] Goodrich & Tamassia (2006), pp. 597–599. [7] Goodrich & Tamassia (2006), pp. 389–397. [8] “When should I use a hash table instead of an association list?". lisp-faq/part2. 1996-02-20. For programs that use very large data sets, this sort of [9] Klammer, F.; Mazzolini, L. (2006), “Pathfinders for assoindividual file storage is not appropriate, and a database ciative maps”, Ext. Abstracts GIS-l 2006, GIS-I, pp. 71– management system (DB) is required. Some DB systems 74. natively store associative arrays by serializing the data and then storing that serialized data and the key. Individual [10] Joel Adams and Larry Nyhoff. “Trees in STL”. Quote: “The Standard Template library ... some of its containarrays can then be loaded or saved from the database users -- the set<T>, map<T1, T2>, multiset<T>, and muling the key to refer to them. These key-value stores have timap<T1, T2> templates -- are generally built using a been used for many years and have a history as long as special kind of self-balancing binary search tree called a that as the more common relational database (RDBs), but red-black tree.” a lack of standardization, among other reasons, limited their use to certain niche roles. RDBs were used for these [11] “Dictionary<TKey, TValue> Class”. MSDN. roles in most cases, although saving objects to a RDB [12] “Associative Arrays, the D programming language”. Digcan be complicated, a problem known as object-relational ital Mars. impedance mismatch. [13] “Archives and Serializations Programming Guide”, Apple After c. 2010, the need for high performance databases Inc., 2012 suitable for cloud computing and more closely matching the internal structure of the programs using them led to a renaissance in the key-value store market. These systems 3.1.8 External links can store and retrieve associative arrays in a native fash• NIST’s Dictionary of Algorithms and Data Strucion, which can greatly improve performance in common tures: Associative Array web-related workflows. 3.2. ASSOCIATION LIST 3.2 Association list In computer programming and particularly in Lisp, an association list, often referred to as an alist, is a linked list in which each list element (or node) comprises a key and a value. The association list is said to associate the value with the key. In order to find the value associated with a given key, a sequential search is used: each element of the list is searched in turn, starting at the head, until the key is found. Associative lists provide a simple way of implementing an associative array, but are efficient only when the number of keys is very small. 3.2.1 Operation 57 or hash table, because of the greater simplicity of their implementation.[4] 3.2.3 Applications and software libraries In the early development of Lisp, association lists were used to resolve references to free variables in procedures.[5][6] In this application, it is convenient to augment association lists with an additional operation, that reverses the addition of a key–value pair without scanning the list for other copies of the same key. In this way, the association list can function as a stack, allowing local variables to temporarily shadow other variables with the same names, without destroying the values of those other variables.[7] [5] An associative array is an abstract data type that can be Many programming languages, including Lisp, [8] [9] [10] have functions for used to maintain a collection of key–value pairs and look Scheme, OCaml, and Haskell handling association lists in their standard libraries. up the value associated with a given key. The association list provides a simple way of implementing this data type. To test whether a key is associated with a value in a 3.2.4 See also given association list, search the list starting at its first node and continuing either until a node containing the • Self-organizing list, a strategy for re-ordering the key has been found or until the search reaches the end of keys in an association list to speed up searches for the list (in which case the key is not present). To add a frequently-accessed keys new key–value pair to an association list, create a new node for that key-value pair, set the node’s link to be the previous first element of the association list, and re- 3.2.5 References place the first element of the association list with the new node.[1] Although some implementations of association [1] Marriott, Kim; Stuckey, Peter J. (1998). Programming with Constraints: An Introduction. MIT Press. pp. 193– lists disallow having multiple nodes with the same keys 195. ISBN 9780262133418. as each other, such duplications are not problematic for this search algorithm: duplicate keys that appear later in [2] Frické, Martin (2012). “2.8.3 Association Lists”. Logic and the Organization of Information. Springer. pp. 44– the list are ignored.[2] It is also possible to delete a key from an association list, by scanning the list to find each occurrence of the key and splicing the nodes containing the key out of the list.[1] The scan should continue to the end of the list, even when the key is found, in case the same key may have been inserted multiple times. 3.2.2 Performance The disadvantage of association lists is that the time to search is O(n), where n is the length of the list.[3] For large lists, this may be much slower than the times that can be obtained by representing an associative array as a binary search tree or as a hash table. Additionally, unless the list is regularly pruned to remove elements with duplicate keys, multiple values associated with the same key will increase the size of the list, and thus the time to search, without providing any compensatory advantage. One advantage of association lists is that a new element can be added in constant time. Additionally, when the number of keys is very small, searching an association list may be more efficient than searching a binary search tree 45. ISBN 9781461430872. [3] Knuth, Donald. “6.1 Sequential Searching”. The Art of Computer Programming, Vol. 3: Sorting and Searching (2nd ed.). Addison Wesley. pp. 396–405. ISBN 0-20189685-0. [4] Janes, Calvin (2011). “Using Association Lists for Associative Arrays”. Developer’s Guide to Collections in Microsoft .NET. Pearson Education. p. 191. ISBN 9780735665279. [5] McCarthy, John; Abrahams, Paul W.; Edwards, Daniel J.; Hart, Timothy P.; Levin, Michael I. (1985). LISP 1.5 Programmer’s Manual (PDF). MIT Press. ISBN 0-26213011-4. See in particular p. 12 for functions that search an association list and use it to substitute symbols in another expression, and p. 103 for the application of association lists in maintaining variable bindings. [6] van de Snepscheut, Jan L. A. (1993). What Computing Is All About. Monographs in Computer Science. Springer. p. 201. ISBN 9781461227106. [7] Scott, Michael Lee (2000). “3.3.4 Association Lists and Central Reference Tables”. Programming Language Pragmatics. Morgan Kaufmann. p. 137. ISBN 9781558604421. 58 CHAPTER 3. DICTIONARIES [8] Pearce, Jon (2012). Programming and MetaProgramming in Scheme. Undergraduate Texts in Computer Science. Springer. p. 214. ISBN 9781461216827. [9] Minsky, Yaron; Madhavapeddy, Anil; Hickey, Jason (2013). Real World OCaml: Functional Programming for the Masses. O'Reilly Media. p. 253. ISBN 9781449324766. [10] O'Sullivan, Bryan; Goerzen, John; Stewart, Donald Bruce (2008). Real World Haskell: Code You Can Believe In. O'Reilly Media. p. 299. ISBN 9780596554309. 3.3.1 Hashing Main article: Hash function The idea of hashing is to distribute the entries (key/value pairs) across an array of buckets. Given a key, the algorithm computes an index that suggests where the entry can be found: index = f(key, array_size) 3.3 Hash table Often this is done in two steps: Not to be confused with Hash list or Hash tree. “Rehash” redirects here. For the South Park episode, see Rehash (South Park). For the IRC command, see List of Internet Relay Chat commands § REHASH. In computing, a hash table (hash map) is a data struc- keys of computer software, particularly for associative arrays, database indexing, caches, and sets. hash function buckets hash = hashfunc(key) index = hash % array_size In this method, the hash is independent of the array size, and it is then reduced to an index (a number between 0 and array_size − 1) using the modulo operator (%). In the case that the array size is a power of two, the remainder operation is reduced to masking, which improves speed, but can increase problems with a poor hash function. 00 John Smith Lisa Smith Sandra Dee 01 521-8976 02 521-1234 03 : : 13 Choosing a good hash function A good hash function and implementation algorithm are essential for good hash table performance, but may be difficult to achieve. A basic requirement is that the function should provide a uniform distribution of hash values. A non-uniform dis15 tribution increases the number of collisions and the cost of resolving them. Uniformity is sometimes difficult to ensure by design, but may be evaluated empirically usA small phone book as a hash table ing statistical tests, e.g., a Pearson’s chi-squared test for ture used to implement an associative array, a structure discrete uniform distributions.[5][6] that can map keys to values. A hash table uses a hash The distribution needs to be uniform only for table sizes function to compute an index into an array of buckets or that occur in the application. In particular, if one uses slots, from which the desired value can be found. dynamic resizing with exact doubling and halving of the 14 521-9655 Ideally, the hash function will assign each key to a unique bucket, but it is possible that two keys will generate an identical hash causing both keys to point to the same bucket. Instead, most hash table designs assume that hash collisions—different keys that are assigned by the hash function to the same bucket—will occur and must be accommodated in some way. table size s, then the hash function needs to be uniform only when s is a power of two. Here the index can be computed as some range of bits of the hash function. On the other hand, some hashing algorithms prefer to have s be a prime number.[7] The modulus operation may provide some additional mixing; this is especially useful with a poor hash function. In a well-dimensioned hash table, the average cost (number of instructions) for each lookup is independent of the number of elements stored in the table. Many hash table designs also allow arbitrary insertions and deletions of key-value pairs, at (amortized[2] ) constant average cost per operation.[3][4] For open addressing schemes, the hash function should also avoid clustering, the mapping of two or more keys to consecutive slots. Such clustering may cause the lookup cost to skyrocket, even if the load factor is low and collisions are infrequent. The popular multiplicative hash[3] is claimed to have particularly poor clustering behavior.[7] In many situations, hash tables turn out to be more effi- Cryptographic hash functions are believed to provide cient than search trees or any other table lookup struc- good hash functions for any table size s, either by modulo ture. For this reason, they are widely used in many kinds reduction or by bit masking. They may also be ap- 3.3. HASH TABLE propriate if there is a risk of malicious users trying to sabotage a network service by submitting requests designed to generate a large number of collisions in the server’s hash tables. However, the risk of sabotage can also be avoided by cheaper methods (such as applying a secret salt to the data, or using a universal hash function). A drawback of cryptographic hashing functions is that they are often slower to compute, which means that in cases where the uniformity for any s is not necessary, a non-cryptographic hashing function might be preferable. Perfect hash function 59 3.3.3 Collision resolution Hash collisions are practically unavoidable when hashing a random subset of a large set of possible keys. For example, if 2,450 keys are hashed into a million buckets, even with a perfectly uniform random distribution, according to the birthday problem there is approximately a 95% chance of at least two of the keys being hashed to the same slot. Therefore, almost all hash table implementations have some collision resolution strategy to handle such events. Some common strategies are described below. All these methods require that the keys (or pointers to them) be stored in the table, together with the associated values. If all keys are known ahead of time, a perfect hash function can be used to create a perfect hash table that has no collisions. If minimal perfect hashing is used, every Separate chaining location in the hash table can be used as well. Perfect hashing allows for constant time lookups in all cases. This is in contrast to most chaining and open addressing methods, where the time for lookup is low on average, but may be very large, O(n), for some sets of keys. keys John Smith Lisa Smith 3.3.2 Key statistics A critical statistic for a hash table is the load factor, defined as buckets 000 Sam Doe Sandra Dee Ted Baker 001 n , k where • n is the number of entries; • k is the number of buckets. As the load factor grows larger, the hash table becomes slower, and it may even fail to work (depending on the method used). The expected constant time property of a hash table assumes that the load factor is kept below some bound. For a fixed number of buckets, the time for a lookup grows with the number of entries and therefore the desired constant time is not achieved. Second to that, one can examine the variance of number of entries per bucket. For example, two tables both have 1,000 entries and 1,000 buckets; one has exactly one entry in each bucket, the other has all entries in the same bucket. Clearly the hashing is not working in the second one. A low load factor is not especially beneficial. As the load factor approaches 0, the proportion of unused areas in the hash table increases, but there is not necessarily any reduction in search cost. This results in wasted memory. Lisa Smith 521-8976 John Smith 521-1234 Sandra Dee 521-9655 Ted Baker 418-4165 Sam Doe 521-5030 002 : : 151 152 153 154 : : 253 254 255 factor load = entries Hash collision resolved by separate chaining. In the method known as separate chaining, each bucket is independent, and has some sort of list of entries with the same index. The time for hash table operations is the time to find the bucket (which is constant) plus the time for the list operation. In a good hash table, each bucket has zero or one entries, and sometimes two or three, but rarely more than that. Therefore, structures that are efficient in time and space for these cases are preferred. Structures that are efficient for a fairly large number of entries per bucket are not needed or desirable. If these cases happen often, the hashing function needs to be fixed. Separate chaining with linked lists Chained hash tables with linked lists are popular because they require only basic data structures with simple algorithms, and can use simple hash functions that are unsuitable for other methods. The cost of a table operation is that of scanning the entries of the selected bucket for the desired key. If the distribution of keys is sufficiently uniform, the average cost of a lookup depends only on the average number of keys per bucket—that is, it is roughly proportional to the 60 CHAPTER 3. DICTIONARIES load factor. For this reason, chained hash tables remain effective even when the number of table entries n is much higher than the number of slots. For example, a chained hash table with 1000 slots and 10,000 stored keys (load factor 10) is five to ten times slower than a 10,000-slot table (load factor 1); but still 1000 times faster than a plain sequential list. Separate chaining with other structures Instead of a list, one can use any other data structure that supports the required operations. For example, by using a selfbalancing tree, the theoretical worst-case time of common hash table operations (insertion, deletion, lookup) can be brought down to O(log n) rather than O(n). However, this approach is only worth the trouble and extra memory cost if long delays must be avoided at all costs (e.g., in a real-time application), or if one must guard against many entries hashed to the same slot (e.g., if one expects extremely non-uniform distributions, or in the case of web sites or other publicly accessible services, which are vulnerable to malicious key distributions in requests). For separate-chaining, the worst-case scenario is when all entries are inserted into the same bucket, in which case the hash table is ineffective and the cost is that of searching the bucket data structure. If the latter is a linear list, the lookup procedure may have to scan all its entries, so the worst-case cost is proportional to the number n of enThe variant called array hash table uses a dynamic array tries in the table. to store all the entries that hash to the same slot.[8][9][10] The bucket chains are often searched sequentially using Each newly inserted entry gets appended to the end of the order the entries were added to the bucket. If the the dynamic array that is assigned to the slot. The dyload factor is large and some keys are more likely to come namic array is resized in an exact-fit manner, meaning up than others, then rearranging the chain with a move- it is grown only by as many bytes as needed. Alternato-front heuristic may be effective. More sophisticated tive techniques such as growing the array by block sizes data structures, such as balanced search trees, are worth or pages were found to improve insertion performance, considering only if the load factor is large (about 10 or but at a cost in space. This variation makes more effimore), or if the hash distribution is likely to be very non- cient use of CPU caching and the translation lookaside uniform, or if one must guarantee good performance even buffer (TLB), because slot entries are stored in sequential in a worst-case scenario. However, using a larger table memory positions. It also dispenses with the next pointand/or a better hash function may be even more effective ers that are required by linked lists, which saves space. in those cases. Despite frequent array resizing, space overheads incurred Chained hash tables also inherit the disadvantages of by the operating system such as memory fragmentation linked lists. When storing small keys and values, the were found to be small. space overhead of the next pointer in each entry record An elaboration on this approach is the so-called dynamic can be significant. An additional disadvantage is that perfect hashing,[11] where a bucket that contains k entries traversing a linked list has poor cache performance, mak- is organized as a perfect hash table with k2 slots. While it ing the processor cache ineffective. uses more memory (n2 slots for n entries, in the worst case and n × k slots in the average case), this variant has guaranteed constant worst-case lookup time, and low amoroverflow tized time for insertion. It is also possible to use a fusion entries keys buckets tree for each bucket, achieving constant time for all oper000 001 Lisa Smith 521-8976 John Smith ations with high probability.[12] 002 Lisa Smith Sam Doe Sandra Dee Ted Baker : : 152 : John Smith 521-1234 153 Ted Baker 418-4165 : : Sam Doe 521-5030 : 151 Sandra Dee 521-9655 154 : 253 254 Open addressing : 255 Hash collision by separate chaining with head records in the bucket array. Separate chaining with list head cells Some chaining implementations store the first record of each chain in the slot array itself.[4] The number of pointer traversals is decreased by one for most cases. The purpose is to increase cache efficiency of hash table access. The disadvantage is that an empty bucket takes the same space as a bucket with one entry. To save space, such hash tables often have about as many slots as stored entries, meaning that many slots have two or more entries. Main article: Open addressing In another strategy, called open addressing, all entry records are stored in the bucket array itself. When a new entry has to be inserted, the buckets are examined, starting with the hashed-to slot and proceeding in some probe sequence, until an unoccupied slot is found. When searching for an entry, the buckets are scanned in the same sequence, until either the target record is found, or an unused array slot is found, which indicates that there is no such key in the table.[13] The name “open addressing” refers to the fact that the location (“address”) of the item is not determined by its hash value. (This method is also called closed hashing; it should not be confused with “open hashing” or “closed addressing” that usually mean separate chaining.) 3.3. HASH TABLE 61 keys buckets 000 001 Lisa Smith 521-8976 : : 152 John Smith 521-1234 153 Sandra Dee 521-9655 154 Ted Baker 418-4165 : : Sam Doe 521-5030 John Smith 002 Lisa Smith 151 : Sam Doe Sandra Dee 155 Ted Baker 253 : 254 255 This graph compares the average number of cache misses required to look up elements in tables with chaining and linear probing. As the table passes the 80%-full mark, linear probing’s performance drastically degrades. Hash collision resolved by open addressing with linear probing (interval=1). Note that “Ted Baker” has a unique hash, but nevertheless collided with “Sandra Dee”, that had previously collided with “John Smith”. the absence of a memory allocator. It also avoids the extra indirection required to access the first entry of each bucket (that is, usually the only one). It also has better locality of reference, particularly with linear probing. Well-known probe sequences include: With small record sizes, these factors can yield better performance than chaining, particularly for lookups. Hash • Linear probing, in which the interval between tables with open addressing are also easier to serialize, probes is fixed (usually 1) because they do not use pointers. • Quadratic probing, in which the interval between probes is increased by adding the successive outputs of a quadratic polynomial to the starting value given by the original hash computation On the other hand, normal open addressing is a poor choice for large elements, because these elements fill entire CPU cache lines (negating the cache advantage), and a large amount of space is wasted on large empty table slots. If the open addressing table only stores references • Double hashing, in which the interval between to elements (external storage), it uses space comparable probes is computed by a second hash function to chaining even for large records but loses its speed advantage. A drawback of all these open addressing schemes is that Generally speaking, open addressing is better used for the number of stored entries cannot exceed the number hash tables with small records that can be stored within of slots in the bucket array. In fact, even with good hash the table (internal storage) and fit in a cache line. They functions, their performance dramatically degrades when are particularly suitable for elements of one word or less. the load factor grows beyond 0.7 or so. For many apIf the table is expected to have a high load factor, the plications, these restrictions mandate the use of dynamic records are large, or the data is variable-sized, chained resizing, with its attendant costs. hash tables often perform as well or better. Open addressing schemes also put more stringent requireUltimately, used sensibly, any kind of hash table algoments on the hash function: besides distributing the keys rithm is usually fast enough; and the percentage of a calmore uniformly over the buckets, the function must also culation spent in hash table code is low. Memory usage is minimize the clustering of hash values that are consecurarely considered excessive. Therefore, in most cases the tive in the probe order. Using separate chaining, the only differences between these algorithms are marginal, and concern is that too many objects map to the same hash other considerations typically come into play. value; whether they are adjacent or nearby is completely irrelevant. Open addressing only saves memory if the entries are small (less than four times the size of a pointer) and the load factor is not too small. If the load factor is close to zero (that is, there are far more buckets than stored entries), open addressing is wasteful even if each entry is just two words. Coalesced hashing A hybrid of chaining and open addressing, coalesced hashing links together chains of nodes within the table itself.[13] Like open addressing, it achieves space usage and (somewhat diminished) cache advantages over chaining. Like chaining, it does not exhibit clustering effects; in fact, the table can be efficiently Open addressing avoids the time overhead of allocating filled to a high density. Unlike chaining, it cannot have each new entry record, and can be implemented even in more elements than table slots. 62 CHAPTER 3. DICTIONARIES Cuckoo hashing Another alternative open-addressing solution is cuckoo hashing, which ensures constant lookup time in the worst case, and constant amortized time for insertions and deletions. It uses two or more hash functions, which means any key/value pair could be in two or more locations. For lookup, the first hash function is used; if the key/value is not found, then the second hash function is used, and so on. If a collision happens during insertion, then the key is re-hashed with the second hash function to map it to another bucket. If all hash functions are used and there is still a collision, then the key it collided with is removed to make space for the new key, and the old key is re-hashed with one of the other hash functions, which maps it to another bucket. If that location also results in a collision, then the process repeats until there is no collision or the process traverses all the buckets, at which point the table is resized. By combining multiple hash functions with multiple cells per bucket, very high space utilization can be achieved. search times in the table. This is similar to ordered hash tables[17] except that the criterion for bumping a key does not depend on a direct relationship between the keys. Since both the worst case and the variation in the number of probes is reduced dramatically, an interesting variation is to probe the table starting at the expected successful probe value and then expand from that position in both directions.[18] External Robin Hood hashing is an extension of this algorithm where the table is stored in an external file and each table position corresponds to a fixed-sized page or bucket with B records.[19] The hopscotch hashing algorithm works by defining a neighborhood of buckets near the original hashed bucket, where a given entry is always found. Thus, search is limited to the number of entries in this neighborhood, which is logarithmic in the worst case, constant on average, and with proper alignment of the neighborhood typically requires one cache miss. When inserting an entry, one first attempts to add it to a bucket in the neighborhood. However, if all buckets in this neighborhood are occupied, the algorithm traverses buckets in sequence until an open slot (an unoccupied bucket) is found (as in linear probing). At that point, since the empty bucket is outside the neighborhood, items are repeatedly displaced in a sequence of hops. (This is similar to cuckoo hashing, but with the difference that in this case the empty slot is being moved into the neighborhood, instead of items being moved out with the hope of eventually finding an empty slot.) Each hop brings the open slot closer to the original neighborhood, without invalidating the neighborhood property of any of the buckets along the way. In the end, the open slot has been moved into the neighborhood, and the entry being inserted can be added to it. that the table size is proportional to the number of entries. With a fixed size, and the common structures, it is similar to linear search, except with a better constant factor. In some cases, the number of entries may be definitely known in advance, for example keywords in a language. More commonly, this is not known for sure, if only due to later changes in code and data. It is one serious, although common, mistake to not provide any way for the table to resize. A general-purpose hash table “class” will almost always have some way to resize, and it is good practice even for simple “custom” tables. An implementation should check the load factor, and do something if it becomes too large (this needs to be done only on inserts, since that is the only thing that would increase it). 2-choice hashing 2-choice hashing employs two different hash functions, h1 (x) and h2 (x), for the hash table. Both hash functions are used to compute two table locations. When an object is inserted in the table, then it is placed in the table location that contains fewer objects (with the default being the h1 (x) table location if there is equality in bucket size). 2Hopscotch hashing Another alternative open- choice hashing employs the principle of the power of two addressing solution is hopscotch hashing,[14] which choices.[20] combines the approaches of cuckoo hashing and linear probing, yet seems in general to avoid their limitations. In particular it works well even when the load factor 3.3.4 Dynamic resizing grows beyond 0.9. The algorithm is well suited for implementing a resizable concurrent hash table. The good functioning of a hash table depends on the fact Robin Hood hashing One interesting variation on double-hashing collision resolution is Robin Hood hashing.[15][16] The idea is that a new key may displace a key already inserted, if its probe count is larger than that of the key at the current position. The net effect of this is that it reduces worst case To keep the load factor under a certain limit, e.g., under 3/4, many table implementations expand the table when items are inserted. For example, in Java’s HashMap class the default load factor threshold for table expansion is 3/4 and in Python’s dict, table size is resized when load factor is greater than 2/3. Since buckets are usually implemented on top of a dynamic array and any constant proportion for resizing greater than 1 will keep the load factor under the desired limit, the exact choice of the constant is determined by the same space-time tradeoff as for dynamic arrays. Resizing is accompanied by a full or incremental table rehash whereby existing items are mapped to new bucket locations. To limit the proportion of memory wasted due to empty 3.3. HASH TABLE 63 buckets, some implementations also shrink the size of the table—followed by a rehash—when items are deleted. From the point of space-time tradeoffs, this operation is similar to the deallocation in dynamic arrays. To ensure that the old table is completely copied over before the new table itself needs to be enlarged, it is necessary to increase the size of the table by a factor of at least (r + 1)/r during resizing. Resizing by copying all entries Disk-based hash tables almost always use some scheme of incremental resizing, since the cost of rebuilding the entire table on disk would be too high. A common approach is to automatically trigger a complete resizing when the load factor exceeds some threshold r ₐₓ. Then a new larger table is allocated, all the entries of the old table are removed and inserted into this new table, and the old table is returned to the free storage pool. Symmetrically, when the load factor falls below a second threshold r ᵢ , all entries are moved to a new smaller table. For hash tables that shrink and grow frequently, the resizing downward can be skipped entirely. In this case, the table size is proportional to the maximum number of entries that ever were in the hash table at one time, rather than the current number. The disadvantage is that memory usage will be higher, and thus cache behavior may be worse. For best control, a “shrink-to-fit” operation can be provided that does this only on request. Monotonic keys If it is known that key values will always increase (or decrease) monotonically, then a variation of consistent hashing can be achieved by keeping a list of the single most recent key value at each hash table resize operation. Upon lookup, keys that fall in the ranges defined by these list entries are directed to the appropriate hash function— and indeed hash table—both of which can be different for each range. Since it is common to grow the overall number of entries by doubling, there will only be O(log(N)) ranges to check, and binary search time for the redirection would be O(log(log(N))). As with consistent hashing, this approach guarantees that any key’s hash, once issued, will never change, even when the hash table is later grown. If the table size increases or decreases by a fixed percentage at each expansion, the total cost of these resizings, amortized over all insert and delete operations, is still a Other solutions constant, independent of the number of entries n and of Linear hashing[21] is a hash table algorithm that permits the number m of operations performed. incremental hash table expansion. It is implemented usFor example, consider a table that was created with the ing a single hash table, but with two possible lookup funcminimum possible size and is doubled each time the load tions. ratio exceeds some threshold. If m elements are inserted into that table, the total number of extra re-insertions that Another way to decrease the cost of table resizing is to occur in all dynamic resizings of the table is at most m − choose a hash function in such a way that the hashes of 1. In other words, dynamic resizing roughly doubles the most values do not change when the table is resized. Such hash functions are prevalent in disk-based and distributed cost of each insert or delete operation. hash tables, where rehashing is prohibitively costly. The problem of designing a hash such that most values do not change when the table is resized is known as the Incremental resizing distributed hash table problem. The four most popular Some hash table implementations, notably in real-time approaches are rendezvous hashing, consistent hashing, systems, cannot pay the price of enlarging the hash table the content addressable network algorithm, and Kademlia all at once, because it may interrupt time-critical opera- distance. tions. If one cannot avoid dynamic resizing, a solution is to perform the resizing gradually: 3.3.5 Performance analysis • During the resize, allocate the new hash table, but In the simplest model, the hash function is completely unkeep the old table unchanged. specified and the table does not resize. For the best pos• In each lookup or delete operation, check both ta- sible choice of hash function, a table of size k with open bles. addressing has no collisions and holds up to k elements, with a single comparison for successful lookup, and a ta• Perform insertion operations only in the new table. ble of size k with chaining and n keys has the minimum • At each insertion also move r elements from the old max(0, n − k) collisions and O(1 + n/k) comparisons for lookup. For the worst choice of hash function, every intable to the new table. sertion causes a collision, and hash tables degenerate to • When all elements are removed from the old table, linear search, with Ω(n) amortized comparisons per indeallocate it. sertion and up to n comparisons for a successful lookup. 64 CHAPTER 3. DICTIONARIES Adding rehashing to this model is straightforward. As in a dynamic array, geometric resizing by a factor of b implies that only n/bi keys are inserted i or more times, so that the total number of insertions is bounded above by bn/(b − 1), which is O(n). By using rehashing to maintain n < k, tables using both chaining and open addressing can have unlimited elements and perform successful lookup in a single comparison for the best choice of hash function. In more realistic models, the hash function is a random variable over a probability distribution of hash functions, and performance is computed on average over the choice of hash function. When this distribution is uniform, the assumption is called “simple uniform hashing” and it can be shown that hashing with chaining requires Θ(1 + n/k) comparisons on average for an unsuccessful lookup, and hashing with open addressing requires Θ(1/(1 − n/k)).[22] Both these bounds are constant, if we maintain n/k < c using table resizing, where c is a fixed constant less than 1. 3.3.6 Features Advantages The main advantage of hash tables over other table data structures is speed. This advantage is more apparent when the number of entries is large. Hash tables are particularly efficient when the maximum number of entries can be predicted in advance, so that the bucket array can be allocated once with the optimum size and never resized. If the set of key-value pairs is fixed and known ahead of time (so insertions and deletions are not allowed), one may reduce the average lookup cost by a careful choice of the hash function, bucket table size, and internal data structures. In particular, one may be able to devise a hash function that is collision-free, or even perfect (see below). In this case the keys need not be stored in the table. Drawbacks index into an array of values. Note that there are no collisions in this case. The entries stored in a hash table can be enumerated efficiently (at constant cost per entry), but only in some pseudo-random order. Therefore, there is no efficient way to locate an entry whose key is nearest to a given key. Listing all n entries in some specific order generally requires a separate sorting step, whose cost is proportional to log(n) per entry. In comparison, ordered search trees have lookup and insertion cost proportional to log(n), but allow finding the nearest key at about the same cost, and ordered enumeration of all entries at constant cost per entry. If the keys are not stored (because the hash function is collision-free), there may be no easy way to enumerate the keys that are present in the table at any given moment. Although the average cost per operation is constant and fairly small, the cost of a single operation may be quite high. In particular, if the hash table uses dynamic resizing, an insertion or deletion operation may occasionally take time proportional to the number of entries. This may be a serious drawback in real-time or interactive applications. Hash tables in general exhibit poor locality of reference—that is, the data to be accessed is distributed seemingly at random in memory. Because hash tables cause access patterns that jump around, this can trigger microprocessor cache misses that cause long delays. Compact data structures such as arrays searched with linear search may be faster, if the table is relatively small and keys are compact. The optimal performance point varies from system to system. Hash tables become quite inefficient when there are many collisions. While extremely uneven hash distributions are extremely unlikely to arise by chance, a malicious adversary with knowledge of the hash function may be able to supply information to a hash that creates worstcase behavior by causing excessive collisions, resulting in very poor performance, e.g., a denial of service attack.[23][24][25] In critical applications, a data structure with better worst-case guarantees can be used; however, universal hashing—a randomized algorithm that prevents the attacker from predicting which inputs cause worstcase behavior—may be preferable.[26] The hash function used by the hash table in the Linux routing table cache was changed with Linux version 2.4.2 as a countermeasure against such attacks.[27] Although operations on a hash table take constant time on average, the cost of a good hash function can be significantly higher than the inner loop of the lookup algorithm for a sequential list or search tree. Thus hash tables are not effective when the number of entries is very small. (However, in some cases the high cost of computing the hash function can be mitigated by saving the hash value 3.3.7 together with the key.) Uses For certain string processing applications, such as spellAssociative arrays checking, hash tables may be less efficient than tries, finite automata, or Judy arrays. Also, if there are not too many possible keys to store -- that is, if each key can be Main article: associative array represented by a small enough number of bits -- then, instead of a hash table, one may use the key directly as the Hash tables are commonly used to implement many 3.3. HASH TABLE 65 types of in-memory tables. They are used to imple- objects. In this representation, the keys are the names of ment associative arrays (arrays whose indices are arbi- the members and methods of the object, and the values trary strings or other complicated objects), especially are pointers to the corresponding member or method. in interpreted programming languages like Perl, Ruby, Python, and PHP. Unique data representation When storing a new item into a multimap and a hash collision occurs, the multimap unconditionally stores both Main article: String interning items. When storing a new item into a typical associative array and a hash collision occurs, but the actual keys themselves are different, the associative array likewise stores both items. However, if the key of the new item exactly matches the key of an old item, the associative array typically erases the old item and overwrites it with the new item, so every item in the table has a unique key. Database indexing Hash tables can be used by some programs to avoid creating multiple character strings with the same contents. For that purpose, all strings in use by the program are stored in a single string pool implemented as a hash table, which is checked whenever a new string has to be created. This technique was introduced in Lisp interpreters under the name hash consing, and can be used with many other kinds of data (expression trees in a symbolic algebra system, records in a database, files in a file system, binary decision diagrams, etc.). Hash tables may also be used as disk-based data structures and database indices (such as in dbm) although B- Transposition table trees are more popular in these applications. In multinode database systems, hash tables are commonly used Main article: Transposition table to distribute rows amongst nodes, reducing network traffic for hash joins. Caches Main article: cache (computing) Hash tables can be used to implement caches, auxiliary data tables that are used to speed up the access to data that is primarily stored in slower media. In this application, hash collisions can be handled by discarding one of the two colliding entries—usually erasing the old item that is currently stored in the table and overwriting it with the new item, so every item in the table has a unique hash value. 3.3.8 Implementations In programming languages Many programming languages provide hash table functionality, either as built-in associative arrays or as standard library modules. In C++11, for example, the unordered_map class provides hash tables for keys and values of arbitrary type. The Java programming language (including the variant which is used on Android) includes the HashSet, HashMap, LinkedHashSet, and LinkedHashMap generic collections.[28] In PHP 5, the Zend 2 engine uses one of the hash functions from Daniel J. Bernstein to generate the hash values Sets used in managing the mappings of data pointers stored Besides recovering the entry that has a given key, many in a hash table. In the PHP source code, it is labelled as hash table implementations can also tell whether such an DJBX33A (Daniel J. Bernstein, Times 33 with Addition). entry exists or not. Python's built-in hash table implementation, in the form Those structures can therefore be used to implement a set data structure, which merely records whether a given key belongs to a specified set of keys. In this case, the structure can be simplified by eliminating all parts that have to do with the entry values. Hashing can be used to implement both static and dynamic sets. Object representation of the dict type, as well as Perl's hash type (%) are used internally to implement namespaces and therefore need to pay more attention to security, i.e., collision attacks. Python sets also use hashes internally, for fast lookup (though they store only keys, not values).[29] In the .NET Framework, support for hash tables is provided via the non-generic Hashtable and generic Dictionary classes, which store key-value pairs, and the generic HashSet class, which stores only values. Several dynamic languages, such as Perl, Python, In Rust's standard library, the generic HashMap and JavaScript, Lua, and Ruby, use hash tables to implement HashSet structs use linear probing with Robin Hood 66 CHAPTER 3. DICTIONARIES bucket stealing. Independent packages • SparseHash (formerly Google SparseHash) An extremely memory-efficient hash_map implementation, with only 2 bits/entry of overhead. The SparseHash library has several C++ hash map implementations with different performance characteristics, including one that optimizes for memory use and another that optimizes for speed. • Bloom filter, memory efficient data-structure designed for constant-time approximate lookups; uses hash function(s) and can be seen as an approximate hash table. • Distributed hash table (DHT), a resilient dynamic table spread over several nodes of a network. • Hash array mapped trie, a trie structure, similar to the array mapped trie, but where each key is hashed first. • SunriseDD An open source C library for hash ta- 3.3.11 References ble storage of arbitrary data objects with lock-free lookups, built-in reference counting and guaranteed [1] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2009). Introduction to Algoorder iteration. The library can participate in exrithms (3rd ed.). Massachusetts Institute of Technology. ternal reference counting systems or use its own pp. 253–280. ISBN 978-0-262-03384-8. built-in reference counting. It comes with a variety of hash functions and allows the use of run- [2] Charles E. Leiserson, Amortized Algorithms, Table time supplied hash functions via callback mechaDoubling, Potential Method Lecture 13, course MIT nism. Source code is well documented. 6.046J/18.410J Introduction to Algorithms—Fall 2005 • uthash This is an easy-to-use hash table for C structures. 3.3.9 History The idea of hashing arose independently in different places. In January 1953, H. P. Luhn wrote an internal IBM memorandum that used hashing with chaining.[30] Gene Amdahl, Elaine M. McGraw, Nathaniel Rochester, and Arthur Samuel implemented a program using hashing at about the same time. Open addressing with linear probing (relatively prime stepping) is credited to Amdahl, but Ershov (in Russia) had the same idea.[30] 3.3.10 See also • Rabin–Karp string search algorithm • Stable hashing • Consistent hashing • Extendible hashing • Lazy deletion • Pearson hashing • PhotoDNA • Search data structure Related data structures There are several data structures that use hash functions but cannot be considered special cases of hash tables: [3] Knuth, Donald (1998). 'The Art of Computer Programming'. 3: Sorting and Searching (2nd ed.). AddisonWesley. pp. 513–558. ISBN 0-201-89685-0. [4] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2001). “Chapter 11: Hash Tables”. Introduction to Algorithms (2nd ed.). MIT Press and McGraw-Hill. pp. 221–252. ISBN 978-0-262-53196-2. [5] Pearson, Karl (1900). “On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling”. Philosophical Magazine, Series 5. 50 (302). pp. 157–175. doi:10.1080/14786440009463897. [6] Plackett, Robin (1983). “Karl Pearson and the ChiSquared Test”. International Statistical Review (International Statistical Institute (ISI)). 51 (1). pp. 59–72. doi:10.2307/1402731. [7] Wang, Thomas (March 1997). “Prime Double Hash Table”. Archived from the original on 1999-09-03. Retrieved 2015-05-10. [8] Askitis, Nikolas; Zobel, Justin (October 2005). Cacheconscious Collision Resolution in String Hash Tables. Proceedings of the 12th International Conference, String Processing and Information Retrieval (SPIRE 2005). 3772/2005. pp. 91–102. doi:10.1007/11575832_11. ISBN 978-3-540-29740-6. [9] Askitis, Nikolas; Sinha, Ranjan (2010). “Engineering scalable, cache and space efficient tries for strings”. The VLDB Journal. 17 (5): 633–660. doi:10.1007/s00778010-0183-9. ISSN 1066-8888. [10] Askitis, Nikolas (2009). Fast and Compact Hash Tables for Integer Keys (PDF). Proceedings of the 32nd Australasian Computer Science Conference (ACSC 2009). 91. pp. 113–122. ISBN 978-1-920682-72-9. 3.3. HASH TABLE 67 [11] Erik Demaine, Jeff Lind. 6.897: Advanced Data Structures. MIT Computer Science and Artificial Intelligence Laboratory. Spring 2003. http://courses.csail.mit.edu/6. 897/spring03/scribe_notes/L2/lecture2.pdf [27] Bar-Yosef, Noa; Wool, Avishai (2007). Remote algorithmic complexity attacks against randomized hash tables Proc. International Conference on Security and Cryptography (SECRYPT) (PDF). p. 124. [12] Willard, Dan E. (2000). “Examining computational geometry, van Emde Boas trees, and hashing from the perspective of the fusion tree”. SIAM Journal on Computing. 29 (3): 1030–1049. doi:10.1137/S0097539797322425. MR 1740562.. [28] https://docs.oracle.com/javase/tutorial/collections/ implementations/index.html [13] Tenenbaum, Aaron M.; Langsam, Yedidyah; Augenstein, Moshe J. (1990). Data Structures Using C. Prentice Hall. pp. 456–461, p. 472. ISBN 0-13-199746-7. [30] Mehta, Dinesh P.; Sahni, Sartaj. Handbook of Datastructures and Applications. p. 9-15. ISBN 1-58488-435-5. [14] Herlihy, Maurice; Shavit, Nir; Tzafrir, Moran (2008). “Hopscotch Hashing”. DISC '08: Proceedings of the 22nd international symposium on Distributed Computing. Berlin, Heidelberg: Springer-Verlag. pp. 350–364. 3.3.12 Further reading [15] Celis, Pedro (1986). Robin Hood hashing (PDF) (Technical report). Computer Science Department, University of Waterloo. CS-86-14. [16] Goossaert, Emmanuel (2013). “Robin Hood hashing”. [17] Amble, Ole; Knuth, Don (1974). “Ordered hash tables”. Computer Journal. 17 (2): 135. doi:10.1093/comjnl/17.2.135. [18] Viola, Alfredo (October 2005). “Exact distribution of individual displacements in linear probing hashing”. Transactions on Algorithms (TALG). ACM. 1 (2,): 214–242. doi:10.1145/1103963.1103965. [19] Celis, Pedro (March 1988). External Robin Hood Hashing (Technical report). Computer Science Department, Indiana University. TR246. [20] http://www.eecs.harvard.edu/~{}michaelm/postscripts/ handbook2001.pdf [21] Litwin, Witold (1980). “Linear hashing: A new tool for file and table addressing”. Proc. 6th Conference on Very Large Databases. pp. 212–223. [22] Doug Dunham. CS 4521 Lecture Notes. University of Minnesota Duluth. Theorems 11.2, 11.6. Last modified April 21, 2009. [23] Alexander Klink and Julian Wälde’s Efficient Denial of Service Attacks on Web Application Platforms, December 28, 2011, 28th Chaos Communication Congress. Berlin, Germany. [24] Mike Lennon “Hash Table Vulnerability Enables WideScale DDoS Attacks”. 2011. [25] “Hardening Perl’s Hash Function”. November 6, 2013. [26] Crosby and Wallach. Denial of Service via Algorithmic Complexity Attacks. quote: “modern universal hashing techniques can yield performance comparable to commonplace hash functions while being provably secure against these attacks.” “Universal hash functions ... are ... a solution suitable for adversarial environments. ... in production systems.” [29] https://stackoverflow.com/questions/513882/ python-list-vs-dict-for-look-up-table • Tamassia, Roberto; Goodrich, Michael T. (2006). “Chapter Nine: Maps and Dictionaries”. Data structures and algorithms in Java : [updated for Java 5.0] (4th ed.). Hoboken, NJ: Wiley. pp. 369–418. ISBN 0-471-73884-0. • McKenzie, B. J.; Harries, R.; Bell, T. (Feb 1990). “Selecting a hashing algorithm”. Software Practice & Experience. 20 (2): 209–224. doi:10.1002/spe.4380200207. 3.3.13 External links • A Hash Function for Hash Table Lookup by Bob Jenkins. • Hash Tables by SparkNotes—explanation using C • Hash functions by Paul Hsieh • Design of Compact and Efficient Hash Tables for Java • Libhashish hash library • NIST entry on hash tables • Open addressing hash table removal algorithm from ICI programming language, ici_set_unassign in set.c (and other occurrences, with permission). • A basic explanation of how the hash table works by Reliable Software • Lecture on Hash Tables • Hash-tables in C—two simple and clear examples of hash tables implementation in C with linear probing and chaining, by Daniel Graziotin • Open Data Structures – Chapter 5 – Hash Tables • MIT’s Introduction to Algorithms: Hashing 1 MIT OCW lecture Video • MIT’s Introduction to Algorithms: Hashing 2 MIT OCW lecture Video 68 CHAPTER 3. DICTIONARIES • How to sort a HashMap (Java) and keep the dupli- 3.4.1 cate entries • How python dictionary works 3.4 Linear probing Keys John Smith Lisa Smith Sam Doe Sandra Dee Indices Key-value pairs (records) 0 1 Lisa Smith +1-555-8976 873 John Smith +1-555-1234 874 Sandra Dee +1-555-9655 998 Sam Doe +1-555-5030 872 Operations Linear probing is a component of open addressing schemes for using a hash table to solve the dictionary problem. In the dictionary problem, a data structure should maintain a collection of key–value pairs subject to operations that insert or delete pairs from the collection or that search for the value associated with a given key. In open addressing solutions to this problem, the data structure is an array T (the hash table) whose cells T[i] (when nonempty) each store a single key–value pair. A hash function is used to map each key into the cell of T where that key should be stored, typically scrambling the keys so that keys with similar values are not placed near each other in the table. A hash collision occurs when the hash function maps a key into a cell that is already occupied by a different key. Linear probing is a strategy for resolving collisions, by placing the new key into the closest following empty cell.[3][4] 999 The collision between John Smith and Sandra Dee (both hashing to cell 873) is resolved by placing Sandra Dee at the next free location, cell 874. Linear probing is a scheme in computer programming for resolving collisions in hash tables, data structures for maintaining a collection of key–value pairs and looking up the value associated with a given key. It was invented in 1954 by Gene Amdahl, Elaine M. McGraw, and Arthur Samuel and first analyzed in 1963 by Donald Knuth. Along with quadratic probing and double hashing, linear probing is a form of open addressing. In these schemes, each cell of a hash table stores a single key–value pair. When the hash function causes a collision by mapping a new key to a cell of the hash table that is already occupied by another key, linear probing searches the table for the closest following free location and inserts the new key there. Lookups are performed in the same way, by searching the table sequentially starting at the position given by the hash function, until finding a cell with a matching key or an empty cell. Search To search for a given key x, the cells of T are examined, beginning with the cell at index h(x) (where h is the hash function) and continuing to the adjacent cells h(x) + 1, h(x) + 2, ..., until finding either an empty cell or a cell whose stored key is x. If a cell containing the key is found, the search returns the value from that cell. Otherwise, if an empty cell is found, the key cannot be in the table, because it would have been placed in that cell in preference to any later cell that has not yet been searched. In this case, the search returns as its result that the key is not present in the dictionary.[3][4] Insertion To insert a key–value pair (x,v) into the table (possibly replacing any existing pair with the same key), the insertion algorithm follows the same sequence of cells that would be followed for a search, until finding either an empty cell x. The new key–value pair As Thorup & Zhang (2012) write, “Hash tables are the or a cell whose stored key is[3][4] is then placed into that cell. most commonly used nontrivial data structures, and the most popular implementation on standard hardware uses If the insertion would cause the load factor of the table linear probing, which is both fast and simple.”[1] Lin- (its fraction of occupied cells) to grow above some preear probing can provide high performance because of set threshold, the whole table may be replaced by a new its good locality of reference, but is more sensitive to table, larger by a constant factor, with a new hash functhe quality of its hash function than some other colli- tion, as in a dynamic array. Setting this threshold close to sion resolution schemes. It takes constant expected time zero and using a high growth rate for the table size leads per search, insertion, or deletion when implemented us- to faster hash table operations but greater memory usage ing a random hash function, a 5-independent hash func- than threshold values close to one and low growth rates. tion, or tabulation hashing. However, good results can A common choice would be to double the table size when be achieved in practice with other hash functions such as the load factor would exceed 1/2, causing the load factor MurmurHash.[2] to stay between 1/4 and 1/2.[5] 3.4. LINEAR PROBING When a key–value pair is deleted, it may be necessary to move another pair backwards into its cell, to prevent searches for the moved key from finding an empty cell. Deletion It is also possible to remove a key–value pair from the dictionary. However, it is not sufficient to do so by simply emptying its cell. This would affect searches for other keys that have a hash value earlier than the emptied cell, but that are stored in a position later than the emptied cell. The emptied cell would cause those searches to incorrectly report that the key is not present. Instead, when a cell i is emptied, it is necessary to search forward through the following cells of the table until finding either another empty cell or a key that can be moved to cell i (that is, a key whose hash value is equal to or earlier than i). When an empty cell is found, then emptying cell i is safe and the deletion process terminates. But, when the search finds a key that can be moved to cell i, it performs this move. This has the effect of speeding up later searches for the moved key, but it also empties out another cell, later in the same block of occupied cells. The search for a movable key continues for the new emptied cell, in the same way, until it terminates by reaching a cell that was already empty. In this process of moving keys to earlier cells, each key is examined only once. Therefore, the time to complete the whole process is proportional to the length of the block of occupied cells containing the deleted key, matching the running time of the other hash table operations.[3] 69 lision to cause more nearby collisions.[3] Additionally, achieving good performance with this method requires a higher-quality hash function than for some other collision resolution schemes.[6] When used with low-quality hash functions that fail to eliminate nonuniformities in the input distribution, linear probing can be slower than other open-addressing strategies such as double hashing, which probes a sequence of cells whose separation is determined by a second hash function, or quadratic probing, where the size of each step varies depending on its position within the probe sequence.[7] 3.4.3 Analysis Using linear probing, dictionary operations can be implemented in constant expected time. In other words, insert, remove and search operations can be implemented in O(1), as long as the load factor of the hash table is a constant strictly less than one.[8] In more detail, the time for any particular operation (a search, insertion, or deletion) is proportional to the length of the contiguous block of occupied cells at which the operation starts. If all starting cells are equally likely, in a hash table with N cells, then a maximal block of k occupied cells will have probability k/N of containing the starting location of a search, and will take time O(k) whenever it is the starting location. Therefore, the expected time for an operation can be calculated as the product of these two terms, O(k2 /N), summed over all of the maximal blocks of contiguous cells in the table. A similar sum of squared block lengths gives the expected time bound for a random hash function (rather than for a random starting location into a specific state of the hash table), by summing over all the blocks that could exist (rather than the ones that actually exist in a given state of the table), and multiplying the term for each potential block by the probability that the block is actually occuAlternatively, it is possible to use a lazy deletion strat- pied. That is, defining Block(i,k) to be the event that there egy in which a key–value pair is removed by replacing is a maximal contiguous block of occupied cells of length the value by a special flag value indicating a deleted key. k beginning at index i, the expected time per operation is However, these flag values will contribute to the load factor of the hash table. With this strategy, it may beN ∑ n come necessary to clean the flag values out of the array ∑ O(k 2 /N ) Pr[Block(i, k)]. and rehash all the remaining key–value pairs once too E[T ] = O(1) + i=1 k=1 large a fraction of the array becomes occupied by deleted keys.[3][4] This formula can be simiplified by replacing Block(i,k) by a simpler necessary condition Full(k), the event that at least k elements have hash values that lie within a block 3.4.2 Properties of cells of length k. After this replacement, the value within the sum no longer depends on i, and the 1/N facLinear probing provides good locality of reference, which tor cancels the N terms of the outer summation. These causes it to require few uncached memory accesses per simplifications lead to the bound operation. Because of this, for low to moderate load factors, it can provide very high performance. However, compared to some other open addressing strategies, its n ∑ performance degrades more quickly at high load factors E[T ] ≤ O(1) + O(k 2 ) Pr[Full(k)]. because of primary clustering, a tendency for one colk=1 70 CHAPTER 3. DICTIONARIES But by the multiplicative form of the Chernoff bound, when the load factor is bounded away from one, the probability that a block of length k contains at least k hashed values is exponentially small as a function of k, causing this sum to be bounded by a constant independent of n.[3] It is also possible to perform the same analysis using Stirling’s approximation instead of the Chernoff bound to estimate the probability that a block contains exactly k hashed values.[4][9] of distinct keys to any k-tuple of indexes. The parameter k can be thought of as a measure of hash function quality: the larger k is, the more time it will take to compute the hash function but it will behave more similarly to completely random functions. For linear probing, 5independence is enough to guarantee constant expected time per operation,[16] while some 4-independent hash functions perform badly, taking up to logarithmic time per operation.[6] In terms of the load factor α, the expected time for a successful search is O(1 + 1/(1 − α)), and the expected time for an unsuccessful search (or the insertion of a new key) is O(1 + 1/(1 − α)2 ).[10] For constant load factors, with high probability, the longest probe sequence (among the probe sequences for all keys stored in the table) has logarithmic length.[11] Another method of constructing hash functions with both high quality and practical speed is tabulation hashing. In this method, the hash value for a key is computed by using each byte of the key as an index into a table of random numbers (with a different table for each byte position). The numbers from those table cells are then combined by a bitwise exclusive or operation. Hash functions constructed this way are only 3-independent. Nevertheless, linear probing using these hash functions takes constant expected time per operation.[4][17] Both tabulation hashing and standard methods for generating 5-independent hash functions are limited to keys that have a fixed number of bits. To handle strings or other types of variablelength keys, it is possible to compose a simpler universal hashing technique that maps the keys to intermediate values and a higher quality (5-independent or tabulation) hash function that maps the intermediate values to hash table indices.[1][18] 3.4.4 Choice of hash function Because linear probing is especially sensitive to unevenly distributed hash values,[7] it is important to combine it with a high-quality hash function that does not produce such irregularities. The analysis above assumes that each key’s hash is a random number independent of the hashes of all the other keys. This assumption is unrealistic for most applications of hashing. However, random or pseudorandom hash values may be used when hashing objects by their identity rather than by their value. For instance, this is done using linear probing by the IdentityHashMap class of the Java collections framework.[12] The hash value that this class associates with each object, its identityHashCode, is guaranteed to remain fixed for the lifetime of an object but is otherwise arbitrary.[13] Because the identityHashCode is constructed only once per object, and is not required to be related to the object’s address or value, its construction may involve slower computations such as the call to a random or pseudorandom number generator. For instance, Java 8 uses an Xorshift pseudorandom number generator to construct these values.[14] For most applications of hashing, it is necessary to compute the hash function for each value every time that it is hashed, rather than once when its object is created. In such applications, random or pseudorandom numbers cannot be used as hash values, because then different objects with the same value would have different hashes. And cryptographic hash functions (which are designed to be computationally indistinguishable from truly random functions) are usually too slow to be used in hash tables.[15] Instead, other methods for constructing hash functions have been devised. These methods compute the hash function quickly, and can be proven to work well with linear probing. In particular, linear probing has been analyzed from the framework of k-independent hashing, a class of hash functions that are initialized from a small random seed and that are equally likely to map any k-tuple In an experimental comparison, Richter et al. found that the Multiply-Shift family of hash functions (defined as hz (x) = (x · z mod2w ) ÷ 2w−d ) was “the fastest hash function when integrated with all hashing schemes, i.e., producing the highest throughputs and also of good quality” whereas tabulation hashing produced “the lowest throughput”.[2] They point out that each table look-up require several cycles, being more expensive than simple arithmetic operatons. They also found MurmurHash to be superior than tabulation hashing: “By studying the results provided by Mult and Murmur, we think that the trade-off for by tabulation (...) is less attractive in practice”. 3.4.5 History The idea of an associative array that allows data to be accessed by its value rather than by its address dates back to the mid-1940s in the work of Konrad Zuse and Vannevar Bush,[19] but hash tables were not described until 1953, in an IBM memorandum by Hans Peter Luhn. Luhn used a different collision resolution method, chaining, rather than linear probing.[20] Knuth (1963) summarizes the early history of linear probing. It was the first open addressing method, and was originally synonymous with open addressing. According to Knuth, it was first used by Gene Amdahl, Elaine M. McGraw (née Boehme), and Arthur Samuel in 1954, in an assembler program for the IBM 701 computer.[8] The 3.4. LINEAR PROBING 71 first published description of linear probing is by Peterson [9] Eppstein, David (October 13, 2011), “Linear probing made easy”, 0xDE. (1957),[8] who also credits Samuel, Amdahl, and Boehme but adds that “the system is so natural, that it very likely [10] Sedgewick, Robert (2003), “Section 14.3: Linear Probmay have been conceived independently by others either ing”, Algorithms in Java, Parts 1–4: Fundamentals, Data before or since that time”.[21] Another early publication Structures, Sorting, Searching (3rd ed.), Addison Wesley, of this method was by Soviet researcher Andrey Ershov, pp. 615–620, ISBN 9780321623973. in 1958.[22] [11] Pittel, B. (1987), “Linear probing: the probable largest The first theoretical analysis of linear probing, showsearch time grows logarithmically with the number ing that it takes constant expected time per operation of records”, Journal of Algorithms, 8 (2): 236–249, with random hash functions, was given by Knuth.[8] doi:10.1016/0196-6774(87)90040-X, MR 890874. Sedgewick calls Knuth’s work “a landmark in the analysis of algorithms”.[10] Significant later developments include [12] “IdentityHashMap”, Java SE 7 Documentation, Oracle, retrieved 2016-01-15. a more detailed analysis of the probability distribution of [23][24] the running time, and the proof that linear probing [13] Friesen, Jeff (2012), Beginning Java 7, Expert’s voice in runs in constant time per operation with practically usJava, Apress, p. 376, ISBN 9781430239109. able hash functions rather than with the idealized random [14] Kabutz, Heinz M. (September 9, 2014), “Identity Crisis”, functions assumed by earlier analysis.[16][17] The Java Specialists’ Newsletter, 222. 3.4.6 References [1] Thorup, Mikkel; Zhang, Yin (2012), “Tabulation-based 5-independent hashing with applications to linear probing and second moment estimation”, SIAM Journal on Computing, 41 (2): 293–331, doi:10.1137/100800774, MR 2914329. [2] Richter, Stefan; Alvarez, Victor; Dittrich, Jens (2015), “A seven-dimensional analysis of hashing methods and its implications on query processing”, Proceedings of the VLDB Endowment, 9 (3): 293–331. [3] Goodrich, Michael T.; Tamassia, Roberto (2015), “Section 6.3.3: Linear Probing”, Algorithm Design and Applications, Wiley, pp. 200–203. [4] Morin, Pat (February 22, 2014), “Section 5.2: LinearHashTable: Linear Probing”, Open Data Structures (in pseudocode) (0.1Gβ ed.), pp. 108–116, retrieved 201601-15. [5] Sedgewick, Robert; Wayne, Kevin (2011), Algorithms (4th ed.), Addison-Wesley Professional, p. 471, ISBN 9780321573513. Sedgewick and Wayne also halve the table size when a deletion would cause the load factor to become too low, causing them to use a wider range [1/8,1/2] in the possible values of the load factor. [6] Pătraşcu, Mihai; Thorup, Mikkel (2010), “On the kindependence required by linear probing and minwise independence” (PDF), Automata, Languages and Programming, 37th International Colloquium, ICALP 2010, Bordeaux, France, July 6–10, 2010, Proceedings, Part I, Lecture Notes in Computer Science, 6198, Springer, pp. 715–726, doi:10.1007/978-3-642-14165-2_60 [7] Heileman, Gregory L.; Luo, Wenbin (2005), “How caching affects hashing” (PDF), Seventh Workshop on Algorithm Engineering and Experiments (ALENEX 2005), pp. 141–154. [8] Knuth, Donald (1963), Notes on “Open” Addressing [15] Weiss, Mark Allen (2014), “Chapter 3: Data Structures”, in Gonzalez, Teofilo; Diaz-Herrera, Jorge; Tucker, Allen, Computing Handbook, 1 (3rd ed.), CRC Press, p. 3-11, ISBN 9781439898536. [16] Pagh, Anna; Pagh, Rasmus; Ružić, Milan (2009), “Linear probing with constant independence”, SIAM Journal on Computing, 39 (3): 1107–1120, doi:10.1137/070702278, MR 2538852 [17] Pătraşcu, Mihai; Thorup, Mikkel (2011), “The power of simple tabulation hashing”, Proceedings of the 43rd annual ACM Symposium on Theory of Computing (STOC '11), pp. 1–10, arXiv:1011.5200 , doi:10.1145/1993636.1993638 [18] Thorup, Mikkel (2009), “String hashing for linear probing”, Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, Philadelphia, PA: SIAM, pp. 655–664, doi:10.1137/1.9781611973068.72, MR 2809270. [19] Parhami, Behrooz (2006), Introduction to Parallel Processing: Algorithms and Architectures, Series in Computer Science, Springer, 4.1 Development of early models, p. 67, ISBN 9780306469640. [20] Morin, Pat (2004), “Hash tables”, in Mehta, Dinesh P.; Sahni, Sartaj, Handbook of Data Structures and Applications, Chapman & Hall / CRC, p. 9-15, ISBN 9781420035179. [21] Peterson, W. W. (April 1957), “Addressing for randomaccess storage”, IBM Journal of Research and Development, Riverton, NJ, USA: IBM Corp., 1 (2): 130–146, doi:10.1147/rd.12.0130. [22] Ershov, A. P. (1958), “On Programming of Arithmetic Operations”, Communications of the ACM, 1 (8): 3–6, doi:10.1145/368892.368907. Translated from Doklady AN USSR 118 (3): 427–430, 1958, by Morris D. Friedman. Linear probing is described as algorithm A2. [23] Flajolet, P.; Poblete, P.; Viola, A. (1998), “On the analysis of linear probing hashing”, Algorithmica, 22 (4): 490– 515, doi:10.1007/PL00009236, MR 1701625. 72 CHAPTER 3. DICTIONARIES [24] Knuth, D. E. (1998), “Linear probing and graphs”, Algorithmica, 22 (4): 561–568, doi:10.1007/PL00009240, MR 1701629. 3.5 Quadratic probing are all distinct. This leads to a probe sequence of h(k), h(k) + 1, h(k) + 3, h(k) + 6, ... where the values increase by 1, 2, 3, ... • For prime m > 2, most choices of c1 and c2 will make h(k,i) distinct for i in [0, (m-1)/2]. Such choices include c1 = c2 = 1/2, c1 = c2 = 1, and c1 = 0, c2 = 1. Because there are only about m/2 distinct probes for a given element, it is difficult to guarantee that insertions will succeed when the load factor is > 1/2. Quadratic probing is an open addressing scheme in computer programming for resolving collisions in hash tables—when an incoming data’s hash value indicates it should be stored in an already-occupied slot or bucket. Quadratic probing operates by taking the original hash index and adding successive values of an arbitrary quadratic 3.5.2 Quadratic probing insertion polynomial until an open slot is found. key For a given hash value, the indices generated by linear The problem, here, is to insert a key at an available [1] space in a given Hash Table using quadratic probing. probing are as follows: H + 1, H + 2, H + 3, H + 4, ..., H + k Algorithm to insert key in hash table This method results in primary clustering, and as the cluster grows larger, the search for those items hashing within 1. Get the key k 2. Set counter j = 0 3. Compute hash the cluster becomes less efficient. function h[k] = k % SIZE 4. If hashtable[h[k]] is empty An example sequence using quadratic probing is: (4.1) Insert key k at hashtable[h[k]] (4.2) Stop Else (4.3) The key space at hashtable[h[k]] is occupied, so we need 2 2 2 2 2 H + 1 , H + 2 , H + 3 , H + 4 , ..., H + k to find the next available key space (4.4) Increment j (4.5) Quadratic probing can be a more efficient algorithm in Compute new hash function h[k] = ( k + j * j ) % SIZE a closed hash table, since it better avoids the clustering (4.6) Repeat Step 4 till j is equal to the SIZE of hash table problem that can occur with linear probing, although it 5. The hash table is full 6. Stop is not immune. It also provides good memory caching because it preserves some locality of reference; however, linear probing has greater locality and, thus, better cache C function for key insertion performance. int quadratic_probing_insert(int *hashtable, int key, int Quadratic probing is used in the Berkeley Fast File *empty) { /* hashtable[] is an integer hash table; empty[] System to allocate free blocks. The allocation routine is another array which indicates whether the key space is chooses a new cylinder-group when the current is nearly occupied; If an empty key space is found, the function full using quadratic probing, because of the speed it shows returns the index of the bucket where the key is inserted, in finding unused cylinder-groups. otherwise it returns (−1) if no empty key space is found */ int i, index; for (i = 0; i < SIZE; i++) { index = (key + i*i) % SIZE; if (empty[index]) { hashtable[index] = key; 3.5.1 Quadratic function empty[index] = 0; return index; } } return −1; } Let h(k) be a hash function that maps an element k to an integer in [0,m-1], where m is the size of the table. Let the ith probe position for a value k be given by the function h(k, i) = (h(k) + c1 i + c2 i2 ) 3.5.3 Quadratic probing search Algorithm to search element in hash table (mod m) 1. Get the key k to be searched 2. Set counter j = 0 3. where c2 ≠ 0. If c2 = 0, then h(k,i) degrades to a linear Compute hash function h[k] = k % SIZE 4. If the key probe. For a given hash table, the values of c1 and c2 space at hashtable[h[k]] is occupied (4.1) Compare the remain constant. element at hashtable[h[k]] with the key k. (4.2) If they are equal (4.2.1) The key is found at the bucket h[k] (4.2.2) Examples: Stop Else (4.3) The element might be placed at the next 2 • If h(k, i) = (h(k) + i + i ) (mod m) , then the location given by the quadratic function (4.4) Increment j probe sequence will be h(k), h(k) + 2, h(k) + 6, ... (4.5) Set k = ( k + (j * j) ) % SIZE, so that we can probe the bucket at a new slot, h[k]. (4.6) Repeat Step 4 till j is • For m = 2n , a good choice for the constants are c1 greater than SIZE of hash table 5. The key was not found = c2 = 1/2, as the values of h(k,i) for i in [0,m-1] in the hash table 6. Stop 3.6. DOUBLE HASHING C function for key searching int quadratic_probing_search(int *hashtable, int key, int *empty) { /* If the key is found in the hash table, the function returns the index of the hashtable where the key is inserted, otherwise it returns (−1) if the key is not found */ int i, index; for (i = 0; i < SIZE; i++) { index = (key + i*i) % SIZE; if (!empty[index] && hashtable[index] == key) return index; } return −1; } 73 1. Get the key k 2. Set counter j = 0 3. Compute hash function h[k] = k % SIZE 4. If hashtable[h[k]] is empty (4.1) Insert key k at hashtable[h[k]] (4.2) Stop Else (4.3) The key space at hashtable[h[k]] is occupied, so we need to find the next available key space (4.4) Increment j (4.5) Compute new hash function h[k]. If j is odd, then h[k] = ( k + j * j ) % SIZE, else h[k] = ( k - j * j ) % SIZE (4.6) Repeat Step 4 till j is equal to the SIZE of hash table 5. The hash table is full 6. Stop The search algorithm is modified likewise. 3.5.4 Limitations 3.5.5 See also [2] For linear probing it is a bad idea to let the hash table • Hash tables get nearly full, because performance is degraded as the hash table gets filled. In the case of quadratic probing, • Hash collision the situation is even more drastic. With the exception of the triangular number case for a power-of-two-sized hash • Double hashing table, there is no guarantee of finding an empty cell once • Linear probing the table gets more than half full, or even before the table gets half full if the table size is not prime. This is because • Hash function at most half of the table can be used as alternative locations to resolve collisions. If the hash table size is b (a prime greater than 3), it can be proven that the first b/2 3.5.6 References alternative locations including the initial location h(k) are all distinct and unique. Suppose, we assume two of the [1] Horowitz, Sahni, Anderson-Freed (2011). Fundamentals alternative locations to be given by h(k) + x2 (mod b) of Data Structures in C. University Press. ISBN 978-81and h(k) + y 2 (mod b) , where 0 ≤ x, y ≤ (b / 2). If 7371-605-8. these two locations point to the same key space, but x ≠ [2] Weiss, Mark Allen (2009). Data Structures and Algorithm y. Then the following would have to be true, h(k) + x2 = h(k) + y 2 (mod b) x2 = y 2 (mod b) x2 − y 2 = 0 (mod b) (x − y)(x + y) = 0 (mod b) Analysis in C++. Pearson Education. ISBN 978-81-3171474-4. As b (table size) is a prime greater than 3, either (x - y) or (x + y) has to be equal to zero. Since x and y are unique, 3.5.7 External links (x - y) cannot be zero. Also, since 0 ≤ x, y ≤ (b / 2), (x + • Tutorial/quadratic probing y) cannot be zero. Thus, by contradiction, it can be said that the first (b / 2) alternative locations after h(k) are unique. So an empty key space can always be found as long as at most (b / 2) 3.6 Double hashing locations are filled, i.e., the hash table is not more than half full. Double hashing is a computer programming technique used in hash tables to resolve hash collisions, in cases when two different values to be searched for produce the Alternating sign same hash key. It is a popular collision-resolution technique in open-addressed hash tables. Double hashing is If the sign of the offset is alternated (e.g. +1, −4, +9, −16 implemented in many popular libraries. etc.), and if the number of buckets is a prime number p congruent to 3 modulo 4 (i.e. one of 3, 7, 11, 19, 23, 31 Like linear probing, it uses one hash value as a starting and so on), then the first p offsets will be unique modulo point and then repeatedly steps forward an interval until the desired value is located, an empty location is reached, p. or the entire table has been searched; but this interval In other words, a permutation of 0 through p-1 is ob- is decided using a second, independent hash function tained, and, consequently, a free bucket will always be (hence the name double hashing). Unlike linear probing found as long as there exists at least one. and quadratic probing, the interval depends on the data, The insertion algorithm only receives a minor modifica- so that even values mapping to the same location have tion (but do note that SIZE has to be a suitable prime different bucket sequences; this minimizes repeated collisions and the effects of clustering. number as explained above): 74 CHAPTER 3. DICTIONARIES Given two randomly, uniformly, and independently se- h2 (k) = (k mod 7) + 1 lected hash functions h1 and h2 , the ith location in the This ensures that the secondary hash function will always bucket sequence for value k in a hash table T is: h(i, k) = be non zero. (h1 (k) + i · h2 (k)) mod |T |. Generally, h1 and h2 are selected from a set of universal hash functions. 3.6.3 See also 3.6.1 Classical applied data structure Double hashing with open addressing is a classical data structure on a table T . Let n be the number of elements stored in T , then T 's load factor is α = |Tn | . • Collision resolution in hash tables • Hash function • Linear probing • Cuckoo hashing Double hashing approximates uniform open address hashing. That is, start by randomly, uniformly and independently selecting two universal hash functions h1 and 3.6.4 Notes h2 to build a double hashing table T . All elements are put in T by double hashing using h1 and h2 . Given a key k , determining the (i + 1) -st hash location is computed by: h(i, k) = (h1 (k) + i · h2 (k)) mod |T |. Let T have fixed load factor α : 1 > α > 0 . Bradford and Katehakis[1] showed the expected number of probes for an unsuccessful search in T , still using these initially 1 chosen hash functions, is 1−α regardless of the distribution of the inputs. More precisely, these two uniformly, randomly and independently chosen hash functions are chosen from a set of universal hash functions where pairwise independence suffices. [1] Bradford, Phillip G.; Katehakis, Michael N. (2007), “A probabilistic study on combinatorial expanders and hashing” (PDF), SIAM Journal on Computing, 37 (1): 83–111, doi:10.1137/S009753970444630X, MR 2306284. [2] L. Guibas and E. Szemerédi: The Analysis of Double Hashing, Journal of Computer and System Sciences, 1978, 16, 226-274. [3] G. S. Lueker and M. Molodowitch: More Analysis of Double Hashing, Combinatorica, 1993, 13(1), 83-96. [4] J. P. Schmidt and A. Siegel: Double Hashing is Computable and Randomizable with Universal Hash Functions, manuscript. Previous results include: Guibas and Szemerédi[2] 1 showed 1−α holds for unsuccessful search for load factors α < 0.319 . Also, Lueker and Molodowitch[3] showed 3.6.5 External links this held assuming ideal randomized functions. Schmidt • How Caching Affects Hashing by Gregory L. Heileand Siegel[4] showed this with k -wise independent and man and Wenbin Luo 2005. uniform functions (for k = c log n , and suitable constant c ). • Hash Table Animation 3.6.2 Implementation details for caching • klib a C library that includes double hashing functionality. Linear probing and, to a lesser extent, quadratic probing are able to take advantage of the data cache by accessing 3.7 Cuckoo hashing locations that are close together. Double hashing has, on average, larger intervals and is not able to achieve this Cuckoo hashing is a scheme in computer programming advantage. for resolving hash collisions of values of hash functions in Like all other forms of open addressing, double hashing a table, with worst-case constant lookup time. The name becomes linear as the hash table approaches maximum derives from the behavior of some species of cuckoo, capacity. The only solution to this is to rehash to a larger where the cuckoo chick pushes the other eggs or young size, as with all other open addressing schemes. out of the nest when it hatches; analogously, inserting a On top of that, it is possible for the secondary hash func- new key into a cuckoo hashing table may push an older tion to evaluate to zero. For example, if we choose k=5 key to a different location in the table. with the following function: h2 (k) = 5 − (k mod 7) 3.7.1 History The resulting sequence will always remain at the initial hash value. One possible solution is to change the sec- Cuckoo hashing was first described by Rasmus Pagh and Flemming Friche Rodler in 2001.[1] ondary hash function to: 3.7. CUCKOO HASHING 75 from collisions, which happen when more than one key is mapped to the same cell. The basic idea of cuckoo hashing is to resolve collisions by using two hash functions instead of only one. This provides two possible locations in the hash table for each key. In one of the commonly used variants of the algorithm, the hash table is split into two smaller tables of equal size, and each hash function provides an index into one of these two tables. It is also possible for both hash functions to provide indexes into a single table. Lookup requires inspection of just two locations in the hash table, which takes constant time in the worst case (see Big O notation). This is in contrast to many other hash table algorithms, which may not have a constant worst-case bound on the time to do a lookup. Deletions, also, may be performed by blanking the cell containing a key, in constant worst case time, more simply than some other schemes such as linear probing. When a new key is inserted, and one of its two cells is empty, it may be placed in that cell. However, when both cells are already full, it will be necessary to move other keys to their second locations (or back to their first locations) to make room for the new key. A greedy algorithm is used: The new key is inserted in one of its two possible locations, “kicking out”, that is, displacing, any key that might already reside in this location. This displaced key is then inserted in its alternative location, again kicking out any key that might reside there. The process continues in the same way until an empty position is found, completing the algorithm. However, it is possible for this insertion process to fail, by entering an infinite loop or by finding a very long chain (longer than a preset threshold that is logarithmic in the table size). In this case, the hash table is rebuilt in-place using new hash functions: There is no need to allocate new tables for the rehashing: We may simply run through the tables to delete and perform the usual insertion procedure on all keys found not to be at their intended position in the table. — Pagh & Rodler, “Cuckoo Hashing”[1] Cuckoo hashing example. The arrows show the alternative location of each key. A new item would be inserted in the location of A by moving A to its alternative location, currently occupied by B, and moving B to its alternative location which is currently vacant. Insertion of a new item in the location of H would not succeed: Since H is part of a cycle (together with W), the new item would get kicked out again. 3.7.2 Operation Cuckoo hashing is a form of open addressing in which each non-empty cell of a hash table contains a key or key–value pair. A hash function is used to determine the location for each key, and its presence in the table (or the value associated with it) can be found by examining that cell of the table. However, open addressing suffers 3.7.3 Theory Insertions succeed in expected constant time,[1] even considering the possibility of having to rebuild the table, as long as the number of keys is kept below half of the capacity of the hash table, i.e., the load factor is below 50%. One method of proving this uses the theory of random graphs: one may form an undirected graph called the “cuckoo graph” that has a vertex for each hash table location, and an edge for each hashed value, with the endpoints of the edge being the two possible locations of the value. Then, the greedy insertion algorithm for adding a set of values to a cuckoo hash table succeeds if and only if 76 CHAPTER 3. DICTIONARIES the cuckoo graph for this set of values is a pseudoforest, a graph with at most one cycle in each of its connected components. Any vertex-induced subgraph with more edges than vertices corresponds to a set of keys for which there are an insufficient number of slots in the hash table. When the hash function is chosen randomly, the cuckoo graph is a random graph in the Erdős–Rényi model. With high probability, for a random graph in which the ratio of the number of edges to the number of vertices is bounded below 1/2, the graph is a pseudoforest and the cuckoo hashing algorithm succeeds in placing all keys. Moreover, the same theory also proves that the expected size of a connected component of the cuckoo graph is small, ensuring that each insertion takes constant expected time.[2] arbitrarily large by increasing the stash size. However, larger stashes also mean slower searches for keys that are not present or are in the stash. A stash can be used in combination with more than two hash functions or with blocked cuckoo hashing to achieve both high load factors and small failure rates.[5] The analysis of cuckoo hashing with a stash extends to practical hash functions, not just to the random hash function model commonly used in theoretical analysis of hashing.[6] Some people recommend a simplified generalization of cuckoo hashing called skewed-associative cache in some CPU caches.[7] 3.7.6 Comparison with related structures 3.7.4 Example Other algorithms that use multiple hash functions include the Bloom filter, a memory-efficient data structure for inThe following hash functions are given: exact sets. An alternative data structure for the same in⌊k⌋ ′ exact set problem, based on cuckoo hashing and called the h (k) = k mod 11 h (k) = 11 mod 11 cuckoo filter, uses even less memory and (unlike classical Columns in the following two tables show the state of the Bloom flters) allows element deletions as well as inserhash tables over time as the elements are inserted. tions and membership tests; however, its theoretical analysis is much less developed than the analysis of Bloom filters.[8] Cycle A study by Zukowski et al.[9] has shown that cuckoo hashing is much faster than chained hashing for small, cache-resident hash tables on modern processors. Kenneth Ross[10] has shown bucketized versions of cuckoo hashing (variants that use buckets that contain more than one key) to be faster than conventional methods also for large hash tables, when space utilization is high. The performance of the bucketized cuckoo hash table was inves3.7.5 Variations tigated further by Askitis,[11] with its performance comSeveral variations of cuckoo hashing have been studied, pared against alternative hashing schemes. primarily with the aim of improving its space usage by A survey by Mitzenmacher[3] presents open problems reincreasing the load factor that it can tolerate to a num- lated to cuckoo hashing as of 2009. ber greater than the 50% threshold of the basic algorithm. Some of these methods can also be used to reduce the failure rate of cuckoo hashing, causing rebuilds of the data 3.7.7 See also structure to be much less frequent. • Perfect hashing Generalizations of cuckoo hashing that use more than two If you now wish to insert the element 6, then you get into a cycle. In the last row of the table we find the same initial situation as at the beginning again. ⌊6⌋ h (6) = 6 mod 11 = 6 h′ (6) = 11 mod 11 = 0 alternative hash functions can be expected to utilize a larger part of the capacity of the hash table efficiently while sacrificing some lookup and insertion speed. Using just three hash functions increases the load to 91%.[3] Another generalization of cuckoo hashing, called blocked cuckoo hashing consists in using more than one key per bucket. Using just 2 keys per bucket permits a load factor above 80%.[4] • Linear probing • Double hashing • Hash collision • Hash function • Quadratic probing Another variation of cuckoo hashing that has been stud• Hopscotch hashing ied is cuckoo hashing with a stash. The stash, in this data structure, is an array of a constant number of keys, used to store keys that cannot successfully be inserted into the 3.7.8 References main hash table of the structure. This modification reduces the failure rate of cuckoo hashing to an inverse- [1] Pagh, Rasmus; Rodler, Flemming Friche (2001). “Cuckoo Hashing”. Algorithms — ESA 2001. Lecpolynomial function with an exponent that can be made 3.8. HOPSCOTCH HASHING ture Notes in Computer Science. 2161. pp. 121– 133. doi:10.1007/3-540-44676-1_10. ISBN 978-3-54042493-2. [2] Kutzelnigg, Reinhard (2006). Bipartite random graphs and cuckoo hashing. Fourth Colloquium on Mathematics and Computer Science. Discrete Mathematics and Theoretical Computer Science. pp. 403–406 [3] Mitzenmacher, Michael (2009-09-09). “Some Open Questions Related to Cuckoo Hashing | Proceedings of ESA 2009” (PDF). Retrieved 2010-11-10. [4] Dietzfelbinger, Martin; Weidling, Christoph (2007), “Balanced allocation and dictionaries with tightly packed constant size bins”, Theoret. Comput. Sci., 380 (1-2): 47– 68, doi:10.1016/j.tcs.2007.02.054, MR 2330641. [5] Kirsch, Adam; Mitzenmacher, Michael D.; Wieder, Udi (2010), “More robust hashing: cuckoo hashing with a stash”, SIAM J. Comput., 39 (4): 1543–1561, doi:10.1137/080728743, MR 2580539. [6] Aumüller, Martin; Dietzfelbinger, Martin; Woelfel, Philipp (2014), “Explicit and efficient hash families suffice for cuckoo hashing with a stash”, Algorithmica, 70 (3): 428–456, doi:10.1007/s00453-013-9840-x, MR 3247374. 77 • Algorithmic Improvements for Fast Concurrent Cuckoo Hashing, X. Li, D. Andersen, M. Kaminsky, M. Freedman. EuroSys 2014. Examples • Concurrent high-performance Cuckoo hashtable written in C++ • Cuckoo hash map written in C++ • Static cuckoo hashtable generator for C/C++ • Cuckoo hashtable written in Java • Generic Cuckoo hashmap in Java • Cuckoo hash table written in Haskell • Cuckoo hashing for Go 3.8 Hopscotch hashing [7] “Micro-Architecture”. [8] Fan, Bin; Kaminsky, Michael; Andersen, David (August 2013). “Cuckoo Filter: Better Than Bloom” (PDF). ;login:. USENIX. 38 (4): 36–40. Retrieved 12 June 2014. [9] Zukowski, Marcin; Heman, Sandor; Boncz, Peter (June 2006). “Architecture-Conscious Hashing” (PDF). Proceedings of the International Workshop on Data Management on New Hardware (DaMoN). Retrieved 2008-1016. [10] Ross, Kenneth (2006-11-08). “Efficient Hash Probes on Modern Processors” (PDF). IBM Research Report RC24100. RC24100. Retrieved 2008-10-16. [11] Askitis, Nikolas (2009). Fast and Compact Hash Tables for Integer Keys (PDF). Proceedings of the 32nd Australasian Computer Science Conference (ACSC 2009). 91. pp. 113–122. ISBN 978-1-920682-72-9. 3.7.9 External links • A cool and practical alternative to traditional hash tables, U. Erlingsson, M. Manasse, F. Mcsherry, 2006. • Cuckoo Hashing for Undergraduates, 2006, R. Pagh, 2006. Hopscotch hashing. Here, H is 4. Gray entries are occupied. In part (a), the item x is added with a hash value of 6. A linear probe finds that entry 13 is empty. Because 13 is more than 4 entries away from 6, the algorithm looks for an earlier entry to swap with 13. The first place to look in is H-1 = 3 entries before, at entry 10. That entry’s hop information bit-map indicates that d, the item at entry 11, can be displaced to 13. After displacing d, Entry 11 is still too far from entry 6, so the algorithm examines entry 8. The hop information bit-map indicates that item c at entry 9 can be moved to entry 11. Finally, a is moved to entry 9. Part (b) shows the table state just before adding x. Hopscotch hashing is a scheme in computer program• Cuckoo Hashing, Theory and Practice (Part 1, Part ming for resolving hash collisions of values of hash func2 and Part 3), Michael Mitzenmacher, 2007. tions in a table using open addressing. It is also well suited • Naor, Moni; Segev, Gil; Wieder, Udi (2008). for implementing a concurrent hash table. Hopscotch “History-Independent Cuckoo Hashing”. Interna- hashing was introduced by Maurice Herlihy, Nir Shavit tional Colloquium on Automata, Languages and Pro- and Moran Tzafrir in 2008.[1] The name is derived from gramming (ICALP). Reykjavik, Iceland. Retrieved the sequence of hops that characterize the table’s inser2008-07-21. tion algorithm. 78 The algorithm uses a single array of n buckets. For each bucket, its neighborhood is a small collection of nearby consecutive buckets (i.e. ones with close indices to the original hashed bucket). The desired property of the neighborhood is that the cost of finding an item in the buckets of the neighborhood is close to the cost of finding it in the bucket itself (for example, by having buckets in the neighborhood fall within the same cache line). The size of the neighborhood must be sufficient to accommodate a logarithmic number of items in the worst case (i.e. it must accommodate log(n) items), but only a constant number on average. If some bucket’s neighborhood is filled, the table is resized. CHAPTER 3. DICTIONARIES cache aligned, then one could apply a reorganization operation in which items are moved into the now vacant location in order to improve alignment. One advantage of hopscotch hashing is that it provides good performance at very high table load factors, even ones exceeding 0.9. Part of this efficiency is due to using a linear probe only to find an empty slot during insertion, not for every lookup as in the original linear probing hash table algorithm. Another advantage is that one can use any hash function, in particular simple ones that are closeto-universal. In hopscotch hashing, as in cuckoo hashing, and unlike in 3.8.1 See also linear probing, a given item will always be inserted-into • Cuckoo hashing and found-in the neighborhood of its hashed bucket. In other words, it will always be found either in its original • Hash collision hashed array entry, or in one of the next H-1 neighboring entries. H could, for example, be 32, a common machine • Hash function word size. The neighborhood is thus a “virtual” bucket • Linear probing that has fixed size and overlaps with the next H-1 buckets. To speed the search, each bucket (array entry) includes a • Open addressing “hop-information” word, an H-bit bitmap that indicates which of the next H-1 entries contain items that hashed • Perfect hashing to the current entry’s virtual bucket. In this way, an item • Quadratic probing can be found quickly by looking at the word to see which entries belong to the bucket, and then scanning through the constant number of entries (most modern processors support special bit manipulation operations that make the 3.8.2 References lookup in the “hop-information” bitmap very fast). Here is how to add item x which was hashed to bucket i: 1. If the entry i is empty, add x to i and return. 2. Starting at entry i, use a linear probe to find an empty entry at index j. 3. If the empty entry’s index j is within H-1 of entry i, place x there and return. Otherwise, entry j is too far from i. To create an empty entry closer to i, find an item y whose hash value lies between i and j, but within H-1 of j. Displacing y to j creates a new empty slot closer to i. Repeat until the empty entry is within H-1 of entry i, place x there and return. If no such item y exists, or if the bucket i already contains H items, resize and rehash the table. The idea is that hopscotch hashing “moves the empty slot towards the desired bucket”. This distinguishes it from linear probing which leaves the empty slot where it was found, possibly far away from the original bucket, or from cuckoo hashing that, in order to create a free bucket, moves an item out of one of the desired buckets in the target arrays, and only then tries to find the displaced item a new place. To remove an item from the table, one simply removes it from the table entry. If the neighborhood buckets are [1] Herlihy, Maurice and Shavit, Nir and Tzafrir, Moran (2008). “Hopscotch Hashing” (PDF). DISC '08: Proceedings of the 22nd international symposium on Distributed Computing. Arcachon, France: Springer-Verlag. pp. 350–364. 3.8.3 External links • libhhash - a C hopscotch hashing implementation • hopscotch-map - a C++ implementation of a hash map using hopscotch hashing 3.9 Hash function This article is about a programming concept. For other meanings of “hash” and “hashing”, see Hash (disambiguation). A hash function is any function that can be used to map data of arbitrary size to data of fixed size. The values returned by a hash function are called hash values, hash codes, hash sums, or simply hashes. One use is a data structure called a hash table, widely used in computer software for rapid data lookup. Hash functions accelerate table or database lookup by detecting duplicated records in a large file. An example is finding similar stretches in DNA sequences. They are also useful in cryptography. 3.9. HASH FUNCTION keys John Smith Lisa Smith Sam Doe Sandra Dee hash function 79 hashes — it tells where one should start looking for it. Still, in a half-full table, a good hash function will typically narrow the search down to only one or two entries. 00 01 02 03 04 05 : 15 A hash function that maps names to integers from 0 to 15. There is a collision between keys “John Smith” and “Sandra Dee”. Caches Hash functions are also used to build caches for large data sets stored in slow media. A cache is generally simpler than a hashed search table, since any collision can be resolved by discarding or writing back the older of the two colliding items. This is also used in file comparison. Bloom filters Main article: Bloom filter A cryptographic hash function allows one to easily verify that some input data maps to a given hash value, but Hash functions are an essential ingredient of the Bloom if the input data is unknown, it is deliberately difficult to filter, a space-efficient probabilistic data structure that is reconstruct it (or equivalent alternatives) by knowing the used to test whether an element is a member of a set. stored hash value. This is used for assuring integrity of transmitted data, and is the building block for HMACs, which provide message authentication. Finding duplicate records Hash functions are related to (and often confused with) checksums, check digits, fingerprints, randomization functions, error-correcting codes, and ciphers. Although Main article: Hash table these concepts overlap to some extent, each has its own uses and requirements and is designed and optimized When storing records in a large unsorted file, one may differently. The Hash Keeper database maintained by use a hash function to map each record to an index into the American National Drug Intelligence Center, for in- a table T, and to collect in each bucket T[i] a list of the stance, is more aptly described as a catalogue of file fin- numbers of all records with the same hash value i. Once gerprints than of hash values. the table is complete, any two duplicate records will end up in the same bucket. The duplicates can then be found by scanning every bucket T[i] which contains two or more members, fetching those records, and comparing them. 3.9.1 Uses With a table of appropriate size, this method is likely to be much faster than any alternative approach (such as sorting Hash tables the file and comparing all consecutive pairs). [1] Hash functions are primarily used in hash tables, to quickly locate a data record (e.g., a dictionary definition) given its search key (the headword). Specifically, the Protecting data hash function is used to map the search key to an index; the index gives the place in the hash table where the cor- Main article: Security of cryptographic hash functions responding record should be stored. Hash tables, in turn, are used to implement associative arrays and dynamic A hash value can be used to uniquely identify secret inforsets. mation. This requires that the hash function is collisionTypically, the domain of a hash function (the set of possi- resistant, which means that it is very hard to find data that ble keys) is larger than its range (the number of different will generate the same hash value. These functions are table indices), and so it will map several different keys to categorized into cryptographic hash functions and provthe same index. Therefore, each slot of a hash table is ably secure hash functions. Functions in the second cateassociated with (implicitly or explicitly) a set of records, gory are the most secure but also too slow for most practirather than a single record. For this reason, each slot of cal purposes. Collision resistance is accomplished in part a hash table is often called a bucket, and hash values are by generating very large hash values. For example, SHAalso called bucket indices. 1, one of the most widely used cryptographic hash funcThus, the hash function only hints at the record’s location tions, generates 160 bit values. 80 CHAPTER 3. DICTIONARIES Finding similar records Standard uses of hashing in cryptography Main article: Locality sensitive hashing Main article: Cryptographic hash function Hash functions can also be used to locate table records whose key is similar, but not identical, to a given key; or pairs of records in a large file which have similar keys. For that purpose, one needs a hash function that maps similar keys to hash values that differ by at most m, where m is a small integer (say, 1 or 2). If one builds a table T of all record numbers, using such a hash function, then similar records will end up in the same bucket, or in nearby buckets. Then one need only check the records in each bucket T[i] against those in buckets T[i+k] where k ranges between −m and m. Some standard applications that employ hash functions include authentication, message integrity (using an HMAC (Hashed MAC)), message fingerprinting, data corruption detection, and digital signature efficiency. 3.9.2 Properties Good hash functions, in the original sense of the term, are usually required to satisfy certain properties listed below. The exact requirements are dependent on the application, This class includes the so-called acoustic fingerprint al- for example a hash function well suited to indexing data gorithms, that are used to locate similar-sounding entries will probably be a poor choice for a cryptographic hash in large collection of audio files. For this application, the function. hash function must be as insensitive as possible to data capture or transmission errors, and to trivial changes such Determinism as timing and volume changes, compression, etc.[2] Finding similar substrings The same techniques can be used to find equal or similar stretches in a large collection of strings, such as a document repository or a genomic database. In this case, the input strings are broken into many small pieces, and a hash function is used to detect potentially equal pieces, as above. A hash procedure must be deterministic—meaning that for a given input value it must always generate the same hash value. In other words, it must be a function of the data to be hashed, in the mathematical sense of the term. This requirement excludes hash functions that depend on external variable parameters, such as pseudo-random number generators or the time of day. It also excludes functions that depend on the memory address of the object being hashed in cases that the address may change during execution (as may happen on systems that use certain methods of garbage collection), although sometimes rehashing of the item is possible. The Rabin–Karp algorithm is a relatively fast string searching algorithm that works in O(n) time on average. The determinism is in the context of the reuse of the It is based on the use of hashing to compare strings. function. For example, Python adds the feature that hash functions make use of a randomized seed that is generated once when the Python process starts in addition to the input to be hashed. The Python hash is still a valid Geometric hashing hash function when used in within a single run. But if the values are persisted (for example, written to disk) they This principle is widely used in computer graphics, can no longer be treated as valid hash values, since in the computational geometry and many other disciplines, to next run the random value might differ. solve many proximity problems in the plane or in threedimensional space, such as finding closest pairs in a set of points, similar shapes in a list of shapes, similar images Uniformity in an image database, and so on. In these applications, the set of all inputs is some sort of metric space, and the A good hash function should map the expected inputs as hashing function can be interpreted as a partition of that evenly as possible over its output range. That is, every space into a grid of cells. The table is often an array with hash value in the output range should be generated with two or more indices (called a grid file, grid index, bucket roughly the same probability. The reason for this last regrid, and similar names), and the hash function returns quirement is that the cost of hashing-based methods goes an index tuple. This special case of hashing is known as up sharply as the number of collisions—pairs of inputs geometric hashing or the grid method. Geometric hash- that are mapped to the same hash value—increases. If ing is also used in telecommunications (usually under the some hash values are more likely to occur than others, name vector quantization) to encode and compress multi- a larger fraction of the lookup operations will have to dimensional signals. search through a larger set of colliding table entries. 3.9. HASH FUNCTION Note that this criterion only requires the value to be uniformly distributed, not random in any sense. A good randomizing function is (barring computational efficiency concerns) generally a good choice as a hash function, but the converse need not be true. 81 input data z, and the number n of allowed hash values. A common solution is to compute a fixed hash function with a very large range (say, 0 to 232 − 1), divide the result by n, and use the division’s remainder. If n is itself a power of 2, this can be done by bit masking and bit shifting. When this approach is used, the hash function must be chosen so that the result has fairly uniform distribution between 0 and n − 1, for any value of n that may occur in the application. Depending on the function, the remainder may be uniform only for certain values of n, e.g. odd or prime numbers. Hash tables often contain only a small subset of the valid inputs. For instance, a club membership list may contain only a hundred or so member names, out of the very large set of all possible names. In these cases, the uniformity criterion should hold for almost all typical subsets of entries that may be found in the table, not just for the global set of all possible entries. We can allow the table size n to not be a power of 2 In other words, if a typical set of m records is hashed to and still not have to perform any remainder or division n table slots, the probability of a bucket receiving many operation, as these computations are sometimes costly. more than m/n records should be vanishingly small. In For example, let n be significantly less than 2b . Conparticular, if m is less than n, very few buckets should sider a pseudorandom number generator (PRNG) funchave more than one or two records. (In an ideal "perfect tion P(key) that is uniform on the interval [0, 2b − 1]. hash function", no bucket should have more than one A hash function uniform on the interval [0, n-1] is n record; but a small number of collisions is virtually in- P(key)/2b . We can replace the division by a (possibly evitable, even if n is much larger than m – see the birthday faster) right bit shift: nP(key) >> b. paradox). When testing a hash function, the uniformity of the distriVariable range with minimal movement (dynamic bution of hash values can be evaluated by the chi-squared hash function) When the hash function is used to store test. values in a hash table that outlives the run of the program, and the hash table needs to be expanded or shrunk, the hash table is referred to as a dynamic hash table. Defined range It is often desirable that the output of a hash function have fixed size (but see below). If, for example, the output is constrained to 32-bit integer values, the hash values can be used to index into an array. Such hashing is commonly used to accelerate data searches.[3] On the other hand, cryptographic hash functions produce much larger hash values, in order to ensure the computational complexity of brute-force inversion.[4] For example, SHA-1, one of the most widely used cryptographic hash functions, produces a 160-bit value. A hash function that will relocate the minimum number of records when the table is – where z is the key being hashed and n is the number of allowed hash values – such that H(z,n + 1) = H(z,n) with probability close to n/(n + 1). Linear hashing and spiral storage are examples of dynamic hash functions that execute in constant time but relax the property of uniformity to achieve the minimal movement property. Extendible hashing uses a dynamic hash function that requires space proportional to n to compute the hash funcProducing fixed-length output from variable length intion, and it becomes a function of the previous keys that put can be accomplished by breaking the input data into have been inserted. chunks of specific size. Hash functions used for data searches use some arithmetic expression which iteratively Several algorithms that preserve the uniformity property processes chunks of the input (such as the characters in but require time proportional to n to compute the value a string) to produce the hash value.[3] In cryptographic of H(z,n) have been invented. hash functions, these chunks are processed by a one-way compression function, with the last chunk being padded if necessary. In this case, their size, which is called block Data normalization size, is much bigger than the size of the hash value.[4] For example, in SHA-1, the hash value is 160 bits and the In some applications, the input data may contain features that are irrelevant for comparison purposes. For example, block size 512 bits. when looking up a personal name, it may be desirable to ignore the distinction between upper and lower case Variable range In many applications, the range of letters. For such data, one must use a hash function that hash values may be different for each run of the program, is compatible with the data equivalence criterion being or may change along the same run (for instance, when a used: that is, any two inputs that are considered equivalent hash table needs to be expanded). In those situations, one must yield the same hash value. This can be accomplished needs a hash function which takes two parameters—the by normalizing the input before hashing it, as by upper- 82 CHAPTER 3. DICTIONARIES casing all letters. The same technique can be used to map two-letter country codes like “us” or “za” to country names (262 = 676 table entries), 5-digit zip codes like 13083 to city names Continuity (100000 entries), etc. Invalid data values (such as the country code “xx” or the zip code 00000) may be left un“A hash function that is used to search for similar (as op- defined in the table or mapped to some appropriate “null” posed to equivalent) data must be as continuous as possi- value. ble; two inputs that differ by a little should be mapped to equal or nearly equal hash values.”[5] Perfect hashing Note that continuity is usually considered a fatal flaw for checksums, cryptographic hash functions, and other re- Main article: Perfect hash function lated concepts. Continuity is desirable for hash functions A hash function that is injective—that is, maps each valid only in some applications, such as hash tables used in Nearest neighbor search. Non-invertible In cryptographic applications, hash functions are typically expected to be practically non-invertible, meaning that it is not realistic to reconstruct the input datum x from its hash value h(x) alone without spending great amounts of computing time (see also One-way function). keys hash function 00 01 John Smith 02 03 Lisa Smith 04 05 Sam Doe 3.9.3 Hash function algorithms hashes : 13 Sandra Dee 14 15 For most types of hashing functions, the choice of the function depends strongly on the nature of the input data, and their probability distribution in the intended applicaA perfect hash function for the four names shown tion. Trivial hash function input to a different hash value—is said to be perfect. With such a function one can directly locate the desired entry in a hash table, without any additional searching. If the data to be hashed is small enough, one can use the data itself (reinterpreted as an integer) as the hashed Minimal perfect hashing value. The cost of computing this “trivial” (identity) hash function is effectively zero. This hash function is perfect, as it maps each input to a distinct hash value. The meaning of “small enough” depends on the size of the type that is used as the hashed value. For example, in Java, the hash code is a 32-bit integer. Thus the 32bit integer Integer and 32-bit floating-point Float objects can simply use the value directly; whereas the 64-bit integer Long and 64-bit floating-point Double cannot use this method. Other types of data can also use this perfect hashing scheme. For example, when mapping character strings between upper and lower case, one can use the binary encoding of each character, interpreted as an integer, to index a table that gives the alternative form of that character (“A” for “a”, “8” for “8”, etc.). If each character is stored in 8 bits (as in extended ASCII[6] or ISO Latin 1), the table has only 28 = 256 entries; in the case of Unicode characters, the table would have 17×216 = 1114112 entries. keys hash function hashes John Smith Lisa Smith Sam Doe 0 1 2 3 Sandra Dee A minimal perfect hash function for the four names shown A perfect hash function for n keys is said to be minimal if its range consists of n consecutive integers, usually from 0 to n−1. Besides providing single-step lookup, a minimal perfect hash function also yields a compact hash table, 3.9. HASH FUNCTION 83 without any vacant slots. Minimal perfect hash functions complex issue and depends on the nature of the data. If are much harder to find than perfect ones with a wider the units b[k] are single bits, then F(S,b) could be, for range. instance Hashing uniformly distributed data If the inputs are bounded-length strings and each input may independently occur with uniform probability (such as telephone numbers, car license plates, invoice numbers, etc.), then a hash function needs to map roughly the same number of inputs to each hash value. For instance, suppose that each input is an integer z in the range 0 to N−1, and the output must be an integer h in the range 0 to n−1, where N is much larger than n. Then the hash function could be h = z mod n (the remainder of z divided by n), or h = (z × n) ÷ N (the value z scaled down by n/N and truncated to an integer), or many other formulas. Hashing data with other distributions if highbit(S) = 0 then return 2 * S + b else return (2 * S + b) ^ P Here highbit(S) denotes the most significant bit of S; the '*' operator denotes unsigned integer multiplication with lost overflow; '^' is the bitwise exclusive or operation applied to words; and P is a suitable fixed word.[7] Special-purpose hash functions In many cases, one can design a special-purpose (heuristic) hash function that yields many fewer collisions than a good general-purpose hash function. For example, suppose that the input data are file names such as FILE0000.CHK, FILE0001.CHK, FILE0002.CHK, etc., with mostly sequential numbers. For such data, a function that extracts the numeric part k of the file name and returns k mod n would be nearly optimal. Needless to say, a function that is exceptionally good for a specific kind of data may have dismal performance on data with different distribution. These simple formulas will not do if the input values are not equally likely, or are not independent. For instance, most patrons of a supermarket will live in the same geographic area, so their telephone numbers are likely to begin with the same 3 to 4 digits. In that case, if m is 10000 or so, the division formula (z × m) ÷ M, which depends mainly on the leading digits, will generate a lot of collisions; whereas the remainder formula z mod m, which is Rolling hash quite sensitive to the trailing digits, may still yield a fairly Main article: Rolling hash even distribution. Hashing variable-length data When the data values are long (or variable-length) character strings—such as personal names, web page addresses, or mail messages—their distribution is usually very uneven, with complicated dependencies. For example, text in any natural language has highly non-uniform distributions of characters, and character pairs, very characteristic of the language. For such data, it is prudent to use a hash function that depends on all characters of the string—and depends on each character in a different way. In cryptographic hash functions, a Merkle–Damgård construction is usually used. In general, the scheme for hashing such data is to break the input into a sequence of small units (bits, bytes, words, etc.) and combine all the units b[1], b[2], …, b[m] sequentially, as follows S ← S0; // Initialize the state. for k in 1, 2, ..., m do // Scan the input data units: S ← F(S, b[k]); // Combine data unit k into the state. return G(S, n) // Extract the hash value from the state. This schema is also used in many text checksum and fingerprint algorithms. The state variable S may be a 32or 64-bit unsigned integer; in that case, S0 can be 0, and G(S,n) can be just S mod n. The best choice of F is a In some applications, such as substring search, one must compute a hash function h for every k-character substring of a given n-character string t; where k is a fixed integer, and n is k. The straightforward solution, which is to extract every such substring s of t and compute h(s) separately, requires a number of operations proportional to k·n. However, with the proper choice of h, one can use the technique of rolling hash to compute all those hashes with an effort proportional to k + n. Universal hashing A universal hashing scheme is a randomized algorithm that selects a hashing function h among a family of such functions, in such a way that the probability of a collision of any two distinct keys is 1/n, where n is the number of distinct hash values desired—independently of the two keys. Universal hashing ensures (in a probabilistic sense) that the hash function application will behave as well as if it were using a random function, for any distribution of the input data. It will, however, have more collisions than perfect hashing and may require more operations than a special-purpose hash function. See also unique permutation hashing.[8] 84 CHAPTER 3. DICTIONARIES Hashing with checksum functions Hashing by nonlinear table lookup One can adapt certain checksum or fingerprinting algorithms for use as hash functions. Some of those algorithms will map arbitrary long string data z, with any typical real-world distribution—no matter how non-uniform and dependent—to a 32-bit or 64-bit string, from which one can extract a hash value in 0 through n − 1. Main article: Tabulation hashing Tables of random numbers (such as 256 random 32-bit integers) can provide high-quality nonlinear functions to be used as hash functions or for other purposes such as cryptography. The key to be hashed is split into 8-bit (one-byte) parts, and each part is used as an index for the nonlinear table. The table values are then added by arithmetic or XOR addition to the hash output value. Because the table is just 1024 bytes in size, it fits into the cache of modern microprocessors and allow very fast execution of the hashing algorithm. As the table value is on average much longer than 8 bits, one bit of input affects nearly all output bits. This method may produce a sufficiently uniform distribution of hash values, as long as the hash range size n is small compared to the range of the checksum or fingerprint function. However, some checksums fare poorly in the avalanche test, which may be a concern in some applications. In particular, the popular CRC32 checksum provides only 16 bits (the higher half of the result) that are usable for hashing. Moreover, each bit of the input has a deterministic effect on each bit of the CRC32, that This algorithm has proven to be very fast and of high qualis one can tell without looking at the rest of the input, ity for hashing purposes (especially hashing of integerwhich bits of the output will flip if the input bit is flipped; number keys). so care must be taken to use all 32 bits when computing the hash from the checksum.[9] Efficient hashing of strings See also: Universal hashing § Hashing strings Multiplicative hashing Multiplicative hashing is a simple type of hash function often used by teachers introducing students to hash tables.[10] Multiplicative hash functions are simple and fast, but have higher collision rates in hash tables than more sophisticated hash functions.[11] In many applications, such as hash tables, collisions make the system a little slower but are otherwise harmless. In such systems, it is often better to use hash functions based on multiplication -- such as MurmurHash and the SBoxHash -- or even simpler hash functions such as CRC32 -and tolerate more collisions; rather than use a more complex hash function that avoids many of those collisions but takes longer to compute.[11] Multiplicative hashing is susceptible to a “common mistake” that leads to poor diffusion—higher-value input bits do not affect lowervalue output bits.[12] Modern microprocessors will allow for much faster processing, if 8-bit character strings are not hashed by processing one character at a time, but by interpreting the string as an array of 32 bit or 64 bit integers and hashing/accumulating these “wide word” integer values by means of arithmetic operations (e.g. multiplication by constant and bit-shifting). The remaining characters of the string which are smaller than the word length of the CPU must be handled differently (e.g. being processed one character at a time). This approach has proven to speed up hash code generation by a factor of five or more on modern microprocessors of a word size of 64 bit. Another approach[14] is to convert strings to a 32 or 64 bit numeric value and then apply a hash function. One method that avoids the problem of strings having great similarity (“Aaaaaaaaaa” and “Aaaaaaaaab”) is to use a Cyclic redundancy check (CRC) of the string to compute a 32- or 64-bit value. While it is possible that two different strings will have the same CRC, the likelihood is very small and only requires that one check the acHashing with cryptographic hash functions tual string found to determine whether one has an exSome cryptographic hash functions, such as SHA-1, have act match. CRCs will be different for strings such as Although, CRC codes even stronger uniformity guarantees than checksums or “Aaaaaaaaaa” and “Aaaaaaaaab”. [15] they are not cryptographican be used as hash values fingerprints, and thus can provide very good general[16] cally secure since they are not collision-resistant. purpose hashing functions. In ordinary applications, this advantage may be too small to offset their much higher cost.[13] However, this method 3.9.4 Locality-sensitive hashing can provide uniformly distributed hashes even when the keys are chosen by a malicious agent. This feature may Locality-sensitive hashing (LSH) is a method of perhelp to protect services against denial of service attacks. forming probabilistic dimension reduction of high- 3.9. HASH FUNCTION 85 dimensional data. The basic idea is to hash the input 3.9.7 See also items so that similar items are mapped to the same buck• Bloom filter ets with high probability (the number of buckets being much smaller than the universe of possible input items). • Coalesced hashing This is different from the conventional hash functions, such as those used in cryptography, as in this case the • Cuckoo hashing goal is to maximize the probability of “collision” of sim• Hopscotch hashing ilar items rather than to avoid collisions.[17] One example of LSH is MinHash algorithm used for finding similar documents (such as web-pages): Let h be a hash function that maps the members of A and B to distinct integers, and for any set S define h ᵢ (S) to be the member x of S with the minimum value of h(x). Then h ᵢ (A) = h ᵢ (B) exactly when the minimum hash value of the union A ∪ B lies in the intersection A ∩ B. Therefore, Pr[h ᵢ (A) = h ᵢ (B)] = J(A,B). where J is Jaccard index. In other words, if r is a random variable that is one when h ᵢ (A) = h ᵢ (B) and zero otherwise, then r is an unbiased estimator of J(A,B), although it has too high a variance to be useful on its own. The idea of the MinHash scheme is to reduce the variance by averaging together several variables constructed in the same way. • Cryptographic hash function • Distributed hash table • Geometric hashing • Hash Code cracker • Hash table • HMAC • Identicon • Linear hash • List of hash functions • Locality sensitive hashing • MD5 • Perfect hash function 3.9.5 Origins of the term • PhotoDNA The term “hash” comes by way of analogy with its nontechnical meaning, to “chop and mix”, see hash (food). Indeed, typical hash functions, like the mod operation, “chop” the input domain into many sub-domains that get “mixed” into the output range to improve the uniformity of the key distribution. • Rabin–Karp string search algorithm Donald Knuth notes that Hans Peter Luhn of IBM appears to have been the first to use the concept, in a memo dated January 1953, and that Robert Morris used the term in a survey paper in CACM which elevated the term from technical jargon to formal terminology.[18] • MinHash 3.9.6 List of hash functions Main article: List of hash functions • NIST hash function competition • Bernstein hash[19] • Fowler-Noll-Vo hash function (32, 64, 128, 256, 512, or 1024 bits) • Jenkins hash function (32 bits) • Pearson hashing (64 bits) • Zobrist hashing • Rolling hash • Transposition table • Universal hashing • Low-discrepancy sequence 3.9.8 References [1] Konheim, Alan (2010). “7. HASHING FOR STORAGE: DATA MANAGEMENT”. Hashing in Computer Science: Fifty Years of Slicing and Dicing. Wiley-Interscience. ISBN 9780470344736. [2] “Robust Audio Hashing for Content Identification by Jaap Haitsma, Ton Kalker and Job Oostveen” [3] Sedgewick, Robert (2002). “14. Hashing”. Algorithms in Java (3 ed.). Addison Wesley. ISBN 978-0201361209. [4] Menezes, Alfred J.; van Oorschot, Paul C.; Vanstone, Scott A (1996). Handbook of Applied Cryptography. CRC Press. ISBN 0849385237. [5] “Fundamental Data Structures – Josiang p.132”. Retrieved May 19, 2014. 86 CHAPTER 3. DICTIONARIES [6] Plain ASCII is a 7-bit character encoding, although it is often stored in 8-bit bytes with the highest-order bit always clear (zero). Therefore, for plain ASCII, the bytes have only 27 = 128 valid values, and the character translation table has only this many entries. [7] Broder, A. Z. (1993). “Some applications of Rabin’s fingerprinting method”. Sequences II: Methods in Communications, Security, and Computer Science. Springer-Verlag. pp. 143–152. [8] Shlomi Dolev, Limor Lahiani, Yinnon Haviv, “Unique permutation hashing”, Theoretical Computer Science Volume 475, 4 March 2013, Pages 59–65. [9] Bret Mulvey, Evaluation of CRC32 for Hash Tables, in Hash Functions. Accessed April 10, 2009. 3.10 Perfect hash function In computer science, a perfect hash function for a set S is a hash function that maps distinct elements in S to a set of integers, with no collisions. In mathematical terms, it is a total injective function. Perfect hash functions may be used to implement a lookup table with constant worst-case access time. A perfect hash function has many of the same applications as other hash functions, but with the advantage that no collision resolution has to be implemented. 3.10.1 Application [10] Knuth. “The Art of Computer Programming”. Volume 3: “Sorting and Searching”. Section “6.4. Hashing”. A perfect hash function with values in a limited range can be used for efficient lookup operations, by placing keys [11] Peter Kankowski. “Hash functions: An empirical com- from S (or other associated values) in a lookup table indexed by the output of the function. One can then test parison”. whether a key is present in S, or look up a value associ[12] “CS 3110 Lecture 21: Hash functions”. Section “Multi- ated with that key, by looking for it at its cell of the table. plicative hashing”. Each such lookup takes constant time in the worst case.[1] [13] Bret Mulvey, Evaluation of SHA-1 for Hash Tables, in Hash Functions. Accessed April 10, 2009. 3.10.2 Construction [14] http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1. 1.18.7520 Performance in Practice of String Hashing Functions A perfect hash function for a specific set S that can be evaluated in constant time, and with values in a small [15] Peter Kankowski. “Hash functions: An empirical com- range, can be found by a randomized algorithm in a number of operations that is proportional to the size of S. The parison”. original construction of Fredman, Komlós & Szemerédi [16] Cam-Winget, Nancy; Housley, Russ; Wagner, David; (1984) uses a two-level scheme to map a set S of n eleWalker, Jesse (May 2003). “Security Flaws in 802.11 ments to a range of O(n) indices, and then map each index Data Link Protocols”. Communications of the ACM. 46 to a range of hash values. The first level of their construc(5): 35–39. doi:10.1145/769800.769823. tion chooses a large prime p (larger than the size of the [17] A. Rajaraman and J. Ullman (2010). “Mining of Massive universe from which S is drawn), and a parameter k, and maps each element x of S to the index Datasets, Ch. 3.”. [18] Knuth, Donald (1973). The Art of Computer Programming, volume 3, Sorting and Searching. pp. 506–542. g(x) = (kxmodp)modn. [19] “Hash Functions”. cse.yorku.ca. September 22, 2003. Retrieved November 1, 2012. the djb2 algorithm (k=33) was first reported by dan bernstein many years ago in comp.lang.c. 3.9.9 External links • Calculate hash of a given value by Timo Denk If k is chosen randomly, this step is likely to have collisions, but the number of elements nᵢ that are simultaneously mapped to the same index i is likely to be small. The second level of their construction assigns disjoint ranges of O(ni2 ) integers to each index i. It uses a second set of linear modular functions, one for each index i, to map each member x of S into the range associated with g(x).[1] As Fredman, Komlós & Szemerédi (1984) show, there exists a choice of the parameter k such that the sum of • The Goulburn Hashing Function (PDF) by Mayur the lengths of the ranges for the n different values of g(x) is O(n). Additionally, for each value of g(x), there exists a Patel linear modular function that maps the corresponding sub• Hash Function Construction for Textual and Geo- set of S into the range associated with that value. Both metrical Data Retrieval Latest Trends on Comput- k, and the second-level functions for each value of g(x), ers, Vol.2, pp. 483–489, CSCC conference, Corfu, can be found in polynomial time` by choosing values randomly until finding one that works.[1] 2010 • Hash Functions and Block Ciphers by Bob Jenkins 3.10. PERFECT HASH FUNCTION The hash function itself requires storage space O(n) to store k, p, and all of the second-level linear modular functions. Computing the hash value of a given key x may be performed in constant time by computing g(x), looking up the second-level function associated with g(x), and applying this function to x. A modified version of this two-level scheme with a larger number of values at the top level can be used to construct a perfect hash function that maps S into a smaller range of length n + o(n).[1] 3.10.3 Space lower bounds The use of O(n) words of information to store the function of Fredman, Komlós & Szemerédi (1984) is nearoptimal: any perfect hash function that can be calculated in constant time requires at least a number of bits that is proportional to the size of S.[2] 3.10.4 Extensions 87 with constant access time is to use an (ordinary) perfect hash function or cuckoo hashing to store a lookup table of the positions of each key. If the keys to be hashed are themselves stored in a sorted array, it is possible to store a small number of additional bits per key in a data structure that can be used to compute hash values quickly.[7] Order-preserving minimal perfect hash functions require necessarily Ω(n log n) bits to be represented.[8] 3.10.5 Related constructions A simple alternative to perfect hashing, which also allows dynamic updates, is cuckoo hashing. This scheme maps keys to two or more locations within a range (unlike perfect hashing which maps each key to a single location) but does so in such a way that the keys can be assigned one-to-one to locations to which they have been mapped. Lookups with this scheme are slower, because multiple locations must be checked, but nevertheless take constant worst-case time.[9] Dynamic perfect hashing 3.10.6 References Main article: Dynamic perfect hashing Using a perfect hash function is best in situations where there is a frequently queried large set, S, which is seldom updated. This is because any modification of the set S may cause the hash function to no longer be perfect for the modified set. Solutions which update the hash function any time the set is modified are known as dynamic perfect hashing,[3] but these methods are relatively complicated to implement. Minimal perfect hash function A minimal perfect hash function is a perfect hash function that maps n keys to n consecutive integers – usually the numbers from 0 to n − 1 or from 1 to n. A more formal way of expressing this is: Let j and k be elements of some finite set S. F is a minimal perfect hash function if and only if F(j) = F(k) implies j = k (injectivity) and there exists an integer a such that the range of F is a..a + |S| − 1. It has been proven that a general purpose minimal perfect hash scheme requires at least 1.44 bits/key.[4] The best currently known minimal perfect hashing schemes can be represented using approximately 2.6 bits per key.[5] Order preservation A minimal perfect hash function F is order preserving if keys are given in some order a1 , a2 , ..., an and for any keys aj and ak, j < k implies F(aj) < F(ak).[6] In this case, the function value is just the position of each key in the sorted ordering of all of the keys. A simple implementation of order-preserving minimal perfect hash functions [1] Fredman, Michael L.; Komlós, János; Szemerédi, Endre (1984), “Storing a Sparse Table with O(1) Worst Case Access Time”, Journal of the ACM, 31 (3): 538, doi:10.1145/828.1884, MR 0819156 [2] Fredman, Michael L.; Komlós, János (1984), “On the size of separating systems and families of perfect hash functions”, SIAM Journal on Algebraic and Discrete Methods, 5 (1): 61–68, doi:10.1137/0605009, MR 731857. [3] Dietzfelbinger, Martin; Karlin, Anna; Mehlhorn, Kurt; Meyer auf der Heide, Friedhelm; Rohnert, Hans; Tarjan, Robert E. (1994), “Dynamic perfect hashing: upper and lower bounds”, SIAM Journal on Computing, 23 (4): 738– 761, doi:10.1137/S0097539791194094, MR 1283572. [4] Belazzougui, Djamal; Botelho, Fabiano C.; Dietzfelbinger, Martin (2009), “Hash, displace, and compress” (PDF), Algorithms—ESA 2009: 17th Annual European Symposium, Copenhagen, Denmark, September 7-9, 2009, Proceedings, Lecture Notes in Computer Science, 5757, Berlin: Springer, pp. 682–693, doi:10.1007/978-3-64204128-0_61, MR 2557794. [5] Baeza-Yates, Ricardo; Poblete, Patricio V. (2010), “Searching”, in Atallah, Mikhail J.; Blanton, Marina, Algorithms and Theory of Computation Handbook: General Concepts and Techniques (2nd ed.), CRC Press, ISBN 9781584888239. See in particular p. 2-10. [6] Jenkins, Bob (14 April 2009), “order-preserving minimal perfect hashing”, in Black, Paul E., Dictionary of Algorithms and Data Structures, U.S. National Institute of Standards and Technology, retrieved 2013-03-05 [7] Belazzougui, Djamal; Boldi, Paolo; Pagh, Rasmus; Vigna, Sebastiano (November 2008), “Theory and practice of monotone minimal perfect hashing”, Journal of 88 CHAPTER 3. DICTIONARIES Experimental Algorithmics, 16, Art. doi:10.1145/1963190.2025378. no. 3.2, 26pp, [8] Fox, Edward A.; Chen, Qi Fan; Daoud, Amjad M.; Heath, Lenwood S. (July 1991), “Order-preserving minimal perfect hash functions and information retrieval”, ACM Transactions on Information Systems, New York, NY, USA: ACM, 9 (3): 281–308, doi:10.1145/125187.125200. [9] Pagh, Rasmus; Rodler, Flemming Friche (2004), “Cuckoo hashing”, Journal of Algorithms, 51 (2): 122– 144, doi:10.1016/j.jalgor.2003.12.002, MR 2050140. 3.10.7 Further reading 3.11 Universal hashing In mathematics and computing universal hashing (in a randomized algorithm or data structure) refers to selecting a hash function at random from a family of hash functions with a certain mathematical property (see definition below). This guarantees a low number of collisions in expectation, even if the data is chosen by an adversary. Many universal families are known (for hashing integers, vectors, strings), and their evaluation is often very efficient. Universal hashing has numerous uses in computer science, for example in implementations of hash tables, randomized algorithms, and cryptography. • Richard J. Cichelli. Minimal Perfect Hash Functions 3.11.1 Introduction Made Simple, Communications of the ACM, Vol. 23, Number 1, January 1980. See also: Hash function • Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, Second Edition. MIT Press and McGrawHill, 2001. ISBN 0-262-03293-7. Section 11.5: Perfect hashing, pp. 245–249. • Fabiano C. Botelho, Rasmus Pagh and Nivio Ziviani. “Perfect Hashing for Data Management Applications”. • Fabiano C. Botelho and Nivio Ziviani. “External perfect hashing for very large key sets”. 16th ACM Conference on Information and Knowledge Management (CIKM07), Lisbon, Portugal, November 2007. • Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, and Sebastiano Vigna. “Monotone minimal perfect hashing: Searching a sorted table with O(1) accesses”. In Proceedings of the 20th Annual ACM-SIAM Symposium On Discrete Mathematics (SODA), New York, 2009. ACM Press. Assume we want to map keys from some universe U into m bins (labelled [m] = {0, . . . , m − 1} ). The algorithm will have to handle some data set S ⊆ U of |S| = n keys, which is not known in advance. Usually, the goal of hashing is to obtain a low number of collisions (keys from S that land in the same bin). A deterministic hash function cannot offer any guarantee in an adversarial setting if the size of U is greater than m · n , since the adversary may choose S to be precisely the preimage of a bin. This means that all data keys land in the same bin, making hashing useless. Furthermore, a deterministic hash function does not allow for rehashing: sometimes the input data turns out to be bad for the hash function (e.g. there are too many collisions), so one would like to change the hash function. The solution to these problems is to pick a function randomly from a family of hash functions. A family of functions H = {h : U → [m]} is called a universal family 1 if, ∀x, y ∈ U, x ̸= y : Prh∈H [h(x) = h(y)] ≤ m . In other words, any two keys of the universe collide with • Douglas C. Schmidt, GPERF: A Perfect Hash Funcprobability at most 1/m when the hash function h is tion Generator, C++ Report, SIGS, Vol. 10, No. 10, drawn randomly from H . This is exactly the probaNovember/December, 1998. bility of collision we would expect if the hash function assigned truly random hash codes to every key. Sometimes, the definition is relaxed to allow collision proba3.10.8 External links bility O(1/m) . This concept was introduced by Carter and Wegman[1] in 1977, and has found numerous appli• Minimal Perfect Hashing by Bob Jenkins cations in computer science (see, for example [2] ). If we • gperf is an Open Source C and C++ perfect hash have an upper bound of ϵ < 1 on the collision probability, generator we say that we have ϵ -almost universality. • cmph is Open Source implementing many perfect Many, but not all, universal families have the following hashing methods stronger uniform difference property: • Sux4J is Open Source implementing perfect hashing, including monotone minimal perfect hashing in Java ∀x, y ∈ U, x ̸= y , when h is drawn randomly from the family H , the difference h(x) − h(y) mod m is uniformly distributed in [m] . • MPHSharp is Open Source implementing many perfect hashing methods in C# Note that the definition of universality is only concerned 3.11. UNIVERSAL HASHING 89 with whether h(x) − h(y) = 0 , which counts collisions. 3.11.2 Mathematical guarantees The uniform difference property is stronger. (Similarly, a universal family can be XOR universal if For any fixed set S of n keys, using a universal family ∀x, y ∈ U, x ̸= y , the value h(x) ⊕ h(y) mod m is guarantees the following properties. uniformly distributed in [m] where ⊕ is the bitwise exclusive or operation. This is only possible if m is a power of two.) An even stronger condition is pairwise independence: we have this property when ∀x, y ∈ U, x ̸= y we have the probability that x, y will hash to any pair of hash values z1 , z2 is as if they were perfectly random: P (h(x) = z1 ∧ h(y) = z2 ) = 1/m2 . Pairwise independence is sometimes called strong universality. Another property is uniformity. We say that a family is uniform if all hash values are equally likely: P (h(x) = z) = 1/m for any hash value z . Universality does not imply uniformity. However, strong universality does imply uniformity. 1. For any fixed x in S , the expected number of keys in the bin h(x) is n/m . When implementing hash tables by chaining, this number is proportional to the expected running time of an operation involving the key x (for example a query, insertion or deletion). 2. The expected number of pairs of keys x, y in S with x ̸= y that collide ( h(x) = h(y) ) is bounded above by n(n − 1)/2m , which is of order O(n2 /m) . When the number of bins, m , is O(n) , the expected number of collisions is O(n) . When hashing into n2 bins, there are no collisions at all with probability at least a half. 3. The expected number of keys in bins with at least t Given a family with the uniform distance property, one keys in them is bounded above by 2n/(t−2(n/m)+ can produce a pairwise independent or strongly univer1) .[6] Thus, if the capacity of each bin is capped sal hash family by adding a uniformly distributed random to three times the average size ( t = 3n/m ), the constant with values in [m] to the hash functions. (Simtotal number of keys in overflowing bins is at most ilarly, if m is a power of two, we can achieve pairwise O(m) . This only holds with a hash family whose independence from an XOR universal hash family by docollision probability is bounded above by 1/m . If a ing an exclusive or with a uniformly distributed random weaker definition is used, bounding it by O(1/m) , constant.) Since a shift by a constant is sometimes irrelthis result is no longer true.[6] evant in applications (e.g. hash tables), a careful distinction between the uniform distance property and pairwise As the above guarantees hold for any fixed set S , they independent is sometimes not made.[3] hold if the data set is chosen by an adversary. However, For some applications (such as hash tables), it is impor- the adversary has to make this choice before (or indepentant for the least significant bits of the hash values to be dent of) the algorithm’s random choice of a hash function. also universal. When a family is strongly universal, this If the adversary can observe the random choice of the alis guaranteed: if H is a strongly universal family with gorithm, randomness serves no purpose, and the situation ′ m = 2L , then the family made of the functions hmod2L is the same as deterministic hashing. for all h ∈ H is also strongly universal for L′ ≤ L . Unfortunately, the same is not true of (merely) universal The second and third guarantee are typically used in confamilies. For example, the family made of the identity junction with rehashing. For instance, a randomized alfunction h(x) = x is clearly universal, but the family gorithm may be prepared to handle some O(n) num′ made of the function h(x) = xmod2L fails to be uni- ber of collisions. If it observes too many collisions, it chooses another random h from the family and repeats. versal. Universality guarantees that the number of repetitions is UMAC and Poly1305-AES and several other message a geometric random variable. authentication code algorithms are based on universal hashing.[4][5] In such applications, the software chooses a new hash function for every message, based on a unique 3.11.3 Constructions nonce for that message. Several hash table implementations are based on univer- Since any computer data can be represented as one or sal hashing. In such applications, typically the software more machine words, one generally needs hash functions chooses a new hash function only after it notices that “too for three types of domains: machine words (“integers”); many” keys have collided; until then, the same hash func- fixed-length vectors of machine words; and variabletion continues to be used over and over. (Some colli- length vectors (“strings”). sion resolution schemes, such as dynamic perfect hashing, pick a new hash function every time there is a collision. Other collision resolution schemes, such as cuckoo hash- Hashing integers ing and 2-choice hashing, allow a number of collisions This section refers to the case of hashing integers that fit before picking a new hash function). in machines words; thus, operations like multiplication, 90 CHAPTER 3. DICTIONARIES addition, division, etc. are cheap machine-level instruc- Avoiding modular arithmetic The state of the art for tions. Let the universe to be hashed be U = {0, . . . , m − hashing integers is the multiply-shift scheme described 1} . by Dietzfelbinger et al. in 1997.[7] By avoiding modular The original proposal of Carter and Wegman[1] was to arithmetic, this method is much easier to implement and also runs significantly faster in practice (usually by at least pick a prime p ≥ m and define a factor of four[8] ). The scheme assumes the number of bins is a power of two, m = 2M . Let w be the number of bits in a machine word. Then the hash functions are ha,b (x) = ((ax + b) mod p) mod m parametrised over odd positive integers a < 2w (that fit in a word of w bits). To evaluate ha (x) , multiply x by where a, b are randomly chosen integers modulo p with a modulo 2w and then keep the high order M bits as the a ̸= 0 . (This is a single iteration of a linear congruential hash code. In mathematical notation, this is generator.) To see that H = {ha,b } is a universal family, note that h(x) = h(y) only holds when ha (x) = (a · x mod 2w ) div 2w−M ax + b ≡ ay + b + i · m (mod p) and it can be implemented in C-like programming languages by ha (x) = (unsigned) (a*x) >> (w-M) for some integer i between 0 and (p − 1)/m . If x ̸= y , their difference, x − y is nonzero and has an inverse This scheme does not satisfy the uniform difference propmodulo p . Solving for a yields erty and is only 2/m -almost-universal; for any x ̸= y , Pr{ha (x) = ha (y)} ≤ 2/m . a ≡ i · m · (x − y)−1 (mod p) There are p − 1 possible choices for a (since a = 0 is excluded) and, varying i in the allowed range, ⌊(p − 1)/m⌋ possible non-zero values for the right hand side. Thus the collision probability is ⌊(p − 1)/m⌋/(p − 1) ≤ ((p − 1)/m)/(p − 1) = 1/m Another way to see H is a universal family is via the notion of statistical distance. Write the difference h(x) − h(y) as h(x) − h(y) ≡ (a(x − y) mod p) (mod m) Since x − y is nonzero and a is uniformly distributed in {1, . . . , p} , it follows that a(x − y) modulo p is also uniformly distributed in {1, . . . , p} . The distribution of (h(x) − h(y)) mod m is thus almost uniform, up to a difference in probability of ±1/p between the samples. As a result, the statistical distance to a uniform family is O(m/p) , which becomes negligible when p ≫ m . The family of simpler hash functions To understand the behavior of the hash function, notice that, if axmod2w and aymod2w have the same highestorder 'M' bits, then a(x − y)mod2w has either all 1’s or all 0’s as its highest order M bits (depending on whether axmod2w or aymod2w is larger. Assume that the least significant set bit of x−y appears on position w−c . Since a is a random odd integer and odd integers have inverses in the ring Z2w , it follows that a(x − y)mod2w will be uniformly distributed among w -bit integers with the least significant set bit on position w − c . The probability that these bits are all 0’s or all 1’s is therefore at most 2/2M = 2/m . On the other hand, if c < M , then higher-order M bits of a(x − y)mod2w contain both 0’s and 1’s, so it is certain that h(x) ̸= h(y) . Finally, if c = M then bit w − M of a(x − y)mod2w is 1 and ha (x) = ha (y) if and only if bits w − 1, . . . , w − M + 1 are also 1, which happens with probability 1/2M −1 = 2/m . This analysis is tight, as can be shown with the example x = 2w−M −2 and y = 3x . To obtain a truly 'universal' hash function, one can use the multiply-add-shift scheme ha,b (x) = ((ax + b)mod2w ) div 2w−M which can be implemented in C-like programming languages by ha,b (x) = (unsigned) (a*x+b) >> (w-M) ha (x) = (ax mod p) mod m where a is a random odd positive integer with a < 2w is only approximately universal: Pr{ha (x) = ha (y)} ≤ and b is a random non-negative integer with b < 2w−M . 2/m for all x ̸= y .[1] Moreover, this analysis is nearly With these choices of a and b , Pr{ha,b (x) = ha,b (y)} ≤ tight; Carter and Wegman [1] show that Pr{ha (1) = 1/m for all x ̸≡ y (mod 2w ) .[9] This differs slightly ha (m + 1)} ≥ 2/(m − 1) whenever (p − 1) mod m = 1 but importantly from the mistranslation in the English . paper.[10] 3.11. UNIVERSAL HASHING 91 Hashing vectors Hashing strings This section is concerned with hashing a fixed-length vector of machine words. Interpret the input as a vector x̄ = (x0 , . . . , xk−1 ) of k machine words (integers of w bits each). If H is a universal family with the uniform difference property, the following family (dating back to Carter and Wegman[1] ) also has the uniform difference property (and hence is universal): This refers to hashing a variable-sized vector of machine words. If the length of the string can be bounded by a small number, it is best to use the vector solution from above (conceptually padding the vector with zeros up to the upper bound). The space required is the maximal length of the string, but the time to evaluate h(s) is just the length of s . As long as zeroes are forbidden in the string, the zero-padding can be ignored when evaluating ) (∑ k−1 the hash function without affecting universality[11] ). Note h (x ) mod m , where each h(x̄) = i=0 i i that if zeroes are allowed in the string, then it might be hi ∈ H is chosen independently at random. best to append a fictitious non-zero (e.g., 1) character to this will ensure that univerIf m is a power of two, one may replace summation by all strings prior to padding: [15] [11] sality is not affected. exclusive or. In practice, if double-precision arithmetic is available, this is instantiated with the multiply-shift hash family of.[12] Initialize the hash function with a vector ā = (a0 , . . . , ak−1 ) of random odd integers on 2w bits each. Then if the number of bins is m = 2M for M ≤ w : hā (x̄) = ( k−1 (∑ ) xi · ai mod 22w ) Now assume we want to hash x̄ = (x0 , . . . , xℓ ) , where a good bound on ℓ is not known a priori. A universal family proposed by [12] treats the string x as the coefficients of a polynomial modulo a large prime. If xi ∈ [u] , let p ≥ max{u, m} be a prime and define: ha (x̄) div 22w−M i=0 = hint (( ∑ ℓ i=0 ) ) xi · ai mod p , where a ∈ [p] is uniformly random and hint is chosen randomly from a universal family mapping integer domain [p] 7→ [m] . It is possible to halve the number of multiplications, which roughly translates to a two-fold speed-up in practice.[11] Initialize the hash function with a vector ā = Using properties of modular arithmetic, above can be (a0 , . . . , ak−1 ) of random odd integers on 2w bits each. computed without producing large numbers for large The following hash family is universal:[13] strings as follows:[16] uint   hash(String x, int a, int p) uint h = INITIAL_VALUE for (uint i=0 ; i < x.length ; ++i) h = ((h*a) + x[i]) mod ( ⌈k/2⌉ ) ∑ hā (x̄) =  (x2i + a2i ) · (x2i+1 + a2i+1 ) mod 22wpreturn div 2h2w−M i=0 This Rabin-Karp rolling hash is based on a linear congruential generator.[17] Above algorithm is also known as Multiplicative hash function.[18] In practice, the mod operator and the parameter p can be avoided altogether by simply allowing integer to overflow because it is equivalent to mod (Max-Int-Value + 1) in many programming languages. Below table shows values chosen to initialize The same scheme can also be used for hashing integers, h and a for some of the popular implementations. by interpreting their bits as vectors of bytes. In this variant, the vector technique is known as tabulation hashing Consider two strings x̄, ȳ and let ℓ be length of the longer and it provides a practical alternative to multiplication- one; for the analysis, the shorter string is conceptually padded with zeros up to length ℓ . A collision before apbased universal hashing schemes.[14] plying hint implies that a is a root of the polynomial with Strong universality at high speed is also possible.[15] Inicoefficients x̄ − ȳ . This polynomial has at most ℓ roots tialize the hash function with a vector ā = (a0 , . . . , ak ) modulo p , so the collision probability is at most ℓ/p . The of random integers on 2w bits. Compute probability of collision through the random hint brings the 1 + pℓ . Thus, if the prime total collision probability to m p is sufficiently large compared to the length of strings k−1 ∑ hashed, the family is very close to universal (in statistical hā (x̄)strong = (a0 + ai+1 xi mod 22w ) div 2w distance). i=0 If double-precision operations are not available, one can interpret the input as a vector of half-words ( w/2 -bit integers). The algorithm will then use ⌈k/2⌉ multiplications, where k was the number of half-words in the vector. Thus, the algorithm runs at a “rate” of one multiplication per word of input. The result is strongly universal on w bits. Experimentally, Other universal families of hash functions used to hash it was found to run at 0.2 CPU cycle per byte on recent unknown-length strings to fixed-length hash values inIntel processors for w = 32 . clude the Rabin fingerprint and the Buzhash. 92 CHAPTER 3. DICTIONARIES Avoiding modular arithmetic To mitigate the computational penalty of modular arithmetic, two tricks are used in practice:[11] 1. One chooses the prime p to be close to a power of two, such as a Mersenne prime. This allows arithmetic modulo p to be implemented without division (using faster operations like addition and shifts). For instance, on modern architectures one can work with p = 261 − 1 , while xi 's are 32-bit values. [7] Dietzfelbinger, Martin; Hagerup, Torben; Katajainen, Jyrki; Penttonen, Martti (1997). “A Reliable Randomized Algorithm for the Closest-Pair Problem” (Postscript). Journal of Algorithms. 25 (1): 19–51. doi:10.1006/jagm.1997.0873. Retrieved 10 February 2011. [8] Thorup, Mikkel. “Text-book algorithms at SODA”. [9] Woelfel, Philipp (2003). Über die Komplexität der Multiplikation in eingeschränkten Branchingprogrammmodellen (PDF) (Ph.D.). Universität Dortmund. Retrieved 18 September 2012. 2. One can apply vector hashing to blocks. For instance, one applies vector hashing to each 16-word block of the string, and applies string hashing to the ⌈k/16⌉ results. Since the slower string hashing is applied on a substantially smaller vector, this will essentially be as fast as vector hashing. [10] Woelfel, Philipp (1999). Efficient Strongly Universal and Optimally Universal Hashing (PDF). Mathematical Foundations of Computer Science 1999. LNCS. pp. 262– 272. doi:10.1007/3-540-48340-3_24. Retrieved 17 May 2011. 3. One chooses a power-of-two as the divisor, allowing arithmetic modulo 2w to be implemented without division (using faster operations of bit masking). The NH hash-function family takes this approach. [11] Thorup, Mikkel (2009). String hashing for linear probing. Proc. 20th ACM-SIAM Symposium on Discrete Algorithms (SODA). pp. 655–664. doi:10.1137/1.9781611973068.72. Archived (PDF) from the original on 2013-10-12., section 5.3 3.11.4 See also • K-independent hashing • Rolling hashing • Tabulation hashing • Min-wise independence • Universal one-way hash function • Low-discrepancy sequence • Perfect hashing 3.11.5 References [1] Carter, Larry; Wegman, Mark N. (1979). “Universal Classes of Hash Functions”. Journal of Computer and System Sciences. 18 (2): 143–154. doi:10.1016/00220000(79)90044-8. Conference version in STOC'77. [2] Miltersen, Peter Bro. “Universal Hashing”. Archived from the original (PDF) on 24 June 2009. [3] Motwani, Rajeev; Raghavan, Prabhakar (1995). Randomized Algorithms. Cambridge University Press. p. 221. ISBN 0-521-47465-5. [12] Dietzfelbinger, Martin; Gil, Joseph; Matias, Yossi; Pippenger, Nicholas (1992). Polynomial Hash Functions Are Reliable (Extended Abstract). Proc. 19th International Colloquium on Automata, Languages and Programming (ICALP). pp. 235–246. [13] Black, J.; Halevi, S.; Krawczyk, H.; Krovetz, T. (1999). UMAC: Fast and Secure Message Authentication (PDF). Advances in Cryptology (CRYPTO '99)., Equation 1 [14] Pătraşcu, Mihai; Thorup, Mikkel (2011). The power of simple tabulation hashing. Proceedings of the 43rd annual ACM Symposium on Theory of Computing (STOC '11). pp. 1–10. arXiv:1011.5200 . doi:10.1145/1993636.1993638. [15] Kaser, Owen; Lemire, Daniel (2013). “Strongly universal string hashing is fast”. Computer Journal. Oxford University Press. arXiv:1202.4961 . doi:10.1093/comjnl/bxt070. [16] “Hebrew University Course Slides” (PDF). [17] Robert Uzgalis. “Library Hash Functions”. 1996. [18] Kankowsk, Peter. “Hash functions: An empirical comparison”. [19] Yigit, Ozan. “String hash functions”. [20] Kernighan; Ritchie (1988). “6”. The C Programming Language (2nd ed.). p. 118. ISBN 0-13-110362-8. [4] David Wagner, ed. “Advances in Cryptology - CRYPTO 2008”. p. 145. [21] “String (Java Platform SE 6)". docs.oracle.com. Retrieved 2015-06-10. [5] Jean-Philippe Aumasson, Willi Meier, Raphael Phan, Luca Henzen. “The Hash Function BLAKE”. 2014. p. 10. 3.11.6 Further reading [6] Baran, Ilya; Demaine, Erik D.; Pătraşcu, Mihai (2008). “Subquadratic Algorithms for 3SUM” (PDF). Algorithmica. 50 (4): 584–596. doi:10.1007/s00453-007-90363. • Knuth, Donald Ervin (1998). The Art of Computer Programming, Vol. III: Sorting and Searching (3rd ed.). Reading, Mass; London: Addison-Wesley. ISBN 0-201-89685-0. 3.12. K-INDEPENDENT HASHING 3.11.7 93 External links {h : U → [m]} is k -independent if for any k distinct keys (x1 , . . . , xk ) ∈ U k and any k hash codes (not nec• Open Data Structures - Section 5.1.1 - Multiplica- essarily distinct) (y1 , . . . , yk ) ∈ [m]k , we have: tive Hashing 3.12 K-independent hashing Pr [h(x1 ) = y1 ∧ · · · ∧ h(xk ) = yk ] = m−k h∈H This definition is equivalent to the following two condiA family of hash functions is said to be k -independent tions: or k -universal[1] if selecting a hash function at random from the family guarantees that the hash codes of 1. for any fixed x ∈ U , as h is drawn randomly from any designated k keys are independent random variables H , h(x) is uniformly distributed in [m] . (see precise mathematical definitions below). Such fam2. for any fixed, distinct keys x1 , . . . , xk ∈ U , as h ilies allow good average case performance in randomized is drawn randomly from H , h(x1 ), . . . , h(xk ) are algorithms or data structures, even if the input data is choindependent random variables. sen by an adversary. The trade-offs between the degree of independence and the efficiency of evaluating the hash function are well studied, and many k -independent fam- Often it is inconvenient to achieve the perfect joint probilies have been proposed. ability of m−k due to rounding issues. Following,[3] one may define a (µ, k) -independent family to satisfy: 3.12.1 Background See also: Hash function The goal of hashing is usually to map keys from some large domain (universe) U into a smaller range, such as m bins (labelled [m] = {0, . . . , m − 1} ). In the analysis of randomized algorithms and data structures, it is often desirable for the hash codes of various keys to “behave randomly”. For instance, if the hash code of each key were an independent random choice in [m] , the number of keys per bin could be analyzed using the Chernoff bound. A deterministic hash function cannot offer any such guarantee in an adversarial setting, as the adversary may choose the keys to be the precisely the preimage of a bin. Furthermore, a deterministic hash function does not allow for rehashing: sometimes the input data turns out to be bad for the hash function (e.g. there are too many collisions), so one would like to change the hash function. The solution to these problems is to pick a function randomly from a large family of hash functions. The randomness in choosing the hash function can be used to guarantee some desired random behavior of the hash codes of any keys of interest. The first definition along these lines was universal hashing, which guarantees a low collision probability for any two designated keys. The concept of k -independent hashing, introduced by Wegman and Carter in 1981,[2] strengthens the guarantees of random behavior to families of k designated keys, and adds a guarantee on the uniform distribution of hash codes. ∀ distinct (x1 , . . . , xk ) ∈ Uk and ∀(y1 , . . . , yk ) ∈ [m]k , Prh∈H [h(x1 ) = y1 ∧ · · · ∧ h(xk ) = yk ] ≤ µ/mk Observe that, even if µ is close to 1, h(xi ) are no longer independent random variables, which is often a problem in the analysis of randomized algorithms. Therefore, a more common alternative to dealing with rounding issues is to prove that the hash family is close in statistical distance to a k -independent family, which allows black-box use of the independence properties. 3.12.3 Techniques Polynomials with random coefficients The original technique for constructing k-independent hash functions, given by Carter and Wegman, was to select a large prime number p, choose k random numbers modulo p, and use these numbers as the coefficients of a polynomial of degree k whose values modulo p are used as the value of the hash function. All polynomials of the given degree modulo p are equally likely, and any polynomial is uniquely determined by any k-tuple of argumentvalue pairs with distinct arguments, from which it follows that any k-tuple of distinct arguments is equally likely to be mapped to any k-tuple of hash values.[2] Tabulation hashing 3.12.2 Definitions Main article: Tabulation hashing The strictest definition, introduced by Wegman and Carter[2] under the name “strongly universal k hash fam- Tabulation hashing is a technique for mapping keys to ily”, is the following. A family of hash functions H = hash values by partitioning each key into bytes, using each 94 CHAPTER 3. DICTIONARIES byte as the index into a table of random numbers (with a different table for each byte position), and combining the results of these table lookups by a bitwise exclusive or operation. Thus, it requires more randomness in its initialization than the polynomial method, but avoids possiblyslow multiplication operations. It is 3-independent but not 4-independent.[4] Variations of tabulation hashing can achieve higher degrees of independence by performing table lookups based on overlapping combinations of bits from the input key, or by applying simple tabulation hashing iteratively.[5][6] 3.12.4 Independence needed by different hashing methods The notion of k-independence can be used to differentiate between different hashing methods, according to the level of independence required to guarantee constant expected time per operation. For instance, hash chaining takes constant expected time even with a 2-independent hash function, because the expected time to perform a search for a given key is bounded by the expected number of collisions that key is involved in. By linearity of expectation, this expected number equals the sum, over all other keys in the hash table, of the probability that the given key and the other key collide. Because the terms of this sum only involve probabilistic events involving two keys, 2-independence is sufficient to ensure that this sum has the same value that it would for a truly random hash function.[2] [3] Siegel, Alan (2004). “On universal classes of extremely random constant-time hash functions and their time-space tradeoff” (PDF). SIAM Journal on Computing. 33 (3): 505–543. doi:10.1137/S0097539701386216. Conference version in FOCS'89. [4] Pătraşcu, Mihai; Thorup, Mikkel (2012), “The power of simple tabulation hashing”, Journal of the ACM, 59 (3): Art. 14, arXiv:1011.5200 , doi:10.1145/2220357.2220361, MR 2946218. [5] Siegel, Alan (2004), “On universal classes of extremely random constant-time hash functions”, SIAM Journal on Computing, 33 (3): 505–543, doi:10.1137/S0097539701386216, MR 2066640. [6] Thorup, M. (2013), “Simple tabulation, fast expanders, double tabulation, and high independence”, Proceedings of the 54th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2013), pp. 90–99, doi:10.1109/FOCS.2013.18, MR 3246210. [7] Bradford, Phillip G.; Katehakis, Michael N. (2007), “A probabilistic study on combinatorial expanders and hashing” (PDF), SIAM Journal on Computing, 37 (1): 83–111, doi:10.1137/S009753970444630X, MR 2306284. [8] Pagh, Anna; Pagh, Rasmus; Ružić, Milan (2009), “Linear probing with constant independence”, SIAM Journal on Computing, 39 (3): 1107–1120, doi:10.1137/070702278, MR 2538852 [9] Pătraşcu, Mihai; Thorup, Mikkel (2010), “On the kindependence required by linear probing and minwise independence” (PDF), Automata, Languages and Programming, 37th International Colloquium, ICALP 2010, Bordeaux, France, July 6-10, 2010, Proceedings, Part I, Lecture Notes in Computer Science, 6198, Springer, pp. 715–726, doi:10.1007/978-3-642-14165-2_60 Double hashing is another method of hashing that requires a low degree of independence. It is a form of open addressing that uses two hash functions: one to determine the start of a probe sequence, and the other to determine the step size between positions in the probe sequence. As long as both of these are 2-independent, this method gives 3.12.6 Further reading constant expected time per operation.[7] • Motwani, Rajeev; Raghavan, Prabhakar (1995). On the other hand, linear probing, a simpler form of open Randomized Algorithms. Cambridge University addressing where the step size is always one, requires 5Press. p. 221. ISBN 0-521-47465-5. independence. It can be guaranteed to work in constant expected time per operation with a 5-independent hash function,[8] and there exist 4-independent hash functions 3.13 Tabulation hashing for which it takes logarithmic time per operation.[9] In computer science, tabulation hashing is a method for constructing universal families of hash functions by combining table lookup with exclusive or operations. It was [1] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, first studied in the form of Zobrist hashing for computer Ronald L.; Stein, Clifford (2009) [1990]. Introduction to games; later work by Carter and Wegman extended this Algorithms (3rd ed.). MIT Press and McGraw-Hill. ISBN method to arbitrary fixed-length keys. Generalizations of 0-262-03384-4. tabulation hashing have also been developed that can han[2] Wegman, Mark N.; Carter, J. Lawrence (1981). “New dle variable-length keys such as text strings. 3.12.5 References hash functions and their use in authentication and set equality” (PDF). Journal of Computer and System Sciences. 22 (3): 265–279. doi:10.1016/00220000(81)90033-7. Conference version in FOCS'79. Retrieved 9 February 2011. Despite its simplicity, tabulation hashing has strong theoretical properties that distinguish it from some other hash functions. In particular, it is 3-independent: every 3-tuple of keys is equally likely to be mapped to any 3-tuple of 3.13. TABULATION HASHING 95 hash values. However, it is not 4-independent. More so- the value of the position before the move, without needing phisticated but slower variants of tabulation hashing ex- to loop over all of the features of the position.[4] tend the method to higher degrees of independence. Tabulation hashing in greater generality, for arbitrary biBecause of its high degree of independence, tabula- nary values, was later rediscovered by Carter & Wegman tion hashing is usable with hashing methods that require (1979) and studied in more detail by Pătraşcu & Thorup a high-quality hash function, including linear probing, (2012). cuckoo hashing, and the MinHash technique for estimating the size of set intersections. 3.13.1 Method Let p denote the number of bits in a key to be hashed, and q denote the number of bits desired in an output hash function. Choose another number r, less than or equal to p; this choice is arbitrary, and controls the tradeoff between time and memory usage of the hashing method: smaller values of r use less memory but cause the hash function to be slower. Compute t by rounding p/r up to the next larger integer; this gives the number of r-bit blocks needed to represent a key. For instance, if r = 8, then an r-bit number is a byte, and t is the number of bytes per key. The key idea of tabulation hashing is to view a key as a vector of t r-bit numbers, use a lookup table filled with random values to compute a hash value for each of the r-bit numbers representing a given key, and combine these values with the bitwise binary exclusive or operation.[1] The choice of r should be made in such a way that this table is not too large; e.g., so that it fits into the computer’s cache memory.[2] The initialization phase of the algorithm creates a twodimensional array T of dimensions 2r by t, and fills the array with random q-bit numbers. Once the array T is initialized, it can be used to compute the hash value h(x) of any given key x. To do so, partition x into r-bit values, where x0 consists of the low order r bits of x, x1 consists of the next r bits, etc. For example, with the choice r = 8, xi is just the ith byte of x. Then, use these values as indices into T and combine them with the exclusive or operation:[1] h(x) = T[0][x0 ] ⊕ T[1][x1 ] ⊕ T[2][x2 ] ⊕ ... 3.13.2 History The first instance of tabulation hashing is Zobrist hashing, a method for hashing positions in abstract board games such as chess named after Albert Lindsey Zobrist, who published it in 1970.[3] In this method, a random bitstring is generated for each game feature such as a combination of a chess piece and a square of the chessboard. Then, to hash any game position, the bitstrings for the features of that position are combined by a bitwise exclusive or. The resulting hash value can then be used as an index into a transposition table. Because each move typically changes only a small number of game features, the Zobrist value of the position after a move can be updated quickly from 3.13.3 Universality Carter & Wegman (1979) define a randomized scheme for generating hash functions to be universal if, for any two keys, the probability that they collide (that is, they are mapped to the same value as each other) is 1/m, where m is the number of values that the keys can take on. They defined a stronger property in the subsequent paper Wegman & Carter (1981): a randomized scheme for generating hash functions is k-independent if, for every k-tuple of keys, and each possible k-tuple of values, the probability that those keys are mapped to those values is 1/mk . 2-independent hashing schemes are automatically universal, and any universal hashing scheme can be converted into a 2-independent scheme by storing a random number x as part of the initialization phase of the algorithm and adding x to each hash value. Thus, universality is essentially the same as 2-independence. However, kindependence for larger values of k is a stronger property, held by fewer hashing algorithms. As Pătraşcu & Thorup (2012) observe, tabulation hashing is 3-independent but not 4-independent. For any single key x, T[x0 ,0] is equally likely to take on any hash value, and the exclusive or of T[x0 ,0] with the remaining table values does not change this property. For any two keys x and y, x is equally likely to be mapped to any hash value as before, and there is at least one position i where xi ≠ yi; the table value T[yi,i] is used in the calculation of h(y) but not in the calculation of h(x), so even after the value of h(x) has been determined, h(y) is equally likely to be any valid hash value. Similarly, for any three keys x, y, and z, at least one of the three keys has a position i where its value zi differs from the other two, so that even after the values of h(x) and h(y) are determined, h(z) is equally likely to be any valid hash value.[5] However, this reasoning breaks down for four keys because there are sets of keys w, x, y, and z where none of the four has a byte value that it does not share with at least one of the other keys. For instance, if the keys have two bytes each, and w, x, y, and z are the four keys that have either zero or one as their byte values, then each byte value in each position is shared by exactly two of the four keys. For these four keys, the hash values computed by tabulation hashing will always satisfy the equation h(w) ⊕ h(x) ⊕ h(y) ⊕ h(z) = 0, whereas for a 4-independent hashing scheme the same equation would only be satisfied with probability 1/m. Therefore, tabulation hashing is not 4-independent.[5] 96 3.13.4 CHAPTER 3. DICTIONARIES Application simple tabulation hashing on the expanded keys, results in a hashing scheme whose independence number is exBecause tabulation hashing is a universal hashing scheme, ponential in the parameter r, the number of bits per block it can be used in any hashing-based algorithm in which in the partition of the keys into blocks. universality is sufficient. For instance, in hash chaining, Simple tabulation is limited to keys of a fixed length, bethe expected time per operation is proportional to the cause a different table of random values needs to be inisum of collision probabilities, which is the same for any tialized for each position of a block in the keys. Lemire universal scheme as it would be for truly random hash (2012) studies variations of tabulation hashing suitable functions, and is constant whenever the load factor of for variable-length keys such as character strings. The the hash table is constant. Therefore, tabulation hashing general type of hashing scheme studied by Lemire uses a can be used to compute hash functions for hash chaining single table T indexed by the value of a block, regardwith a theoretical guarantee of constant expected time per less of its position within the key. However, the valoperation.[6] ues from this table may be combined by a more compliHowever, universal hashing is not strong enough to guar- cated function than bitwise exclusive or. Lemire shows antee the performance of some other hashing algorithms. that no scheme of this type can be 3-independent. NevFor instance, for linear probing, 5-independent hash func- ertheless, he shows that it is still possible to achieve 2tions are strong enough to guarantee constant time op- independence. In particular, a tabulation scheme that ineration, but there are 4-independent hash functions that terprets the values T[xi] (where xi is, as before, the ith fail.[7] Nevertheless, despite only being 3-independent, block of the input) as the coefficients of a polynomial tabulation hashing provides the same constant-time guar- over a finite field and then takes the remainder of the reantee for linear probing.[8] sulting polynomial modulo another polynomial, gives a Cuckoo hashing, another technique for implementing 2-independent hash function. hash tables, guarantees constant time per lookup (regardless of the hash function). Insertions into a cuckoo hash 3.13.6 Notes table may fail, causing the entire table to be rebuilt, but such failures are sufficiently unlikely that the expected [1] Morin (2014); Mitzenmacher & Upfal (2014). time per insertion (using either a truly random hash function or a hash function with logarithmic independence) [2] Mitzenmacher & Upfal (2014). is constant. With tabulation hashing, on the other hand, [3] Thorup (2013). the best bound known on the failure probability is higher, high enough that insertions cannot be guaranteed to take [4] Zobrist (1970). constant expected time. Nevertheless, tabulation hashing is adequate to ensure the linear-expected-time construc- [5] Pătraşcu & Thorup (2012); Mitzenmacher & Upfal (2014). tion of a cuckoo hash table for a static set of keys that [8] does not change as the table is used. [6] Carter & Wegman (1979). 3.13.5 Extensions [7] For the sufficiency of 5-independent hashing for linear probing, see Pagh, Pagh & Ružić (2009). For examples of weaker hashing schemes that fail, see Pătraşcu & Thorup (2010). Although tabulation hashing as described above (“simple tabulation hashing”) is only 3-independent, variations of [8] Pătraşcu & Thorup (2012). this method can be used to obtain hash functions with much higher degrees of independence. Siegel (2004) uses the same idea of using exclusive or operations to combine 3.13.7 References random values from a table, with a more complicated algorithm based on expander graphs for transforming the Secondary sources key bits into table indices, to define hashing schemes that are k-independent for any constant or even logarithmic • Morin, Pat (February 22, 2014), “Section 5.2.3: value of k. However, the number of table lookups needed Tabulation hashing”, Open Data Structures (in pseuto compute each hash value using Siegel’s variation of docode) (0.1Gβ ed.), pp. 115–116, retrieved 2016tabulation hashing, while constant, is still too large to be 01-08. practical, and the use of expanders in Siegel’s technique • Mitzenmacher, Michael; Upfal, Eli (2014), “Some also makes it not fully constructive. Thorup (2013) provides a scheme based on tabulation hashing that reaches practical randomized algorithms and data struchigh degrees of independence more quickly, in a more tures”, in Tucker, Allen; Gonzalez, Teofilo; DiazHerrera, Jorge, Computing Handbook: Computer constructive way. He observes that using one round of Science and Software Engineering (3rd ed.), CRC simple tabulation hashing to expand the input keys to six Press, pp. 11-1 – 11-23, ISBN 9781439898529. times their original length, and then a second round of 3.14. CRYPTOGRAPHIC HASH FUNCTION See in particular Section 11.1.1: Tabulation hashing, pp. 11-3 – 11-4. Primary sources • Carter, J. Lawrence; Wegman, Mark N. (1979), “Universal classes of hash functions”, Journal of Computer and System Sciences, 18 (2): 143–154, doi:10.1016/0022-0000(79)90044-8, MR 532173. 97 Input Digest Fox cryptographic hash function DFCD 3454 BBEA 788A 751A 696C 24D9 7009 CA99 2D17 The red fox jumps over the blue dog cryptographic hash function 0086 46BB FB7D CBE2 823C ACC7 6CD1 90B1 EE6E 3ABC The red fox jumps ouer the blue dog cryptographic hash function 8FD8 7558 7851 4F32 D1C6 76B1 79A9 0DA4 AEFE 4819 The red fox jumps oevr the blue dog cryptographic hash function FCD3 7FDB 5AF2 C6FF 915F D401 C0A9 7D9A 46AF FB45 The red fox cryptographic 8ACA D682 D588 4C75 4BF4 jumps oer hash • Lemire, Daniel (2012), “The univer1799 7D88 BCF8 92B9 6A6C the blue dog function sality of iterated hashing over variablelength strings”, Discrete Applied MatheA cryptographic hash function (specifically SHA-1) at work. A matics, 160: 604–617, arXiv:1008.1715 , small change in the input (in the word “over”) drastically changes doi:10.1016/j.dam.2011.11.009, MR 2876344. the output (digest). This is the so-called avalanche effect. • Pagh, Anna; Pagh, Rasmus; Ružić, Milan (2009), “Linear probing with constant independence”, 3.14 Cryptographic hash function SIAM Journal on Computing, 39 (3): 1107–1120, A cryptographic hash function is a special class of hash doi:10.1137/070702278, MR 2538852. function that has certain properties which make it suitable • Pătraşcu, Mihai; Thorup, Mikkel (2010), “On for use in cryptography. It is a mathematical algorithm the k-independence required by linear probing that maps data of arbitrary size to a bit string of a fixed and minwise independence” (PDF), Proceedings of size (a hash function) which is designed to also be a onethe 37th International Colloquium on Automata, way function, that is, a function which is infeasible to inLanguages and Programming (ICALP 2010), Bor- vert. The only way to recreate the input data from an deaux, France, July 6-10, 2010, Part I, Lecture ideal cryptographic hash function’s output is to attempt a Notes in Computer Science, 6198, Springer, brute-force search of possible inputs to see if they propp. 715–726, doi:10.1007/978-3-642-14165-2_60, duce a match. Bruce Schneier has called one-way hash functions “the workhorses of modern cryptography”.[1] MR 2734626. The input data is often called the message, and the output • Pătraşcu, Mihai; Thorup, Mikkel (2012), “The (the hash value or hash) is often called the message digest power of simple tabulation hashing”, Journal of or simply the digest. the ACM, 59 (3): Art. 14, arXiv:1011.5200 , The ideal cryptographic hash function has four main doi:10.1145/2220357.2220361, MR 2946218. properties: • Siegel, Alan (2004), “On universal classes of extremely random constant-time hash functions”, SIAM Journal on Computing, 33 (3): 505–543, doi:10.1137/S0097539701386216, MR 2066640. • it is quick to compute the hash value for any given message • Thorup, M. (2013), “Simple tabulation, fast expanders, double tabulation, and high independence”, Proceedings of the 54th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2013), pp. 90–99, doi:10.1109/FOCS.2013.18, MR 3246210. • a small change to a message should change the hash value so extensively that the new hash value appears uncorrelated with the old hash value • Wegman, Mark N.; Carter, J. Lawrence (1981), “New hash functions and their use in authentication and set equality”, Journal of Computer and System Sciences, 22 (3): 265–279, doi:10.1016/00220000(81)90033-7, MR 633535. • it is infeasible to generate a message from its hash value except by trying all possible messages • it is infeasible to find two different messages with the same hash value Cryptographic hash functions have many informationsecurity applications, notably in digital signatures, message authentication codes (MACs), and other forms of authentication. They can also be used as ordinary hash functions, to index data in hash tables, for fingerprinting, • Zobrist, Albert L. (April 1970), A New Hashing to detect duplicate data or uniquely identify files, and as Method with Application for Game Playing (PDF), checksums to detect accidental data corruption. Indeed, Tech. Rep. 88, Madison, Wisconsin: Computer in information-security contexts, cryptographic hash valSciences Department, University of Wisconsin. ues are sometimes called (digital) fingerprints, checksums, 98 CHAPTER 3. DICTIONARIES or just hash values, even though all these terms stand for functions are vulnerable to length-extension attacks: given more general functions with rather different properties hash(m) and len(m) but not m, by choosing a suitable m' and purposes. an attacker can calculate hash(m || m') where || denotes concatenation.[4] This property can be used to break naive authentication schemes based on hash functions. The 3.14.1 Properties HMAC construction works around these problems. Most cryptographic hash functions are designed to take a In practice, collision resistance is insufficient for many string of any length as input and produce a fixed-length practical uses. In additional to collision resistance, it should be impossible for an adversary to find two meshash value. sages with substantially similar digests; or to infer any A cryptographic hash function must be able to withstand useful information about the data, given only its digest. all known types of cryptanalytic attack. In theoretical In particular, should behave as much as possible like a cryptography, the security level of a cryptographic hash random function (often called a random oracle in proofs function has been defined using the following properties: of security) while still being deterministic and efficiently computable. This rules out functions like the SWIFFT • Pre-image resistance function, which can be rigorously proven to be collision resistant assuming that certain problems on ideal lattices Given a hash value h it should be diffiare computationally difficult, but as a linear function, cult to find any message m such that h = does not satisfy these additional properties.[5] hash(m). This concept is related to that Checksum algorithms, such as CRC32 and other cyclic of one-way function. Functions that lack redundancy checks, are designed to meet much weaker this property are vulnerable to preimage requirements, and are generally unsuitable as cryptoattacks. graphic hash functions. For example, a CRC was used • Second pre-image resistance for message integrity in the WEP encryption standard, but an attack was readily discovered which exploited the Given an input m1 it should be diflinearity of the checksum. ficult to find different input m2 such that hash(m1 ) = hash(m2 ). Functions that lack this property are vulnerable to Degree of difficulty second-preimage attacks. • Collision resistance It should be difficult to find two different messages m1 and m2 such that hash(m1 ) = hash(m2 ). Such a pair is called a cryptographic hash collision. This property is sometimes referred to as strong collision resistance. It requires a hash value at least twice as long as that required for preimage-resistance; otherwise collisions may be found by a birthday attack.[2] In cryptographic practice, “difficult” generally means “almost certainly beyond the reach of any adversary who must be prevented from breaking the system for as long as the security of the system is deemed important”. The meaning of the term is therefore somewhat dependent on the application, since the effort that a malicious agent may put into the task is usually proportional to his expected gain. However, since the needed effort usually grows very quickly with the digest length, even a thousand-fold advantage in processing power can be neutralized by adding a few dozen bits to the latter. For messages selected from a limited set of messages, for example passwords or other short messages, it can be feasible to invert a hash by trying all possible messages in the set. Because cryptographic hash functions are typically designed to be computed quickly, special key derivation functions that require greater computing resources have been developed that make such brute force attacks more difficult. These properties form a hierarchy, in that collision resistance implies second pre-image resistance, which in turns implies pre-image resistance, while the converse is not true in general. [3] The weaker assumption is always preferred in theoretical cryptography, but in practice, a hash-functions which is only second pre-image resistant is considered insecure and is therefore not recommended In some theoretical analyses “difficult” has a spefor real applications. cific mathematical meaning, such as “not solvable in Informally, these properties mean that a malicious ad- asymptotic polynomial time". Such interpretations of versary cannot replace or modify the input data without difficulty are important in the study of provably secure changing its digest. Thus, if two strings have the same cryptographic hash functions but do not usually have a digest, one can be very confident that they are identical. strong connection to practical security. For example, an A function meeting these criteria may still have unde- exponential time algorithm can sometimes still be fast sirable properties. Currently popular cryptographic hash enough to make a feasible attack. Conversely, a polyno- 3.14. CRYPTOGRAPHIC HASH FUNCTION 99 mial time algorithm (e.g., one that requires n20 steps for ing retrieved if forgotten or lost, and they have to be ren-digit keys) may be too slow for any practical use. placed with new ones.) The password is often concatenated with a random, non-secret salt value before the hash function is applied. The salt is stored with the password 3.14.2 Illustration hash. Because users have different salts, it is not feasible to store tables of precomputed hash values for common An illustration of the potential use of a cryptographic passwords. Key stretching functions, such as PBKDF2, hash is as follows: Alice poses a tough math problem to Bcrypt or Scrypt, typically use repeated invocations of a Bob and claims she has solved it. Bob would like to try cryptographic hash to increase the time required to perit himself, but would yet like to be sure that Alice is not form brute force attacks on stored password digests. bluffing. Therefore, Alice writes down her solution, com- In 2013 a long-term Password Hashing Competition was putes its hash and tells Bob the hash value (whilst keep- announced to choose a new, standard algorithm for passing the solution secret). Then, when Bob comes up with word hashing.[7] the solution himself a few days later, Alice can prove that she had the solution earlier by revealing it and having Bob hash it and check that it matches the hash value given to Proof-of-work him before. (This is an example of a simple commitment scheme; in actual practice, Alice and Bob will often be Main article: Proof-of-work system computer programs, and the secret would be something less easily spoofed than a claimed puzzle solution). A proof-of-work system (or protocol, or function) is an economic measure to deter denial of service attacks and other service abuses such as spam on a network by requir3.14.3 Applications ing some work from the service requester, usually meaning processing time by a computer. A key feature of these Verifying the integrity of files or messages schemes is their asymmetry: the work must be moderately hard (but feasible) on the requester side but easy to Main article: File verification check for the service provider. One popular system — used in Bitcoin mining and Hashcash — uses partial hash An important application of secure hashes is verification inversions to prove that work was done, as a good-will toof message integrity. Determining whether any changes ken to send an e-mail. The sender is required to find a have been made to a message (or a file), for example, can message whose hash value begins with a number of zero be accomplished by comparing message digests calcu- bits. The average work that sender needs to perform in lated before, and after, transmission (or any other event). order to find a valid message is exponential in the number For this reason, most digital signature algorithms only of zero bits required in the hash value, while the recipiconfirm the authenticity of a hashed digest of the message ent can verify the validity of the message by executing a to be “signed”. Verifying the authenticity of a hashed di- single hash function. For instance, in Hashcash, a sender gest of the message is considered proof that the message is asked to generate a header whose 160 bit SHA-1 hash value has the first 20 bits as zeros. The sender will on itself is authentic. average have to try 219 times to find a valid header. MD5, SHA1, or SHA2 hashes are sometimes posted along with files on websites or forums to allow verification of integrity.[6] This practice establishes a chain of trust so File or data identifier long as the hashes are posted on a site authenticated by A message digest can also serve as a means of reliably HTTPS. identifying a file; several source code management systems, including Git, Mercurial and Monotone, use the Password verification sha1sum of various types of content (file content, directory trees, ancestry information, etc.) to uniquely identify Main article: password hashing them. Hashes are used to identify files on peer-to-peer filesharing networks. For example, in an ed2k link, an A related application is password verification (first in- MD4-variant hash is combined with the file size, providvented by Roger Needham). Storing all user passwords ing sufficient information for locating file sources, downas cleartext can result in a massive security breach if the loading the file and verifying its contents. Magnet links password file is compromised. One way to reduce this are another example. Such file hashes are often the top danger is to only store the hash digest of each password. hash of a hash list or a hash tree which allows for addiTo authenticate a user, the password presented by the user tional benefits. is hashed and compared with the stored hash. (Note that One of the main applications of a hash function is to althis approach prevents the original passwords from below the fast look-up of a data in a hash table. Being hash 100 CHAPTER 3. DICTIONARIES functions of a particular kind, cryptographic hash functions lend themselves well to this application too. However, compared with standard hash functions, cryptographic hash functions tend to be much more expensive computationally. For this reason, they tend to be used in contexts where it is necessary for users to protect themselves against the possibility of forgery (the creation of data with the same digest as the expected data) by potentially malicious participants. IV Message Message block 1 block 2 Message block n Message Message block 1 block 2 Message Length block n padding f f f f Finalisation Hash The Merkle–Damgård hash construction. Pseudorandom generation and key derivation one-way compression function. The compression function can either be specially designed for hashing or be Hash functions can also be used in the generation of built from a block cipher. A hash function built with pseudorandom bits, or to derive new keys or passwords the Merkle–Damgård construction is as resistant to collifrom a single secure key or password. sions as is its compression function; any collision for the full hash function can be traced back to a collision in the compression function. 3.14.4 Hash functions based on block ciThe last block processed should also be unambiguously phers length padded; this is crucial to the security of this There are several methods to use a block cipher to build a construction. This construction is called the Merkle– cryptographic hash function, specifically a one-way com- Damgård construction. Most widely used hash functions, including SHA-1 and MD5, take this form. pression function. The methods resemble the block cipher modes of operation usually used for encryption. Many well-known hash functions, including MD4, MD5, SHA-1 and SHA-2 are built from block-cipher-like components designed for the purpose, with feedback to ensure that the resulting function is not invertible. SHA-3 finalists included functions with block-cipher-like components (e.g., Skein, BLAKE) though the function finally selected, Keccak, was built on a cryptographic sponge instead. A standard block cipher such as AES can be used in place of these custom block ciphers; that might be useful when an embedded system needs to implement both encryption and hashing with minimal code size or hardware area. However, that approach can have costs in efficiency and security. The ciphers in hash functions are built for hashing: they use large keys and blocks, can efficiently change keys every block, and have been designed and vetted for resistance to related-key attacks. General-purpose ciphers tend to have different design goals. In particular, AES has key and block sizes that make it nontrivial to use to generate long hash values; AES encryption becomes less efficient when the key changes each block; and related-key attacks make it potentially less secure for use in a hash function than for encryption. 3.14.5 Merkle–Damgård construction The construction has certain inherent flaws, including length-extension and generate-and-paste attacks, and cannot be parallelized. As a result, many entrants in the recent NIST hash function competition were built on different, sometimes novel, constructions. 3.14.6 Use in building other cryptographic primitives Hash functions can be used to build other cryptographic primitives. For these other primitives to be cryptographically secure, care must be taken to build them correctly. Message authentication codes (MACs) (also called keyed hash functions) are often built from hash functions. HMAC is such a MAC. Just as block ciphers can be used to build hash functions, hash functions can be used to build block ciphers. Luby-Rackoff constructions using hash functions can be provably secure if the underlying hash function is secure. Also, many hash functions (including SHA-1 and SHA2) are built by using a special-purpose block cipher in a Davies-Meyer or other construction. That cipher can also be used in a conventional mode of operation, without the same security guarantees. See SHACAL, BEAR and LION. Pseudorandom number generators (PRNGs) can be built using hash functions. This is done by combining a (secret) Main article: Merkle–Damgård construction A hash function must be able to process an arbitrary- random seed with a counter and hashing it. length message into a fixed-length output. This can be Some hash functions, such as Skein, Keccak, and achieved by breaking the input up into a series of equal- RadioGatún output an arbitrarily long stream and can be sized blocks, and operating on them in sequence using a used as a stream cipher, and stream ciphers can also be 3.14. CRYPTOGRAPHIC HASH FUNCTION built from fixed-length digest hash functions. Often this is done by first building a cryptographically secure pseudorandom number generator and then using its stream of random bytes as keystream. SEAL is a stream cipher that uses SHA-1 to generate internal tables, which are then used in a keystream generator more or less unrelated to the hash algorithm. SEAL is not guaranteed to be as strong (or weak) as SHA-1. Similarly, the key expansion of the HC-128 and HC-256 stream ciphers makes heavy use of the SHA256 hash function. 3.14.7 Concatenation Concatenating outputs from multiple hash functions provides collision resistance as good as the strongest of the algorithms included in the concatenated result. For example, older versions of Transport Layer Security (TLS) and Secure Sockets Layer (SSL) use concatenated MD5 and SHA-1 sums. This ensures that a method to find collisions in one of the hash functions does not defeat data protected by both hash functions. 101 veloped SHA-0 and SHA-1. On 12 August 2004, Joux, Carribault, Lemuet, and Jalby announced a collision for the full SHA-0 algorithm. Joux et al. accomplished this using a generalization of the Chabaud and Joux attack. They found that the collision had complexity 251 and took about 80,000 CPU hours on a supercomputer with 256 Itanium 2 processors— equivalent to 13 days of full-time use of the supercomputer. In February 2005, an attack on SHA-1 was reported that would find collision in about 269 hashing operations, rather than the 280 expected for a 160-bit hash function. In August 2005, another attack on SHA-1 was reported that would find collisions in 263 operations. Though theoretical weaknesses of SHA-1 exist,[12][13] no collision (or near-collision) has yet been found. Nonetheless, it is often suggested that it may be practical to break within years, and that new applications can avoid these problems by using later members of the SHA family, such as SHA2, or using techniques such as randomized hashing[14][15] that do not require collision resistance. For Merkle–Damgård construction hash functions, the concatenated function is as collision-resistant as its strongest component, but not more collision-resistant. Antoine Joux observed that 2-collisions lead to ncollisions: If it is feasible for an attacker to find two messages with the same MD5 hash, the attacker can find as many messages as the attacker desires with identical MD5 hashes with no greater difficulty.[8] Among the n messages with the same MD5 hash, there is likely to be a collision in SHA-1. The additional work needed to find the SHA1 collision (beyond the exponential birthday search) requires only polynomial time.[9][10] However, to ensure the long-term robustness of applications that use hash functions, there was a competition to design a replacement for SHA-2. On October 2, 2012, Keccak was selected as the winner of the NIST hash function competition. A version of this algorithm became a FIPS standard on August 5, 2015 under the name SHA3.[16] 3.14.8 3.14.9 See also Cryptographic hash algorithms Another finalist from the NIST hash function competition, BLAKE, was optimized to produce BLAKE2 which is notable for being faster than SHA-3, SHA-2, SHA-1, or MD5, and is used in numerous applications and libraries. There is a long list of cryptographic hash functions, al- 3.14.10 References though many have been found to be vulnerable and should [1] Schneier, Bruce. “Cryptanalysis of MD5 and SHA: Time not be used. Even if a hash function has never been brofor a New Standard”. Computerworld. Retrieved 2016ken, a successful attack against a weakened variant may 04-20. Much more than encryption algorithms, one-way undermine the experts’ confidence and lead to its abanhash functions are the workhorses of modern cryptogradonment. For instance, in August 2004 weaknesses were phy. found in several then-popular hash functions, including SHA-0, RIPEMD, and MD5. These weaknesses called [2] Katz, Jonathan and Lindell, Yehuda (2008). Introduction to Modern Cryptography. Chapman & Hall/CRC. into question the security of stronger algorithms derived from the weak hash functions—in particular, SHA-1 [3] Rogaway & Shrimpton 2004, in Sec. 5. Implications. (a strengthened version of SHA-0), RIPEMD-128, and [4] “Flickr’s API Signature Forgery Vulnerability”. Thai RIPEMD-160 (both strengthened versions of RIPEMD). Duong and Juliano Rizzo. Neither SHA-0 nor RIPEMD are widely used since they [5] Lyubashevsky, Vadim and Micciancio, Daniele and Peikwere replaced by their strengthened versions. As of 2009, the two most commonly used cryptographic hash functions were MD5 and SHA-1. However, a successful attack on MD5 broke Transport Layer Security in 2008.[11] The United States National Security Agency (NSA) de- ert, Chris and Rosen, Alon. “SWIFFT: A Modest Proposal for FFT Hashing”. Springer. Retrieved 29 August 2016. [6] Perrin, Chad (December 5, 2007). “Use MD5 hashes to verify software downloads”. TechRepublic. Retrieved March 2, 2013. 102 CHAPTER 3. DICTIONARIES [7] “Password Hashing Competition”. Retrieved March 3, 2013. [8] Antoine Joux. Multicollisions in Iterated Hash Functions. Application to Cascaded Constructions. LNCS 3152/2004, pages 306–316 Full text. [9] Finney, Hal (August 20, 2004). “More Problems with Hash Functions”. The Cryptography Mailing List. Retrieved May 25, 2016. [10] Hoch, Jonathan J.; Shamir, Adi (2008). “On the Strength of the Concatenated Hash Combiner when All the Hash Functions Are Weak” (PDF). Retrieved May 25, 2016. [11] Alexander Sotirov, Marc Stevens, Jacob Appelbaum, Arjen Lenstra, David Molnar, Dag Arne Osvik, Benne de Weger, MD5 considered harmful today: Creating a rogue CA certificate, accessed March 29, 2009. [12] Xiaoyun Wang, Yiqun Lisa Yin, and Hongbo Yu, Finding Collisions in the Full SHA-1 [13] Bruce Schneier, Cryptanalysis of SHA-1 (summarizes Wang et al. results and their implications) [14] Shai Halevi, Hugo Krawczyk, Update on Randomized Hashing [15] Shai Halevi and Hugo Krawczyk, Randomized Hashing and Digital Signatures [16] NIST.gov – Computer Security Division – Computer Security Resource Center 3.14.11 External links • Paar, Christof; Pelzl, Jan (2009). “11: Hash Functions”. Understanding Cryptography, A Textbook for Students and Practitioners. Springer. (companion web site contains online cryptography course that covers hash functions) • “The ECRYPT Hash Function Website”. • Buldas, A. (2011). “Series of mini-lectures about cryptographic hash functions”. • Rogaway, P.; Shrimpton, T. (2004). “Cryptographic Hash-Function Basics: Definitions, Implications, and Separations for Preimage Resistance, Second-Preimage Resistance, and Collision Resistance”. CiteSeerX: 10.1.1.3.6200. Chapter 4 Sets 4.1 Set (abstract data type) types, and quotient sets may be replaced by setoids.) The characteristic function F of a set S is defined as: In computer science, a set is an abstract data type that can store certain values, without any particular order, and no { 1, if x ∈ S repeated values. It is a computer implementation of the F (x) = mathematical concept of a finite set. Unlike most other 0, if x ̸∈ S collection types, rather than retrieving a specific element from a set, one typically tests a value for membership in In theory, many other abstract data structures can be viewed as set structures with additional operations and/or a set. additional axioms imposed on the standard operations. Some set data structures are designed for static or frozen For example, an abstract heap can be viewed as a set sets that do not change after they are constructed. Static structure with a min(S) operation that returns the element sets allow only query operations on their elements — such of smallest value. as checking whether a given value is in the set, or enumerating the values in some arbitrary order. Other variants, called dynamic or mutable sets, allow also the insertion 4.1.2 Operations and deletion of elements from the set. An abstract data structure is a collection, or aggregate, Core set-theoretical operations of data. The data may be booleans, numbers, characters, or other data structures. If one considers the structure One may define the operations of the algebra of sets: yielded by packaging [lower-alpha 1] or indexing,[lower-alpha 2] • union(S,T): returns the union of sets S and T. there are four basic data structures:[1][2] • intersection(S,T): returns the intersection of sets S and T. 1. unpackaged, unindexed: bunch 2. packaged, unindexed: set • difference(S,T): returns the difference of sets S and T. 3. unpackaged, indexed: string (sequence) • subset(S,T): a predicate that tests whether the set S is a subset of set T. 4. packaged, indexed: list (array) In this view, the contents of a set are a bunch, and isolated data items are elementary bunches (elements). Whereas Static sets sets contain elements, bunches consist of elements. Further structuring may be achieved by considering the Typical operations that may be provided by a static set multiplicity of elements (sets become multisets, bunches structure S are: become hyperbunches)[3] or their homogeneity (a record is a set of fields, not necessarily all of the same type). 4.1.1 • is_element_of(x,S): checks whether the value x is in the set S. • is_empty(S): checks whether the set S is empty. Type theory In type theory, sets are generally identified with their indicator function (characteristic function): accordingly, a set of values of type A may be denoted by 2A or P(A) . (Subtypes and subsets may be modeled by refinement 103 • size(S) or cardinality(S): returns the number of elements in S. • iterate(S): returns a function that returns one more value of S at each call, in some arbitrary order. 104 CHAPTER 4. SETS • enumerate(S): returns a list containing the elements of S in some arbitrary order. • equal(S1 , S2 ): checks whether the two given sets are equal (i.e. contain all and only the same elements). • build(x1 ,x2 ,…,xn,): creates a set structure with values x1 ,x2 ,…,xn. • hash(S): returns a hash value for the static set S such that if equal(S1 , S2 ) then hash(S1 ) = hash(S2 ) • create_from(collection): creates a new set structure Other operations can be defined for sets with elements of containing all the elements of the given collection or a special type: all the elements returned by the given iterator. Dynamic sets Dynamic set structures typically add: • create(): creates a new, initially empty set structure. • create_with_capacity(n): creates a new set structure, initially empty but capable of holding up to n elements. • add(S,x): adds the element x to S, if it is not present already. • remove(S, x): removes the element x from S, if it is present. • capacity(S): returns the maximum number of values that S can hold. Some set structures may allow only some of these operations. The cost of each operation will depend on the implementation, and possibly also on the particular values stored in the set, and the order in which they are inserted. • sum(S): returns the sum of all elements of S for some definition of “sum”. For example, over integers or reals, it may be defined as fold(0, add, S). • collapse(S): given a set of sets, return the union.[9] For example, collapse({{1}, {2, 3}}) == {1, 2, 3}. May be considered a kind of sum. • flatten(S): given a set consisting of sets and atomic elements (elements that are not sets), returns a set whose elements are the atomic elements of the original top-level set or elements of the sets it contains. In other words, remove a level of nesting – like collapse, but allow atoms. This can be done a single time, or recursively flattening to obtain a set of only atomic elements.[10] For example, flatten({1, {2, 3}}) == {1, 2, 3}. • nearest(S,x): returns the element of S that is closest in value to x (by some metric). • min(S), max(S): returns the minimum/maximum element of S. 4.1.3 Implementations Additional operations Sets can be implemented using various data structures, There are many other operations that can (in principle) which provide different time and space trade-offs for various operations. Some implementations are designed be defined in terms of the above, such as: to improve the efficiency of very specialized operations, • pop(S): returns an arbitrary element of S, deleting it such as nearest or union. Implementations described as “general use” typically strive to optimize the element_of, from S.[4] add, and delete operations. A simple implementation is • pick(S): returns an arbitrary element of S.[5][6][7] to use a list, ignoring the order of the elements and takFunctionally, the mutator pop can be interpreted as ing care to avoid repeated values. This is simple but inthe pair of selectors (pick, rest), where rest returns efficient, as operations like set membership or element the set consisting of all elements except for the ar- deletion are O(n), as they require scanning the entire bitrary element.[8] Can be interpreted in terms of list.[lower-alpha 4] Sets are often instead implemented using iterate.[lower-alpha 3] more efficient data structures, particularly various flavors of trees, tries, or hash tables. • map(F,S): returns the set of distinct values resulting As sets can be interpreted as a kind of map (by the infrom applying function F to each element of S. dicator function), sets are commonly implemented in the • filter(P,S): returns the subset containing all elements same way as (partial) maps (associative arrays) – in this of S that satisfy a given predicate P. case in which the value of each key-value pair has the unit • fold(A0 ,F,S): returns the value A|S| after applying type or a sentinel value (like 1) – namely, a self-balancing Ai+1 := F(Ai, e) for each element e of S, for some binary search tree for sorted sets (which has O(log n) for binary operation F. F must be associative and com- most operations), or a hash table for unsorted sets (which has O(1) average-case, but O(n) worst-case, for most opmutative for this to be well-defined. erations). A sorted linear hash table[11] may be used to • clear(S): delete all elements of S. provide deterministically ordered sets. 4.1. SET (ABSTRACT DATA TYPE) Further, in languages that support maps but not sets, sets can be implemented in terms of maps. For example, a common programming idiom in Perl that converts an array to a hash whose values are the sentinel value 1, for use as a set, is: 105 • Python has built-in set and frozenset types since 2.4, and since Python 3.0 and 2.7, supports non-empty set literals using a curly-bracket syntax, e.g.: {x, y, z}. my %elements = map { $_ => 1 } @elements; • The .NET Framework provides the generic HashSet and SortedSet classes that implement the generic ISet interface. Other popular methods include arrays. In particular a subset of the integers 1..n can be implemented efficiently as an n-bit bit array, which also support very efficient union and intersection operations. A Bloom map implements a set probabilistically, using a very compact representation but risking a small chance of false positives on queries. • Smalltalk's class library includes Set and IdentitySet, using equality and identity for inclusion test respectively. Many dialects provide variations for compressed storage (NumberSet, CharacterSet), for ordering (OrderedSet, SortedSet, etc.) or for weak references (WeakIdentitySet). The Boolean set operations can be implemented in terms of more elementary operations (pop, clear, and add), but specialized algorithms may yield lower asymptotic time bounds. If sets are implemented as sorted lists, for example, the naive algorithm for union(S,T) will take time proportional to the length m of S times the length n of T; whereas a variant of the list merging algorithm will do the job in time proportional to m+n. Moreover, there are specialized set data structures (such as the union-find data structure) that are optimized for one or more of these operations, at the expense of others. 4.1.4 Language support One of the earliest languages to support sets was Pascal; many languages now include it, whether in the core language or in a standard library. • In C++, the Standard Template Library (STL) provides the set template class, which is typically implemented using a binary search tree (e.g. red-black tree); SGI's STL also provides the hash_set template class, which implements a set using a hash table. C++11 has support for the unordered_set template class, which is implemented using a hash table. In sets, the elements themselves are the keys, in contrast to sequenced containers, where elements are accessed using their (relative or absolute) position. Set elements must have a strict weak ordering. • Java offers the Set interface to support sets (with the HashSet class implementing it using a hash table), and the SortedSet sub-interface to support sorted sets (with the TreeSet class implementing it using a binary search tree). • Ruby's standard library includes a set module which contains Set and SortedSet classes that implement sets using hash tables, the latter allowing iteration in sorted order. • OCaml's standard library contains a Set module, which implements a functional set data structure using binary search trees. • The GHC implementation of Haskell provides a Data.Set module, which implements immutable sets using binary search trees.[12] • The Tcl Tcllib package provides a set module which implements a set data structure based upon TCL lists. • The Swift standard library contains a Set type, since Swift 1.2. As noted in the previous section, in languages which do not directly support sets but do support associative arrays, sets can be emulated using associative arrays, by using the elements as keys, and using a dummy value as the values, which are ignored. 4.1.5 Multiset A generalization of the notion of a set is that of a multiset or bag, which is similar to a set but allows repeated (“equal”) values (duplicates). This is used in two distinct senses: either equal values are considered identical, and are simply counted, or equal values are considered equivalent, and are stored as distinct items. For example, given a list of people (by name) and ages (in years), one could construct a multiset of ages, which simply counts the number of people of a given age. Alternatively, one can construct a multiset of people, where two people are considered equivalent if their ages are the same (but may be different people and have different names), in which case each pair (name, age) must be stored, and selecting on a given age gives all the people of a given age. • Apple's Foundation framework (part of Cocoa) provides the Objective-C classes NSSet, NSMutableSet, NSCountedSet, NSOrderedSet, and NSMutableOrderedSet. The CoreFoundation APIs provide the CFSet and CFMutableSet types Formally, it is possible for objects in computer science for use in C. to be considered “equal” under some equivalence relation 106 CHAPTER 4. SETS but still distinct under another relation. Some types of multiplicities (this will not be able to distinguish between multiset implementations will store distinct equal objects equal elements at all). as separate items in the data structure; while others will Typical operations on bags: collapse it down to one version (the first one encountered) and keep a positive integer count of the multiplicity of the • contains(B, x): checks whether the element x is element. present (at least once) in the bag B As with sets, multisets can naturally be implemented us• is_sub_bag(B1 , B2 ): checks whether each element ing hash table or trees, which yield different performance in the bag B1 occurs in B1 no more often than it occharacteristics. curs in the bag B ; sometimes denoted as B ⊑ B . 2 The set of all bags over type T is given by the expression bag T. If by multiset one considers equal items identical and simply counts them, then a multiset can be interpreted as a function from the input domain to the non-negative integers (natural numbers), generalizing the identification of a set with its indicator function. In some cases a multiset in this counting sense may be generalized to allow negative values, as in Python. 1 2 • count(B, x): returns the number of times that the element x occurs in the bag B; sometimes denoted as B # x. • scaled_by(B, n): given a natural number n, returns a bag which contains the same elements as the bag B, except that every element that occurs m times in B occurs n * m times in the resulting bag; sometimes denoted as n ⊗ B. • C++'s Standard Template Library implements both • union(B1 , B2 ): returns a bag that containing just sorted and unsorted multisets. It provides the those values that occur in either the bag B1 or the multiset class for the sorted multiset, as a kind of bag B2 , except that the number of times a value x associative container, which implements this multioccurs in the resulting bag is equal to (B1 # x) + (B2 set using a self-balancing binary search tree. It pro# x); sometimes denoted as B1 ⊎ B2 . vides the unordered_multiset class for the unsorted multiset, as a kind of unordered associative containers, which implements this multiset using a hash ta- Multisets in SQL ble. The unsorted multiset is standard as of C++11; previously SGI’s STL provides the hash_multiset In relational databases, a table can be a (mathematical) set class, which was copied and eventually standardized. or a multiset, depending on the presence on unicity constraints on some columns (which turns it into a candidate • For Java, third-party libraries provide multiset func- key). tionality: SQL allows the selection of rows from a relational table: • Apache Commons Collections provides the this operation will in general yield a multiset, unless the Bag and SortedBag interfaces, with imple- keyword DISTINCT is used to force the rows to be all different, or the selection includes the primary (or a canmenting classes like HashBag and TreeBag. didate) key. • Google Guava provides the Multiset interface, with implementing classes like HashMultiset In ANSI SQL the MULTISET keyword can be used to transform a subquery into a collection expression: and TreeMultiset. SELECT expression1, expression2... FROM ta• Apple provides the NSCountedSet class as part of ble_name... Cocoa, and the CFBag and CFMutableBag types as part of CoreFoundation. is a general select that can be used as subquery expression • Python’s standard library includes of another more general query, while collections.Counter, which is similar to a mulMULTISET(SELECT expression1, expression2... tiset. FROM table_name...) • Smalltalk includes the Bag class, which can be instantiated to use either identity or equality as predi- transforms the subquery into a collection expression that cate for inclusion test. can be used in another query, or in assignment to a column of appropriate collection type. Where a multiset data structure is not available, a workaround is to use a regular set, but override the equality predicate of its items to always return “not equal” on 4.1.6 See also distinct objects (however, such will still not be able to • Bloom filter store multiple occurrences of the same object) or use • Disjoint set an associative array mapping the values to their integer 4.2. BIT ARRAY 4.1.7 Notes [1] “Packaging” consists in supplying a container for an aggregation of objects in order to turn them into a single object. Consider a function call: without packaging, a function can be called to act upon a bunch only by passing each bunch element as a separate argument, which complicates the function’s signature considerably (and is just not possible in some programming languages). By packaging the bunch’s elements into a set, the function may now be called upon a single, elementary argument: the set object (the bunch’s package). [2] Indexing is possible when the elements being considered are totally ordered. Being without order, the elements of a multiset (for example) do not have lesser/greater or preceding/succeeding relationships: they can only be compared in absolute terms (same/different). [3] For example, in Python pick can be implemented on a derived class of the built-in set as follows: class Set(set): def pick(self): return next(iter(self)) [4] Element insertion can be done in O(1) time by simply inserting at an end, but if one avoids duplicates this takes O(n) time. 107 [11] Wang, Thomas (1997), Sorted Linear Hash Table [12] Stephen Adams, "Efficient sets: a balancing act", Journal of Functional Programming 3(4):553-562, October 1993. Retrieved on 2015-03-11. 4.2 Bit array A bit array (also known as bitmap, bitset, bit string, or bit vector) is an array data structure that compactly stores bits. It can be used to implement a simple set data structure. A bit array is effective at exploiting bit-level parallelism in hardware to perform operations quickly. A typical bit array stores kw bits, where w is the number of bits in the unit of storage, such as a byte or word, and k is some nonnegative integer. If w does not divide the number of bits to be stored, some space is wasted due to internal fragmentation. 4.2.1 Definition A bit array is a mapping from some domain (almost always a range of integers) to values in the set {0, 1}. The values can be interpreted as dark/light, absent/present, locked/unlocked, valid/invalid, et cetera. The point is that 4.1.8 References there are only two possible values, so they can be stored [1] Hehner, Eric C. R. (1981), “Bunch Theory: A Simple Set in one bit. As with other arrays, the access to a single Theory for Computer Science”, Information Processing bit can be managed by applying an index to the array. Letters, 12 (1): 26, doi:10.1016/0020-0190(81)90071-5 Assuming its size (or length) to be n bits, the array can [2] Hehner, Eric C. R. (2004), A Practical Theory of Pro- be used to specify a subset of the domain (e.g. {0, 1, 2, ..., n−1}), where a 1-bit indicates the presence and a 0-bit gramming, second edition the absence of a number in the set. This set data structure [3] Hehner, Eric C. R. (2012), A Practical Theory of Pro- uses about n/w words of space, where w is the number of gramming, 2012-3-30 edition bits in each machine word. Whether the least significant bit (of the word) or the most significant bit indicates the [4] Python: pop() smallest-index number is largely irrelevant, but the for[5] Management and Processing of Complex Data Structures: mer tends to be preferred (on little-endian machines). Third Workshop on Information Systems and Artificial Intelligence, Hamburg, Germany, February 28 - March 2, 1994. Proceedings, ed. Kai v. Luck, Heinz Marburger, p. 76 [6] Python Issue7212: Retrieve an arbitrary element from a set without removing it; see msg106593 regarding standard name [7] Ruby Feature #4553: Add Set#pick and Set#pop [8] Inductive Synthesis of Functional Programs: Universal Planning, Folding of Finite Programs, and Schema Abstraction by Analogical Reasoning, Ute Schmid, Springer, Aug 21, 2003, p. 240 [9] Recent Trends in Data Type Specification: 10th Workshop on Specification of Abstract Data Types Joint with the 5th COMPASS Workshop, S. Margherita, Italy, May 30 - June 3, 1994. Selected Papers, Volume 10, ed. Egidio Astesiano, Gianna Reggio, Andrzej Tarlecki, p. 38 [10] Ruby: flatten() 4.2.2 Basic operations Although most machines are not able to address individual bits in memory, nor have instructions to manipulate single bits, each bit in a word can be singled out and manipulated using bitwise operations. In particular: • OR can be used to set a bit to one: 11101010 OR 00000100 = 11101110 • AND can be used to set a bit to zero: 11101010 AND 11111101 = 11101000 • AND together with zero-testing can be used to determine if a bit is set: 11101010 AND 00000000 = 0 00000001 = 108 CHAPTER 4. SETS 11101010 AND 00000010 ≠ 0 00000010 = word and keep a running total. Counting zeros is similar. See the Hamming weight article for examples of an efficient implementation. • XOR can be used to invert or toggle a bit: 11101010 11101110 11101110 11101010 XOR 00000100 = XOR 00000100 = • NOT can be used to invert all bits. Inversion Vertical flipping of a one-bit-per-pixel image, or some FFT algorithms, requires flipping the bits of individual words (so b31 b30 ... b0 becomes b0 ... b30 b31). When this operation is not available on the processor, it’s still possible to proceed by successive passes, in this example on 32 bits: NOT 10110010 = 01001101 exchange two 16bit halfwords exchange bytes by pairs (0xddccbbaa -> 0xccddaabb) ... swap bits by pairs To obtain the bit mask needed for these operations, we swap bits (b31 b30 ... b1 b0 -> b30 b31 ... b0 b1) The can use a bit shift operator to shift the number 1 to the last operation can be written ((x&0x55555555)<<1) | left by the appropriate number of places, as well as bitwise (x&0xaaaaaaaa)>>1)). negation if necessary. Given two bit arrays of the same size representing sets, we can compute their union, intersection, and set-theoretic Find first one difference using n/w simple bit operations each (2n/w for difference), as well as the complement of either: The find first set or find first one operation identifies the for i from 0 to n/w-1 complement_a[i] := not a[i] union[i] index or position of the 1-bit with the smallest index in := a[i] or b[i] intersection[i] := a[i] and b[i] difference[i] an array, and has widespread hardware support (for arrays not larger than a word) and efficient algorithms for := a[i] and (not b[i]) its computation. When a priority queue is stored in a bit array, find first one can be used to identify the highest priIf we wish to iterate through the bits of a bit array, we can ority element in the queue. To expand a word-size find do this efficiently using a doubly nested loop that loops first one to longer arrays, one can find the first nonzero through each word, one at a time. Only n/w memory acword and then run find first one on that word. The recesses are required: lated operations find first zero, count leading zeros, count for i from 0 to n/w-1 index := 0 // if needed word := a[i] leading ones, count trailing zeros, count trailing ones, and for b from 0 to w-1 value := word and 1 ≠ 0 word := log base 2 (see find first set) can also be extended to a bit word shift right 1 // do something with value index := array in a straightforward manner. index + 1 // if needed 4.2.4 Compression Both of these code samples exhibit ideal locality of reference, which will subsequently receive large performance A bit array is the densest storage for “random” bits, that boost from a data cache. If a cache line is k words, only is, where each bit is equally likely to be 0 or 1, and each about n/wk cache misses will occur. one is independent. But most data is not random, so it may be possible to store it more compactly. For example, the data of a typical fax image is not random and can 4.2.3 More complex operations be compressed. Run-length encoding is commonly used As with character strings it is straightforward to define to compress these long streams. However, most comlength, substring, lexicographical compare, concatenation, pressed data formats are not so easy to access randomly; reverse operations. The implementation of some of these also by compressing bit arrays too aggressively we run the risk of losing the benefits due to bit-level parallelism operations is sensitive to endianness. (vectorization). Thus, instead of compressing bit arrays as streams of bits, we might compress them as streams of Population / Hamming weight bytes or words (see Bitmap index (compression)). If we wish to find the number of 1 bits in a bit array, sometimes called the population count or Hamming weight, 4.2.5 Advantages and disadvantages there are efficient branch-free algorithms that can compute the number of bits in a word using a series of simple Bit arrays, despite their simplicity, have a number of bit operations. We simply run such an algorithm on each marked advantages over other data structures for the same 4.2. BIT ARRAY problems: 109 based on bit arrays that accept either false positives or false negatives. • They are extremely compact; few other data struc- Bit arrays and the operations on them are also important tures can store n independent pieces of data in n/w for constructing succinct data structures, which use close words. to the minimum possible space. In this context, operations like finding the nth 1 bit or counting the number of • They allow small arrays of bits to be stored and ma- 1 bits up to a certain position become important. nipulated in the register set for long periods of time Bit arrays are also a useful abstraction for examining with no memory accesses. streams of compressed data, which often contain elements that occupy portions of bytes or are not byte• Because of their ability to exploit bit-level paralaligned. For example, the compressed Huffman coding lelism, limit memory access, and maximally use the representation of a single 8-bit character can be anywhere data cache, they often outperform many other data from 1 to 255 bits long. structures on practical data sets, even those that are more asymptotically efficient. In information retrieval, bit arrays are a good representation for the posting lists of very frequent terms. If we However, bit arrays aren't the solution to everything. In compute the gaps between adjacent values in a list of particular: strictly increasing integers and encode them using unary coding, the result is a bit array with a 1 bit in the nth in the list. The implied proba• Without compression, they are wasteful set data position if and only if n is n structures for sparse sets (those with few elements bility of a gap of n is 1/2 . This is also the special case of compared to their range) in both time and space. Golomb coding where the parameter M is 1; this parameFor such applications, compressed bit arrays, Judy ter is only normally selected when -log(2-p)/log(1-p) ≤ 1, arrays, tries, or even Bloom filters should be consid- or roughly the term occurs in at least 38% of documents. ered instead. • Accessing individual elements can be expensive and difficult to express in some languages. If random access is more common than sequential and the array is relatively small, a byte array may be preferable on a machine with byte addressing. A word array, however, is probably not justified due to the huge space overhead and additional cache misses it causes, unless the machine only has word addressing. 4.2.6 Applications Because of their compactness, bit arrays have a number of applications in areas where space or efficiency is at a premium. Most commonly, they are used to represent a simple group of boolean flags or an ordered sequence of boolean values. Bit arrays are used for priority queues, where the bit at index k is set if and only if k is in the queue; this data structure is used, for example, by the Linux kernel, and benefits strongly from a find-first-zero operation in hardware. 4.2.7 Language support The APL programming language fully supports bit arrays of arbitrary shape and size as a Boolean datatype distinct from integers. All major implementations (Dyalog APL, APL2, APL Next, NARS2000, Gnu APL, etc.) pack the bits densely into whatever size the machine word is. Bits may be accessed individually via the usual indexing notation (A[3]) as well as through all of the usual primitive functions and operators where they are often operated on using a special case algorithm such as summing the bits via a table lookup of bytes. The C programming language's bitfields, pseudo-objects found in structs with size equal to some number of bits, are in fact small bit arrays; they are limited in that they cannot span words. Although they give a convenient syntax, the bits are still accessed using bitwise operators on most machines, and they can only be defined statically (like C’s static arrays, their sizes are fixed at compiletime). It is also a common idiom for C programmers to use words as small bit arrays and access bits of them using bit operators. A widely available header file included in the X11 system, xtrapbits.h, is “a portable way for systems to define bit field manipulation of arrays of bits.” A more explanatory description of aforementioned approach can be found in the comp.lang.c faq. Bit arrays can be used for the allocation of memory pages, inodes, disk sectors, etc. In such cases, the term bitmap may be used. However, this term is frequently used to refer to raster images, which may use multiple bits per pixel. In C++, although individual bools typically occupy the Another application of bit arrays is the Bloom filter, a same space as a byte or an integer, the STL type vecprobabilistic set data structure that can store large sets tor<bool> is a partial template specialization in which in a small space in exchange for a small probability of bits are packed as a space efficiency optimization. Since error. It is also possible to build probabilistic hash tables bytes (and not bits) are the smallest addressable unit in 110 C++, the [] operator does not return a reference to an element, but instead returns a proxy reference. This might seem a minor point, but it means that vector<bool> is not a standard STL container, which is why the use of vector<bool> is generally discouraged. Another unique STL class, bitset,[1] creates a vector of bits fixed at a particular size at compile-time, and in its interface and syntax more resembles the idiomatic use of words as bit sets by C programmers. It also has some additional power, such as the ability to efficiently count the number of bits that are set. The Boost C++ Libraries provide a dynamic_bitset class[2] whose size is specified at run-time. CHAPTER 4. SETS or word boundary— or unaligned— elements immediately follow each other with no padding. Hardware description languages such as VHDL, Verilog, and SystemVerilog natively support bit vectors as these are used to model storage elements like flip-flops, hardware busses and hardware signals in general. In hardware verification languages such as OpenVera, e and SystemVerilog, bit vectors are used to sample values from the hardware models, and to represent data that is transferred to hardware during simulations. The D programming language provides bit arrays in its 4.2.8 See also standard library, Phobos, in std.bitmanip. As in C++, the • Bit field [] operator does not return a reference, since individual bits are not directly addressable on most hardware, but • Arithmetic logic unit instead returns a bool. • Bitboard Chess and similar games. In Java, the class BitSet creates a bit array that is then manipulated with functions named after bitwise operators familiar to C programmers. Unlike the bitset in C++, the Java BitSet does not have a “size” state (it has an effectively infinite size, initialized with 0 bits); a bit can be set or tested at any index. In addition, there is a class EnumSet, which represents a Set of values of an enumerated type internally as a bit vector, as a safer alternative to bitfields. • Bitmap index • Binary numeral system • Bitstream • Judy array The .NET Framework supplies a BitArray collection 4.2.9 References class. It stores boolean values, supports random access [1] std::bitset and bitwise operators, can be iterated over, and its Length property can be changed to grow or truncate it. [2] boost::dynamic_bitset Although Standard ML has no support for bit arrays, Standard ML of New Jersey has an extension, the BitArray structure, in its SML/NJ Library. It is not fixed in size and supports set operations and bit operations, including, unusually, shift operations. [3] http://perldoc.perl.org/perlop.html# Bitwise-String-Operators [4] http://perldoc.perl.org/functions/vec.html Haskell likewise currently lacks standard support for bit- 4.2.10 External links wise operations, but both GHC and Hugs provide a Data.Bits module with assorted bitwise functions and op• mathematical bases by Pr. D.E.Knuth erators, including shift and rotate operations and an “un• vector<bool> Is Nonconforming, and Forces Optiboxed” array over boolean values may be used to model mization Choice a Bit array, although this lacks support from the former module. • vector<bool>: More Problems, Better Solutions In Perl, strings can be used as expandable bit arrays. They can be manipulated using the usual bitwise operators (~ | & ^),[3] and individual bits can be tested and set using the 4.3 Bloom filter vec function.[4] In Ruby, you can access (but not set) a bit of an integer Not to be confused with Bloom shader effect. (Fixnum or Bignum) using the bracket operator ([]), as if it were an array of bits. A Bloom filter is a space-efficient probabilistic data Apple’s Core Foundation library contains CFBitVector structure, conceived by Burton Howard Bloom in 1970, and CFMutableBitVector structures. that is used to test whether an element is a member of a PL/I supports arrays of bit strings of arbitrary length, set. False positive matches are possible, but false negawhich may be either fixed-length or varying. The array el- tives are not, thus a Bloom filter has a 100% recall rate. ements may be aligned— each element begins on a byte In other words, a query returns either “possibly in set” or “definitely not in set”. Elements can be added to the 4.3. BLOOM FILTER 111 set, but not removed (though this can be addressed with either the element is in the set, or the bits have by chance a “counting” filter). The more elements that are added to been set to 1 during the insertion of other elements, rethe set, the larger the probability of false positives. sulting in a false positive. In a simple Bloom filter, there Bloom proposed the technique for applications where is no way to distinguish between the two cases, but more the amount of source data would require an impracti- advanced techniques can address this problem. cally large amount of memory if “conventional” errorfree hashing techniques were applied. He gave the example of a hyphenation algorithm for a dictionary of 500,000 words, out of which 90% follow simple hyphenation rules, but the remaining 10% require expensive disk accesses to retrieve specific hyphenation patterns. With sufficient core memory, an error-free hash could be used to eliminate all unnecessary disk accesses; on the other hand, with limited core memory, Bloom’s technique uses a smaller hash area but still eliminates most unnecessary accesses. For example, a hash area only 15% of the size needed by an ideal error-free hash still eliminates 85% of the disk accesses, an 85–15 form of the Pareto principle.[1] The requirement of designing k different independent hash functions can be prohibitive for large k. For a good hash function with a wide output, there should be little if any correlation between different bit-fields of such a hash, so this type of hash can be used to generate multiple “different” hash functions by slicing its output into multiple bit fields. Alternatively, one can pass k different initial values (such as 0, 1, ..., k − 1) to a hash function that takes an initial value; or add (or append) these values to the key. For larger m and/or k, independence among the hash functions can be relaxed with negligible increase in false positive rate.[3] Specifically, Dillinger & Manolios (2004b) show the effectiveness of deriving the k indices using enhanced double hashing or triple hashing, variants of double hashing that are effectively simple More generally, fewer than 10 bits per element are rerandom number generators seeded with the two or three quired for a 1% false positive probability, independent of hash values. [2] the size or number of elements in the set. 4.3.1 Algorithm description {x, y, z} 0 1 0 1 1 1 0 0 0 0 0 1 0 1 0 0 1 0 w An example of a Bloom filter, representing the set {x, y, z}. The colored arrows show the positions in the bit array that each set element is mapped to. The element w is not in the set {x, y, z}, because it hashes to one bit-array position containing 0. For this figure, m = 18 and k = 3. An empty Bloom filter is a bit array of m bits, all set to 0. There must also be k different hash functions defined, each of which maps or hashes some set element to one of the m array positions with a uniform random distribution. Typically, k is a constant, much smaller than m, which is proportional to the number of elements to be added; the precise choice of k and the constant of proportionality of m are determined by the intended false positive rate of the filter. To add an element, feed it to each of the k hash functions to get k array positions. Set the bits at all these positions to 1. To query for an element (test whether it is in the set), feed it to each of the k hash functions to get k array positions. If any of the bits at these positions is 0, the element is definitely not in the set – if it were, then all the bits would have been set to 1 when it was inserted. If all are 1, then Removing an element from this simple Bloom filter is impossible because false negatives are not permitted. An element maps to k bits, and although setting any one of those k bits to zero suffices to remove the element, it also results in removing any other elements that happen to map onto that bit. Since there is no way of determining whether any other elements have been added that affect the bits for an element to be removed, clearing any of the bits would introduce the possibility for false negatives. One-time removal of an element from a Bloom filter can be simulated by having a second Bloom filter that contains items that have been removed. However, false positives in the second filter become false negatives in the composite filter, which may be undesirable. In this approach re-adding a previously removed item is not possible, as one would have to remove it from the “removed” filter. It is often the case that all the keys are available but are expensive to enumerate (for example, requiring many disk reads). When the false positive rate gets too high, the filter can be regenerated; this should be a relatively rare event. 4.3.2 Space and time advantages While risking false positives, Bloom filters have a strong space advantage over other data structures for representing sets, such as self-balancing binary search trees, tries, hash tables, or simple arrays or linked lists of the entries. Most of these require storing at least the data items themselves, which can require anywhere from a small number of bits, for small integers, to an arbitrary number of bits, such as for strings (tries are an exception, since they can share storage between elements with equal prefixes). However, Bloom filters do not store the data items 112 CHAPTER 4. SETS STORAGE FILTER Do you have 'key1'? Storage: Filter: No No No necessary Do you have 'key2'? Filter: disk access Yes: here is key2 Yes: here is key2 Do you have 'key3'? False Positive Filter: unnecessary disk access Yes No Storage: Yes Yes long runs of zeros. The information content of the array relative to its size is low. The generalized Bloom filter (k greater than 1) allows many more bits to be set while still maintaining a low false positive rate; if the parameters (k and m) are chosen well, about half of the bits will be set,[5] and these will be apparently random, minimizing redundancy and maximizing information content. 4.3.3 Probability of false positives 1 Storage: log_2(m)=8 log_2(m)=12 log_2(m)=16 log_2(m)=20 log_2(m)=24 No No 0.01 log_2(m)=28 log_2(m)=32 log_2(m)=36 0.0001 p Bloom filter used to speed up answers in a key-value storage system. Values are stored on a disk which has slow access times. Bloom filter decisions are much faster. However some unnecessary disk accesses are made when the filter reports a positive (in order to weed out the false positives). Overall answer speed is better with the Bloom filter than without the Bloom filter. Use of a Bloom filter for this purpose, however, does increase memory usage. 1e-06 1e-08 1e-10 1 10 100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09 n at all, and a separate solution must be provided for the actual storage. Linked structures incur an additional linear space overhead for pointers. A Bloom filter with 1% error and an optimal value of k, in contrast, requires only about 9.6 bits per element, regardless of the size of the elements. This advantage comes partly from its compactness, inherited from arrays, and partly from its probabilistic nature. The 1% false-positive rate can be reduced by a factor of ten by adding only about 4.8 bits per element. However, if the number of potential values is small and many of them can be in the set, the Bloom filter is easily surpassed by the deterministic bit array, which requires only one bit for each potential element. Note also that hash tables gain a space and time advantage if they begin ignoring collisions and store only whether each bucket contains an entry; in this case, they have effectively become Bloom filters with k = 1.[4] The false positive probability p as a function of number of elements n in the filter and the filter size m . An optimal number of hash functions k = (m/n) ln 2 has been assumed. Assume that a hash function selects each array position with equal probability. If m is the number of bits in the array, the probability that a certain bit is not set to 1 by a certain hash function during the insertion of an element is 1− 1 . m If k is the number of hash functions, the probability that the bit is not set to 1 by any of the hash functions is ( )k 1 Bloom filters also have the unusual property that the time 1− . m needed either to add items or to check whether an item is in the set is a fixed constant, O(k), completely indepen- If we have inserted n elements, the probability that a cerdent of the number of items already in the set. No other tain bit is still 0 is constant-space set data structure has this property, but the average access time of sparse hash tables can make them )kn faster in practice than some Bloom filters. In a hardware ( 1 implementation, however, the Bloom filter shines because 1− ; m its k lookups are independent and can be parallelized. To understand its space efficiency, it is instructive to com- the probability that it is 1 is therefore pare the general Bloom filter with its special case when k = 1. If k = 1, then in order to keep the false positive )kn ( rate sufficiently low, a small fraction of bits should be 1 . set, which means the array must be very large and contain 1 − 1 − m 4.3. BLOOM FILTER 113 Now test membership of an element that is not in the set. Each of the k array positions computed by the hash funcm tions is 1 with a probability as above. The probability of k = n ln 2, all of them being 1, which would cause the algorithm to erroneously claim that the element is in the set, is often which gives given as ( [ 1 m ]kn )k 2−k ≈ 0.6185m/n . ( )k ≈ 1 − e−kn/m . The required number of bits m, given n (the number of inserted elements) and a desired false positive probability p (and assuming the optimal value of k is used) can be This is not strictly correct as it assumes independence for computed by substituting the optimal value of k in the the probabilities of each bit being set. However, assuming probability expression above: it is a close approximation we have that the probability of false positives decreases as m (the number of bits in ( )m the array) increases, and increases as n (the number of n n ln 2 −( m n ln 2) m p = 1 − e inserted elements) increases. 1− 1− An alternative analysis arriving at the same approximation without the assumption of independence is given by Mitzenmacher and Upfal.[6] After all n items have been added to the Bloom filter, let q be the fraction of the m bits that are set to 0. (That is, the number of bits still set to 0 is qm.) Then, when testing membership of an element not in the set, for the array position given by any of the k hash functions, the probability that the bit is found set to 1 is 1 − q . So the probability that all k hash functions find their bit set to 1 is (1 − q)k . Further, the expected value of q is the probability that a given array position is left untouched by each of the k hash functions for each of the n items, which is (as above) ( E[q] = 1 1− m )kn which can be simplified to: ln p = − m 2 (ln 2) . n This results in: m=− n ln p . (ln 2)2 This means that for a given false positive probability p, the length of a Bloom filter m is proportionate to the number of elements being filtered n.[8] While the above formula is asymptotic (i.e. applicable as m,n → ∞), the agreement with finite values of m,n is also quite good; the false positive probability for a finite Bloom filter with m bits, n elements, and k hash functions is at most It is possible to prove, without the independence assumption, that q is very strongly concentrated around its expected value. In particular, from the Azuma–Hoeffding (1 − e−k(n+0.5)/(m−1) )k . inequality, they prove that[7] So we can use the asymptotic formula if we pay a penalty for at most half an extra element and at most one fewer λ bit.[9] Pr(|q − E[q]| ≥ ) ≤ 2 exp(−2λ2 /m) m Because of this, we can say that the exact probability of 4.3.4 false positives is ∑ ( Pr(q = t)(1−t)k ≈ (1−E[q])k = t [ 1 1− 1− m & Baldi (2007) showed that the number of k ]kn )Swamidass )k can be approximated with the folitems ( in a Bloom filter ≈ 1 − e−kn/m lowing formula, as before. n∗ = − Optimal number of hash functions Approximating the number of items in a Bloom filter [ ] X m ln 1 − , k m where n∗ is an estimate of the number of items in the For a given m and n, the value of k (the number of hash filter, m is the length (size) of the filter, k is the number functions) that minimizes the false positive probability is of hash functions, and X is the number of bits set to one. 114 CHAPTER 4. SETS 4.3.5 The union and intersection of sets Bloom filters are a way of compactly representing a set of items. It is common to try to compute the size of the intersection or union between two sets. Bloom filters can be used to approximate the size of the intersection and union of two sets. Swamidass & Baldi (2007) showed that for two Bloom filters of length m, their counts, respectively can be estimated as • Some kinds of superimposed code can be seen as a Bloom filter implemented with physical edgenotched cards. An example is Zatocoding, invented by Calvin Mooers in 1947, in which the set of categories associated with a piece of information is represented by notches on a card, with a random pattern of four notches for each category. 4.3.7 Examples [ ] n(A) m n(A ) = − ln 1 − k m ∗ and n(B ∗ ) = − [ ] m n(B) ln 1 − . k m The size of their union can be estimated as [ ] m n(A ∪ B) ∗ ∗ n(A ∪ B ) = − ln 1 − , k m where n(A ∪ B) is the number of bits set to one in either of the two Bloom filters. Finally, the intersection can be estimated as n(A∗ ∩ B ∗ ) = n(A∗ ) + n(B ∗ ) − n(A∗ ∪ B ∗ ), using the three formulas together. 4.3.6 Interesting properties • Unlike a standard hash table, a Bloom filter of a fixed size can represent a set with an arbitrarily large number of elements; adding an element never fails due to the data structure “filling up”. However, the false positive rate increases steadily as elements are added until all bits in the filter are set to 1, at which point all queries yield a positive result. • Union and intersection of Bloom filters with the same size and set of hash functions can be implemented with bitwise OR and AND operations respectively. The union operation on Bloom filters is lossless in the sense that the resulting Bloom filter is the same as the Bloom filter created from scratch using the union of the two sets. The intersect operation satisfies a weaker property: the false positive probability in the resulting Bloom filter is at most the falsepositive probability in one of the constituent Bloom filters, but may be larger than the false positive probability in the Bloom filter created from scratch using the intersection of the two sets. • Akamai's web servers use Bloom filters to prevent “one-hit-wonders” from being stored in its disk caches. One-hit-wonders are web objects requested by users just once, something that Akamai found applied to nearly three-quarters of their caching infrastructure. Using a Bloom filter to detect the second request for a web object and caching that object only on its second request prevents one-hit wonders from entering the disk cache, significantly reducing disk workload and increasing disk cache hit rates.[10] • Google BigTable, Apache HBase and Apache Cassandra, and Postgresql[11] use Bloom filters to reduce the disk lookups for non-existent rows or columns. Avoiding costly disk lookups considerably increases the performance of a database query operation.[12] • The Google Chrome web browser used to use a Bloom filter to identify malicious URLs. Any URL was first checked against a local Bloom filter, and only if the Bloom filter returned a positive result was a full check of the URL performed (and the user warned, if that too returned a positive result).[13][14] • The Squid Web Proxy Cache uses Bloom filters for cache digests.[15] • Bitcoin uses Bloom filters to speed up wallet synchronization.[16][17] • The Venti archival storage system uses Bloom filters to detect previously stored data.[18] • The SPIN model checker uses Bloom filters to track the reachable state space for large verification problems.[19] • The Cascading analytics framework uses Bloom filters to speed up asymmetric joins, where one of the joined data sets is significantly larger than the other (often called Bloom join in the database literature).[20] • The Exim mail transfer agent (MTA) uses Bloom filters in its rate-limit feature.[21] • Medium uses Bloom filters to avoid recommending articles a user has previously read.[22] 4.3. BLOOM FILTER 4.3.8 115 Alternatives variant may be slower than classic Bloom filters but this may be compensated by the fact that a single hash funcClassic Bloom filters use 1.44 log2 (1/ϵ) bits of space per tion need to be computed. inserted key, where ϵ is the false positive rate of the Another alternative to classic Bloom filter is the one based Bloom filter. However, the space that is strictly neces- on space efficient variants of cuckoo hashing. In this case sary for any data structure playing the same role as a once the hash table is constructed, the keys stored in the Bloom filter is only log2 (1/ϵ) per key.[23] Hence Bloom hash table are replaced with short signatures of the keys. filters use 44% more space than an equivalent optimal Those signatures are strings of bits computed using a hash data structure. Instead, Pagh et al. provide an optimal- function applied on the keys. space data structure. Moreover, their data structure has constant locality of reference independent of the false positive rate, unlike Bloom filters, where a smaller false 4.3.9 Extensions and applications positive rate ϵ leads to a greater number of memory accesses per query, log(1/ϵ) . Also, it allows elements to Cache filtering be deleted without a space penalty, unlike Bloom filters. The same improved properties of optimal space usage, constant locality of reference, and the ability to delete elements are also provided by the cuckoo filter of Fan et al. (2014), an open source implementation of which is available. Stern & Dill (1996) describe a probabilistic structure based on hash tables, hash compaction, which Dillinger & Manolios (2004b) identify as significantly more accurate than a Bloom filter when each is configured optimally. Dillinger and Manolios, however, point out that the reasonable accuracy of any given Bloom filter over a wide range of numbers of additions makes it attractive for probabilistic enumeration of state spaces of unknown size. Hash compaction is, therefore, attractive when the number of additions can be predicted accurately; however, despite being very fast in software, hash compaction is poorly suited for hardware because of worst-case linear access time. Putze, Sanders & Singler (2007) have studied some variants of Bloom filters that are either faster or use less space than classic Bloom filters. The basic idea of the fast variant is to locate the k hash values associated with each key into one or two blocks having the same size as processor’s memory cache blocks (usually 64 bytes). This will presumably improve performance by reducing the number of potential memory cache misses. The proposed variants have however the drawback of using about 32% more space than classic Bloom filters. The space efficient variant relies on using a single hash function that generates for each key a value in the range [0, n/ε] where ϵ is the requested false positive rate. The sequence of values is then sorted and compressed using Golomb coding (or some other compression technique) to occupy a space close to n log2 (1/ϵ) bits. To query the Bloom filter for a given key, it will suffice to check if its corresponding value is stored in the Bloom filter. Decompressing the whole Bloom filter for each query would make this variant totally unusable. To overcome this problem the sequence of values is divided into small blocks of equal size that are compressed separately. At query time only half a block will need to be decompressed on average. Because of decompression overhead, this Using a Bloom filter to prevent one-hit-wonders from being stored in a web cache decreased the rate of disk writes by nearly one half, reducing the load on the disks and potentially increasing disk performance.[10] Content delivery networks deploy web caches around the world to cache and serve web content to users with greater performance and reliability. A key application of Bloom filters is their use in efficiently determining which web objects to store in these web caches. Nearly three-quarters of the URLs accessed from a typical web cache are “onehit-wonders” that are accessed by users only once and never again. It is clearly wasteful of disk resources to store one-hit-wonders in a web cache, since they will never be accessed again. To prevent caching one-hitwonders, a Bloom filter is used to keep track of all URLs that are accessed by users. A web object is cached only when it has been accessed at least once before, i.e., the object is cached on its second request. The use of a Bloom filter in this fashion significantly reduces the disk write workload, since one-hit-wonders are never written to the disk cache. Further, filtering out the one-hit-wonders also saves cache space on disk, increasing the cache hit rates.[10] Counting filters Counting filters provide a way to implement a delete operation on a Bloom filter without recreating the filter afresh. In a counting filter the array positions (buckets) are extended from being a single bit to being an n-bit counter. In fact, regular Bloom filters can be considered as counting filters with a bucket size of one bit. Counting filters were introduced by Fan et al. (2000). 116 CHAPTER 4. SETS The insert operation is extended to increment the value of the buckets, and the lookup operation checks that each of the required buckets is non-zero. The delete operation then consists of decrementing the value of each of the respective buckets. Data synchronization The size of counters is usually 3 or 4 bits. Hence counting Bloom filters use 3 to 4 times more space than static Bloom filters. In contrast, the data structures of Pagh, Pagh & Rao (2005) and Fan et al. (2014) also allow deletions but use less space than a static Bloom filter. Chazelle et al. (2004) designed a generalization of Bloom filters that could associate a value with each element that had been inserted, implementing an associative array. Like Bloom filters, these structures achieve a small space overhead by accepting a small probability of false positives. In the case of “Bloomier filters”, a false positive is defined as returning a result when the key is not in the map. The map will never return the wrong value for a key that is in the map. Bloom filters can be used for approximate data synchronization as in Byers et al. (2004). Counting Bloom filters can be used to approximate the number of differences beArithmetic overflow of the buckets is a problem and the tween two sets and this approach is described in Agarwal buckets should be sufficiently large to make this case rare. & Trachtenberg (2006). If it does occur then the increment and decrement operations must leave the bucket set to the maximum possible Bloomier filters value in order to retain the properties of a Bloom filter. Another issue with counting filters is limited scalability. Because the counting Bloom filter table cannot be expanded, the maximal number of keys to be stored simultaneously in the filter must be known in advance. Once the designed capacity of the table is exceeded, the false positive rate will grow rapidly as more keys are inserted. Bonomi et al. (2006) introduced a data structure based on d-left hashing that is functionally equivalent but uses approximately half as much space as counting Bloom filters. The scalability issue does not occur in this data structure. Once the designed capacity is exceeded, the keys could be reinserted in a new hash table of double size. The space efficient variant by Putze, Sanders & Singler (2007) could also be used to implement counting filters by supporting insertions and deletions. Rottenstreich, Kanizo & Keslassy (2012) introduced a new general method based on variable increments that significantly improves the false positive probability of counting Bloom filters and their variants, while still supporting deletions. Unlike counting Bloom filters, at each element insertion, the hashed counters are incremented by a hashed variable increment instead of a unit increment. To query an element, the exact values of the counters are considered and not just their positiveness. If a sum represented by a counter value cannot be composed of the corresponding variable increment for the queried element, a negative answer can be returned to the query. Decentralized aggregation Compact approximators Boldi & Vigna (2005) proposed a lattice-based generalization of Bloom filters. A compact approximator associates to each key an element of a lattice (the standard Bloom filters being the case of the Boolean two-element lattice). Instead of a bit array, they have an array of lattice elements. When adding a new association between a key and an element of the lattice, they compute the maximum of the current contents of the k array locations associated to the key with the lattice element. When reading the value associated to a key, they compute the minimum of the values found in the k locations associated to the key. The resulting value approximates from above the original value. Stable Bloom filters Deng & Rafiei (2006) proposed Stable Bloom filters as a variant of Bloom filters for streaming data. The idea is that since there is no way to store the entire history of a stream (which can be infinite), Stable Bloom filters continuously evict stale information to make room for more recent elements. Since stale information is evicted, the Stable Bloom filter introduces false negatives, which do not appear in traditional Bloom filters. The authors show that a tight upper bound of false positive rates is guaranteed, and the method is superior to standard Bloom filters in terms of false positive rates and time efficiency when a small space and an acceptable false positive rate are given. Bloom filters can be organized in distributed data structures to perform fully decentralized computations of aggregate functions. Decentralized aggregation makes Scalable Bloom filters collective measurements locally available in every node of a distributed network without involving a centralized Almeida et al. (2007) proposed a variant of Bloom filters that can adapt dynamically to the number of elements computational entity for this purpose.[24] 4.3. BLOOM FILTER 117 stored, while assuring a minimum false positive proba- by attenuating (shifting out) bits set by sources further bility. The technique is based on sequences of standard away.[26] Bloom filters with increasing capacity and tighter false positive probabilities, so as to ensure that a maximum false positive probability can be set beforehand, regard- Chemical structure searching less of the number of elements to be inserted. Bloom filters are often used to search large chemical structure databases (see chemical similarity). In the simplest case, the elements added to the filter (called a finLayered Bloom filters gerprint in this field) are just the atomic numbers present A layered Bloom filter consists of multiple Bloom filter in the molecule, or a hash based on the atomic number layers. Layered Bloom filters allow keeping track of how of each atom and the number and type of its bonds. This many times an item was added to the Bloom filter by case is too simple to be useful. More advanced filters checking how many layers contain the item. With a lay- also encode atom counts, larger substructure features like ered Bloom filter a check operation will normally return carboxyl groups, and graph properties like the number of rings. In hash-based fingerprints, a hash function based the deepest layer number the item was found in.[25] on atom and bond properties is used to turn a subgraph into a PRNG seed, and the first output values used to set bits in the Bloom filter. Attenuated Bloom filters Attenuated Bloom Filter Example: Search for pattern 11010, starting from node n1. An attenuated Bloom filter of depth D can be viewed as an array of D normal Bloom filters. In the context of service discovery in a network, each node stores regular and attenuated Bloom filters locally. The regular or local Bloom filter indicates which services are offered by the node itself. The attenuated filter of level i indicates which services can be found on nodes that are i-hops away from the current node. The i-th value is constructed by taking a union of local Bloom filters for nodes i-hops away from the node.[26] Molecular fingerprints started in the late 1940s as way to search for chemical structures searched on punched cards. However, it wasn't until around 1990 that Daylight introduced a hash-based method to generate the bits, rather than use a precomputed table. Unlike the dictionary approach, the hash method can assign bits for substructures which hadn't previously been seen. In the early 1990s, the term “fingerprint” was considered different from “structural keys”, but the term has since grown to encompass most molecular characteristics which can used for a similarity comparison, including structural keys, sparse count fingerprints, and 3D fingerprints. Unlike Bloom filters, the Daylight hash method allows the number of bits assigned per feature to be a function of the feature size, but most implementations of Daylight-like fingerprints use a fixed number of bits per feature, which makes them a Bloom filter. The original Daylight fingerprints could be used for both similarity and screening purposes. Many other fingerprint types, like the popular ECFP2, can be used for similarity but not for screening because they include local environmental characteristics that introduce false negatives when used as a screen. Even if these are constructed with the same mechanism, these are not Bloom filters because they cannot be used to filter. Let’s take a small network shown on the graph below as an example. Say we are searching for a service A whose 4.3.10 See also id hashes to bits 0,1, and 3 (pattern 11010). Let n1 node • Count–min sketch to be the starting point. First, we check whether service A is offered by n1 by checking its local filter. Since the • Feature hashing patterns don't match, we check the attenuated Bloom fil• MinHash ter in order to determine which node should be the next hop. We see that n2 doesn't offer service A but lies on the • Quotient filter path to nodes that do. Hence, we move to n2 and repeat • Skip list the same procedure. We quickly find that n3 offers the [27] service, and hence the destination is located. By using attenuated Bloom filters consisting of multiple 4.3.11 Notes layers, services at more than one hop distance can be discovered while avoiding saturation of the Bloom filter [1] Bloom (1970). 118 CHAPTER 4. SETS 4.3.12 References [2] Bonomi et al. (2006). [3] Dillinger & Manolios (2004a); Kirsch & Mitzenmacher (2006). [4] Mitzenmacher & Upfal (2005). [5] Blustein & El-Maazawi (2002), pp. 21–22 • Ahmadi, Mahmood; Wong, Stephan (2007), “A Cache Architecture for Counting Bloom Filters”, 15th international Conference on Networks (ICON2007), p. 218, doi:10.1109/ICON.2007.4444089, ISBN 978-1-4244-1229-7 [6] Mitzenmacher & Upfal (2005), pp. 109–111, 308. [7] Mitzenmacher & Upfal (2005), p. 308. [8] Starobinski, Trachtenberg & Agarwal (2003) • Almeida, Paulo; Baquero, Carlos; Preguica, Nuno; Hutchison, David (2007), “Scalable Bloom Filters” (PDF), Information Processing Letters, 101 (6): 255–261, doi:10.1016/j.ipl.2006.10.007 [9] Goel & Gupta (2010). [10] Maggs & Sitaraman (2015). [11] ""Bloom index contrib module"". Postgresql.org. 201604-01. Retrieved 2016-06-18. [12] Chang et al. (2006); Apache Software Foundation (2012). [13] Yakunin, Alex (2010-03-25). “Alex Yakunin’s blog: Nice Bloom filter application”. Blog.alexyakunin.com. Retrieved 2014-05-31. [14] “Issue 10896048: Transition safe browsing from bloom filter to prefix set. - Code Review”. Chromiumcodereview.appspot.com. Retrieved 2014-07-03. [15] Wessels (2004). [16] Bitcoin 0.8.0 [17] “The Bitcoin Foundation - Supporting the development of Bitcoin”. bitcoinfoundation.org. [18] “Plan 9 /sys/man/8/venti”. trieved 2014-05-31. Plan9.bell-labs.com. Re- [19] http://spinroot.com/ [20] Mullin (1990). [21] “Exim source code”. github. Retrieved 2014-03-03. [22] “What are Bloom filters?". Medium. Retrieved 2015-1101. [23] Pagh, Pagh & Rao (2005). [24] Pournaras, Warnier & Brazier (2013). [25] Zhiwang, Jungang & Jian (2010). [26] Koucheryavy et al. (2009). [27] Kubiatowicz et al. (2000). • Agarwal, Sachin; Trachtenberg, Ari (2006), “Approximating the number of differences between remote sets” (PDF), IEEE Information Theory Workshop, Punta del Este, Uruguay: 217, doi:10.1109/ITW.2006.1633815, ISBN 1-4244-0035-X • Apache Software Foundation (2012), “11.6. Schema Design”, The Apache HBase Reference Guide, Revision 0.94.27 • Bloom, Burton H. (1970), “Space/Time Tradeoffs in Hash Coding with Allowable Errors”, Communications of the ACM, 13 (7): 422–426, doi:10.1145/362686.362692 • Blustein, James; El-Maazawi, Amal (2002), “optimal case for general Bloom filters”, Bloom Filters — A Tutorial, Analysis, and Survey, Dalhousie University Faculty of Computer Science, pp. 1–31 • Boldi, Paolo; Vigna, Sebastiano (2005), “Mutable strings in Java: design, implementation and lightweight text-search algorithms”, Science of Computer Programming, 54 (1): 3–23, doi:10.1016/j.scico.2004.05.003 • Bonomi, Flavio; Mitzenmacher, Michael; Panigrahy, Rina; Singh, Sushil; Varghese, George (2006), “An Improved Construction for Counting Bloom Filters”, Algorithms – ESA 2006, 14th Annual European Symposium (PDF), Lecture Notes in Computer Science, 4168, pp. 684– 695, doi:10.1007/11841036_61, ISBN 978-3-54038875-3 • Broder, Andrei; Mitzenmacher, Michael (2005), “Network Applications of Bloom Filters: A Survey” (PDF), Internet Mathematics, 1 (4): 485–509, doi:10.1080/15427951.2004.10129096 • Byers, John W.; Considine, Jeffrey; Mitzenmacher, Michael; Rost, Stanislav (2004), “Informed content delivery across adaptive overlay networks”, IEEE/ACM Transactions on Networking, 12 (5): 767, doi:10.1109/TNET.2004.836103 • Chang, Fay; Dean, Jeffrey; Ghemawat, Sanjay; Hsieh, Wilson; Wallach, Deborah; Burrows, Mike; 4.3. BLOOM FILTER Chandra, Tushar; Fikes, Andrew; Gruber, Robert (2006), “Bigtable: A Distributed Storage System for Structured Data”, Seventh Symposium on Operating System Design and Implementation • Charles, Denis; Chellapilla, Kumar (2008), “Bloomier Filters: A second look”, The Computing Research Repository (CoRR), arXiv:0807.0928 • Chazelle, Bernard; Kilian, Joe; Rubinfeld, Ronitt; Tal, Ayellet (2004), “The Bloomier filter: an efficient data structure for static support lookup tables”, Proceedings of the Fifteenth Annual ACMSIAM Symposium on Discrete Algorithms (PDF), pp. 30–39 119 • Eppstein, David; Goodrich, Michael T. (2007), “Space-efficient straggler identification in roundtrip data streams via Newton’s identities and invertible Bloom filters”, Algorithms and Data Structures, 10th International Workshop, WADS 2007, Springer-Verlag, Lecture Notes in Computer Science 4619, pp. 637–648, arXiv:0704.3313 • Fan, Bin; Andersen, Dave G.; Kaminsky, Michael; Mitzenmacher, Michael D. (2014), “Cuckoo filter: Practically better than Bloom”, Proc. 10th ACM Int. Conf. Emerging Networking Experiments and Technologies (CoNEXT '14), pp. 75–88, doi:10.1145/2674005.2674994. Open source implementation available on github. • Cohen, Saar; Matias, Yossi (2003), “Spectral Bloom Filters”, Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (PDF), pp. 241–252, doi:10.1145/872757.872787, ISBN 158113634X • Fan, Li; Cao, Pei; Almeida, Jussara; Broder, Andrei (2000), “Summary Cache: A Scalable WideArea Web Cache Sharing Protocol”, IEEE/ACM Transactions on Networking, 8 (3): 281–293, doi:10.1109/90.851975. A preliminary version appeared at SIGCOMM '98. • Deng, Fan; Rafiei, Davood (2006), “Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters”, Proceedings of the ACM SIGMOD Conference (PDF), pp. 25–36 • Goel, Ashish; Gupta, Pankaj (2010), “Small subset queries and bloom filters using ternary associative memories, with applications”, ACM Sigmetrics 2010, 38: 143, doi:10.1145/1811099.1811056 • Dharmapurikar, Sarang; Song, Haoyu; Turner, Jonathan; Lockwood, John (2006), “Fast packet classification using Bloom filters”, Proceedings of the 2006 ACM/IEEE Symposium on Architecture for Networking and Communications Systems (PDF), pp. 61–70, doi:10.1145/1185347.1185356, ISBN 1595935800 • Haghighat, Mohammad Hashem; Tavakoli, Mehdi; Kharrazi, Mehdi (2013), “Payload Attribution via Character Dependent Multi-Bloom Filters”, Transaction on Information Forensics and Security, IEEE, 99 (5): 705, doi:10.1109/TIFS.2013.2252341 • Dietzfelbinger, Martin; Pagh, Rasmus (2008), “Succinct Data Structures for Retrieval and Approximate Membership”, The Computing Research Repository (CoRR), arXiv:0803.3693 • Dillinger, Peter C.; Manolios, Panagiotis (2004a), “Fast and Accurate Bitstate Verification for SPIN”, Proceedings of the 11th International Spin Workshop on Model Checking Software, Springer-Verlag, Lecture Notes in Computer Science 2989 • Dillinger, Peter C.; Manolios, Panagiotis (2004b), “Bloom Filters in Probabilistic Verification”, Proceedings of the 5th International Conference on Formal Methods in Computer-Aided Design, Springer-Verlag, Lecture Notes in Computer Science 3312 • Donnet, Benoit; Baynat, Bruno; Friedman, Timur (2006), “Retouched Bloom Filters: Allowing Networked Applications to Flexibly Trade Off False Positives Against False Negatives”, CoNEXT 06 – 2nd Conference on Future Networking Technologies • Kirsch, Adam; Mitzenmacher, Michael (2006), “Less Hashing, Same Performance: Building a Better Bloom Filter”, in Azar, Yossi; Erlebach, Thomas, Algorithms – ESA 2006, 14th Annual European Symposium (PDF), Lecture Notes in Computer Science, 4168, Springer-Verlag, Lecture Notes in Computer Science 4168, pp. 456–467, doi:10.1007/11841036, ISBN 978-3-540-38875-3 • Koucheryavy, Y.; Giambene, G.; Staehle, D.; Barcelo-Arroyo, F.; Braun, T.; Siris, V. (2009), “Traffic and QoS Management in Wireless Multimedia Networks”, COST 290 Final Report, USA: 111 • Kubiatowicz, J.; Bindel, D.; Czerwinski, Y.; Geels, S.; Eaton, D.; Gummadi, R.; Rhea, S.; Weatherspoon, H.; et al. (2000), “Oceanstore: An architecture for global-scale persistent storage” (PDF), ACM SIGPLAN Notices, USA: 190–201 • Maggs, Bruce M.; Sitaraman, Ramesh K. (July 2015), “Algorithmic nuggets in content delivery”, SIGCOMM Computer Communication Review, New York, NY, USA: ACM, 45 (3): 52–66, doi:10.1145/2805789.2805800 120 • Mitzenmacher, Michael; Upfal, Eli (2005), Probability and computing: Randomized algorithms and probabilistic analysis, Cambridge University Press, pp. 107–112, ISBN 9780521835404 • Mortensen, Christian Worm; Pagh, Rasmus; Pătraşcu, Mihai (2005), “On dynamic range reporting in one dimension”, Proceedings of the Thirty-seventh Annual ACM Symposium on Theory of Computing, pp. 104–111, doi:10.1145/1060590.1060606, ISBN 1581139608 • Mullin, James K. (1990), “Optimal semijoins for distributed database systems”, Software Engineering, IEEE Transactions on, 16 (5): 558–560, doi:10.1109/32.52778 • Pagh, Anna; Pagh, Rasmus; Rao, S. Srinivasa (2005), “An optimal Bloom filter replacement”, Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms (PDF), pp. 823– 829 • Porat, Ely (2008), “An Optimal Bloom Filter Replacement Based on Matrix Solving”, The Computing Research Repository (CoRR), arXiv:0804.1845 • Pournaras, E.; Warnier, M.; Brazier, F.M.T.. (2013), “A generic and adaptive aggregation service for large-scale decentralized networks”, Complex Adaptive Systems Modeling, 1:19, doi:10.1186/2194-3206-1-19. Prototype implementation available on github. • Putze, F.; Sanders, P.; Singler, J. (2007), “Cache-, Hash- and Space-Efficient Bloom Filters”, in Demetrescu, Camil, Experimental Algorithms, 6th International Workshop, WEA 2007 (PDF), Lecture Notes in Computer Science, 4525, Springer-Verlag, Lecture Notes in Computer Science 4525, pp. 108– 121, doi:10.1007/978-3-540-72845-0, ISBN 9783-540-72844-3 CHAPTER 4. SETS • Shanmugasundaram, Kulesh; Brönnimann, Hervé; Memon, Nasir (2004), “Payload attribution via hierarchical Bloom filters”, Proceedings of the 11th ACM Conference on Computer and Communications Security, pp. 31–41, doi:10.1145/1030083.1030089, ISBN 1581139616 • Starobinski, David; Trachtenberg, Ari; Agarwal, Sachin (2003), “Efficient PDA Synchronization”, IEEE Transactions on Mobile Computing, 2 (1): 40, doi:10.1109/TMC.2003.1195150 • Stern, Ulrich; Dill, David L. (1996), “A New Scheme for Memory-Efficient Probabilistic Verification”, Proceedings of Formal Description Techniques for Distributed Systems and Communication Protocols, and Protocol Specification, Testing, and Verification: IFIP TC6/WG6.1 Joint International Conference, Chapman & Hall, IFIP Conference Proceedings, pp. 333–348, CiteSeerX: 10.1.1.47.4101 • Swamidass, S. Joshua; Baldi, Pierre (2007), “Mathematical correction for fingerprint similarity measures to improve chemical retrieval”, Journal of chemical information and modeling, ACS Publications, 47 (3): 952–964, doi:10.1021/ci600526a, PMID 17444629 • Wessels, Duane (January 2004), “10.7 Cache Digests”, Squid: The Definitive Guide (1st ed.), O'Reilly Media, p. 172, ISBN 0-596-00162-2, Cache Digests are based on a technique first published by Pei Cao, called Summary Cache. The fundamental idea is to use a Bloom filter to represent the cache contents. • Zhiwang, Cen; Jungang, Xu; Jian, Sun (2010), “A multi-layer Bloom filter for duplicated URL detection”, Proc. 3rd International Conference on Advanced Computer Theory and Engineering (ICACTE 2010), 1, pp. V1–586–V1–591, doi:10.1109/ICACTE.2010.5578947 • Rottenstreich, Ori; Kanizo, Yossi; Keslassy, Isaac (2012), “The Variable-Increment Count4.3.13 External links ing Bloom Filter”, 31st Annual IEEE International Conference on Computer Communications, • Why Bloom filters work the way they do (Michael 2012, Infocom 2012 (PDF), pp. 1880–1888, Nielsen, 2012) doi:10.1109/INFCOM.2012.6195563, ISBN 978-1-4673-0773-4 • Bloom Filters — A Tutorial, Analysis, and Survey (Blustein & El-Maazawi, 2002) at Dalhousie Uni• Sethumadhavan, Simha; Desikan, Rajagopalan; versity Burger, Doug; Moore, Charles R.; Keckler, Stephen W. (2003), “Scalable hardware memory disam• Table of false-positive rates for different configurabiguation for high ILP processors”, 36th Annual tions from a University of Wisconsin–Madison webIEEE/ACM International Symposium on Microarsite chitecture, 2003, MICRO-36 (PDF), pp. 399– • Interactive Processing demonstration from ash410, doi:10.1109/MICRO.2003.1253244, ISBN 07695-2043-X can.org 4.4. MINHASH 121 • “More Optimal Bloom Filters,” Ely Porat sets A and B. In other words, if r is the random variable (Nov/2007) Google TechTalk video on YouTube that is one when h ᵢ (A) = h ᵢ (B) and zero otherwise, then r is an unbiased estimator of J(A,B). r has too high a • “Using Bloom Filters” Detailed Bloom Filter expla- variance to be a useful estimator for the Jaccard similarnation using Perl ity on its own—it is always zero or one. The idea of the • “A Garden Variety of Bloom Filters - Explanation MinHash scheme is to reduce this variance by averaging together several variables constructed in the same way. and Analysis of Bloom filter variants • “Bloom filters, fast and simple” - Explanation and example implementation in Python 4.4.2 4.4 MinHash In computer science, MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are. The scheme was invented by Andrei Broder (1997),[1] and initially used in the AltaVista search engine to detect duplicate web pages and eliminate them from search results.[2] It has also been applied in largescale clustering problems, such as clustering documents by the similarity of their sets of words.[1] 4.4.1 Algorithm Variant with many hash functions The simplest version of the minhash scheme uses k different hash functions, where k is a fixed integer parameter, and represents each set S by the k values of h ᵢ (S) for these k functions. To estimate J(A,B) using this version of the scheme, let y be the number of hash functions for which h ᵢ (A) = h ᵢ (B), and use y/k as the estimate. This estimate is the average of k different 0-1 random variables, each of which is one when h ᵢ (A) = h ᵢ (B) and zero otherwise, and each of which is an unbiased estimator of J(A,B). Therefore, their average is also an unbiased estimator, of 0-1 random Jaccard similarity and minimum and by standard Chernoff bounds for sums variables, its expected error is O(1/√k).[3] hash values The Jaccard similarity coefficient is a commonly used indicator of the similarity between two sets. For sets A and B it is defined to be the ratio of the number of elements of their intersection and the number of elements of their union: Therefore, for any constant ε > 0 there is a constant k = O(1/ε2 ) such that the expected error of the estimate is at most ε. For example, 400 hashes would be required to estimate J(A,B) with an expected error less than or equal to .05. Variant with a single hash function |A ∩ B| J(A, B) = . |A ∪ B| This value is 0 when the two sets are disjoint, 1 when they are equal, and strictly between 0 and 1 otherwise. Two sets are more similar (i.e. have relatively more members in common) when their Jaccard index is closer to 1. The goal of MinHash is to estimate J(A,B) quickly, without explicitly computing the intersection and union. Let h be a hash function that maps the members of A and B to distinct integers, and for any set S define h ᵢ (S) to be the minimal member of S with respect to h—that is, the member x of S with the minimum value of h(x). Now, if we apply h ᵢ to both A and B, we will get the same value exactly when the element of the union A ∪ B with minimum hash value lies in the intersection A ∩ B. The probability of this being true is the ratio above, and therefore: It may be computationally expensive to compute multiple hash functions, but a related version of MinHash scheme avoids this penalty by using only a single hash function and uses it to select multiple values from each set rather than selecting only a single minimum value per hash function. Let h be a hash function, and let k be a fixed integer. If S is any set of k or more values in the domain of h, define h₍k₎(S) to be the subset of the k members of S that have the smallest values of h. This subset h₍k₎(S) is used as a signature for the set S, and the similarity of any two sets is estimated by comparing their signatures. Specifically, let A and B be any two sets. Then X = h₍k₎(h₍k₎(A) ∪ h₍k₎(B)) = h₍k₎(A ∪ B) is a set of k elements of A ∪ B, and if h is a random function then any subset of k elements is equally likely to be chosen; that is, X is a simple random sample of A ∪ B. The subset Y = X ∩ h₍k₎(A) ∩ h₍k₎(B) is the set of members of X that belong to the intersection A ∩ B. Therefore, |Y|/k is an unPr[ h ᵢ (A) = h ᵢ (B) ] = J(A,B), biased estimator of J(A,B). The difference between this estimator and the estimator produced by multiple hash That is, the probability that h ᵢ (A) = h ᵢ (B) is true is functions is that X always has exactly k members, whereas equal to the similarity J(A,B), assuming randomly chosen the multiple hash functions may lead to a smaller number 122 CHAPTER 4. SETS of sampled elements due to the possibility that two differ- 4.4.4 Applications ent hash functions may have the same minima. However, when k is small relative to the sizes of the sets, this dif- The original applications for MinHash involved clusterference is negligible. ing and eliminating near-duplicates among web docuas sets of the words occurring in those By standard Chernoff bounds for sampling without re- ments, represented [1][2] documents. Similar techniques have also been used placement, this estimator has expected error O(1/√k), for clustering and near-duplicate elimination for other matching the performance of the multiple-hash-function types of data, such as images: in the case of image data, scheme. an image can be represented as a set of smaller subimages cropped from it, or as sets of more complex image feature descriptions.[6] Time analysis The estimator |Y|/k can be computed in time O(k) from the two signatures of the given sets, in either variant of the scheme. Therefore, when ε and k are constants, the time to compute the estimated similarity from the signatures is also constant. The signature of each set can be computed in linear time on the size of the set, so when many pairwise similarities need to be estimated this method can lead to a substantial savings in running time compared to doing a full comparison of the members of each set. Specifically, for set size n the many hash variant takes O(n k) time. The single hash variant is generally faster, requiring O(n) time to maintain the queue of minimum hash values assuming n >> k.[1] 4.4.3 In data mining, Cohen et al. (2001) use MinHash as a tool for association rule learning. Given a database in which each entry has multiple attributes (viewed as a 0–1 matrix with a row per database entry and a column per attribute) they use MinHash-based approximations to the Jaccard index to identify candidate pairs of attributes that frequently co-occur, and then compute the exact value of the index for only those pairs to determine the ones whose frequencies of co-occurrence are below a given strict threshold.[7] 4.4.5 Other uses The MinHash scheme may be seen as an instance of Min-wise independent permutations locality sensitive hashing, a collection of techniques for In order to implement the MinHash scheme as described above, one needs the hash function h to define a random permutation on n elements, where n is the total number of distinct elements in the union of all of the sets to be compared. But because there are n! different permutations, it would require Ω(n log n) bits just to specify a truly random permutation, an infeasibly large number for even moderate values of n. Because of this fact, by analogy to the theory of universal hashing, there has been significant work on finding a family of permutations that is “min-wise independent”, meaning that for any subset of the domain, any element is equally likely to be the minimum. It has been established that a min-wise independent family of permutations must include at least using hash functions to map large sets of objects down to smaller hash values in such a way that, when two objects have a small distance from each other, their hash values are likely to be the same. In this instance, the signature of a set may be seen as its hash value. Other locality sensitive hashing techniques exist for Hamming distance between sets and cosine distance between vectors; locality sensitive hashing has important applications in nearest neighbor search algorithms.[8] For large distributed systems, and in particular MapReduce, there exist modified versions of MinHash to help compute similarities with no dependence on the point dimension.[9] 4.4.6 Evaluation and benchmarks A large scale evaluation has been conducted by Google in 2006 [10] to compare the performance of Minhash and Simhash[11] algorithms. In 2007 Google redifferent permutations, and therefore that it needs Ω(n) ported using Simhash for duplicate detection for web bits to specify a single permutation, still infeasibly crawling[12] and using Minhash and LSH for Google large.[2] News personalization.[13] Because of this impracticality, two variant notions of min-wise independence have been introduced: restricted min-wise independent permutations families, and ap4.4.7 See also proximate min-wise independent families. Restricted min-wise independence is the min-wise independence • w-shingling property restricted to certain sets of cardinality at most [4] k. Approximate min-wise independence has at most a fixed probability ε of varying from full independence.[5] • Count-min sketch lcm(1, 2, · · · , n) ≥ en−o(n) 4.5. DISJOINT-SET DATA STRUCTURE 4.4.8 References [1] Broder, Andrei Z. (1997), “On the resemblance and containment of documents”, Compression and Complexity of Sequences: Proceedings, Positano, Amalfitan Coast, Salerno, Italy, June 11-13, 1997 (PDF), IEEE, pp. 21– 29, doi:10.1109/SEQUEN.1997.666900. [2] Broder, Andrei Z.; Charikar, Moses; Frieze, Alan M.; Mitzenmacher, Michael (1998), “Min-wise independent permutations”, Proc. 30th ACM Symposium on Theory of Computing (STOC '98), New York, NY, USA: Association for Computing Machinery, pp. 327–336, doi:10.1145/276698.276781. [3] Vassilvitskii, Sergey (2011), COMS 6998-12: Dealing with Massive Data (lecture notes, Columbia university) (PDF). [4] Matoušek, Jiří; Stojaković, Miloš (2003), “On restricted min-wise independence of permutations”, Random Structures and Algorithms, 23 (4): 397–408, doi:10.1002/rsa.10101. [5] Saks, M.; Srinivasan, A.; Zhou, S.; Zuckerman, D. (2000), “Low discrepancy sets yield approximate minwise independent permutation families”, Information Processing Letters, 73 (1–2): 29–32, doi:10.1016/S00200190(99)00163-5. [6] Chum, Ondřej; Philbin, James; Isard, Michael; Zisserman, Andrew (2007), “Scalable near identical image and shot detection”, Proceedings of the 6th ACM International Conference on Image and Cideo Retrieval (CIVR'07), doi:10.1145/1282280.1282359; Chum, Ondřej; Philbin, James; Zisserman, Andrew (2008), “Near duplicate image detection: min-hash and tf-idf weighting”, Proceedings of the British Machine Vision Conference (PDF), 3, p. 4. [7] Cohen, E.; Datar, M.; Fujiwara, S.; Gionis, A.; Indyk, P.; Motwani, R.; Ullman, J. D.; Yang, C. (2001), “Finding interesting associations without support pruning”, IEEE Transactions on Knowledge and Data Engineering, 13 (1): 64–78, doi:10.1109/69.908981. [8] Andoni, Alexandr; Indyk, Piotr (2008), “Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions”, Communications of the ACM, 51 (1): 117–122, doi:10.1145/1327452.1327494. [9] Zadeh, Reza; Goel, Ashish (2012), Dimension Independent Similarity Computation, arXiv:1206.2082 . [10] Henzinger, Monika (2006), “Finding near-duplicate web pages: a large-scale evaluation of algorithms”, Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (PDF), doi:10.1145/1148170.1148222. [11] Charikar, Moses S. (2002), “Similarity estimation techniques from rounding algorithms”, Proceedings of the 34th Annual ACM Symposium on Theory of Computing, doi:10.1145/509907.509965. 123 [12] Gurmeet Singh, Manku; Jain, Arvind; Das Sarma, Anish (2007), “Detecting near-duplicates for web crawling”, Proceedings of the 16th International Conference on World Wide Web (PDF), doi:10.1145/1242572.1242592. [13] Das, Abhinandan S.; Datar, Mayur; Garg, Ahutosh; Rajaram, Shyam; et al. (2007), “Google news personalization: scalable online collaborative filtering”, Proceedings of the 16th International Conference on World Wide Web, doi:10.1145/1242572.1242610. 4.4.9 External links • Mining of Massive Datasets, Ch. 3. Finding similar Items • Simple Simhashing • Set Similarity & MinHash - C# implementation • Minhash with LSH for all-pair search (C# implementation) • MinHash – Java implementation • MinHash – Scala implementation and a duplicate detection tool • All pairs similarity search (Google Research) • Distance and Similarity Measures(Wolfram Alpha) • Nilsimsa hash (Python implementation) • Simhash 4.5 Disjoint-set data structure 1 2 3 4 5 6 7 8 MakeSet creates 8 singletons. 1 2 5 6 8 3 4 7 After some operations of Union, some sets are grouped together. In computer science, a disjoint-set data structure, also called a union–find data structure or merge–find set, is a data structure that keeps track of a set of elements partitioned into a number of disjoint (nonoverlapping) subsets. It supports two useful operations: • Find: Determine which subset a particular element is in. Find typically returns an item from this set that serves as its “representative"; by comparing the result of two Find operations, one can determine whether two elements are in the same subset. • Union: Join two subsets into a single subset. 124 CHAPTER 4. SETS The other important operation, MakeSet, which makes a set containing only a given element (a singleton), is generally trivial. With these three operations, many practical partitioning problems can be solved (see the Applications section). In order to define these operations more precisely, some way of representing the sets is needed. One common approach is to select a fixed element of each set, called its representative, to represent the set as a whole. Then, Find(x) returns the representative of the set that x belongs to, and Union takes two set representatives as its arguments. 4.5.1 Disjoint-set linked lists belongs to is merged with another list of the same size or of greater size. Each time that happens, the size of the list to which x belongs at least doubles. So finally, the question is “how many times can a number double before it is the size of n ?" (then the list containing x will contain all n elements). The answer is exactly log2 (n) . So for any given element of any given list in the structure described, it will need to be updated log2 (n) times in the worst case. Therefore, updating a list of n elements stored in this way takes O(n log(n)) time in the worst case. A find operation can be done in O(1) for this structure because each node contains the name of the list to which it belongs. A similar argument holds for merging the trees in the data structures discussed below. Additionally, it helps explain the time analysis of some operations in the binomial heap and Fibonacci heap data structures. A simple disjoint-set data structure uses a linked list for each set. The element at the head of each list is chosen as 4.5.2 its representative. Disjoint-set forests MakeSet creates a list of one element. Union appends the two lists, a constant-time operation if the list carries a pointer to its tail. The drawback of this implementation is that Find requires O(n) or linear time to traverse the list backwards from a given element to the head of the list. Disjoint-set forests are data structures where each set is represented by a tree data structure, in which each node holds a reference to its parent node (see Parent pointer tree). They were first described by Bernard A. Galler and Michael J. Fischer in 1964,[2] although their precise analThis can be avoided by including in each linked list node ysis took years. a pointer to the head of the list; then Find takes constant In a disjoint-set forest, the representative of each set is the time, since this pointer refers directly to the set represenroot of that set’s tree. Find follows parent nodes until it tative. However, Union now has to update each element reaches the root. Union combines two trees into one by of the list being appended to make it point to the head of attaching the root of one to the root of the other. One the new combined list, requiring O(n) time. way of implementing these might be: When the length of each list is tracked, the required time function MakeSet(x) x.parent := x function Find(x) if can be improved by always appending the smaller list x.parent == x return x else return Find(x.parent) functo the longer. Using this weighted-union heuristic, a setion Union(x, y) xRoot := Find(x) yRoot := Find(y) quence of m MakeSet, Union, and Find operations on n elxRoot.parent := yRoot ements requires O(m + nlog n) time.[1] For asymptotically In this naive form, this approach is no better than the faster operations, a different data structure is needed. linked-list approach, because the tree it creates can be highly unbalanced; however, it can be enhanced in two Analysis of the naive approach ways. We now explain the bound O(n log(n)) above. Suppose you have a collection of lists and each node of each list contains an object, the name of the list to which it belongs, and the number of elements in that list. Also assume that the total number of elements in all lists is n (i.e. there are n elements overall). We wish to be able to merge any two of these lists, and update all of their nodes so that they still contain the name of the list to which they belong. The rule for merging the lists A and B is that if A is larger than B then merge the elements of B into A and update the elements that used to belong to B , and vice versa. The first way, called union by rank, is to always attach the smaller tree to the root of the larger tree. Since it is the depth of the tree that affects the running time, the tree with smaller depth gets added under the root of the deeper tree, which only increases the depth if the depths were equal. In the context of this algorithm, the term rank is used instead of depth since it stops being equal to the depth if path compression (described below) is also used. One-element trees are defined to have a rank of zero, and whenever two trees of the same rank r are united, the rank of the result is r+1. Just applying this technique alone yields a worst-case running-time of O(log n) for the Union or Find operation. Pseudocode for the improved MakeSet and Union: Choose an arbitrary element of list L , say x . We wish to count how many times in the worst case will x need to function MakeSet(x) x.parent := x x.rank := 0 function have the name of the list to which it belongs updated. The Union(x, y) xRoot := Find(x) yRoot := Find(y) if xRoot element x will only have its name updated when the list it == yRoot return // x and y are not already in same set. 4.5. DISJOINT-SET DATA STRUCTURE Merge them. if xRoot.rank < yRoot.rank xRoot.parent := yRoot else if xRoot.rank > yRoot.rank yRoot.parent := xRoot else yRoot.parent := xRoot xRoot.rank := xRoot.rank + 1 The second improvement, called path compression, is a way of flattening the structure of the tree whenever Find is used on it. The idea is that each node visited on the way to a root node may as well be attached directly to the root node; they all share the same representative. To effect this, as Find recursively traverses up the tree, it changes each node’s parent reference to point to the root that it found. The resulting tree is much flatter, speeding up future operations not only on these elements but on those referencing them, directly or indirectly. Here is the improved Find: function Find(x) if x.parent != x x.parent := Find(x.parent) return x.parent 125 proven by Hopcroft and Ullman,[6] was O(log* n), the iterated logarithm of n, another slowly growing function (but not quite as slow as the inverse Ackermann function). Tarjan and Van Leeuwen also developed one-pass Find algorithms that are more efficient in practice while retaining the same worst-case complexity.[7] In 2007, Sylvain Conchon and Jean-Christophe Filliâtre developed a persistent version of the disjoint-set forest data structure, allowing previous versions of the structure to be efficiently retained, and formalized its correctness using the proof assistant Coq.[8] However, the implementation is only asymptotic if used ephemerally or if the same version of the structure is repeatedly used with limited backtracking. 4.5.5 See also These two techniques complement each other; applied to• Partition refinement, a different data structure for gether, the amortized time per operation is only O(α(n)) maintaining disjoint sets, with updates that split sets , where α(n) is the inverse of the function n = f (x) = apart rather than merging them together A(x, x) , and A is the extremely fast-growing Ackermann • Dynamic connectivity function. Since α(n) is the inverse of this function, α(n) is less than 5 for all remotely practical values of n . Thus, the amortized running time per operation is effectively a 4.5.6 References small constant. In fact, this is asymptotically optimal: Fredman and Saks showed in 1989 that Ω(α(n)) words must be accessed by any disjoint-set data structure per operation on average.[3] [1] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2001), “Chapter 21: Data structures for Disjoint Sets”, Introduction to Algorithms (Second ed.), MIT Press, pp. 498–524, ISBN 0-26203293-7 4.5.3 [2] Galler, Bernard A.; Fischer, Michael J. (May 1964), “An improved equivalence algorithm”, Communications of the ACM, 7 (5): 301–303, doi:10.1145/364099.364331. The paper originating disjoint-set forests. Applications Disjoint-set data structures model the partitioning of a set, for example to keep track of the connected components of an undirected graph. This model can then be used to determine whether two vertices belong to the same component, or whether adding an edge between them would result in a cycle. The Union–Find algorithm is used in high-performance implementations of unification.[4] This data structure is used by the Boost Graph Library to implement its Incremental Connected Components functionality. It is also used for implementing Kruskal’s algorithm to find the minimum spanning tree of a graph. Note that the implementation as disjoint-set forests doesn't allow deletion of edges—even without path compression or the rank heuristic. 4.5.4 History While the ideas used in disjoint-set forests have long been familiar, Robert Tarjan was the first to prove the upper bound (and a restricted version of the lower bound) in terms of the inverse Ackermann function, in 1975.[5] Until this time the best bound on the time per operation, [3] Fredman, M.; Saks, M. (May 1989), “The cell probe complexity of dynamic data structures”, Proceedings of the Twenty-First Annual ACM Symposium on Theory of Computing: 345–354, Theorem 5: Any CPROBE(log n) implementation of the set union problem requires Ω(m α(m, n)) time to execute m Find’s and n−1 Union’s, beginning with n singleton sets. [4] Knight, Kevin (1989). “Unification: A multidisciplinary survey”. ACM Computing Surveys. 21: 93–124. doi:10.1145/62029.62030. [5] Tarjan, Robert Endre (1975). “Efficiency of a Good But Not Linear Set Union Algorithm”. Journal of the ACM. 22 (2): 215–225. doi:10.1145/321879.321884. [6] Hopcroft, J. E.; Ullman, J. D. (1973). “Set Merging Algorithms”. SIAM Journal on Computing. 2 (4): 294–303. doi:10.1137/0202024. [7] Tarjan, Robert E.; van Leeuwen, Jan (1984), “Worst-case analysis of set union algorithms”, Journal of the ACM, 31 (2): 245–281, doi:10.1145/62.2160 [8] Conchon, Sylvain; Filliâtre, Jean-Christophe (October 2007), “A Persistent Union-Find Data Structure”, ACM SIGPLAN Workshop on ML, Freiburg, Germany 126 4.5.7 CHAPTER 4. SETS External links • C++ implementation, part of the Boost C++ libraries • A Java implementation with an application to color image segmentation, Statistical Region Merging (SRM), IEEE Trans. Pattern Anal. Mach. Intell. 26(11): 1452–1458 (2004) • Java applet: A Graphical Union–Find Implementation, by Rory L. P. McGuire • Associated with each set Si, a collection of its elements of Si, in a form such as a doubly linked list or array data structure that allows for rapid deletion of individual elements from the collection. Alternatively, this component of the data structure may be represented by storing all of the elements of all of the sets in a single array, sorted by the identity of the set they belong to, and by representing the collection of elements in any set Si by its starting and ending positions in this array. • Associated with each element, the set it belongs to. To perform a refinement operation, the algorithm loops through the elements of the given set X. For each such element x, it finds the set Si that contains x, and checks whether a second set for Si ∩ X has already been started. If not, it creates the second set and add Si to a list L of the sets that are split by the operation. Then, regardless • Python implementation of whether a new set was formed, the algorithm removes x from Si and adds it to Si ∩ X. In the representation in • Visual explanation and C# code which all elements are stored in a single array, moving x from one set to another may be performed by swapping x with the final element of Si and then decrementing the 4.6 Partition refinement end index of Si and the start index of the new set. Finally, after all elements of X have been processed in this way, In the design of algorithms, partition refinement is a the algorithm loops through L, separating each current technique for representing a partition of a set as a data set Si from the second set that has been split from it, and structure that allows the partition to be refined by split- reports both of these sets as being newly formed by the ting its sets into a larger number of smaller sets. In that refinement operation. sense it is dual to the union-find data structure, which also maintains a partition into disjoint sets but in which the op- The time to perform a single refinement operations in this way is O(|X|), independent of the number of elements in erations merge pairs of sets together. the family of sets and also independent of the total numPartition refinement forms a key component of several ber of sets in the data structure. Thus, the time for a efficient algorithms on graphs and finite automata, in- sequence of refinements is proportional to the total size cluding DFA minimization, the Coffman–Graham algo- of the sets given to the algorithm in each refinement step. rithm for parallel scheduling, and lexicographic breadthfirst search of graphs.[1][2][3] • Wait-free Parallel Algorithms for the Union–Find Problem, a 1994 paper by Richard J. Anderson and Heather Woll describing a parallelized version of Union–Find that never needs to block 4.6.2 Applications 4.6.1 Data structure A partition refinement algorithm maintains a family of disjoint sets Si. At the start of the algorithm, this family contains a single set of all the elements in the data structure. At each step of the algorithm, a set X is presented to the algorithm, and each set Si in the family that contains members of X is split into two sets, the intersection Si ∩ X and the difference Si \ X. Such an algorithm may be implemented efficiently by maintaining data structures representing the following information:array, ordered by the sets they belong to, and sets may be represented by start and end indices into this array.[4][5] • The ordered sequence of the sets Si in the family, in a form such as a doubly linked list that allows new sets to be inserted into the middle of the sequence An early application of partition refinement was in an algorithm by Hopcroft (1971) for DFA minimization. In this problem, one is given as input a deterministic finite automaton, and must find an equivalent automaton with as few states as possible. Hopcroft’s algorithm maintains a partition of the states of the input automaton into subsets, with the property that any two states in different subsets must be mapped to different states of the output automaton. Initially, there are two subsets, one containing all the accepting states of the automaton and one containing the remaining states. At each step one of the subsets Si and one of the input symbols x of the automaton are chosen, and the subsets of states are refined into states for which a transition labeled x would lead to Si, and states for which an x-transition would lead somewhere else. When a set Si that has already been chosen is split by a refinement, only one of the two resulting sets (the smaller of the two) needs to be chosen again; in this way, each state participates in the sets X for O(s log n) refinement steps and the 4.6. PARTITION REFINEMENT overall algorithm takes time O(ns log n), where n is the number of initial states and s is the size of the alphabet.[6] Partition refinement was applied by Sethi (1976) in an efficient implementation of the Coffman–Graham algorithm for parallel scheduling. Sethi showed that it could be used to construct a lexicographically ordered topological sort of a given directed acyclic graph in linear time; this lexicographic topological ordering is one of the key steps of the Coffman–Graham algorithm. In this application, the elements of the disjoint sets are vertices of the input graph and the sets X used to refine the partition are sets of neighbors of vertices. Since the total number of neighbors of all vertices is just the number of edges in the graph, the algorithm takes time linear in the number of edges, its input size.[7] Partition refinement also forms a key step in lexicographic breadth-first search, a graph search algorithm with applications in the recognition of chordal graphs and several other important classes of graphs. Again, the disjoint set elements are vertices and the set X represent sets of neighbors, so the algorithm takes linear time.[8][9] 4.6.3 See also • Refinement (sigma algebra) 4.6.4 References [1] Paige, Robert; Tarjan, Robert E. (1987), “Three partition refinement algorithms”, SIAM Journal on Computing, 16 (6): 973–989, doi:10.1137/0216062, MR 917035. [2] Habib, Michel; Paul, Christophe; Viennot, Laurent (1999), “Partition refinement techniques: an interesting algorithmic tool kit”, International Journal of Foundations of Computer Science, 10 (2): 147–170, doi:10.1142/S0129054199000125, MR 1759929. [3] Habib, Michel; Paul, Christophe; Viennot, Laurent (1998), “A synthesis on partition refinement: a useful routine for strings, graphs, Boolean matrices and automata”, STACS 98 (Paris, 1998), Lecture Notes in Computer Science, 1373, Springer-Verlag, pp. 25–38, doi:10.1007/BFb0028546, MR 1650757. [4] Valmari, Antti; Lehtinen, Petri (2008). “Efficient minimization of DFAs with partial transition functions”. In Albers, Susanne; Weil, Pascal. 25th International Symposium on Theoretical Aspects of Computer Science (STACS 2008). Leibniz International Proceedings in Informatics (LIPIcs). Dagstuhl, Germany: Schloss Dagstuhl: Leibniz-Zentrum fuer Informatik. pp. 645– 656. doi:10.4230/LIPIcs.STACS.2008.1328. ISBN 9783-939897-06-4. ISSN 1868-8969.. [5] Knuutila, Timo (2001). “Re-describing an algorithm by Hopcroft”. Theoretical Computer Science. 250 (1-2): 333–363. doi:10.1016/S0304-3975(99)00150-4. ISSN 0304-3975. 127 [6] Hopcroft, John (1971), “An n log n algorithm for minimizing states in a finite automaton”, Theory of machines and computations (Proc. Internat. Sympos., Technion, Haifa, 1971), New York: Academic Press, pp. 189–196, MR 0403320. [7] Sethi, Ravi (1976), “Scheduling graphs on two processors”, SIAM Journal on Computing, 5 (1): 73–82, doi:10.1137/0205005, MR 0398156. [8] Rose, D. J.; Tarjan, R. E.; Lueker, G. S. (1976), “Algorithmic aspects of vertex elimination on graphs”, SIAM Journal on Computing, 5 (2): 266–283, doi:10.1137/0205021. [9] Corneil, Derek G. (2004), “Lexicographic breadth first search – a survey”, Graph-Theoretic Methods in Computer Science, Lecture Notes in Computer Science, 3353, Springer-Verlag, pp. 1–19. Chapter 5 Priority queues 5.1 Priority queue In addition, peek (in this context often called find-max or find-min), which returns the highest-priority element but does not modify the queue, is very frequently implemented, and nearly always executes in O(1) time. This operation and its O(1) performance is crucial to many applications of priority queues. In computer science, a priority queue is an abstract data type which is like a regular queue or stack data structure, but where additionally each element has a “priority” associated with it. In a priority queue, an element with high priority is served before an element with low priority. If More advanced implementations may support more comtwo elements have the same priority, they are served ac- plicated operations, such as pull_lowest_priority_element, cording to their order in the queue. inspecting the first few highest- or lowest-priority eleWhile priority queues are often implemented with heaps, ments, clearing the queue, clearing subsets of the queue, they are conceptually distinct from heaps. A priority performing a batch insert, merging two or more queues queue is an abstract concept like “a list" or “a map"; just into one, incrementing priority of any element, etc. as a list can be implemented with a linked list or an array, a priority queue can be implemented with a heap or a va5.1.2 riety of other methods such as an unordered array. 5.1.1 Similarity to queues One can imagine a priority queue as a modified queue, but when one would get the next element off the queue, the highest-priority element is retrieved first. Operations A priority queue must at least support the following op- Stacks and queues may be modeled as particular kinds of priority queues. As a reminder, here is how stacks and erations: queues behave: • insert_with_priority: add an element to the queue with an associated priority. • stack – elements are pulled in last-in first-out-order (e.g., a stack of papers) • pull_highest_priority_element: remove the element from the queue that has the highest priority, and return it. This is also known as "pop_element(Off)", "get_maximum_element" or "get_front(most)_element". Some conventions reverse the order of priorities, considering lower values to be higher priority, so this may also be known as "get_minimum_element", and is often referred to as "get-min" in the literature. This may instead be specified as separate "peek_at_highest_priority_element" and "delete_element" functions, which can be combined to produce "pull_highest_priority_element". • queue – elements are pulled in first-in first-out-order (e.g., a line in a cafeteria) In a stack, the priority of each inserted element is monotonically increasing; thus, the last element inserted is always the first retrieved. In a queue, the priority of each inserted element is monotonically decreasing; thus, the first element inserted is always the first retrieved. 5.1.3 Implementation Naive implementations There is a variety of simple, usually inefficient, ways to implement a priority queue. They provide an analogy to help one understand what a priority queue is. For instance, one can keep all the elements in an unsorted list. Whenever the highest-priority element is requested, 128 5.1. PRIORITY QUEUE 129 search through all elements for the one with the highest For applications that do many "peek" operations for evpriority. (In big O notation: O(1) insertion time, O(n) pull ery “extract-min” operation, the time complexity for peek time due to search.) actions can be reduced to O(1) in all tree and heap implementations by caching the highest priority element after every insertion and removal. For insertion, this adds at Usual implementation most a constant cost, since the newly inserted element is compared only to the previously cached minimum eleTo improve performance, priority queues typically use a ment. For deletion, this at most adds an additional “peek” heap as their backbone, giving O(log n) performance for cost, which is typically cheaper than the deletion cost, so inserts and removals, and O(n) to build initially. Vari- overall time complexity is not significantly impacted. ants of the basic heap data structure such as pairing heaps Monotone priority queues are specialized queues that are or Fibonacci heaps can provide better bounds for some optimized for the case where no item is ever inserted that [1] operations. has a lower priority (in the case of min-heap) than any Alternatively, when a self-balancing binary search tree is item previously extracted. This restriction is met by sevused, insertion and removal also take O(log n) time, al- eral practical applications of priority queues. though building trees from existing sequences of elements takes O(n log n) time; this is typical where one might already have access to these data structures, such as with Summary of running times third-party or standard libraries. In the following time complexities[5] O(f) is an asympFrom a computational-complexity standpoint, priority totic upper bound and Θ(f) is an asymptotically tight queues are congruent to sorting algorithms. See the next bound (see Big O notation). Function names assume a section for how efficient sorting algorithms can create ef- min-heap. ficient priority queues. Specialized heaps There are several specialized heap data structures that either supply additional operations or outperform heapbased implementations for specific types of keys, specifically integer keys. [1] Brodal and Okasaki later describe a persistent variant with the same bounds except for decrease-key, which is not supported. Heaps with n elements can be constructed bottom-up in O(n).[8] [2] Amortized time. √ [3] Bounded by Ω(log log n), O(22 log log n ) [11][12] [4] n is the size of the larger heap. • When the set of keys is {1, 2, ..., C}, and only insert, find-min and extract-min are needed, a bucket queue can be constructed as an array of C linked lists plus a pointer top, initially C. Inserting an item with key k appends the item to the k'th, and updates top ← min(top, k), both in constant time. Extract-min deletes and returns one item from the list with index top, then increments top if needed until it again points to a non-empty list; this takes O(C) time in the worst case. These queues are useful for sorting the vertices of a graph by their degree.[2]:374 5.1.4 Equivalence of priority queues and sorting algorithms Using a priority queue to sort The semantics of priority queues naturally suggest a sorting method: insert all the elements to be sorted into a priority queue, and sequentially remove them; they will come out in sorted order. This is actually the procedure used by several sorting algorithms, once the layer • For the set of keys {1, 2, ..., C}, a van Emde Boas of abstraction provided by the priority queue is removed. tree would support the minimum, maximum, insert, This sorting method is equivalent to the following sorting delete, search, extract-min, extract-max, predecessor algorithms: and successor operations in O(log log C) time, but has a space cost for small queues of about O(2m/2 ), where m is the number of bits in the priority value.[3] Using a sorting algorithm to make a priority queue • The Fusion tree algorithm by Fredman and Willard A sorting algorithm can also be used to implement a priority queue. Specifically, Thorup says:[13] implements the minimum operation in O(1) √ time and insert and extract-min operations in O( log n) We present a general deterministic linear time however it is stated by the author that, “Our space reduction from priority queues to sortalgorithms have theoretical interest only; The coning implying that if we can sort up to n keys in stant factors involved in the execution times preS(n) time per key, then there is a priority queue clude practicality.”.[4] 130 CHAPTER 5. PRIORITY QUEUES supporting delete and insert in O(S(n)) time and find-min in constant time. rival. This ensures that the prioritized traffic (such as realtime traffic, e.g. an RTP stream of a VoIP connection) is forwarded with the least delay and the least likelihood of being rejected due to a queue reaching its maximum capacity. All other traffic can be handled when the highest priority queue is empty. Another approach used is to send disproportionately more traffic from higher priority queues. That is, if there is a sorting algorithm which can sort in O(S) time per key, where S is some function of n and word size,[14] then one can use the given procedure to create a priority queue where pulling the highest-priority element is O(1) time, and inserting new elements (and deleting elements) is O(S) time. For example, if one has an O(n Many modern protocols for local area networks also inlog log n) sort algorithm, one can create a priority queue clude the concept of priority queues at the media access with O(1) pulling and O(log log n) insertion. control (MAC) sub-layer to ensure that high-priority applications (such as VoIP or IPTV) experience lower latency than other applications which can be served with 5.1.5 Libraries best effort service. Examples include IEEE 802.11e (an amendment to IEEE 802.11 which provides quality of A priority queue is often considered to be a "container service) and ITU-T G.hn (a standard for high-speed local data structure". area network using existing home wiring (power lines, The Standard Template Library (STL), and the C++ phone lines and coaxial cables). 1998 standard, specifies priority_queue as one of the STL Usually a limitation (policer) is set to limit the bandcontainer adaptor class templates. However, it does not width that traffic from the highest priority queue can take, specify how two elements with same priority should be in order to prevent high priority packets from choking served, and indeed, common implementations will not re- off all other traffic. This limit is usually never reached turn them according to their order in the queue. It im- due to high level control instances such as the Cisco plements a max-priority-queue, and has three parame- Callmanager, which can be programmed to inhibit calls ters: a comparison object for sorting such as a function which would exceed the programmed bandwidth limit. object (defaults to less<T> if unspecified), the underlying container for storing the data structures (defaults to std::vector<T>), and two iterators to the beginning and Discrete event simulation end of a sequence. Unlike actual STL containers, it does not allow iteration of its elements (it strictly adheres to its Another use of a priority queue is to manage the events abstract data type definition). STL also has utility func- in a discrete event simulation. The events are added to tions for manipulating another random-access container the queue with their simulation time used as the prioras a binary max-heap. The Boost (C++ libraries) also ity. The execution of the simulation proceeds by repeathave an implementation in the library heap. edly pulling the top of the queue and executing the event Python’s heapq module implements a binary min-heap on thereon. top of a list. See also: Scheduling (computing), queueing theory Java's library contains a PriorityQueue class, which implements a min-priority-queue. Dijkstra’s algorithm Go's library contains a container/heap module, which implements a min-heap on top of any compatible data struc- When the graph is stored in the form of adjacency list or ture. matrix, priority queue can be used to extract minimum The Standard PHP Library extension contains the class efficiently when implementing Dijkstra’s algorithm, although one also needs the ability to alter the priority of a SplPriorityQueue. particular vertex in the priority queue efficiently. Apple’s Core Foundation framework contains a CFBinaryHeap structure, which implements a min-heap. Huffman coding 5.1.6 Applications Bandwidth management Huffman coding requires one to repeatedly obtain the two lowest-frequency trees. A priority queue is one method of doing this. Priority queuing can be used to manage limited resources such as bandwidth on a transmission line from a network Best-first search algorithms router. In the event of outgoing traffic queuing due to insufficient bandwidth, all other queues can be halted to Best-first search algorithms, like the A* search algorithm, send the traffic from the highest priority queue upon ar- find the shortest path between two vertices or nodes of 5.1. PRIORITY QUEUE a weighted graph, trying out the most promising routes first. A priority queue (also known as the fringe) is used to keep track of unexplored routes; the one for which the estimate (a lower bound in the case of A*) of the total path length is smallest is given highest priority. If memory limitations make best-first search impractical, variants like the SMA* algorithm can be used instead, with a double-ended priority queue to allow removal of lowpriority items. ROAM triangulation algorithm The Real-time Optimally Adapting Meshes (ROAM) algorithm computes a dynamically changing triangulation of a terrain. It works by splitting triangles where more detail is needed and merging them where less detail is needed. The algorithm assigns each triangle in the terrain a priority, usually related to the error decrease if that triangle would be split. The algorithm uses two priority queues, one for triangles that can be split and another for triangles that can be merged. In each step the triangle from the split queue with the highest priority is split, or the triangle from the merge queue with the lowest priority is merged with its neighbours. Prim’s algorithm for minimum spanning tree 131 [3] P. van Emde Boas. Preserving order in a forest in less than logarithmic time. In Proceedings of the 16th Annual Symposium on Foundations of Computer Science, pages 75-84. IEEE Computer Society, 1975. [4] Michael L. Fredman and Dan E. Willard. Surpassing the information theoretic bound with fusion trees. Journal of Computer and System Sciences, 48(3):533-551, 1994 [5] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L. (1990). Introduction to Algorithms (1st ed.). MIT Press and McGraw-Hill. ISBN 0-262-03141-8. [6] Iacono, John (2000), “Improved upper bounds for pairing heaps”, Proc. 7th Scandinavian Workshop on Algorithm Theory, Lecture Notes in Computer Science, 1851, Springer-Verlag, pp. 63–77, doi:10.1007/3-540-44985X_5 [7] Brodal, Gerth S. (1996), “Worst-Case Efficient Priority Queues”, Proc. 7th Annual ACM-SIAM Symposium on Discrete Algorithms (PDF), pp. 52–58 [8] Goodrich, Michael T.; Tamassia, Roberto (2004). “7.3.6. Bottom-Up Heap Construction”. Data Structures and Algorithms in Java (3rd ed.). pp. 338–341. [9] Haeupler, Bernhard; Sen, Siddhartha; Tarjan, Robert E. (2009). “Rank-pairing heaps” (PDF). SIAM J. Computing: 1463–1485. Using min heap priority queue in Prim’s algorithm to find the minimum spanning tree of a connected and [10] Brodal, G. S. L.; Lagogiannis, G.; Tarjan, R. E. (2012). Strict Fibonacci heaps (PDF). Proceedings of undirected graph, one can achieve a good running time. the 44th symposium on Theory of Computing - STOC This min heap priority queue uses the min heap data '12. p. 1177. doi:10.1145/2213977.2214082. ISBN structure which supports operations such as insert, min9781450312455. imum, extract-min, decrease-key.[15] In this implementation, the weight of the edges is used to decide the priority [11] Fredman, Michael Lawrence; Tarjan, Robert E. (1987). of the vertices. Lower the weight, higher the priority and “Fibonacci heaps and their uses in improved network higher the weight, lower the priority.[16] optimization algorithms” (PDF). Journal of the Association for Computing Machinery. doi:10.1145/28869.28874. 5.1.7 See also • Batch queue • Command queue • Job scheduler 5.1.8 34 (3): 596–615. References [1] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, Second Edition. MIT Press and McGraw-Hill, 2001. ISBN 0-262-03293-7. Chapter 20: Fibonacci Heaps, pp.476– 497. Third edition p518. [2] Skiena, Steven (2010). The Algorithm Design Manual (2nd ed.). Springer Science+Business Media. ISBN 1849-96720-2. [12] Pettie, Seth (2005). “Towards a Final Analysis of Pairing Heaps” (PDF). Max Planck Institut für Informatik. [13] Thorup, Mikkel (2007). “Equivalence between priority queues and sorting”. Journal of the ACM. 54 (6). doi:10.1145/1314690.1314692. [14] http://courses.csail.mit.edu/6.851/spring07/scribe/lec17. pdf [15] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein (2009). INTRODUCTION TO ALGORITHMS. 3. MIT Press. p. 634. ISBN 978-81203-4007-7. In order to implement Prim’s algorithm efficiently, we need a fast way to select a new edge to add to the tree formed by the edges in A. In the pseudo-code [16] “Prim’s Algorithm”. September 2014. Geek for Geeks. Retrieved 12 132 5.1.9 CHAPTER 5. PRIORITY QUEUES Further reading • Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, Second Edition. MIT Press and McGrawHill, 2001. ISBN 0-262-03293-7. Section 6.5: Priority queues, pp. 138–142. 5.1.10 External links • C++ reference for std::priority_queue • Descriptions by Lee Killough • To insert an element x with priority p, add x to the container at A[p]. • To remove an element x with priority p, remove x from the container at A[p] • To find an element with the minimum priority, perform a sequential search to find the first non-empty container, and then choose an arbitrary element from this container. In this way, insertions and deletions take constant time, while finding the minimum priority element takes time O(C).[1][3] • PQlib - Open source Priority Queue library for C • libpqueue is a generic priority queue (heap) imple- 5.2.2 Optimizations mentation (in C) used by the Apache HTTP Server As an optimization, the data structure can also maintain project. an index L that lower-bounds the minimum priority of • Survey of known priority queue structures by Stefan an element. When inserting a new element, L should be Xenos updated to the minimum of its old value and the new element’s priority. When searching for the minimum prior• UC Berkeley - Computer Science 61B - Lecture 24: ity element, the search can start at L instead of at zero, and Priority Queues (video) - introduction to priority after the search L should be left equal to the priority that queues using binary heap was found in the search.[3] In this way the time for a search is reduced to the difference between the previous lower bound and its next value; this difference could be sig5.2 Bucket queue nificantly smaller than C. For applications of monotone priority queues such as Dijkstra’s algorithm in which the In the design and analysis of data structures, a bucket minimum priorities form a monotonic sequence, the sum queue[1] (also called a bucket priority queue[2] or of these differences is at most C, so the total time for bounded-height priority queue[3] ) is a priority queue a sequence of n operations is O(n + C), rather than the for prioritizing elements whose priorities are small slower O(nC) time bound that would result without this integers. It has the form of an array of buckets: an array optimization. data structure, indexed by the priorities, whose cells conAnother optimization (already given by Dial 1969) can tain buckets of items with the same priority as each other. be used to save space when the priorities are monotonic The bucket queue is the priority-queue analogue of and, at any point in time, fall within a range of r values pigeonhole sort (also called bucket sort), a sorting algo- rather than extending over the whole range from 0 to C. In rithm that places elements into buckets indexed by their this case, one can index the array by the priorities modpriorities and then concatenates the buckets. Using a ulo r rather than by their actual values. The search for bucket queue as the priority queue in a selection sort gives the minimum priority element should always begin at the a form of the pigeonhole sort algorithm. previous minimum, to avoid priorities that are higher than [1] Applications of the bucket queue include computation of the minimum but have lower moduli. the degeneracy of a graph as well as fast algorithms for shortest paths and widest paths for graphs with weights 5.2.3 Applications that are small integers or are already sorted. Its first use[2] was in a shortest path algorithm by Dial (1969).[4] A bucket queue can be used to maintain the vertices of an undirected graph, prioritized by their degrees, and repeatedly find and remove the vertex of minimum 5.2.1 Basic data structure degree.[3] This greedy algorithm can be used to calculate This structure can handle the insertions and deletions of the degeneracy of a given graph. It takes linear time, with elements with integer priorities in the range from 0 to or without the optimization that maintains a lower bound some known bound C, as well as operations that find the on the minimum priority, because each vertex is found in element with minimum (or maximum) priority. It con- time proportional to its degree and the sum of all vertex [5] sists of an array A of container data structures, where ar- degrees is linear in the number of edges of the graph. ray cell A[p] stores the collection of elements with prior- In Dijkstra’s algorithm for shortest paths in positivelyweighted directed graphs, a bucket queue can be used to ity p. It can handle the following operations: 5.3. HEAP (DATA STRUCTURE) 133 obtain a time bound of O(n + m + dc), where n is the number of vertices, m is the number of edges, d is the diameter of the network, and c is the maximum (integer) link cost.[6] In this algorithm, the priorities will only span a range of width c + 1, so the modular optimization can be used to reduce the space to O(n + c).[1] A variant of the same algorithm can be used for the widest path problem, and (in combination with methods for quickly partitioning non-integer edge weights) leads to near-linear-time solutions to the single-source single-destination version of this problem.[7] 5.2.4 100 19 17 2 36 3 25 1 7 References [1] Mehlhorn, Kurt; Sanders, Peter (2008), “10.5.1 Bucket Queues”, Algorithms and Data Structures: The Basic Toolbox, Springer, p. 201, ISBN 9783540779773. [2] Edelkamp, Stefan; Schroedl, Stefan (2011), “3.1.1 Bucket Data Structures”, Heuristic Search: Theory and Applications, Elsevier, pp. 90–92, ISBN 9780080919737. See also p. 157 for the history and naming of this structure. Example of a complete binary max-heap with node keys being integers from 1 to 100 the lowest key is in the root node. Heaps are crucial in several efficient graph algorithms such as Dijkstra’s algorithm, and in the sorting algorithm heapsort. A common implementation of a heap is the binary heap, in which the tree is a complete binary tree (see figure). [3] Skiena, Steven S. (1998), The Algorithm Design Manual, Springer, p. 181, ISBN 9780387948607. In a heap, the highest (or lowest) priority element is always stored at the root. A heap is not a sorted struc[4] Dial, Robert B. (1969), “Algorithm 360: Shortest- ture and can be regarded as partially ordered. As visible path forest with topological ordering [H]", from the heap-diagram, there is no particular relationship Communications of the ACM, 12 (11): 632–633, among nodes on any given level, even among the siblings. doi:10.1145/363269.363610. When a heap is a complete binary tree, it has a smallest possible height—a heap with N nodes always has log N [5] Matula, D. W.; Beck, L. L. (1983), “Smallest-last orderheight. A heap is a useful data structure when you need ing and clustering and graph coloring algorithms”, Journal of the ACM, 30 (3): 417–427, doi:10.1145/2402.322385, to remove the object with the highest (or lowest) priority. MR 0709826. Note that, as shown in the graphic, there is no implied ordering between siblings or cousins and no implied se[6] Varghese, George (2005), Network Algorithmics: An Interdisciplinary Approach to Designing Fast Networked De- quence for an in-order traversal (as there would be in, e.g., a binary search tree). The heap relation mentioned above vices, Morgan Kaufmann, ISBN 9780120884773. applies only between nodes and their parents, grandpar[7] Gabow, Harold N.; Tarjan, Robert E. (1988), “Algo- ents, etc. The maximum number of children each node rithms for two bottleneck optimization problems”, Jour- can have depends on the type of heap, but in many types nal of Algorithms, 9 (3): 411–417, doi:10.1016/0196- it is at most two, which is known as a binary heap. 6774(88)90031-4, MR 955149 5.3 Heap (data structure) This article is about the programming data structure. For the dynamic memory area, see Dynamic memory allocation. In computer science, a heap is a specialized tree-based data structure that satisfies the heap property: If A is a parent node of B then the key (the value) of node A is ordered with respect to the key of node B with the same ordering applying across the heap. A heap can be classified further as either a "max heap" or a "min heap". In a max heap, the keys of parent nodes are always greater than or equal to those of the children and the highest key is in the root node. In a min heap, the keys of parent nodes are less than or equal to those of the children and The heap is one maximally efficient implementation of an abstract data type called a priority queue, and in fact priority queues are often referred to as “heaps”, regardless of how they may be implemented. A heap data structure should not be confused with the heap which is a common name for the pool of memory from which dynamically allocated memory is allocated. The term was originally used only for the data structure. 5.3.1 Operations The common operations involving heaps are: Basic • find-max or find-min: find the maximum item of a 134 CHAPTER 5. PRIORITY QUEUES max-heap or a minimum item of a min-heap (a.k.a. peek) 5.3.2 Implementation Heaps are usually implemented in an array (fixed size or • insert: adding a new key to the heap (a.k.a., push[1] ) dynamic array), and do not require pointers between elements. After an element is inserted into or deleted from • extract-min [or extract-max]: returns the node of a heap, the heap property may be violated and the heap minimum value from a min heap [or maximum must be balanced by internal operations. value from a max heap] after removing it from the Full and almost full binary heaps may be represented in heap (a.k.a., pop[2] ) a very space-efficient way (as an implicit data structure) using an array alone. The first (or last) element will con• delete-max or delete-min: removing the root node of tain the root. The next two elements of the array contain a max- or min-heap, respectively its children. The next four contain the four children of the two child nodes, etc. Thus the children of the node at • replace: pop root and push a new key. More efficient position n would be at positions 2n and 2n + 1 in a onethan pop followed by push, since only need to bal- based array, or 2n + 1 and 2n + 2 in a zero-based array. ance once, not twice, and appropriate for fixed-size This allows moving up or down the tree by doing simple heaps.[3] index computations. Balancing a heap is done by shift-up or shift-down operations (swapping elements which are out of order). As we can build a heap from an array withCreation out requiring extra memory (for the nodes, for example), heapsort can be used to sort an array in-place. • create-heap: create an empty heap Different types of heaps implement the operations in dif• heapify: create a heap out of given array of elements ferent ways, but notably, insertion is often done by adding the new element at the end of the heap in the first available free space. This will generally violate the heap prop• merge (union): joining two heaps to form a valid new erty, and so the elements are then shifted up until the heap heap containing all the elements of both, preserving property has been reestablished. Similarly, deleting the the original heaps. root is done by removing the root and then putting the last element in the root and shifting down to rebalance. • meld: joining two heaps to form a valid new heap Thus replacing is done by deleting the root and putting containing all the elements of both, destroying the the new element in the root and shifting down, avoiding original heaps. a shifting up step compared to pop (shift down of last element) followed by push (shift up of new element). Inspection Construction of a binary (or d-ary) heap out of a given array of elements may be performed in linear time using the classic Floyd algorithm, with the worst-case number • size: return the number of items in the heap. of comparisons equal to 2N − 2s2 (N) − e2 (N) (for a bi• is-empty: return true if the heap is empty, false oth- nary heap), where s2 (N) is the sum of all digits of the binary representation of N and e2 (N) is the exponent of 2 erwise. in the prime factorization of N.[4] This is faster than a sequence of consecutive insertions into an originally empty Internal heap, which is log-linear (or linearithmic).[lower-alpha 1] • increase-key or decrease-key: updating a key within 5.3.3 a max- or min-heap, respectively Variants • delete: delete an arbitrary node (followed by moving last node and sifting to maintain heap) • 2–3 heap • shift-up: move a node up in the tree, as long as needed; used to restore heap condition after insertion. Called “sift” because node moves up the tree until it reaches the correct level, as in a sieve. • Beap • shift-down: move a node down in the tree, similar to shift-up; used to restore heap condition after deletion or replacement. • Brodal queue • B-heap • Binary heap • Binomial heap • d-ary heap 5.3. HEAP (DATA STRUCTURE) • Fibonacci heap • Leftist heap • Pairing heap • Skew heap • Soft heap • Weak heap • Leaf heap • Radix heap • Randomized meldable heap • Ternary heap • Treap 5.3.4 Comparison of theoretic bounds for variants In the following time complexities[5] O(f) is an asymptotic upper bound and Θ(f) is an asymptotically tight bound (see Big O notation). Function names assume a min-heap. [1] Each insertion ∑ takes O(log(k)) in the existing size of the heap, thus n k=1 O(log k) . Since log n/2 = (log n) − 1 , a constant factor (half) of these insertions are within a constant factor of the maximum, so asymptotically we can assume k = n ; formally the time is nO(log n)−O(n) = O(n log n) . This can also be readily seen from Stirling’s approximation. [2] Brodal and Okasaki later describe a persistent variant with the same bounds except for decrease-key, which is not supported. Heaps with n elements can be constructed bottom-up in O(n).[8] [3] Amortized time. √ [4] Bounded by Ω(log log n), O(22 log log n ) [11][12] [5] n is the size of the larger heap. 5.3.5 Applications The heap data structure has many applications. • Heapsort: One of the best sorting methods being inplace and with no quadratic worst-case scenarios. • Selection algorithms: A heap allows access to the min or max element in constant time, and other selections (such as median or kth-element) can be done in sub-linear time on data that is in a heap.[13] 135 • Graph algorithms: By using heaps as internal traversal data structures, run time will be reduced by polynomial order. Examples of such problems are Prim’s minimal-spanning-tree algorithm and Dijkstra’s shortest-path algorithm. • Priority Queue: A priority queue is an abstract concept like “a list” or “a map"; just as a list can be implemented with a linked list or an array, a priority queue can be implemented with a heap or a variety of other methods. • Order statistics: The Heap data structure can be used to efficiently find the kth smallest (or largest) element in an array. 5.3.6 Implementations • The C++ Standard Library provides the make_heap, push_heap and pop_heap algorithms for heaps (usually implemented as binary heaps), which operate on arbitrary random access iterators. It treats the iterators as a reference to an array, and uses the array-toheap conversion. It also provides the container adaptor priority_queue, which wraps these facilities in a container-like class. However, there is no standard support for the decrease/increase-key operation. • The Boost C++ libraries include a heaps library. Unlike the STL it supports decrease and increase operations, and supports additional types of heap: specifically, it supports d-ary, binomial, Fibonacci, pairing and skew heaps. • There is generic heap implementation for C and C++ with D-ary heap and B-heap support. It provides STL-like API. • The Java platform (since version 1.5) provides the binary heap implementation with class java.util.PriorityQueue<E> in Java Collections Framework. This class implements by default a min-heap; to implement a max-heap, programmer should write a custom comparator. There is no support for the decrease/increase-key operation. • Python has a heapq module that implements a priority queue using a binary heap. • PHP has both max-heap (SplMaxHeap) and minheap (SplMinHeap) as of version 5.3 in the Standard PHP Library. • Perl has implementations of binary, binomial, and Fibonacci heaps in the Heap distribution available on CPAN. • The Go language contains a heap package with heap algorithms that operate on an arbitrary type that satisfies a given interface. 136 CHAPTER 5. PRIORITY QUEUES • Apple’s Core Foundation library contains a CFBinaryHeap structure. [10] Brodal, G. S. L.; Lagogiannis, G.; Tarjan, R. E. (2012). Strict Fibonacci heaps (PDF). Proceedings of the 44th symposium on Theory of Computing - STOC '12. p. 1177. doi:10.1145/2213977.2214082. ISBN 9781450312455. • Pharo has an implementation in the CollectionsSequenceable package along with a set of test cases. A heap is used in the implementation of the timer [11] Fredman, Michael Lawrence; Tarjan, Robert E. (1987). event loop. • The Rust programming language has a binary maxheap implementation, BinaryHeap, in the collections module of its standard library. 5.3.7 See also • Sorting algorithm • Search data structure “Fibonacci heaps and their uses in improved network optimization algorithms” (PDF). Journal of the Association for Computing Machinery. 34 (3): 596–615. doi:10.1145/28869.28874. [12] Pettie, Seth (2005). “Towards a Final Analysis of Pairing Heaps” (PDF). Max Planck Institut für Informatik. [13] Frederickson, Greg N. (1993), “An Optimal Algorithm for Selection in a Min-Heap”, Information and Computation (PDF), 104 (2), Academic Press, pp. 197–214, doi:10.1006/inco.1993.1030 • Stack (abstract data type) • Queue (abstract data type) 5.3.9 External links • Tree (data structure) • Heap at Wolfram MathWorld • Treap, a form of binary search tree based on heapordered trees • Explanation of how the basic heap algorithms work 5.3.8 References 5.4 Binary heap [1] The Python Standard Library, 8.4. heapq — Heap queue algorithm, heapq.heappush 100 [2] The Python Standard Library, 8.4. heapq — Heap queue algorithm, heapq.heappop 19 36 [3] The Python Standard Library, 8.4. heapq — Heap queue algorithm, heapq.heapreplace [4] Suchenek, Marek A. (2012), “Elementary Yet Precise Worst-Case Analysis of Floyd’s Heap-Construction Program”, Fundamenta Informaticae, IOS Press, 120 (1): 75–92, doi:10.3233/FI-2012-751. [5] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L. (1990). Introduction to Algorithms (1st ed.). MIT Press and McGraw-Hill. ISBN 0-262-03141-8. 17 2 3 25 7 Example of a complete binary max heap [6] Iacono, John (2000), “Improved upper bounds for pairing heaps”, Proc. 7th Scandinavian Workshop on Algorithm Theory, Lecture Notes in Computer Science, 1851, Springer-Verlag, pp. 63–77, doi:10.1007/3-540-44985X_5 [7] Brodal, Gerth S. (1996), “Worst-Case Efficient Priority Queues”, Proc. 7th Annual ACM-SIAM Symposium on Discrete Algorithms (PDF), pp. 52–58 [8] Goodrich, Michael T.; Tamassia, Roberto (2004). “7.3.6. Bottom-Up Heap Construction”. Data Structures and Algorithms in Java (3rd ed.). pp. 338–341. [9] Haeupler, Bernhard; Sen, Siddhartha; Tarjan, Robert E. (2009). “Rank-pairing heaps” (PDF). SIAM J. Computing: 1463–1485. Example of a complete binary min heap 1 5.4. BINARY HEAP 137 A binary heap is a heap data structure that takes the form heap property, thus the insertion operation has a worstof a binary tree. Binary heaps are a common way of im- case time complexity of O(log n) but an average-case plementing priority queues.[1]:162–163 complexity of O(1).[3] A binary heap is defined as a binary tree with two addi- As an example of binary heap insertion, say we have a tional constraints:[2] max-heap • Shape property: a binary heap is a complete binary tree; that is, all levels of the tree, except possibly the last one (deepest) are fully filled, and, if the last level of the tree is not complete, the nodes of that level are filled from left to right. 11 5 3 8 X 4 • Heap property: the key stored in each node is either greater than or equal to or less than or equal to the keys in the node’s children, according to some total and we want to add the number 15 to the heap. We first place the 15 in the position marked by the X. However, order. the heap property is violated since 15 > 8, so we need to swap the 15 and the 8. So, we have the heap looking as Heaps where the parent key is greater than or equal to follows after the first swap: (≥) the child keys are called max-heaps; those where it is less than or equal to (≤) are called min-heaps. Efficient 11 (logarithmic time) algorithms are known for the two operations needed to implement a priority queue on a binary heap: inserting an element, and removing the small5 15 est (largest) element from a min-heap (max-heap). Binary heaps are also commonly employed in the heapsort 8 3 4 sorting algorithm, which is an in-place algorithm owing to the fact that binary heaps can be implemented as an implicit data structure, storing keys in an array and us- However the heap property is still violated since 15 > 11, ing their relative positions within that array to represent so we need to swap again: child-parent relationships. 15 5.4.1 Heap operations 5 11 Both the insert and remove operations modify the heap to 8 3 4 conform to the shape property first, by adding or removing from the end of the heap. Then the heap property is restored by traversing up or down the heap. Both operawhich is a valid max-heap. There is no need to check the tions take O(log n) time. left child after this final step: at the start, the max-heap was valid, meaning 11 > 5; if 15 > 11, and 11 > 5, then Insert 15 > 5, because of the transitive relation. To add an element to a heap we must perform an up-heap operation (also known as bubble-up, percolate-up, sift-up, Extract trickle-up, heapify-up, or cascade-up), by following this The procedure for deleting the root from the heap (effecalgorithm: tively extracting the maximum element in a max-heap or the minimum element in a min-heap) and restoring the 1. Add the element to the bottom level of the heap. properties is called down-heap (also known as bubble2. Compare the added element with its parent; if they down, percolate-down, sift-down, trickle down, heapifydown, cascade-down, and extract-min/max). are in the correct order, stop. 3. If not, swap the element with its parent and return to the previous step. The number of operations required depends only on the number of levels the new element must rise to satisfy the 1. Replace the root of the heap with the last element on the last level. 2. Compare the new root with its children; if they are in the correct order, stop. 138 CHAPTER 5. PRIORITY QUEUES 3. If not, swap the element with one of its children and return to the previous step. (Swap with its smaller child in a min-heap and its larger child in a maxheap.) So, if we have the same max-heap as before 11 5 5.4.2 Building a heap We remove the 11 and replace it with the 4. 4 5 8 3 Now the heap property is violated since 8 is greater than 4. In this case, swapping the two elements, 4 and 8, is enough to restore the heap property and we need not swap elements further: 8 5 In the worst case, the new root has to be swapped with its child on each level until it reaches the bottom level of the heap, meaning that the delete operation has a time complexity relative to the height of the tree, or O(log n). 8 4 3 to modify the value of the root, even when an element is not being deleted. In the pseudocode above, what starts with // is a comment. Note that A is an array (or list) that starts being indexed from 1 up to length(A), according to the pseudocode. 4 3 The downward-moving node is swapped with the larger of its children in a max-heap (in a min-heap it would be swapped with its smaller child), until it satisfies the heap property in its new position. This functionality is achieved by the Max-Heapify function as defined below in pseudocode for an array-backed heap A of length heap_length[A]. Note that “A” is indexed starting at 1, not 0 as is common in many real programming languages. Max-Heapify (A, i): left ← 2*i // ← means “assignment” right ← 2*i + 1 largest ← i if left ≤ heap_length[A] and A[left] > A[largest] then: largest ← left if right ≤ heap_length[A] and A[right] > A[largest] then: largest ← right if largest ≠ i then: swap A[i] and A[largest] Max-Heapify(A, largest) For the above algorithm to correctly re-heapify the array, the node at index i and its two direct children must violate the heap property. If they do not, the algorithm will fall through with no change to the array. The down-heap operation (without the preceding swap) can also be used Building a heap from an array of n input elements can be done by starting with an empty heap, then successively inserting each element. This approach, called Williams’ method after the inventor of binary heaps, is easily seen to run in O(n log n) time: it performs n insertions at O(log n) cost each.[lower-alpha 1] However, Williams’ method is suboptimal. A faster method (due to Floyd[4] ) starts by arbitrarily putting the elements on a binary tree, respecting the shape property (the tree could be represented by an array, see below). Then starting from the lowest level and moving upwards, shift the root of each subtree downward as in the deletion algorithm until the heap property is restored. More specifically if all the subtrees starting at some height h (measured from the bottom) have already been “heapified”, the trees at height h + 1 can be heapified by sending their root down along the path of maximum valued children when building a max-heap, or minimum valued children when building a min-heap. This process takes O(h) operations (swaps) per node. In this method most of the heapification takes place in the lower levels. Since the height of the heap is ⌊log(n)⌋ , the of nodes ⌈ lognumber ⌉ ⌈ (log n)−h−1 ⌉ ⌈ n ⌉ 2 n at height h is ≤ 2 = 2h+1 = 2h+1 . Therefore, the cost of heapifying all subtrees is: ⌈log n⌉ ∑ h=0  n 2h+1 O(h) = O n ⌈log n⌉ ∑  h  2h+1   ⌈log n⌉ ∑ h n  = O 2 2h h=0 ( ∞ ) ∑ h =O n 2h h=0 h=0 = O(2n) = O(n) This uses the fact that the given infinite series h / 2h converges to 2. The exact value of the above (the worst-case number of comparisons during the heap construction) is known to be equal to: 5.4. BINARY HEAP 139 2n − 2s2 (n) − e2 (n) ,[5] • its parent at index floor((i − 1) ∕ 2). where s2 (n) is the sum of all digits of the binary repre- Alternatively, if the tree root is at index 1, with valid insentation of n and e2 (n) is the exponent of 2 in the prime dices 1 through n, then each element a at index i has factorization of n. • children at indices 2i and 2i +1 The Build-Max-Heap function that follows, converts an • its parent at index floor(i ∕ 2). array A which stores a complete binary tree with n nodes to a max-heap by repeatedly using Max-Heapify in a bottom up manner. It is based on the observation that the This implementation is used in the heapsort algorithm, array elements indexed by floor(n/2) + 1, floor(n/2) + 2, where it allows the space in the input array to be reused ..., n are all leaves for the tree (assuming, as before, that to store the heap (i.e. the algorithm is done in-place). indices start at 1), thus each is a one-element heap. Build- The implementation is also useful for use as a Priority Max-Heap runs Max-Heapify on each of the remaining queue where use of a dynamic array allows insertion of an unbounded number of items. tree nodes. Build-Max-Heap (A): heap_length[A] ← length[A] for each index i from floor(length[A]/2) downto 1 do: Max-Heapify(A, i) 5.4.3 0 Heap implementation 1 2 3 4 A small complete binary tree stored in an array 5 6 The upheap/downheap operations can then be stated in terms of an array as follows: suppose that the heap property holds for the indices b, b+1, ..., e. The sift-down function extends the heap property to b−1, b, b+1, ..., e. Only index i = b−1 can violate the heap property. Let j be the index of the largest child of a[i] (for a max-heap, or the smallest child for a min-heap) within the range b, ..., e. (If no such index exists because 2i > e then the heap property holds for the newly extended range and nothing needs to be done.) By swapping the values a[i] and a[j] the heap property for position i is established. At this point, the only problem is that the heap property might not hold for index j. The sift-down function is applied tail-recursively to index j until the heap property is established for all elements. The sift-down function is fast. In each step it only needs two comparisons and one swap. The index value where it is working doubles in each iteration, so that at most log2 e steps are required. For big heaps and using virtual memory, storing elements in an array according to the above scheme is inefficient: Comparison between a binary heap and an array implementa- (almost) every level is in a different page. B-heaps are tion. binary heaps that keep subtrees in a single page, reducing [6] Heaps are commonly implemented with an array. Any bi- the number of pages accessed by up to a factor of ten. nary tree can be stored in an array, but because a binary The operation of merging two binary heaps takes Θ(n) heap is always a complete binary tree, it can be stored for equal-sized heaps. The best you can do is (in case of compactly. No space is required for pointers; instead, array implementation) simply concatenating the two heap the parent and children of each node can be found by arrays and build a heap of the result.[7] A heap on n elarithmetic on array indices. These properties make this ements can be merged with a heap on k elements using heap implementation a simple example of an implicit data O(log n log k) key comparisons, or, in case of a pointerstructure or Ahnentafel list. Details depend on the root based implementation, in O(log n log k) time.[8] An alposition, which in turn may depend on constraints of a gorithm for splitting a heap on n elements into two heaps programming language used for implementation, or pro- on k and n-k elements, respectively, based on a new view grammer preference. Specifically, sometimes the root is of heaps as an ordered collections of subheaps was preplaced at index 1, sacrificing space in order to simplify sented in.[9] The algorithm requires O(log n * log n) comarithmetic. parisons. The view also presents a new and conceptually Let n be the number of elements in the heap and i be an simple algorithm for merging heaps. When merging is a arbitrary valid index of the array storing the heap. If the common task, a different heap implementation is recomtree root is at index 0, with valid indices 0 through n − 1, mended, such as binomial heaps, which can be merged in O(log n). then each element a at index i has • children at indices 2i + 1 and 2i + 2 Additionally, a binary heap can be implemented with a traditional binary tree data structure, but there is an issue 140 CHAPTER 5. PRIORITY QUEUES with finding the adjacent element on the last level on the binary heap when adding an element. This element can be determined algorithmically or by adding extra data to the nodes, called “threading” the tree—instead of merely storing references to the children, we store the inorder successor of the node as well. It is possible to modify the heap structure to allow extraction of both the smallest and largest element in O (log n) time.[10] To do this, the rows alternate between min heap and max heap. The algorithms are roughly the same, but, in each step, one must consider the alternating rows with alternating comparisons. The performance is roughly the same as a normal single direction heap. This idea can be generalised to a min-max-median heap. 5.4.4 right = 1) + last(L − 2j = (2L+2 − 2) − 2j = = 2(2L+1 − 2 − j) + 2 2i + 2 As required. Noting that the left child of any node is always 1 place before its right child, we get left = 2i + 1 . If the root is located at index 1 instead of 0, the last node in each level is instead at index 2l+1 − 1 . Using this throughout yields left = 2i and right = 2i + 1 for heaps with their root at 1. Derivation of index equations Parent node In an array-based heap, the children and parent of a node can be located via simple arithmetic on the node’s index. Every node is either the left or right child of its parent, so This section derives the relevant equations for heaps with we know that either of the following is true. their root at index 0, with additional notes on heaps with their root at index 1. 1. i = 2 × (parent) + 1 To avoid confusion, we'll define the level of a node as its distance from the root, such that the root itself occupies 2. i = 2 × (parent) + 2 level 0. Hence, Child nodes For a general node located at index i (beginning from 0), we will first derive the index of its right child, right = 2i + 2 . Let node i be located in level L , and note that any level l contains exactly 2l nodes. Furthermore, there are exactly 2l+1 − 1 nodes contained in the layers up to and including layer l (think of binary arithmetic; 0111...111 = 1000...000 - 1). Because the root is stored at 0, the k th node will be stored at index (k − 1) . Putting these observations together yields the following expression for the index of the last node in layer l. parent = ⌊ ⌋ i−1 Now consider the expression . 2 If node i is a left child, this gives the result immediately, however, it also gives the correct result if node i is a right child. In this case, (i−2) must be even, and hence (i−1) must be odd. ⌊ last(l) = (2 l+1 − 1) − 1 = 2 l+1 −2 Let there be j nodes after node i in layer L, such that i = last(L) − j = (2L+1 − 2) − j Each of these j nodes must have exactly 2 children, so there must be 2j nodes separating i 's right child from the end of its layer ( L + 1 ). i−1 i−2 or 2 2 ⌋ i−1 = 2 ⌊ i−2 1 + 2 2 i−2 = 2 = parent ⌋ Therefore, irrespective of whether a node is a left or right child, its parent can be found by the expression: ⌊ i−1 parent = 2 ⌋ 5.4. BINARY HEAP 5.4.5 Related structures Since the ordering of siblings in a heap is not specified by the heap property, a single node’s two children can be freely interchanged unless doing so violates the shape property (compare with treap). Note, however, that in the common array-based heap, simply swapping the children might also necessitate moving the children’s sub-tree nodes to retain the heap property. The binary heap is a special case of the d-ary heap in which d = 2. 5.4.6 Summary of running times In the following time complexities[11] O(f) is an asymptotic upper bound and Θ(f) is an asymptotically tight bound (see Big O notation). Function names assume a min-heap. [1] In fact, this procedure can be shown to take Θ(n log n) time in the worst case, meaning that n log n is also an asymptotic lower bound on the complexity.[1]:167 In the average case (averaging over all permutations of n inputs), though, the method takes linear time.[4] [2] Brodal and Okasaki later describe a persistent variant with the same bounds except for decrease-key, which is not supported. Heaps with n elements can be constructed bottom-up in O(n).[14] 141 [6] Poul-Henning Kamp. “You're Doing It Wrong”. ACM Queue. June 11, 2010. [7] Chris L. Kuszmaul. “binary heap”. Dictionary of Algorithms and Data Structures, Paul E. Black, ed., U.S. National Institute of Standards and Technology. 16 November 2009. [8] J.-R. Sack and T. Strothotte “An Algorithm for Merging Heaps”, Acta Informatica 22, 171-186 (1985). [9] . J.-R. Sack and T. Strothotte “A characterization of heaps and its applications” Information and Computation Volume 86, Issue 1, May 1990, Pages 69–86. [10] Atkinson, M.D., J.-R. Sack, N. Santoro, and T. Strothotte (1 October 1986). “Min-max heaps and generalized priority queues.” (PDF). Programming techniques and Data structures. Comm. ACM, 29(10): 996–1000. [11] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L. (1990). Introduction to Algorithms (1st ed.). MIT Press and McGraw-Hill. ISBN 0-262-03141-8. [12] Iacono, John (2000), “Improved upper bounds for pairing heaps”, Proc. 7th Scandinavian Workshop on Algorithm Theory, Lecture Notes in Computer Science, 1851, Springer-Verlag, pp. 63–77, doi:10.1007/3-540-44985X_5 [13] Brodal, Gerth S. (1996), “Worst-Case Efficient Priority Queues”, Proc. 7th Annual ACM-SIAM Symposium on Discrete Algorithms (PDF), pp. 52–58 [3] Amortized time. [14] Goodrich, Michael T.; Tamassia, Roberto (2004). “7.3.6. Bottom-Up Heap Construction”. Data Structures and Algorithms in Java (3rd ed.). pp. 338–341. [5] n is the size of the larger heap. [15] Haeupler, Bernhard; Sen, Siddhartha; Tarjan, Robert E. (2009). “Rank-pairing heaps” (PDF). SIAM J. Computing: 1463–1485. √ [4] Bounded by Ω(log log n), O(22 log log n ) [17][18] 5.4.7 See also • Heap • Heapsort 5.4.8 References [1] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2009) [1990]. Introduction to Algorithms (3rd ed.). MIT Press and McGraw-Hill. ISBN 0-262-03384-4. [2] eEL,CSA_Dept,IISc,Bangalore, “Binary Heaps”, Data Structures and Algorithms [16] Brodal, G. S. L.; Lagogiannis, G.; Tarjan, R. E. (2012). Strict Fibonacci heaps (PDF). Proceedings of the 44th symposium on Theory of Computing - STOC '12. p. 1177. doi:10.1145/2213977.2214082. ISBN 9781450312455. [17] Fredman, Michael Lawrence; Tarjan, Robert E. (1987). “Fibonacci heaps and their uses in improved network optimization algorithms” (PDF). Journal of the Association for Computing Machinery. 34 (3): 596–615. doi:10.1145/28869.28874. [18] Pettie, Seth (2005). “Towards a Final Analysis of Pairing Heaps” (PDF). Max Planck Institut für Informatik. 5.4.9 External links [3] http://wcipeg.com/wiki/Binary_heap • Binary Heap Applet by Kubo Kovac [4] Hayward, Ryan; McDiarmid, Colin (1991). “Average Case Analysis of Heap Building by Repeated Insertion” (PDF). J. Algorithms. 12: 126–153. • Open Data Structures - Section 10.1 - BinaryHeap: An Implicit Binary Tree [5] Suchenek, Marek A. (2012), “Elementary Yet Precise Worst-Case Analysis of Floyd’s Heap-Construction Program”, Fundamenta Informaticae, IOS Press, 120 (1): 75–92, doi:10.3233/FI-2012-751. • Implementation of binary max heap in C by Robin Thomas • Implementation of binary min heap in C by Robin Thomas 142 CHAPTER 5. PRIORITY QUEUES 5.5 ''d''-ary heap procedure may be used to decrease the priority of an item in a min-heap, or to increase the priority of an item in a [2][3] The d-ary heap or d-heap is a priority queue data struc- max-heap. ture, a generalization of the binary heap in which the To create a new heap from an array of n items, one may nodes have d children instead of 2.[1][2][3] Thus, a binary loop over the items in reverse order, starting from the heap is a 2-heap, and a ternary heap is a 3-heap. Ac- item at position n − 1 and ending at the item at position cording to Tarjan[2] and Jensen et al.,[4] d-ary heaps were 0, applying the downward-swapping procedure for each invented by Donald B. Johnson in 1975.[1] item.[2][3] This data structure allows decrease priority operations to be performed more quickly than binary heaps, at the expense of slower delete minimum operations. This tradeoff leads to better running times for algorithms such as Dijkstra’s algorithm in which decrease priority operations are more common than delete min operations.[1][5] Additionally, d-ary heaps have better memory cache behavior than a binary heap, allowing them to run more quickly in practice despite having a theoretically larger worstcase running time.[6][7] Like binary heaps, d-ary heaps are an in-place data structure that uses no additional storage beyond that needed to store the array of items in the heap.[2][8] 5.5.2 Analysis In a d-ary heap with n items in it, both the upwardswapping procedure and the downward-swapping procedure may perform as many as logd n = log n / log d swaps. In the upward-swapping procedure, each swap involves a single comparison of an item with its parent, and takes constant time. Therefore, the time to insert a new item into the heap, to decrease the priority of an item in a minheap, or to increase the priority of an item in a max-heap, is O(log n / log d). In the downward-swapping procedure, each swap involves d comparisons and takes O(d) time: it takes d − 1 comparisons to determine the minimum or maximum of the children and then one more comparison 5.5.1 Data structure against the parent to determine whether a swap is needed. Therefore, the time to delete the root item, to increase the The d-ary heap consists of an array of n items, each of priority of an item in a min-heap, or to decrease the priwhich has a priority associated with it. These items may ority of an item in a max-heap, is O(d log n / log d).[2][3] be viewed as the nodes in a complete d-ary tree, listed in breadth first traversal order: the item at position 0 of the When creating a d-ary heap from a set of n items, most of array forms the root of the tree, the items at positions 1 the items are in positions that will eventually hold leaves through d are its children, the next d2 items are its grand- of the d-ary tree, and no downward swapping is perchildren, etc. Thus, the parent of the item at position i formed for those items. At most n/d + 1 items are non(for any i > 0) is the item at position floor((i − 1)/d) and leaves, and may be swapped downwards at least once, at its children are the items at positions di + 1 through di + a cost of O(d) time to find the child to swap them with. d. According to the heap property, in a min-heap, each At most n/d2 + 1 nodes may be swapped downward two item has a priority that is at least as large as its parent; in times, incurring an additional O(d) cost for the second a max-heap, each item has a priority that is no larger than swap beyond the cost already counted in the first term, etc. Therefore, the total amount of time to create a heap its parent.[2][3] in this way is The minimum priority item in a min-heap (or the maximum priority item in a max-heap) may always be found ) ∑logd n ( n [2][3] at position 0 of the array. To remove this item from the i=1 di + 1 O(d) = O(n). priority queue, the last item x in the array is moved into its place, and the length of the array is decreased by one. The exact value of the above (the worst-case number of Then, while item x and its children do not satisfy the heap comparisons during the construction of d-ary heap) is property, item x is swapped with one of its children (the known to be equal to: one with the smallest priority in a min-heap, or the one with the largest priority in a max-heap), moving it downd − sd (n)) − (d − 1 − d−1 (n ward in the tree and later in the array, until eventually the (nmodd))(ed (⌊ nd ⌋) + 1) ,[9] heap property is satisfied. The same downward swapping procedure may be used to increase the priority of an item in a min-heap, or to decrease the priority of an item in a where s (n) is the sum of all digits of the standard base-d representation of n and e (n) is the exponent of d in the max-heap.[2][3] factorization of n. This reduces to To insert a new item into the heap, the item is appended to the end of the array, and then while the heap property 2n − 2s2 (n) − e2 (n) ,[9] is violated it is swapped with its parent, moving it upward in the tree and earlier in the array, until eventually the heap property is satisfied. The same upward-swapping for d = 2, and to 5.6. BINOMIAL HEAP 3 2 (n − s3 (n)) − 2e3 (n) − e3 (n − 1) ,[9] for d = 3. The space usage of the d-ary heap, with insert and deletemin operations, is linear, as it uses no extra storage other than an array containing a list of the items in the heap.[2][8] If changes to the priorities of existing items need to be supported, then one must also maintain pointers from the items to their positions in the heap, which again uses only linear storage.[2] 5.5.3 Applications 143 [6] Naor, D.; Martel, C. U.; Matloff, N. S. (1991), “Performance of priority queue structures in a virtual memory environment”, Computer Journal, 34 (5): 428–437, doi:10.1093/comjnl/34.5.428. [7] Kamp, Poul-Henning (2010), “You're doing it wrong”, ACM Queue, 8 (6). [8] Mortensen, C. W.; Pettie, S. (2005), “The complexity of implicit and space efficient priority queues”, Algorithms and Data Structures: 9th International Workshop, WADS 2005, Waterloo, Canada, August 15–17, 2005, Proceedings, Lecture Notes in Computer Science, 3608, SpringerVerlag, pp. 49–60, doi:10.1007/11534273_6, ISBN 9783-540-28101-6. Dijkstra’s algorithm for shortest paths in graphs and [9] Suchenek, Marek A. (2012), “Elementary Yet Precise Worst-Case Analysis of Floyd’s Heap-Construction ProPrim’s algorithm for minimum spanning trees both use gram”, Fundamenta Informaticae, IOS Press, 120 (1): a min-heap in which there are n delete-min operations 75–92, doi:10.3233/FI-2012-751. and as many as m decrease-priority operations, where n is the number of vertices in the graph and m is the number [10] Cherkassky, B. V.; Goldberg, A. V.; Radzik, T. (1996), of edges. By using a d-ary heap with d = m/n, the total “Shortest paths algorithms: Theory and experimental times for these two types of operations may be balanced evaluation”, Mathematical Programming, 73 (2): 129– against each other, leading to a total time of O(m logm/n 174, doi:10.1007/BF02592101. n) for the algorithm, an improvement over the O(m log n) running time of binary heap versions of these algorithms whenever the number of edges is significantly larger than 5.5.5 External links the number of vertices.[1][5] An alternative priority queue data structure, the Fibonacci heap, gives an even better • C++ implementation of generalized heap with Dtheoretical running time of O(m + n log n), but in pracHeap support tice d-ary heaps are generally at least as fast, and often faster, than Fibonacci heaps for this application.[10] 4-heaps may perform better than binary heaps in practice, 5.6 Binomial heap even for delete-min operations.[2][3] Additionally, a d-ary heap typically runs much faster than a binary heap for For binomial price trees, see Binomial options heap sizes that exceed the size of the computer’s cache pricing model. memory: A binary heap typically requires more cache misses and virtual memory page faults than a d-ary heap, In computer science, a binomial heap is a heap similar each one taking far more time than the extra work into a binary heap but also supports quick merging of two curred by the additional comparisons a d-ary heap makes heaps. This is achieved by using a special tree structure. It compared to a binary heap.[6][7] is important as an implementation of the mergeable heap abstract data type (also called meldable heap), which is a priority queue supporting merge operation. 5.5.4 References [1] Johnson, D. B. (1975), “Priority queues with update and finding minimum spanning trees”, Information Processing Letters, 4 (3): 53–57, doi:10.1016/0020-0190(75)900010. 5.6.1 Binomial heap A binomial heap is implemented as a collection of bi[2] Tarjan, R. E. (1983), “3.2. d-heaps”, Data Structures nomial trees (compare with a binary heap, which has a and Network Algorithms, CBMS-NSF Regional Confer- shape of a single binary tree), which are defined recurence Series in Applied Mathematics, 44, Society for In- sively as follows: dustrial and Applied Mathematics, pp. 34–38. [3] Weiss, M. A. (2007), "d-heaps”, Data Structures and Algorithm Analysis (2nd ed.), Addison-Wesley, p. 216, ISBN 0-321-37013-9. [4] Jensen, C.; Katajainen, J.; Vitale, F. (2004), An extended truth about heaps (PDF). [5] Tarjan (1983), pp. 77 and 91. • A binomial tree of order 0 is a single node • A binomial tree of order k has a root node whose children are roots of binomial trees of orders k−1, k−2, ..., 2, 1, 0 (in this order). A binomial tree of order k has 2k nodes, height k. 144 Order 0 CHAPTER 5. PRIORITY QUEUES 1 2 3 5 9 17 99 Binomial trees of order 0 to 3: Each tree has a root node with subtrees of all lower ordered binomial trees, which have been highlighted. For example, the order 3 binomial tree is connected to an order 2, 1, and 0 (highlighted as blue, green and red respectively) binomial tree. 12 21 33 23 12 24 23 77 53 1 of a binomial heap containing 13 nodes with distinct keys. The heap consists of three binomial trees with orders 0, 2, and 3. Because of its unique structure, a binomial tree of order k can be constructed from two trees of order k−1 trivially by attaching one of them as the leftmost child of the root 5.6.3 Implementation of the other tree. This feature is central to the merge opBecause no operation requires random access to the root eration of a binomial heap, which is its major advantage nodes of the binomial trees, the roots of the binomial trees over other conventional heaps. can be stored in a linked list, ordered by increasing order The name comes from the shape: a binomial tree of order of the tree. ( ) n has nd nodes at depth d . (See Binomial coefficient.) Merge As mentioned above, the simplest and most important operation is the merging of two binomial trees of the same order within a binomial heap. Due to the structure of biA binomial heap is implemented as a set of binomial trees nomial trees, they can be merged trivially. As their root that satisfy the binomial heap properties: node is the smallest element within the tree, by comparing the two keys, the smaller of them is the minimum key, and becomes the new root node. Then the other tree be• Each binomial tree in a heap obeys the minimum- comes a subtree of the combined tree. This operation is heap property: the key of a node is greater than or basic to the complete merging of two binomial heaps. equal to the key of its parent. function mergeTree(p, q) if p.root.key <= q.root.key return p.addSubTree(q) else return q.addSubTree(p) The operation of merging two heaps is perhaps the most in• There can only be either one or zero binomial trees teresting and can be used as a subroutine in most other operations. The lists of roots of both heaps are traversed for each order, including zero order. simultaneously in a manner similar to that of the merge algorithm. 5.6.2 Structure of a binomial heap The first property ensures that the root of each binomial If only one of the heaps contains a tree of order j, this tree tree contains the smallest key in the tree, which applies to is moved to the merged heap. If both heaps contain a tree the entire heap. of order j, the two trees are merged to one tree of order j+1 so that the minimum-heap property is satisfied. Note The second property implies that a binomial heap with that it may later be necessary to merge this tree with some n nodes consists of at most log n + 1 binomial trees. In other tree of order j+1 present in one of the heaps. In the fact, the number and orders of these trees are uniquely course of the algorithm, we need to examine at most three determined by the number of nodes n: each binomial tree trees of any order (two from the two heaps we merge and corresponds to one digit in the binary representation of one composed of two smaller trees). number n. For example number 13 is 1101 in binary, 23 + 22 + 20 , and thus a binomial heap with 13 nodes Because each binomial tree in a binomial heap correwill consist of three binomial trees of orders 3, 2, and 0 sponds to a bit in the binary representation of its size, (see figure below). there is an analogy between the merging of two heaps and Example 5.6. BINOMIAL HEAP 145 > 7 8 12 ning time is O(log n). 3 4 5 13 function merge(p, q) while not (p.end() and q.end()) tree = mergeTree(p.currentTree(), q.currentTree()) if not heap.currentTree().empty() tree = mergeTree(tree, heap.currentTree()) heap.addTree(tree) heap.next(); p.next(); q.next() 9 Insert Inserting a new element to a heap can be done by simply creating a new heap containing only this element and then merging it with the original heap. Due to the merge, insert takes O(log n) time. However, across a series of n consecutive insertions, insert has an amortized time of O(1) (i.e. constant). 3 4 5 7 Find minimum 8 12 9 To find the minimum element of the heap, find the minimum among the roots of the binomial trees. This can again be done easily in O(log n) time, as there are just O(log n) trees and hence roots to examine. 13 To merge two binomial trees of the same order, first compare the root key. Since 7>3, the black tree on the left(with root node 7) is attached to the grey tree on the right(with root node 3) as a subtree. The result is a tree of order 3. 1 9 2 11 12 + 3 7 4 10 5 21 14 6 3 9 11 2 14 12 7 4 10 5 Delete minimum 8 6 1 By using a pointer to the binomial tree that contains the minimum element, the time for this operation can be reduced to O(1). The pointer must be updated when performing any operation other than Find minimum. This can be done in O(log n) without raising the running time of any operation. 8 21 This shows the merger of two binomial heaps. This is accomplished by merging two binomial trees of the same order one by one. If the resulting merged tree has the same order as one binomial tree in one of the two heaps, then those two are merged again. the binary addition of the sizes of the two heaps, from right-to-left. Whenever a carry occurs during addition, this corresponds to a merging of two binomial trees during the merge. Each tree has order at most log n and therefore the run- To delete the minimum element from the heap, first find this element, remove it from its binomial tree, and obtain a list of its subtrees. Then transform this list of subtrees into a separate binomial heap by reordering them from smallest to largest order. Then merge this heap with the original heap. Since each root has at most log n children, creating this new heap is O(log n). Merging heaps is O(log n), so the entire delete minimum operation is O(log n). function deleteMin(heap) min = heap.trees().first() for each current in heap.trees() if current.root < min.root then min = current for each tree in min.subTrees() tmp.addTree(tree) heap.removeTree(min) merge(heap, tmp) Decrease key After decreasing the key of an element, it may become smaller than the key of its parent, violating the minimumheap property. If this is the case, exchange the element with its parent, and possibly also with its grandparent, and so on, until the minimum-heap property is no longer violated. Each binomial tree has height at most log n, so this takes O(log n) time. 146 CHAPTER 5. PRIORITY QUEUES Delete • Haskell implementation of binomial heap To delete an element from the heap, decrease its key to negative infinity (that is, some value lower than any element in the heap) and then delete the minimum in the heap. • Common Lisp implementation of binomial heap 5.6.4 Summary of running times In the following time complexities[1] O(f) is an asymptotic upper bound and Θ(f) is an asymptotically tight bound (see Big O notation). Function names assume a min-heap. [1] Brodal and Okasaki later describe a persistent variant with the same bounds except for decrease-key, which is not supported. Heaps with n elements can be constructed bottom-up in O(n).[4] [2] Amortized time. √ [3] Bounded by Ω(log log n), O(22 log log n ) [7][8] [1] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L. (1990). Introduction to Algorithms (1st ed.). MIT Press and McGraw-Hill. ISBN 0-262-03141-8. [2] Iacono, John (2000), “Improved upper bounds for pairing heaps”, Proc. 7th Scandinavian Workshop on Algorithm Theory, Lecture Notes in Computer Science, 1851, Springer-Verlag, pp. 63–77, doi:10.1007/3-540-44985X_5 [3] Brodal, Gerth S. (1996), “Worst-Case Efficient Priority Queues”, Proc. 7th Annual ACM-SIAM Symposium on Discrete Algorithms (PDF), pp. 52–58 [4] Goodrich, Michael T.; Tamassia, Roberto (2004). “7.3.6. Bottom-Up Heap Construction”. Data Structures and Algorithms in Java (3rd ed.). pp. 338–341. [5] Haeupler, Bernhard; Sen, Siddhartha; Tarjan, Robert E. (2009). “Rank-pairing heaps” (PDF). SIAM J. Computing: 1463–1485. [4] n is the size of the larger heap. 5.6.5 Applications • Discrete event simulation • Priority queues 5.6.6 See also • Fibonacci heap • Soft heap [6] Brodal, G. S. L.; Lagogiannis, G.; Tarjan, R. E. (2012). Strict Fibonacci heaps (PDF). Proceedings of the 44th symposium on Theory of Computing - STOC '12. p. 1177. doi:10.1145/2213977.2214082. ISBN 9781450312455. [7] Fredman, Michael Lawrence; Tarjan, Robert E. (1987). “Fibonacci heaps and their uses in improved network optimization algorithms” (PDF). Journal of the Association for Computing Machinery. 34 (3): 596–615. doi:10.1145/28869.28874. [8] Pettie, Seth (2005). “Towards a Final Analysis of Pairing Heaps” (PDF). Max Planck Institut für Informatik. • Skew binomial heap 5.7 Fibonacci heap 5.6.7 References In computer science, a Fibonacci heap is a data structure for priority queue operations, consisting of a collection of heap-ordered trees. It has a better amortized running time than many other priority queue data structures including the binary heap and binomial heap. Michael L. Fredman and Robert E. Tarjan developed Fibonacci heaps in 1984 and published them in a scientific journal • Vuillemin, J. (1978). A data structure for manipuin 1987. They named Fibonacci heaps after the Fibonacci lating priority queues. Communications of the ACM numbers, which are used in their running time analysis. 21, 309–314. For the Fibonacci heap, the find-minimum operation takes constant (O(1)) amortized time.[1] The insert and 5.6.8 External links decrease key operations also work in constant amortized time.[2] Deleting an element (most often used in the spe• Java applet simulation of binomial heap cial case of deleting the minimum element) works in O(log n) amortized time, where n is the size of the heap.[2] • Python implementation of binomial heap This means that starting from an empty data structure, • Two C implementations of binomial heap (a generic any sequence of a insert and decrease key operations and one and one optimized for integer keys) b delete operations would take O(a + b log n) worst case • Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, Second Edition. MIT Press and McGrawHill, 2001. ISBN 0-262-03293-7. Chapter 19: Binomial Heaps, pp. 455–475. 5.7. FIBONACCI HEAP time, where n is the maximum heap size. In a binary or binomial heap such a sequence of operations would take O((a + b) log n) time. A Fibonacci heap is thus better than a binary or binomial heap when b is smaller than a by a non-constant factor. It is also possible to merge two Fibonacci heaps in constant amortized time, improving on the logarithmic merge time of a binomial heap, and improving on binary heaps which cannot handle merges efficiently. Using Fibonacci heaps for priority queues improves the asymptotic running time of important algorithms, such as Dijkstra’s algorithm for computing the shortest path between two nodes in a graph, compared to the same algorithm using other slower priority queue data structures. 5.7.1 Structure 147 child is cut, the node itself needs to be cut from its parent and becomes the root of a new tree (see Proof of degree bounds, below). The number of trees is decreased in the operation delete minimum, where trees are linked together. As a result of a relaxed structure, some operations can take a long time while others are done very quickly. For the amortized running time analysis we use the potential method, in that we pretend that very fast operations take a little bit longer than they actually do. This additional time is then later combined and subtracted from the actual running time of slow operations. The amount of time saved for later use is measured at any given moment by a potential function. The potential of a Fibonacci heap is given by Potential = t + 2m where t is the number of trees in the Fibonacci heap, and m is the number of marked nodes. A node is marked if at least one of its children was cut since this node was made a child of another node (all roots are unmarked). The amortized time for an operation is given by the sum of the actual time and c times the difference in potential, where c is a constant (chosen to match the constant factors in the O notation for the actual time). Figure 1. Example of a Fibonacci heap. It has three trees of degrees 0, 1 and 3. Three vertices are marked (shown in blue). Therefore, the potential of the heap is 9 (3 trees + 2 × (3 markedvertices)). A Fibonacci heap is a collection of trees satisfying the minimum-heap property, that is, the key of a child is always greater than or equal to the key of the parent. This implies that the minimum key is always at the root of one of the trees. Compared with binomial heaps, the structure of a Fibonacci heap is more flexible. The trees do not have a prescribed shape and in the extreme case the heap can have every element in a separate tree. This flexibility allows some operations to be executed in a lazy manner, postponing the work for later operations. For example, merging heaps is done simply by concatenating the two lists of trees, and operation decrease key sometimes cuts a node from its parent and forms a new tree. Thus, the root of each tree in a heap has one unit of time stored. This unit of time can be used later to link this tree with another tree at amortized time 0. Also, each marked node has two units of time stored. One can be used to cut the node from its parent. If this happens, the node becomes a root and the second unit of time will remain stored in it as in any other root. 5.7.2 Implementation of operations To allow fast deletion and concatenation, the roots of all trees are linked using a circular, doubly linked list. The children of each node are also linked using such a list. For each node, we maintain its number of children and whether the node is marked. Moreover, we maintain a pointer to the root containing the minimum key. Operation find minimum is now trivial because we keep the pointer to the node containing it. It does not change the potential of the heap, therefore both actual and amortized cost are constant. As mentioned above, merge is implemented simply by concatenating the lists of tree roots of the two heaps. This However at some point some order needs to be introduced can be done in constant time and the potential does not to the heap to achieve the desired running time. In par- change, leading again to constant amortized time. ticular, degrees of nodes (here degree means the number of children) are kept quite low: every node has degree at Operation insert works by creating a new heap with one most O(log n) and the size of a subtree rooted in a node element and doing merge. This takes constant time, and of degree k is at least Fk₊₂, where Fk is the kth Fibonacci the potential increases by one, because the number of number. This is achieved by the rule that we can cut at trees increases. The amortized cost is thus still constant. most one child of each non-root node. When a second Operation extract minimum (same as delete minimum) 148 CHAPTER 5. PRIORITY QUEUES ficiently we use an array of length O(log n) in which we keep a pointer to one root of each degree. When a second root is found of the same degree, the two are linked and the array is updated. The actual running time is O(log n + m) where m is the number of roots at the beginning of the second phase. At the end we will have at most O(log n) roots (because each has a different degree). Therefore, the difference in the potential function from before this phase to after it is: O(log n) − m, and the amortized running time is then at most O(log n + m) + c(O(log n) − m). With a sufficiently large choice of c, this simplifies to O(log n). Fibonacci heap from Figure 1 after first phase of extract minimum. Node with key 1 (the minimum) was deleted and its children were added as separate trees. In the third phase we check each of the remaining roots and find the minimum. This takes O(log n) time and the potential does not change. The overall amortized running time of extract minimum is therefore O(log n). operates in three phases. First we take the root containing the minimum element and remove it. Its children will become roots of new trees. If the number of children was d, it takes time O(d) to process all new roots and the potential increases by d−1. Therefore, the amortized running time of this phase is O(d) = O(log n). Fibonacci heap from Figure 1 after decreasing key of node 9 to 0. This node as well as its two marked ancestors are cut from the tree rooted at 1 and placed as new roots. Fibonacci heap from Figure 1 after extract minimum is completed. First, nodes 3 and 6 are linked together. Then the result is linked with tree rooted at node 2. Finally, the new minimum is found. Operation decrease key will take the node, decrease the key and if the heap property becomes violated (the new key is smaller than the key of the parent), the node is cut from its parent. If the parent is not a root, it is marked. If it has been marked already, it is cut as well and its parent is marked. We continue upwards until we reach either the root or an unmarked node. Now we set the minimum pointer to the decreased value if it is the new minimum. In the process we create some number, say k, of new trees. Each of these new trees except possibly the first one was marked originally but as a root it will become unmarked. One node can become marked. Therefore, the number of marked nodes changes by −(k − 1) + 1 = − k + 2. Combining these 2 changes, the potential changes by 2(−k + 2) + k = −k + 4. The actual time to perform the cutting was O(k), therefore (again with a sufficiently large choice of c) the amortized running time is constant. Finally, operation delete can be implemented simply by decreasing the key of the element to be deleted to minus infinity, thus turning it into the minimum of the whole However to complete the extract minimum operation, we heap. Then we call extract minimum to remove it. The need to update the pointer to the root with minimum amortized running time of this operation is O(log n). key. Unfortunately there may be up to n roots we need to check. In the second phase we therefore decrease the number of roots by successively linking together roots of 5.7.3 Proof of degree bounds the same degree. When two roots u and v have the same degree, we make one of them a child of the other so that The amortized performance of a Fibonacci heap depends the one with the smaller key remains the root. Its degree on the degree (number of children) of any tree root bewill increase by one. This is repeated until every root has ing O(log n), where n is the size of the heap. Here we a different degree. To find trees of the same degree ef- show that the size of the (sub)tree rooted at any node 5.7. FIBONACCI HEAP x of degree d in the heap must have size at least Fd₊₂, where Fk is the kth Fibonacci number. The degree bound follows from this and the fact (easily proved by induction) that √ Fd+2 ≥ φd for all integers d ≥ 0 , where . φ = (1 + 5)/2 = 1.618 . (We then have n ≥ Fd+2 ≥ d φ , and taking the log to base φ of both sides gives d ≤ logφ n as required.) Consider any node x somewhere in the heap (x need not be the root of one of the main trees). Define size(x) to be the size of the tree rooted at x (the number of descendants of x, including x itself). We prove by induction on the height of x (the length of a longest simple path from x to a descendant leaf), that size(x) ≥ Fd₊₂, where d is the degree of x. Base case: If x has height 0, then d = 0, and size(x) = 1 = F2. Inductive case: Suppose x has positive height and degree d>0. Let y1 , y2 , ..., yd be the children of x, indexed in order of the times they were most recently made children of x (y1 being the earliest and yd the latest), and let c1 , c2 , ..., cd be their respective degrees. We claim that ci ≥ i−2 for each i with 2≤i≤d: Just before yi was made a child of x, y1 ,...,yi₋₁ were already children of x, and so x had degree at least i−1 at that time. Since trees are combined only when the degrees of their roots are equal, it must have been that yi also had degree at least i−1 at the time it became a child of x. From that time to the present, yi can only have lost at most one child (as guaranteed by the marking process), and so its current degree ci is at least i−2. This proves the claim. 149 Although the total running time of a sequence of operations starting with an empty structure is bounded by the bounds given above, some (very few) operations in the sequence can take very long to complete (in particular delete and delete minimum have linear running time in the worst case). For this reason Fibonacci heaps and other amortized data structures may not be appropriate for realtime systems. It is possible to create a data structure which has the same worst-case performance as the Fibonacci heap has amortized performance.[4][5] One such structure, the Brodal queue, is, in the words of the creator, “quite complicated” and "[not] applicable in practice.” Created in 2012, the strict Fibonacci heap is a simpler (compared to Brodal’s) structure with the same worstcase bounds. It is unknown whether the strict Fibonacci heap is efficient in practice. The run-relaxed heaps of Driscoll et al. give good worst-case performance for all Fibonacci heap operations except merge. 5.7.5 Summary of running times In the following time complexities[6] O(f) is an asymptotic upper bound and Θ(f) is an asymptotically tight bound (see Big O notation). Function names assume a min-heap. [1] Brodal and Okasaki later describe a persistent variant with the same bounds except for decrease-key, which is not supported. Heaps with n elements can be constructed bottom-up in O(n).[9] Since the heights of all the yi are strictly less than that of [2] Amortized time. √ x, we can apply the inductive hypothesis to them to get [3] Bounded by Ω(log log n), O(22 log log n ) [2][12] size(yi) ≥ Fci₊₂ ≥ F₍i₋₂₎₊₂ = Fi. The nodes x and y1 each contribute at least 1 to size(x), and so we have [4] n is the size of the larger heap. ∑d ∑d size(x) ≥ 2 + i=2 size(yi ) ≥ 2 + i=2 Fi = 1 + ∑d i=0 Fi . 5.7.6 Practical considerations ∑d A routine induction proves that 1 + i=0 Fi = Fd+2 for any d ≥ 0 , which gives the desired lower bound on Fibonacci heaps have a reputation for being slow in practice[13] due to large memory consumption per node size(x). and high constant factors on all operations.[14] Recent experimental results suggest that Fibonacci heaps are more efficient in practice than most of its later derivatives, 5.7.4 Worst case including quake heaps, violation heaps, strict Fibonacci heaps, rank pairing heaps, but less efficient than either Although Fibonacci heaps look very efficient, they have pairing heaps or array-based heaps.[15] the following two drawbacks (as mentioned in the paper “The Pairing Heap: A new form of Self Adjusting Heap”): “They are complicated when it comes to cod- 5.7.7 References ing them. Also they are not as efficient in practice when compared with the theoretically less efficient forms of [1] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2001) [1990]. “Chapter 20: heaps, since in their simplest version they require storFibonacci Heaps”. Introduction to Algorithms (2nd ed.). age and manipulation of four pointers per node, comMIT Press and McGraw-Hill. pp. 476–497. ISBN 0pared to the two or three pointers per node needed for 262-03293-7. Third edition p. 518. [3] other structures ". These other structures are referred to Binary heap, Binomial heap, Pairing Heap, Brodal Heap [2] Fredman, Michael Lawrence; Tarjan, Robert E. (1987). and Rank Pairing Heap. “Fibonacci heaps and their uses in improved network 150 CHAPTER 5. PRIORITY QUEUES optimization algorithms” (PDF). Journal of the Association for Computing Machinery. 34 (3): 596–615. doi:10.1145/28869.28874. [3] Fredman, Michael L.; Sedgewick, Robert; Sleator, Daniel D.; Tarjan, Robert E. (1986). “The pairing heap: a new form of self-adjusting heap” (PDF). Algorithmica. 1 (1): 111–129. doi:10.1007/BF01840439. [4] Gerth Stølting Brodal (1996), “Worst-Case Efficient Priority Queues”, Proc. 7th ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics: 52–58, doi:10.1145/313852.313883, ISBN 089871-366-8, CiteSeerX: 10.1.1.43.8133 [5] Brodal, G. S. L.; Lagogiannis, G.; Tarjan, R. E. (2012). Strict Fibonacci heaps (PDF). Proceedings of the 44th symposium on Theory of Computing - STOC '12. p. 1177. doi:10.1145/2213977.2214082. ISBN 9781450312455. 5.7.8 External links • Java applet simulation of a Fibonacci heap • MATLAB implementation of Fibonacci heap • De-recursived and memory efficient C implementation of Fibonacci heap (free/libre software, CeCILL-B license) • Ruby implementation of the Fibonacci heap (with tests) • Pseudocode of the Fibonacci heap algorithm • Various Java Implementations for Fibonacci heap 5.8 Pairing heap A pairing heap is a type of heap data structure with relatively simple implementation and excellent practical amortized performance, introduced by Michael Fredman, Robert Sedgewick, Daniel Sleator, and Robert Tarjan in 1986.[1] Pairing heaps are heap-ordered multiway tree [7] Iacono, John (2000), “Improved upper bounds for pairing heaps”, Proc. 7th Scandinavian Workshop on Algo- structures, and can be considered simplified Fibonacci rithm Theory, Lecture Notes in Computer Science, 1851, heaps. They are considered a “robust choice” for imple[2] Springer-Verlag, pp. 63–77, doi:10.1007/3-540-44985- menting such algorithms as Prim’s MST algorithm, and support the following operations (assuming a min-heap): X_5 [6] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L. (1990). Introduction to Algorithms (1st ed.). MIT Press and McGraw-Hill. ISBN 0-262-03141-8. [8] Brodal, Gerth S. (1996), “Worst-Case Efficient Priority Queues”, Proc. 7th Annual ACM-SIAM Symposium on Discrete Algorithms (PDF), pp. 52–58 [9] Goodrich, Michael T.; Tamassia, Roberto (2004). “7.3.6. Bottom-Up Heap Construction”. Data Structures and Algorithms in Java (3rd ed.). pp. 338–341. [10] Haeupler, Bernhard; Sen, Siddhartha; Tarjan, Robert E. (2009). “Rank-pairing heaps” (PDF). SIAM J. Computing: 1463–1485. [11] Brodal, G. S. L.; Lagogiannis, G.; Tarjan, R. E. (2012). Strict Fibonacci heaps (PDF). Proceedings of the 44th symposium on Theory of Computing - STOC '12. p. 1177. doi:10.1145/2213977.2214082. ISBN 9781450312455. [12] Pettie, Seth (2005). “Towards a Final Analysis of Pairing Heaps” (PDF). Max Planck Institut für Informatik. [13] http://www.cs.princeton.edu/~{}wayne/ kleinberg-tardos/pdf/FibonacciHeaps.pdf, p. 79 [14] http://web.stanford.edu/class/cs166/lectures/07/ Small07.pdf, p. 72 [15] Larkin, Daniel; Sen, Siddhartha; Tarjan, Robert (2014). “A Back-to-Basics Empirical Study of Priority Queues”. Proceedings of the Sixteenth Workshop on Algorithm Engineering and Experiments: 61–72. arXiv:1403.0252 . doi:10.1137/1.9781611973198.7. • find-min: simply return the top element of the heap. • merge: compare the two root elements, the smaller remains the root of the result, the larger element and its subtree is appended as a child of this root. • insert: create a new heap for the inserted element and merge into the original heap. • decrease-key (optional): remove the subtree rooted at the key to be decreased, replace the key with a smaller key, then merge the result back into the heap. • delete-min: remove the root and merge its subtrees. Various strategies are employed. The analysis of pairing heaps’ time complexity was initially inspired by that of splay trees.[1] The amortized time per delete-min is O(log n), and the operations find-min, merge, and insert run in O(1) amortized time.[3] Determining the precise asymptotic running time of pairing heaps when a decrease-key operation is needed has turned out to be difficult. Initially, the time complexity of this operation was conjectured on empirical grounds to be O(1),[4] but Fredman proved that the amortized time per decrease-key is at least Ω(log log n) for some sequences of operations.[5] Using a different amortization argument, Pettie then proved that insert, meld, and decrease-key all √ 2 log log n ) amortized time, which is o(log n) run in O(2 .[6] Elmasry later introduced a variant of pairing heaps 5.8. PAIRING HEAP for which decrease-key runs in O(log log n) amortized time and with all other operations matching Fibonacci heaps,[7] but no tight Θ(log log n) bound is known for the original data structure.[6][3] Moreover, it is an open question whether a o(log n) amortized time bound for decrease-key and a O(1) amortized time bound for insert can be achieved simultaneously.[8] Although this is worse than other priority queue algorithms such as Fibonacci heaps, which perform decreasekey in O(1) amortized time, the performance in practice is excellent. Stasko and Vitter,[4] Moret and Shapiro,[9] and Larkin, Sen, and Tarjan[8] conducted experiments on pairing heaps and other heap data structures. They concluded that pairing heaps are often faster in practice than array-based binary heaps and d-ary heaps, and almost always faster in practice than other pointer-based heaps, including data structures like Fibonacci heaps that are theoretically more efficient. 151 the two root elements as its root element and just adds the heap with the larger root to the list of subheaps: function merge(heap1, heap2) if heap1 == Empty return heap2 elsif heap2 == Empty return heap1 elsif heap1.elem < heap2.elem return Heap(heap1.elem, heap2 :: heap1.subheaps) else return Heap(heap2.elem, heap1 :: heap2.subheaps) insert The easiest way to insert an element into a heap is to merge the heap with a new heap containing just this element and an empty list of subheaps: function insert(elem, heap) return merge(Heap(elem, []), heap) delete-min 5.8.1 Structure The only non-trivial fundamental operation is the deletion of the minimum element from the heap. The standard A pairing heap is either an empty heap, or a pair consist- strategy first merges the subheaps in pairs (this is the step ing of a root element and a possibly empty list of pairing that gave this datastructure its name) from left to right and heaps. The heap ordering property requires that all the then merges the resulting list of heaps from right to left: root elements of the subheaps in the list are not smaller function delete-min(heap) if heap == Empty error else than the root element of the heap. The following descripreturn merge-pairs(heap.subheaps) tion assumes a purely functional heap that does not supThis uses the auxiliary function merge-pairs: port the decrease-key operation. type PairingHeap[Elem] = Empty | Heap(elem: Elem, function merge-pairs(l) if length(l) == 0 return Empty elsif length(l) == 1 return l[0] else return subheaps: List[PairingHeap[Elem]]) merge(merge(l[0], l[1]), merge-pairs(l[2.. ])) A pointer-based implementation for RAM machines, supporting decrease-key, can be achieved using three That this does indeed implement the described two-pass pointers per node, by representing the children of a node left-to-right then right-to-left merging strategy can be by a singly-linked list: a pointer to the node’s first child, seen from this reduction: one to its next sibling, and one to the parent. Alterna- merge-pairs([H1, H2, H3, H4, H5, H6, H7]) => tively, the parent pointer can be omitted by letting the merge(merge(H1, H2), merge-pairs([H3, H4, H5, H6, last child point back to the parent, if a single boolean flag H7])) # merge H1 and H2 to H12, then the rest of is added to indicate “end of list”. This achieves a more the list => merge(H12, merge(merge(H3, H4), mergecompact structure at the expense of a constant overhead pairs([H5, H6, H7]))) # merge H3 and H4 to H34, factor per operation.[1] then the rest of the list => merge(H12, merge(H34, merge(merge(H5, H6), merge-pairs([H7])))) # merge H5 and H6 to H56, then the rest of the list => merge(H12, 5.8.2 Operations merge(H34, merge(H56, H7))) # switch direction, merge the last two resulting heaps, giving H567 => merge(H12, find-min merge(H34, H567)) # merge the last two resulting heaps, giving H34567 => merge(H12, H34567) # finally, merge The function find-min simply returns the root element of the first merged pair with the result of merging the rest the heap: => H1234567 function find-min(heap) if heap == Empty error else return heap.elem 5.8.3 Summary of running times In the following time complexities[10] O(f) is an asymptotic upper bound and Θ(f) is an asymptotically tight Merging with an empty heap returns the other heap, oth- bound (see Big O notation). Function names assume a erwise a new heap is returned that has the minimum of min-heap. merge 152 CHAPTER 5. PRIORITY QUEUES [1] Brodal and Okasaki later describe a persistent variant with the same bounds except for decrease-key, which is not supported. Heaps with n elements can be constructed bottom-up in O(n).[13] [2] Amortized time. √ [3] Bounded by Ω(log log n), O(22 log log n ) [16][17] [4] n is the size of the larger heap. 5.8.4 References [11] Iacono, John (2000), “Improved upper bounds for pairing heaps”, Proc. 7th Scandinavian Workshop on Algorithm Theory, Lecture Notes in Computer Science, 1851, Springer-Verlag, pp. 63–77, doi:10.1007/3-540-44985X_5 [12] Brodal, Gerth S. (1996), “Worst-Case Efficient Priority Queues”, Proc. 7th Annual ACM-SIAM Symposium on Discrete Algorithms (PDF), pp. 52–58 [13] Goodrich, Michael T.; Tamassia, Roberto (2004). “7.3.6. Bottom-Up Heap Construction”. Data Structures and Algorithms in Java (3rd ed.). pp. 338–341. [1] Fredman, Michael L.; Sedgewick, Robert; Sleator, Daniel D.; Tarjan, Robert E. (1986). “The pairing heap: a new form of self-adjusting heap” (PDF). Algorithmica. 1 (1): 111–129. doi:10.1007/BF01840439. [14] Haeupler, Bernhard; Sen, Siddhartha; Tarjan, Robert E. (2009). “Rank-pairing heaps” (PDF). SIAM J. Computing: 1463–1485. [2] Mehlhorn, Kurt; Sanders, Peter (2008). Algorithms and Data Structures: The Basic Toolbox (PDF). Springer. p. 231. [15] Brodal, G. S. L.; Lagogiannis, G.; Tarjan, R. E. (2012). Strict Fibonacci heaps (PDF). Proceedings of the 44th symposium on Theory of Computing - STOC '12. p. 1177. doi:10.1145/2213977.2214082. ISBN 9781450312455. [3] Iacono, John (2000). Improved upper bounds for pairing heaps (PDF). Proc. 7th Scandinavian Workshop on Algorithm Theory. Lecture Notes in Computer Science. Springer-Verlag. pp. 63–77. arXiv:1110.4428 . doi:10.1007/3-540-44985-X_5. ISBN 978-3-54067690-4. [4] Stasko, John T.; Vitter, Jeffrey S. (1987), “Pairing heaps: experiments and analysis”, Communications of the ACM, 30 (3): 234–249, doi:10.1145/214748.214759, CiteSeerX: 10.1.1.106.2988 [5] Fredman, Michael L. (1999). “On the efficiency of pairing heaps and related data structures” (PDF). Journal of the ACM. 46 (4): 473–501. doi:10.1145/320211.320214. [6] Pettie, Seth (2005), “Towards a final analysis of pairing heaps” (PDF), Proc. 46th Annual IEEE Symposium on Foundations of Computer Science, pp. 174–183, doi:10.1109/SFCS.2005.75, ISBN 0-7695-2468-0 [7] Elmasry, Amr (2009), “Pairing heaps with O(log log n) decrease cost” (PDF), Proc. 20th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 471–476, doi:10.1137/1.9781611973068.52 [16] Fredman, Michael Lawrence; Tarjan, Robert E. (1987). “Fibonacci heaps and their uses in improved network optimization algorithms” (PDF). Journal of the Association for Computing Machinery. 34 (3): 596–615. doi:10.1145/28869.28874. [17] Pettie, Seth (2005). “Towards a Final Analysis of Pairing Heaps” (PDF). Max Planck Institut für Informatik. 5.8.5 External links • Louis Wasserman discusses pairing heaps and their implementation in Haskell in The Monad Reader, Issue 16 (pp. 37–52). • pairing heaps, Sartaj Sahni 5.9 Double-ended priority queue Not to be confused with Double-ended queue. [8] Larkin, Daniel H.; Sen, Siddhartha; Tarjan, Robert E. (2014), “A back-to-basics empirical study of priority queues”, Proceedings of the 16th Workshop on Algorithm Engineering and Experiments, pp. 61–72, arXiv:1403.0252 , doi:10.1137/1.9781611973198.7 In computer science, a double-ended priority queue (DEPQ)[1] or double-ended heap[2] is a data structure similar to a priority queue or heap, but allows for efficient removal of both the maximum and minimum, according [9] Moret, Bernard M. E.; Shapiro, Henry D. (1991), to some ordering on the keys (items) stored in the struc“An empirical analysis of algorithms for construct- ture. Every element in a DEPQ has a priority or value. ing a minimum spanning tree”, Proc. 2nd Workshop In a DEPQ, it is possible to remove the elements in both on Algorithms and Data Structures, Lecture Notes in ascending as well as descending order.[3] Computer Science, 519, Springer-Verlag, pp. 400– 411, doi:10.1007/BFb0028279, ISBN 3-540-54343-0, CiteSeerX: 10.1.1.53.5960 [10] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L. (1990). Introduction to Algorithms (1st ed.). MIT Press and McGraw-Hill. ISBN 0-262-03141-8. 5.9.1 Operations A double-ended priority queue features the follow operations: 5.9. DOUBLE-ENDED PRIORITY QUEUE isEmpty() Checks if DEPQ is empty and returns true if empty. size() Returns the total number of elements present in the DEPQ. getMin() Returns the element having least priority. 153 • Removing the max element: Perform removemax() on the max heap and remove(node value) on the min heap, where node value is the value in the corresponding node in the min heap. Total correspondence getMax() Returns the element having highest priority. put(x) Inserts the element x in the DEPQ. removeMin() Removes an element with minimum priority and returns this element. removeMax() Removes an element with maximum priority and returns this element. If an operation is to be performed on two elements having the same priority, then the element inserted first is chosen. Also, the priority of any element can be changed once it has been inserted in the DEPQ.[4] 5.9.2 Implementation Double-ended priority queues can be built from balanced binary search trees (where the minimum and maximum elements are the leftmost and rightmost leaves, respectively), or using specialized data structures like min-max heap and pairing heap. A total correspondence heap for the elements 3, 4, 5, 5, 6, 6, 7, 8, 9, 10, 11 with element 11 as buffer.[1] Half the elements are in the min PQ and the other half in the max PQ. Each element in the min PQ has a one to one correspondence with an element in max PQ. If the number of elements in the DEPQ is odd, one of Generic methods of arriving at double-ended priority the elements is retained in a buffer.[1] Priority of every element in the min PQ will be less than or equal to the queues from normal priority queues are:[5] corresponding element in the max PQ. Dual structure method Leaf correspondence A dual structure with 14,12,4,10,8 as the members of DEPQ.[1] In this method two different priority queues for min and max are maintained. The same elements in both the PQs are shown with the help of correspondence pointers. Here, the minimum and maximum elements are values [1] contained in the root nodes of min heap and max heap A leaf correspondence heap for the same elements as above. respectively. In this method only the leaf elements of the min and • Removing the min element: Perform removemin() max PQ form corresponding one to one pairs. It is not on the min heap and remove(node value) on the max necessary for non-leaf elements to be in a one to one heap, where node value is the value in the corre- correspondence pair.[1] sponding node in the max heap. 154 Interval heaps CHAPTER 5. PRIORITY QUEUES • Even number of elements: If the number of elements is even, then for the insertion of a new element an additional node is created. If the element falls to the left of the parent interval, it is considered to be in the min heap and if the element falls to the right of the parent interval, it is considered in the max heap. Further, it is compared successively and moved from the last node to the root until all the conditions for interval heap are satisfied. If the element lies within the interval of the parent node itself, the process is stopped then and there itself and moving of elements does not take place.[6] The time required for inserting an element depends on the number of movements required to meet all the conditions Apart from the above mentioned correspondence meth- and is O(log n). ods, DEPQ’s can be obtained efficiently using interval heaps.[6] An interval heap is like an embedded min-max Deleting an element heap in which each node contains two elements. It is a complete binary tree in which:[6] • Min element: In an interval heap, the minimum elImplementing a DEPQ using interval heap. • The left element is less than or equal to the right element. • Both the elements define a closed interval. • Interval represented by any node except the root is a sub-interval of the parent node. • Elements on the left hand side define a min heap. • Elements on the right hand side define a max heap. Depending on the number of elements, two cases are possible[6] 1. Even number of elements: In this case, each node contains two elements say p and q, with p ≤ q. Every node is then represented by the interval [p, q]. 2. Odd number of elements: In this case, each node except the last contains two elements represented by the interval [p, q] whereas the last node will contain a single element and is represented by the interval [p, p]. ement is the element on the left hand side of the root node. This element is removed and returned. To fill in the vacancy created on the left hand side of the root node, an element from the last node is removed and reinserted into the root node. This element is then compared successively with all the left hand elements of the descending nodes and the process stops when all the conditions for an interval heap are satisfied.In case if the left hand side element in the node becomes greater than the right side element at any stage, the two elements are swapped[6] and then further comparisons are done. Finally, the root node will again contain the minimum element on the left hand side. • Max element: In an interval heap, the maximum element is the element on the right hand side of the root node. This element is removed and returned. To fill in the vacancy created on the right hand side of the root node, an element from the last node is removed and reinserted into the root node. Further comparisons are carried out on a similar basis as discussed above. Finally, the root node will again contain the max element on the right hand side. Thus, with interval heaps, both the minimum and maxiInserting an element Depending on the number of mum elements can be removed efficiently traversing from elements already present in the interval heap, following root to leaf. Thus, a DEPQ can be obtained[6] from an incases are possible: terval heap where the elements of the interval heap are the priorities of elements in the DEPQ. • Odd number of elements: If the number of elements in the interval heap is odd, the new element is firstly inserted in the last node. Then, it is suc- 5.9.3 Time Complexity cessively compared with the previous node elements and tested to satisfy the criteria essential for an in- Interval Heaps terval heap as stated above. In case if the element does not satisfy any of the criteria, it is moved from When DEPQ’s are implemented using Interval heaps conthe last node to the root until all the conditions are sisting of n elements, the time complexities for the various functions are formulated in the table below[1] satisfied.[6] 5.10. SOFT HEAP Pairing heaps When DEPQ’s are implemented using heaps or pairing heaps consisting of n elements, the time complexities for the various functions are formulated in the table below.[1] For pairing heaps, it is an amortized complexity. 155 [5] Fundamentals of Data Structures in C++ - Ellis Horowitz, Sartaj Sahni and Dinesh Mehta [6] http://www.mhhe.com/engcs/compsci/sahni/enrich/c9/ interval.pdf 5.10 Soft heap 5.9.4 Applications For the Canterbury scene band, see Soft Heap. External sorting One example application of the double-ended priority queue is external sorting. In an external sort, there are more elements than can be held in the computer’s memory. The elements to be sorted are initially on a disk and the sorted sequence is to be left on the disk. The external quick sort is implemented using the DEPQ as follows: 1. Read in as many elements as will fit into an internal DEPQ. The elements in the DEPQ will eventually be the middle group (pivot) of elements. 2. Read in the remaining elements. If the next element is ≤ the smallest element in the DEPQ, output this next element as part of the left group. If the next element is ≥ the largest element in the DEPQ, output this next element as part of the right group. Otherwise, remove either the max or min element from the DEPQ (the choice may be made randomly or alternately); if the max element is removed, output it as part of the right group; otherwise, output the removed element as part of the left group; insert the newly input element into the DEPQ. In computer science, a soft heap is a variant on the simple heap data structure that has constant amortized time for 5 types of operations. This is achieved by carefully “corrupting” (increasing) the keys of at most a certain number of values in the heap. The constant time operations are: • create(S): Create a new soft heap • insert(S, x): Insert an element into a soft heap • meld(S, S' ): Combine the contents of two soft heaps into one, destroying both • delete(S, x): Delete an element from a soft heap • findmin(S): Get the element with minimum key in the soft heap Other heaps such as Fibonacci heaps achieve most of these bounds without any corruption, but cannot provide a constant-time bound on the critical delete operation. The amount of corruption can be controlled by the choice of a parameter ε, but the lower this is set, the more time insertions require (O(log 1/ε) for an error rate of ε). 3. Output the elements in the DEPQ, in sorted order, More precisely, the guarantee offered by the soft heap as the middle group. is the following: for a fixed value ε between 0 and 1/2, at any point in time there will be at most ε*n corrupted keys 4. Sort the left and right groups recursively. in the heap, where n is the number of elements inserted so far. Note that this does not guarantee that only a fixed percentage of the keys currently in the heap are corrupted: 5.9.5 See also in an unlucky sequence of insertions and deletions, it can happen that all elements in the heap will have corrupted • Queue (abstract data type) keys. Similarly, we have no guarantee that in a sequence of elements extracted from the heap with findmin and • Priority queue delete, only a fixed percentage will have corrupted keys: • Double-ended queue in an unlucky scenario only corrupted elements are extracted from the heap. 5.9.6 References [1] Data Structures, Algorithms, & Applications in Java: Double-Ended Priority Queues, Sartaj Sahni, 1999. [2] Brass, Peter (2008). Advanced Data Structures. Cambridge University Press. p. 211. ISBN 9780521880374. [3] “Depq - Double-Ended Priority Queue”. [4] “depq”. The soft heap was designed by Bernard Chazelle in 2000. The term “corruption” in the structure is the result of what Chazelle called “carpooling” in a soft heap. Each node in the soft heap contains a linked-list of keys and one common key. The common key is an upper bound on the values of the keys in the linked-list. Once a key is added to the linked-list, it is considered corrupted because its value is never again relevant in any of the soft heap operations: only the common keys are compared. This is what makes soft heaps “soft"; you can't be sure whether 156 CHAPTER 5. PRIORITY QUEUES or not any particular value you put into it will be cor- HeapSelect(a[xIndex..n], k-xIndex+1) rupted. The purpose of these corruptions is effectively to lower the information entropy of the data, enabling the data structure to break through information-theoretic bar- 5.10.2 References riers regarding heaps. • Chazelle, B. 2000. The soft heap: an approximate priority queue with optimal error rate. J. ACM 47, 6 (Nov. 2000), 1012-1027. 5.10.1 Applications Despite their limitations and unpredictable nature, soft heaps are useful in the design of deterministic algorithms. They were used to achieve the best complexity to date for finding a minimum spanning tree. They can also be used to easily build an optimal selection algorithm, as well as near-sorting algorithms, which are algorithms that place every element near its final position, a situation in which insertion sort is fast. One of the simplest examples is the selection algorithm. Say we want to find the kth largest of a group of n numbers. First, we choose an error rate of 1/3; that is, at most about 33% of the keys we insert will be corrupted. Now, we insert all n elements into the heap — we call the original values the “correct” keys, and the values stored in the heap the “stored” keys. At this point, at most n/3 keys are corrupted, that is, for at most n/3 keys is the “stored” key larger than the “correct” key, for all the others the stored key equals the correct key. Next, we delete the minimum element from the heap n/3 times (this is done according to the “stored” key). As the total number of insertions we have made so far is still n, there are still at most n/3 corrupted keys in the heap. Accordingly, at least 2n/3 − n/3 = n/3 of the keys remaining in the heap are not corrupted. Let L be the element with the largest correct key among the elements we removed. The stored key of L is possibly larger than its correct key (if L was corrupted), and even this larger value is smaller than all the stored keys of the remaining elements in the heap (as we were removing minimums). Therefore, the correct key of L is smaller than the remaining n/3 uncorrupted elements in the soft heap. Thus, L divides the elements somewhere between 33%/66% and 66%/33%. We then partition the set about L using the partition algorithm from quicksort and apply the same algorithm again to either the set of numbers less than L or the set of numbers greater than L, neither of which can exceed 2n/3 elements. Since each insertion and deletion requires O(1) amortized time, the total deterministic time is T(n) = T(2n/3) + O(n). Using case 3 of the master theorem (with ε=1 and c=2/3), we know that T(n) = Θ(n). The final algorithm looks like this: function softHeapSelect(a[1..n], k) if k = 1 then return minimum(a[1..n]) create(S) for i from 1 to n insert(S, a[i]) for i from 1 to n/3 x := findmin(S) delete(S, x) xIndex := partition(a, x) // Returns new index of pivot x if k < xIndex softHeapSelect(a[1..xIndex-1], k) else soft- • Kaplan, H. and Zwick, U. 2009. A simpler implementation and analysis of Chazelle’s soft heaps. In Proceedings of the Nineteenth Annual ACM -SIAM Symposium on Discrete Algorithms (New York, New York, January 4––6, 2009). Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, 477-485. Chapter 6 Successors and neighbors 6.1 Binary search algorithm T, the following subroutine uses binary search to find the index of T in A.[6] This article is about searching a finite sorted array. For searching continuous function values, see bisection method. 1. Set L to 0 and R to n − 1. 2. If L > R, the search terminates as unsuccessful. 3. Set m (the position of the middle element) to the floor of (L + R) / 2. In computer science, binary search, also known as halfinterval search[1] or logarithmic search,[2] is a search algorithm that finds the position of a target value within a sorted array.[3][4] It compares the target value to the middle element of the array; if they are unequal, the half in which the target cannot lie is eliminated and the search continues on the remaining half until it is successful. Binary search runs in at worst logarithmic time, making O(log n) comparisons, where n is the number of elements in the array and log is the binary logarithm; and using only constant (O(1)) space.[5] Although specialized data structures designed for fast searching—such as hash tables—can be searched more efficiently, binary search applies to a wider range of search problems. 4. If Am < T, set L to m + 1 and go to step 2. 5. If Am > T, set R to m – 1 and go to step 2. 6. Now Am = T, the search is done; return m. This iterative procedure keeps track of the search boundaries via two variables. Some implementations may place the comparison for equality at the end of the algorithm, resulting in a faster comparison loop but costing one more iteration on average.[7] Although the idea is simple, implementing binary search Approximate matches correctly requires attention to some subtleties about its The above procedure only performs exact matches, findexit conditions and midpoint calculation. ing the position of a target value. However, due to the orThere exist numerous variations of binary search. One dered nature of sorted arrays, it is trivial to extend binary variation in particular (fractional cascading) speeds up bisearch to perform approximate matches. Particularly, binary searches for the same value in multiple arrays. nary search can be used to compute, relative to a value, its rank (the number of smaller elements), predecessor (next-smallest element), successor (next-largest element), 6.1.1 Algorithm and nearest neighbor. Range queries seeking the number two values can be performed with Binary search works on sorted arrays. A binary search be- of elements between [8] two rank queries. gins by comparing the middle element of the array with the target value. If the target value matches the middle element, its position in the array is returned. If the target value is less than or greater than the middle element, the search continues in the lower or upper half of the array, respectively, eliminating the other half from consideration.[6] Procedure Given an array A of n elements with values or records A0 ... An₋₁, sorted such that A0 ≤ ... ≤ An₋₁, and target value 157 • Rank queries can be performed using a modified version of binary search. By returning m on a successful search, and L on an unsuccessful search, the number of elements less than the target value is returned instead.[8] • Predecessor and successor queries can be performed with rank queries. Once the rank of the target value is known, its predecessor is the element at the position given by its rank (as it is the largest element that is smaller than the target value). Its successor is the 158 CHAPTER 6. SUCCESSORS AND NEIGHBORS element after it (if it is present in the array) or at the next position after the predecessor (otherwise).[9] The nearest neighbor of the target value is either its predecessor or successor, whichever is closer. smaller subarray being eliminated. The actual number of average iterations is slightly higher, at log n − n−logn n−1 iterations.[5] In the best case, where the first middle element selected is equal to the target value, its position is returned after one iteration.[12] In terms of iterations, no • Range queries are also straightforward. Once the search algorithm that is based solely on comparisons can ranks of the two values are known, the number exhibit better average and worst-case performance than of elements greater than or equal to the first value binary search.[11] and less than the second is the difference of the two ranks. This count can be adjusted up or down Each iteration of the binary search algorithm defined by one according to whether the endpoints of the above makes one or two comparisons, checking if the range should be considered to be part of the range middle element is equal to the target value in each iteraand whether the array contains keys matching those tion. Again assuming that each element is equally likely to be searched, each iteration makes 1.5 comparisons on endpoints.[10] average. A variation of the algorithm instead checks for equality at the very end of the search, eliminating on average half a comparison from each iteration. This de6.1.2 Performance creases the time taken per iteration very slightly on most computers, while guaranteeing that the search takes the maximum number of iterations, on average adding one iteration to the search. Because the comparison loop is performed only ⌊log n+1⌋ times in the worst case, for all but enormous n , the slight increase in comparison loop efficiency does not compensate for the extra iteration. Knuth 1998 gives a value of 266 (more than 73 quintillion)[13] elements for this variation to be faster.[lower-alpha 2][14][15] Fractional cascading can be used to speed up searches of the same value in multiple arrays. Where k is the number of arrays, searching each array for the target value takes O(k log n) time; fractional cascading reduces this The performance of binary search can be analyzed by re- to O(k + log n) .[16] ducing the procedure to a binary comparison tree, where the root node is the middle element of the array; the middle element of the lower half is left of the root and the 6.1.3 Binary search versus other schemes middle element of the upper half is right of the root. The rest of the tree is built in a similar fashion. This Sorted arrays with binary search are a very inefficient somodel represents binary search; starting from the root lution when insertion and deletion operations are internode, the left or right subtrees are traversed depending leaved with retrieval, taking O(n) time for each such opon whether the target value is less or more than the node eration, and complicating memory use. under consideration, representing the successive eliminaOther algorithms support much more efficient insertion tion of elements.[5][11] and deletion, and also fast exact matching. The worst case is ⌊log n + 1⌋ iterations (of the comparison loop), where the ⌊⌋ notation denotes the floor function that rounds its argument down to an integer. This Hashing is reached when the search reaches the deepest level of the tree, equivalent to a binary search that has reduced to For implementing associative arrays, hash tables, a data one element and, in each iteration, always eliminates the structure that maps keys to records using a hash funcsmaller subarray out of the two if they are not of equal tion, are generally faster than binary search on a sorted size.[lower-alpha 1][11] array of records;[17] most implementations require only A tree representing binary search. The array being searched here is [20, 30, 40, 50, 90, 100], and the target value is 40. On average, assuming that each element is equally likely to be searched, by the time the search completes, the target value will most likely be found at the second-deepest level of the tree. This is equivalent to a binary search that completes one iteration before the worst case, reached after log n − 1 iterations. However, the tree may be unbalanced, with the deepest level partially filled, and equivalently, the array may not be divided perfectly by the search in some iterations, half of the time resulting in the amortized constant time on average.[lower-alpha 3][19] However, hashing is not useful for approximate matches, such as computing the next-smallest, next-largest, and nearest key, as the only information given on a failed search is that the target is not present in any record.[20] Binary search is ideal for such matches, performing them in logarithmic time. In addition, all operations possible on a sorted array can be performed—such as finding the smallest and largest key and performing range searches.[21] 6.1. BINARY SEARCH ALGORITHM 159 Trees used for set membership. But there are other algorithms that are more specifically suited for set membership. A A binary search tree is a binary tree data structure that bit array is the simplest, useful when the range of keys is works based on the principle of binary search: the records limited, is very fast: O(1) . The Judy1 type of Judy array of the tree are arranged in sorted order, and traversal handles 64-bit keys efficiently. of the tree is performed using a logarithmic time binary For approximate results, Bloom filters, another probasearch-like algorithm. Insertion and deletion also require bilistic data structure based on hashing, store a set of logarithmic time in binary search trees. This is faster than keys by encoding the keys using a bit array and multithe linear time insertion and deletion of sorted arrays, and ple hash functions. Bloom filters are much more spacebinary trees retain the ability to perform all the operations efficient than bitarrays in most cases and not much slower: possible on a sorted array, including range and approxiwith k hash functions, membership queries require only mate queries.[22] O(k) time. However, Bloom filters suffer from false posHowever, binary search is usually more efficient for itives.[lower-alpha 6][lower-alpha 7][31] searching as binary search trees will most likely be imperfectly balanced, resulting in slightly worse performance than binary search. This applies even to balanced bi- Other data structures nary search trees, binary search trees that balance their own nodes—as they rarely produce optimally-balanced There exist data structures that may improve on binary trees—but to a lesser extent. Although unlikely, the search in some cases for both searching and other operatree may be severely imbalanced with few internal nodes tions available for sorted arrays. For example, searches, with two children, resulting in the average and worst-case approximate matches, and the operations available to search time approaching n comparisons.[lower-alpha 4] Bi- sorted arrays can be performed more efficiently than binary search trees take more space than sorted arrays.[24] nary search on specialized data structures such as van Emde Boas trees, fusion trees, tries, and bit arrays. HowBinary search trees lend themselves to fast searching in ever, while these operations can always be done at least external memory stored in hard disks, where data needs to efficiently on a sorted array regardless of the keys, such be sought and paged into main memory. By splitting the data structures are usually only faster because they exploit tree into pages of some number of elements, each storing the properties of keys with a certain attribute (usually keys in turn a section of the tree, searching in a binary search that are small integers), and thus will be time or space tree costs fewer disk seeks, improving its overall perforconsuming for keys that do not have that attribute.[21] mance. Notice that this effectively creates a multiway tree, as each page is connected to each other. The Btree generalizes this method of tree organization; B-trees 6.1.4 Variations are frequently used to organize long-term storage such as databases and filesystems.[25][26] Uniform binary search Linear search Uniform binary search stores, instead of the lower and upper bounds, the index of the middle element and the number of elements around the middle element that were not eliminated yet. Each step reduces the width by about half. This variation is uniform because the difference between the indices of middle elements and the preceding middle elements chosen remains constant between searches of arrays of the same length.[32] Linear search is a simple search algorithm that checks every record until it finds the target value. Linear search can be done on a linked list, which allows for faster insertion and deletion than an array. Binary search is faster than linear search for sorted arrays except if the array is short.[lower-alpha 5][28] If the array must first be sorted, that cost must be amortized (spread) over any searches. Sorting the array also enables efficient approximate matches Fibonacci search and other operations.[29] Main article: Fibonacci search technique Mixed approaches Fibonacci search is a method similar to binary search that The Judy array uses a combination of approaches to prosuccessively shortens the interval in which the maximum vide a highly efficient solution. of a unimodal function lies. Given a finite interval, a unimodal function, and the maximum length of the resulting interval, Fibonacci search finds a Fibonacci number such Set membership algorithms that if the interval is divided equally into that many subinA related problem to search is set membership. Any al- tervals, the subintervals would be shorter than the maxigorithm that does lookup, like binary search, can also be mum length. After dividing the interval, it eliminates the 160 CHAPTER 6. SUCCESSORS AND NEIGHBORS complexity compensates for this only for large arrays.[36] Fractional cascading Main article: Fractional cascading Fractional cascading is a technique that speeds up binary searches for the same element for both exact and approximate matching in “catalogs” (arrays of sorted elements) associated with vertices in graphs. Searching each cat1 Fibonacci search on the function f (x) = sin ((x+ 10 )π) on the alog separately requires O(k log n) time, where k is the unit interval [0, 1] . The algorithm finds an interval containing number of catalogs. Fractional cascading reduces this to 1 the maximum of f with a length less than or equal to 10 in the O(k + log n) by storing specific information in each cat5 6 above example. In three iterations, it returns the interval [ 13 , 13 ] alog about other catalogs.[16] , which is of length 1 . 13 Fractional cascading was originally developed to efficiently solve various computational geometry problems, subintervals in which the maximum cannot lie until one but it also has been applied elsewhere, in domains such or more contiguous subintervals remain.[33][34] as data mining and Internet Protocol routing.[16] Exponential search 6.1.5 History Main article: Exponential search In 1946, John Mauchly made the first mention of binary search as part of the Moore School Lectures, the first ever set of lectures regarding any computer-related topic.[39] Every published binary search algorithm worked only for arrays whose length is one less than a power of two[lower-alpha 8] until 1960, when Derrick Henry Lehmer published a binary search algorithm that worked on all arrays.[41] In 1962, Hermann Bottenbruch presented an ALGOL 60 implementation of binary search that placed the comparison for equality at the end, increasing the average number of iterations by one, but reducing to one the number of comparisons per iteration.[7] The uniform binary search was presented to Donald Knuth in 1971 by A. K. Chandra of Stanford University and published in Knuth’s The Art of Computer Programming.[39] In 1986, Bernard Chazelle and Leonidas J. Guibas introduced fractional cascading, a technique used to speed up binary searches in multiple arrays.[16][42][43] Exponential search is an algorithm for searching primarily in infinite lists, but it can be applied to select the upper bound for binary search. It starts by finding the first element with an index that is both a power of two and greater than the target value. Afterwards, it sets that index as the upper bound, and switches to binary search. A search takes ⌊log x + 1⌋ iterations of the exponential search and at most ⌊log n⌋ iterations of the binary search, where x is the position of the target value. Only if the target value is near the beginning of the array is this variation more efficient than selecting the highest element as the upper bound.[35] Interpolation search Main article: Interpolation search Instead of merely calculating the midpoint, interpolation search attempts to calculate the position of the target value, taking into account the lowest and highest elements in the array and the length of the array. This is only possible if the array elements are numbers. It works on the basis that the midpoint is not the best guess in many cases; for example, if the target value is close to the highest element in the array, it is likely to be located near the end of the array.[36] When the distribution of the array elements is uniform or near uniform, it makes O(log log n) comparisons.[36][37][38] 6.1.6 Implementation issues Although the basic idea of binary search is comparatively straightforward, the details can be surprisingly tricky ... — Donald Knuth[2] When Jon Bentley assigned binary search as a problem in a course for professional programmers, he found that ninety percent failed to provide a correct solution after several hours of working on it,[44] and another study published in 1988 shows that accurate code for it is only found in five out of twenty textbooks.[45] Furthermore, BentIn practice, interpolation search is slower than binary ley’s own implementation of binary search, published in search for small arrays, as interpolation search requires his 1986 book Programming Pearls, contained an overextra computation, and the slower growth rate of its time flow error that remained undetected for over twenty years. 6.1. BINARY SEARCH ALGORITHM 161 The Java programming language library implementation of binary search had the same overflow bug for more than nine years.[46] In a practical implementation, the variables used to represent the indices will often be of fixed size, and this can result in an arithmetic overflow for very large arrays. If the midpoint of the span is calculated as (L + R) / 2, then the value of L + R may exceed the range of integers of the data type used to store the midpoint, even if L and R are within the range. If L and R are nonnegative, this can be avoided by calculating the midpoint as L + (R − L) / 2.[47] • Python provides the bisect module.[54] • Ruby's Array class includes a bsearch method with built-in approximate matching.[55] • Go's sort standard library package contains the functions Search, SearchInts, SearchFloat64s, and SearchStrings, which implement general binary search, as well as specific implementations for searching slices of integers, floating-point numbers, and strings, respectively.[56] • For Objective-C, the Cocoa framework provides the NSArray -indexOfObject:inSortedRange:options: usingComparator: method in Mac OS X 10.6+.[57] Apple’s Core Foundation C framework also contains a CFArrayBSearchValues() function.[58] If the target value is greater than the greatest value in the array, and the last index of the array is the maximum representable value of L, the value of L will eventually become too large and overflow. A similar problem will occur if the target value is smaller than the least value in the array and the first index of the array is the smallest representable value of R. In particular, this means that R 6.1.8 See also must not be an unsigned type if the array starts with index • Bisection method – the same idea used to solve 0. equations in the real numbers An infinite loop may occur if the exit conditions for the loop are not defined correctly. Once L exceeds R, the search has failed and must convey the failure of the 6.1.9 Notes and references search. In addition, the loop must be exited when the target element is found, or in the case of an implementation Notes where this check is moved to the end, checks for whether the search was successful or failed at the end must be in [1] This happens as binary search will not always divide the place. Bentley found that, in his assignment of binary array perfectly. Take for example the array [1, 2 ... search, this error was made by most of the programmers 16]. The first iteration will select the midpoint of 8. On who failed to implement a binary search correctly.[7][48] the left subarray are eight elements, but on the right are 6.1.7 nine. If the search takes the right path, there is a higher chance that the search will make the maximum number of comparisons.[11] Library support Many languages’ standard libraries include binary search routines: • C provides the function bsearch() in its standard library, which is typically implemented via binary search (although the official standard does not require it so).[49] • C++'s STL provides the nary_search(), lower_bound(), and equal_range().[50] functions biupper_bound() • Java offers a set of overloaded binarySearch() static methods in the classes Arrays and Collections in the standard java.util package for performing binary searches on Java arrays and on Lists, respectively.[51][52] • Microsoft's .NET Framework 2.0 offers static generic versions of the binary search algorithm in its collection base classes. An example would be System.Array’s method BinarySearch<T>(T[] array, T value).[53] [2] A formal time performance analysis by Knuth showed that the average running time of this variation for a successful search is 17.5 log n + 17 units of time compared to 18 log n − 16 units for regular binary search. The time complexity for this variation grows slightly more slowly, but at the cost of higher initial complexity.[14] [3] It is possible to perform hashing in guaranteed constant time.[18] [4] The worst binary search tree for searching can be produced by inserting the values in sorted or near-sorted order or in an alternating lowest-highest record pattern.[23] [5] Knuth 1998 performed a formal time performance analysis of both of these search algorithms. On Knuth’s hypothetical MIX computer, intended to represent an ordinary computer, binary search takes on average 18 log n − 16 units of time for a successful search, while linear search with a sentinel node at the end of the list takes 1.75n + 2 8.5 − n mod units. Linear search has lower initial com4n plexity because it requires minimal computation, however quickly outgrows binary search in complexity. On the MIX computer, binary search only outperforms linear search with a sentinel if n > 44 .[11][27] 162 [6] As simply setting all of the bits which the hash functions point to for a specific key can affect queries for other keys which have a common hash location for one or more of the functions.[30] [7] There exist improvements of the Bloom filter which improve on its complexity or support deletion; for example, the cuckoo filter exploits cuckoo hashing to gain these advantages.[30] [8] That is, arrays of length 1, 3, 7, 15, 31 ...[40] Citations [1] Willams, Jr., Louis F. (1975). A modification to the half-interval search (binary search) method. Proceedings of the 14th ACM Southeast Conference. pp. 95–101. doi:10.1145/503561.503582. [2] Knuth 1998, §6.2.1 (“Searching an ordered table”), subsection “Binary search”. [3] Cormen et al. 2009, p. 39. CHAPTER 6. SUCCESSORS AND NEIGHBORS [18] Knuth 1998, §6.4 (“Hashing”), subsection “History”. [19] Dietzfelbinger, Martin; Karlin, Anna; Mehlhorn, Kurt; Meyer auf der Heide, Friedhelm; Rohnert, Hans; Tarjan, Robert E. (August 1994). “Dynamic Perfect Hashing: Upper and Lower Bounds”. SIAM Journal on Computing. 23 (4): 738–761. doi:10.1137/S0097539791194094. [20] Morin, Pat. “Hash Tables” (PDF). p. 1. Retrieved 28 March 2016. [21] Beame, Paul; Fich, Faith E. (2001). “Optimal Bounds for the Predecessor Problem and Related Problems”. Journal of Computer and System Sciences. 65 (1): 38–72. doi:10.1006/jcss.2002.1822. [22] Sedgewick & Wayne 2011, §3.2 (“Binary Search Trees”), subsection “Order-based methods and deletion”. [23] Knuth 1998, §6.2.2 (“Binary tree searching”), subsection “But what about the worst case?". [24] Sedgewick & Wayne 2011, §3.5 (“Applications”), “Which symbol-table implementation should I use?". [25] Knuth 1998, §5.4.9 (“Disks and Drums”). [4] Weisstein, Eric W. “Binary Search”. MathWorld. [5] Flores, Ivan; Madpis, George (1971). “Average binary search length for dense ordered lists”. CACM. 14 (9): 602–603. doi:10.1145/362663.362752. [6] Knuth 1998, §6.2.1 (“Searching an ordered table”), subsection “Algorithm B”. [7] Bottenbruch, Hermann (1962). “Structure and Use of ALGOL 60”. Journal of the ACM. 9 (2): 161–211. Procedure is described at p. 214 (§43), titled “Program for Binary Search”. [8] Sedgewick & Wayne 2011, §3.1, subsection “Rank and selection”. [9] Goldman & Goldman 2008, pp. 461–463. [10] Sedgewick & Wayne 2011, §3.1, subsection “Range queries”. [11] Knuth 1998, §6.2.1 (“Searching an ordered table”), subsection “Further analysis of binary search”. [26] Knuth 1998, §6.2.4 (“Multiway trees”). [27] Knuth 1998, Answers to Exercises (§6.2.1) for “Exercise 5”. [28] Knuth 1998, §6.2.1 (“Searching an ordered table”). [29] Sedgewick & Wayne 2011, §3.2 (“Ordered symbol tables”). [30] Fan, Bin; Andersen, Dave G.; Kaminsky, Michael; Mitzenmacher, Michael D. (2014). Cuckoo Filter: Practically Better Than Bloom. Proceedings of the 10th ACM International on Conference on emerging Networking Experiments and Technologies. pp. 75–88. doi:10.1145/2674005.2674994. [31] Bloom, Burton H. (1970). “Space/time Trade-offs in Hash Coding with Allowable Errors”. CACM. 13 (7): 422–426. doi:10.1145/362686.362692. [32] Knuth 1998, §6.2.1 (“Searching an ordered table”), subsection “An important variation”. [13] Sloane, Neil. Table of n, 2n for n = 0..1000. Part of OEIS A000079. Retrieved 30 April 2016. [33] Kiefer, J. (1953). “Sequential Minimax Search for a Maximum”. Proceedings of the American Mathematical Society. 4 (3): 502–506. doi:10.2307/2032161. JSTOR 2032161. [14] Knuth 1998, §6.2.1 (“Searching an ordered table”), subsection “Exercise 23”. [34] Hassin, Refael (1981). “On Maximizing Functions by Fibonacci Search”. Fibonacci Quarterly. 19: 347–351. [12] Chang 2003, p. 169. [15] Rolfe, Timothy J. (1997). “Analytic derivation of comparisons in binary search”. ACM SIGNUM Newsletter. 32 (4): 15–19. doi:10.1145/289251.289255. [16] Chazelle, Bernard; Liu, Ding (2001). Lower bounds for intersection searching and fractional cascading in higher dimension. 33rd ACM Symposium on Theory of Computing. pp. 322–329. doi:10.1145/380752.380818. [17] Knuth 1998, §6.4 (“Hashing”). [35] Moffat & Turpin 2002, p. 33. [36] Knuth 1998, §6.2.1 (“Searching an ordered table”), subsection “Interpolation search”. [37] Knuth 1998, §6.2.1 (“Searching an ordered table”), subsection “Exercise 22”. [38] Perl, Yehoshua; Itai, Alon; Avni, Haim (1978). “Interpolation search—a log log n search”. CACM. 21 (7): 550– 553. doi:10.1145/359545.359557. 6.2. BINARY SEARCH TREE [39] Knuth 1998, §6.2.1 (“Searching an ordered table”), subsection “History and bibliography”. [40] “2n −1”. OEIS A000225. Retrieved 7 May 2016. [41] Lehmer, Derrick (1960). Teaching combinatorial tricks to a computer. Proceedings of Symposia in Applied Mathematics. 10. pp. 180–181. doi:10.1090/psapm/010. [42] Chazelle, Bernard; Guibas, Leonidas J. (1986). “Fractional cascading: I. A data structuring technique” (PDF). Algorithmica. 1 (1): 133–162. doi:10.1007/BF01840440. [43] Chazelle, Bernard; Guibas, Leonidas J. (1986), “Fractional cascading: II. Applications” (PDF), Algorithmica, 1 (1): 163–191, doi:10.1007/BF01840441 [44] Bentley 2000, §4.1 (“The Challenge of Binary Search”). [45] Pattis, Richard E. (1988). “Textbook errors in binary searching”. SIGCSE Bulletin. 20: 190–194. doi:10.1145/52965.53012. [46] Bloch, Joshua (2 June 2006). “Extra, Extra – Read All About It: Nearly All Binary Searches and Mergesorts are Broken”. Google Research Blog. Retrieved 21 April 2016. [47] Ruggieri, Salvatore (2003). “On computing the semi-sum of two integers” (PDF). Information Processing Letters. 87 (2): 67–71. doi:10.1016/S0020-0190(03)00263-1. [48] Bentley 2000, §4.4 (“Principles”). [49] “bsearch – binary search a sorted table”. The Open Group Base Specifications (7th ed.). The Open Group. 2013. Retrieved 28 March 2016. [50] Stroustrup 2013, §32.6.1 (“Binary Search”). [51] “java.util.Arrays”. Java Platform Standard Edition 8 Documentation. Oracle Corporation. Retrieved 1 May 2016. [52] “java.util.Collections”. Java Platform Standard Edition 8 Documentation. Oracle Corporation. Retrieved 1 May 2016. [53] “List<T>.BinarySearch Method (T)". Microsoft Developer Network. Retrieved 10 April 2016. [54] “8.5. bisect — Array bisection algorithm”. The Python Standard Library. Python Software Foundation. Retrieved 10 April 2016. [55] Fitzgerald 2007, p. 152. [56] “Package sort”. The Go Programming Language. Retrieved 28 April 2016. [57] “NSArray”. Mac Developer Library. Apple Inc. Retrieved 1 May 2016. [58] “CFArray”. Mac Developer Library. Apple Inc. Retrieved 1 May 2016. 163 Works • Alexandrescu, Andrei (2010). The D Programming Language. Upper Saddle River, NJ: AddisonWesley Professional. ISBN 0-321-63536-1. • Bentley, Jon (2000) [1986]. Programming Pearls (2nd ed.). Addison-Wesley. ISBN 0-201-65788-0. • Chang, Shi-Kuo (2003). Data Structures and Algorithms. Software Engineering and Knowledge Engineering. 13. Singapore: World Scientific. ISBN 978-981-238-348-8. • Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2009) [1990]. Introduction to Algorithms (3rd ed.). MIT Press and McGraw-Hill. ISBN 0-262-03384-4. • Fitzgerald, Michael (2007). Ruby Pocket Reference. Sebastopol, CA: O'Reilly Media. ISBN 9781-4919-2601-7. • Goldman, Sally A.; Goldman, Kenneth J. (2008). A Practical Guide to Data Structures and Algorithms using Java. Boca Raton: CRC Press. ISBN 978-158488-455-2. • Knuth, Donald (1998). Sorting and Searching. The Art of Computer Programming. 3 (2nd ed.). Reading, MA: Addison-Wesley Professional. • Leiss, Ernst (2007). A Programmer’s Companion to Algorithm Analysis. Boca Raton, FL: CRC Press. ISBN 1-58488-673-0. • Moffat, Alistair; Turpin, Andrew (2002). Compression and Coding Algorithms. Hamburg, Germany: Kluwer Academic Publishers. doi:10.1007/978-14615-0935-6. ISBN 978-0-7923-7668-2. • Sedgewick, Robert; Wayne, Kevin (2011). Algorithms (4th ed.). Upper Saddle River, NJ: Addison-Wesley Professional. ISBN 978-0-32157351-3. Condensed web version: ; book version . • Stroustrup, Bjarne (2013). The C++ Programming Language (4th ed.). Upper Saddle River, NJ: Addison-Wesley Professional. ISBN 978-0-32156384-2. 6.1.10 External links • NIST Dictionary of Algorithms and Data Structures: binary search 164 CHAPTER 6. SUCCESSORS AND NEIGHBORS the left sub-tree, and smaller than all keys in the right subtree.[1]:287 (The leaves (final nodes) of the tree contain no key and have no structure to distinguish them from one another. Leaves are commonly represented by a special leaf or nil symbol, a NULL pointer, etc.) 8 3 1 10 6 4 14 7 13 A binary search tree of size 9 and depth 3, with 8 at the root. The leaves are not drawn. 6.2 Binary search tree In computer science, binary search trees (BST), sometimes called ordered or sorted binary trees, are a particular type of containers: data structures that store “items” (such as numbers, names etc.) in memory. They allow fast lookup, addition and removal of items, and can be used to implement either dynamic sets of items, or lookup tables that allow finding an item by its key (e.g., finding the phone number of a person by name). Binary search trees keep their keys in sorted order, so that lookup and other operations can use the principle of binary search: when looking for a key in a tree (or a place to insert a new key), they traverse the tree from root to leaf, making comparisons to keys stored in the nodes of the tree and deciding, based on the comparison, to continue searching in the left or right subtrees. On average, this means that each comparison allows the operations to skip about half of the tree, so that each lookup, insertion or deletion takes time proportional to the logarithm of the number of items stored in the tree. This is much better than the linear time required to find items by key in an (unsorted) array, but slower than the corresponding operations on hash tables. Generally, the information represented by each node is a record rather than a single data element. However, for sequencing purposes, nodes are compared according to their keys rather than any part of their associated records. The major advantage of binary search trees over other data structures is that the related sorting algorithms and search algorithms such as in-order traversal can be very efficient; they are also easy to code. Binary search trees are a fundamental data structure used to construct more abstract data structures such as sets, multisets, and associative arrays. Some of their disadvantages are as follows: • The shape of the binary search tree depends entirely on the order of insertions and deletions, and can become degenerate. • When inserting or searching for an element in a binary search tree, the key of each visited node has to be compared with the key of the element to be inserted or found. • The keys in the binary search tree may be long and the run time may increase. • After a long intermixed sequence of random insertion and deletion, the expected height of the tree approaches square root of the number of keys, √n, which grows much faster than log n. Order relation Binary search requires an order relation by which every element (item) can be compared with every other element in the sense of a total preorder. The part of the element which effectively takes place in the comparison is called Several variants of the binary search tree have been stud- its key. Whether duplicates, i.e. different elements with ied in computer science; this article deals primarily with same key, shall be allowed in the tree or not, does not the basic type, making references to more advanced types depend on the order relation, but on the application only. when appropriate. In the context of binary search trees a total preorder is realized most flexibly by means of a three-way comparison subroutine. 6.2.1 Definition A binary search tree is a rooted binary tree, whose internal nodes each store a key (and optionally, an associated value) and each have two distinguished sub-trees, commonly denoted left and right. The tree additionally satisfies the binary search tree property, which states that the key in each node must be greater than all keys stored in 6.2.2 Operations Binary search trees support three main operations: insertion of elements, deletion of elements, and lookup (checking whether a key is present). 6.2. BINARY SEARCH TREE Searching 165 Here’s how a typical binary search tree insertion might be performed in a binary tree in C++: Searching a binary search tree for a specific key can be a void insert(Node*& root, int key, int value) { if (!root) programmed recursively or iteratively. root = new Node(key, value); else if (key < root->key) We begin by examining the root node. If the tree is null, insert(root->left, key, value); else // key >= root->key the key we are searching for does not exist in the tree. insert(root->right, key, value); } Otherwise, if the key equals that of the root, the search is successful and we return the node. If the key is less than The above destructive procedural variant modifies the tree that of the root, we search the left subtree. Similarly, if in place. It uses only constant heap space (and the iterthe key is greater than that of the root, we search the right ative version uses constant stack space as well), but the subtree. This process is repeated until the key is found or prior version of the tree is lost. Alternatively, as in the the remaining subtree is null. If the searched key is not following Python example, we can reconstruct all ancesfound after a null subtree is reached, then the key is not tors of the inserted node; any reference to the original present in the tree. This is easily expressed as a recursive tree root remains valid, making the tree a persistent data algorithm (implemented in Python): structure: 1 def search_recursively(key, node): 2 if node is None or def binary_tree_insert(node, key, value): if node is node.key == key: 3 return node 4 else if key < node.key: None: return NodeTree(None, key, value, None) if 5 return search_recursively(key, node.left) 6 else: # key key == node.key: return NodeTree(node.left, key, > node.key 7 return search_recursively(key, node.right) value, node.right) if key < node.key: return NodeTree(binary_tree_insert(node.left, key, value), node.key, The same algorithm can be implemented iteratively: node.value, node.right) else: return NodeTree(node.left, 1 def search_iteratively(key, node): 2 current_node = node.key, node.value, binary_tree_insert(node.right, node 3 while current_node is not None: 4 if key == key, value)) current_node.key: 5 return current_node 6 else if key < current_node.key: 7 current_node = current_node.left The part that is rebuilt uses O(log n) space in the average 8 else: # key > current_node.key: 9 current_node = case and O(n) in the worst case. current_node.right 10 return None In either version, this operation requires time proportional to the height of the tree in the worst case, which is O(log These two examples rely on the order relation being a total n) time in the average case over all trees, but O(n) time order. in the worst case. If the order relation is only a total preorder a reasonable extension of the functionality is the following: also in case of equality search down to the leaves in a direction specifiable by the user. A binary tree sort equipped with such a comparison function becomes stable. Another way to explain insertion is that in order to insert a new node in the tree, its key is first compared with that of the root. If its key is less than the root’s, it is then compared with the key of the root’s left child. If its key is greater, it is compared with the root’s right child. This process continues, until the new node is compared with a leaf node, and then it is added as this node’s right or left child, depending on its key: if the key is less than the leaf’s key, then it is inserted as the leaf’s left child, otherwise as the leaf’s right child. Because in the worst case this algorithm must search from the root of the tree to the leaf farthest from the root, the search operation takes time proportional to the tree’s height (see tree terminology). On average, binary search trees with n nodes have O(log n) height.[lower-alpha 1] However, in the worst case, binary search trees can have O(n) There are other ways of inserting nodes into a binary tree, height, when the unbalanced tree resembles a linked list but this is the only way of inserting nodes at the leaves and (degenerate tree). at the same time preserving the BST structure. Insertion Deletion Insertion begins as a search would begin; if the key is There are three possible cases to consider: not equal to that of the root, we search the left or right • Deleting a node with no children: simply remove the subtrees as before. Eventually, we will reach an external node from the tree. node and add the new key-value pair (here encoded as a record 'newNode') as its right or left child, depending on • Deleting a node with one child: remove the node and the node’s key. In other words, we examine the root and replace it with its child. recursively insert the new node to the left subtree if its • Deleting a node with two children: call the node to key is less than that of the root, or the right subtree if its be deleted N. Do not delete N. Instead, choose eikey is greater than or equal to the root. 166 CHAPTER 6. SUCCESSORS AND NEIGHBORS ther its in-order successor node or its in-order predecessor node, R. Copy the value of R to N, then recursively call delete on R until reaching one of the first two cases. If you choose in-order successor of a node, as right sub tree is not NIL (Our present case is node has 2 children), then its in-order successor is node with least value in its right sub tree, which will have at a maximum of 1 sub tree, so deleting it would fall in one of the first 2 cases. Broadly speaking, nodes with children are harder to delete. As with all binary trees, a node’s in-order successor is its right subtree’s left-most child, and a node’s in-order predecessor is the left subtree’s right-most child. In either case, this node will have zero or one children. Delete it according to one of the two simpler cases above. 6 7 6 9 6 6 9 9 Deleting a node with two children from a binary search tree. First the rightmost node in the left subtree, the in-order predecessor (6), is identified. Its value is copied into the node being deleted. The in-order predecessor can then be easily deleted because it has at most one child. The same method works symmetrically using the inorder successor (9). self.replace_node_in_parent(self.right_child) else: # this node has no children self.replace_node_in_parent(None) Traversal Main article: Tree traversal Once the binary search tree has been created, its elements can be retrieved in-order by recursively traversing the left subtree of the root node, accessing the node itself, then recursively traversing the right subtree of the node, continuing this pattern with each node in the tree as it’s recursively accessed. As with all binary trees, one may conduct a pre-order traversal or a post-order traversal, but neither are likely to be useful for binary search trees. An in-order traversal of a binary search tree will always result in a sorted list of node items (numbers, strings or other comparable items). The code for in-order traversal in Python is given below. It will call callback (some function the programmer wishes to call on the node’s value, such as printing to the screen) for every node in the tree. def traverse_binary_tree(node, callback): if node is None: return traverse_binary_tree(node.leftChild, callback) callback(node.value) traverse_binary_tree(node.rightChild, callback) Consistently using the in-order successor or the in-order predecessor for every instance of the two-child case can lead to an unbalanced tree, so some implementations se- Traversal requires O(n) time, since it must visit every node. This algorithm is also O(n), so it is asymptotically lect one or the other at different times. optimal. Runtime analysis: Although this operation does not always traverse the tree down to a leaf, this is always a Traversal can also be implemented iteratively. For cerpossibility; thus in the worst case it requires time propor- tain applications, e.g. greater equal search, approximational to the height of the tree. It does not require more tive search, an operation for single step (iterative) travereven when the node has two children, since it still follows sal can be very useful. This is, of course, implemented without the callback construct and takes O(1) on average a single path and does not visit any node twice. and O(log n) in the worst case. def find_min(self): # Gets minimum node in a subtree current_node = self while current_node.left_child: current_node = current_node.left_child return Verification current_node def replace_node_in_parent(self, new_value=None): if self.parent: if self == self.parent.left_child: self.parent.left_child = Sometimes we already have a binary tree, and we need to new_value else: self.parent.right_child = new_value determine whether it is a BST. This problem has a simple if new_value: new_value.parent = self.parent def recursive solution. binary_tree_delete(self, key): if key < self.key: The BST property—every node on the right subtree has self.left_child.binary_tree_delete(key) elif key > to be larger than the current node and every node on the self.key: self.right_child.binary_tree_delete(key) left subtree has to be smaller than (or equal to - should else: # delete the key here if self.left_child and not be the case as only unique values should be in the tree self.right_child: # if both children are present suc- - this also poses the question as to if such nodes should be cessor = self.right_child.find_min() self.key = succes- left or right of this parent) the current node—is the key sor.key successor.binary_tree_delete(successor.key) to figuring out whether a tree is a BST or not. The greedy elif self.left_child: # if the node has only a *left* algorithm – simply traverse the tree, at every node check child self.replace_node_in_parent(self.left_child) elif whether the node contains a value larger than the value at self.right_child: # if the node has only a *right* child the left child and smaller than the value on the right child 6.2. BINARY SEARCH TREE 167 – does not work for all cases. Consider the following tree: Sort 20 / \ 10 30 / \ 5 40 Main article: Tree sort In the tree above, each node meets the condition that the node contains a value larger than its left child and smaller A binary search tree can be used to implement a simple than its right child hold, and yet it is not a BST: the value sorting algorithm. Similar to heapsort, we insert all the 5 is on the right subtree of the node containing 20, a viovalues we wish to sort into a new ordered data structure— lation of the BST property. in this case a binary search tree—and then traverse it in Instead of making a decision based solely on the values order. of a node and its children, we also need information flowThe worst-case time of build_binary_tree is O(n2 )— ing down from the parent as well. In the case of the tree if you feed it a sorted list of values, it chains them above, if we could remember about the node containing into a linked list with no left subtrees. For example, the value 20, we would see that the node with value 5 is build_binary_tree([1, 2, 3, 4, 5]) yields the tree (1 (2 (3 violating the BST property contract. (4 (5))))). So the condition we need to check at each node is: There are several schemes for overcoming this flaw with simple binary trees; the most common is the self• if the node is the left child of its parent, then it must balancing binary search tree. If this same procedure is be smaller than (or equal to) the parent and it must done using such a tree, the overall worst-case time is O(n pass down the value from its parent to its right sub- log n), which is asymptotically optimal for a comparison tree to make sure none of the nodes in that subtree sort. In practice, the added overhead in time and space for is greater than the parent a tree-based sort (particularly for node allocation) make • if the node is the right child of its parent, then it must it inferior to other asymptotically optimal sorts such as be larger than the parent and it must pass down the heapsort for static list sorting. On the other hand, it is value from its parent to its left subtree to make sure one of the most efficient methods of incremental sortnone of the nodes in that subtree is lesser than the ing, adding items to a list over time while keeping the list sorted at all times. parent. A recursive solution in C++ can explain this further: Priority queue operations struct TreeNode { int key; int value; struct TreeNode *left; struct TreeNode *right; }; bool isBST(struct TreeNode *node, int minKey, int maxKey) { if(node == NULL) return true; if(node->key < minKey || node->key > maxKey) return false; return isBST(node>left, minKey, node->key) && isBST(node->right, node->key, maxKey); } Binary search trees can serve as priority queues: structures that allow insertion of arbitrary key as well as lookup and deletion of the minimum (or maximum) key. Insertion works as previously explained. Find-min walks the tree, following left pointers as far as it can without hitting a leaf: // Precondition: T is not a leaf function find-min(T): The initial call to this function can be something like this: while hasLeft(T): T ? left(T) return key(T) if(isBST(root, INT_MIN, INT_MAX)) { puts(“This is a Find-max is analogous: follow right pointers as far as possible. Delete-min (max) can simply look up the miniBST.”); } else { puts(“This is NOT a BST!"); } mum (maximum), then delete it. This way, insertion and deletion both take logarithmic time, just as they do in a Essentially we keep creating a valid range (starting from binary heap, but unlike a binary heap and most other pri[MIN_VALUE, MAX_VALUE]) and keep shrinking it ority queue implementations, a single tree can support all down for each node as we go down recursively. of find-min, find-max, delete-min and delete-max at the As pointed out in section #Traversal, an in-order traver- same time, making binary search trees suitable as doublesal of a binary search tree returns the nodes sorted. Thus ended priority queues.[2]:156 we only need to keep the last visited node while traversing the tree and check whether its key is smaller (or smaller/equal, if duplicates are to be allowed in the tree) 6.2.4 Types compared to the current key. There are many types of binary search trees. AVL trees and red-black trees are both forms of self-balancing binary search trees. A splay tree is a binary search tree that 6.2.3 Examples of applications automatically moves frequently accessed elements nearer Some examples shall illustrate the use of above basic to the root. In a treap (tree heap), each node also holds a (randomly chosen) priority and the parent node has higher building blocks. 168 CHAPTER 6. SUCCESSORS AND NEIGHBORS priority than its children. Tango trees are trees optimized for fast searches. T-trees are binary search trees optimized to reduce storage space overhead, widely used for in-memory databases A degenerate tree is a tree where for each parent node, there is only one associated child node. It is unbalanced and, in the worst case, performance degrades to that of a linked list. If your added node function does not handle re-balancing, then you can easily construct a degenerate tree by feeding it with data that is already sorted. What this means is that in a performance measurement, the tree will essentially behave like a linked list data structure. which are asymptotically as good as any static search tree we can construct for any particular sequence of lookup operations. Alphabetic trees are Huffman trees with the additional constraint on order, or, equivalently, search trees with the modification that all elements are stored in the leaves. Faster algorithms exist for optimal alphabetic binary trees (OABTs). 6.2.5 See also • Search tree Performance comparisons • Binary search algorithm D. A. Heger (2004)[3] presented a performance comparison of binary search trees. Treap was found to have the best average performance, while red-black tree was found to have the smallest amount of performance variations. • Randomized binary search tree • Tango trees • Self-balancing binary search tree Optimal binary search trees • Geometry of binary search trees Main article: Optimal binary search tree If we do not plan on modifying a search tree, and we • Red-black tree • AVL trees • Day–Stout–Warren algorithm γ α β γ α β Tree rotations are very common internal operations in binary trees to keep perfect, or near-to-perfect, internal balance in the tree. 6.2.6 Notes [1] The notion of an average BST is made precise as follows. Let a random BST be one built using only insertions out of a sequence of unique elements in random order (all permutations equally likely); then the expected height of the tree is O(log n). If deletions are allowed as well as insertions, “little is known about the average height of a binary search tree”.[1]:300 know exactly how often each item will be accessed, we can construct[4] an optimal binary search tree, which is a 6.2.7 References search tree where the average cost of looking up an item (the expected search cost) is minimized. [1] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Even if we only have estimates of the search costs, such a system can considerably speed up lookups on average. For example, if you have a BST of English words used in a spell checker, you might balance the tree based on word frequency in text corpora, placing words like the near the root and words like agerasia near the leaves. Such a tree might be compared with Huffman trees, which similarly seek to place frequently used items near the root in order to produce a dense information encoding; however, Huffman trees store data elements only in leaves, and these elements need not be ordered. If we do not know the sequence in which the elements in the tree will be accessed in advance, we can use splay trees Ronald L.; Stein, Clifford (2009) [1990]. Introduction to Algorithms (3rd ed.). MIT Press and McGraw-Hill. ISBN 0-262-03384-4. [2] Mehlhorn, Kurt; Sanders, Peter (2008). Algorithms and Data Structures: The Basic Toolbox (PDF). Springer. [3] Heger, Dominique A. (2004), “A Disquisition on The Performance Behavior of Binary Search Tree Data Structures” (PDF), European Journal for the Informatics Professional, 5 (5): 67–75 [4] Gonnet, Gaston. “Optimal Binary Search Trees”. Scientific Computation. ETH Zürich. Retrieved 1 December 2013. 6.3. RANDOM BINARY TREE 6.2.8 Further reading • This article incorporates public domain material from the NIST document: Black, Paul E. “Binary Search Tree”. Dictionary of Algorithms and Data Structures. 169 are equally likely. It is also possible to form other distributions, for instance by repeated splitting. Adding and removing nodes directly in a random binary tree will in general disrupt its random structure, but the treap and related randomized binary search tree data structures use the principle of binary trees formed from a random permutation in order to maintain a balanced binary search tree dynamically as nodes are inserted and deleted. • Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2001). “12: Binary search trees, 15.5: Optimal binary search trees”. For random trees that are not necessarily binary, see Introduction to Algorithms (2nd ed.). MIT Press & random tree. McGraw-Hill. pp. 253–272, 356–363. ISBN 0262-03293-7. 6.3.1 Binary trees from random permutations • Jarc, Duane J. (3 December 2005). “Binary Tree Traversals”. Interactive Data Structure VisualizaFor any set of numbers (or, more generally, values from tions. University of Maryland. some total order), one may form a binary search tree in • Knuth, Donald (1997). “6.2.2: Binary Tree Search- which each number is inserted in sequence as a leaf of ing”. The Art of Computer Programming. 3: “Sort- the tree, without changing the structure of the previously ing and Searching” (3rd ed.). Addison-Wesley. pp. inserted numbers. The position into which each num426–458. ISBN 0-201-89685-0. ber should be inserted is uniquely determined by a binary search in the tree formed by the previous numbers. For • Long, Sean. “Binary Search Tree” (PPT). Data instance, if the three numbers (1,3,2) are inserted into a Structures and Algorithms Visualization-A Powertree in that sequence, the number 1 will sit at the root Point Slides Based Approach. SUNY Oneonta. of the tree, the number 3 will be placed as its right child, • Parlante, Nick (2001). “Binary Trees”. CS Educa- and the number 2 as the left child of the number 3. There are six different permutations of the numbers (1,2,3), but tion Library. Stanford University. only five trees may be constructed from them. That is because the permutations (2,1,3) and (2,3,1) form the same tree. 6.2.9 External links • Literate implementations of binary search trees in Expected depth of a node various languages on LiteratePrograms • Binary Tree Visualizer (JavaScript animation of var- For any fixed choice of a value x in a given set of n numbers, if one randomly permutes the numbers and forms a ious BT-based data structures) binary tree from them as described above, the expected • Kovac, Kubo. “Binary Search Trees” (Java applet). value of the length of the path from the root of the tree to Korešponden?ný seminár z programovania. x is at most 2 log n + O(1), where “log” denotes the natural logarithm function and the O introduces big O notation. • Madru, Justin (18 August 2009). “Binary Search For, the expected number of ancestors of x is by linearTree”. JDServer. C++ implementation. ity of expectation equal to the sum, over all other values y in the set, of the probability that y is an ancestor of x. • Binary Search Tree Example in Python And a value y is an ancestor of x exactly when y is the • “References to Pointers (C++)". MSDN. Microsoft. first element to be inserted from the elements in the in2005. Gives an example binary tree implementa- terval [x,y]. Thus, the values that are adjacent to x in the sorted sequence of values have probability 1/2 of being tion. an ancestor of x, the values one step away have probability 1/3, etc. Adding these probabilities for all positions in the sorted sequence gives twice a Harmonic number, 6.3 Random binary tree leading to the bound above. A bound of this form holds also for the expected search length of a path to a fixed In computer science and probability theory, a random value x that is not part of the given set.[1] binary tree refers to a binary tree selected at random from some probability distribution on binary trees. Two different distributions are commonly used: binary trees The longest path formed by inserting nodes one at a time according to a random permutation, and binary trees chosen from a Although not as easy to analyze as the average path length, uniform discrete distribution in which all distinct trees there has also been much research on determining the ex- 170 CHAPTER 6. SUCCESSORS AND NEIGHBORS pectation (or high probability bounds) of the length of the 6.3.2 Uniformly random binary trees longest path in a binary search tree generated from a random insertion order. It is now known that this length, for The number of binary trees with n nodes is a Catalan number: for n = 1, 2, 3, ... these numbers of trees are a tree with n nodes, is almost surely 1 log n ≈ 4.311 log n, β 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, … (sequence A000108 in the OEIS). Thus, if one of these trees is selected uniformly at ranwhere β is the unique number in the range 0 < β < 1 satdom, its probability is the reciprocal of a Catalan number. isfying the equation Trees in this model have expected depth proportional to the square root of n, rather than to the logarithm;[4] how1−β [2] 2βe = 1. ever, the Strahler number of a uniformly random binary tree, a more sensitive measure of the distance from a leaf in which a node has Strahler number i whenever it has eiExpected number of leaves ther a child with that number or two children with number i − 1, is with high probability logarithmic.[5] In the random permutation model, each of the numbers Due to their large heights, this model of equiprobable ranfrom the set of numbers used to form the tree, except dom trees is not generally used for binary search trees, for the smallest and largest of the numbers, has probabilbut it has been applied to problems of modeling the ity 1/3 of being a leaf in the tree, for it is a leaf when it parse trees of algebraic expressions in compiler design[6] inserted after its two neighbors, and any of the six permu(where the above-mentioned bound on Strahler number tations of these two neighbors and it are equally likely. By translates into the number of registers needed to evaluate similar reasoning, the smallest and largest of the numbers an expression[7] ) and for modeling evolutionary trees.[8] have probability 1/2 of being a leaf. Therefore, the exIn some cases the analysis of random binary trees unpected number of leaves is the sum of these probabilities, der the random permutation model can be automatically which for n ≥ 2 is exactly (n + 1)/3. transferred to the uniform model.[9] Treaps and randomized binary search trees 6.3.3 Random split trees In applications of binary search tree data structures, it is rare for the values in the tree to be inserted without deletion in a random order, limiting the direct applications of random binary trees. However, algorithm designers have devised data structures that allow insertions and deletions to be performed in a binary search tree, at each step maintaining as an invariant the property that the shape of the tree is a random variable with the same distribution as a random binary search tree. Devroye & Kruszewski (1996) generate random binary trees with n nodes by generating a real-valued random variable x in the unit interval (0,1), assigning the first xn nodes (rounded down to an integer number of nodes) to the left subtree, the next node to the root, and the remaining nodes to the right subtree, and continuing recursively in each subtree. If x is chosen uniformly at random in the interval, the result is the same as the random binary search tree generated by a random permutation of the nodes, as any node is equally likely to be chosen as root; however, this formulation allows other distributions to be used instead. For instance, in the uniformly random binary tree model, once a root is fixed each of its two subtrees must also be uniformly random, so the uniformly random model may also be generated by a different choice of distribution for x. As Devroye and Kruszewski show, by choosing a beta distribution on x and by using an appropriate choice of shape to draw each of the branches, the mathematical trees generated by this process can be used to create realistic-looking botanical trees. If a given set of ordered numbers is assigned numeric priorities (distinct numbers unrelated to their values), these priorities may be used to construct a Cartesian tree for the numbers, a binary tree that has as its inorder traversal sequence the sorted sequence of the numbers and that is heap-ordered by priorities. Although more efficient construction algorithms are known, it is helpful to think of a Cartesian tree as being constructed by inserting the given numbers into a binary search tree in priority order. Thus, by choosing the priorities either to be a set of independent random real numbers in the unit interval, or by choosing them to be a random permutation of the numbers from 1 to n (where n is the number of nodes in the tree), and by maintaining the heap ordering property using tree ro- 6.3.4 Notes tations after any insertion or deletion of a node, it is pos[1] Hibbard (1962); Knuth (1973); Mahmoud (1992), p. 75. sible to maintain a data structure that behaves like a random binary search tree. Such a data structure is known [2] Robson (1979); Pittel (1985); Devroye (1986); Mahmoud (1992), pp. 91–99; Reed (2003). as a treap or a randomized binary search tree.[3] 6.4. TREE ROTATION [3] Martinez & Roura (1998); Seidel & Aragon (1996). [4] Knuth (2005), p. 15. [5] Devroye & Kruszewski (1995). That it is at most logarithmic is trivial, because the Strahler number of every tree is bounded by the logarithm of the number of its nodes. [6] Mahmoud (1992), p. 63. [7] Flajolet, Raoult & Vuillemin (1979). [8] Aldous (1996). [9] Mahmoud (1992), p. 70. 6.3.5 References • Aldous, David (1996), “Probability distributions on cladograms”, in Aldous, David; Pemantle, Robin, Random Discrete Structures, The IMA Volumes in Mathematics and its Applications, 76, SpringerVerlag, pp. 1–18. 171 • Mahmoud, Hosam M. (1992), Evolution of Random Search Trees, John Wiley & Sons. • Martinez, Conrado; Roura, Salvador (1998), “Randomized binary search trees”, Journal of the ACM, ACM Press, 45 (2): 288–323, doi:10.1145/274787.274812. • Pittel, B. (1985), “Asymptotical growth of a class of random trees”, Annals of Probability, 13 (2): 414– 427, doi:10.1214/aop/1176993000. • Reed, Bruce (2003), “The height of a random binary search tree”, Journal of the ACM, 50 (3): 306–332, doi:10.1145/765568.765571. • Robson, J. M. (1979), “The height of binary search trees”, Australian Computer Journal, 11: 151–153. • Seidel, Raimund; Aragon, Cecilia R. (1996), “Randomized Search Trees”, Algorithmica, 16 (4/5): 464–497, doi:10.1007/s004539900061. • Devroye, Luc (1986), “A note on the height of bi- 6.3.6 External links nary search trees”, Journal of the ACM, 33 (3): 489– 498, doi:10.1145/5925.5930. • Open Data Structures - Chapter 7 - Random Binary Search Trees • Devroye, Luc; Kruszewski, Paul (1995), “A note on the Horton-Strahler number for random trees”, Information Processing Letters, 56 (2): 95–99, 6.4 Tree rotation doi:10.1016/0020-0190(95)00114-R. • Devroye, Luc; Kruszewski, Paul (1996), “The botanical beauty of random binary trees”, in Brandenburg, Franz J., Graph Drawing: 3rd Int. Symp., GD'95, Passau, Germany, September 20-22, 1995, Lecture Notes in Computer Science, 1027, Springer-Verlag, pp. 166–177, doi:10.1007/BFb0021801, ISBN 3-540-60723-4. • Drmota, Michael (2009), Random Trees : An Interplay between Combinatorics and Probability, Springer-Verlag, ISBN 978-3-211-75355-2. γ α β γ α β • Flajolet, P.; Raoult, J. C.; Vuillemin, J. (1979), “The Generic tree rotations. number of registers required for evaluating arithmetic expressions”, Theoretical Computer Science, 9 In discrete mathematics, tree rotation is an operation on (1): 99–125, doi:10.1016/0304-3975(79)90009-4. a binary tree that changes the structure without interfering with the order of the elements. A tree rotation moves • Hibbard, Thomas N. (1962), “Some combinato- one node up in the tree and one node down. It is used to rial properties of certain trees with applications to change the shape of the tree, and in particular to decrease searching and sorting”, Journal of the ACM, 9 (1): its height by moving smaller subtrees down and larger sub13–28, doi:10.1145/321105.321108. trees up, resulting in improved performance of many tree operations. • Knuth, Donald M. (1973), “6.2.2 Binary Tree Searching”, The Art of Computer Programming, III, There exists an inconsistency in different descriptions as to the definition of the direction of rotations. Some say Addison-Wesley, pp. 422–451. that the direction of rotation reflects the direction that a • Knuth, Donald M. (2005), “Draft of Section 7.2.1.6: node is moving upon rotation (a left child rotating into its Generating All Trees”, The Art of Computer Pro- parent’s location is a right rotation) while others say that gramming, IV. the direction of rotation reflects which subtree is rotating 172 CHAPTER 6. SUCCESSORS AND NEIGHBORS (a left subtree rotating into its parent’s location is a left rotation, the opposite of the former). This article takes the approach of the directional movement of the rotating node. 6.4.1 Illustration you can see in the diagram, the order of the leaves doesn't change. The opposite operation also preserves the order and is the second kind of rotation. Assuming this is a binary search tree, as stated above, the elements must be interpreted as variables that can be compared to each other. The alphabetic characters to the left are used as placeholders for these variables. In the animation to the right, capital alphabetic characters are used as variable placeholders while lowercase Greek letters are placeholders for an entire set of variables. The circles represent individual nodes and the triangles represent subtrees. Each subtree could be empty, consist of a single node, or consist of any number of nodes. 6.4.2 Detailed illustration Animation of tree rotations taking place. The right rotation operation as shown in the image to the left is performed with Q as the root and hence is a right rotation on, or rooted at, Q. This operation results in a rotation of the tree in the clockwise direction. The inverse operation is the left rotation, which results in a movement in a counter-clockwise direction (the left rotation shown above is rooted at P). The key to understanding how a rotation functions is to understand its constraints. In particular the order of the leaves of the tree (when read left to right for example) cannot change (another way to think of it is that the order that the leaves would be visited in an in-order traversal must be the same after the operation as before). Another constraint is the main property of a binary search tree, namely that the right child is greater than the parent and the left child is less than the parent. Notice that the right child of a left child of the root of a sub-tree (for example node B in the diagram for the tree rooted at Q) can become the left child of the root, that itself becomes the right child of the “new” root in the rotated sub-tree, without violating either of those constraints. As Pictorial description of how rotations are made. When a subtree is rotated, the subtree side upon which it is rotated increases its height by one node while the other subtree decreases its height. This makes tree rotations useful for rebalancing a tree. Using the terminology of Root for the parent node of the subtrees to rotate, Pivot for the node which will become the new parent node, RS for rotation side upon to rotate and OS for opposite side of rotation. In the above diagram for the root Q, the RS is C and the OS is P. The pseudo code for the rotation is: Pivot = Root.OS Root.OS = Pivot.RS Pivot.RS = Root Root = Pivot This is a constant time operation. The programmer must also make sure that the root’s par- 6.4. TREE ROTATION 173 ent points to the pivot after the rotation. Also, the programmer should note that this operation may result in a new root for the entire tree and take care to update pointers accordingly. 6.4.3 Inorder invariance The tree rotation renders the inorder traversal of the binary tree invariant. This implies the order of the elements are not affected when a rotation is performed in any part of the tree. Here are the inorder traversals of the trees shown above: Left tree: ((A, P, B), Q, C) Right tree: (A, P, (B, Q, C)) Pictorial description of how rotations cause rebalancing in an AVL tree. Computing one from the other is very simple. The following is example Python code that performs that com6.4.5 putation: Rotation distance def right_rotation(treenode): left, Q, C = treenode A, P, The rotation distance between any two binary trees with the same number of nodes is the minimum number of roB = left return (A, P, (B, Q, C)) tations needed to transform one into the other. With this distance, the set of n-node binary trees becomes a metric Another way of looking at it is: space: the distance is symmetric, positive when given two Right rotation of node Q: different trees, and satisfies the triangle inequality. Let P be Q’s left child. Set Q’s left child to be P’s right It is an open problem whether there exists a polynomial child. [Set P’s right-child’s parent to Q] Set P’s right child time algorithm for calculating rotation distance. to be Q. [Set Q’s parent to P] Daniel Sleator, Robert Tarjan and William Thurston Left rotation of node P: showed that the rotation distance between any two nLet Q be P’s right child. Set P’s right child to be Q’s left node trees (for n ≥ 11) is at most 2n − 6, and that some child. [Set Q’s left-child’s parent to P] Set Q’s left child pairs of trees are this far apart as soon as n is sufficiently large.[1] Lionel Pournin showed that, in fact, such pairs to be P. [Set P’s parent to Q] exist whenever n ≥ 11. [2] All other connections are left as-is. There are also double rotations, which are combinations of left and right rotations. A double left rotation at X can 6.4.6 See also be defined to be a right rotation at the right child of X • AVL tree, red-black tree, and splay tree, kinds of followed by a left rotation at X; similarly, a double right binary search tree data structures that use rotations rotation at X can be defined to be a left rotation at the left to maintain balance. child of X followed by a right rotation at X. Tree rotations are used in a number of tree data structures such as AVL trees, red-black trees, splay trees, and treaps. They require only constant time because they are local transformations: they only operate on 5 nodes, and need not examine the rest of the tree. 6.4.4 Rotations for rebalancing • Associativity of a binary operation means that performing a tree rotation on it does not change the final result. • The Day–Stout–Warren algorithm balances an unbalanced BST. • Tamari lattice, a partially ordered set in which the elements can be defined as binary trees and the ordering between elements is defined by tree rotation. A tree can be rebalanced using rotations. After a rotation, the side of the rotation increases its height by 1 whilst the 6.4.7 References side opposite the rotation decreases its height similarly. Therefore, one can strategically apply rotations to nodes [1] Sleator, Daniel D.; Tarjan, Robert E.; Thurston, William whose left child and right child differ in height by more P. (1988), “Rotation distance, triangulations, and hyperthan 1. Self-balancing binary search trees apply this opbolic geometry”, Journal of the American Mathematical eration automatically. A type of tree which uses this reSociety, 1 (3): 647–681, doi:10.2307/1990951, JSTOR 1990951, MR 928904. balancing technique is the AVL tree. 174 CHAPTER 6. SUCCESSORS AND NEIGHBORS [2] Pournin, Lionel (2014), “The diameter of associahedra”, Advances in Mathematics, 259: 13–42, arXiv:1207.6296 , doi:10.1016/j.aim.2014.02.035, MR 3197650. 6.4.8 External links data structures such as associative arrays, priority queues and sets. The red–black tree, which is a type of self-balancing binary search tree, was called symmetric binary B-tree[2] and was renamed but can still be confused with the generic concept of self-balancing binary search tree because of the initials. • Java applets demonstrating tree rotations • The AVL Tree Rotations Tutorial (RTF) by John 6.5.1 Hargrove Overview 6.5 Self-balancing binary search tree γ α 50 β 76 17 54 23 9 14 19 72 50 12 9 72 23 14 19 54 β Tree rotations are very common internal operations on selfbalancing binary trees to keep perfect or near-to-perfect balance. Most operations on a binary search tree (BST) take time directly proportional to the height of the tree, so it is desirable to keep the height small. A binary tree with height h can contain at most 20 +21 +···+2h = 2h+1 −1 nodes. It follows that for a tree with n nodes and height h: n ≤ 2h+1 − 1 h ≥ ⌈log2 (n + 1) − 1⌉ ≥ ⌊log2 n⌋ . An example of an unbalanced tree; following the path from the root to a node takes an average of 3.27 node accesses 17 α And that implies: 67 12 γ 76 67 The same tree after being height-balanced; the average path effort decreased to 3.00 node accesses In computer science, a self-balancing (or heightbalanced) binary search tree is any node-based binary search tree that automatically keeps its height (maximal number of levels below the root) small in the face of arbitrary item insertions and deletions.[1] In other words, the minimum height of a tree with n nodes is log2 (n), rounded down; that is, ⌊log2 n⌋ .[1] However, the simplest algorithms for BST item insertion may yield a tree with height n in rather common situations. For example, when the items are inserted in sorted key order, the tree degenerates into a linked list with n nodes. The difference in performance between the two situations may be enormous: for n = 1,000,000, for example, the minimum height is ⌊log2 (1, 000, 000)⌋ = 19 . If the data items are known ahead of time, the height can be kept small, in the average sense, by adding values in a random order, resulting in a random binary search tree. However, there are many situations (such as online algorithms) where this randomization is not viable. Self-balancing binary trees solve this problem by performing transformations on the tree (such as tree rotations) at key insertion times, in order to keep the height proportional to log2 (n). Although a certain overhead is involved, it may be justified in the long run by ensuring These structures provide efficient implementations for fast execution of later operations. mutable ordered lists, and can be used for other abstract Maintaining the height always at its minimum value 6.6. TREAP ⌊log2 (n)⌋ is not always viable; it can be proven that any insertion algorithm which did so would have an excessive overhead. Therefore, most self-balanced BST algorithms keep the height within a constant factor of this lower bound. 175 we have a very simple-to-describe yet asymptotically optimal O(n log n) sorting algorithm. Similarly, many algorithms in computational geometry exploit variations on self-balancing BSTs to solve problems such as the line segment intersection problem and the point location problem efficiently. (For average-case performance, however, self-balanced BSTs may be less efficient than other solutions. Binary tree sort, in particular, is likely to be slower than merge sort, quicksort, or heapsort, because of the tree-balancing overhead as well as cache access patterns.) In the asymptotic ("Big-O") sense, a self-balancing BST structure containing n items allows the lookup, insertion, and removal of an item in O(log n) worst-case time, and ordered enumeration of all items in O(n) time. For some implementations these are per-operation time bounds, while for others they are amortized bounds over a sequence of operations. These times are asymptotically op- Self-balancing BSTs are flexible data structures, in that timal among all data structures that manipulate the key it’s easy to extend them to efficiently record additional inonly through comparisons. formation or perform new operations. For example, one can record the number of nodes in each subtree having a certain property, allowing one to count the number of nodes in a certain key range with that property in O(log 6.5.2 Implementations n) time. These extensions can be used, for example, to Popular data structures implementing this type of tree in- optimize database queries or other list-processing algorithms. clude: • 2-3 tree 6.5.4 See also • AA tree • Search data structure • AVL tree • Day–Stout–Warren algorithm • Red-black tree • Fusion tree • Scapegoat tree • Splay tree • Treap 6.5.3 Applications • Skip list • Sorting 6.5.5 References [1] Donald Knuth. The Art of Computer Programming, Volume 3: Sorting and Searching, Second Edition. AddisonWesley, 1998. ISBN 0-201-89685-0. Section 6.2.3: Balanced Trees, pp.458–481. Self-balancing binary search trees can be used in a natural way to construct and maintain ordered lists, such as [2] Paul E. Black, “red-black tree”, in Dictionary of Algorithms and Data Structures [online], Vreda Pieterse and priority queues. They can also be used for associative arPaul E. Black, eds. 13 April 2015. (accessed 03 October rays; key-value pairs are simply inserted with an ordering 2016) Available from: http://www.nist.gov/dads/HTML/ based on the key alone. In this capacity, self-balancing redblack.html BSTs have a number of advantages and disadvantages over their main competitor, hash tables. One advantage of self-balancing BSTs is that they allow fast (indeed, 6.5.6 External links asymptotically optimal) enumeration of the items in key order, which hash tables do not provide. One disadvan• Dictionary of Algorithms and Data Structures: tage is that their lookup algorithms get more complicated Height-balanced binary search tree when there may be multiple items with the same key. • GNU libavl, a LGPL-licensed library of binary tree Self-balancing BSTs have better worst-case lookup perimplementations in C, with documentation formance than hash tables (O(log n) compared to O(n)), but have worse average-case performance (O(log n) compared to O(1)). Self-balancing BSTs can be used to implement any algo- 6.6 Treap rithm that requires mutable ordered lists, to achieve optimal worst-case asymptotic performance. For example, if In computer science, the treap and the randomized bibinary tree sort is implemented with a self-balanced BST, nary search tree are two closely related forms of binary 176 CHAPTER 6. SUCCESSORS AND NEIGHBORS search tree data structures that maintain a dynamic set of ordered keys and allow binary searches among the keys. After any sequence of insertions and deletions of keys, the shape of the tree is a random variable with the same probability distribution as a random binary tree; in particular, with high probability its height is proportional to the logarithm of the number of keys, so that each search, insertion, or deletion operation takes logarithmic time to perform. 6.6.1 Description 9 h to have the same priority) then the shape of a treap has the same probability distribution as the shape of a random binary search tree, a search tree formed by inserting the nodes without rebalancing in a randomly chosen insertion order. Because random binary search trees are known to have logarithmic height with high probability, the same is true for treaps. Aragon and Seidel also suggest assigning higher priorities to frequently accessed nodes, for instance by a process that, on each access, chooses a random number and replaces the priority of the node with that number if it is higher than the previous priority. This modification would cause the tree to lose its random shape; instead, frequently accessed nodes would be more likely to be near the root of the tree, causing searches for them to be faster. Naor and Nissim[3] describe an application in maintaining authorization certificates in public-key cryptosystems. 6.6.2 Operations 4 c 2 a 7 j 0 e A treap with alphabetic key and numeric max heap order Treaps support the following basic operations: • To search for a given key value, apply a standard binary search algorithm in a binary search tree, ignoring the priorities. • To insert a new key x into the treap, generate a random priority y for x. Binary search for x in the tree, and create a new node at the leaf position where the binary search determines a node for x should exist. Then, as long as x is not the root of the tree and has a larger priority number than its parent z, perform a tree rotation that reverses the parent-child relation between x and z. • To delete a node x from the treap, if x is a leaf of the tree, simply remove it. If x has a single child z, remove x from the tree and make z be the child of the parent of x (or make z the root of the tree if x had no parent). Finally, if x has two children, swap its position in the tree with the position of its immediate successor z in the sorted order, resulting in one of the previous cases. In this final case, the swap may violate the heap-ordering property for z, so additional rotations may need to be performed to restore this property. The treap was first described by Cecilia R. Aragon and Raimund Seidel in 1989;[1][2] its name is a portmanteau of tree and heap. It is a Cartesian tree in which each key is given a (randomly chosen) numeric priority. As with any binary search tree, the inorder traversal order of the nodes is the same as the sorted order of the keys. The structure of the tree is determined by the requirement that it be heap-ordered: that is, the priority number for any nonleaf node must be greater than or equal to the priority of its children. Thus, as with Cartesian trees more generally, the root node is the maximum-priority node, and its left and right subtrees are formed in the same manner from Bulk operations the subsequences of the sorted order to the left and right of that node. In addition to the single-element insert, delete and lookup An equivalent way of describing the treap is that it could operations, several fast “bulk” operations have been debe formed by inserting the nodes highest-priority-first fined on treaps: union, intersection and set difference. into a binary search tree without doing any rebalancing. These rely on two helper operations, split and merge. Therefore, if the priorities are independent random numbers (from a distribution over a large enough space of possible priorities to ensure that two nodes are very unlikely • To split a treap into two smaller treaps, those smaller than key x, and those larger than key x, insert x into 6.6. TREAP 177 the treap with maximum priority—larger than the priority of any node in the treap. After this insertion, x will be the root node of the treap, all values less than x will be found in the left subtreap, and all values greater than x will be found in the right subtreap. This costs as much as a single insertion into the treap. root of the tree, and otherwise it calls the insertion procedure recursively to insert x within the left or right subtree (depending on whether its key is less than or greater than the root). The numbers of descendants are used by the algorithm to calculate the necessary probabilities for the random choices at each step. Placing x at the root of a subtree may be performed either as in the treap by inserting it at a leaf and then rotating it upwards, or by an • Merging two treaps that are the product of a former alternative algorithm described by Martínez and Roura split, one can safely assume that the greatest value that splits the subtree into two pieces to be used as the in the first treap is less than the smallest value in left and right children of the new node. the second treap. Create a new node with value x, such that x is larger than this max-value in the first The deletion procedure for a randomized binary search treap, and smaller than the min-value in the second tree uses the same information per node as the insertion treap, assign it the minimum priority, then set its left procedure, and like the insertion procedure it makes a sechild to the first heap and its right child to the sec- quence of O(log n) random decisions in order to join the ond heap. Rotate as necessary to fix the heap order. two subtrees descending from the left and right children After that it will be a leaf node, and can easily be of the deleted node into a single tree. If the left or right deleted. The result is one treap merged from the two subtree of the node to be deleted is empty, the join oporiginal treaps. This is effectively “undoing” a split, eration is trivial; otherwise, the left or right child of the deleted node is selected as the new subtree root with proband costs the same. ability proportional to its number of descendants, and the join proceeds recursively. The union of two treaps t 1 and t 2 , representing sets A and B is a treap t that represents A ∪ B. The following recursive algorithm computes the union: 6.6.4 Comparison function union(t1 , t2 ): if t1 = nil: return t2 if t2 = nil: return t1 if priority(t1 ) < priority(t2 ): swap t1 and t2 t<, t> ← split t2 on key(t1 ) return new node(key(t1 ), The information stored per node in the randomized binary tree is simpler than in a treap (a small integer rather union(left(t1 ), t<), union(right(t1 ), t>)) than a high-precision random number), but it makes a Here, split is presumed to return two trees: one hold- greater number of calls to the random number generator ing the keys less its input key, one holding the greater (O(log n) calls per insertion or deletion rather than one keys. (The algorithm is non-destructive, but an in-place call per insertion) and the insertion procedure is slightly destructive version exists as well.) more complicated due to the need to update the numbers The algorithm for intersection is similar, but requires the of descendants per node. A minor technical difference is join helper routine. The complexity of each of union, in- that, in a treap, there is a small probability of a collision tersection and difference is O(m log n/m) for treaps of (two keys getting the same priority), and in both cases sizes m and n, with m ≤ n. Moreover, since the recursive there will be statistical differences between a true rancalls to union are independent of each other, they can be dom number generator and the pseudo-random number generator typically used on digital computers. However, executed in parallel.[4] in any case the differences between the theoretical model of perfect random choices used to design the algorithm and the capabilities of actual random number generators 6.6.3 Randomized binary search tree are vanishingly small. The randomized binary search tree, introduced by Although the treap and the randomized binary search tree Martínez and Roura subsequently to the work of Aragon both have the same random distribution of tree shapes afand Seidel on treaps,[5] stores the same nodes with the ter each update, the history of modifications to the trees same random distribution of tree shape, but maintains performed by these two data structures over a sequence different information within the nodes of the tree in order of insertion and deletion operations may be different. For to maintain its randomized structure. instance, in a treap, if the three numbers 1, 2, and 3 Rather than storing random priorities on each node, the randomized binary search tree stores a small integer at each node, the number of its descendants (counting itself as one); these numbers may be maintained during tree rotation operations at only a constant additional amount of time per rotation. When a key x is to be inserted into a tree that already has n nodes, the insertion algorithm chooses with probability 1/(n + 1) to place x as the new are inserted in the order 1, 3, 2, and then the number 2 is deleted, the remaining two nodes will have the same parent-child relationship that they did prior to the insertion of the middle number. In a randomized binary search tree, the tree after the deletion is equally likely to be either of the two possible trees on its two nodes, independently of what the tree looked like prior to the insertion of the middle number. 178 CHAPTER 6. SUCCESSORS AND NEIGHBORS 6.6.5 See also 6.7 AVL tree • Finger search J 6.6.6 +1 References F [1] Aragon, Cecilia R.; Seidel, Raimund (1989), “Randomized Search Trees” (PDF), Proc. 30th Symp. Foundations of Computer Science (FOCS 1989), Washington, D.C.: IEEE Computer Society Press, pp. 540–545, doi:10.1109/SFCS.1989.63531, ISBN 0-8186-1982-1 D C 0 ‒1 P ‒1 +1 G L 0 V +1 N [2] Seidel, Raimund; Aragon, Cecilia R. (1996), “Randomized Search Trees”, Algorithmica, 16 (4/5): 464–497, doi:10.1007/s004539900061 [3] Naor, M.; Nissim, K. (April 2000), “Certificate revocation and certificate update” (PDF), IEEE Journal on Selected Areas in Communications, 18 (4): 561–570, doi:10.1109/49.839932. S 0 ‒1 X 0 Q 0 U 0 0 Fig. 1: AVL tree with balance factors (green) In computer science, an AVL tree is a self-balancing binary search tree. It was the first such data structure to be invented.[2] In an AVL tree, the heights of the two child [4] Blelloch, Guy E.,; Reid-Miller, Margaret, (1998), subtrees of any node differ by at most one; if at any time “Fast set operations using treaps”, Proc. 10th ACM they differ by more than one, rebalancing is done to reSymp. Parallel Algorithms and Architectures (SPAA 1998), New York, NY, USA: ACM, pp. 16–26, store this property. Lookup, insertion, and deletion all take O(log n) time in both the average and worst cases, doi:10.1145/277651.277660, ISBN 0-89791-989-0. where n is the number of nodes in the tree prior to the [5] Martínez, Conrado; Roura, Salvador (1997), operation. Insertions and deletions may require the tree “Randomized binary search trees”, Journal of the to be rebalanced by one or more tree rotations. ACM, 45 (2): 288–323, doi:10.1145/274787.274812 6.6.7 External links The AVL tree is named after its two Soviet inventors, Georgy Adelson-Velsky and Evgenii Landis, who published it in their 1962 paper “An algorithm for the organization of information”.[3] • Collection of treap references and info by Cecilia AVL trees are often compared with red–black trees beAragon cause both support the same set of operations and take • Open Data Structures - Section 7.2 - Treap: A Ran- O(log n) time for the basic operations. For lookupdomized Binary Search Tree intensive applications, AVL trees are faster than red– black trees because they are more strictly balanced.[4] • Treap Applet by Kubo Kovac Similar to red–black trees, AVL trees are height• Animated treap balanced. Both are in general not weight-balanced nor μbalanced for any μ≤1 ⁄2 ;[5] that is, sibling nodes can have • Randomized binary search trees. Lecture notes hugely differing numbers of descendants. from a course by Jeff Erickson at UIUC. Despite the title, this is primarily about treaps and skip lists; randomized binary search trees are mentioned only 6.7.1 Definition briefly. • A high performance key-value store based on treap Balance factor by Junyi Sun In a binary tree the balance factor of a node N is defined • VB6 implementation of treaps. Visual basic 6 imto be the height difference plementation of treaps as a COM object. • ActionScript3 implementation of a treap • Pure Python and Cython in-memory treap and duptreap • Treaps in C#. By Roy Clemmons • Pure Go in-memory, immutable treaps • Pure Go persistent treap key-value storage library BalanceFactor(N) := –Height(LeftSubtree(N)) + Height(RightSubtree(N)) [6] of its two child subtrees. A binary tree is called AVL tree if the invariant BalanceFactor(N) ∈ {–1,0,+1} 6.7. AVL TREE 179 holds for every node N in the tree. order (or at least a total preorder) on the set of keys. The A node N with BalanceFactor(N) < 0 is called “left- number of comparisons required for successful search is heavy”, one with BalanceFactor(N) > 0 is called “right- limited by the height h and for unsuccessful search is very heavy”, and one with BalanceFactor(N) = 0 is sometimes close to h, so both are in O(log n). simply called “balanced”. Traversal Remark In the sequel, because there is a one-to-one correspondence between nodes and the subtrees rooted by them, we sometimes leave it to the context whether the name of an object stands for the node or the subtree. Properties Balance factors can be kept up-to-date by knowing the previous balance factors and the change in height – it is not necessary to know the absolute height. For holding the AVL balance information, two bits per node are sufficient.[7] Once a node has been found in an AVL tree, the next or previous node can be accessed in amortized constant time. Some instances of exploring these “nearby” nodes require traversing up to h ∝ log(n) links (particularly when navigating from the rightmost leaf of the root’s left subtree to the root or from the root to the leftmost leaf of the root’s right subtree; in the AVL tree of figure 1, moving from node P to the next but one node Q takes 3 steps). However, exploring all n nodes of the tree in this manner would visit each link exactly twice: one downward visit to enter the subtree rooted by that node, another visit upward to leave that node’s subtree after having explored it. And since there are n−1 links in any tree, the amortized cost is found to be 2×(n−1)/n, or approximately 2. The height h of an AVL tree with n nodes lies in the interval:[8] 6.7.3 Comparison to other structures log2 (n+1) ≤ h < c log2 (n+2)+b Both AVL trees and red–black trees are self-balancing binary search trees and they are related mathematically. Indeed, every AVL tree can be colored red–black. The operations to balance the trees are different; both AVL trees and red-black require O(1) rotations in the worst case, while both also require O(log n) other updates (to colors or heights) in the worst case (though only O(1) amortized). AVL trees require storing 2 bits (or one trit) Data structure of information in each node, while red-black trees require just one bit per node. The bigger difference between the According to the original paper “An algorithm for the or- two data structures is their height limit. ganization of information” AVL trees have been invented as binary search trees. In that sense they are a data struc- For a tree of size n ≥ 1 ture together with its major associated operations, namely search, insert, delete, which rely on and maintain the AVL • an AVL tree’s height is at most property. In that sense, the AVL tree is a “self-balancing binary search tree”. h ≦ c log2 (n + d) + b < c log2 (n + 2) + b with the golden ratio φ := (1+√5) ⁄2 ≈ 1.618, c := 1 ⁄ log2 φ ≈ 1.44, and b := c ⁄2 log2 5 – 2 ≈ –0.328. This is because an AVL tree of height h contains at least Fh₊₂ – 1 nodes where {Fh} is the Fibonacci sequence with the seed values F1 = 1, F2 = 1. 6.7.2 Operations Read-only operations of an AVL tree involve carrying out the same actions as would be carried out on an unbalanced binary search tree, but modifications have to observe and restore the height balance of the subtrees. √ where φ := 1+2 5 ≈ 1.618 the golden ratio, c := log1 φ ≈ 1.44, b := 2c log2 5 − 2 ≈ 2 −0.328, and d := 1 + φ41√5 ≈ 1.07 . • a red–black tree’s height is at most Searching h ≦ 2 log2 (n + 1) [9] Searching for a specific key in an AVL tree can be done the same way as that of a normal unbalanced binary AVL trees are more rigidly balanced than red–black trees, search tree. In order for search to work effectively it has leading to faster retrieval but slower insertion and deleto employ a comparison function which establishes a total tion. 180 CHAPTER 6. SUCCESSORS AND NEIGHBORS 6.7.4 See also 6.7.6 Further reading • Donald Knuth. The Art of Computer Programming, Volume 3: Sorting and Searching, Third Edition. Addison-Wesley, 1997. ISBN 0-201-89685-0. Pages 458–475 of section 6.2.3: Balanced Trees. • Trees • Tree rotation • Red–black tree • Splay tree 6.7.7 External links • Scapegoat tree • This article incorporates public domain material from the NIST document: Black, Paul E. “AVL Tree”. Dictionary of Algorithms and Data Structures. • B-tree • T-tree • List of data structures 6.7.5 • AVL tree demonstration (HTML5/Canvas) • AVL tree demonstration (requires Flash) • AVL tree demonstration (requires Java) References [1] Eric Alexander. “AVL Trees”. [2] Robert Sedgewick, Algorithms, Addison-Wesley, 1983, ISBN 0-201-06672-6, page 199, chapter 15: Balanced Trees. [3] Georgy Adelson-Velsky, G.; Evgenii Landis (1962). “An algorithm for the organization of information”. Proceedings of the USSR Academy of Sciences (in Russian). 146: 263–266. English translation by Myron J. Ricci in Soviet Math. Doklady, 3:1259–1263, 1962. 6.8 Red–black tree A red–black tree is a kind of self-balancing binary search tree. Each node of the binary tree has an extra bit, and that bit is often interpreted as the color (red or black) of the node. These color bits are used to ensure the tree remains approximately balanced during insertions and deletions.[2] Balance is preserved by painting each node of the tree [4] Pfaff, Ben (June 2004). “Performance Analysis of BSTs with one of two colors (typically called 'red' and 'black') in a way that satisfies certain properties, which collecin System Software” (PDF). Stanford University. tively constrain how unbalanced the tree can become in [5] AVL trees are not weight-balanced? (meaning: AVL trees the worst case. When the tree is modified, the new tree is are not μ-balanced?) subsequently rearranged and repainted to restore the colThereby: A Binary Tree is called µ -balanced, with 0 ≤ oring properties. The properties are designed in such a µ ≤ 12 , if for every node N , the inequality way that this rearranging and recoloring can be performed efficiently. |Nl | 1 1 2 −µ≤ |N |+1 ≤ 2 +µ holds and µ is minimal with this property. |N | is the number of nodes below the tree with N as root (including the root) and Nl is the left child node of N . [6] Knuth, Donald E. (2000). Sorting and searching (2. ed., 6. printing, newly updated and rev. ed.). Boston [u.a.]: Addison-Wesley. p. 459. ISBN 0-201-89685-0. [7] More precisely: if the AVL balance information is kept in the child nodes – with meaning “when going upward there is an additional increment in height”, this can be done with one bit. Nevertheless, the modifying operations can be programmed more efficiently if the balance information can be checked with one test. [8] Knuth, Donald E. (2000). Sorting and searching (2. ed., 6. printing, newly updated and rev. ed.). Boston [u.a.]: Addison-Wesley. p. 460. ISBN 0-201-89685-0. [9] Red–black tree#Proof of asymptotic bounds The balancing of the tree is not perfect, but it is good enough to allow it to guarantee searching in O(log n) time, where n is the total number of elements in the tree. The insertion and deletion operations, along with the tree rearrangement and recoloring, are also performed in O(log n) time.[3] Tracking the color of each node requires only 1 bit of information per node because there are only two colors. The tree does not contain any other data specific to its being a red–black tree so its memory footprint is almost identical to a classic (uncolored) binary search tree. In many cases the additional bit of information can be stored at no additional memory cost. 6.8.1 History In 1972 Rudolf Bayer[4] invented a data structure that was a special order-4 case of a B-tree. These trees maintained 6.8. RED–BLACK TREE 181 all paths from root to leaf with the same number of nodes, 6.8.3 creating perfectly balanced trees. However, they were not binary search trees. Bayer called them a “symmetric binary B-tree” in his paper and later they became popular as 2-3-4 trees or just 2-4 trees.[5] Properties 13 8 17 In a 1978 paper, “A Dichromatic Framework for Balanced Trees”,[6] Leonidas J. Guibas and Robert 1 11 15 25 Sedgewick derived the red-black tree from the symmetNIL NIL NIL NIL NIL ric binary B-tree.[7] The color “red” was chosen because 6 22 27 it was the best-looking color produced by the color laser NIL NIL NIL NIL NIL NIL printer available to the authors while working at Xerox [8] PARC. Another response from professor Guibas states that it was because of the red and black pens available to An example of a red–black tree them to draw the trees.[9] In addition to the requirements imposed on a binary In 1993, Arne Andersson introduced the idea of right search tree the following must be satisfied by a red–black leaning tree to simplify insert and delete operations.[10] tree:[16] In 1999, Chris Okasaki showed how to make insert operation purely functional. Its balance function needed to 1. A node is either red or black. take care of only 4 unbalanced cases and one default balanced case.[11] 2. The root is black. This rule is sometimes omitted. Since the root can always be changed from red to The original algorithm used 8 unbalanced cases, but black, but not necessarily vice versa, this rule has Cormen et al. (2001) reduced that to 6 unbalanced little effect on analysis. [2] cases. Sedgewick showed that the insert operation can be implemented in just 46 lines of Java code.[12][13] In 2008, Sedgewick proposed the left-leaning red–black tree, leveraging Andersson’s idea that simplified algorithms. Sedgewick originally allowed nodes whose two children are red making his trees more like 2-3-4 trees but later this restriction was added making new trees more like 2-3 trees. Sedgewick implemented the insert algorithm in just 33 lines, significantly shortening his original 46 lines of code.[14][15] 6.8.2 Terminology A red–black tree is a special type of binary tree, used in computer science to organize pieces of comparable data, such as text fragments or numbers. The leaf nodes of red–black trees do not contain data. These leaves need not be explicit in computer memory— a null child pointer can encode the fact that this child is a leaf—but it simplifies some algorithms for operating on red–black trees if the leaves really are explicit nodes. To save memory, sometimes a single sentinel node performs the role of all leaf nodes; all references from internal nodes to leaf nodes then point to the sentinel node. Red–black trees, like all binary search trees, allow efficient in-order traversal (that is: in the order Left–Root– Right) of their elements. The search-time results from the traversal from root to leaf, and therefore a balanced tree of n nodes, having the least possible tree height, results in O(log n) search time. 3. All leaves (NIL) are black. 4. If a node is red, then both its children are black. 5. Every path from a given node to any of its descendant NIL nodes contains the same number of black nodes. Some definitions: the number of black nodes from the root to a node is the node’s black depth; the uniform number of black nodes in all paths from root to the leaves is called the black-height of the red–black tree.[17] These constraints enforce a critical property of red–black trees: the path from the root to the farthest leaf is no more than twice as long as the path from the root to the nearest leaf. The result is that the tree is roughly height-balanced. Since operations such as inserting, deleting, and finding values require worst-case time proportional to the height of the tree, this theoretical upper bound on the height allows red–black trees to be efficient in the worst case, unlike ordinary binary search trees. To see why this is guaranteed, it suffices to consider the effect of properties 4 and 5 together. For a red–black tree T, let B be the number of black nodes in property 5. Let the shortest possible path from the root of T to any leaf consist of B black nodes. Longer possible paths may be constructed by inserting red nodes. However, property 4 makes it impossible to insert more than one consecutive red node. Therefore, ignoring any black NIL leaves, the longest possible path consists of 2*B nodes, alternating black and red (this is the worst case). Counting the black NIL leaves, the longest possible path consists of 2*B-1 nodes. 182 CHAPTER 6. SUCCESSORS AND NEIGHBORS The shortest possible path has all black nodes, and the longest possible path alternates between red and black nodes. Since all maximal paths have the same number of black nodes, by property 5, this shows that no path is more than twice as long as any other path. 6.8.4 Analogy to B-trees of order 4 8 NIL 1 6 NIL NIL NIL 11 13 NIL NIL 17 15 NIL 22 NIL NIL 25 27 NIL NIL The same red–black tree as in the example above, seen as a Btree. A red–black tree is similar in structure to a B-tree of order[note 1] 4, where each node can contain between 1 and 3 values and (accordingly) between 2 and 4 child pointers. In such a B-tree, each node will contain only one value matching the value in a black node of the red–black tree, with an optional value before and/or after it in the same node, both matching an equivalent red node of the red–black tree. slot in the cluster vector is used. If values are stored by reference, e.g. objects, null references can be used and so the cluster can be represented by a vector containing 3 slots for value pointers plus 4 slots for child references in the tree. In that case, the B-tree can be more compact in memory, improving data locality. The same analogy can be made with B-trees with larger orders that can be structurally equivalent to a colored binary tree: you just need more colors. Suppose that you add blue, then the blue–red–black tree defined like red– black trees but with the additional constraint that no two successive nodes in the hierarchy will be blue and all blue nodes will be children of a red node, then it becomes equivalent to a B-tree whose clusters will have at most 7 values in the following colors: blue, red, blue, black, blue, red, blue (For each cluster, there will be at most 1 black node, 2 red nodes, and 4 blue nodes). For moderate volumes of values, insertions and deletions in a colored binary tree are faster compared to B-trees because colored trees don't attempt to maximize the fill factor of each horizontal cluster of nodes (only the minimum fill factor is guaranteed in colored binary trees, limiting the number of splits or junctions of clusters). B-trees will be faster for performing rotations (because rotations will frequently occur within the same cluster rather than with multiple separate nodes in a colored binary tree). For storing large volumes, however, B-trees will be much faster as they will be more compact by grouping several children in the same cluster where they can be accessed locally. One way to see this equivalence is to “move up” the red nodes in a graphical representation of the red–black tree, so that they align horizontally with their parent black node, by creating together a horizontal cluster. In the B- All optimizations possible in B-trees to increase the avtree, or in the modified graphical representation of the erage fill factors of clusters are possible in the equivalent multicolored binary tree. Notably, maximizing the avred–black tree, all leaf nodes are at the same depth. erage fill factor in a structurally equivalent B-tree is the The red–black tree is then structurally equivalent to a Bsame as reducing the total height of the multicolored tree, tree of order 4, with a minimum fill factor of 33% of by increasing the number of non-black nodes. The worst values per cluster with a maximum capacity of 3 values. case occurs when all nodes in a colored binary tree are This B-tree type is still more general than a red–black black, the best case occurs when only a third of them are tree though, as it allows ambiguity in a red–black tree black (and the other two thirds are red nodes). conversion—multiple red–black trees can be produced Notes from an equivalent B-tree of order 4. If a B-tree cluster contains only 1 value, it is the minimum, black, and has two child pointers. If a cluster contains 3 values, then [1] Using Knuth’s definition of order: the maximum number the central value will be black and each value stored on of children its sides will be red. If the cluster contains two values, however, either one can become the black node in the red–black tree (and the other one will be red). 6.8.5 Applications and related data strucSo the order-4 B-tree does not maintain which of the values contained in each cluster is the root black tree for the whole cluster and the parent of the other values in the same cluster. Despite this, the operations on red–black trees are more economical in time because you don't have to maintain the vector of values. It may be costly if values are stored directly in each node rather than being stored by reference. B-tree nodes, however, are more economical in space because you don't need to store the color attribute for each node. Instead, you have to know which tures Red–black trees offer worst-case guarantees for insertion time, deletion time, and search time. Not only does this make them valuable in time-sensitive applications such as real-time applications, but it makes them valuable building blocks in other data structures which provide worstcase guarantees; for example, many data structures used in computational geometry can be based on red–black trees, and the Completely Fair Scheduler used in current 6.8. RED–BLACK TREE 183 Linux kernels uses red–black trees. changes (which are very quick in practice) and no more The AVL tree is another structure supporting O(log n) than three tree rotations (two for insertion). Although insearch, insertion, and removal. It is more rigidly balanced sert and delete operations are complicated, their times rethan red–black trees, leading to slower insertion and remain O(log n). moval but faster retrieval. This makes it attractive for data structures that may be built once and loaded without re- Insertion construction, such as language dictionaries (or program dictionaries, such as the opcodes of an assembler or in- Insertion begins by adding the node as any binary search terpreter). tree insertion does and by coloring it red. Whereas in the Red–black trees are also particularly valuable in functional programming, where they are one of the most common persistent data structures, used to construct associative arrays and sets which can retain previous versions after mutations. The persistent version of red–black trees requires O(log n) space for each insertion or deletion, in addition to time. binary search tree, we always add a leaf, in the red–black tree, leaves contain no information, so instead we add a red interior node, with two black leaves, in place of an existing black leaf. tree is used. This results in the improvement of time complexity of searching such an element from O(n) to O(log n).[20] 2. If a node in the right (target) half of a diagram carries a blue contour it will become the current node in the next iteration and there the other nodes will be newly assigned relative to it. Any color shown in the diagram is either assumed in its case or implied by those assumptions. What happens next depends on the color of other nearby nodes. The term uncle node will be used to refer to the sibling of a node’s parent, as in human family trees. Note For every 2-4 tree, there are corresponding red–black that: trees with data elements in the same order. The insertion and deletion operations on 2-4 trees are also equivalent • property 3 (all leaves are black) always holds. to color-flipping and rotations in red–black trees. This makes 2-4 trees an important tool for understanding the • property 4 (both children of every red node are logic behind red–black trees, and this is why many inblack) is threatened only by adding a red node, retroductory algorithm texts introduce 2-4 trees just before painting a black node red, or a rotation. red–black trees, even though 2-4 trees are not often used • property 5 (all paths from any given node to its leaf in practice. nodes contain the same number of black nodes) is In 2008, Sedgewick introduced a simpler version of the threatened only by adding a black node, repainting red–black tree called the left-leaning red–black tree[18] by a red node black (or vice versa), or a rotation. eliminating a previously unspecified degree of freedom in the implementation. The LLRB maintains an additional invariant that all red links must lean left except during in- Notes serts and deletes. Red–black trees can be made isometric to either 2-3 trees,[19] or 2-4 trees,[18] for any sequence of 1. The label N will be used to denote the current node operations. The 2-4 tree isometry was described in 1978 (colored red). In the diagrams N carries a blue conby Sedgewick. With 2-4 trees, the isometry is resolved tour. At the beginning, this is the new node being by a “color flip,” corresponding to a split, in which the inserted, but the entire procedure may also be apred color of two children nodes leaves the children and plied recursively to other nodes (see case 3). P will moves to the parent node. The tango tree, a type of tree denote N's parent node, G will denote N's grandoptimized for fast searches, usually uses red–black trees parent, and U will denote N's uncle. In between as part of its data structure. some cases, the roles and labels of the nodes are exchanged, but in each case, every label continues to In the version 8 of Java, the Collection HashMap has been represent the same node it represented at the beginmodified such that instead of using a LinkedList to store ning of the case. different elements with identical hashcodes, a Red-Black 6.8.6 Operations 3. A numbered triangle represents a subtree of unspecRead-only operations on a red–black tree require no modified depth. A black circle atop a triangle means that ification from those used for binary search trees, because black-height of subtree is greater by one compared every red–black tree is a special case of a simple binary to subtree without this circle. search tree. However, the immediate result of an insertion or removal may violate the properties of a red– black tree. Restoring the red–black properties requires There are several cases of red–black tree insertion to hana small number (O(log n) or amortized O(1)) of color dle: 184 CHAPTER 6. SUCCESSORS AND NEIGHBORS • N is the root node, i.e., first node of red–black tree • N's parent (P) is black { n->parent->color = BLACK; u->color = BLACK; g = grandparent(n); g->color = RED; insert_case1(g); } else { insert_case4(n); } } • N's parent (P) and uncle (U) are red • N is added to right of left child of grandparent, or N is added to left of right child of grandparent (P is red and U is black) • N is added to left of left child of grandparent, or N is added to right of right child of grandparent (P is red and U is black) Each case will be demonstrated with example C code. The uncle and grandparent nodes can be found by these functions: struct node *grandparent(struct node *n) { if ((n != NULL) && (n->parent != NULL)) return n->parent>parent; else return NULL; } struct node *uncle(struct node *n) { struct node *g = grandparent(n); if (g == NULL) return NULL; // No grandparent means no uncle if (n->parent == g->left) return g->right; else return g->left; } Case 1: The current node N is at the root of the tree. In this case, it is repainted black to satisfy property 2 (the root is black). Since this adds one black node to every path at once, property 5 (all paths from any given node to its leaf nodes contain the same number of black nodes) is not violated. Note: In the remaining cases, it is assumed that the parent node P is the left child of its parent. If it is the right child, left and right should be reversed throughout cases 4 and 5. The code samples take care of this. void insert_case4(struct node *n) { struct node *g = grandparent(n); if ((n == n->parent->right) && (n->parent == g->left)) { rotate_left(n->parent); /* * rotate_left can be the below because of already having *g = grandparent(n) * * struct node *saved_p=g->left, *saved_left_n=n->left; * g->left=n; * n->left=saved_p; * saved_p->right=saved_left_n; * * and modify the parent’s nodes properly */ n = n->left; } else if ((n == n->parent->left) && (n->parent == g->right)) { rotate_right(n->parent); /* * rotate_right can be the below to take advantage of already having *g = grandparent(n) * * struct node *saved_p=g>right, *saved_right_n=n->right; * g->right=n; * n->right=saved_p; * saved_p->left=saved_right_n; * */ n = n->right; } insert_case5(n); } void insert_case5(struct node *n) { struct node *g = grandparent(n); n->parent->color = BLACK; g->color = RED; if (n == n->parent->left) rotate_right(g); else rotate_left(g); } void insert_case1(struct node *n) { if (n->parent == Note that inserting is actually in-place, since all the calls NULL) n->color = BLACK; else insert_case2(n); } above use tail recursion. Case 2: The current node’s parent P is black, so property 4 (both children of every red node are black) is not invalidated. In this case, the tree is still valid. Property 5 (all paths from any given node to its leaf nodes contain the same number of black nodes) is not threatened, because the current node N has two black leaf children, but because N is red, the paths through each of its children have the same number of black nodes as the path through the leaf it replaced, which was black, and so this property remains satisfied. In the algorithm above, all cases are chained in order, except in insert case 3 where it can recurse to case 1 back to the grandparent node: this is the only case where an iterative implementation will effectively loop. Because the problem of repair is escalated to the next higher level but one, it takes maximally h ⁄2 iterations to repair the tree (where h is the height of the tree). Because the probability for escalation decreases exponentially with each iteration the average insertion cost is constant. Mehlhorn & Sanders (2008) point out: “AVL trees do not support constant amortized update costs”, but red-black void insert_case2(struct node *n) { if (n->parent->color trees do.[21] == BLACK) return; /* Tree is still valid */ else insert_case3(n); } Removal Note: In the following cases it can be assumed that N has a grandparent node G, because its parent P is red, and if it were the root, it would be black. Thus, N also has an uncle node U, although it may be a leaf in cases 4 and 5. In a regular binary search tree when deleting a node with two non-leaf children, we find either the maximum element in its left subtree (which is the in-order predecessor) or the minimum element in its right subtree (which is the in-order successor) and move its value into the node being deleted (as shown here). We then delete the node void insert_case3(struct node *n) { struct node *u = we copied the value from, which must have fewer than uncle(n), *g; if ((u != NULL) && (u->color == RED)) two non-leaf children. (Non-leaf children, rather than all 6.8. RED–BLACK TREE children, are specified here because unlike normal binary search trees, red–black trees can have leaf nodes anywhere, so that all nodes are either internal nodes with two children or leaf nodes with, by definition, zero children. In effect, internal nodes having two leaf children in a red– black tree are like the leaf nodes in a regular binary search tree.) Because merely copying a value does not violate any red–black properties, this reduces to the problem of deleting a node with at most one non-leaf child. Once we have solved that problem, the solution applies equally to the case where the node we originally want to delete has at most one non-leaf child as to the case just considered where it has two non-leaf children. Therefore, for the remainder of this discussion we address the deletion of a node with at most one non-leaf child. We use the label M to denote the node to be deleted; C will denote a selected child of M, which we will also call “its child”. If M does have a non-leaf child, call that its child, C; otherwise, choose either leaf as its child, C. If M is a red node, we simply replace it with its child C, which must be black by property 4. (This can only occur when M has two leaf children, because if the red node M had a black non-leaf child on one side but just a leaf child on the other side, then the count of black nodes on both sides would be different, thus the tree would violate property 5.) All paths through the deleted node will simply pass through one fewer red node, and both the deleted node’s parent and child must be black, so property 3 (all leaves are black) and property 4 (both children of every red node are black) still hold. Another simple case is when M is black and C is red. Simply removing a black node could break Properties 4 (“Both children of every red node are black”) and 5 (“All paths from any given node to its leaf nodes contain the same number of black nodes”), but if we repaint C black, both of these properties are preserved. The complex case is when both M and C are black. (This can only occur when deleting a black node which has two leaf children, because if the black node M had a black non-leaf child on one side but just a leaf child on the other side, then the count of black nodes on both sides would be different, thus the tree would have been an invalid red–black tree by violation of property 5.) We begin by replacing M with its child C. We will relabel this child C (in its new position) N, and its sibling (its new parent’s other child) S. (S was previously the sibling of M.) In the diagrams below, we will also use P for N's new parent (M's old parent), SL for S's left child, and SR for S's right child (S cannot be a leaf because if M and C were black, then P's one subtree which included M counted two black-height and thus P's other subtree which includes S must also count two black-height, which cannot be the case if S is a leaf node). Notes 1. The label N will be used to denote the current node 185 (colored black). In the diagrams N carries a blue contour. At the beginning, this is the replacement node and a leaf, but the entire procedure may also be applied recursively to other nodes (see case 3). In between some cases, the roles and labels of the nodes are exchanged, but in each case, every label continues to represent the same node it represented at the beginning of the case. 2. If a node in the right (target) half of a diagram carries a blue contour it will become the current node in the next iteration and there the other nodes will be newly assigned relative to it. Any color shown in the diagram is either assumed in its case or implied by those assumptions. White represents an arbitrary color (either red or black), but the same in both halves of the diagram. 3. A numbered triangle represents a subtree of unspecified depth. A black circle atop a triangle means that black-height of subtree is greater by one compared to subtree without this circle. We will find the sibling using this function: struct node *sibling(struct node *n) { if ((n == NULL) || (n->parent == NULL)) return NULL; // no parent means no sibling if (n == n->parent->left) return n->parent>right; else return n->parent->left; } Note: In order that the tree remains welldefined, we need that every null leaf remains a leaf after all transformations (that it will not have any children). If the node we are deleting has a non-leaf (non-null) child N, it is easy to see that the property is satisfied. If, on the other hand, N would be a null leaf, it can be verified from the diagrams (or code) for all the cases that the property is satisfied as well. We can perform the steps outlined above with the following code, where the function replace_node substitutes child into n’s place in the tree. For convenience, code in this section will assume that null leaves are represented by actual node objects rather than NULL (the code in the Insertion section works with either representation). void delete_one_child(struct node *n) { /* * Precondition: n has at most one non-leaf child. */ struct node *child = is_leaf(n->right) ? n->left : n->right; replace_node(n, child); if (n->color == BLACK) { if (child->color == RED) child->color = BLACK; else delete_case1(child); } free(n); } Note: If N is a null leaf and we do not want to represent null leaves as actual node objects, we can modify the algorithm by first calling delete_case1() on its parent (the node that we 186 CHAPTER 6. SUCCESSORS AND NEIGHBORS delete, n in the code above) and deleting it afterwards. We do this if the parent is black (red is trivial), so it behaves in the same way as a null leaf (and is sometimes called a 'phantom' leaf). And we can safely delete it at the end as n will remain a leaf after all operations, as shown above. In addition, the sibling tests in cases 2 and 3 require updating as it is no longer true that the sibling will have children represented as objects. If both N and its original parent are black, then deleting this original parent causes paths which proceed through N to have one fewer black node than paths that do not. As this violates property 5 (all paths from any given node to its leaf nodes contain the same number of black nodes), the tree must be rebalanced. There are several cases to consider: (s->left->color == BLACK) && (s->right->color == RED)) {/* this last test is trivial too due to cases 2-4. */ s->color = RED; s->right->color = BLACK; rotate_left(s); } } delete_case6(n); } void delete_case6(struct node *n) { struct node *s = sibling(n); s->color = n->parent->color; n->parent->color = BLACK; if (n == n->parent->left) { s->right->color = BLACK; rotate_left(n->parent); } else { s->left->color = BLACK; rotate_right(n->parent); } } Again, the function calls all use tail recursion, so the algorithm is in-place. In the algorithm above, all cases are chained in order, except in delete case 3 where it can recurse to case 1 back to the parent node: this is the only case where an iterative implementation will effectively loop. No more than h loops back to case 1 will occur (where h is the height of the tree). And because the probability for escalation Case 1: N is the new root. In this case, we are done. We decreases exponentially with each iteration the average removed one black node from every path, and the new removal cost is constant. root is black, so the properties are preserved. Additionally, no tail recursion ever occurs on a child node, void delete_case1(struct node *n) { if (n->parent != so the tail recursion loop can only move from a child back NULL) delete_case2(n); } to its successive ancestors. If a rotation occurs in case 2 (which is the only possibility of rotation within the loop of cases 1–3), then the parent of the node N becomes red Note: In cases 2, 5, and 6, we assume N is after the rotation and we will exit the loop. Therefore, the left child of its parent P. If it is the right at most one rotation will occur within this loop. Since no child, left and right should be reversed throughmore than two additional rotations will occur after exiting out these three cases. Again, the code examthe loop, at most three rotations occur in total. ples take both cases into account. void delete_case2(struct node *n) { struct node *s = sibling(n); if (s->color == RED) { n->parent->color = RED; s->color = BLACK; if (n == n->parent->left) rotate_left(n->parent); else rotate_right(n->parent); } delete_case3(n); } void delete_case3(struct node *n) { struct node *s = sibling(n); if ((n->parent->color == BLACK) && (s->color == BLACK) && (s->left->color == BLACK) && (s->right->color == BLACK)) { s->color = RED; delete_case1(n->parent); } else delete_case4(n); } void delete_case4(struct node *n) { struct node *s = sibling(n); if ((n->parent->color == RED) && (s->color == BLACK) && (s->left->color == BLACK) && (s->right->color == BLACK)) { s->color = RED; n->parent->color = BLACK; } else delete_case5(n); } void delete_case5(struct node *n) { struct node *s = sibling(n); if (s->color == BLACK) { /* this if statement is trivial, due to case 2 (even though case 2 changed the sibling to a sibling’s child, the sibling’s child can't be red, since no red parent can have a red child). */ /* the following statements just force the red to be on the left of the left of the parent, or right of the right, so case six will rotate correctly. */ if ((n == n->parent->left) && (s->right->color == BLACK) && (s->left->color == RED)) { /* this last test is trivial too due to cases 2-4. */ s->color = RED; s->left->color = BLACK; rotate_right(s); } else if ((n == n->parent->right) && 6.8.7 Proof of asymptotic bounds A red black tree which contains n internal nodes has a height of O(log n). Definitions: • h(v) = height of subtree rooted at node v • bh(v) = the number of black nodes from v to any leaf in the subtree, not counting v if it is black - called the black-height Lemma: A subtree rooted at node v has at least 2bh(v) −1 internal nodes. Proof of Lemma (by induction height): Basis: h(v) = 0 If v has a height of zero then it must be null, therefore bh(v) = 0. So: 2bh(v) − 1 = 20 − 1 = 1 − 1 = 0 Inductive Step: v such that h(v) = k, has at least 2bh(v) −1 internal nodes implies that v ′ such that h( v ′ ) = k+1 has ′ at least 2bh(v ) − 1 internal nodes. 6.8. RED–BLACK TREE 187 Since v ′ has h( v ′ ) > 0 it is an internal node. As such it has two children each of which have a black-height of either bh( v ′ ) or bh( v ′ )−1 (depending on whether the child is red or black, respectively). By the inductive hypothesis ′ each child has at least 2bh(v )−1 − 1 internal nodes, so v ′ has at least: ′ ′ ′ 2bh(v )−1 − 1 + 2bh(v )−1 − 1 + 1 = 2bh(v ) − 1 • AA tree, a variation of the red-black tree • AVL tree • B-tree (2-3 tree, 2-3-4 tree, B+ tree, B*-tree, UBtree) • Scapegoat tree • Splay tree • T-tree internal nodes. • WAVL tree Using this lemma we can now show that the height of the tree is logarithmic. Since at least half of the nodes on any path from the root to a leaf are black (property 4 of 6.8.11 References a red–black tree), the black-height of the root is at least [1] James Paton. “Red-Black Trees”. h(root)/2. By the lemma we get: [2] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2001). “Red–Black Trees”. h(root) h(root) Introduction to Algorithms (second ed.). MIT Press. pp. n ≥ 2 2 −1 ↔ log2 (n + 1) ≥ ↔ h(root) ≤ 2 log2 (n + 1). 273–301. ISBN 0-262-03293-7. 2 Therefore, the height of the root is O(log n). 6.8.8 Parallel algorithms Parallel algorithms for constructing red–black trees from sorted lists of items can run in constant time or O(log log n) time, depending on the computer model, if the number of processors available is asymptotically proportional to the number n of items where n→∞. Fast search, insertion, and deletion parallel algorithms are also known.[22] 6.8.9 Popular Culture A red-black-tree was referenced correctly in an episode of Missing (Canadian TV series)[23] as noted by Robert Sedgewick in one of his lectures:[24] Jess: " It was the red door again. " Pollock: " I thought the red door was the storage container. " Jess: " But it wasn't red anymore, it was black. " Antonio: " So red turning to black means what? " Pollock: " Budget deficits, red ink, black ink. " Antonio: " It could be from a binary search tree. The red-black tree tracks every simple path from a node to a descendant leaf that has the same number of black nodes. " Jess: " Does that help you with the ladies? " 6.8.10 See also • List of data structures • Tree data structure • Tree rotation [3] John Morris. “Red–Black Trees”. [4] Rudolf Bayer (1972). “Symmetric binary B-Trees: Data structure and maintenance algorithms”. Acta Informatica. 1 (4): 290–306. doi:10.1007/BF00289509. [5] Drozdek, Adam. Data Structures and Algorithms in Java (2 ed.). Sams Publishing. p. 323. ISBN 0534376681. [6] Leonidas J. Guibas and Robert Sedgewick (1978). “A Dichromatic Framework for Balanced Trees”. Proceedings of the 19th Annual Symposium on Foundations of Computer Science. pp. 8–21. doi:10.1109/SFCS.1978.3. [7] “Red Black Trees”. eternallyconfuzzled.com. Retrieved 2015-09-02. [8] Robert Sedgewick (2012). Red-Black BSTs. Coursera. A lot of people ask why did we use the name red–black. Well, we invented this data structure, this way of looking at balanced trees, at Xerox PARC which was the home of the personal computer and many other innovations that we live with today entering[sic] graphic user interfaces, ethernet and object-oriented programmings[sic] and many other things. But one of the things that was invented there was laser printing and we were very excited to have nearby color laser printer that could print things out in color and out of the colors the red looked the best. So, that’s why we picked the color red to distinguish red links, the types of links, in three nodes. So, that’s an answer to the question for people that have been asking. [9] “Where does the term “Red/Black Tree” come from?". programmers.stackexchange.com. Retrieved 2015-09-02. [10] Andersson, Arne (1993-08-11). Dehne, Frank; Sack, Jörg-Rüdiger; Santoro, Nicola; Whitesides, Sue, eds. “Balanced search trees made simple” (PDF). Algorithms and Data Structures (Proceedings). Lecture Notes in Computer Science. Springer-Verlag Berlin Heidelberg. 709: 60–71. doi:10.1007/3-540-57155-8_236. ISBN 978-3-540-57155-1. Archived from the original on 200003-17. 188 CHAPTER 6. SUCCESSORS AND NEIGHBORS [11] Okasaki, Chris (1999-01-01). “Red-black trees in a functional setting” (PS). Journal of Functional Programming. 9 (4): 471–477. doi:10.1017/S0956796899003494. ISSN 1469-7653. [12] Sedgewick, Robert (1983). Algorithms (1st ed.). Addison-Wesley. ISBN 0-201-06672-6. [13] RedBlackBST code in Java [14] Sedgewick, Robert (2008). Trees” (PDF). “Left-leaning Red-Black • Sedgewick, Robert; Wayne, Kevin (2011). Algorithms (4th ed.). Addison-Wesley Professional. ISBN 978-0-321-57351-3. [15] [16] Cormen, Thomas; Leiserson, Charles; Rivest, Ronald; Stein, Clifford (2009). “13”. Introduction to Algorithms (3rd ed.). MIT Press. pp. 308–309. ISBN 978-0-26203384-8. [17] Mehlhorn, Kurt; Sanders, Peter (2008). Algorithms and Data Structures: The Basic Toolbox (PDF). Springer, Berlin/Heidelberg. pp. 154–165. doi:10.1007/978-3540-77978-0. ISBN 978-3-540-77977-3. p. 155. 6.8.13 External links • A complete and working implementation in C • Red–Black Tree Demonstration • OCW MIT Lecture by Prof. Erik Demaine on Red Black Trees • Binary Search Tree Insertion Visualization on YouTube – Visualization of random and pre-sorted data insertions, in elementary binary search trees, and left-leaning red–black trees • An intrusive red-black tree written in C++ • Red-black BSTs in 3.3 Balanced Search Trees • Red–black BST Demo 6.9 WAVL tree In computer science, a WAVL tree or weak AVL tree is a self-balancing binary search tree. WAVL trees are named after AVL trees, another type of balanced search tree, and are closely related both to AVL trees and red– http://www.cs.princeton.edu/courses/archive/fall08/ black trees, which all fall into a common framework of cos226/lectures/10BalancedTrees-2x2.pdf rank balanced trees. Like other balanced binary search “How does a HashMap work in JAVA”. coding-geek.com. trees, WAVL trees can handle insertion, deletion, and search operations in time O(log n) per operation.[1][2] Mehlhorn & Sanders 2008, pp. 165, 158 WAVL trees are designed to combine some of the best Park, Heejin; Park, Kunsoo (2001). “Parallel algorithms for red–black trees”. Theoretical computer sci- properties of both AVL trees and red–black trees. One ence. Elsevier. 262 (1–2): 415–435. doi:10.1016/S0304- advantage of AVL trees over red–black trees is that they 3975(00)00287-5. Our parallel algorithm for construct- are more balanced: they have height at most logφ n ≈ ing a red–black tree from a sorted list of n items runs in 1.44 log2 n (for a tree with n data items, where φ is the O(1) time with n processors on the CRCW PRAM and golden ratio), while red–black trees have larger maximum runs in O(log log n) time with n / log log n processors on height, 2 log2 n . If a WAVL tree is created using only the EREW PRAM. insertions, without deletions, then it has the same small height bound that an AVL tree has. On the other hand, Missing (Canadian TV series). A, W Network (Canada); red–black trees have the advantage over AVL trees that Lifetime (United States). they perform less restructuring of their trees. In AVL Robert Sedgewick (2012). B-Trees. Coursera. 10:07 trees, each deletion may require a logarithmic number of minutes in. So not only is there some excitement in that tree rotation operations, while red–black trees have simdialogue but it’s also technically correct which you don't pler deletion operations that use only a constant number often find with math in popular culture of computer sciof tree rotations. WAVL trees, like red–black trees, use ence. A red black tree tracks every simple path from a only a constant number of tree rotations, and the constant node to a descendant leaf with the same number of black is even better than for red–black trees.[1][2] nodes they got that right. [18] http://www.cs.princeton.edu/~{}rs/talks/LLRB/ RedBlack.pdf [19] [20] [21] [22] [23] [24] 6.8.12 Further reading • Mathworld: Red–Black Tree WAVL trees were introduced by Haeupler, Sen & Tarjan (2015). The same authors also provided a common view of AVL trees, WAVL trees, and red–black trees as all being a type of rank-balanced tree.[2] • San Diego State University: CS 660: Red–Black 6.9.1 tree notes, by Roger Whitney Definition • Pfaff, Ben (June 2004). “Performance Analysis of As with binary search trees more generally, a WAVL tree BSTs in System Software” (PDF). Stanford Univer- consists of a collection of nodes, of two types: internal sity. nodes and external nodes. An internal node stores a data 6.9. WAVL TREE item, and is linked to its parent (except for a designated root node that has no parent) and to exactly two children in the tree, the left child and the right child. An external node carries no data, and has a link only to its parent in the tree. These nodes are arranged to form a binary tree, so that for any internal node x the parents of the left and right children of x are x itself. The external nodes form the leaves of the tree.[3] The data items are arranged in the tree in such a way that an inorder traversal of the tree lists the data items in sorted order.[4] What distinguishes WAVL trees from other types of binary search tree is its use of ranks. These are numbers, stored with each node, that provide an approximation to the distance from the node to its farthest leaf descendant. The ranks are required to obey the following properties:[1][2] • Every external node has rank 0[5] • If a non-root node has rank r, then the rank of its parent must be either r + 1 or r + 2. • An internal node with two external children must have rank exactly 1. 6.9.2 Operations Searching Searching for a key k in a WAVL tree is much the same as in any balanced binary search tree data structure. One begins at the root of the tree, and then repeatedly compares k with the data item stored at each node on a path from the root, following the path to the left child of a node when k is smaller than the value at the node or instead following the path to the right child when k is larger than the value at the node. When a node with value equal to k is reached, or an external node is reached, the search stops.[6] 189 from each node to its parent, incrementing the rank of each parent node if necessary to make it greater than the new rank of its child, until one of three stopping conditions is reached. • If the path of incremented ranks reaches the root of the tree, then the rebalancing procedure stops, without changing the structure of the tree. • If the path of incremented ranks reaches a node whose parent’s rank previously differed by two, and (after incrementing the rank of the node) still differs by one, then again the rebalancing procedure stops without changing the structure of the tree. • If the procedure increases the rank of a node x, so that it becomes equal to the rank of the parent y of x, but the other child of y has a rank that is smaller by two (so that the rank of y cannot be increased) then again the rebalancing procedure stops. In this case, by performing at most two tree rotations, it is always possible to rearrange the tree nodes near x and y in such a way that the ranks obey the constraints of a WAVL tree, leaving the rank of the root of the rotated subtree unchanged. Thus, overall, the insertion procedure consists of a search, the creation of a constant number of new nodes, a logarithmic number of rank changes, and a constant number of tree rotations.[1][2] Deletion As with binary search trees more broadly, deletion operations on an internal node x that has at least one externalnode child may be performed directly, by removing x from the tree and reconnecting the other child of x to the parent of x. If, however, both children of a node x are internal nodes, then we may follow a path downward in the If the search stops at an internal node, the key k has been tree from x to the leftmost descendant of its right child, found. If instead, the search stops at an external node, a node y that immediately follows x in the sorted orderthen the position where k would be inserted (if it were ing of the tree nodes. Then y has an external-node child (its left child). We may delete x by performing the same inserted) has been found.[6] reconnection procedure at node y (effectively, deleting y instead of x) and then replacing the data item stored at x with the one that had been stored at y.[7] Insertion In either case, after making this change to the tree structure, it is necessary to rebalance the tree and update its ranks. As in the case of an insertion, this may be done by following a path upwards in the tree and changing the ranks of the nodes along this path until one of three things happens: the root is reached and the tree is balanced, a node is reached whose rank does not need to be changed, and again the tree is balanced, or a node is reached whose rank cannot be changed. In this last case a constant numIn this rebalancing step, one assigns rank 1 to the newly ber of tree rotations completes the rebalancing stage of created internal node, and then follows a path upward the deletion process.[1][2] Insertion of a key k into a WAVL tree is performed by performing a search for the external node where the key should be added, replacing that node by an internal node with data item k and two external-node children, and then rebalancing the tree. The rebalancing step can be performed either top-down or bottom-up,[2] but the bottomup version of rebalancing is the one that most closely matches AVL trees.[1][2] 190 CHAPTER 6. SUCCESSORS AND NEIGHBORS Overall, as with the insertion procedure, a deletion consists of a search downward through the tree (to find the node to be deleted), a continuation of the search farther downward (to find a node with an external child), the removal of a constant number of new nodes, a logarithmic number of rank changes, and a constant number of tree rotations.[1][2] 6.9.3 Computational complexity tions, then its structure will be the same as the structure of an AVL tree created by the same insertion sequence, and its ranks will be the same as the ranks of the corresponding AVL tree. It is only through deletion operations that a WAVL tree can become different from an AVL tree. In particular this implies that a WAVL tree created only through insertions has height at most logφ n ≈ 1.44 log2 n .[2] Red–black trees Each search, insertion, or deletion in a WAVL tree involves following a single path in the tree and performing A red–black tree is a balanced binary search tree in which a constant number of steps for each node in the path. In each node has a color (red or black), satisfying the followa WAVL tree with n items that has only undergone inser- ing properties: tions, the maximum path length is logφ n ≈ 1.44 log2 n . If both insertions and deletions may have happened, the • External nodes are black. maximum path length is 2 log2 n . Therefore, in either • If an internal node is red, its two children are both case, the worst-case time for each search, insertion, or black. deletion in a WAVL tree with n data items is O(log n). 6.9.4 Related structures WAVL trees are closely related to both AVL trees and red–black trees. Every AVL tree can have ranks assigned to its nodes in a way that makes it into a WAVL tree. And every WAVL tree can have its nodes colored red and black (and its ranks reassigned) in a way that makes it into a red–black tree. However, some WAVL trees do not come from AVL trees in this way and some red–black trees do not come from WAVL trees in this way. AVL trees An AVL tree is a kind of balanced binary search tree in which the two children of each internal node must have heights that differ by at most one.[8] The height of an external node is zero, and the height of any internal node is always one plus the maximum of the heights of its two children. Thus, the height function of an AVL tree obeys the constraints of a WAVL tree, and we may convert any AVL tree into a WAVL tree by using the height of each node as its rank.[1][2] • All paths from the root to an external node have equal numbers of black nodes. red–black trees can equivalently be defined in terms of a system of ranks, stored at the nodes, satisfying the following requirements (different than the requirements for ranks in WAVL trees): • The rank of an external node is always 0 and its parent’s rank is always 1. • The rank of any non-root node equals either its parent’s rank or its parent’s rank minus 1. • No two consecutive edges on any root-leaf path have rank difference 0. The equivalence between the color-based and rank-based definitions can be seen, in one direction, by coloring a node black if its parent has greater rank and red if its parent has equal rank. In the other direction, colors can be converted to ranks by making the rank of a black node equal to the number of black nodes on any path to an external node, and by making the rank of a red node equal [9] The key difference between an AVL tree and a WAVL to its parent. tree arises when a node has two children with the same The ranks of the nodes in a WAVL tree can be converted rank or height. In an AVL tree, if a node x has two chil- to a system of ranks of nodes, obeying the requirements dren of the same height h as each other, then the height for red–black trees, by dividing each rank by two and of x must be exactly h + 1. In contrast, in a WAVL tree, rounding up to the nearest integer.[10] Because of this conif a node x has two children of the same rank r as each version, for every WAVL tree there exists a valid red– other, then the rank of x can be either r + 1 or r + 2. This black tree with the same structure. Because red–black greater flexibility in ranks also leads to a greater flexibil- trees have maximum height 2 log2 n , the same is true for ity in structures: some WAVL trees cannot be made into WAVL trees.[1][2] However, there exist red–black trees AVL trees even by modifying their ranks, because they that cannot be given a valid WAVL tree rank function.[2] include nodes whose children’s heights differ by more Despite the fact that, in terms of their tree structures, than one.[2] WAVL trees are special cases of red–black trees, their If a WAVL tree is created only using insertion opera- update operations are different. The tree rotations used 6.10. SCAPEGOAT TREE in WAVL tree update operations may make changes that would not be permitted in a red–black tree, because they would in effect cause the recoloring of large subtrees of the red–black tree rather than making color changes only on a single path in the tree.[2] This allows WAVL trees to perform fewer tree rotations per deletion, in the worst case, than red-black trees.[1][2] 191 to a regular binary search tree: a node stores only a key and two pointers to the child nodes. This makes scapegoat trees easier to implement and, due to data structure alignment, can reduce node overhead by up to one-third. 6.10.1 Theory 6.9.5 References [1] Goodrich, Michael T.; Tamassia, Roberto (2015), “4.4 Weak AVL Trees”, Algorithm Design and Applications, Wiley, pp. 130–138. [2] Haeupler, Bernhard; Sen, Siddhartha; Tarjan, Robert E. (2015), “Rank-balanced trees” (PDF), ACM Transactions on Algorithms, 11 (4): Art. 30, 26, doi:10.1145/2689412, MR 3361215. A binary search tree is said to be weight-balanced if half the nodes are on the left of the root, and half on the right. An α-weight-balanced node is defined as meeting a relaxed weight balance criterion: size(left) <= α*size(node) size(right) <= α*size(node) Where size can be defined recursively as: function size(node) if node = nil return 0 else return size(node->left) + size(node->right) + 1 end [3] Goodrich & Tamassia (2015), Section 2.3 Trees, pp. 68– 83. An α of 1 therefore would describe a linked list as balanced, whereas an α of 0.5 would only match almost com[4] Goodrich & Tamassia (2015), Chapter 3 Binary Search plete binary trees. Trees, pp. 89–114. A binary search tree that is α-weight-balanced must also [5] In this we follow Goodrich & Tamassia (2015). In the be α-height-balanced, that is version described by Haeupler, Sen & Tarjan (2015), the height(tree) <= log₁/α(NodeCount) + 1 external nodes have rank −1. This variation makes very little difference in the operations of WAVL trees, but it causes some minor changes to the formula for converting WAVL trees to red–black trees. [6] Goodrich & Tamassia (2015), Section 3.1.2 Searching in a Binary Search Tree, pp. 95–96. [7] [8] [9] [10] Scapegoat trees are not guaranteed to keep α-weightbalance at all times, but are always loosely α-heightbalanced in that height(scapegoat tree) <= log₁/α(NodeCount) + 1 This makes scapegoat trees similar to red-black trees in Goodrich & Tamassia (2015), Section 3.1.4 Deletion in a that they both have restrictions on their height. They differ greatly though in their implementations of determinBinary Search Tree, pp. 98–99. ing where the rotations (or in the case of scapegoat trees, Goodrich & Tamassia (2015), Section 4.2 AVL Trees, pp. rebalances) take place. Whereas red-black trees store ad120–125. ditional 'color' information in each node to determine the location, scapegoat trees find a scapegoat which isn't αGoodrich & Tamassia (2015), Section 4.3 Red–black weight-balanced to perform the rebalance operation on. Trees, pp. 126–129. This is loosely similar to AVL trees, in that the actual roIn Haeupler, Sen & Tarjan (2015) the conversion is done tations depend on 'balances’ of nodes, but the means of by rounding down, because the ranks of external nodes determining the balance differs greatly. Since AVL trees are −1 rather than 0. Goodrich & Tamassia (2015) give a check the balance value on every insertion/deletion, it is formula that also rounds down, but because they use rank typically stored in each node; scapegoat trees are able to 0 for external nodes their formula incorrectly assigns red– calculate it only as needed, which is only when a scapeblack rank 0 to internal nodes with WAVL rank 1. goat needs to be found. Unlike most other self-balancing search trees, scapegoat trees are entirely flexible as to their balancing. They sup6.10 Scapegoat tree port any α such that 0.5 < α < 1. A high α value results in fewer balances, making insertion quicker but lookups and In computer science, a scapegoat tree is a self-balancing deletions slower, and vice versa for a low α. Therefore in binary search tree, invented by Arne Andersson[1] and practical applications, an α can be chosen depending on again by Igal Galperin and Ronald L. Rivest.[2] It provides how frequently these actions should be performed. worst-case O(log n) lookup time, and O(log n) amortized insertion and deletion time. Unlike most other self-balancing binary search trees that provide worst case O(log n) lookup time, scapegoat trees 6.10.2 have no additional per-node memory overhead compared Operations 192 Insertion CHAPTER 6. SUCCESSORS AND NEIGHBORS worst-case scenarios are spread out, insertion takes O(log n) amortized time. Insertion is implemented with the same basic ideas as an unbalanced binary search tree, however with a few signif- Sketch of proof for cost of insertion Define the Imbalance of a node v to be the absolute value of the differicant changes. ence in size between its left node and right node minus 1, When finding the insertion point, the depth of the new or 0, whichever is greater. In other words: node must also be recorded. This is implemented via a simple counter that gets incremented during each itera- I(v) = max(| left(v) − right(v)| − 1, 0) tion of the lookup, effectively counting the number of Immediately after rebuilding a subtree rooted at v, I(v) = edges between the root and the inserted node. If this node 0. violates the α-height-balance property (defined above), a Lemma: Immediately before rebuilding the subtree rebalance is required. rooted at v, To rebalance, an entire subtree rooted at a scapegoat un- I(v) = Ω(|v|) ( Ω is Big O Notation.) dergoes a balancing operation. The scapegoat is defined Proof of lemma: as being an ancestor of the inserted node which isn't αweight-balanced. There will always be at least one such Let v0 be the root of a subtree immediately after rebuildancestor. Rebalancing any of them will restore the α- ing. h(v0 ) = log(|v0 |+1) . If there are Ω(|v0 |) degenerate insertions (that is, where each inserted node increases height-balanced property. the height by 1), then One way of finding a scapegoat, is to climb from the new I(v) = Ω(|v0 |) , node back up to the root and select the first node that isn't h(v) = h(v0 ) + Ω(|v0 |) and α-weight-balanced. log(|v|) ≤ log(|v0 | + 1) + 1 . Climbing back up to the root requires O(log n) storage Since I(v) = Ω(|v|) before rebuilding, there were Ω(|v|) space, usually allocated on the stack, or parent pointers. insertions into the subtree rooted at v that did not result in This can actually be avoided by pointing each child at its rebuilding. Each of these insertions can be performed in parent as you go down, and repairing on the walk back O(log n) time. The final insertion that causes rebuilding up. costs O(|v|) . Using aggregate analysis it becomes clear To determine whether a potential node is a viable scape- that the amortized cost of an insertion is O(log n) : goat, we need to check its α-weight-balanced property. Ω(|v|)O(log n)+O(|v|) = O(log n) Ω(|v|) To do this we can go back to the definition: size(left) <= α*size(node) size(right) <= α*size(node) Deletion However a large optimisation can be made by realising that we already know two of the three sizes, leaving only Scapegoat trees are unusual in that deletion is easier than the third having to be calculated. insertion. To enable deletion, scapegoat trees need to Consider the following example to demonstrate this. As- store an additional value with the tree data structure. suming that we're climbing back up to the root: This property, which we will call MaxNodeCount simply represents the highest achieved NodeCount. It is set size(parent) = size(node) + size(sibling) + 1 to NodeCount whenever the entire tree is rebalanced, But as: and after insertion is set to max(MaxNodeCount, NodeCount). size(inserted node) = 1. To perform a deletion, we simply remove the node as you The case is trivialized down to: would in a simple binary search tree, but if size[x+1] = size[x] + size(sibling) + 1 NodeCount <= α*MaxNodeCount Where x = this node, x + 1 = parent and size(sibling) is then we rebalance the entire tree about the root, rememthe only function call actually required. bering to set MaxNodeCount to NodeCount. Once the scapegoat is found, the subtree rooted at the scapegoat is completely rebuilt to be perfectly This gives deletion its worst-case performance of O(n) balanced.[2] This can be done in O(n) time by travers- time; however, it is amortized to O(log n) average time. ing the nodes of the subtree to find their values in sorted order and recursively choosing the median as the root of Sketch of proof for cost of deletion Suppose the the subtree. scapegoat tree has n elements and has just been rebuilt As rebalance operations take O(n) time (dependent on the (in other words, it is a complete binary tree). At most number of nodes of the subtree), insertion has a worst- n/2 − 1 deletions can be performed before the tree must case performance of O(n) time. However, because these be rebuilt. Each of these deletions take O(log n) time 6.11. SPLAY TREE 193 (the amount of time to search for the element and flag it 6.11 Splay tree as deleted). The n/2 deletion causes the tree to be rebuilt and takes O(log n) + O(n) (or just O(n) ) time. A splay tree is a self-adjusting binary search tree with the Using aggregate analysis it becomes clear that the amor- additional property that recently accessed elements are tized cost of a deletion is O(log n) : quick to access again. It performs basic operations such ∑n n 2 O(log n)+O(n) as insertion, look-up and removal in O(log n) amortized O(log n)+O(n) 1 = 2 = O(log n) n n time. For many sequences of non-random operations, 2 2 splay trees perform better than other search trees, even when the specific pattern of the sequence is unknown. Lookup The splay tree was invented by Daniel Sleator and Robert Tarjan in 1985.[1] Lookup is not modified from a standard binary search tree, and has a worst-case time of O(log n). This is in All normal operations on a binary search tree are comcontrast to splay trees which have a worst-case time of bined with one basic operation, called splaying. Splaying O(n). The reduced node memory overhead compared to the tree for a certain element rearranges the tree so that other self-balancing binary search trees can further im- the element is placed at the root of the tree. One way to do this is to first perform a standard binary tree search for prove locality of reference and caching. the element in question, and then use tree rotations in a specific fashion to bring the element to the top. Alternatively, a top-down algorithm can combine the search and 6.10.3 See also the tree reorganization into a single phase. • Splay tree • Trees 6.11.1 Advantages • Tree rotation Good performance for a splay tree depends on the fact that it is self-optimizing, in that frequently accessed nodes will move nearer to the root where they can be accessed more quickly. The worst-case height—though unlikely— is O(n), with the average being O(log n). Having frequently used nodes near the root is an advantage for many practical applications (also see Locality of reference), and is particularly useful for implementing caches and garbage collection algorithms. • AVL tree • B-tree • T-tree • List of data structures Advantages include: 6.10.4 References [1] Andersson, Arne (1989). Improving partial rebuilding by using simple balance criteria. Proc. Workshop on Algorithms and Data Structures. Journal of Algorithms. Springer-Verlag. pp. 393–402. doi:10.1007/3-54051542-9_33. [2] Galperin, Igal; Rivest, Ronald L. (1993). “Scapegoat trees”. Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms: 165–174. • Comparable performance: Average-case performance is as efficient as other trees.[2] • Small memory footprint: Splay trees do not need to store any bookkeeping data. 6.11.2 Disadvantages The most significant disadvantage of splay trees is that the height of a splay tree can be linear. For example, 6.10.5 External links this will be the case after accessing all n elements in nondecreasing order. Since the height of a tree corresponds • Scapegoat Tree Applet by Kubo Kovac to the worst-case access time, this means that the actual cost of an operation can be high. However the amortized • Scapegoat Trees: Galperin and Rivest’s paper de- access cost of this worst case is logarithmic, O(log n). scribing scapegoat trees Also, the expected access cost can be reduced to O(log n) by using a randomized variant.[3] • On Consulting a Set of Experts and Searching (full The representation of splay trees can change even when version paper) they are accessed in a 'read-only' manner (i.e. by find op• Open Data Structures - Chapter 8 - Scapegoat Trees erations). This complicates the use of such splay trees in 194 CHAPTER 6. SUCCESSORS AND NEIGHBORS a multi-threaded environment. Specifically, extra management is needed if multiple threads are allowed to perform find operations concurrently. This also makes them unsuitable for general use in purely functional programming, although they can be used in limited ways to implement priority queues even there. 6.11.3 p. Note that zig-zig steps are the only thing that differentiate splay trees from the rotate to root method introduced by Allen and Munro[4] prior to the introduction of splay trees. Operations Splaying When a node x is accessed, a splay operation is performed on x to move it to the root. To perform a splay operation we carry out a sequence of splay steps, each of which moves x closer to the root. By performing a splay operation on the node of interest after every access, the recently accessed nodes are kept near the root and the tree remains roughly balanced, so that we achieve the desired amortized time bounds. Zig-zag step: this step is done when p is not the root and x is a right child and p is a left child or vice versa. The tree is rotated on the edge between p and x, and then rotated on the resulting edge between x and g. Each particular step depends on three factors: • Whether x is the left or right child of its parent node, p, • whether p is the root or not, and if not • whether p is the left or right child of its parent, g (the grandparent of x). Join It is important to remember to set gg (the greatgrandparent of x) to now point to x after any splay op- Given two trees S and T such that all elements of S are eration. If gg is null, then x obviously is now the root and smaller than the elements of T, the following steps can be must be updated as such. used to join them to a single tree: There are three types of splay steps, each of which has a left- and right-handed case. For the sake of brevity, only one of these two is shown for each type. These three types are: • Splay the largest item in S. Now this item is in the root of S and has a null right child. • Set the right child of the new root to T. Zig step: this step is done when p is the root. The tree is rotated on the edge between x and p. Zig steps exist to deal with the parity issue and will be done only as the last Split step in a splay operation and only when x has odd depth at the beginning of the operation. Given a tree and an element x, return two new trees: one containing all elements less than or equal to x and the p x other containing all elements greater than x. This can be done in the following way: p x C A B A B C • Splay x. Now it is in the root so the tree to its left contains all elements smaller than x and the tree to its right contains all element larger than x. • Split the right subtree from the rest of the tree. Zig-zig step: this step is done when p is not the root and x and p are either both right children or are both left children. The picture below shows the case where x and p are Insertion both left children. The tree is rotated on the edge joining p with its parent g, then rotated on the edge joining x with To insert a value x into a splay tree: 6.11. SPLAY TREE 195 • Insert x as with a normal binary search tree. Below there is an implementation of splay trees in C++, which uses pointers to represent each node on the tree. • when an item is inserted, a splay is performed. This implementation is based on bottom-up splaying ver• As a result, the newly inserted node x becomes the sion and uses the second method of deletion on a splay tree. Also, unlike the above definition, this C++ version root of the tree. does not splay the tree on finds - it only splays on insertions and deletions. ALTERNATIVE: #include <functional> #ifndef SPLAY_TREE #define • Use the split operation to split the tree at the value SPLAY_TREE template< typename T, typename Comp of x to two sub-trees: S and T. = std::less< T > > class splay_tree { private: Comp • Create a new tree in which x is the root, S is its left comp; unsigned long p_size; struct node { node *left, *right; node *parent; T key; node( const T& init = T( ) sub-tree and T its right sub-tree. ) : left( 0 ), right( 0 ), parent( 0 ), key( init ) { } ~node( ) { if( left ) delete left; if( right ) delete right; if( parent Deletion ) delete parent; } } *root; void left_rotate( node *x ) { node *y = x->right; if(y) { x->right = y->left; if( To delete a node x, use the same method as with a binary y->left ) y->left->parent = x; y->parent = x->parent; } search tree: if x has two children, swap its value with that if( !x->parent ) root = y; else if( x == x->parent->left of either the rightmost node of its left sub tree (its in-order ) x->parent->left = y; else x->parent->right = y; if(y) predecessor) or the leftmost node of its right subtree (its y->left = x; x->parent = y; } void right_rotate( node in-order successor). Then remove that node instead. In *x ) { node *y = x->left; if(y) { x->left = y->right; if( this way, deletion is reduced to the problem of removing y->right ) y->right->parent = x; y->parent = x->parent; a node with 0 or 1 children. Unlike a binary search tree, } if( !x->parent ) root = y; else if( x == x->parent->left in a splay tree after deletion, we splay the parent of the ) x->parent->left = y; else x->parent->right = y; if(y) yremoved node to the top of the tree. >right = x; x->parent = y; } void splay( node *x ) { while( x->parent ) { if( !x->parent->parent ) { if( x->parentALTERNATIVE: >left == x ) right_rotate( x->parent ); else left_rotate( • The node to be deleted is first splayed, i.e. brought x->parent ); } else if( x->parent->left == x && xto the root of the tree and then deleted. leaves the >parent->parent->left == x->parent ) { right_rotate( x->parent->parent ); right_rotate( x->parent ); } else tree with two sub trees. if( x->parent->right == x && x->parent->parent->right • The two sub-trees are then joined using a “join” op- == x->parent ) { left_rotate( x->parent->parent ); eration. left_rotate( x->parent ); } else if( x->parent->left == x && x->parent->parent->right == x->parent ) { right_rotate( x->parent ); left_rotate( x->parent ); } else 6.11.4 Implementation and variants { left_rotate( x->parent ); right_rotate( x->parent ); } } } void replace( node *u, node *v ) { if( !u->parent ) root Splaying, as mentioned above, is performed during a sec- = v; else if( u == u->parent->left ) u->parent->left = v; ond, bottom-up pass over the access path of a node. It is else u->parent->right = v; if( v ) v->parent = u->parent; possible to record the access path during the first pass for } node* subtree_minimum( node *u ) { while( u->left ) use during the second, but that requires extra space dur- u = u->left; return u; } node* subtree_maximum( node ing the access operation. Another alternative is to keep *u ) { while( u->right ) u = u->right; return u; } public: a parent pointer in every node, which avoids the need for splay_tree( ) : root( 0 ), p_size( 0 ) { } void insert( const extra space during access operations but may reduce over- T &key ) { node *z = root; node *p = 0; while( z ) { all time efficiency because of the need to update those p = z; if( comp( z->key, key ) ) z = z->right; else z = pointers.[1] z->left; } z = new node( key ); z->parent = p; if( !p ) Another method which can be used is based on the ar- root = z; else if( comp( p->key, z->key ) ) p->right = gument that we can restructure the tree on our way down z; else p->left = z; splay( z ); p_size++; } node* find( the access path instead of making a second pass. This const T &key ) { node *z = root; while( z ) { if( comp( top-down splaying routine uses three sets of nodes - left z->key, key ) ) z = z->right; else if( comp( key, z->key tree, right tree and middle tree. The first two contain all ) ) z = z->left; else return z; } return 0; } void erase( items of original tree known to be less than or greater than const T &key ) { node *z = find( key ); if( !z ) return; current item respectively. The middle tree consists of the splay( z ); if( !z->left ) replace( z, z->right ); else if( sub-tree rooted at the current node. These three sets are !z->right ) replace( z, z->left ); else { node *y = subupdated down the access path while keeping the splay op- tree_minimum( z->right ); if( y->parent != z ) { replace( erations in check. Another method, semisplaying, modi- y, y->right ); y->right = z->right; y->right->parent = fies the zig-zig case to reduce the amount of restructuring y; } replace( z, y ); y->left = z->left; y->left->parent = y; } delete z; p_size--; } const T& minimum( ) { done in all operations.[1][5] 196 CHAPTER 6. SUCCESSORS AND NEIGHBORS return subtree_minimum( root )->key; } const T& maximum( ) { return subtree_maximum( root )->key; } bool empty( ) const { return root == 0; } unsigned long size( ) const { return p_size; } }; #endif // SPLAY_TREE 6.11.5 Analysis amortized-cost = cost + ΔΦ ≤ 3(rank'(x)-rank(x)) When summed over the entire splay operation, this telescopes to 3(rank(root)-rank(x)) which is O(log n). The Zig operation adds an amortized cost of 1, but there’s at most one such operation. So now we know that the total amortized time for a seA simple amortized analysis of static splay trees can be quence of m operations is: carried out using the potential method. Define: • size(r) - the number of nodes in the sub-tree rooted Tamortized (m) = O(m log n) at node r (including r). To go from the amortized time to the actual time, we must add the decrease in potential from the initial state before • Φ = the sum of the ranks of all the nodes in the tree. any operation is done (Φi) to the final state after all operations are completed (Φf). Φ will tend to be high for poorly balanced trees and low for well-balanced trees. ∑ ranki (x) − rankf (x) = O(n log n) To apply the potential method, we first calculate ΔΦ - Φi − Φf = x the change in the potential caused by a splay operation. We check each case separately. Denote by rank' the rank where the last inequality comes from the fact that for evfunction after the operation. x, p and g are the nodes afery node x, the minimum rank is 0 and the maximum fected by the rotation operation (see figures above). rank is log(n). Zig step: Now we can finally bound the actual time: • rank(r) = log2 (size(r)). ΔΦ = rank'(p) - rank(p) + rank'(x) - rank(x) [since only p and x change ranks] = rank'(p) - rank(x) [since rank'(x)=rank(p)] ≤ rank'(x) - rank(x) [since rank'(p)<rank'(x)] Zig-Zig step: ΔΦ = rank'(g) - rank(g) + rank'(p) - rank(p) + rank'(x) - rank(x) = rank'(g) + rank'(p) - rank(p) - rank(x) [since rank'(x)=rank(g)] ≤ rank'(g) + rank'(x) - 2 rank(x) [since rank(x)<rank(p) and rank'(x)>rank'(p)] ≤ 3(rank'(x)-rank(x)) - 2 [due to the concavity of the log function] Zig-Zag step: Tactual (m) = O(m log n + n log n) Weighted analysis The above analysis can be generalized in the following way. • Assign to each node r a weight w(r). • Define size(r) = the sum of weights of nodes in the sub-tree rooted at node r (including r). • Define rank(r) and Φ exactly as above. The same analysis applies and the amortized cost of a splaying operation is again: ΔΦ = rank'(g) - rank(g) + rank'(p) - rank(p) + rank'(x) - rank(x) rank(root)−rank(x) = O(log W −log w(x)) = O(log ≤ rank'(g) + rank'(p) - 2 rank(x) [since rank'(x)=rank(g) and rank(x)<rank(p)] where W is the sum of all weights. ≤ 2(rank'(x)-rank(x)) - 2 [due to the concavity of the log function] The decrease from the initial to the final potential is bounded by: The amortized cost of any operation is ΔΦ plus the actual ∑ W cost. The actual cost of any zig-zig or zig-zag operation Φ − Φ ≤ log i f w(x) is 2 since there are two rotations to make. Hence: x∈tree W ) w(x) 6.11. SPLAY TREE since the maximum size of any single node is W and the minimum is w(x). Hence the actual time is bounded by: O( ∑ (log x∈sequence 6.11.6 ∑ W W (log )+ )) w(x) w(x) x∈tree Performance theorems 197 net potential drop is O (n log n).). This theorem is equivalent to splay trees having key-independent optimality.[1] Scanning Theorem Also known as the Sequential Access Theorem or the Queue theorem. Accessing the n elements of a splay tree in symmetric order takes O(n) time, regardless of the initial structure of the splay tree.[8] The tightest upper bound proven so far is 4.5n .[9] There are several theorems and conjectures regarding the 6.11.7 Dynamic optimality conjecture worst-case runtime for performing a sequence S of m acMain article: Optimal binary search tree cesses in a splay tree containing n elements. Balance Theorem The cost of performing the sequence S is O [m log n + n log n] (Proof: take a constant weight, e.g. w(x)=1 for every node x. Then W=n). This theorem implies that splay trees perform as well as static balanced binary search trees on sequences of at least n accesses.[1] Static Optimality Theorem Let qx be the number of times element x is accessed in S. If every element is accessed at once, then the cost [ least∑ ] of performm ing S is O m + x∈tree qx log qx (Proof: let w(x) = qx . Then W = m ). This theorem implies that splay trees perform as well as an optimum static binary search tree on sequences of at least n accesses. They spend less time on the more frequent items.[1] In addition to the proven performance guarantees for splay trees there is an unproven conjecture of great interest from the original Sleator and Tarjan paper. This conjecture is known as the dynamic optimality conjecture and it basically claims that splay trees perform as well as any other binary search tree algorithm up to a constant factor. Dynamic Optimality Conjecture:[1] Let A be any binary search tree algorithm that accesses an element x by traversing the path from the root to x at a cost of d(x) + 1 , and that between accesses can make any rotations in the tree at a cost of 1 per rotation. Let A(S) be the cost for A to perform the sequence S of accesses. Then the cost for a splay tree to perform the same accesses is O[n + A(S)] . Static Finger Theorem Assume that the items are numbered from 1 through n in ascend- There are several corollaries of the dynamic optimality ing order. Let f be any fixed element (the conjecture that remain unproven: 'finger'). Then the cost of performing S is] [ ∑ Traversal Conjecture:[1] Let T1 and T2 be O m + n log n + x∈sequence log(|x − f | + 1) two splay trees containing the same elements. (Proof: let w(x) = 1/(|x − f | + 1)2 . Then Let S be the sequence obtained by visiting W=O(1). The net potential drop is O (n log n) since the elements in T2 in preorder (i.e., depth first [1] the weight of any item is at least 1/n^2). search order). The total cost of performing the sequence S of accesses on T1 is O(n) . Dynamic Finger Theorem Assume that the 'finger' for each step accessing an element y is the element accessed in the previDeque Conjecture:[8][10][11] Let S be a seous[ step, x. The cost of performing ]S is quence of m double-ended queue operations ∑m (push, pop, inject, eject). Then the cost of perO m + n + x,y∈sequence log(|y − x| + 1) forming S on a splay tree is O(m + n) . .[6][7] Working Set Theorem At any time during the sequence, let t(x) be the number of distinct elements accessed before the previous time element [x was accessed. The cost of performing ]S ∑ is O m + n log n + x∈sequence log(t(x) + 1) Split Conjecture:[5] Let S be any permutation of the elements of the splay tree. Then the cost of deleting the elements in the order S is O(n) . (Proof: let w(x) = 1/(t(x) + 1)2 . Note that 6.11.8 Variants here the weights change during the sequence. However, the sequence of weights is still a permutation In order to reduce the number of restructuring operations, of 1, 1/4, 1/9, ..., 1/n^2. So as before W=O(1). The it is possible to replace the splaying with semi-splaying, 198 CHAPTER 6. SUCCESSORS AND NEIGHBORS in which an element is splayed only halfway towards the 6.11.11 References root.[1] • Albers, Susanne; Karpinski, Marek (2002). Another way to reduce restructuring is to do full splaying, “Randomized Splay Trees: Theoretical and Exbut only in some of the access operations - only when the perimental Results” (PDF). Information Processing access path is longer than a threshold, or only in the first Letters. 81: 213–221. doi:10.1016/s0020[1] m access operations. 0190(01)00230-7. 6.11.9 See also • Finger tree • Link/cut tree • Scapegoat tree • Zipper (data structure) • Trees • Tree rotation • AVL tree • B-tree • T-tree • List of data structures • Iacono’s working set structure • Geometry of binary search trees • Splaysort, a sorting algorithm using splay trees 6.11.10 Notes [1] Sleator & Tarjan 1985. [2] Goodrich, Tamassia & Goldwasser 2014. [3] Albers & Karpinski 2002. [4] Allen & Munro 1978. [5] Lucas 1991. [6] Cole et al. 2000. [7] Cole 2000. [8] Tarjan 1985. [9] Elmasry 2004. [10] Pettie 2008. [11] Sundar 1992. • Allen, Brian; Munro, Ian (1978). “Self-organizing search trees”. Journal of the ACM. 25 (4): 526–535. doi:10.1145/322092.322094. • Cole, Richard; Mishra, Bud; Schmidt, Jeanette; Siegel, Alan (2000). “On the Dynamic Finger Conjecture for Splay Trees. Part I: Splay Sorting log n-Block Sequences”. SIAM Journal on Computing. 30: 1–43. doi:10.1137/s0097539797326988. • Cole, Richard (2000). “On the Dynamic Finger Conjecture for Splay Trees. Part II: The Proof”. SIAM Journal on Computing. 30: 44–85. doi:10.1137/S009753979732699X. • Elmasry, Amr (2004), “On the sequential access theorem and Deque conjecture for splay trees”, Theoretical Computer Science, 314 (3): 459–466, doi:10.1016/j.tcs.2004.01.019 • Goodrich, Michael; Tamassia, Roberto; Goldwasser, Michael (2014). Data Structures and Algorithms in Java (6 ed.). Wiley. p. 506. ISBN 978-1118-77133-4. • Knuth, Donald (1997). The Art of Computer Programming. 3: Sorting and Searching (3rd ed.). Addison-Wesley. p. 478. ISBN 0-201-89685-0. • Lucas, Joan M. (1991). “On the Competitiveness of Splay Trees: Relations to the Union-Find Problem”. Online Algorithms. Series in Discrete Mathematics and Theoretical Computer Science. Center for Discrete Mathematics and Theoretical Computer Science. 7: 95–124. • Pettie, Seth (2008), “Splay Trees, DavenportSchinzel Sequences, and the Deque Conjecture”, Proc. 19th ACM-SIAM Symposium on Discrete Algorithms, 0707: 1115–1124, arXiv:0707.2160 , Bibcode:2007arXiv0707.2160P • Sleator, Daniel D.; Tarjan, Robert E. (1985). “Self-Adjusting Binary Search Trees” (PDF). Journal of the ACM. 32 (3): 652–686. doi:10.1145/3828.3835. • Sundar, Rajamani (1992). “On the Deque conjecture for the splay algorithm”. Combinatorica. 12 (1): 95–124. doi:10.1007/BF01191208. 6.12. TANGO TREE 199 • Tarjan, Robert E. (1985). “Sequential access in accessed node in T is in the subtree rooted at r, and l as splay trees takes linear time”. Combinatorica. 5 (4): the preferred child otherwise. Note that if the most re367–378. doi:10.1007/BF02579253. cently accessed node of T is p itself, then l is the preferred child by definition. A preferred path is defined by starting at the root and following the preferred children until reaching a leaf node. • NIST’s Dictionary of Algorithms and Data Struc- Removing the nodes on this path partitions the remainder of the tree into a number of subtrees, and we recurse tures: Splay Tree on each subtree (forming a preferred path from its root, • Implementations in C and Java (by Daniel Sleator) which partitions the subtree into more subtrees). 6.11.12 External links • Pointers to splay tree visualizations • Fast and efficient implementation of Splay trees Auxiliary Trees • Top-Down Splay Tree Java implementation To represent a preferred path, we store its nodes in a balanced binary search tree, specifically a red-black tree. For each non-leaf node n in a preferred path P, it has a non-preferred child c, which is the root of a new auxiliary tree. We attach this other auxiliary tree’s root (c) to n in P, thus linking the auxiliary trees together. We also augment the auxiliary tree by storing at each node the minimum and maximum depth (depth in the reference tree, that is) of nodes in the subtree under that node. • Zipper Trees • splay tree video 6.12 Tango tree A tango tree is a type of binary search tree proposed by Erik D. Demaine, Dion Harmon, John Iacono, and Mihai 6.12.2 Patrascu in 2004.[1] It is an online binary search tree that achieves an O(log log n) competitive ratio relative to the optimal offline binary search tree, while only using O(log log n) additional bits of memory per node. This improved upon the previous best known competitive ratio, which was O(log n) . Algorithm Searching To search for an element in the tango tree, we simply simulate searching the reference tree. We start by searching the preferred path connected to the root, which is simulated by searching the auxiliary tree corresponding to that preferred path. If the auxiliary tree doesn't contain the desired element, the search terminates on the parent 6.12.1 Structure of the root of the subtree containing the desired element Tango trees work by partitioning a binary search tree into (the beginning of another preferred path), so we simply a set of preferred paths, which are themselves stored in proceed by searching the auxiliary tree for that preferred auxiliary trees (so the tango tree is represented as a tree path, and so forth. of trees). Updating Reference Tree To construct a tango tree, we simulate a complete binary search tree called the reference tree, which is simply a traditional binary search tree containing all the elements. This tree never shows up in the actual implementation, but is the conceptual basis behind the following pieces of a tango tree. Preferred Paths First, we define for each node its preferred child, which informally is the most-recently touched child by a traditional binary search tree lookup. More formally, consider a subtree T, rooted at p, with children l (left) and r (right). We set r as the preferred child of p if the most recently In order to maintain the structure of the tango tree (auxiliary trees correspond to preferred paths), we must do some updating work whenever preferred children change as a result of searches. When a preferred child changes, the top part of a preferred path becomes detached from the bottom part (which becomes its own preferred path) and reattached to another preferred path (which becomes the new bottom part). In order to do this efficiently, we'll define cut and join operations on our auxiliary trees. Join Our join operation will combine two auxiliary trees as long as they have the property that the top node of one (in the reference tree) is a child of the bottom node of the other (essentially, that the corresponding preferred paths can be concatenated). This will work based on the 200 CHAPTER 6. SUCCESSORS AND NEIGHBORS concatenate operation of red-black trees, which combines two trees as long as they have the property that all elements of one are less than all elements of the other, and split, which does the reverse. In the reference tree, note that there exist two nodes in the top path such that a node is in the bottom path if and only if its key-value is between them. Now, to join the bottom path to the top path, we simply split the top path between those two nodes, then concatenate the two resulting auxiliary trees on either side of the bottom path’s auxiliary tree, and we have our final, joined auxiliary tree. maintain the proper invariants (switching preferred children and re-arranging preferred paths). Searching To see that the searching (not updating) fits in this bound, simply note that every time an auxiliary tree search is unsuccessful and we have to move to the next auxiliary tree, that results in a preferred child switch (since the parent preferred path now switches directions to join the child preferred path). Since all auxiliary tree searches are unsuccessful except the last one (we stop once a search is successful, naturally), we search k + 1 auxiliary trees. Each search takes O(log log n) , because Cut Our cut operation will break a preferred path into an auxiliary tree’s size is bounded by log n , the height of two parts at a given node, a top part and a bottom part. the reference tree. More formally, it'll partition an auxiliary tree into two auxiliary trees, such that one contains all nodes at or above a certain depth in the reference tree, and the other con- Updating The update cost fits within this bound as tains all nodes below that depth. As in join, note that the well, because we only have to perform one cut and one top part has two nodes that bracket the bottom part. Thus, join for every visited auxiliary tree. A single cut or join we can simply split on each of these two nodes to divide operation takes only a constant number of searches, splits, the path into three parts, then concatenate the two outer and concatenates, each of which takes logarithmic time ones so we end up with two parts, the top and bottom, as in the size of the auxiliary tree, so our update cost is (k + 1)O(log log n) . desired. 6.12.3 Analysis In order to bound the competitive ratio for tango trees, we must find a lower bound on the performance of the optimal offline tree that we use as a benchmark. Once we find an upper bound on the performance of the tango tree, we can divide them to bound the competitive ratio. Interleave Bound Main article: Interleave lower bound Competitive Ratio Tango trees are O(log log n) -competitive, because the work done by the optimal offline binary search tree is at least linear in k (the total number of preferred child switches), and the work done by the tango tree is at most (k + 1)O(log log n) . 6.12.4 See also • Splay tree • Optimal binary search tree To find a lower bound on the work done by the optimal • Red-black tree offline binary search tree, we again use the notion of preferred children. When considering an access sequence (a • Tree (data structure) sequence of searches), we keep track of how many times a reference tree node’s preferred child switches. The total number of switches (summed over all nodes) gives an 6.12.5 References asymptotic lower bound on the work done by any binary search tree algorithm on the given access sequence. This [1] Demaine, E. D.; Harmon, D.; Iacono, J.; Pătraşcu, M. (2007). “Dynamic Optimality— is called the interleave lower bound.[1] Almost”. SIAM Journal on Computing. 240. doi:10.1137/S0097539705447347. 37 (1): Tango Tree In order to connect this to tango trees, we will find an 6.13 Skip list upper bound on the work done by the tango tree for a given access sequence. Our upper bound will be (k + In computer science, a skip list is a data structure that 1)O(log log n) , where k is the number of interleaves. allows fast search within an ordered sequence of eleThe total cost is divided into two parts, searching for the ments. Fast search is made possible by maintaining a element, and updating the structure of the tango tree to linked hierarchy of subsequences, with each successive 6.13. SKIP LIST 201 subsequence skipping over fewer elements than the previous one. Searching starts in the sparsest subsequence until two consecutive elements have been found, one smaller and one larger than or equal to the element searched for. Via the linked hierarchy, these two elements link to elements of the next sparsest subsequence, where searching is continued until finally we are searching in the full sequence. The elements that are skipped over may be cho- Inserting elements to skip list sen probabilistically [2] or deterministically,[3] with the former being more common. corresponding linked-list operations, except that “tall” elNIL ements must be inserted into or deleted from more than NIL one linked list. NIL O(n) operations, which force us to visit every node in ascending order (such as printing the entire list), provide the opportunity to perform a behind-the-scenes derandomA schematic picture of the skip list data structure. Each box with ization of the level structure of the skip-list in an optian arrow represents a pointer and a row is a linked list giving mal way, bringing the skip list to O(log n) search time. a sparse subsequence; the numbered boxes at the bottom represent the ordered data sequence. Searching proceeds downwards (Choose the level of the i'th finite node to be 1 plus the from the sparsest subsequence at the top until consecutive ele- number of times we can repeatedly divide i by 2 before it becomes odd. Also, i=0 for the negative infinity header ments bracketing the search element are found. as we have the usual special case of choosing the highest possible level for negative and/or positive infinite nodes.) However this also allows someone to know where all of 6.13.1 Description the higher-than-level 1 nodes are and delete them. NIL head 1 2 3 4 5 6 7 8 9 10 A skip list is built in layers. The bottom layer is an ordinary ordered linked list. Each higher layer acts as an “express lane” for the lists below, where an element in layer i appears in layer i+1 with some fixed probability p (two commonly used values for p are 1/2 or 1/4). On average, each element appears in 1/(1-p) lists, and the tallest element (usually a special head element at the front of the skip list) in all the lists, log1/p n of them. Alternatively, we could make the level structure quasirandom in the following way: A search for a target element begins at the head element in the top list, and proceeds horizontally until the current element is greater than or equal to the target. If the current element is equal to the target, it has been found. If the current element is greater than the target, or the search reaches the end of the linked list, the procedure is repeated after returning to the previous element and dropping down vertically to the next lower list. The expected number of steps in each linked list is at most 1/p, which can be seen by tracing the search path backwards from the target until reaching an element that appears in the next higher list or reaching the beginning of the current list. Therefore, the total expected cost of a search is (log1/p n)/p, which is O(log n) when p is a constant. By choosing different values of p, it is possible to trade search costs against storage costs. Like the derandomized version, quasi-randomization is only done when there is some other reason to be running a O(n) operation (which visits every node). make all nodes level 1 j ← 1 while the number of nodes at level j > 1 do for each i'th node at level j do if i is odd if i is not the last node at level j randomly choose whether to promote it to level j+1 else do not promote end if else if i is even and node i-1 was not promoted promote it to level j+1 end if repeat j ← j + 1 repeat The advantage of this quasi-randomness is that it doesn't give away nearly as much level-structure related information to an adversarial user as the de-randomized one. This is desirable because an adversarial user who is able to tell which nodes are not at the lowest level can pessimize performance by simply deleting higher-level nodes. (Bethea and Reiter however argue that nonetheless an adversary can use probabilistic and timing methods to force performance degradation.[4] ) The search performance is still guaranteed to be logarithmic. It would be tempting to make the following “optimization": In the part which says “Next, for each i'th...”, forget about doing a coin-flip for each even-odd pair. Just flip a coin once to decide whether to promote only the Implementation details even ones or only the odd ones. Instead of O(n log n) coin flips, there would only be O(log n) of them. UnforThe elements used for a skip list can contain more than tunately, this gives the adversarial user a 50/50 chance of one pointer since they can participate in more than one being correct upon guessing that all of the even numbered list. nodes (among the ones at level 1 or higher) are higher than Insertions and deletions are implemented much like the level one. This is despite the property that he has a very 202 CHAPTER 6. SUCCESSORS AND NEIGHBORS low probability of guessing that a particular node is at (1+3+1). level N for some integer N. function lookupByPositionIndex(i) node ← head i ← i A skip list does not provide the same absolute worst-case + 1 # don't count the head as a step for level from top performance guarantees as more traditional balanced tree to bottom do while i ≥ node.width[level] do # if next data structures, because it is always possible (though with step is not too far i ← i - node.width[level] # subtract the very low probability) that the coin-flips used to build the current width node ← node.next[level] # traverse forward skip list will produce a badly balanced structure. How- at the current level repeat repeat return node.value end ever, they work well in practice, and the randomized bal- function ancing scheme has been argued to be easier to imple- This method of implementing indexing is detailed in ment than the deterministic balancing schemes used in Section 3.4 Linear List Operations in “A skip list cookbalanced binary search trees. Skip lists are also useful in book” by William Pugh. parallel computing, where insertions can be done in different parts of the skip list in parallel without any global rebalancing of the data structure. Such parallelism can 6.13.2 History be especially advantageous for resource discovery in an ad-hoc wireless network because a randomized skip list Skip lists were first described in 1989 by William Pugh.[6] can be made robust to the loss of any single node.[5] To quote the author: Indexable skiplist As described above, a skiplist is capable of fast O(log n) insertion and removal of values from a sorted sequence, but it has only slow O(n) lookups of values at a given position in the sequence (i.e. return the 500th value); however, with a minor modification the speed of random access indexed lookups can be improved to O(log n) . Skip lists are a probabilistic data structure that seem likely to supplant balanced trees as the implementation method of choice for many applications. Skip list algorithms have the same asymptotic expected time bounds as balanced trees and are simpler, faster and use less space. For every link, also store the width of the link. The width 6.13.3 Usages is defined as the number of bottom layer links being traList of applications and frameworks that use skip lists: versed by each of the higher layer “express lane” links. For example, here are the widths of the links in the example at the top of the page: 1 10 o---> o--------------------------------------------------------> o Top level 1 3 2 5 o---> o---------------> o--------> o---------------------------> o Level 3 1 2 1 2 3 2 o---> o---------> o---> o---------> o---------------> o---------> o Level 2 1 1 1 1 1 1 1 1 1 1 1 o---> o---> o---> o---> o---> o---> o---> o---> o---> o---> o---> o Bottom level Head 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th NIL Node Node Node Node Node Node Node Node Node Node Notice that the width of a higher level link is the sum of the component links below it (i.e. the width 10 link spans the links of widths 3, 2 and 5 immediately below it). Consequently, the sum of all widths is the same on every level (10 + 1 = 1 + 3 + 2 + 5 = 1 + 2 + 1 + 2 + 5). To index the skiplist and find the i'th value, traverse the skiplist while counting down the widths of each traversed link. Descend a level whenever the upcoming width would be too large. For example, to find the node in the fifth position (Node 5), traverse a link of width 1 at the top level. Now four more steps are needed but the next width on this level is ten which is too large, so drop one level. Traverse one link of width 3. Since another step of width 2 would be too far, drop down to the bottom level. Now traverse the final link of width 1 to reach the target running total of 5 • MemSQL uses skiplists as its prime indexing structure for its database technology. • Cyrus IMAP server offers a “skiplist” backend DB implementation (source file) • Lucene uses skip lists to search delta-encoded posting lists in logarithmic time. • QMap (up to Qt 4) template class of Qt that provides a dictionary. • Redis, an ANSI-C open-source persistent key/value store for Posix systems, uses skip lists in its implementation of ordered sets.[7] • nessDB, a very fast key-value embedded Database Storage Engine (Using log-structured-merge (LSM) trees), uses skip lists for its memtable. • skipdb is an open-source database format using ordered key/value pairs. • ConcurrentSkipListSet ConcurrentSkipListMap in the Java 1.6 API. and • leveldb, a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values 6.14. B-TREE 203 Skip lists are used for efficient statistical computations [11] Bajpai, R.; Dhara, K. K.; Krishnaswamy, V. (2008). “QPID: A Distributed Priority Queue with Item Localof running medians (also known as moving medians). ity”. 2008 IEEE International Symposium on Parallel Skip lists are also used in distributed applications (where and Distributed Processing with Applications. p. 215. the nodes represent physical computers, and pointers doi:10.1109/ISPA.2008.90. ISBN 978-0-7695-3471-8. represent network connections) and for implementing highly scalable concurrent priority queues with less lock [12] Sundell, H. K.; Tsigas, P. (2004). “Scalable and lockcontention,[8] or even without locking,[9][10][11] as well as free concurrent dictionaries”. Proceedings of the 2004 lockless concurrent dictionaries.[12] There are also several ACM symposium on Applied computing - SAC '04 (PDF). p. US patents for using skip lists to implement (lockless) pri1438. doi:10.1145/967900.968188. ISBN 1581138121. ority queues and concurrent dictionaries.[13] [13] US patent 7937378 6.13.4 See also • Bloom filter • Skip graph 6.13.5 References [1] http://www.cs.uwaterloo.ca/research/tr/1993/28/ root2side.pdf [2] Pugh, W. (1990). “Skip lists: A probabilistic alternative to balanced trees” (PDF). Communications of the ACM. 33 (6): 668. doi:10.1145/78973.78977. [3] Munro, J. Ian; Papadakis, Thomas; Sedgewick, Robert (1992). “Deterministic skip lists”. Proceedings of the third annual ACM-SIAM symposium on Discrete algorithms (SODA '92). Orlando, Florida, USA: Society for Industrial and Applied Mathematics, Philadelphia, PA, USA. pp. 367–375. alternative link [4] Darrell Bethea and Michael K. Reiter, Data Structures with Unpredictable Timing https://www.cs.unc. edu/~{}djb/papers/2009-ESORICS.pdf, section 4 “Skip Lists” [5] Shah, Gauri (2003). Distributed Data Structures for Peerto-Peer Systems (PDF) (Ph.D. thesis). Yale University. [6] William Pugh (April 1989). “Concurrent Maintenance of Skip Lists”, Tech. Report CS-TR-2222, Dept. of Computer Science, U. Maryland. 6.13.6 External links • “Skip list” entry in the Dictionary of Algorithms and Data Structures • Skip Lists: A Linked List with Self-Balancing BSTLike Properties on MSDN in C# 2.0 • Skip Lists lecture (MIT OpenCourseWare: Introduction to Algorithms) • Open Data Structures - Chapter 4 - Skiplists • Skip trees, an alternative data structure to skip lists in a concurrent approach • Skip tree graphs, a distributed version of skip trees • More on skip tree graphs, a distributed version of skip trees Demo applets • Skip List Applet by Kubo Kovac • Thomas Wenger’s demo applet on skiplists Implementations • Algorithm::SkipList, implementation in Perl on CPAN [7] “Redis ordered set implementation”. • Raymond Hettinger’s implementation in Python [8] Shavit, N.; Lotan, I. (2000). “Skiplist-based concurrent priority queues”. Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000 (PDF). p. 263. doi:10.1109/IPDPS.2000.845994. ISBN 0-7695-0574-0. • ConcurrentSkipListSet documentation for Java 6 (and sourcecode) [9] Sundell, H.; Tsigas, P. (2003). “Fast and lock-free concurrent priority queues for multi-thread systems”. Proceedings International Parallel and Distributed Processing Symposium. p. 11. doi:10.1109/IPDPS.2003.1213189. ISBN 0-7695-1926-1. 6.14 B-tree Not to be confused with Binary tree. In computer science, a B-tree is a self-balancing tree data [10] Fomitchev, Mikhail; Ruppert, Eric (2004). Lock-free structure that keeps data sorted and allows searches, selinked lists and skip lists (PDF). Proc. Annual ACM Symp. quential access, insertions, and deletions in logarithmic on Principles of Distributed Computing (PODC). pp. 50– time. The B-tree is a generalization of a binary search 59. doi:10.1145/1011767.1011776. ISBN 1581138024. tree in that a node can have more than two children 204 CHAPTER 6. SUCCESSORS AND NEIGHBORS (Comer 1979, p. 123). Unlike self-balancing binary search trees, the B-tree is optimized for systems that read and write large blocks of data. B-trees are a good example of a data structure for external memory. It is commonly used in databases and filesystems. 6.14.1 A B-tree is kept balanced by requiring that all leaf nodes be at the same depth. This depth will increase slowly as elements are added to the tree, but an increase in the overall depth is infrequent, and results in all leaf nodes being one more node farther away from the root. Overview 7 16 1 2 5 6 9 12 node. In a 2-3 B-tree, the internal nodes will store either one key (with two child nodes) or two keys (with three child nodes). A B-tree is sometimes described with the parameters (d+1) — (2d+1) or simply with the highest branching order, (2d + 1) . 18 21 A B-tree (Bayer & McCreight 1972) of order 5 (Knuth 1998). B-trees have substantial advantages over alternative implementations when the time to access the data of a node greatly exceeds the time spent processing that data, because then the cost of accessing the node may be amortized over multiple operations within the node. This usually occurs when the node data are in secondary storage such as disk drives. By maximizing the number of keys within each internal node, the height of the tree decreases and the number of expensive node accesses is reduced. In addition, rebalancing of the tree occurs less often. The maximum number of child nodes depends on the information that must be stored for each child node and the size of a full disk block or an analogous size in secondary storage. While 2-3 B-trees are easier to explain, practical B-trees using secondary storage need a large number of child nodes to improve performance. In B-trees, internal (non-leaf) nodes can have a variable number of child nodes within some pre-defined range. When data is inserted or removed from a node, its number of child nodes changes. In order to maintain the predefined range, internal nodes may be joined or split. Because a range of child nodes is permitted, B-trees do not need re-balancing as frequently as other self-balancing search trees, but may waste some space, since nodes are not entirely full. The lower and upper bounds on the number of child nodes are typically fixed for a particular implementation. For example, in a 2-3 B-tree (often simply referred to as a 2-3 tree), each internal node may have Variants only 2 or 3 child nodes. Each internal node of a B-tree will contain a number of keys. The keys act as separation values which divide its subtrees. For example, if an internal node has 3 child nodes (or subtrees) then it must have 2 keys: a1 and a2 . All values in the leftmost subtree will be less than a1 , all values in the middle subtree will be between a1 and a2 , and all values in the rightmost subtree will be greater than a2 . Usually, the number of keys is chosen to vary between d and 2d , where d is the minimum number of keys, and d + 1 is the minimum degree or branching factor of the tree. In practice, the keys take up the most space in a node. The factor of 2 will guarantee that nodes can be split or combined. If an internal node has 2d keys, then adding a key to that node can be accomplished by splitting the hypothetical 2d+1 key node into two d key nodes and moving the key that would have been in the middle to the parent node. Each split node has the required minimum number of keys. Similarly, if an internal node and its neighbor each have d keys, then a key may be deleted from the internal node by combining it with its neighbor. Deleting the key would make the internal node have d − 1 keys; joining the neighbor would add d keys plus one more key brought down from the neighbor’s parent. The result is an entirely full node of 2d keys. The number of branches (or child nodes) from a node will be one more than the number of keys stored in the The term B-tree may refer to a specific design or it may refer to a general class of designs. In the narrow sense, a B-tree stores keys in its internal nodes but need not store those keys in the records at the leaves. The general class includes variations such as the B+ tree and the B* tree. • In the B+ tree, copies of the keys are stored in the internal nodes; the keys and records are stored in leaves; in addition, a leaf node may include a pointer to the next leaf node to speed sequential access (Comer 1979, p. 129). • The B* tree balances more neighboring internal nodes to keep the internal nodes more densely packed (Comer 1979, p. 129). This variant requires non-root nodes to be at least 2/3 full instead of 1/2 (Knuth 1998, p. 488). To maintain this, instead of immediately splitting up a node when it gets full, its keys are shared with a node next to it. When both nodes are full, then the two nodes are split into three. Deleting nodes is somewhat more complex than inserting however. • B-trees can be turned into order statistic trees to allow rapid searches for the Nth record in key order, or counting the number of records between any two records, and various other related operations.[1] 6.14. B-TREE 205 Etymology RPM drive, the rotation period is 8.33 milliseconds. For a drive such as the Seagate ST3500320NS, the track-toRudolf Bayer and Ed McCreight invented the B-tree track seek time is 0.8 milliseconds and the average readwhile working at Boeing Research Labs in 1971 (Bayer & ing seek time is 8.5 milliseconds.[4] For simplicity, asMcCreight 1972), but they did not explain what, if any- sume reading from disk takes about 10 milliseconds. thing, the B stands for. Douglas Comer explains: Naively, then, the time to locate one record out of a milThe origin of “B-tree” has never been explained by the authors. As we shall see, “balanced,” “broad,” or “bushy” might apply. Others suggest that the “B” stands for Boeing. Because of his contributions, however, it seems appropriate to think of B-trees as “Bayer"trees. (Comer 1979, p. 123 footnote 1) lion would take 20 disk reads times 10 milliseconds per disk read, which is 0.2 seconds. The time won't be that bad because individual records are grouped together in a disk block. A disk block might be 16 kilobytes. If each record is 160 bytes, then 100 records could be stored in each block. The disk read time above was actually for an entire block. Once the disk head is in position, one or more disk blocks can be read with little delay. With 100 records per block, the last 6 or so Donald Knuth speculates on the etymology of B-trees in comparisons don't need to do any disk reads—the comhis May, 1980 lecture on the topic “CS144C classroom parisons are all within the last disk block read. lecture about disk storage and B-trees”, suggesting the To speed the search further, the first 13 to 14 comparisons “B” may have originated from Boeing or from Bayer’s (which each required a disk access) must be sped up. [2] name. After a talk at CPM 2013 (24th Annual Symposium on Combinatorial Pattern Matching, Bad Herrenalb, Germany, June 17–19, 2013), Ed McCreight answered a question on B-tree’s name by Martin Farach-Colton saying: “Bayer and I were in a lunch time where we get to think a name. And we were, so, B, we were thinking… B is, you know… We were working for Boeing at the time, we couldn't use the name without talking to lawyers. So, there is a B. It has to do with balance, another B. Bayer was the senior author, who did have several years older than I am and had many more publications than I did. So there is another B. And so, at the lunch table we never did resolve whether there was one of those that made more sense than the rest. What really lives to say is: the more you think about what the B in B-trees means, the better you understand B-trees.”[3] 6.14.2 B-tree usage in databases Time to search a sorted file An index speeds the search A significant improvement can be made with an index. In the example above, initial disk reads narrowed the search range by a factor of two. That can be improved substantially by creating an auxiliary index that contains the first record in each disk block (sometimes called a sparse index). This auxiliary index would be 1% of the size of the original database, but it can be searched more quickly. Finding an entry in the auxiliary index would tell us which block to search in the main database; after searching the auxiliary index, we would have to search only that one block of the main database—at a cost of one more disk read. The index would hold 10,000 entries, so it would take at most 14 comparisons. Like the main database, the last 6 or so comparisons in the aux index would be on the same disk block. The index could be searched in about 8 disk reads, and the desired record could be accessed in 9 disk reads. The trick of creating an auxiliary index can be repeated to make an auxiliary index to the auxiliary index. That Usually, sorting and searching algorithms have been characterized by the number of comparison operations that would make an aux-aux index that would need only 100 must be performed using order notation. A binary search entries and would fit in one disk block. of a sorted table with N records, for example, can be Instead of reading 14 disk blocks to find the desired done in roughly ⌈log2 N ⌉ comparisons. If the table had record, we only need to read 3 blocks. Reading and 1,000,000 records, then a specific record could be located searching the first (and only) block of the aux-aux inwith at most 20 comparisons: ⌈log2 (1, 000, 000)⌉ = 20 dex identifies the relevant block in aux-index. Reading and searching that aux-index block identifies the relevant . Large databases have historically been kept on disk block in the main database. Instead of 150 milliseconds, drives. The time to read a record on a disk drive far ex- we need only 30 milliseconds to get the record. ceeds the time needed to compare keys once the record is available. The time to read a record from a disk drive involves a seek time and a rotational delay. The seek time may be 0 to 20 or more milliseconds, and the rotational delay averages about half the rotation period. For a 7200 The auxiliary indices have turned the search problem from a binary search requiring roughly log2 N disk reads to one requiring only logb N disk reads where b is the blocking factor (the number of entries per block: b = 100 entries per block; logb 1, 000, 000 = 3 reads). 206 CHAPTER 6. SUCCESSORS AND NEIGHBORS In practice, if the main database is being frequently Disadvantages of B-trees searched, the aux-aux index and much of the aux index • maximum key length cannot be changed without may reside in a disk cache, so they would not incur a disk completely rebuilding the database. This led to read. many database systems truncating full human names to 70 characters. Insertions and deletions (Other implementations of associative array, such as a If the database does not change, then compiling the index ternary search tree or a separate-chaining hash table, dyis simple to do, and the index need never be changed. namically adapt to arbitrarily long key lengths). If there are changes, then managing the database and its index becomes more complicated. 6.14.3 Technical description Deleting records from a database is relatively easy. The index can stay the same, and the record can just be Terminology marked as deleted. The database stays in sorted order. If there is a large number of deletions, then the searching The literature on B-trees is not uniform in its terminology and storage become less efficient. (Folk & Zoellick 1992, p. 362). Insertions can be very slow in a sorted sequential file because room for the inserted record must be made. Inserting a record before the first record in the file requires shifting all of the records down one. Such an operation is just too expensive to be practical. One solution is to leave some space available to be used for insertions. Instead of densely storing all the records in a block, the block can have some free space to allow for subsequent insertions. Those records would be marked as if they were “deleted” records. Bayer & McCreight (1972), Comer (1979), and others define the order of B-tree as the minimum number of keys in a non-root node. Folk & Zoellick (1992) points out that terminology is ambiguous because the maximum number of keys is not clear. An order 3 B-tree might hold a maximum of 6 keys or a maximum of 7 keys. Knuth (1998, p. 483) avoids the problem by defining the order to be maximum number of children (which is one more than the maximum number of keys). Both insertions and deletions are fast as long as space is available on a block. If an insertion won't fit on the block, then some free space on some nearby block must be found and the auxiliary indices adjusted. The hope is that enough space is nearby such that a lot of blocks do not need to be reorganized. Alternatively, some out-ofsequence disk blocks may be used. The term leaf is also inconsistent. Bayer & McCreight (1972) considered the leaf level to be the lowest level of keys, but Knuth considered the leaf level to be one level below the lowest keys (Folk & Zoellick 1992, p. 363). There are many possible implementation choices. In some designs, the leaves may hold the entire data record; in other designs, the leaves may only hold pointers to the data record. Those choices are not fundamental to the idea of a B-tree.[5] Advantages of B-tree usage for databases There are also unfortunate choices like using the variable k to represent the number of children when k could be confused with the number of keys. The B-tree uses all of the ideas described above. In parFor simplicity, most authors assume there are a fixed ticular, a B-tree: number of keys that fit in a node. The basic assumption is the key size is fixed and the node size is fixed. In practice, • keeps keys in sorted order for sequential traversing variable length keys may be employed (Folk & Zoellick 1992, p. 379). • uses a hierarchical index to minimize the number of disk reads Definition • uses partially full blocks to speed insertions and According to Knuth’s definition, a B-tree of order m is a deletions tree which satisfies the following properties: • keeps the index balanced with an elegant recursive algorithm In addition, a B-tree minimizes waste by making sure the interior nodes are at least half full. A B-tree can handle an arbitrary number of insertions and deletions. 1. Every node has at most m children. 2. Every non-leaf node (except root) has at least ⌈m/2⌉ children. 3. The root has at least two children if it is not a leaf node. 6.14. B-TREE 4. A non-leaf node with k children contains k−1 keys. 5. All leaves appear in the same level 207 It can be shown (by induction for example) that a B-tree of height h with all its nodes completely filled has n= mh+1 −1 entries. Hence, the best case height of a B-tree is: Each internal node’s keys act as separation values which divide its subtrees. For example, if an internal node has 3 child nodes (or subtrees) then it must have 2 keys: a1 and ⌈logm (n + 1)⌉ − 1 a2 . All values in the leftmost subtree will be less than a1 , all values in the middle subtree will be between a1 and Let d be the minimum number of children an intera2 , and all values in the rightmost subtree will be greater nal (non-root) node can have. For an ordinary B-tree, than a2 . d=⌈m/2⌉. Internal nodes Internal nodes are all nodes except for leaf nodes and the root node. They are usually represented as an ordered set of elements and child pointers. Every internal node contains a maximum of U children and a minimum of L children. Thus, the number of elements is always 1 less than the number of child pointers (the number of elements is between L−1 and U−1). U must be either 2L or 2L−1; therefore each internal node is at least half full. The relationship between U and L implies that two half-full nodes can be joined to make a legal node, and one full node can be split into two legal nodes (if there’s room to push one element up into the parent). These properties make it possible to delete and insert new values into a B-tree and adjust the tree to preserve the B-tree properties. Comer (1979, p. 127) and Cormen et al. (2001, pp. 383– 384) give the worst case height of a B-tree (where the root node is considered to have height 0) as A B-tree of depth n+1 can hold about U times as many items as a B-tree of depth n, but the cost of search, insert, and delete operations grows with the depth of the tree. As with any balanced tree, the cost grows much more slowly than the number of elements. All insertions start at a leaf node. To insert a new element, search the tree to find the leaf node where the new element should be added. Insert the new element into that node with the following steps: Some balanced trees store values only at leaf nodes, and use different kinds of nodes for leaf nodes and internal nodes. B-trees keep values in every node in the tree, and may use the same structure for all nodes. However, since leaf nodes never have children, the B-trees benefit from improved performance if they use a specialized structure. 1. If the node contains fewer than the maximum legal number of elements, then there is room for the new element. Insert the new element in the node, keeping the node’s elements ordered. ⌊ ( )⌋ n+1 h ≤ logd . 2 6.14.5 Algorithms Search Searching is similar to searching a binary search tree. Starting at the root, the tree is recursively traversed from top to bottom. At each level, the search reduces its field of view to the child pointer (subtree) whose range includes The root node The root node’s number of children has the search value. A subtree’s range is defined by the valthe same upper limit as internal nodes, but has no ues, or keys, contained in its parent node. These limiting lower limit. For example, when there are fewer than values are also known as separation values. L−1 elements in the entire tree, the root will be the Binary search is typically (but not necessarily) used only node in the tree with no children at all. within nodes to find the separation values and child tree of interest. Leaf nodes Leaf nodes have the same restriction on the number of elements, but have no children, and no Insertion child pointers. 6.14.4 Best case and worst case heights Let h be the height of the classic B-tree. Let n > 0 be the number of entries in the tree.[6] Let m be the maximum number of children a node can have. Each node can have at most m−1 keys. 2. Otherwise the node is full, evenly split it into two nodes so: (a) A single median is chosen from among the leaf’s elements and the new element. (b) Values less than the median are put in the new left node and values greater than the median are put in the new right node, with the median acting as a separation value. 208 CHAPTER 6. SUCCESSORS AND NEIGHBORS tains one more element, and hence it is legal too. If U−1 is even, then U=2L−1, so there are 2L−2 elements in the node. Half of this number is L−1, which is the minimum number of elements allowed per node. An improved algorithm supports a single pass down the tree from the root to the node where the insertion will take place, splitting any full nodes encountered on the way. This prevents the need to recall the parent nodes into memory, which may be expensive if the nodes are on secondary storage. However, to use this improved algorithm, we must be able to send one element to the parent and split the remaining U−2 elements into two legal nodes, without adding a new element. This requires U = 2L rather than U = 2L−1, which accounts for why some textbooks impose this requirement in defining B-trees. Deletion There are two popular strategies for deletion from a Btree. 1. Locate and delete the item, then restructure the tree to retain its invariants, OR 2. Do a single pass down the tree, but before entering (visiting) a node, restructure the tree so that once the key to be deleted is encountered, it can be deleted without triggering the need for any further restructuring The algorithm below uses the former strategy. There are two special cases to consider when deleting an element: A B Tree insertion example with each iteration. The nodes of this B tree have at most 3 children (Knuth order 3). 1. The element in an internal node is a separator for its child nodes 2. Deleting an element may put its node under the minimum number of elements and children (c) The separation value is inserted in the node’s parent, which may cause it to be split, and so on. If the node has no parent (i.e., the node The procedures for these cases are in order below. was the root), create a new root above this node (increasing the height of the tree). If the splitting goes all the way up to the root, it creates a new root with a single separator value and two children, which is why the lower bound on the size of internal nodes does not apply to the root. The maximum number of elements per node is U−1. When a node is split, one element moves to the parent, but one element is added. So, it must be possible to divide the maximum number U−1 of elements into two legal nodes. If this number is odd, then U=2L and one of the new nodes contains (U−2)/2 = L−1 elements, and hence is a legal node, and the other con- Deletion from a leaf node 1. Search for the value to delete. 2. If the value is in a leaf node, simply delete it from the node. 3. If underflow happens, rebalance the tree as described in section “Rebalancing after deletion” below. 6.14. B-TREE Deletion from an internal node Each element in an internal node acts as a separation value for two subtrees, therefore we need to find a replacement for separation. Note that the largest element in the left subtree is still less than the separator. Likewise, the smallest element in the right subtree is still greater than the separator. Both of those elements are in leaf nodes, and either one can be the new separator for the two subtrees. Algorithmically described below: 1. Choose a new separator (either the largest element in the left subtree or the smallest element in the right subtree), remove it from the leaf node it is in, and replace the element to be deleted with the new separator. 2. The previous step deleted an element (the new separator) from a leaf node. If that leaf node is now deficient (has fewer than the required number of nodes), then rebalance the tree starting from the leaf node. Rebalancing after deletion Rebalancing starts from a leaf and proceeds toward the root until the tree is balanced. If deleting an element from a node has brought it under the minimum size, then some elements must be redistributed to bring all nodes up to the minimum. Usually, the redistribution involves moving an element from a sibling node that has more than the minimum number of nodes. That redistribution operation is called a rotation. If no sibling can spare an element, then the deficient node must be merged with a sibling. The merge causes the parent to lose a separator element, so the parent may become deficient and need rebalancing. The merging and rebalancing may continue all the way to the root. Since the minimum element count doesn't apply to the root, making the root be the only deficient node is not a problem. The algorithm to rebalance the tree is as follows: 209 down; deficient node now has the minimum number of elements) 2. Replace the separator in the parent with the last element of the left sibling (left sibling loses one node but still has at least the minimum number of elements) 3. The tree is now balanced • Otherwise, if both immediate siblings have only the minimum number of elements, then merge with a sibling sandwiching their separator taken off from their parent 1. Copy the separator to the end of the left node (the left node may be the deficient node or it may be the sibling with the minimum number of elements) 2. Move all elements from the right node to the left node (the left node now has the maximum number of elements, and the right node – empty) 3. Remove the separator from the parent along with its empty right child (the parent loses an element) • If the parent is the root and now has no elements, then free it and make the merged node the new root (tree becomes shallower) • Otherwise, if the parent has fewer than the required number of elements, then rebalance the parent Note: The rebalancing operations are different for B+ trees (e.g., rotation is different because parent has copy of the key) and B* -tree (e.g., three siblings are merged into two siblings). • If the deficient node’s right sibling exists and has more than the minimum number of elements, then Sequential access rotate left While freshly loaded databases tend to have good sequen1. Copy the separator from the parent to the tial behavior, this behavior becomes increasingly difficult end of the deficient node (the separator moves to maintain as a database grows, resulting in more random down; the deficient node now has the mini- I/O and performance challenges.[7] mum number of elements) 2. Replace the separator in the parent with the first element of the right sibling (right sibling Initial construction loses one node but still has at least the miniIn applications, it is frequently useful to build a B-tree to mum number of elements) represent a large existing collection of data and then up3. The tree is now balanced date it incrementally using standard B-tree operations. In • Otherwise, if the deficient node’s left sibling exists this case, the most efficient way to construct the initial and has more than the minimum number of ele- B-tree is not to insert every element in the initial collection successively, but instead to construct the initial set of ments, then rotate right leaf nodes directly from the input, then build the internal 1. Copy the separator from the parent to the nodes from these. This approach to B-tree construction start of the deficient node (the separator moves is called bulkloading. Initially, every leaf but the last one 210 CHAPTER 6. SUCCESSORS AND NEIGHBORS has one extra element, which will be used to build the TOPS-20 (and possibly TENEX) used a 0 to 2 level tree internal nodes. that has similarities to a B-tree. A disk block was 512 369 For example, if the leaf nodes have maximum size 4 and bit words. If the file fit in a 512 (2 ) word block, then the point to that physical disk block. If the initial collection is the integers 1 through 24, we would file directory would 18 the file fit in 2 words, then the directory would point to initially construct 4 leaf nodes containing 5 values each an aux index; the 512 words of that index would either be and 1 which contains 4 values: NULL (the block isn't allocated) or point to the physical We build the next level up from the leaves by taking the address of the block. If the file fit in 227 words, then last element from each leaf node except the last one. the directory would point to a block holding an aux-aux Again, each node except the last will contain one extra index; each entry would either be NULL or point to an value. In the example, suppose the internal nodes contain aux index. Consequently, the physical disk block for a at most 2 values (3 child pointers). Then the next level up 227 word file could be located in two disk reads and read of internal nodes would be: on the third. This process is continued until we reach a level with only Apple’s filesystem HFS+, Microsoft’s NTFS,[9] AIX one node and it is not overfilled. In the example only the (jfs2) and some Linux filesystems, such as btrfs and Ext4, root level remains: use B-trees. B* -trees are used in the HFS and Reiser4 file systems. 6.14.6 In filesystems Most modern filesystems use B-trees (or § Variants); alternatives such as extendible hashing are less common.[8] In addition to its use in databases, the B-tree is also used in filesystems to allow quick random access to an arbitrary block in a particular file. The basic problem is turning the file block i address into a disk block (or perhaps to a cylinder-head-sector) address. Some operating systems require the user to allocate the maximum size of the file when the file is created. The file can then be allocated as contiguous disk blocks. When converting to a disk block the operating system just adds the file block address to the starting disk block of the file. The scheme is simple, but the file cannot exceed its created size. 6.14.7 Variations Access concurrency Lehman and Yao[10] showed that all the read locks could be avoided (and thus concurrent access greatly improved) by linking the tree blocks at each level together with a “next” pointer. This results in a tree structure where both insertion and search operations descend from the root to the leaf. Write locks are only required as a tree block is modified. This maximizes access concurrency by multiple users, an important consideration for databases and/or other B-tree based ISAM storage methods. The cost associated with this improvement is that empty pages cannot be removed from the btree during normal operations. (However, see [11] for various strategies to implement node merging, and source code at.[12] ) Other operating systems allow a file to grow. The resulting disk blocks may not be contiguous, so mapping logical United States Patent 5283894, granted in 1994, appears blocks to physical blocks is more involved. to show a way to use a 'Meta Access Method' [13] to alMS-DOS, for example, used a simple File Allocation low concurrent B+ tree access and modification without Table (FAT). The FAT has an entry for each disk locks. The technique accesses the tree 'upwards’ for both block,[note 1] and that entry identifies whether its block is searches and updates by means of additional in-memory used by a file and if so, which block (if any) is the next indexes that point at the blocks in each level in the block disk block of the same file. So, the allocation of each cache. No reorganization for deletes is needed and there file is represented as a linked list in the table. In order to are no 'next' pointers in each block as in Lehman and Yao. find the disk address of file block i , the operating system (or disk utility) must sequentially follow the file’s linked list in the FAT. Worse, to find a free disk block, it must sequentially scan the FAT. For MS-DOS, that was not a 6.14.8 See also huge penalty because the disks and files were small and the FAT had few entries and relatively short file chains. • B+tree In the FAT12 filesystem (used on floppy disks and early hard disks), there were no more than 4,080 [note 2] entries, • R-tree and the FAT would usually be resident in memory. As disks got bigger, the FAT architecture began to confront • 2–3 tree penalties. On a large disk using FAT, it may be necessary to perform disk reads to learn the disk location of a file • 2–3–4 tree block to be read or written. 6.14. B-TREE 6.14.9 Notes [1] For FAT, what is called a “disk block” here is what the FAT documentation calls a “cluster”, which is fixed-size group of one or more contiguous whole physical disk sectors. For the purposes of this discussion, a cluster has no significant difference from a physical sector. [2] Two of these were reserved for special purposes, so only 4078 could actually represent disk blocks (clusters). 6.14.10 References [1] Counted B-Trees, retrieved 2010-01-25 [2] Knuth’s video lectures from Stanford [3] Talk’s video, retrieved 2014-01-17 [4] Seagate Technology LLC, Product Manual: Barracuda ES.2 Serial ATA, Rev. F., publication 100468393, 2008 , page 6 [5] Bayer & McCreight (1972) avoided the issue by saying an index element is a (physically adjacent) pair of (x, a) where x is the key, and a is some associated information. The associated information might be a pointer to a record or records in a random access, but what it was didn't really matter. Bayer & McCreight (1972) states, “For this paper the associated information is of no further interest.” [6] If n is zero, then no root node is needed, so the height of an empty tree is not well defined. [7] “Cache Oblivious B-trees”. State University of New York (SUNY) at Stony Brook. Retrieved 2011-01-17. [8] Mikuláš Patocka. “Design and Implementation of the Spad Filesystem”. “Table 4.1: Directory organization in filesystems”. 2006. [9] Mark Russinovich. “Inside Win2K NTFS, Part 1”. Microsoft Developer Network. Archived from the original on 13 April 2008. Retrieved 2008-04-18. [10] “Efficient locking for concurrent operations on B-trees”. Portal.acm.org. doi:10.1145/319628.319663. Retrieved 2012-06-28. [11] http://www.dtic.mil/cgi-bin/GetTRDoc?AD= ADA232287&Location=U2&doc=GetTRDoc.pdf [12] “Downloads - high-concurrency-btree - High Concurrency B-Tree code in C - GitHub Project Hosting”. Retrieved 2014-01-27. [13] Lockless Concurrent B+Tree General • Bayer, R.; McCreight, E. (1972), “Organization and Maintenance of Large Ordered Indexes” (PDF), Acta Informatica, 1 (3): 173–189, doi:10.1007/bf00288683 211 • Comer, Douglas (June 1979), “The Ubiquitous B-Tree”, Computing Surveys, 11 (2): 123–137, doi:10.1145/356770.356776, ISSN 0360-0300. • Cormen, Thomas; Leiserson, Charles; Rivest, Ronald; Stein, Clifford (2001), Introduction to Algorithms (Second ed.), MIT Press and McGraw-Hill, pp. 434–454, ISBN 0-262-03293-7. Chapter 18: B-Trees. • Folk, Michael J.; Zoellick, Bill (1992), File Structures (2nd ed.), Addison-Wesley, ISBN 0-20155713-4 • Knuth, Donald (1998), Sorting and Searching, The Art of Computer Programming, Volume 3 (Second ed.), Addison-Wesley, ISBN 0-201-89685-0. Section 6.2.4: Multiway Trees, pp. 481–491. Also, pp. 476–477 of section 6.2.3 (Balanced Trees) discusses 2-3 trees. Original papers • Bayer, Rudolf; McCreight, E. (July 1970), Organization and Maintenance of Large Ordered Indices, Mathematical and Information Sciences Report No. 20, Boeing Scientific Research Laboratories. • Bayer, Rudolf (1971), Binary B-Trees for Virtual Memory, Proceedings of 1971 ACM-SIGFIDET Workshop on Data Description, Access and Control, San Diego, California. 6.14.11 External links • B-tree lecture by David Scot Taylor, SJSU • B-Tree animation applet by slady • B-tree and UB-tree on Scholarpedia Curator: Dr Rudolf Bayer • B-Trees: Balanced Tree Data Structures • NIST’s Dictionary of Algorithms and Data Structures: B-tree • B-Tree Tutorial • The InfinityDB BTree implementation • Cache Oblivious B(+)-trees • Dictionary of Algorithms and Data Structures entry for B*-tree • Open Data Structures - Section 14.2 - B-Trees • Counted B-Trees • B-Tree .Net, a modern, virtualized RAM & Disk implementation 212 CHAPTER 6. SUCCESSORS AND NEIGHBORS 6.15 B+ tree leaf, in this case.) This node is permitted to have as little as one key if necessary, and at most b. 6.15.2 Algorithms Search The root of a B+ Tree represents the whole range of values in the tree, where every internal node is a subinterval. We are looking for a value k in the B+ Tree. Starting from the root, we are looking for the leaf which may contain A simple B+ tree example linking the keys 1–7 to data values the value k. At each node, we figure out which internal d1 -d7 . The linked list (red) allows rapid in-order traversal. This pointer we should follow. An internal B+ Tree node has at particular tree’s branching factor is b=4. most d ≤ b children, where every one of them represents a A B+ tree is an n-ary tree with a variable but often large different sub-interval. We select the corresponding node number of children per node. A B+ tree consists of a root, by searching on the key values of the node. internal nodes and leaves.[1] The root may be either a leaf Function: search (k) return tree_search (k, root); Funcor a node with two or more children.[2] tion: tree_search (k, node) if node is a leaf then return A B+ tree can be viewed as a B-tree in which each node node; switch k do case k < k_0 return tree_search(k, contains only keys (not key-value pairs), and to which an p_0); case k_i ≤ k < k_{i+1} return tree_search(k, additional level is added at the bottom with linked leaves. p_{i+1}); case k_d ≤ k return tree_search(k, p_{d+1}); The primary value of a B+ tree is in storing data for ef- This pseudocode assumes that no duplicates are allowed. ficient retrieval in a block-oriented storage context — in particular, filesystems. This is primarily because unlike Prefix key compression binary search trees, B+ trees have very high fanout (number of pointers to child nodes in a node,[1] typically on the • It is important to increase fan-out, as this allows to order of 100 or more), which reduces the number of I/O direct searches to the leaf level more efficiently. operations required to find an element in the tree. • Index Entries are only to `direct traffic’, thus we can The ReiserFS, NSS, XFS, JFS, ReFS, and BFS filesyscompress them. tems all use this type of tree for metadata indexing; BFS also uses B+ trees for storing directories. NTFS uses B+ trees for directory indexing. EXT4 uses ex- Insertion tent trees (a modified B+ tree data structure) for file extent indexing.[3] Relational database management sys- Perform a search to determine what bucket the new tems such as IBM DB2,[4] Informix,[4] Microsoft SQL record should go into. Server,[4] Oracle 8,[4] Sybase ASE,[4] and SQLite[5] support this type of tree for table indices. Key-value database • If the bucket is not full (at most b - 1 entries after management systems such as CouchDB[6] and Tokyo the insertion), add the record. Cabinet[7] support this type of tree for data access. • Otherwise, split the bucket. 6.15.1 Overview • Allocate new leaf and move half the bucket’s elements to the new bucket. The order, or branching factor, b of a B+ tree measures • Insert the new leaf’s smallest key and address the capacity of nodes (i.e., the number of children nodes) into the parent. for internal nodes in the tree. The actual number of chil• If the parent is full, split it too. dren for a node, referred to here as m, is constrained for • Add the middle key to the parent node. internal nodes so that ⌈b/2⌉ ≤ m ≤ b . The root is an ex[1] ception: it is allowed to have as few as two children. For • Repeat until a parent is found that need not example, if the order of a B+ tree is 7, each internal node split. (except for the root) may have between 4 and 7 children; • If the root splits, create a new root which has one key the root may have between 2 and 7. Leaf nodes have no and two pointers. (That is, the value that gets pushed children, but are constrained so that the number of keys to the new root gets removed from the original node) must be at least ⌈b/2⌉ − 1 and at most b − 1 . In the situation where a B+ tree is nearly empty, it only contains one node, which is a leaf node. (The root is also the single B-trees grow at the root and not at the leaves.[1] 6.15. B+ TREE 213 • The maximum number of keys is nkmax = bh Deletion • Start at root, find leaf L where entry belongs. • The space required to store the tree is O(n) • Remove the entry. • Inserting a record requires O(logb n) operations • If L is at least half-full, done! • Finding a record requires O(logb n) operations • If L has fewer entries than it should, • Removing a (previously located) record requires O(logb n) operations • If sibling (adjacent node with same parent as L) is more than half-full, re-distribute, borrowing an entry from it. • Otherwise, sibling is exactly half-full, so we can merge L and sibling. • If merge occurred, must delete entry (pointing to L or sibling) from parent of L. • Merge could propagate to root, decreasing height. Bulk-loading Given a collection of data records, we want to create a B+ tree index on some key field. One approach is to insert each record into an empty tree. However, it is quite expensive, because each entry requires us to start from the root and go down to the appropriate leaf page. An efficient alternative is to use bulk-loading. • The first step is to sort the data entries according to a search key in ascending order. • Performing a range query with k elements occurring within the range requires O(logb n + k) operations 6.15.4 Implementation The leaves (the bottom-most index blocks) of the B+ tree are often linked to one another in a linked list; this makes range queries or an (ordered) iteration through the blocks simpler and more efficient (though the aforementioned upper bound can be achieved even without this addition). This does not substantially increase space consumption or maintenance on the tree. This illustrates one of the significant advantages of a B+tree over a B-tree; in a Btree, since not all keys are present in the leaves, such an ordered linked list cannot be constructed. A B+tree is thus particularly useful as a database system index, where the data typically resides on disk, as it allows the B+tree to actually provide an efficient structure for housing the data itself (this is described in [4]:238 as index structure “Alternative 1”). If a storage system has a block size of B bytes, and the • We allocate an empty page to serve as the root, and keys to be stored have a size of k, arguably the most efinsert a pointer to the first page of entries into it. ficient B+ tree is one where b = (B/k) − 1 . Although theoretically the one-off is unnecessary, in practice there • When the root is full, we split the root, and create a is often a little extra space taken up by the index blocks new root page. (for example, the linked list references in the leaf blocks). • Keep inserting entries to the right most index page Having an index block which is slightly larger than the just above the leaf level, until all entries are indexed. storage system’s actual block represents a significant performance decrease; therefore erring on the side of caution is preferable. Note (1) when the right-most index page above the leaf level fills up, it is split; (2) this action may, in turn, cause If nodes of the B+ tree are organized as arrays of elea split of the right-most index page on step closer to the ments, then it may take a considerable time to insert or root; and (3) splits only occur on the right-most path from delete an element as half of the array will need to be shifted on average. To overcome this problem, elements the root to the leaf level. inside a node can be organized in a binary tree or a B+ tree instead of an array. 6.15.3 Characteristics For a b-order B+ tree with h levels of index: B+ trees can also be used for data stored in RAM. In this case a reasonable choice for block size would be the size of processor’s cache line. • The maximum number of records stored is nmax = Space efficiency of B+ trees can be improved by using some compression techniques. One possibility is to use bh − bh−1 delta encoding to compress keys stored into each block. • The minimum number of records stored is nmin = For internal blocks, space saving can be achieved by ei⌈ ⌉h−1 ther compressing keys or pointers. For string keys, space 2 2b can be saved by using the following technique: Normally • The minimum number of keys is nkmin = the i -th entry of an internal block contains the first key ⌈ ⌉h−1 −1 2 2b of block i+1. Instead of storing the full key, we could 214 CHAPTER 6. SUCCESSORS AND NEIGHBORS store the shortest prefix of the first key of block i+1 that is strictly greater (in lexicographic order) than last key of block i. There is also a simple way to compress pointers: if we suppose that some consecutive blocks i, i+1...i+k are stored contiguously, then it will suffice to store only a pointer to the first block and the count of consecutive blocks. All the above compression techniques have some drawbacks. First, a full block must be decompressed to extract a single element. One technique to overcome this problem is to divide each block into sub-blocks and compress them separately. In this case searching or inserting an element will only need to decompress or compress a sub-block instead of a full block. Another drawback of compression techniques is that the number of stored elements may vary considerably from a block to another depending on how well the elements are compressed inside each block. 6.15.5 History [5] SQLite Version 3 Overview [6] CouchDB Guide (see note after 3rd paragraph) [7] Tokyo Cabinet reference Archived September 12, 2009, at the Wayback Machine. 6.15.8 External links • B+ tree in Python, used to implement a list • Dr. Monge’s B+ Tree index notes • Evaluating the performance of CSB+-trees on Mutithreaded Architectures • Effect of node size on the performance of cache conscious B+-trees • Fractal Prefetching B+-trees • Towards pB+-trees in the field: implementations Choices and performance • Cache-Conscious Index Structures for MainThe B tree was first described in the paper Organization Memory Databases and Maintenance of Large Ordered Indices. Acta Informatica 1: 173–189 (1972) by Rudolf Bayer and Edward • Cache Oblivious B(+)-trees M. McCreight. There is no single paper introducing the • The Power of B-Trees: CouchDB B+ Tree ImpleB+ tree concept. Instead, the notion of maintaining all mentation data in leaf nodes is repeatedly brought up as an interesting variant. An early survey of B trees also covering B+ • B+ Tree Visualization trees is Douglas Comer: "The Ubiquitous B-Tree", ACM Computing Surveys 11(2): 121–137 (1979). Comer notes that the B+ tree was used in IBM’s VSAM data ac- Implementations cess software and he refers to an IBM published article from 1973. • Interactive B+ Tree Implementation in C • Interactive B+ Tree Implementation in C++ 6.15.6 See also • Binary search tree • B-tree • Divide and conquer algorithm 6.15.7 References [1] Navathe, Ramez Elmasri, Shamkant B. (2010). Fundamentals of database systems (6th ed.). Upper Saddle River, N.J.: Pearson Education. pp. 652–660. ISBN 9780136086208. [2] http://www.seanster.com/BplusTree/BplusTree.html [3] Giampaolo, Dominic (1999). Practical File System Design with the Be File System (PDF). Morgan Kaufmann. ISBN 1-55860-497-9. [4] Ramakrishnan Raghu, Gehrke Johannes - Database Management Systems, McGraw-Hill Higher Education (2000), 2nd edition (en) page 267 • Memory based B+ tree implementation as C++ template library • Stream based B+ tree implementation as C++ template library • Open Source JavaScript B+ Tree Implementation • Perl implementation of B+ trees • Java/C#/Python implementations of B+ trees • C# B+Tree implementation, MIT License • File based B+Tree in C# with threading and MVCC support • JavaScript B+ Tree, MIT License • JavaScript B+ Tree, Interactive and Open Source Chapter 7 Integer and string searching 7.1 Trie can be compressed into a deterministic acyclic finite state automaton. This article is about a tree data structure. For the French Though tries are usually keyed by character strings, they commune, see Trie-sur-Baïse. need not be. The same algorithms can be adapted to serve In computer science, a trie, also called digital tree similar functions of ordered lists of any construct, e.g. permutations on a list of digits or shapes. In particular, a bitwise trie is keyed on the individual bits making up any fixed-length binary datum, such as an integer or memory address. t t o 7 te a tea 3 15 n d ted 7.1.1 History and etymology i A e to i A ten 4 12 11 n in n 5 Trie were first described by R. de la Briandais in 1959.[1][2]:336 The term trie was coined two years later by Edward Fredkin, who pronounces it /ˈtriː/ (as “tree”), after the middle syllable of retrieval.[3][4] However, other authors pronounce it /ˈtraɪ/ (as “try”), in an attempt to distinguish it verbally from “tree”.[3][4][5] inn 9 7.1.2 Applications As a replacement for other data structures A trie for keys “A”,"to”, “tea”, “ted”, “ten”, “i”, “in”, and “inn”. As discussed below, a trie has a number of advantages and sometimes radix tree or prefix tree (as they can be over binary search trees.[6] A trie can also be used to researched by prefixes), is a kind of search tree—an or- place a hash table, over which it has the following advandered tree data structure that is used to store a dynamic tages: set or associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores • Looking up data in a trie is faster in the worst case, the key associated with that node; instead, its position in O(m) time (where m is the length of a search string), the tree defines the key with which it is associated. All compared to an imperfect hash table. An imperfect the descendants of a node have a common prefix of the hash table can have key collisions. A key collision string associated with that node, and the root is associis the hash function mapping of different keys to the ated with the empty string. Values are not necessarily same position in a hash table. The worst-case lookup associated with every node. Rather, values tend only to speed in an imperfect hash table is O(N) time, but be associated with leaves, and with some inner nodes that far more typically is O(1), with O(m) time spent correspond to keys of interest. For the space-optimized evaluating the hash. presentation of prefix tree, see compact prefix tree. • There are no collisions of different keys in a trie. In the example shown, keys are listed in the nodes and values below them. Each complete English word has an arbi• Buckets in a trie, which are analogous to hash tatrary integer value associated with it. A trie can be seen as ble buckets that store key collisions, are necessary a tree-shaped deterministic finite automaton. Each finite only if a single key is associated with more than one language is generated by a trie automaton, and each trie value. 215 216 CHAPTER 7. INTEGER AND STRING SEARCHING • There is no need to provide a hash function or to change hash functions as more keys are added to a trie. We can look up a value in the trie as follows: find :: String -> Trie a -> Maybe a find [] t = value t find (k:ks) t = do ct <- Data.Map.lookup k (children t) find • A trie can provide an alphabetical ordering of the ks ct entries by key. Tries do have some drawbacks as well: In an imperative style, and assuming an appropriate data type in place, we can describe the same algorithm in Python (here, specifically for testing membership). Note that children is a list of a node’s children; and we say that a “terminal” node is one which contains a valid word. • Tries can be slower in some cases than hash tables for looking up data, especially if the data is directly accessed on a hard disk drive or some other secondary storage device where the random-access def find(node, key): for char in key: if char in node.children: node = node.children[char] else: return time is high compared to main memory.[7] None return node • Some keys, such as floating point numbers, can lead to long chains and prefixes that are not particularly Insertion proceeds by walking the trie according to the meaningful. Nevertheless, a bitwise trie can hanstring to be inserted, then appending new nodes for the dle standard IEEE single and double format floating suffix of the string that is not contained in the trie. In point numbers. imperative pseudocode, • Some tries can require more space than a hash ta- algorithm insert(root : node, s : string, value : any): node ble, as memory may be allocated for each character = root i = 0 n = length(s) while i < n: if node.child(s[i]) in the search string, rather than a single chunk of != nil: node = node.child(s[i]) i = i + 1 else: break memory for the whole entry, as in most hash tables. (* append new nodes, if necessary *) while i < n: Dictionary representation node.child(s[i]) = new node node = node.child(s[i]) i = i + 1 node.value = value A common application of a trie is storing a predictive text or autocomplete dictionary, such as found on a mobile Sorting telephone. Such applications take advantage of a trie’s ability to quickly search for, insert, and delete entries; Lexicographic sorting of a set of keys can be accomhowever, if storing dictionary words is all that is required plished with a simple trie-based algorithm as follows: (i.e., storage of information auxiliary to each word is not required), a minimal deterministic acyclic finite state au• Insert all keys in a trie. tomaton (DAFSA) would use less space than a trie. This is because a DAFSA can compress identical branches • Output all keys in the trie by means of prefrom the trie which correspond to the same suffixes (or order traversal, which results in output that is in parts) of different words being stored. lexicographically increasing order. Pre-order traversal is a kind of depth-first traversal. Tries are also well suited for implementing approximate matching algorithms,[8] including those used in spell This algorithm is a form of radix sort. checking and hyphenation[4] software. A trie forms the fundamental data structure of Burstsort, which (in 2007) was the fastest known string sorting Term indexing algorithm.[10] However, now there are faster string sorting [11] A discrimination tree term index stores its information in algorithms. a trie data structure.[9] Full text search 7.1.3 Algorithms A special kind of trie, called a suffix tree, can be used to index all suffixes in a text in order to carry out fast full Lookup and membership are easily described. The listing text searches. below implements a recursive trie node as a Haskell data type. It stores an optional value and a list of children tries, indexed by the next character: 7.1.4 Implementation strategies import Data.Map data Trie a = Trie { value :: Maybe a, children :: Map Char (Trie a) } There are several ways to represent tries, corresponding to different trade-offs between memory use and speed 7.1. TRIE 217 b a b y o d n k x d the alphabet array as a bitmap of 256 bits representing the ASCII alphabet, reducing dramatically the size of the nodes.[14] a Bitwise tries d n c e A trie implemented as a doubly chained tree: vertical arrows are child pointers, dashed horizontal arrows are next pointers. The set of strings stored in this trie is {baby, bad, bank, box, dad, dance}. The lists are sorted to allow traversal in lexicographic order. of the operations. The basic form is that of a linked set of nodes, where each node contains an array of child pointers, one for each symbol in the alphabet (so for the English alphabet, one would store 26 child pointers and for the alphabet of bytes, 256 pointers). This is simple but wasteful in terms of memory: using the alphabet of bytes (size 256) and four-byte pointers, each node requires a kilobyte of storage, and when there is little overlap in the strings’ prefixes, the number of required nodes is roughly the combined length of the stored strings.[2]:341 Put another way, the nodes near the bottom of the tree tend to have few children and there are many of them, so the structure wastes space storing null pointers.[12] The storage problem can be alleviated by an implementation technique called alphabet reduction, whereby the original strings are reinterpreted as longer strings over a smaller alphabet. E.g., a string of n bytes can alternatively be regarded as a string of 2n four-bit units and stored in a trie with sixteen pointers per node. Lookups need to visit twice as many nodes in the worst case, but the storage requirements go down by a factor of eight.[2]:347–352 An alternative implementation represents a node as a triple (symbol, child, next) and links the children of a node together as a singly linked list: child points to the node’s first child, next to the parent node’s next child.[12][13] The set of children can also be represented as a binary search tree; one instance of this idea is the ternary search tree developed by Bentley and Sedgewick.[2]:353 Bitwise tries are much the same as a normal characterbased trie except that individual bits are used to traverse what effectively becomes a form of binary tree. Generally, implementations use a special CPU instruction to very quickly find the first set bit in a fixed length key (e.g., GCC’s __builtin_clz() intrinsic). This value is then used to index a 32- or 64-entry table which points to the first item in the bitwise trie with that number of leading zero bits. The search then proceeds by testing each subsequent bit in the key and choosing child[0] or child[1] appropriately until the item is found. Although this process might sound slow, it is very cachelocal and highly parallelizable due to the lack of register dependencies and therefore in fact has excellent performance on modern out-of-order execution CPUs. A red-black tree for example performs much better on paper, but is highly cache-unfriendly and causes multiple pipeline and TLB stalls on modern CPUs which makes that algorithm bound by memory latency rather than CPU speed. In comparison, a bitwise trie rarely accesses memory, and when it does, it does so only to read, thus avoiding SMP cache coherency overhead. Hence, it is increasingly becoming the algorithm of choice for code that performs many rapid insertions and deletions, such as memory allocators (e.g., recent versions of the famous Doug Lea’s allocator (dlmalloc) and its descendents). Compressing tries Compressing the trie and merging the common branches can sometimes yield large performance gains. This works best under the following conditions: • The trie is mostly static (key insertions to or deletions from a pre-filled trie are disabled). • Only lookups are needed. • The trie nodes are not keyed by node-specific data, or the nodes’ data are common.[15] • The total set of stored keys is very sparse within their representation space. For example, it may be used to represent sparse bitsets, i.e., subsets of a much larger, fixed enumerable set. In such a case, the trie is keyed by the bit element position within the full set. The key is created from the string of bits needed to encode the integral position of each element. Such tries have a very degenerate form with many Another alternative in order to avoid the use of an array missing branches. After detecting the repetition of comof 256 pointers (ASCII), as suggested before, is to store mon patterns or filling the unused gaps, the unique leaf 218 CHAPTER 7. INTEGER AND STRING SEARCHING nodes (bit strings) can be stored and compressed easily, External memory tries reducing the overall size of the trie. Several trie variants are suitable for maintaining sets of Such compression is also used in the implementation of strings in external memory, including suffix trees. A the various fast lookup tables for retrieving Unicode chartrie/B-tree combination called the B-trie has also been acter properties. These could include case-mapping tasuggested for this task; compared to suffix trees, they are bles (e.g. for the Greek letter pi, from ∏ to π), or lookup limited in the supported operations but also more comtables normalizing the combination of base and combinpact, while performing update operations faster.[17] ing characters (like the a-umlaut in German, ä, or the dalet-patah-dagesh-ole in Biblical Hebrew, ‫)ַּ֫ד‬. For such applications, the representation is similar to transforming 7.1.5 See also a very large, unidimensional, sparse table (e.g. Unicode code points) into a multidimensional matrix of their com• Suffix tree binations, and then using the coordinates in the hyper• Radix tree matrix as the string key of an uncompressed trie to represent the resulting character. The compression will then • Directed acyclic word graph (aka DAWG) consist of detecting and merging the common columns within the hyper-matrix to compress the last dimension • Acyclic deterministic finite automata in the key. For example, to avoid storing the full, multi• Hash trie byte Unicode code point of each element forming a matrix column, the groupings of similar code points can be • Deterministic finite automata exploited. Each dimension of the hyper-matrix stores the • Judy array start position of the next dimension, so that only the offset (typically a single byte) need be stored. The resulting • Search algorithm vector is itself compressible when it is also sparse, so each dimension (associated to a layer level in the trie) can be • Extendible hashing compressed separately. • Hash array mapped trie Some implementations do support such data compression • Prefix Hash Tree within dynamic sparse tries and allow insertions and deletions in compressed tries. However, this usually has a sig• Burstsort nificant cost when compressed segments need to be split or merged. Some tradeoff has to be made between data • Luleå algorithm compression and update speed. A typical strategy is to • Huffman coding limit the range of global lookups for comparing the common branches in the sparse trie. • Ctrie The result of such compression may look similar to trying • HAT-trie to transform the trie into a directed acyclic graph (DAG), because the reverse transform from a DAG to a trie is obvious and always possible. However, the shape of the 7.1.6 References DAG is determined by the form of the key chosen to index the nodes, in turn constraining the compression pos- [1] de la Briandais, René (1959). File searching using variable sible. length keys. Proc. Western J. Computer Conf. pp. 295– Another compression strategy is to “unravel” the data structure into a single byte array.[16] This approach eliminates the need for node pointers, substantially reducing the memory requirements. This in turn permits memory mapping and the use of virtual memory to efficiently load the data from disk. One more approach is to “pack” the trie.[4] Liang describes a space-efficient implementation of a sparse packed trie applied to automatic hyphenation, in which the descendants of each node may be interleaved in memory. 298. Cited by Brass. [2] Brass, Peter (2008). Advanced Data Structures. Cambridge University Press. [3] Black, Paul E. (2009-11-16). “trie”. Dictionary of Algorithms and Data Structures. National Institute of Standards and Technology. Archived from the original on 2010-05-19. [4] Franklin Mark Liang (1983). Word Hy-phen-a-tion By Com-put-er (Doctor of Philosophy thesis). Stanford University. Archived from the original (PDF) on 2010-05-19. Retrieved 2010-03-28. [5] Knuth, Donald (1997). “6.3: Digital Searching”. The Art of Computer Programming Volume 3: Sorting and Searching (2nd ed.). Addison-Wesley. p. 492. ISBN 0-20189685-0. 7.2. RADIX TREE 219 [6] Bentley, Jon; Sedgewick, Robert (1998-04-01). “Ternary Search Trees”. Dr. Dobb’s Journal. Dr Dobb’s. Archived from the original on 2008-06-23. [17] Askitis, Nikolas; Zobel, Justin (2008). “B-tries for Diskbased String Management” (PDF). VLDB Journal: 1–26. ISSN 1066-8888. [7] Edward Fredkin (1960). “Trie Memory”. Communications of the ACM. 3 (9): 490–499. doi:10.1145/367390.367400. 7.1.7 External links [8] Aho, Alfred V.; Corasick, Margaret J. (Jun 1975). “Efficient String Matching: An Aid to Bibliographic Search” (PDF). Communications of the ACM. 18 (6): 333– 340. doi:10.1145/360825.360855. [9] John W. Wheeler; Guarionex Jordan. “An Empirical Study of Term Indexing in the Darwin Implementation of the Model Evolution Calculus”. 2004. p. 5. [10] “Cache-Efficient String Sorting Using Copying” (PDF). Retrieved 2008-11-15. [11] “Engineering Radix Sort for Strings.” (PDF). Retrieved 2013-03-11. • NIST’s Dictionary of Algorithms and Data Structures: Trie • Tries by Lloyd Allison • Comparison and Analysis • Java reference implementation Simple with prefix compression and deletions. 7.2 Radix tree [12] Allison, Lloyd. “Tries”. Retrieved 18 February 2014. [13] Sahni, Sartaj. “Tries”. Data Structures, Algorithms, & Applications in Java. University of Florida. Retrieved 18 February 2014. [14] Bellekens, Xavier (2014). A Highly-Efficient MemoryCompression Scheme for GPU-Accelerated Intrusion Detection Systems. Glasgow, Scotland, UK: ACM. pp. 302:302––302:309. ISBN 978-1-4503-3033-6. Retrieved 21 October 2015. [15] Jan Daciuk; Stoyan Mihov; Bruce W. Watson; Richard E. Watson (2000). “Incremental Construction of Minimal Acyclic Finite-State Automata”. Computational Linguistics. Association for Computational Linguistics. 26: 3. doi:10.1162/089120100561601. Archived from the original on 2006-03-13. Retrieved 2009-05-28. This paper presents a method for direct building of minimal acyclic finite states automaton which recognizes a given finite list of words in lexicographical order. Our approach is to construct a minimal automaton in a single phase by adding new strings one by one and minimizing the resulting automaton on-the-fly An example of a radix tree In computer science, a radix tree (also radix trie or compact prefix tree) is a data structure that represents a space-optimized trie in which each node that is the only child is merged with its parent. The result is that the number of children of every internal node is at least the radix r of the radix tree, where r is a positive integer and a power [16] Ulrich Germann; Eric Joanis; Samuel Larkin (2009). x of 2, having x ≥ 1. Unlike in regular tries, edges can be “Tightly packed tries: how to fit large models into mem- labeled with sequences of elements as well as single elory, and make them load fast, too” (PDF). ACL Work- ements. This makes radix trees much more efficient for shops: Proceedings of the Workshop on Software Engi- small sets (especially if the strings are long) and for sets neering, Testing, and Quality Assurance for Natural Lanof strings that share long prefixes. guage Processing. Association for Computational Linguistics. pp. 31–39. We present Tightly Packed Tries (TPTs), a compact implementation of read-only, compressed trie structures with fast on-demand paging and short load times. We demonstrate the benefits of TPTs for storing n-gram back-off language models and phrase tables for statistical machine translation. Encoded as TPTs, these databases require less space than flat text file representations of the same data compressed with the gzip utility. At the same time, they can be mapped into memory quickly and be searched directly in time linear in the length of the key, without the need to decompress the entire file. The overhead for local decompression during search is marginal. Unlike regular trees (where whole keys are compared en masse from their beginning up to the point of inequality), the key at each node is compared chunk-of-bits by chunk-of-bits, where the quantity of bits in that chunk at that node is the radix r of the radix trie. When the r is 2, the radix trie is binary (i.e., compare that node’s 1-bit portion of the key), which minimizes sparseness at the expense of maximizing trie depth—i.e., maximizing up to conflation of nondiverging bit-strings in the key. When r is an integer power of 2 greater or equal to 4, then the radix trie is an r-ary trie, which lessens the depth of the radix trie at the expense of potential sparseness. 220 CHAPTER 7. INTEGER AND STRING SEARCHING As an optimization, edge labels can be stored in constant The following pseudo code assumes that these classes exsize by using two pointers to a string (for the first and last ist. elements).[1] Edge Note that although the examples in this article show strings as sequences of characters, the type of the string • Node targetNode elements can be chosen arbitrarily; for example, as a bit or byte of the string representation when using multibyte • string label character encodings or Unicode. Node 7.2.1 Applications Radix trees are useful for constructing associative arrays with keys that can be expressed as strings. They find particular application in the area of IP routing, where the ability to contain large ranges of values with a few exceptions is particularly suited to the hierarchical organization of IP addresses.[2] They are also used for inverted indexes of text documents in information retrieval. 7.2.2 Operations Radix trees support insertion, deletion, and searching operations. Insertion adds a new string to the trie while trying to minimize the amount of data stored. Deletion removes a string from the trie. Searching operations include (but are not necessarily limited to) exact lookup, find predecessor, find successor, and find all strings with a prefix. All of these operations are O(k) where k is the maximum length of all strings in the set, where length is measured in the quantity of bits equal to the radix of the radix trie. • Array of Edges edges • function isLeaf() function lookup(string x) { // Begin at the root with no elements found Node traverseNode := root; int elementsFound := 0; // Traverse until a leaf is found or it is not possible to continue while (traverseNode != null && !traverseNode.isLeaf() && elementsFound < x.length) { // Get the next edge to explore based on the elements not yet found in x Edge nextEdge := select edge from traverseNode.edges where edge.label is a prefix of x.suffix(elementsFound) // x.suffix(elementsFound) returns the last (x.length - elementsFound) elements of x // Was an edge found? if (nextEdge != null) { // Set the next node to explore traverseNode := nextEdge.targetNode; // Increment elements found based on the label stored at the edge elementsFound += nextEdge.label.length; } else { // Terminate loop traverseNode := null; } } // A match is found if we arrive at a leaf node and have used up exactly x.length elements return (traverseNode != null && traverseNode.isLeaf() && elementsFound == x.length); } Lookup Insertion To insert a string, we search the tree until we can make no further progress. At this point we either add a new outgoing edge labeled with all remaining elements in the input string, or if there is already an outgoing edge sharing a prefix with the remaining input string, we split it into two edges (the first labeled with the common prefix) and proceed. This splitting step ensures that no node has more children than there are possible string elements. Finding a string in a Patricia trie The lookup operation determines if a string exists in a trie. Most operations modify this approach in some way to handle their specific tasks. For instance, the node where a string terminates may be of importance. This operation is similar to tries except that some edges consume multiple elements. Several cases of insertion are shown below, though more may exist. Note that r simply represents the root. It is assumed that edges can be labelled with empty strings to terminate strings where necessary and that the root has no incoming edge. (The lookup algorithm described above will not work when using empty-string edges.) • Insert 'water' at the root • Insert 'slower' while keeping 'slow' • Insert 'test' which is a prefix of 'tester' 7.2. RADIX TREE 221 • Insert 'team' while splitting 'test' and creating a new Radix trees also share the disadvantages of tries, however: edge label 'st' as they can only be applied to strings of elements or elements with an efficiently reversible mapping to strings, • Insert 'toast' while splitting 'te' and moving previous they lack the full generality of balanced search trees, strings a level lower which apply to any data type with a total ordering. A reversible mapping to strings can be used to produce the required total ordering for balanced search trees, but not Deletion the other way around. This can also be problematic if a data type only provides a comparison operation, but not To delete a string x from a tree, we first locate the leaf a (de)serialization operation. representing x. Then, assuming x exists, we remove the Hash tables are commonly said to have expected O(1) corresponding leaf node. If the parent of our leaf node insertion and deletion times, but this is only true when has only one other child, then that child’s incoming label considering computation of the hash of the key to be a is appended to the parent’s incoming label and the child constant-time operation. When hashing the key is taken is removed. into account, hash tables have expected O(k) insertion and deletion times, but may take longer in the worst case depending on how collisions are handled. Radix trees Additional operations have worst-case O(k) insertion and deletion. The successor/predecessor operations of radix trees are also not • Find all strings with common prefix: Returns an arimplemented by hash tables. ray of strings which begin with the same prefix. • Find predecessor: Locates the largest string less than 7.2.5 a given string, by lexicographic order. Variants A common extension of radix trees uses two colors of • Find successor: Locates the smallest string greater nodes, 'black' and 'white'. To check if a given string is than a given string, by lexicographic order. stored in the tree, the search starts from the top and follows the edges of the input string until no further progress can be made. If the search string is consumed and the fi7.2.3 History nal node is a black node, the search has failed; if it is white, the search has succeeded. This enables us to add Donald R. Morrison first described what he called “Patri- a large range of strings with a common prefix to the tree, cia trees” in 1968;[3] the name comes from the acronym using white nodes, then remove a small set of “excepPATRICIA, which stands for "Practical Algorithm To tions” in a space-efficient manner by inserting them using Retrieve Information Coded In Alphanumeric". Gernot black nodes. Gwehenberger independently invented and described the data structure at about the same time.[4] PATRICIA tries The HAT-trie is a cache-conscious data structure based are radix tries with radix equals 2, which means that each on radix trees that offers efficient string storage and rebit of the key is compared individually and each node is trieval, and ordered iterations. Performance, with respect to both time and space, is comparable to the cachea two-way (i.e., left versus right) branch. conscious hashtable.[5][6] See HAT trie implementation notes at 7.2.4 Comparison to other data structures The adaptive radix tree is a radix tree variant that inte- grates adaptive node sizes to the radix tree. One major (In the following comparisons, it is assumed that the keys drawbacks of the usual radix trees is the use of space, are of length k and the data structure contains n mem- because it uses a constant node size in every level. The major difference between the radix tree and the adaptive bers.) radix tree is its variable size for each node based on the Unlike balanced trees, radix trees permit lookup, insernumber of child elements, which grows while adding new tion, and deletion in O(k) time rather than O(log n). This entries. Hence, the adaptive radix tree leads to a better does not seem like an advantage, since normally k ≥ log n, use of space without reducing its speed.[7][8][9] but in a balanced tree every comparison is a string comparison requiring O(k) worst-case time, many of which are slow in practice due to long common prefixes (in the 7.2.6 See also case where comparisons begin at the start of the string). In a trie, all comparisons require constant time, but it • Prefix tree (also known as a Trie) takes m comparisons to look up a string of length m. • Deterministic acyclic finite state automaton Radix trees can perform these operations with fewer com(DAFSA) parisons, and require many fewer nodes. 222 CHAPTER 7. INTEGER AND STRING SEARCHING • Ternary search tries 7.2.8 External links • Hash trie • Algorithms and Data Structures Research & Reference Material: PATRICIA, by Lloyd Allison, Monash University • Deterministic finite automata • Patricia Tree, NIST Dictionary of Algorithms and Data Structures • Judy array • Crit-bit trees, by Daniel J. Bernstein • Search algorithm • Radix Tree API in the Linux Kernel, by Jonathan Corbet • Extendible hashing • Kart (key alteration radix tree), by Paul Jarc • Acyclic deterministic finite automata • Hash array mapped trie • Prefix hash tree • Burstsort • Luleå algorithm • Huffman coding 7.2.7 References [1] Morin, Patrick. “Data Structures for Strings” (PDF). Retrieved 15 April 2012. [2] Knizhnik, Konstantin. “Patricia Tries: A Better Index For Prefix Searches”, Dr. Dobb’s Journal, June, 2008. [3] Morrison, Donald R. Practical Algorithm to Retrieve Information Coded in Alphanumeric [4] G. Gwehenberger, Anwendung einer binären Verweiskettenmethode beim Aufbau von Listen. Elektronische Rechenanlagen 10 (1968), pp. 223–226 [5] Askitis, Nikolas; Sinha, Ranjan (2007). HAT-trie: A Cache-conscious Trie-based Data Structure for Strings. Proceedings of the 30th Australasian Conference on Computer science. 62. pp. 97–105. ISBN 1-920682-43-0. [6] Askitis, Nikolas; Sinha, Ranjan (October 2010). “Engineering scalable, cache and space efficient tries for strings”. The VLDB Journal. 19 (5): 633–660. doi:10.1007/s00778-010-0183-9. ISSN 1066-8888. ISSN 0949-877X (0nline). [7] Kemper, Alfons; Eickler, André (2013). Datenbanksysteme, Eine Einführung. 9. pp. 604–605. ISBN 978-3486-72139-3. Implementations • Linux Kernel Implementation, used for the page cache, among other things. • GNU C++ Standard library has a trie implementation • Java implementation of Concurrent Radix Tree, by Niall Gallagher • C# implementation of a Radix Tree • Practical Algorithm Template Library, a C++ library on PATRICIA tries (VC++ >=2003, GCC G++ 3.x), by Roman S. Klyujkov • Patricia Trie C++ template class implementation, by Radu Gruian • Haskell standard library implementation “based on big-endian patricia trees”. Web-browsable source code. • Patricia Trie implementation in Java, by Roger Kapsi and Sam Berlin • Crit-bit trees forked from C code by Daniel J. Bernstein • Patricia Trie implementation in C, in libcprops • Patricia Trees : efficient sets and maps over integers in OCaml, by Jean-Christophe Filliâtre • Radix DB (Patricia trie) implementation in C, by G. B. Versiani 7.3 Suffix tree In computer science, a suffix tree (also called PAT tree or, in an earlier form, position tree) is a compressed trie [8] “armon/libart · GitHub”. GitHub. Retrieved 17 Septem- containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees alber 2014. low particularly fast implementations of many important [9] http://www-db.in.tum.de/~{}leis/papers/ART.pdf string operations. 7.3. SUFFIX TREE 223 7.3.2 Definition A BANANA$ The suffix tree for the string S of length n is defined as a tree such that:[2] NA • The tree has exactly n leaves numbered from 1 to n. 0 $ NA 5 $ 3 $ NA$ 4 2 NA$ 1 Suffix tree for the text BANANA. Each substring is terminated with special character $. The six paths from the root to the leaves (shown as boxes) correspond to the six suffixes A$, NA$, ANA$, NANA$, ANANA$ and BANANA$. The numbers in the leaves give the start position of the corresponding suffix. Suffix links, drawn dashed, are used during construction. The construction of such a tree for the string S takes time and space linear in the length of S . Once constructed, several operations can be performed quickly, for instance locating a substring in S , locating a substring if a certain number of mistakes are allowed, locating matches for a regular expression pattern etc. Suffix trees also provide one of the first linear-time solutions for the longest common substring problem. These speedups come at a cost: storing a string’s suffix tree typically requires significantly more space than storing the string itself. 7.3.1 • Except for the root, every internal node has at least two children. • Each edge is labeled with a non-empty substring of S. • No two edges starting out of a node can have stringlabels beginning with the same character. • The string obtained by concatenating all the stringlabels found on the path from the root to leaf i spells out suffix S[i..n], for i from 1 to n. Since such a tree does not exist for all strings, S is padded with a terminal symbol not seen in the string (usually denoted $). This ensures that no suffix is a prefix of another, and that there will be n leaf nodes, one for each of the n suffixes of S . Since all internal non-root nodes are branching, there can be at most n − 1 such nodes, and n + (n − 1) + 1 = 2n nodes in total (n leaves, n − 1 internal non-root nodes, 1 root). Suffix links are a key feature for older linear-time construction algorithms, although most newer algorithms, which are based on Farach’s algorithm, dispense with suffix links. In a complete suffix tree, all internal non-root nodes have a suffix link to another internal node. If the path from the root to a node spells the string χα , where χ is a single character and α is a string (possibly empty), it has a suffix link to the internal node representing α . See for example the suffix link from the node for ANA to the node for NA in the figure above. Suffix links are also used in some algorithms running on the tree. History The concept was first introduced by Weiner (1973), which Donald Knuth subsequently characterized as “Algorithm of the Year 1973”. The construction was greatly simplified by McCreight (1976) , and also by Ukkonen (1995).[1] Ukkonen provided the first online-construction of suffix trees, now known as Ukkonen’s algorithm, with running time that matched the then fastest algorithms. These algorithms are all linear-time for a constant-size alphabet, and have worst-case running time of O(n log n) in general. Farach (1997) gave the first suffix tree construction algorithm that is optimal for all alphabets. In particular, this is the first linear-time algorithm for strings drawn from an alphabet of integers in a polynomial range. Farach’s algorithm has become the basis for new algorithms for constructing both suffix trees and suffix arrays, for example, in external memory, compressed, succinct, etc. 7.3.3 Generalized suffix tree A generalized suffix tree is a suffix tree made for a set of words instead of a single word. It represents all suffixes from this set of words. Each word must be terminated by a different termination symbol or word. 7.3.4 Functionality A suffix tree for a string S of length n can be built in Θ(n) time, if the letters come from an alphabet of integers in a polynomial range (in particular, this is true for constantsized alphabets).[3] For larger alphabets, the running time is dominated by first sorting the letters to bring them into a range of size O(n) ; in general, this takes O(n log n) time. The costs below are given under the assumption that the alphabet is constant. 224 CHAPTER 7. INTEGER AND STRING SEARCHING Assume that a suffix tree has been built for the string S of length n , or that a generalised suffix tree has been built for the set of strings D = {S1 , S2 , . . . , SK } of total length n = |n1 | + |n2 | + · · · + |nK | . You can: • Search for strings: • Check if a string P of length m is a substring in O(m) time.[4] • Find all z tandem repeats in O(n log n + z) , and k-mismatch tandem repeats in O(kn log(n/k) + z) .[16] • Find the longest common substrings to at least k strings in D for k = 2, . . . , K in Θ(n) time.[17] • Find the longest palindromic substring of a given string (using the generalized suffix tree of the string and its reverse) in linear time.[18] • Find the first occurrence of the patterns P1 , . . . , Pq of total length m as substrings in O(m) time. 7.3.5 Applications • Find all z occurrences of the patterns P1 , . . . , Pq of total length m as substrings in Suffix trees can be used to solve a large number of string problems that occur in text-editing, free-text search, comO(m + z) time.[5] putational biology and other application areas.[19] Pri• Search for a regular expression P in time exmary applications include:[19] pected sublinear in n .[6] • Find for each suffix of a pattern P , the length of the longest match between a prefix of P [i . . . m] and a substring in D in Θ(m) time.[7] This is termed the matching statistics for P . • Find properties of the strings: • Find the longest common substrings of the string Si and Sj in Θ(ni + nj ) time.[8] • String search, in O(m) complexity, where m is the length of the sub-string (but with initial O(n) time required to build the suffix tree for the string) • Finding the longest repeated substring • Finding the longest common substring • Finding the longest palindrome in a string • Find all maximal pairs, maximal repeats or su- Suffix trees are often used in bioinformatics applications, searching for patterns in DNA or protein sequences permaximal repeats in Θ(n + z) time.[9] • Find the Lempel–Ziv decomposition in Θ(n) (which can be viewed as long strings of characters). The ability to search efficiently with mismatches might be time.[10] considered their greatest strength. Suffix trees are also • Find the longest repeated substrings in Θ(n) used in data compression; they can be used to find retime. peated data, and can be used for the sorting stage of the • Find the most frequently occurring substrings Burrows–Wheeler transform. Variants of the LZW comof a minimum length in Θ(n) time. pression schemes use suffix trees (LZSS). A suffix tree is • Find the shortest strings from Σ that do not also used in suffix tree clustering, a data clustering algo[20] occur in D , in O(n + z) time, if there are z rithm used in some search engines. such strings. • Find the shortest substrings occurring only 7.3.6 once in Θ(n) time. • Find, for each i , the shortest substrings of Si not occurring elsewhere in D in Θ(n) time. Implementation If each node and edge can be represented in Θ(1) space, the entire tree can be represented in Θ(n) space. The total length of all the strings on all of the edges in the tree The suffix tree can be prepared for constant time is O(n2 ) , but each edge can be stored as the position and lowest common ancestor retrieval between nodes in Θ(n) length of a substring of S, giving a total space usage of Θ(n) computer words. The worst-case space usage of a time.[11] One can then also: suffix tree is seen with a fibonacci word, giving the full 2n • Find the longest common prefix between the suffixes nodes. Si [p..ni ] and Sj [q..nj ] in Θ(1) .[12] An important choice when making a suffix tree im• Search for a pattern P of length m with at most k plementation is the parent-child relationships between mismatches in O(kn + z) time, where z is the num- nodes. The most common is using linked lists called sibling lists. Each node has a pointer to its first child, and ber of hits.[13] to the next node in the child list it is a part of. Other • Find all z maximal palindromes in Θ(n) ,[14] or implementations with efficient running time properties Θ(gn) time if gaps of length g are allowed, or use hash maps, sorted or unsorted arrays (with array doubling), or balanced search trees. We are interested in: Θ(kn) if k mismatches are allowed.[15] 7.3. SUFFIX TREE • The cost of finding the child on a given character. • The cost of inserting a child. 225 On the other hand, there have been practical works for constructing disk-based suffix trees which scale to (few) GB/hours. The state of the art methods are TDD,[28] TRELLIS,[29] DiGeST,[30] and B2 ST.[31] • The cost of enlisting all children of a node (divided TDD and TRELLIS scale up to the entire human genome by the number of children in the table below). – approximately 3GB – resulting in a disk-based suffix tree of a size in the tens of gigabytes.[28][29] However, Let σ be the size of the alphabet. Then you have the folthese methods cannot handle efficiently collections of selowing costs: quences exceeding 3GB.[30] DiGeST performs significantly better and is able to handle collections of sequences in the order of 6GB in about 6 hours.[30] . All these methLookup Insertion Traversal ods can efficiently build suffix trees for the case when the arrays unsorted / lists Sibling O(σ) Θ(1) Θ(1) tree does not fit in main memory, but the input does. The trees sibling Bitwise O(log σ) Θ(1) Θ(1) most recent method, B2 ST,[31] scales to handle inputs that maps Hash Θ(1) Θ(1) O(σ) do not fit in main memory. ERA is a recent parallel suffix tree search Balanced O(log σ) O(log σ) O(1) tree construction method that is significantly faster. ERA arrays Sorted O(log σ) O(σ) O(1) can index the entire human genome in 19 minutes on an lists sibling + maps Hash O(1) O(1) O(1) 8-core desktop computer with 16GB RAM. On a simple Linux cluster with 16 nodes (4GB RAM per node), The insertion cost is amortised, and that the costs for ERA can index the entire human genome in less than 9 hashing are given for perfect hashing. minutes.[32] The large amount of information in each edge and node makes the suffix tree very expensive, consuming about 10 to 20 times the memory size of the source text in good 7.3.9 See also implementations. The suffix array reduces this require• Suffix array ment to a factor of 8 (for array including LCP values built within 32-bit address space and 8-bit characters.) This • Generalised suffix tree factor depends on the properties and may reach 2 with usage of 4-byte wide characters (needed to contain any • Trie symbol in some UNIX-like systems, see wchar_t) on 32bit systems. Researchers have continued to find smaller 7.3.10 Notes indexing structures. [1] Giegerich & Kurtz (1997). 7.3.7 Parallel construction Various parallel algorithms to speed up suffix tree construction have been proposed.[21][22][23][24][25] Recently, a practical parallel algorithm for suffix tree construction with O(n) work (sequential time) and O(log2 n) span has been developed. The algorithm achieves good parallel scalability on shared-memory multicore machines and can index the 3GB human genome in under 3 minutes using a 40-core machine.[26] [2] http://www.cs.uoi.gr/~{}kblekas/courses/ bioinformatics/Suffix_Trees1.pdf [3] Farach (1997). [4] Gusfield (1999), p.92. [5] Gusfield (1999), p.123. [6] Baeza-Yates & Gonnet (1996). [7] Gusfield (1999), p.132. [8] Gusfield (1999), p.125. 7.3.8 External construction [9] Gusfield (1999), p.144. Though linear, the memory usage of a suffix tree is signif- [10] Gusfield (1999), p.166. icantly higher than the actual size of the sequence collec[11] Gusfield (1999), Chapter 8. tion. For a large text, construction may require external memory approaches. [12] Gusfield (1999), p.196. There are theoretical results for constructing suffix trees in external memory. The algorithm by Farach-Colton, Ferragina & Muthukrishnan (2000) is theoretically optimal, with an I/O complexity equal to that of sorting. However the overall intricacy of this algorithm has prevented, so far, its practical implementation.[27] [13] Gusfield (1999), p.200. [14] Gusfield (1999), p.198. [15] Gusfield (1999), p.201. [16] Gusfield (1999), p.204. 226 CHAPTER 7. INTEGER AND STRING SEARCHING [17] Gusfield (1999), p.205. [18] Gusfield (1999), pp.197–199. [19] Allison, L. “Suffix Trees”. Retrieved 2008-10-14. [20] First introduced by Zamir & Etzioni (1998). [21] Apostolico et al. (Vishkin). [22] Hariharan (1994). [23] Sahinalp & Vishkin (1994). [24] Farach & Muthukrishnan (1996). [25] Iliopoulos & Rytter (2004). [26] Shun & Blelloch (2014). [27] Smyth (2003). [28] Tata, Hankins & Patel (2003). [29] Phoophakdee & Zaki (2007). [30] Barsky et al. (2008). [31] Barsky et al. (2009). [32] Mansour et al. (2011). 7.3.11 References • Apostolico, A.; Iliopoulos, C.; Landau, G. M.; Schieber, B.; Vishkin, U. (1988), “Parallel construction of a suffix tree with applications”, Algorithmica, 3. • Baeza-Yates, Ricardo A.; Gonnet, Gaston H. (1996), “Fast text searching for regular expressions or automaton searching on tries”, Journal of the ACM, 43 (6): 915–936, doi:10.1145/235809.235810. • Barsky, Marina; Stege, Ulrike; Thomo, Alex; Upton, Chris (2008), “A new method for indexing genomes using on-disk suffix trees”, CIKM '08: Proceedings of the 17th ACM Conference on Information and Knowledge Management, New York, NY, USA: ACM, pp. 649–658. • Barsky, Marina; Stege, Ulrike; Thomo, Alex; Upton, Chris (2009), “Suffix trees for very large genomic sequences”, CIKM '09: Proceedings of the 18th ACM Conference on Information and Knowledge Management, New York, NY, USA: ACM. • Farach, Martin (1997), “Optimal Suffix Tree Construction with Large Alphabets” (PDF), 38th IEEE Symposium on Foundations of Computer Science (FOCS '97), pp. 137–143. • Farach, Martin; Muthukrishnan, S. (1996), “Optimal Logarithmic Time Randomized Suffix Tree Construction”, International Colloquium on Automata Languages and Programming. • Farach-Colton, Martin; Ferragina, Paolo; Muthukrishnan, S. (2000), “On the sorting-complexity of suffix tree construction.”, Journal of the ACM, 47 (6): 987–1011, doi:10.1145/355541.355547. • Giegerich, R.; Kurtz, S. (1997), “From Ukkonen to McCreight and Weiner: A Unifying View of Linear-Time Suffix Tree Construction” (PDF), Algorithmica, 19 (3): 331–353, doi:10.1007/PL00009177. • Gusfield, Dan (1999), Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge University Press, ISBN 0-521-58519-8. • Hariharan, Ramesh (1994), “Optimal Parallel Suffix Tree Construction”, ACM Symposium on Theory of Computing. • Iliopoulos, Costas; Rytter, Wojciech (2004), “On Parallel Transformations of Suffix Arrays into Suffix Trees”, 15th Australasian Workshop on Combinatorial Algorithms. • Mansour, Essam; Allam, Amin; Skiadopoulos, Spiros; Kalnis, Panos (2011), “ERA: Efficient Serial and Parallel Suffix Tree Construction for Very Long Strings” (PDF), PVLDB, 5 (1): 49–60, doi:10.14778/2047485.2047490. • McCreight, Edward M. (1976), “A SpaceEconomical Suffix Tree Construction Algorithm”, Journal of the ACM, 23 (2): 262– 272, doi:10.1145/321941.321946, CiteSeerX: 10.1.1.130.8022. • Phoophakdee, Benjarath; Zaki, Mohammed J. (2007), “Genome-scale disk-based suffix tree indexing”, SIGMOD '07: Proceedings of the ACM SIGMOD International Conference on Management of Data, New York, NY, USA: ACM, pp. 833–844. • Sahinalp, Cenk; Vishkin, Uzi (1994), “Symmetry breaking for suffix tree construction”, ACM Symposium on Theory of Computing • Smyth, William (2003), Computing Patterns in Strings, Addison-Wesley. • Shun, Julian; Blelloch, Guy E. (2014), “A Simple Parallel Cartesian Tree Algorithm and its Application to Parallel Suffix Tree Construction”, ACM Transactions on Parallel Computing. • Tata, Sandeep; Hankins, Richard A.; Patel, Jignesh M. (2003), “Practical Suffix Tree Construction”, VLDB '03: Proceedings of the 30th International Conference on Very Large Data Bases, Morgan Kaufmann, pp. 36–47. 7.4. SUFFIX ARRAY 227 • Ukkonen, E. (1995), “On-line construction of suf- 7.4.2 Example fix trees” (PDF), Algorithmica, 14 (3): 249–260, Consider the text S =banana$ to be indexed: doi:10.1007/BF01206331. The text ends with the special sentinel letter $ that is • Weiner, P. (1973), “Linear pattern matching alunique and lexicographically smaller than any other chargorithms” (PDF), 14th Annual IEEE Symposium acter. The text has the following suffixes: on Switching and Automata Theory, pp. 1–11, These suffixes can be sorted in ascending order: doi:10.1109/SWAT.1973.13. • Zamir, Oren; Etzioni, Oren (1998), “Web document clustering: a feasibility demonstration”, SIGIR '98: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, New York, NY, USA: ACM, pp. 46–54. 7.3.12 External links • Suffix Trees by Sartaj Sahni • NIST’s Dictionary of Algorithms and Data Structures: Suffix Tree • Universal Data Compression Based on the BurrowsWheeler Transformation: Theory and Practice, application of suffix trees in the BWT • Theory and Practice of Succinct Data Structures, C++ implementation of a compressed suffix tree • Ukkonen’s Suffix Tree Implementation in C Part 1 Part 2 Part 3 Part 4 Part 5 Part 6 The suffix array A contains the starting positions of these sorted suffixes: The suffix array with the suffixes written out vertically underneath for clarity: So for example, A[3] contains the value 4, and therefore refers to the suffix starting at position 4 within S , which is the suffix ana$. 7.4.3 Correspondence to suffix trees Suffix arrays are closely related to suffix trees: • Suffix arrays can be constructed by performing a depth-first traversal of a suffix tree. The suffix array corresponds to the leaf-labels given in the order in which these are visited during the traversal, if edges are visited in the lexicographical order of their first character. • A suffix tree can be constructed in linear time by using a combination of suffix array and LCP array. For a description of the algorithm, see the corresponding section in the LCP array article. It has been shown that every suffix tree algorithm can be systematically replaced with an algorithm that uses a suf7.4 Suffix array fix array enhanced with additional information (such as the LCP array) and solves the same problem in the same [2] In computer science, a suffix array is a sorted array of time complexity. Advantages of suffix arrays over sufall suffixes of a string. It is a data structure used, among fix trees include improved space requirements, simpler others, in full text indices, data compression algorithms linear time construction algorithms (e.g., compared to Ukkonen’s algorithm) and improved cache locality.[1] and within the field of bioinformatics.[1] Suffix arrays were introduced by Manber & Myers (1990) as a simple, space efficient alternative to suffix trees. They 7.4.4 Space Efficiency have independently been discovered by Gaston Gonnet in 1987 under the name PAT array (Gonnet, Baeza-Yates & Suffix arrays were introduced by Manber & Myers (1990) Snider 1992). in order to improve over the space requirements of suffix trees: Suffix arrays store n integers. Assuming an integer requires 4 bytes, a suffix array requires 4n bytes in 7.4.1 Definition total. This is significantly less than the 20n bytes which are required by a careful suffix tree implementation.[3] Let S = S[1]S[2]...S[n] be a string and let S[i, j] denote However, in certain applications, the space requirements the substring of S ranging from i to j . of suffix arrays may still be prohibitive. Analyzed in The suffix array A of S is now defined to be an array of integers providing the starting positions of suffixes of S in lexicographical order. This means, an entry A[i] contains the starting position of the i -th smallest suffix in S and thus for all 1 < i ≤ n : S[A[i − 1], n] < S[A[i], n] . bits, a suffix array requires O(n log n) space, whereas the original text over an

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Fundamental Data Structures