Download Fundamentals of Python: From First Programs Through Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Array data structure wikipedia , lookup

Transcript
Fundamentals of Python:
From First Programs Through Data
Structures
Chapter 19
Unordered Collections: Sets and
Dictionaries
Objectives
After completing this chapter, you will be able to:
• Implement a set type and a dictionary type using
lists
• Explain how hashing can help a programmer
achieve constant access time to unordered
collections
• Explain strategies for resolving collisions during
hashing, such as linear probing, quadratic probing,
and bucket/chaining
Fundamentals of Python: From First Programs Through Data Structures
2
Objectives (continued)
After completing this chapter, you will be able to:
(continued)
• Use a hashing strategy to implement a set type and
a dictionary type
• Use a binary search tree to implement a sorted set
type and a sorted dictionary type
Fundamentals of Python: From First Programs Through Data Structures
3
Using Sets
• A set is a collection of items in no particular order
• Most typical operations:
–
–
–
–
–
–
–
–
Return the number of items in the set
Test for the empty set (a set that contains no items)
Add an item to the set
Remove an item from the set
Test for set membership
Obtain the union of two sets
Obtain the intersection of two sets
Obtain the difference of two sets
Fundamentals of Python: From First Programs Through Data Structures
4
Using Sets (continued)
Fundamentals of Python: From First Programs Through Data Structures
5
The Python set Class
Fundamentals of Python: From First Programs Through Data Structures
6
The Python set Class (continued)
Fundamentals of Python: From First Programs Through Data Structures
7
A Sample Session with Sets
Fundamentals of Python: From First Programs Through Data Structures
8
A Sample Session with Sets
(continued)
Fundamentals of Python: From First Programs Through Data Structures
9
Applications of Sets
• Sets have many applications in the area of data
processing
– Example: In database management, answer to
query that contains conjunction of two keys could be
constructed from intersection of sets of items
associated with those keys
Fundamentals of Python: From First Programs Through Data Structures
10
Implementations of Sets
• Arrays and lists may be used to contain the data
items of a set
– A linked list has the advantage of supporting
constant-time removals of items
• Once they are located in the structure
• Hashing attempts to approximate random access
into an array for insertions, removals, and searches
Fundamentals of Python: From First Programs Through Data Structures
11
Relationship Between Sets and
Dictionaries
• A dictionary is an unordered collection of elements
called entries
– Each entry consists of a key and an associated
value
– A dictionary’s keys must be unique, but its values
may be duplicated
• One can think of a dictionary as having a set of
keys
Fundamentals of Python: From First Programs Through Data Structures
12
List Implementations of Sets and
Dictionaries
• The simplest implementations of sets and
dictionaries use lists
• This section presents these implementations and
assesses their run-time performance
Fundamentals of Python: From First Programs Through Data Structures
13
Sets
• List implementation of a set
Fundamentals of Python: From First Programs Through Data Structures
14
Dictionaries
• Our list-based implementation of a dictionary is
called ListDict
– The entries in a dictionary consist of two parts, a key
and a value
• A list implementation of a dictionary behaves in
many ways like a list implementation of a set
Fundamentals of Python: From First Programs Through Data Structures
15
Dictionaries (continued)
Fundamentals of Python: From First Programs Through Data Structures
16
Dictionaries (continued)
Fundamentals of Python: From First Programs Through Data Structures
17
Dictionaries (continued)
Fundamentals of Python: From First Programs Through Data Structures
18
Complexity Analysis of the List
Implementations of Sets and
Dictionaries
• The list implementations of sets and dictionaries
require little programmer effort
– Unfortunately, they do not perform well
• Basic accessing methods must perform a linear
search of the underlying list
– Each basic accessing method is O(n)
Fundamentals of Python: From First Programs Through Data Structures
19
Hashing Strategies
• Key-to-address transformation or a hashing
function
– Acts on a given key by returning its relative position in
an array
• Hash table
– An array used with a hashing strategy
• Collision
– Placement of different keys at the same array index
Fundamentals of Python: From First Programs Through Data Structures
20
Hashing Strategies (continued)
Fundamentals of Python: From First Programs Through Data Structures
21
Hashing Strategies (continued)
Fundamentals of Python: From First Programs Through Data Structures
22
The Relationship of Collisions to
Density
• Density
– The number of keys relative to the length of an array
• As the density decreases, so does the probability
of collisions
• Keeping a low load factor even (say, below .2)
seems like a good way to avoid collisions
– Cost of memory incurred by load factors below .5 is
probably prohibitive for data sets of millions of items
– Even load factors below .5 cannot prevent many
collisions from occurring for some data sets
Fundamentals of Python: From First Programs Through Data Structures
23
Hashing with Non-Numeric Keys
• Try returning the sum of the ASCII values in the
string
• This method has effect of producing same keys for
anagrams
– Strings that contain same characters, but in different
order
• First letters of many words in English are unevenly
distributed
– This might have the effect of weighting or biasing the
sums generated
Fundamentals of Python: From First Programs Through Data Structures
24
Hashing with Non-Numeric Keys
(continued)
• One solution:
– If length of string is greater than a certain threshold
• Drop first character from string before computing sum
• Can also subtract the ASCII value of the last character
• Python also includes a standard hash function for
use in hashing applications
– Function can receive any Python object as an
argument and returns a unique integer
Fundamentals of Python: From First Programs Through Data Structures
25
Hashing with Non-Numeric Keys
(continued)
Fundamentals of Python: From First Programs Through Data Structures
26
Linear Probing
• Linear probing
– Simplest way to resolve a collision
– Search array, starting from collision spot, for the first
available position
• At the start of an insertion, the hashing function is
run to compute the home index of the item
– If cell at home index is not available, move index to
the right to probe for an available cell
– When search reaches last position of array, probing
wraps around to continue from the first position
Fundamentals of Python: From First Programs Through Data Structures
27
Linear Probing (continued)
• For retrievals, stop probing process when current
array cell is empty or it contains the target item
– If target item is found, its cell is set to DELETED
Fundamentals of Python: From First Programs Through Data Structures
28
Linear Probing (continued)
• Problem: After several insertions/removals, item is
farther away from its home index than needs to be
– Increasing the average overall access time
• Two ways to deal with this problem:
– After a removal, shift items on the cell’s right over to
the cell’s left until an empty cell, a currently occupied
cell, or the home indexes for each item are reached
– Regularly rehash the table (e.g., if load factor is .5)
• Clustering: Occurs when items causing a collision
are relocated to the same region within the array
Fundamentals of Python: From First Programs Through Data Structures
29
Linear Probing (continued)
Fundamentals of Python: From First Programs Through Data Structures
30
Quadratic Probing
• To avoid clustering associated with linear probing,
we can advance the search for an empty position a
considerable distance from the collision point
– Quadratic probing: Increments the home index by
the square of a distance on each attempt
• Problem: By jumping over some cells, one or more
of them might be missed
– Can lead to some wasted space
Fundamentals of Python: From First Programs Through Data Structures
31
Quadratic Probing (continued)
• Here is the code for insertions, updated to use
quadratic probing:
Fundamentals of Python: From First Programs Through Data Structures
32
Chaining
• Items are stored in an array of linked lists (chains)
– Each item’s key locates the bucket (index) of the
chain in which the item resides or is to be inserted
• Retrieval and removal each perform these steps:
– Compute the item’s home index in the array
– Search the linked list at that index for the item
• To insert an item:
– Compute the item’s home index in the array
– If cell is empty, create a node with item and assign
the node to cell; else (collision), insert item in chain
Fundamentals of Python: From First Programs Through Data Structures
33
Chaining (continued)
Fundamentals of Python: From First Programs Through Data Structures
34
Complexity Analysis
• Linear probing: Complexity depends on load factor
(D) and tendency of items to cluster
– Worst case (method traverses entire array before
locating item’s position): behavior is linear
– Average behavior in searching for an item that
cannot be found is (1/2) [1 + 1/(1 – D)2]
• Quadratic probing: Tends to mitigate clustering
– Average search complexity is 1 – loge(1 – D) – (D /
2) for the successful case and 1 / (1 – D) – D –
loge(1 – D) for the unsuccessful case
Fundamentals of Python: From First Programs Through Data Structures
35
Complexity Analysis (continued)
• Chaining:
– Locating an item consists of two parts:
• Computing home index  constant time behavior
• Searching linked list upon a collision  linear
– Worst case (all items that have collided with each
other are in one chain, which is a linked list): O(n)
– If lists are evenly distributed in array and array is
fairly large, the second part can be close to constant
– Best case (a chain of length 1 occupies each array
cell): O(1)
Fundamentals of Python: From First Programs Through Data Structures
36
Case Study: Profiling Hashing
Strategies
• Request:
– Write a program that allows a programmer to profile
different hashing strategies
• Analysis:
– Should allow to gather statistics on number of
collisions caused by the hashing strategies
– Other useful information:
• Hash table’s load factor
• Number of probes needed to resolve collisions during
linear or quadratic probing
Fundamentals of Python: From First Programs Through Data Structures
37
Case Study: Profiling Hashing
Strategies (continued)
Fundamentals of Python: From First Programs Through Data Structures
38
Case Study: Profiling Hashing
Strategies (continued)
Fundamentals of Python: From First Programs Through Data Structures
39
Case Study: Profiling Hashing
Strategies (continued)
• Analysis (continued):
– Here are the profiler’s results:
• Design:
– Profiler class requires instance variables to track
a table, number of collisions, and number of probes
Fundamentals of Python: From First Programs Through Data Structures
40
Case Study: Profiling Hashing
Strategies (continued)
• Implementation:
Fundamentals of Python: From First Programs Through Data Structures
41
Case Study: Profiling Hashing
Strategies (continued)
Fundamentals of Python: From First Programs Through Data Structures
42
Hashing Implementation of
Dictionaries
• HashDict uses the bucket/chaining strategy
– To manage the array, declare three instance
variables: _table, _size, and _capacity
Fundamentals of Python: From First Programs Through Data Structures
43
Hashing Implementation of Sets
• The design of the methods for HashSet is also the
same as the methods in HashDict, except:
– __contains__ searches for an item (not key)
– add inserts item only if it is not already in the set
– A single iterator method is included instead of
separate methods that return keys and values
Fundamentals of Python: From First Programs Through Data Structures
44
Sorted Sets and Dictionaries
• Each item added to a sorted set must be
comparable with its other items
– Same applies for keys added to a sorted dictionary
• The iterator for each type of collection guarantees
its users access to items or keys in sorted order
• Implementation alternatives:
– List-based: must maintain a sorted list of the items
– Hashing implementation: not feasible
– Binary search tree implementation: generally provide
logarithmic access to data items
Fundamentals of Python: From First Programs Through Data Structures
45
Sorted Sets and Dictionaries
(continued)
Fundamentals of Python: From First Programs Through Data Structures
46
Summary
• A set is an unordered collection of items
– Each item is unique
– List-based implementation  linear-time access
– Hashing implementation  constant-time access
• Items in a sorted set can be visited in sorted order
– A tree-based implementation of a sorted set
supports logarithmic-time access
• A dictionary is an unordered collection of entries,
where each entry consists of a key and a value
– Each key is unique; its values may be duplicated
Fundamentals of Python: From First Programs Through Data Structures
47
Summary (continued)
• A sorted dictionary imposes an ordering by
comparison on its keys
• Implementations of both types of dictionaries are
similar to those of sets
• Hashing: Technique for locating an item in constant
time
– Techniques to resolve collisions: linear collision
processing, quadratic collision processing, chaining
– The run-time and memory aspects involve the load
factor of the array
Fundamentals of Python: From First Programs Through Data Structures
48