Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Redundant Bit Vectors for the Audio Fingerprinting Server John Platt Jonathan Goldstein Chris Burges Structure of Talk 1. 2. 3. 4. 5. 6. Problem Statement Problems with Existing Techniques Bit Vectors Partitioning the Query Space Results Future Extensions Problem Statement Find all hyperspheres that overlap a query point S1 S2 S3 Query is song 2 Query S4 S5 Data sphere Centers of spheres = undistorted fingerprint of song Radius of sphere = acceptable distortion of fingerprint Sphere overlaps query = query song is same as dB song Existing Techniques Linear scan For each sphere, check if query is inside Linear effort in size of dB Previous best known technique Data partitioning (R-trees, SS-trees, etc) Store data in a tree of shapes Shapes chop up space Descend and backtrack in tree, finding all leaf nodes that could match query Performs worse than linear scan!!! Key New Ideas New Algorithm: Redundant Bit Vectors (RBV) Partition the queries, not the data 1. Avoids hopeless task Store each data point redundantly 2. Combats high dimensionality Use bit vectors to index database 3. Small & fast representation Data Structure: Bit Vectors Bit vectors represent every data point as one bit 1 1 1 1 1 0 1 1 0 0 1 0 1 0 0 1 1 1 1 1 1 & 1 & 0 & 1 → 0 1 1 0 0 1 0 1 1 1 0 1 0 1 1 1 1 Point N 1 Point 1 Most examples are excluded from a final linear scan 0 means exclude this example from linear search 1 means linear search must still look at example Why Bit Vectors are Good Small memory footprint 1 bit per example per bit vector Fast on modern CPUs 1 CPU cycle operates on 32 examples per clock cycle Compare to Euclidean distance 3 operations/example/dimension Potential speed up: 96x !!! we use 1 bit vector per dimension in lookup Partition the Queries, not the Data Query 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 bin 1 1 0 1 1 1 1 1 1 1 0 1 0 1 1 0 For each query dimension, • dimension indexed into bins • bit vector associated with each bin • when query falls into bin … use the corresponding bit vector AND together bit vectors for some or all dimensions Each dimension trims examples Perform linear scan on survivors How We Compute the RBVs At index building time, construct the bit vectors: S1 S2 S3 1 1 0 0 0 0 1 0 1 0 0 0 S4 S5 S6 0 0 1 1 0 1 0 0 0 1 1 1 Project spheres into each dimension i th vector, j th bit = does sphere j overlap bin i ? How We Decide on the Bin Edges Two equivalent heuristics each bin should have ~ same number of spheres adjacent bit vectors should be ~ constant Hamming distance apart Place bin edges equal num. of sphere edges apart 6 3 S2 S1 S3 9 S4 12 S5 S6 Improve Selectivity by Shrinking Boxes Fingerprinting with hyperspheres (L2 norm) Low, but non-zero false negative rate Bit vectors implement hyperrectangles (L∞ norm) bit vectors guaranteed to never introduce false positives bit vectors empirically found to introduce no extra false positives We shrink the hyperrectangles to speed up final linear search Speed Comparisons Ran 1000 queries against fingerprint database Database size = 240K 64 dimensional points 14 bit vector dimensions used 32 bins per dimension chosen to optimize bit vector + linear scan speed more dim: bit vectors slow down, linear scan speeds up chosen to optimize memory/speed tradeoff Pentium 4, 2.2 GHz Results Method Queries/sec 14-D RBVs + 64-D limited L2 scan 1490 Factor Slower Than RBVs 1 64-D full L2 scan 31 48 14-D Hilbert R-trees 10 149 • all linear scans used early bailing • measured code by itself, not in context of SQL or IIS Code Details ~600 lines of C++ (pretty simple) Not integrated with SQL Server or IIS, etc. Running as part of audio fingerprinting demo How Fast Is It Really? You tell us: it depends on several factors Linear time in size of database Linear time in amount of resistance to cropping Sorting by popularity may help substantially Resistance to Cropping How much “crop slop” do you need? location of fingerprint true start of song largest acceptable crop to recognize current system = 5.4 traces / second of slop need to search this portion of song possible to reduce by 2x server load linear in number of traces Popularity Sorting May Help Order database by approximate popularity of music Split search into different sections 5000 most popular 50000 next most popular first search here if not found, search here if not found, search here the rest May yield substantial speed gain Memory Performance Bit Vectors can stay resident in memory For 240K songs All fingerprints live in 128M Bit vector indices only require 13M We can store fingerprints as 2-byte short, save 2x mem. Bit vector search blows out of cache speed depends on memory bandwidth of server Summary Bit Vectors are Simple, Small, and Fast Must be used to get good server-side performance