Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
SAXually Explicit Images: Finding Unusual Shapes Li Wei Eamonn Keogh Xiaopeng Xi Computer Science & Engineering Dept. University of California – Riverside Appears as a Google tech talk , Google “Keogh SAXually” http://video.google.com/videoplay?docid=6642985254445857159 Time Series Data Mining Group Outline • Shape Representations and Distance Measures • Shape Discords (i.e. unusual shapes) • Algorithm – Shape Discords Discovery Framework – Approximating the Optimal Ordering • Empirical Evaluation • Conclusion Time Series Data Mining Group Shape Datasets Fruit fly wings Skulls Sea animals Leaves Petroglyphs Butterflies Nematodes Lizards Time Series Data Mining Group Arrowheads Shape Representations 0 200 400 600 800 1000 1200 We can convert shapes into a 1D signal. By doing this we remove information about scale and offset. But we must deal with rotation in our algorithms … There are three ways to be rotation invariant: Landmarking, Rotation Invariant Features, Brute Force Rotation Alignment… Time Series Data Mining Group Landmarking*: Find the one “True” Rotation ? • Domain Specific Landmarking Find some fixed point in your domain, eg. the nose on a face, the stem of leaf, the tail of a fish … Best Rotation Alignment • Generic Landmarking Find the major axis of the shape and use that as the canonical alignment Owl Monkey (species unknown) Owl Monkey Northern Gray-Necked • Problem It does not work in many cases. A Orangutan C B Generic Landmark Alignment Generic Landmark Alignment * Xie, J. AND Heng P. Shape Modeling Using Automatic Landmarking. MICCAI 2005. A A B Time Series Data Mining Group Best Rotation Alignment B Rotation Invariant Features* • Possible features include: Ratio of perimeter to area, fractal measures, elongatedness, circularity, min/max/mean curvature, entropy, perimeter of convex hull and histograms Orangutan Orangutan (juvenile) Borneo Orangutan • Problem When throwing away rotation information, some useful information are thrown away invariably. Red Howler Monkey Histogram of the distances between two randomly chosen points on the perimeter of the shape Time Series Data Mining Group Mantled Howler Monkey 1D centriod representation * Cardone, A., Gupta, S. K., and Karnik, M. A Survey of Shape Similarity Assessment Algorithms for Product Design and Manufacturing Applications. ASME Journal, 2003 Brute Force Rotation Alignment • Idea Achieve true rotation invariance by exhaustive brute force search over all possible rotations • Rotation Matrix Given a time series C of length n, its possible rotations constitute a rotation matrix C of size n by n C C1 C2 C3 c1 , c2 , , cn1 , cn c2 , , cn1 , cn , c1 C cn , c1 , c2 , , cn1 C4 C5 C6 C7 • Rotation Invariant Euclidean Distance (RED) RED (Q, C ) min ED Q, C j • Problem High computational cost 1 j n C8 C9 C10 C11 C12 C13 Time Series Data Mining Group We have forcefully shown this is the right representation, see our VLDB 2006 paper Shape Discord • The shape that is least similar to other shapes in a dataset (or has the largest distance to its nearest match) 1st Discord SQUID Dataset (subset) (Castroville Cornertang) Specimen 20773 1st Discord Time Series Data Mining Group 2nd Discord (Martindale point) 1st Discord Brute Force Shape Discord Discovery For each shape in the dataset Find the distance to its nearest neighbor Check whether it is a better candidate as the discord Algorithm [dist, index] = BruteForce_Search(S) 1 best_so_far_dist = 0 2 best_so_far_index = NaN 3 For p = 1 to |S| 4 nearest_neighbor_dist = infinity 5 For q = 1 to |S| 6 If p!= q 7 If Dist (Cp , Cq ) < nearest_neighbor_dist 8 nearest_neighbor_dist = Dist (Cp , Cq) 9 End 10 End 11 End 12 If nearest_neighbor_dist > best_so_far_dist 13 best_so_far_dist = nearest_neighbor_dist 14 best_so_far_index = p 15 End 16 End 17 Return [best_so_far_dist, best_so_far_index] Time Series Data Mining Group For each shape in the dataset (row) Find the distance to its nearest neighbor (column) Check whether it is a better candidate for discord Observations from Brute Force Algorithm Brute Force 1 1 2 3 19.1 5.9 4 Early Abandon 6 5 29.3 19.5 nn_dist 1 2 3 4 5 bsf_dist = 5.9 10.1 29.0 2.4 3.0 2.4 < 5.9 28.1 4.1 8.4 4.1 < 5.9 5.9 1 10.1 29.0 2.4 3.0 2.4 2 19.1 28.1 4.1 8.4 4.1 3 5.9 26.7 4 29.3 29.0 28.1 2.4 5 19.5 2.4 4.1 26.7 6 18.4 3.0 8.4 28.8 3.4 19.1 3 5.9 4 29.3 29.0 28.1 5 19.5 2.4 4.1 26.7 6 18.4 3.0 8.4 28.8 3.4 10.1 26.7 28.8 3.4 3.0 comments 18.4 18.4 2 6 19.1 5.9 10.1 29.3 19.5 26.7 28.8 3.4 bsf_dist = 26.7 2.4 < 26.7 3.0 < 26.7 Magic 1 1 Time Series Data Mining Group 2 3 19.1 5.9 4 5 29.3 19.5 2 19.1 3 5.9 4 29.3 29.0 28.1 5 19.5 2.4 4.1 26.7 6 18.4 3.0 8.4 28.8 3.4 6 18.4 10.1 29.0 2.4 3.0 28.1 4.1 8.4 10.1 26.7 28.8 3.4 Order Matters! Heuristic Shape Discord Discovery 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Algorithm [dist, index] = Heuristic_Search(S, Outer, Inner) best_so_far_dist = 0 best_so_far_index = NaN For each index p given by heuristic Outer nearest_neighbor_dist = infinity For each index q given by heuristic Inner If p!= q If Dist (Cp , Cq ) < best_so_far_dist break End If Dist (Cp , Cq ) < nearest_neighbor_dist nearest_neighbor_dist = Dist (Cp , Cq ) End End End If nearest_neighbor_dist > best_so_far_dist best_so_far_dist = nearest_neighbor_dist best_so_far_index = p End End Return [ best_so_far_dist, best_so_far_index ] Time Series Data Mining Group Consider discord candidate in Outer order Visit other shapes in Inner order Apply early abandoning Observations from Heuristic Algorithm 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Algorithm [dist, index] = Heuristic_Search(S, Outer, Inner) best_so_far_dist = 0 best_so_far_index = NaN For each index p given by heuristic Outer nearest_neighbor_dist = infinity For each index q given by heuristic Inner If p!= q If Dist (Cp , Cq ) < best_so_far_dist break End If Dist (Cp , Cq ) < nearest_neighbor_dist nearest_neighbor_dist = Dist (Cp , Cq ) End End End If nearest_neighbor_dist > best_so_far_dist best_so_far_dist = nearest_neighbor_dist best_so_far_index = p End End Return [ best_so_far_dist, best_so_far_index ] Time Series Data Mining Group Observation 1 • We do not need a perfect outer ordering. • Among the first few shapes being examined, there is at least one that has a large distance to its nearest neighbor. Observation 2 • We do not need a perfect inner ordering. • Among the first few shapes being examined, there is at least one that has a distance to the candidate that is less than the current value of the best_so_far_dist variable . We want this conditional test be true as often as possible! Approximating the Optimal Ordering • Step 1: symbolize the time series • Step 2: use locality-sensitive hashing to estimate similarity between shapes • Step 3: generate heuristics for outer and inner loops • Keep in mind: – Outer heuristic (invoked only once) can take at most O(m) to calculate. – Inner heuristic (invoked m times) can take at most O(1) to calculate. Time Series Data Mining Group SAX: Symbolic Aggregate approXimation baabccbc • Lower bounds Euclidean distance • Achieves dimensionality reduction • There are now well over 100 SAX papers, see www.cs.ucr.edu/~eamonn/SAX.htm Time Series Data Mining Group Locality-sensitive Hash Function* • Consider a string s of length w over an alphabet S and k indices i1, … , ik chosen uniformly at random from the set {1, … , w}, the locality-sensitive hash function f is defined as • For example, f ( s ) s[i1 ], s[i2 ],..., s[ik ] adad f aa • Property dd dada – Strings similar to each other are more likely to be hashed to the same value. Time Series Data Mining Group * Indyk, P., Motwani, R., Raghavan, P., and Vempala, S. Locality-Preserving Hashing in Multidimensional Spaces. STOC 1997. Because of rotations, similar shapes may not be hashed to the same value. Images Time Series Representations SAX Words A) adad 0 200 400 600 800 1000 1200 B) daca 0 Time Series Data Mining Group 200 400 600 800 1000 1200 Rotation Invariant Locality-sensitive Hash Function • Consider a string s of length w over an alphabet S and k indices i1, … , ik chosen uniformly at random from the set {1, … , w}, the rotation invariant locality-sensitive hash function f ’ is defined as f ' ( s ) { p[i1 ], p[i2 ],..., p[ik ] | p LSHIFTS ( s )} where LSHIFTS(s) is the set of all possible left shifts of string s Images SAX Words Shifts LSH Values A) adad adad dada aa dd B) da c a d a c a dc aa cd Time Series Data Mining Group a c a d c a d a a d a c Generating Heuristics Image 1 Time Series 1 1 2 Time Series 4 3 4 m bada Buckets daca acad cada adac aa: 14m ab: 23 ac: ba: bd: 3 m ca: 3 cd: 12 db: m dc: 12 dd: 4 adad dada 23 Collision Matrix 1 2 3 2 1 2 3 2 4 1 4 … m 1 1 2 2 1 … Image 4 Array daca a cbd caab adad :: :: :: :: Shifts m 1 1 • Outer order: examines shapes in the ascending order of the largest number of collisions each shape has with others . • Inner order: When candidate shape i is considered in the outer loop, the inner loop examines the shapes in the descending order of the number of collisions they have with shape i. Time Series Data Mining Group The Utility of Shape Discords A B 1st Discord Heliconius melpomene (The Postman) (Dacrocyte) A B D E Heliconius erato C (Red Passion Flower Butterfly) 0 100 200 300 400 500 600 700 800 E 900 C D B C A B C D E F A A, D, E 1st Discord G A B C B, C, F G Time Series Data Mining Group The Utility of Heuristic Ordered Search • Datasets – Homogeneous: 10,000 projectile points – Heterogeneous: 5,844 objects • Measurement – number of distance function calls by each approach / number of distance function calls by brute force Projectile Points Heterogeneous Dataset 1 0.9 0.8 0.7 1 0.8 0.6 0.5 0.4 0.3 0.2 0.1 0 Time Series Data Mining Group 0.6 0.4 0.2 0 Just using early abandoning (which is an original idea in this context) is 3 or 4 orders of magnitude faster, the Magic heuristic is a further order of magnitude faster. Conclusion & Future Work • We define shape discords. • We introduce the heuristic based algorithm to efficiently find discords and demonstrate its utility in various domains • Future Work – Investigate image discords not only using shapes but also texture/color – Conduct a field studies of shape discord discovery in anthropology and archeology Time Series Data Mining Group Thank you! Questions? Time Series Data Mining Group