Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ADVENTURES IN DATA MINING Margaret H. Dunham Southern Methodist University Dallas, Texas 75275 [email protected] This material is based in part upon work supported by the National Science Foundation under Grant No. 9820841 Some slides used by permission from Dr Eamonn Keogh; University of California Riverside; [email protected] 2/25/09 - GCSU 1 The 2000 ozone hole over the antarctic seen by EPTOMS http://jwocky.gsfc.nasa.gov/multi/multi.html#hole 2/25/09 - GCSU 2 Data Mining Outline Introduction Techniques Classification Clustering Association Rules Examples Explore some interesting data mining applications 2/25/09 - GCSU 3 Introduction Data is growing at a phenomenal rate Users expect more sophisticated information How? UNCOVER HIDDEN INFORMATION DATA MINING 2/25/09 - GCSU 4 But it isn’t Magic You must know what you are looking for You must know how to look for you Suppose you knew that a specific cave had gold: • What would you look for? • How would you look for it? • Might need an expert miner 2/25/09 - GCSU 5 “If it looks like a terrorist, “If it looks like a duck, walks like a terrorist, and walks like a duck, and quacks like a terrorist, then quacks like a duck, then it’s a terrorist.” it’s a duck.” Description Behavior Associations Classification Clustering Link Analysis (Profiling) (Similarity) 2/25/09 - GCSU 6 2/25/09 - GCSU 7 CLASSIFICATION Assign data into predefined groups or classes. 2/25/09 - GCSU 8 Classification Ex: Grading x <90 >=90 x <80 >=80 x <70 x <50 F 2/25/09 - GCSU A B >=70 C >=60 D 9 Katydids Given a collection of annotated data. (in this case 5 instances of Katydids and five of Grasshoppers), decide what type of insect the unlabeled example is. Grasshoppers 2/25/09 - GCSU (c) Eamonn Keogh, [email protected] 10 Insect ID Abdomen Length Antennae Length Insect Class 1 2.7 5.5 Grasshopper 2 8.0 9.1 Katydid 3 0.9 4.7 Grasshopper 4 1.1 3.1 Grasshopper 5 5.4 8.5 Katydid 6 2.9 1.9 Grasshopper 7 6.1 6.6 Katydid 8 0.5 1.0 Grasshopper 9 8.3 6.6 Katydid 10 8.1 4.7 Katydid previously unseen instance = 11 2/25/09 - GCSU 5.1 The classification problem can now be expressed as: • Given a training database predict the class label of a previously unseen instance (c) Eamonn Keogh, [email protected] 7.0 ??????? 11 Antenna Length 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Abdomen Length 2/25/09 - GCSU Grasshoppers 12 (c) Eamonn Keogh, [email protected] Katydids Facial Recognition 2/25/09 - GCSU (c) Eamonn Keogh, [email protected] 13 Handwriting Recognition 1 0.5 0 0 50 100 150 200 250 300 350 400 450 (c) Eamonn Keogh, [email protected] 2/25/09 - GCSU George Washington Manuscript 14 Rare Event Detection 2/25/09 - GCSU 15 2/25/09 - GCSU 16 Dallas Morning News October 7, 2005 2/25/09 - GCSU 17 CLUSTERING Partition data into previously undefined groups. 2/25/09 - GCSU 18 http://149.170.199.144/multivar/ca.htm 2/25/09 - GCSU 19 What is Similarity? 2/25/09 - GCSU 20 (c) Eamonn Keogh, [email protected] Two Types of Clustering Hierarchical Partitional (c) Eamonn Keogh, [email protected] 2/25/09 - GCSU 21 Hierarchical Clustering Example Iris Data Set Versicolor Setosa Virginica •The data originally appeared in Fisher, R. A. (1936). "The Use of Multiple Measurements in Axonomic Problems," Annals of Eugenics 7, 179-188. 2/25/09 - GCSU •Hierarchical Clustering Explorer Version 3.0, Human-Computer Interaction Lab, University of Maryland, http://www.cs.umd.edu/hcil/multi-cluster . 22 ASSOCIATION RULES/ LINK ANALYSIS Find relationships between data 2/25/09 - GCSU 23 ASSOCIATION RULES EXAMPLES People who buy diapers also buy beer If gene A is highly expressed in this disease then gene A is also expressed Relationships between people Book Stores Department Stores Advertising Product Placement http://www.amazon.com/Data-Mining-Introductory-AdvancedTopics/dp/0130888923/ref=sr_1_1?ie=UTF8&s=books&qid=1235564485 &sr=1-1 2/25/09 - GCSU 24 Data Mining Introductory and Advanced Topics, by Margaret H. Dunham, Prentice Hall, 2003. DILBERT reprinted by permission of United Feature Syndicate, Inc. 2/25/09 - GCSU 25 Data Mining Outline Introduction Techniques Examples 2/25/09 - GCSU Vision Mining Law Enforcement (Cheating, Plagiarism, Fraud, Criminal Behavior,…) Bioinformatics 26 Vision Mining License Plate Recognition Red Light Cameras Toll Booths http://www.licenseplaterecognition.com/ Computer Vision http://www.eecs.berkeley.edu/Research /Projects/CS/vision/shape/vid/ 2/25/09 - GCSU 27 How Stuff Works, “Facial Recognition,” http://computer.howstuf fworks.com/facialrecognition1.htm 2/25/09 - GCSU 28 Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007. 2/25/09 - GCSU 29 No/Little Cheating Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007. 2/25/09 - GCSU 30 Rampant Cheating Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007. 2/25/09 - GCSU 31 2/25/09 - GCSU Jialun Qin, Jennifer J. Xu, Daning Hu, Marc Sageman and Hsinchun Chen, “Analyzing Terrorist Networks: A Case Study of the Global Salafi Jihad Network” Lecture Notes in Computer 32 Science, Publisher: Springer-Verlag GmbH, Volume 3495 / 2005 , p. 287. http://www.time.com/time/magazine/article/0,9171,1541283,00.html 2/25/09 - GCSU 33 DNA Basic building blocks of organisms Located in nucleus of cells Composed of 4 nucleotides Two strands bound together 2/25/09 - GCSU http://www.visionlearning.com/library/module_viewer.php?mi d=63 34 Central Dogma: DNA -> RNA -> Protein DNA CCTGAGCCAACTATTGATGAA transcription RNA CCUGAGCCAACUAUUGAUGAA translation Protein Amino Acid 2/25/09 - GCSU www.bioalgorithms.info; chapter 6; Gene Prediction 35 Human Genome Scientists originally thought there would be about 100,000 genes Appear to be about 20,000 WHY? Almost identical to that of Chimps. What makes the difference? Answers appear to lie in the noncoding regions of the DNA (formerly thought to be junk) 2/25/09 - GCSU 36 RNAi – Nobel Prize in Medicine 2006 siRNA may be artificially added to cell! Double stranded RNA Short Interfering RNA (~20-25 nt) RNA-Induced Silencing Complex Binds to mRNA Cuts RNA 2/25/09 - GCSU Image source: http://nobelprize.org/nobel_prizes/medicine/laureates/2006/adv.html, Advanced Information, Image 3 37 miRNA Short (20-25nt) sequence of noncoding RNA Known since 1993 but significance not widely appreciated until 2001 Impact / Prevent translation of mRNA Generally reduce protein levels without impacting mRNA levels (animal cells) Functions Causes some cancers Guide embryo development Regulate cell Differentiation Associated with HIV … 2/25/09 - GCSU 38 TCGR – Mature miRNA (Window=5; Pattern=3) C Elegans Homo Sapiens Mus Musculus All Mature 2/25/09 - GCSU ACG CGC GCG UCG 39 TCGRs for Xue Training Data C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of Real and Pseudo MicroRNA Precursors using Local StructureSequence Features and Support Vector Machine,” BMC Bioinformatics, vol 6, no 310. 2/25/09 - GCSU P O S I T I VE NE G AT I VE 40 Affymetrix GeneChip® Array http://www.affymetrix.com/corporate/outreach/lesson_plan/educator_resources.affx 2/25/09 - GCSU 41 Microarray Data Analysis Each probe location associated with gene Measure the amount of mRNA Color indicates degree of gene expression Compare different samples (normal/disease) Track same sample over time Questions Which genes are related to this disease? Which genes behave in a similar manner? What is the function of a gene? Clustering Hierarchical K-means 2/25/09 - GCSU 42 Microarray Data - Clustering "Gene expression profiling identifies clinically relevant subtypes of prostate cancer" Proc. Natl. Acad. Sci. USA, Vol. 101, Issue 3, 811-816, January 20, 2004 2/25/09 - GCSU 43 BIG BROTHER ? Total Information Awareness http://infowar.net/tia/www.darpa.mil/iao/index.htm http://www.govtech.net/magazine/story.php?id=45918 http://en.wikipedia.org/wiki/Information_Awareness_Office Terror Watch List http://www.businessweek.com/technology/content/may2005/tc20050511_ 8047_tc_210.htm http://www.theregister.co.uk/2004/08/19/senator_on_terror_watch/ http://blog.wired.com/27bstroke6/2008/02/us-terror-watch.html CAPPS http://www.theregister.co.uk/2004/04/26/airport_security_failures/ http://www.heritage.org/Research/HomelandDefense/BG1683.cfm http://www.theregister.co.uk/2004/07/16/homeland_capps_scrapped/ http://en.wikipedia.org/wiki/CAPPS 2/25/09 - GCSU 44 2/25/09 - GCSU 45 http://ieeexplore.ieee.org/iel5/6/32236/01502526.pdf?tp=&arnumber=1502526&isnumber=32236 2/25/09 - GCSU 46 2/25/09 - GCSU 47