Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
ACM Distinguished Speakers Program ADVENTURES IN DATA MINING Margaret H. Dunham Southern Methodist University Dallas, Texas 75275 [email protected] This material is based in part upon work supported by the National Science Foundation under Grant No. 9820841 and NIH Grant No.1R21HG005912-01A1 Some slides used by permission from Dr Eamonn Keogh; University of California Riverside; [email protected] •2/25/13 - Union University 1 The 2000 ozone hole over the antarctic seen by EPTOMS http://jwocky.gsfc.nasa.gov/multi/multi.html#hole •2/25/13 - Union University 2 Data Mining Outline Introduction Techniques Classification Clustering Association Rules Examples Explore some interesting data mining applications •2/25/13 - Union University 3 Introduction Data is growing at a phenomenal rate Users expect more sophisticated information How? UNCOVER HIDDEN INFORMATION DATA MINING •2/25/13 - Union University 4 But it isn’t Magic You must know what you are looking for You must know how to look for you Suppose you knew that a specific cave had gold: • What would you look for? • How would you look for it? • Might need an expert miner •2/25/13 - Union University 5 CLASSIFICATION Assign data into predefined groups or classes. •2/25/13 - Union University 6 “If it looks like a duck, walks like a duck, and “If it looks like a terrorist, quacks like a duck, then walks like a terrorist, and it’s a duck.” quacks like a terrorist, then it’s a terrorist.” Description Behavior Associations Classification Clustering Link Analysis (Profiling) (Similarity) •2/25/13 - Union University 7 Classification Ex: Grading x <90 >=90 x <80 >=80 x <70 x <50 F •2/25/13 - Union University A B >=70 C >=60 D 8 Katydids Given a collection of annotated data. (in this case 5 instances of Katydids and five of Grasshoppers), decide what type of insect the unlabeled example is. Grasshoppers (c) Eamonn Keogh, [email protected] •2/25/13 - Union University 9 Insect ID Abdomen Length Antennae Length Insect Class 1 2.7 5.5 Grasshopper 2 8.0 9.1 Katydid 3 0.9 4.7 Grasshopper 4 1.1 3.1 Grasshopper 5 5.4 8.5 Katydid 6 2.9 1.9 Grasshopper 7 6.1 6.6 Katydid 8 0.5 1.0 Grasshopper 9 8.3 6.6 Katydid 10 8.1 4.7 Katydid previously unseen instance = 11 •2/25/13 - Union University 5.1 The classification problem can now be expressed as: • Given a training database predict the class label of a previously unseen instance (c) Eamonn Keogh, [email protected] 7.0 ??????? 10 Antenna Length 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Abdomen Length •2/25/13 - Union University Grasshoppers (c) Eamonn Keogh, [email protected] 11 Katydids How Stuff Works, “Facial Recognition,” http://computer.howstuf fworks.com/facialrecognition1.htm •2/25/13 - Union University 12 Facial Recognition •2/25/13 - Union (c) University Eamonn Keogh, [email protected] 13 Handwriting Recognition 1 0.5 0 0 50 100 150 200 250 300 350 400 450 (c) Eamonn Keogh, [email protected] •2/25/13 - Union University George Washington Manuscript 14 Rare Event Detection •2/25/13 - Union University 15 •2/25/13 - Union University 16 Dallas Morning News October 7, 2005 •2/25/13 - Union University 17 Classification Performance True Positive False Negative False Positive True Negative •© Prentice Hall 18 Behavior Based Classification/Prediction Credit Card Fraud Detection Credit Score Home Mortgage Approval 2/25/13 - Union University 19 CLUSTERING Partition data into previously undefined groups. •2/25/13 - Union University 20 http://149.170.199.144/multivar/ca.htm •2/25/13 - Union University 21 What is Similarity? •2/25/13 - Union University (c) Eamonn Keogh, [email protected] 22 Two Types of Clustering Hierarchical Partitional (c) Eamonn Keogh, [email protected] •2/25/13 - Union University 23 Hierarchical Clustering Example Iris Data Set Versicolor Setosa Virginica •The data originally appeared in Fisher, R. A. (1936). "The Use of Multiple Measurements in Axonomic Problems," Annals of Eugenics 7, 179-188. •2/25/13 - Union•Hierarchical University Clustering Explorer Version 3.0, Human-Computer Interaction Lab, University of Maryland, http://www.cs.umd.edu/hcil/multi-cluster . 24 ASSOCIATION RULES/ LINK ANALYSIS Find relationships between data •2/25/13 - Union University 25 ASSOCIATION RULES EXAMPLES People who buy diapers also buy beer If gene A is highly expressed in this disease then gene A is also expressed Relationships between people Book Stores Department Stores Advertising Product Placement http://www.amazon.com/Data-Mining-Introductory-AdvancedTopics/dp/0130888923/ref=sr_1_1?ie=UTF8&s=books&qid=123 5564485&sr=1-1 •2/25/13 - Union University 26 Data Mining Introductory and Advanced Topics, by Margaret H. Dunham, Prentice Hall, 2003. DILBERT reprinted by permission of United Feature Syndicate, Inc. •2/25/13 - Union University 27 Data Mining Outline Introduction Techniques Examples Vision Mining Law Enforcement (Cheating, Plagiarism, Fraud, Criminal Behavior,…) Bioinformatics •2/25/13 - Union University 28 Vision Mining License Plate Recognition Red Light Cameras Toll Booths http://www.licenseplaterecognition.com/ Computer Vision http://www.eecs.berkeley.edu/Research/Proj ects/CS/vision/shape/vid/ •2/25/13 - Union University 29 Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007. •2/25/13 - Union University 30 No/Little Cheating Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007. •2/25/13 - Union University 31 Rampant Cheating Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007. •2/25/13 - Union University 32 •2/25/13 - Union University Jialun Qin, Jennifer J. Xu, Daning Hu, Marc Sageman and Hsinchun Chen, “Analyzing Terrorist Networks: A Case Study of the Global Salafi Jihad Network” Lecture Notes in Computer 33 Science, Publisher: Springer-Verlag GmbH, Volume 3495 / 2005 , p. 287. Arnet Miner http://arnetminer.org/ •2/25/13 - Union University 34 DNA Basic building blocks of organisms Located in nucleus of cells Composed of 4 nucleotides Two strands bound together •2/25/13 - Union University http://www.visionlearning.com/library/module_viewer.php?mi d=63 35 Central Dogma: DNA -> RNA -> Protein DNA CCTGAGCCAACTATTGATGAA transcription RNA CCUGAGCCAACUAUUGAUGAA translation Protein Amino Acid •2/25/13 - Union www.bioalgorithms.info; University chapter 6; Gene Prediction 36 Human Genome Scientists originally thought there would be about 100,000 genes Appear to be about 20,000 WHY? Almost identical to that of Chimps. What makes the difference? Answers appear to lie in the noncoding regions of the DNA (formerly thought to be junk) •2/25/13 - Union University 37 RNAi – Nobel Prize in Medicine 2006 siRNA may be artificially added to cell! Double stranded RNA Short Interfering RNA (~20-25 nt) RNA-Induced Silencing Complex Binds to mRNA Cuts RNA •2/25/13 - Union University Image source: http://nobelprize.org/nobel_prizes/medicine/laureates/2006/adv.html, Advanced Information, Image 3 38 miRNA Short (20-25nt) sequence of noncoding RNA Known since 1993 but significance not widely appreciated until 2001 Impact / Prevent translation of mRNA Generally reduce protein levels without impacting mRNA levels (animal cells) Functions Causes some cancers Guide embryo development Regulate cell Differentiation Associated with HIV … •2/25/13 - Union University 39 TCGR – Mature miRNA (Window=5; Pattern=3) C Elegans Homo Sapiens Mus Musculus All Mature •2/25/13 - Union University ACG CGC GCG UCG 40 TCGRs for Xue Training Data C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of Real and Pseudo MicroRNA Precursors using Local StructureSequence Features and Support Vector Machine,” BMC Bioinformatics, vol 6, no 310. P O S I T I VE NE G AT I VE •2/25/13 - Union University 41 Affymetrix GeneChip® Array http://www.affymetrix.com/corporate/outreach/lesson_plan/educator_resources.affx •2/25/13 - Union University 42 BIG BROTHER ? Total Information Awareness http://en.wikipedia.org/wiki/Information_Awareness_Offi ce Terror Watch List http://www.businessweek.com/technology/content/may2 005/tc20050511_8047_tc_210.htm http://www.theregister.co.uk/2004/08/19/senator_on_te rror_watch/ http://blog.wired.com/27bstroke6/2008/02/us-terrorwatch.html CAPPS http://en.wikipedia.org/wiki/CAPPS •2/25/13 - Union University 43 •2/25/13 - Union University 44 http://ieeexplore.ieee.org/iel5/6/32236/01502526.pdf?tp=&arnumber=1502526&isnumber=32236 •2/25/13 - Union University 45 My DM Toolbelt C, C++ Perl, Ruby Weka R, SAS Excel, XLMiner Vi, word, … Grep, sed, … 2/25/13 - Union University 46 •2/25/13 - Union University 47