Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
DATA MINING APPLICATIONS Margaret H. Dunham Southern Methodist University Dallas, Texas 75275 [email protected] This material is based in part upon work supported by the National Science Foundation under Grant No. 9820841 Some slides used by permission from Dr Eamonn Keogh; University of California Riverside; [email protected] 7/10/07 7/10/07--SEDE'07 SEDE'07 1 The 2000 ozone hole over the antarctic seen by EPTOMS 7/10/07 7/10/07--SEDE'07 SEDE'07 http://jwocky.gsfc.nasa.gov/multi/multi.html#hole 2 OBJECTIVE Explore some of the applications of data mining techniques. 7/10/07 7/10/07--SEDE'07 SEDE'07 3 Data Mining Applications Outline Introduction – Data Mining Overview Classification (Prediction,Forecasting) Clustering Association Rules (Link Analysis) Applications Fraud Detection & Illegal Activities Facial Recognition Cheating & Plagiarism Bioinformatics Conclusions 7/10/07 7/10/07--SEDE'07 SEDE'07 4 Data Mining Overview Finding hidden information in a database Fit data to a model You must know what you are looking for You must know how to look for you 7/10/07 7/10/07--SEDE'07 SEDE'07 5 “If it looks like a terrorist, duck, walks like a terrorist, duck, andand quacks like a terrorist, duck, then then it’s a terrorist.” duck.” Description Behavior Classification Clustering (Profiling) (Similarity) 7/10/07 7/10/07--SEDE'07 SEDE'07 Associations Link Analysis 6 Classification Applications Teachers classify students’ grades as A, B, C, D, or F. Letter Recognition andwriting Recognition Phishing: http://computerworld.com/action/article.do?command= viewArticleBasic&taxonomyName=cybercrime_hackin g&articleId=9002996&taxonomyId=82 Pluto: http://www.npr.org/templates/story/story.php?storyId=5 705254 7/10/07 7/10/07--SEDE'07 SEDE'07 7 Classification Example Given a collection of annotated data. (in this case 5 instances of Katydids and five of Grasshoppers), decide what type of insect the unlabeled example is. Katydids Grasshoppers (c) Eamonn Keogh, [email protected] 7/10/07 7/10/07--SEDE'07 SEDE'07 8 Antenna Length 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Abdomen Length 7/10/07 7/10/07--SEDE'07 SEDE'07 Grasshoppers 9 (c) Eamonn Keogh, [email protected] Katydids Clustering Applications Targeted Marketing Determining Gene Functionality Identifying Species Clustering vs. Classification No prior knowledge Number of clusters Meaning of clusters Unsupervised learning 7/10/07 7/10/07--SEDE'07 SEDE'07 10 7/10/07 7/10/07--SEDE'07 SEDE'07 http://149.170.199.144/multivar/ca.htm 11 What is Similarity? 7/10/07 7/10/07--SEDE'07 SEDE'07 (c) Eamonn Keogh, [email protected] 12 Association Rules Applications People who buy diapers also buy beer If gene A is highly expressed in this disease then gene B is also expressed Relationships between people www.amazon.com Book Stores Department Stores Advertising Product Placement 7/10/07 7/10/07--SEDE'07 SEDE'07 13 Data Mining Introductory and Advanced Topics, by Margaret H. Dunham, Prentice Hall, 2003. DILBERT reprinted by permission of United Feature Syndicate, Inc. 7/10/07 7/10/07--SEDE'07 SEDE'07 14 Data Mining Applications Outline Introduction – Data Mining Overview Classification (Prediction,Forecasting) Clustering Association Rules (Link Analysis) Applications Fraud Detection & Illegal Activities Facial Recognition Cheating & Plagiarism Bioinformatics Conclusions 7/10/07 7/10/07--SEDE'07 SEDE'07 15 7/10/07 7/10/07--SEDE'07 SEDE'07 16 Fraud Detection Identify fraudulent behavior Used Extensively in financial, law enforcement, health care, etc. sectors http://www.aaai.org/AITopics/html/fraud.html SPSS: http://www.spss.com/predictiveclaims/fraud_det ection.htm Neural Technologies: http://www.neuralt.com/fraud_management.html 7/10/07 7/10/07--SEDE'07 SEDE'07 17 Law Enforcement Identify suspect behavior and relationships I2 Inc. Investigative analytic/visualization software http://www.i2inc.com Social Network Analysis – Analyze patterns of relationships Relationships: personal, religious, operational, etc. 7/10/07 7/10/07--SEDE'07 SEDE'07 18 Jialun Qin, Jennifer J. Xu, Daning Hu, Marc Sageman and Hsinchun Chen, “Analyzing Terrorist Networks: A Case Study of the Global Salafi Jihad Network” Lecture 7/10/07 --SEDE'07 Notes in Computer Science, 7/10/07 SEDE'07 Publisher: Springer-Verlag GmbH, Volume 3495 / 2005 , p. 287. 19 Data Mining Applications Outline Introduction – Data Mining Overview Classification (Prediction,Forecasting) Clustering Association Rules (Link Analysis) Applications Fraud Detection & Illegal Activities Facial Recognition Cheating & Plagiarism Bioinformatics Conclusions 7/10/07 7/10/07--SEDE'07 SEDE'07 20 How Stuff Works, “Facial Recognition,” http://computer.howstuf fworks.com/facialrecognition1.htm 7/10/07 7/10/07--SEDE'07 SEDE'07 21 Facial Recognition Based upon features in face Convert face to a feature vector Less invasive than other biometric techniques http://www.face-rec.org http://computer.howstuffworks.com/facialrecognition.htm SIMS: http://www.casinoincidentreporting.com/Prod ucts.aspx 7/10/07 7/10/07--SEDE'07 SEDE'07 22 7/10/07 7/10/07--SEDE'07 SEDE'07 (c) Eamonn Keogh, [email protected] 23 Data Mining Applications Outline Introduction – Data Mining Overview Classification (Prediction,Forecasting) Clustering Association Rules (Link Analysis) Applications Fraud Detection & Illegal Activities Facial Recognition Cheating & Plagiarism Bioinformatics Conclusions 7/10/07 7/10/07--SEDE'07 SEDE'07 24 Cheating on Multiple Choice Tests Similarity between tests based on number of common wrong answers. (George O. Wesolowsky, “Detecting Excessive Similarity in Answers on Multiple Choice Exams,” Journal of Applied Statistics, vol 27, no 7,200, pp909-923.) The number of common correct answers is often ignored. H-H Index (D.N. Harpp, J.J. Hogan, and J.S. Jennings, 1996, “Crime in the Classroom – Part II, and update,” Journal of Chemical Education, vol 73, no 4, pp 349-351): H-H = (Number of exact answers in common) (Number of different answers) 7/10/07 7/10/07--SEDE'07 SEDE'07 25 Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007. 7/10/07 7/10/07--SEDE'07 SEDE'07 26 No/Little Cheating Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007. 7/10/07 7/10/07--SEDE'07 SEDE'07 27 Rampant Cheating Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007. 7/10/07 7/10/07--SEDE'07 SEDE'07 28 Data Mining Applications Outline Introduction – Data Mining Overview Classification (Prediction,Forecasting) Clustering Association Rules (Link Analysis) Applications Fraud Detection & Illegal Activities Facial Recognition Cheating & Plagiarism Bioinformatics Conclusions 7/10/07 7/10/07--SEDE'07 SEDE'07 29 DNA Basic building blocks of organisms Located in nucleus of cells Composed of 4 nucleotides Two strands bound together 7/10/07 7/10/07--SEDE'07 SEDE'07 http://www.visionlearning.com/library/module_viewer.php?mi d=63 30 Central Dogma: DNA -> RNA -> Protein DNA CCTGAGCCAACTATTGATGAA transcription RNA CCUGAGCCAACUAUUGAUGAA translation Protein PEPTIDE 7/10/07 7/10/07--SEDE'07 SEDE'07www.bioalgorithms.info; chapter 6; Gene Prediction 31 miRNA Short (20-25nt) sequence of noncoding RNA Known since 1993 but significance not widely appreciated until 2001 Impact / Prevent translation of mRNA Generally reduce protein levels without impacting mRNA levels (animal cells) Functions Causes some cancers Guide embryo development Regulate cell Differentiation Associated with HIV … 7/10/07 7/10/07--SEDE'07 SEDE'07 32 Questions If each cell in an organism contains the same DNA – How does each cell behave differently? Why do cells behave differently during childhood/? What causes some cells to act differently – such as during disease? DNA contains many genes, but only a few are being transcribed – why? One answer - miRNA 7/10/07 7/10/07--SEDE'07 SEDE'07 33 http://www.time.com/time/magazine/article/0,9171,1541283,00.html 7/10/07 7/10/07--SEDE'07 SEDE'07 34 Human Genome Scientists originally thought there would be about 100,000 genes Appear to be about 20,000 WHY? Almost identical to that of Chimps. What makes the difference? Visualization from UCR dnaQT.mov Answers appear to lie in the noncoding regions of the DNA (formerly thought to be junk) 7/10/07 7/10/07--SEDE'07 SEDE'07 35 RNAi – Nobel Prize in Medicine 2006 siRNA may be artificially added to cell! Double stranded RNA Short Interfering RNA (~20-25 nt) RNA-Induced Silencing Complex Binds to mRNA Cuts RNA 7/10/07 7/10/07--SEDE'07 SEDE'07 Image source: http://nobelprize.org/nobel_prizes/medicine/laureates/2006/adv.html, 36 Advanced Information, Image 3 Computer Science & Bioinformatics Algorithms Data Structures Improving efficiency Data Mining Biologists don’t usually understand or even appreciate what Computer Science can do Issues: Scalability Fuzzy We will look at: Microarray Clustering TCGR 7/10/07 7/10/07--SEDE'07 SEDE'07 37 Affymetrix GeneChip® Array http://www.affymetrix.com/corporate/outreach/lesson_plan/educator_resources.affx 7/10/07 7/10/07--SEDE'07 SEDE'07 38 Microarray Data Analysis Each probe location associated with gene Measure the amount of mRNA Color indicates degree of gene expression Compare different samples (normal/disease) Track same sample over time Questions Which genes are related to this disease? Which genes behave in a similar manner? What is the function of a gene? Clustering Hierarchical K-means 7/10/07 7/10/07--SEDE'07 SEDE'07 39 Microarray Data - Clustering "Gene expression profiling identifies clinically relevant subtypes of prostate cancer" Proc. Natl. Acad. Sci. USA, Vol. 101, Issue 3, 811-816, January 20, 2004 7/10/07 7/10/07--SEDE'07 SEDE'07 40 miRNA Research Issues Predict / Find miRNA in genomic sequence Predict miRNA targets Identify miRNA functions 7/10/07 7/10/07--SEDE'07 SEDE'07 41 Temporal CGR (TCGR) 2D Array Each Row represents counts for a particular window in sequence • First row – first window • Last row – last window • We start successive windows at the next character location Each Column represents the counts for the associated pattern in that window • Initially we have assumed order of patterns is alphabetic Size of TCGR depends on sequence length and subpattern length 7/10/07 7/10/07--SEDE'07 SEDE'07 42 TCGR Example (cont’d) TCGRs for Sub-patterns of length 1, 2, and 3 7/10/07 7/10/07--SEDE'07 SEDE'07 43 TCGR – Mature miRNA (Window=5; Pattern=3) C Elegans Homo Sapiens Mus Musculus All Mature 7/10/07 7/10/07--SEDE'07 SEDE'07 ACG CGC GCG UCG 44 TCGRs for Xue Training Data C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of Real and Pseudo MicroRNA Precursors using Local StructureSequence Features and Support Vector Machine,” BMC Bioinformatics, vol 6, no 310. 7/10/07 7/10/07--SEDE'07 SEDE'07 P O S I T I VE NE G AT I VE 45 TCGRs for Xue Test Data PO S I T I VE NE GA T I VE 7/10/07 7/10/07--SEDE'07 SEDE'07 46 Data Mining Applications Outline Introduction – Data Mining Overview Classification (Prediction,Forecasting) Clustering Association Rules (Link Analysis) Applications Fraud Detection & Illegal Activities Facial Recognition Cheating & Plagiarism Bioinformatics Conclusions 7/10/07 7/10/07--SEDE'07 SEDE'07 47 Conclusions Not magic Doesn’t work for all applications Stock Market Prediction Issues Privacy Data Here are some infamous examples of failed data mining applications 7/10/07 7/10/07--SEDE'07 SEDE'07 48 7/10/07 7/10/07--SEDE'07 SEDE'07 49 Dallas Morning News October 7, 2005 7/10/07 7/10/07--SEDE'07 SEDE'07 50 7/10/07 7/10/07--SEDE'07 SEDE'07 51 http://ieeexplore.ieee.org/iel5/6/32236/01502526.pdf?tp=&arnumber=1502526&isnumber=32236 BIG BROTHER ? Total Information Awareness http://infowar.net/tia/www.darpa.mil/iao/index.htm http://www.govtech.net/magazine/story.php?id=45918 http://en.wikipedia.org/wiki/Information_Awareness_Office Terror Watch List http://www.businessweek.com/technology/content/may2005/tc20050 511_8047_tc_210.htm http://www.theregister.co.uk/2004/08/19/senator_on_terror_watch/ http://blogs.abcnews.com/theblotter/2007/06/fbi_terror_watc.html http://www.thedenverchannel.com/news/9559707/detail.html CAPPS http://www.theregister.co.uk/2004/04/26/airport_security_failures/ http://www.heritage.org/Research/HomelandDefense/BG1683.cfm http://www.theregister.co.uk/2004/07/16/homeland_capps_scrapped/ http://en.wikipedia.org/wiki/CAPPS 7/10/07 7/10/07--SEDE'07 SEDE'07 52 7/10/07 7/10/07--SEDE'07 SEDE'07 53 7/10/07 7/10/07--SEDE'07 SEDE'07 54 7/10/07 7/10/07--SEDE'07 SEDE'07 55