Download Understanding Virtual Blah Blahs…

Scientific Applications of Data Mining Bioinformatics Seminar August 28, 2002 Gary Lindstrom School of Computing University of Utah Outline What is data mining? Where has it been successfully applied? How can it be applied to scientific applications? Research Opportunities What Is Data Mining? One definition (Robert Grossman) • Data mining is the semi-automatic discovery of patterns, associations, anomalies, structures, and changes in large data sets Data Mining Characteristics • • • • Large data, vs. small data Discovery, not validation Data driven, not hypothesis driven Automated, not manual application Supported by • Statistics, machine learning, databases, high performance computing The Data Gap Exponential growth of data • More automation, greater throughput, more models, e.g. simulated But: linear increase in number of researchers • Sift the sand, rather than searching a sensor Classical Data Mining Applications Retail • Market basket analysis Political science • Targeting campaign resources Financial • Exploiting market trends & imbalances Decision Support Systems Generic term for analytic and historic uses of DBs • Contrast with: operational uses • Commonly known as On-Line Transaction Processing (OLTP) Data warehouses • Data culled from operational DBs, with history and derived summary data Data Warehouses vs. Databases • Replicate data from distributed sources • Do not require strict currency of data • Oriented toward complex, often statistical queries • Often based on materialized views of operational data  Views which have been expanded into real tables Tools for DSS  Ad hoc SQL-style queries • Optimized for large, complex data  On-Line Analytic Processing (OLAP) • Queries optimized for aggregation operations • Data is viewed as multidimensional array • Influenced by end-user tools such as spreadsheets  Data mining • Exploratory data analysis • Looking for interesting unanticipated patterns in the data Data Warehousing Visualization External Data Source Metadata Repository EXTRACT TRANSFORM LOAD REFRESH SERVES OLAP Data Warehouse Data Mining Creating And Maintaining A Warehouse  Challenges • Schema design for integrated information • Operations      Cleaning (curation): filling gaps, correcting errors Transforming: making consistent with new schema Loading: also sorting and summarizing Refreshing: incorporate updates to operation data Purging: aging out old data  Role of metadata • Sources of data, schema conversion information, refresh history, etc. OLAP Naturally Leads to Data Mining Seeks interesting trends or patterns in large datasets • An example of exploratory data analysis • Related to knowledge discovery and machine learning Mining for rules • Association rules: motivated by retail market basket analysis Market Basket Analysis  Market basket • A collection of items purchased by a customer in one transaction • Retailers want to learn of items often purchased together  For promotional and display grouping purposes • Simple tabular representation  Purchases(transid, custid, date, item, price, quantity) Association Rules Seek rules of the form: { pen } => { ink } • Meaning:  If a pen is purchased in a transaction, it is likely that ink will also be purchased in that transaction Important Measures for Association Rules Support • % of transactions containing all items mentioned in rule • Low support reduces interest in the rule Confidence • % of transactions containing the LHS that also contain RHS • Indicates degree of correlation Using Association Rules For Prediction Always somewhat risky • Because ultimate goal is understanding causality • Which is not directly reflected in transaction data There Can Be High Support and Confidence  … but no causality  Example: pencils and pens are often bought together • And pens and ink are often bought together • Hence pencils and ink are often bought together  But there is no causal link between pencils and ink • Hence sale promotions on pencils and ink probably won’t be effective Finding Association Rules Seek rules with: • Support greater than minsup • Confidence greater than minconf Steps • Find frequent item sets  Sets of items with support >= minsup • Break each frequent item set into LHS and RHS of candidate rules  Keep those with confidence >= minconf Testing Candidate Rules Confidence calculation for each candidate rule • Maintain two counters: lhscount, rhscount • Scan entire customer transaction table • Count in lhscount occurrences of all items in LHS • If LHS is present, tally in rhscount if all items in RHS are present Identifying Frequent Item Sets The a priori property: • Every subset of a frequent item set is also a frequent item set This leads to an iterative algorithm • Identify frequent item sets of one item • Iteratively, seek to extend frequent item sets by adding an item Finding Frequent Itemsets foreach item, check if it is a frequent itemset repeat foreach new frequent itemset Ik with k items generate all itemsets Ik+1 with k+1 items, Ik  Ik+1 Scan all transactions once and check if the generated k+1-itemsets are frequent until no new frequent itemsets are found Example: Mining Simulated Combustion Data Joint work with • Brijesh Garabadu, School of Computing • Zoran Djurisic, Chem. & Fuels Engg. The problem • Combustion model for powdered coal furnaces • Which conditions control NOx pollution? The Data Multidimensional space • Pressure, fuel mix, oxygen concentration • Can explore (simulate) any combination  But which to look at? Need to: • Locate relevant subspaces • Characterize important events • Develop causal hypotheses Techniques Applied Cluster analysis • Which datasets are similar? Neural networks • Which datasets are interesting? Decision trees • Which features best explain similarities? Cluster Analysis: Unsupervised Learning At outset, category structure of the data is unknown • All that is known is a collection of observations Objective: To discover a category structure which fits the observation • i.e. finding natural groups in data Combustion Application  Cluster analysis was used to detect relationships among various species • Are the behaviors of any two species related? • Is the concentration of one species dependent on that of one or more other species?  One confirmed hypothesis: • CH reaches it peak concentration either before or at the same time as H reaches its peak concentration • An important engineering observation Artificial Neural Networks  A general, practical method for learning real-valued, discrete-values, and vectorvalues function from examples  Combustion application • Finding out different kinds of pattern (increasing / decreasing, etc) in the lifetime of a species during the combustion process • This can be used to prove various hypothesis as well as to detect patterns of specific species in previously unseen data Neural Networks: Supervised Learning Application Technique  Training set data are labeled by the user • These labeled data are used to train the ANN  The ANN is then used to classify previously unseen data • e.g., species in a particular combustion • Into a particular pattern class  For example, NO shows two different trends under differing conditions  A trained ANN can be used to classify the datasets according to the trend of NO Decision Trees Characterize data by features • e.g., species concentration at an instant Categorize data sets • Manually, or use ANN • e.g., according to the trend of NO Use decision tree algorithm to discover clustering criteria Sample Output === Classifier model (full training set) === J48 pruned tree --------------------CO <= 0.002945 | OH <= 0.000016 | | CO <= 0.000166: yes (17.0/1.0) | | CO > 0.000166: no (3.0) | OH > 0.000016: yes (30.0) CO > 0.002945: no (60.0 / 1.0) Research Opportunities Try it! • In your area, on your data, for new results Features • Definition, efficient extraction Community building • Sharing data mining results PMML Predictive Model Markup Language XML based representation of association rules Developed by Data Mining Group • Industrial and university research collaboration An Excellent Tutorial Used for material in this talk • Data Mining Scientific and Engineering Applications  Tutorial at SC2001, November 12, 2001 by R. Grossman, C. Kamath and V. Kumar http://www-users.cs.umn.edu/ ~kumar/Presentation/sc2001.html

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Understanding Virtual Blah Blahs…