Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Scientific Applications of Data Mining Bioinformatics Seminar August 28, 2002 Gary Lindstrom School of Computing University of Utah Outline What is data mining? Where has it been successfully applied? How can it be applied to scientific applications? Research Opportunities What Is Data Mining? One definition (Robert Grossman) • Data mining is the semi-automatic discovery of patterns, associations, anomalies, structures, and changes in large data sets Data Mining Characteristics • • • • Large data, vs. small data Discovery, not validation Data driven, not hypothesis driven Automated, not manual application Supported by • Statistics, machine learning, databases, high performance computing The Data Gap Exponential growth of data • More automation, greater throughput, more models, e.g. simulated But: linear increase in number of researchers • Sift the sand, rather than searching a sensor Classical Data Mining Applications Retail • Market basket analysis Political science • Targeting campaign resources Financial • Exploiting market trends & imbalances Decision Support Systems Generic term for analytic and historic uses of DBs • Contrast with: operational uses • Commonly known as On-Line Transaction Processing (OLTP) Data warehouses • Data culled from operational DBs, with history and derived summary data Data Warehouses vs. Databases • Replicate data from distributed sources • Do not require strict currency of data • Oriented toward complex, often statistical queries • Often based on materialized views of operational data Views which have been expanded into real tables Tools for DSS Ad hoc SQL-style queries • Optimized for large, complex data On-Line Analytic Processing (OLAP) • Queries optimized for aggregation operations • Data is viewed as multidimensional array • Influenced by end-user tools such as spreadsheets Data mining • Exploratory data analysis • Looking for interesting unanticipated patterns in the data Data Warehousing Visualization External Data Source Metadata Repository EXTRACT TRANSFORM LOAD REFRESH SERVES OLAP Data Warehouse Data Mining Creating And Maintaining A Warehouse Challenges • Schema design for integrated information • Operations Cleaning (curation): filling gaps, correcting errors Transforming: making consistent with new schema Loading: also sorting and summarizing Refreshing: incorporate updates to operation data Purging: aging out old data Role of metadata • Sources of data, schema conversion information, refresh history, etc. OLAP Naturally Leads to Data Mining Seeks interesting trends or patterns in large datasets • An example of exploratory data analysis • Related to knowledge discovery and machine learning Mining for rules • Association rules: motivated by retail market basket analysis Market Basket Analysis Market basket • A collection of items purchased by a customer in one transaction • Retailers want to learn of items often purchased together For promotional and display grouping purposes • Simple tabular representation Purchases(transid, custid, date, item, price, quantity) Association Rules Seek rules of the form: { pen } => { ink } • Meaning: If a pen is purchased in a transaction, it is likely that ink will also be purchased in that transaction Important Measures for Association Rules Support • % of transactions containing all items mentioned in rule • Low support reduces interest in the rule Confidence • % of transactions containing the LHS that also contain RHS • Indicates degree of correlation Using Association Rules For Prediction Always somewhat risky • Because ultimate goal is understanding causality • Which is not directly reflected in transaction data There Can Be High Support and Confidence … but no causality Example: pencils and pens are often bought together • And pens and ink are often bought together • Hence pencils and ink are often bought together But there is no causal link between pencils and ink • Hence sale promotions on pencils and ink probably won’t be effective Finding Association Rules Seek rules with: • Support greater than minsup • Confidence greater than minconf Steps • Find frequent item sets Sets of items with support >= minsup • Break each frequent item set into LHS and RHS of candidate rules Keep those with confidence >= minconf Testing Candidate Rules Confidence calculation for each candidate rule • Maintain two counters: lhscount, rhscount • Scan entire customer transaction table • Count in lhscount occurrences of all items in LHS • If LHS is present, tally in rhscount if all items in RHS are present Identifying Frequent Item Sets The a priori property: • Every subset of a frequent item set is also a frequent item set This leads to an iterative algorithm • Identify frequent item sets of one item • Iteratively, seek to extend frequent item sets by adding an item Finding Frequent Itemsets foreach item, check if it is a frequent itemset repeat foreach new frequent itemset Ik with k items generate all itemsets Ik+1 with k+1 items, Ik Ik+1 Scan all transactions once and check if the generated k+1-itemsets are frequent until no new frequent itemsets are found Example: Mining Simulated Combustion Data Joint work with • Brijesh Garabadu, School of Computing • Zoran Djurisic, Chem. & Fuels Engg. The problem • Combustion model for powdered coal furnaces • Which conditions control NOx pollution? The Data Multidimensional space • Pressure, fuel mix, oxygen concentration • Can explore (simulate) any combination But which to look at? Need to: • Locate relevant subspaces • Characterize important events • Develop causal hypotheses Techniques Applied Cluster analysis • Which datasets are similar? Neural networks • Which datasets are interesting? Decision trees • Which features best explain similarities? Cluster Analysis: Unsupervised Learning At outset, category structure of the data is unknown • All that is known is a collection of observations Objective: To discover a category structure which fits the observation • i.e. finding natural groups in data Combustion Application Cluster analysis was used to detect relationships among various species • Are the behaviors of any two species related? • Is the concentration of one species dependent on that of one or more other species? One confirmed hypothesis: • CH reaches it peak concentration either before or at the same time as H reaches its peak concentration • An important engineering observation Artificial Neural Networks A general, practical method for learning real-valued, discrete-values, and vectorvalues function from examples Combustion application • Finding out different kinds of pattern (increasing / decreasing, etc) in the lifetime of a species during the combustion process • This can be used to prove various hypothesis as well as to detect patterns of specific species in previously unseen data Neural Networks: Supervised Learning Application Technique Training set data are labeled by the user • These labeled data are used to train the ANN The ANN is then used to classify previously unseen data • e.g., species in a particular combustion • Into a particular pattern class For example, NO shows two different trends under differing conditions A trained ANN can be used to classify the datasets according to the trend of NO Decision Trees Characterize data by features • e.g., species concentration at an instant Categorize data sets • Manually, or use ANN • e.g., according to the trend of NO Use decision tree algorithm to discover clustering criteria Sample Output === Classifier model (full training set) === J48 pruned tree --------------------CO <= 0.002945 | OH <= 0.000016 | | CO <= 0.000166: yes (17.0/1.0) | | CO > 0.000166: no (3.0) | OH > 0.000016: yes (30.0) CO > 0.002945: no (60.0 / 1.0) Research Opportunities Try it! • In your area, on your data, for new results Features • Definition, efficient extraction Community building • Sharing data mining results PMML Predictive Model Markup Language XML based representation of association rules Developed by Data Mining Group • Industrial and university research collaboration An Excellent Tutorial Used for material in this talk • Data Mining Scientific and Engineering Applications Tutorial at SC2001, November 12, 2001 by R. Grossman, C. Kamath and V. Kumar http://www-users.cs.umn.edu/ ~kumar/Presentation/sc2001.html