Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lesssons for the Computational Discovery of Scientific Knowledge Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and Center for the Study of Language and Information Stanford University, Stanford, California http://www.isle.org/~langley [email protected] Thanks to S. Bay, V. Brooks, S. Klooster, A. Pohorille, C. Potter, K. Saito, J. Shrager, M. Schwabacher, and A. Torregrosa. Outline of the Talk 1. History of machine learning applications 2. Traditional lessons from applied machine learning 3. History of computational scientific discovery 4. Two application efforts in scientific discovery 5. Lessons from these application efforts 6. Directions for future research History of Machine Learning Applications Early 1980s: D. Michie et al. champion use of decision-tree induction on industrial problems. During 1980s: Parallel application developments in neural networks and case-based learning. Early 1990s: Initial reviews of machine learning applications. Mid 1993: First workshops on applications of machine learning. Mid 1995: CACM paper analyzes factors underlying success. Mid 1995: KDD conference becomes the default meeting for papers on machine learning applications. Early 1998: Special issue of Machine Learning, with editorial, on applications. Steps in the Application of Machine Learning Formulating the Problem Engineering the Representation Collecting and Preparing Data Induction Process Evaluating the Learned Knowledge Gaining User Acceptance Areas of Machine Learning Applications There exist a number of application movements within the field of machine learning: data mining for classification/regression tasks empirical natural language processing applied reinforcement learning adaptive interfaces for personalized services computational scientific discovery These types of applications differ in the demands they make and in the issues they raise. Data Mining vs. Scientific Discovery There exist two computational paradigms for discovering explicit knowledge from data: Data mining generates knowledge cast as decision trees, logical rules, or other notations invented by AI researchers; Computational scientific discovery instead uses equations, structural models, reaction pathways, or other formalisms invented by scientists and engineers. Both approaches draw on heuristic search to find regularities in data, but they differ considerably in their emphases. History of Research on Computational Scientific Discovery 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Abacus, Coper Bacon.1–Bacon.5 AM Glauber Dendral Dalton, Stahl Legend Hume, ARC DST, GPN LaGrange IDSQ, Live NGlauber Stahlp, Revolver IE Numeric laws Fahrehneit, E*, Tetrad, IDSN Gell-Mann BR-3, Mendel RL, Progol Pauli Coast, Phineas, AbE, Kekada Qualitative laws SDS HR BR-4 Mechem, CDP Structural models SSF, RF5, LaGramge Process models Astra, GPM Successes of Computational Scientific Discovery Over the past decade, systems of this type have helped discover new knowledge in many scientific fields: • stellar taxonomies from infrared spectra (Cheeseman et al., 1989) • qualitative chemical factors in mutagenesis (King et al., 1996) • quantitative laws of metallic behavior (Sleeman et al., 1997) • qualitative conjectures in number theory (Colton et al., 2000) • temporal laws of ecological behavior (Todorovski et al., 2000) • reaction pathways in catalytic chemistry (Valdes-Perez, 1994, 1997) Each of these has led to publications in the refereed literature of the relevant scientific field (see Langley, 2000). Steps in Applying Computational Scientific Discovery problem formulation algorithm manipulation algorithm invocation representation engineering data collection/ manipulation filtering and interpretation Two Applications for Scientific Discovery Given Find Data on climate variables and carbon production over space and time A model of the Earth’s ecosystem that fits and explains these data Given Find Gene expression levels, over time, for wild and mutant organisms. A model of gene regulation that fits and explains these data Lesson 1 Traditional notations from machine learning are not communicated easily to domain scientists. Ecosystem model NPPc = Smonth max (E · IPAR, 0) E = 0.56 · T1 · T2 · W T1 = 0.8 + 0.02 · Topt – 0.0005 · Topt2 T2 = 1.18 / [(1 + e 0.2 · (Topt – Tempc – 10) ) · (1 + e 0.3 · (Tempc – Topt – 10) )] W = 0.5 + 0.5 · EET / PET PET = 1.6 · (10 · Tempc / AHI)A · PET-TW-M if Tempc > 0 PET = 0 if Tempc < 0 A = 0.00000068 · AHI3 – 0.000077 · AHI2 + 0.018 · AHI + 0.49 IPAR = 0.5 · FPAR-FAS · Monthly-Solar · Sol-Conver FPAR-FAS = min [(SR-FAS – 1.08) / SR (UMD-VEG) , 0.95] SR-FAS = (Mon-FAS-NDVI + 1000) / (Mon-FAS-NDVI – 1000) Gene regulation model NBLR + + NBLA psbA1 - + RR - Health + - psbA2 Light PBS - DFR + - cpcB + Photo Lesson 2 Scientists often have initial models that should influence the discovery process. NBLR + + NBLA psbA1 + + RR - Observations psbA2 + NBLR + psbA1 - m + NBLA PBS RR × Health + - psbA2 Light - + DFR Revised model cpcB Initial model + cpcB × Photo Health + Light PBS - DFR Discovery - Photo Lesson 3 Scientific data are often rare and difficult to obtain rather than being plentiful. Ecosystem model Number of variables Number of equations Number of parameters Number of samples Gene regulation model 8 11 20 303 9 Number of variables 11 Number of initial links Number of possible links 70 20 Number of samples Lesson 4 Scientists want models that move beyond description to provide explanations of their data. Ecosystem model Gene regulation model NPPc NBLR + + E NBLA psbA1 - W T2 T1 SOLAR FPAR + + A PET EET Topt SR PETTWM Tempc NDVI RR - VEG Health + - psbA2 Light AHI PBS IPAR DFR e_max - cpcB + Photo Lesson 5 Scientists want computational assistance rather than automated discovery systems. NBLR + + NBLA psbA1 + + RR - Observations psbA2 + cpcB Initial model NBLR + + NBLA psbA1 - + RR × Health + - psbA2 Light PBS + DFR Revised model - cpcB × Photo Health + Light PBS - DFR Discovery - Photo An Environment for Interactive Modeling In response, we are developing an environment that lets users: specify process models of static and dynamic systems; display and edit a model’s structure and details graphically; utilize a model to simulate a system’s behavior over time; incorporate background knowledge cast as generic processes; indicate which processes to consider during model revision; invoke a revision module that improves a model’s fit to data. The current environment focuses on quantitative processes, but future versions will also support qualitative models. A Process Model for Carbon Production model npp; variables NPPc, E, IPAR, T1, T2, W, Topt, tempc, eet, PET, PETTWM, ahi, A, FPARFAS, monthlySolar, SolConver, MONFASNDVI, umd_veg; observable ahi,eet,tempc,Topt,MONFASNDVI,monthlySolar,PETTWM,umd_veg; process CarbonProd; equations NPPc = E * IPAR; process PhotoEfficiency; equations E = (0.389 * (T1 * (T2 * W))); process TempStress1; equations T1 = (0.8 + ((0.02 * Topt) - (0.0005 * (Topt ^ 2)))); process TempStress2; equations T2 = ((1.1814 / (1 + (2.718281828 ^ (0.2 * (Topt - 10 - tempc))))) / (1 + (2.718281828 ^ (0.3 * (tempc - 10 - Topt))))); process WaterStress; conditions PET!=0; equations W = (0.5 + (0.5 * (eet / PET))); process WSNoEvapoTrans; conditions PET==0; equations W = 0.5; process EvapoTrans; conditions tempc>0; equations PET = 1.6 * (10 * tempc / ahi) ^ A * PETTWM; • • • Viewing and Editing a Process Model Directions for Future Research These lessons suggest the field needs increased research on: methods for discovering knowledge in scientific formalisms techniques for revising existing scientific models approaches to dealing with small data sets algorithms for discovering explanatory models interactive environments for scientific knowledge discovery Taken together, these emphases should address the needs of domain scientists and produce interesting new methods. In Memoriam Early last year, computational scientific discovery lost two of its founding fathers: Herbert A. Simon (1916 – 2001) Jan M. Zytkow (1945 – 2001) Both contributed to the field in many ways: posing new problems, inventing methods, training students, and organizing meetings. Moreover, both were interdisciplinary researchers who contributed to computer science, psychology, philosophy, and statistics. Herb Simon and Jan Zytkow were excellent role models that we should all aim to emulate. The NPPc Portion of CASA NPPc = Smonth max (E · IPAR, 0) E = 0.56 · T1 · T2 · W T1 = 0.8 + 0.02 · Topt – 0.0005 · Topt2 T2 = 1.18 / [(1 + e 0.2 · (Topt – Tempc – 10) ) · (1 + e 0.3 · (Tempc – Topt – 10) )] W = 0.5 + 0.5 · EET / PET PET = 1.6 · (10 · Tempc / AHI)A · PET-TW-M if Tempc > 0 PET = 0 if Tempc < 0 A = 0.00000068 · AHI3 – 0.000077 · AHI2 + 0.018 · AHI + 0.49 IPAR = 0.5 · FPAR-FAS · Monthly-Solar · Sol-Conver FPAR-FAS = min [(SR-FAS – 1.08) / SR (UMD-VEG) , 0.95] SR-FAS = (Mon-FAS-NDVI + 1000) / (Mon-FAS-NDVI – 1000) The NPPc Portion of CASA NPPc E e_max W A PET AHI PETTWM IPAR T2 EET Tempc T1 SOLAR Topt SR NDVI FPAR VEG A Model of Photosynthesis Regulation How do plants modify their photosynthetic apparatus in high light? NBLR + NBLA - PBS + - DFR psbA1 - + + psbA2 Light + - - RR Health cpcB + Photo