* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download CSC 599: Computational Scientific Discovery
History of artificial intelligence wikipedia , lookup
Personal knowledge base wikipedia , lookup
Gene expression programming wikipedia , lookup
Time series wikipedia , lookup
Mathematical model wikipedia , lookup
Pattern recognition wikipedia , lookup
Concept learning wikipedia , lookup
CSC 599: Computational Scientific Discovery Lecture 4: Machine Learning and Model Search Outline Computational Reasoning in Science, cont'd Brief introduction to Artificial Intelligence Search space and search operators Newell's model's of intelligence Brief introduction to Machine Learning Computer Algebra Bayesian nets Error, precision and accuracy Overfitting Computational scientific discovery vs. Machine Learning Importance of sticking to paradigm CSD vs ML: The take-home message Computer Algebra Forget numbers! Q: Have an ungodly amount of algebra to do? Physics, engineering A: Try a Computer Algebra System (CAS)! For algebraic symbol manipulation Examples: Mathematica Maple (Compare: Numerical methods & stats packages) Do “number crunching” Examples: Matlab, Mathematica SAS, SPSS Bayesian Networks Idea Complexity: Simplicity: Lots of variables Non-deterministic environment Patterns of influence between variables Bayesian net encodes influence patterns Example: Variables: a) Prof assigns homework? (true or false) b) TA assigns homework? (true or false) c) Will your weekend be busy? (true or false) Bayesian Networks (2) Example: pr=prof, ta=TA, b=busy p(pr) = .6 p(-pr) = 0.4 p(ta|pr) = 0.1 p(-ta|pr)=0.9 p(ta|-pr)= 0.9 p(-ta|-pr)=0.1 p(b|ta,pr) =0.99 p(-b|ta,pr)=0.01 p(b|-ta,pr)= 0.8 p(-b|-ta,pr)=0.2 p(b|ta,-pr)= 0.9 p(-b|ta,-pr)=0.1 p(b|-ta,-pr)=0.1 p(-b|-ta,-pr)=0.9 Bayesian Networks (3) P(pr=T|b=T) = P(b=T, pr=T) / P(b=T) = P(b=T, ta=T/F, pr=T) / P(b=T, ta=T/F, pr=T/F) = [(0.99*0.1*0.6= 0.0594TTT) +(0.8*0.9*0.6 = 0.432TFT)] / [0.0594TTT+0.432TFT+0.324TTF+0.004TFF] = 0.599707103 Bayesian Networks (4) Q: That's a lot of work! Can't we get the network to simplify things? A: Yes, D-separation! Two sets of nodes X,Y are d-separated given Z if: 1. M is in Z and is the middle node (chain): i --> M --> j Intuition: if I know M, knowing i doesn't tell me any more about j 2. M is in Z and is the middle node (fork): i <-- M --> j Intuition: if I know common cause M, knowing 1st result i doesn't tell me any more about 2nd result j 3. M is NOT in Z (and none of its descendants): i --> M <-- j Intuition: if I did know i and common result M then that would justify why I should not believe in j. An A.I. researcher's worldview Problems are divided into 1. Those solvable by “algorithms” Algorithm = do these steps and you are guaranteed to get the answer in a “reasonable” time Classic examples: searching and sorting 2. Those that aren't No way to guarantee you will get an answer (in polynomial time) Q: What do you do? A: Search for one! A.I. Worldview (2) Example of an “A.I.” problem: Chess Can you guarantee that you will always win at chess? Can you guarantee that you will (at least) never lose? No? Well, that makes it interesting! Compare with Tic-Tac-Toe You can guarantee that you will never lose (That's why only children play it) A.I. Worldview (3) A.I. paradigm for searching for a solution Remember: no “algorithm” for obtaining answer Need to search for one: States: Configurations of the world Operators: Define legal transitions from one state to another Example: white knight g1->f3 white pawn c2->c4 A.I. Worldview (4) State space (or search space) Space of states reachable by operators from initial state A.I. Worldview (5) Goal state One or more states that have the configuration that you want In Chess: Checkmate! A.I. Worldview (6) A.I. Pioneer Alan Newell's view of intelligence A given level of “intelligence” achievable with a) Lots of knowledge and little search (Chess grandmaster) b) Little knowledge and lots of search (“stupid” program) c) Some knowledge and some search (“smart” program) A.I. Worldview (7) Idea 1. Start at initial state 2. Apply operators to traverse search space 3. Hope to arrive at goal state 4. Issues: How quickly can you find the answer? (time!) How much memory do you need? (space!) How good is your goal state? Optimal = shortest path? Optimal = shortest arc cost? A.I. Worldview (8) Tools Uninformed search Informed search Depth 1st Breadth 1st Uniform cost (best 1st where best = least cost so far) Iterative deepening depth 1st Heuristic function tells “desirability” of each node Greedy (best 1st where best = least estimated cost to goal), A* (best 1st where best = uniform + greedy) Search from: Initial state to goal state(s) Goal state to initial state(s) Both directions Machine Learning and A.I. ML goals Find some data structure that permits better performance on some set of problems Prediction Conciseness Some combination thereof What about coefficient finding numerical methods? They're “algorithms” (in the A.I. Sense)! 1. Stuff in the data 2. Turn the crank 3. In O(n^3) later out comes the answer ML example Decision Tree learning: Task: Build a decision tree that predicts a class Leaves = guessed class Non-leaf nodes = tests on attribute variables Each edge to child represents one or more attr. values ML example Decision Tree learning (2) Approach Greedy search 1. Use information theory to find best attribute to split data 2. Split data on that attribute 3. Recursively Continue until either: a) No more attributes to split on (label with majority class) b) All instances are in same class (label with that class) ML example Decision Tree learning (3) A bit of information theory: Ci = some class value to guess S = some set of examples freq(Ci,S) = how many Ci's are in S size(S) = size of S Intuition k choices C1 . . Ck How much information needed to specify one Ci from S? Not many Ci's (≈ 0)? On average few bits Each occurrence costs more than 1 bit but not many occurrences Lots of Ci's (≈ size(S))? Not many bits Each occurrence less than 1 bit (good default guess) Some Ci's (≈ size(S)/2)? About 1 bit About 1 bit each, occur about ½ the time ML example Decision Tree learning (4) Prob choose one class value from set: freq(Ci,S)/size(S) Information to specify one Ci in S: -lg( freq(Ci,S)/size(S) ) bits For expected information multiply by class proportions info(S) = - sum(i=1 to k): freq(Ci,S)/size(S) * lg(freq(Ci,S)/size(S)) ML example Decision Tree learning (5) Let's get an intuition: Case 1: Every member of S is a C1, none of C2 size(S) = 10, freq(C1,S) = 10, freq(C2,S) = 0: Therefore: info(S) = - sum(i=1,2): freq(Ci,S)/size(S) * lg(freq(Ci,S)/size(S)) = - [ (10/10) * lg(10/10)] - [(0/10) * lg(0/10)] = -0 - 0 = 0 Intuition: “If we know that we're dealing with S, then we know that all of it's members are in C1. No need to specify that which is C1 and which is C2” ML example Decision Tree learning (6) Let's get an intuition (cont'd): Case 2: Half members of S is a C1, half of C2 size(S) = 10, freq(C1,S) = 5, freq(C2,S) = 5: Therefore: info(S) = - sum(i=1,2): freq(Ci,S)/size(S) * lg(freq(Ci,S)/size(S)) = - [ (5/10) * lg(5/10)] - [(5/10) * lg(5/10)] = -2 * (0.5 * -1) = 1 Intuition: “If we know that we're dealing with S, then its a 50-50 guess which members belong to C1 and which to C2. Need to specify which (no compression possible)” ML example Decision Tree learning (7) Recall the plan: select “best” attr to partition on “best” = best separator classes Information gain for some attribute: gain(attr) = = (ave info needed to spec. a class) (ave info needed to spec. a class after partition by attr) = info(T) – infoattr(T) When infoattr(T) small, classes well separated (big gain!) where: n = number attribute values Ti = set where all members have same attr value vi infoattr(T) = sum(i=1,n): size(Ti)/size(T) * info(Ti) ML example Decision Tree learning (8) Example data (should we play tennis?) Outlook sunny sunny sunny sunny sunny overcast overcast overcast overcast rain rain rain rain rain Temp 75 80 85 72 69 72 83 64 81 71 65 75 68 70 Humidity 70 90 85 95 70 90 78 65 75 80 70 80 80 96 Windy true true false false true true false true false true true false false false PlayTennis? yes no no no yes yes yes yes yes no no yes yes yes ML example Decision Tree learning (9) info(PlayTennis): = -9/14 * lg(9/14) - 5/14 * lg(5/14) = 0.940 bits infooutlook(PlayTennis): = 5/14 * (-2/5 * lg(2/5) - 3/5 * lg(3/5)) + 4/14 * (-4/4 * lg(4/4) - 0/4 * lg(0/4)) + 5/14 * (-3/5 * lg(3/5) - 2/5 * lg(2/5)) = 0.694 bits gain(outlook) = 0.940 – 0.694 = 0.246 bits ML example Decision Tree learning (10) info(PlayTennis): = -9/14 * lg(9/14) - 5/14 * lg(5/14) = 0.940 bits infowindy(PlayTennis): = 6/14 * (-3/6 * lg(3/6) - 3/6 * lg(3/6)) + 8/14 * (-6/8 * lg(6/8) - 2/8 * lg(2/8)) + = 0.892 bits gain(windy) = 0.940 – 0.892 = 0.048 bits gain(outlook) > gain(windy) Test on outlook! ML example Decision Tree learning (11) Guarding against overfitting: Cross-validation Want to use all data, but using test data to train is cheating Split data into k sets: for (i = 0; i < k; i++) { model = train_with_everything_but(i); test_with(model,i); } Tenets of Machine Learning Choose appropriate: Training experience Ex: Good to have about equal number of cases of each class, even if some classes are more probable in real data Think about how you'll test too! Target function: Decision tree? Neural Net? Representation: Ex: how much data: Windy in {true,false} vs. wind_speed in mph Learning algorithm: Ex: Greedy search? Genetic algorithm? Backpropagation? Our Tenets of Scientific Discovery 1. Play to computers' strengths: 1. Speed 2. Accuracy (fingers crossed) 3. Don't get bored Do exhaustive search! Q: Hey doesn't that ignore all that AI heuristic fnc research? 2. Use background knowledge Predictive accuracy is not everything! Normal science ==> dominant paradigm Revolutionary science ==> ? What are the Differences? 1. Background knowledge CSD values background knowledge ML considers background knowledge What are the Differences? (cont) 2. The process of knowledge discovery The ML Process is iterative: But the CSD is iterative, and starts all over again: 1. Exhaustive Search Tell computers to consider everything! Search space systematically Simplest --> increasingly more complex Issues: 1. How do you search systematically? States: models Initial state = simplest model Goal state = solution model Operators: Go from one model to marginally more complex – What is “everything”? Q: With floating pt values every different coefficient could be a new model (x, x+dx, x+2dx, etc.) A: Generate next qualitative state, use numerical methods to find best coefficients in that state 2. Background knowledge as inductive bias (1) Inductive bias is necessary N training cases But N+1 test case could be anything Want to assume something about target function Inductive Bias = what you've assumed Common inductive biases in ML: Minimal cross-validation error (e.g. decision tree learn) Maximal conditional independence (Bayes nets) Maximal boundary size between classes (Support vector machines) Minimal description length (Occam's razor) Minimal feature usage (Ignore extraneous data) Same class as nearest neighbor (Locality) 2. Background knowledge as inductive bias (2) Biases we can add/refine in CSD 1. Expressible in same language as paradigm? Re-use paradigm elements instead of inventing something “brand new” Penalty for new objects Penalty for new attributes Penalty for new processes Penalty for new relations/operations (?) Penalty for new types of assertions (?) 2. Uses same reasoning as done in paradigm Penalty for new types of reasoning, even with old assertions Q: Does this mean we can never introduce a new thing? Penalty for new objects: polywater Polymer: a long molecule in a repetitive chain Nikolai Fedyakin (1962 USSR) H2O condensed in and forced thru narrow quartz capillary tubes Measured boiling pt, freezing pt and viscosity Similar to syrup Boris Derjaguin Popularized results (Moscow, then UK 1966) In West Some could replicate findings Some could not Penalty for new objects: polywater (2) People concerned with contamination of H20 But precautions taken against this Denis Rousseau (Bell Labs) Did same tests with his sweat Had same properties as “polywater” Easier to believe in an old thing (water + organic pollutants) rather than new thing ("polywater") Penalty for new things: Piltdown Man Circa 1900: looking for early human fossils Neanderthals in Germany (1863) Cro-Magnon in France (1868) What about England?? Charles Dawson (1912) “I was given a skull by men in at Piltdown gravel pit” Later, got skull fragments and lower jaw Excavating Piltdown gravels: Dawson (r) Smith Woodward (center) Penalty for new things: Piltdown Man (2) Royal College of Surgeons (soon after discovery) “Brain looks like modern man” French paleontologist Marcellin Boule (1915) “Jaw from ape” American zoologist Gerrit Smith Miller (1915) “Jaw from fossil ape” German anatomist Franz Weidenreich (1923) “Modern human cranium + orangutan jaw w/filed teeth” Oxford anthropologist Kenneth Page Oakley (1953) “Skull is medieval human, lower jaw is Sarawak orangutan, teeth are fossil chimpanzee Penalty for new attributes: Inertia vs. gravitational mass Inertia mass: Resistance to motion m in F = ma Active gravitational mass Ability to attract other masses M in F = GMm/r2 Passive gravitational mass: Ability to be attracted by other masses m in F = GMm/r2 Penalty for new attributes (2) Conceptually they are three different types of mass No experiment has ever distinguished between them People since Newton on have tried experiments Assume they are all the same! Penalty for new processes: cold-fusion Cold fusion Novel combo of old processes: catalysis + fusion Catalysis: Hard: A + B -> D Easier (C = catalyst): A + C -> AC (activated catalyst) B + AC -> ABC (ready to go) ABC -> CD (easier reaction) CD -> C + D (catalyst ready to do another reaction) Penalty for new processes: cold-fusion (2) Fusion: how it works Get lots of energy fusing neutron-rich atoms Need a lot of energy in to get more out Penalty for new processes: cold-fusion (3) Fusion: Overcoming electrostatic force is hard: Current technology: need a fission bomb to do it This is the result: Penalty for new processes: cold-fusion (4) Martin Fleischmann & Stanley Pons (1989) “We can do fusion at room temperature!” (No initiating nuclear bomb needed) Electrolysis of heavy water (D2O) “Excess heat” observed Proposed mechanism Palladium is catalyst Pd + D -> Pd-D Pd-D + D -> D-Pd-D D-Pd-D -> He-PD + energy! He-PD -> He + Pd Penalty for new processes: cold-fusion (5) Reported in New York Times Instantly a worldwide story among scientists Replication Some can Others can't Results: Energy: Some get excess energy Other claim didn't calibrate/account for everything Helium: Not enough observed for energy said to be produced (there is background Helium in the air) Ramifications 1. Science is conservative Use the current paradigm to guide thinking 2. Accuracy is not everything Assertion has to “fit in” current model Be explainable by model Use same terms as model ML and CSD? From ML we can get: Idea of learning as model search: Training experience Target function Representation Learning algorithm Extra considerations for CSD: Use computers' strengths: Use of background knowledge Speed + Accuracy + Don't Get Bored Simulation + Exhaustive search Down right conservative about introducing new terms Not just iterative, never ends