Download Genetic Algorithm Optimization of Membership Functions for

Introduction to Artificial Intelligence: Applications in Computational Biology Susan M. Bridges [email protected] Intelligent Systems Laboratory Outline • • • • • • What is AI? Search Expert systems Uncertainty Machine learning Data mining Intelligent Systems Laboratory Intelligent Systems and Computational Biology • • First applications (DNA) in which great progress was made were digital • • Signal processing algorithms Text processing techniques Many of the most interesting and difficult problems to be tackled are analog • • • Protein structure Gene expression Metabolic networks Intelligent Systems Laboratory Definitions of AI (What is AI?) • Rich, E. and K. Knight . 1991. Artificial Intelligence. New York: McGraw-Hill. “Artificial intelligence (AI) is the study of how to make computers do things which at the moment, people do better.” Intelligent Systems Laboratory Another definition of AI • Winston, Patrick Henry. 1984. Artificial Intelligence. 1984. Addison-Wesley, Reading, MA. “Artificial Intelligence is the study of ideas that enable computers to be intelligent. Intelligence includes: ability to reason, ability to acquire and apply knowledge, ability to perceive and manipulate things in the physical world, and others.” Intelligent Systems Laboratory Why Study AI? • • Understand human human intelligence Develop “intelligent” machines • • Robotics Programs with intelligent properties Intelligent Systems Laboratory Acting Rationally: Turing Test Approach Interrogator Intelligent Systems Laboratory AI Tasks • Mundane tasks • Perception • » Vision » Speech » Geometry » Logic » Integral calculus • Natural Language » Understanding » Generation » Translation • Common sense reasoning • Robot control Formal tasks • Games • Mathematics • Expert tasks • Engineering • Scientific analysis • Medical diagnosis • Financial analysis Intelligent Systems Laboratory Intelligent Agents • Agent • • • Perceives its environment using sensors • Acts on environment using effectors Rational agent • An agent that does the right thing • Basis for action » A measure of degree of success. » Knowledge of what has been perceived so far. » The actions that the agent can perform Autonomous Agent • Learns from experience • Makes independent decisions Intelligent Systems Laboratory Major Topics • • • Search Knowledge Representation Machine Learning Intelligent Systems Laboratory Problem-solving agent • • • • A type of goal-based agent Find sequence of actions that lead to a desirable state Intelligent agents should make a set of changes in the state of the environment that maximizes the performance measure Life is simpler if we can set a goal and aim to satisfy it. Intelligent Systems Laboratory Components of a problem • • Initial state Set of possible actions • actions can be described as operators » an operator describes an action by specifying the state that can be reached by carrying out an action in a particular state State x Operator a State y • actions can be described in terms of a successor function S. Given a particular state x, S(x) returns the set of states reachable from x by any single action. Intelligent Systems Laboratory State Space • • • • The set of all states reachable from the initial state by any sequence of actions A path in the state space is a sequence of actions leading from one state to another The agent can apply a goal test to any single state to determine if it is a goal state. If one path is preferable to another, then we may need to compute path cost (g). Intelligent Systems Laboratory 5 4 1 6 1 8 8 7 3 2 7 Initial State 2 3 4 6 Goal State States Goal Test Operators Path Cost Intelligent Systems Laboratory 5 Problem: Find route from Louisville to West Point West Point Pheba Mathiston Mayhew Maben Columbus Starkville Sturgis Artesia Ackerman Crawford Louisville Intelligent Systems Laboratory Brooksville A. The initial state Louisville Louisville B. After expanding Louisville Ackerman Starkville Brooksville Louisville C. After expanding Ackerman Maben Ackerman Sturgis Starkville Louisville Intelligent Systems Laboratory Brooksville Some terms • • • • • New states are generated from old states by operators. This is called expanding the state. The choice of which state to expand first is called the search strategy Result is called a search tree The set of nodes waiting to be expanded is called the fringe or frontier Intelligent Systems Laboratory Search Strategies • Requirements for a good search strategy • • • • causes motion is systematic State space can usually be represented as a tree or a graph Two important parameters of a tree • • branching factor (b) depth (d) Intelligent Systems Laboratory Two Types of Searches • Uninformed or blind search • • • systematically generate states test states to see if they are goal states Informed or heuristic search • • • use knowledge about the problem domain explore search space more efficiently may sacrifice accuracy for speed Intelligent Systems Laboratory Breadth-first search • All nodes at each depth d are expanded before any nodes at depth d+1 Intelligent Systems Laboratory Depth-first search l l Always expands one of the nodes at the deepest level of the tree Parameter m is the maximum depth Intelligent Systems Laboratory What is a heuristic? (rule of thumb) • • A heuristic is a formalized rule for choosing those branches in a state space that are most likely to lead to an acceptable solution (Luger and Stubblefield, 1998). Used two ways • • some problems do not have exact solutions, so we just do the best we can (medical diagnosis) there may be an exact solution, but it may be very expensive to find Intelligent Systems Laboratory Hill Climbing • • • • Use an heuristic function (or objective or evaluation function) to decide which direction to move in the search space. Always move toward the state that appears to be best (basing all decisions on local information). Assume that we want to maximize the value of the function. Can also be used for minimization (called gradient descent) Intelligent Systems Laboratory 1 2 8 7 6 3 1 2 3 Steepest Ascent Hill Climbing 4 7 8 4 Using Manhattan Distance 5 6 5 Heuristic Goal 1 h= 2 7 6 h= 8 3 1 2 3 4 7 8 4 5 6 6 5 h= 1 2 3 1 2 3 7 8 4 7 8 4 7 6 5 6 6 5 1 2 3 7 8 4 6 5 h= Intelligent Systems Laboratory h= A* Search • • • Minimizing the total path cost Combines uniform-cost search and greedy search. Evaluation function: f(n) = g(n) + h(n) g(n): cost of path from start to node n h(n): estimate of cost of path from n to goal f(n): estimated cost of the cheapest solution through n Intelligent Systems Laboratory Goal: Minimum length path. Is h(n) an admissible heuristic? f(n) = g(n) + h(n) A(22) 5 3 K (18) 3 S (18) 11 F(7) 4 L ( 3) 4 C (21) 2 M(2) 14 d=1 D (8) 4 7 G (9) H(6) 12 E(12) 3 10 B (18) 6 d=0 7 1 N(9) O(5) 8 11 I (13) 5 12 P(2) 6 d=2 J(14) 3 Q(10) 4 R(12) d=3 5 T(0) U (0) Numbers in parentheses are h(n) Numbers on edges are operator costs Intelligent Systems Laboratory d=4 Multiple Sequence Alignment • • DNA and protein sequences Alignment of multiple sequences created by inserting gaps to shift characters to matching positions ATCG TGA GAT • -ATCG--T-GA GAT--- Optimal alignment maximizes the number of matching positions Intelligent Systems Laboratory Multiple Sequence Alignment As State-Space Search (Eric Hansen, Rong Zhou) Space Complexity: O (LN) Time Complexity: O (2NLN) Where L is the average length of sequences and N is the number of sequences start ATCG TGA GAT -ATCG--T-GA GAT--- goal An Illustration of Anytime A* 09 18 38 19 37 16 28 18 19 19 29 20 gh f 47 18 47 18 56 17 65 16 Nodes pruned by Anytime A* gh = expanded node f g h = stored but not f expanded node 74 15 83 14 f = g + 2h 92 13 Goal 11 0 11 Total number of nodes stored = 8 Genetic Algorithms • • • Search procedure based on a simple model of evolution Uses a “random” process to explore search space Has been applied in many domains Intelligent Systems Laboratory Terminology • Begin with a population of individuals. Each • • • • individual represents a solution to the problem we are trying to solve. A data structure describes the genetic structure of the individual. (Assume for initial discussion that this is a string of 0’s and 1’s). In genetics, the strings are called chromosomes and the bits are called genes. The string associated with each individual is its genotype Selection is based on fitness of individuals Intelligent Systems Laboratory The Genetic Algorithm • • Each evolving population of individuals is called a generation. Given a population of individuals corresponding to one generation, the algorithm simulates natural selection and reproduction in order to obtain the next generation. Intelligent Systems Laboratory Three basic operations • Reproduction: • • Crossover: • • Individuals from one generation are selected for the next generation Genetic material from one individual is exchanged with genetic material from another individual Mutation: • Genetic material is altered Intelligent Systems Laboratory General GA Procedure Selection, crossover, and mutation operations Initial population Evaluate fitness Parent candidate pool Father and Mother Select parents no Crossover and mutate Evaluate fitness and replace yes Converge? Offspring Next generation population Intelligent Systems Laboratory Example of General GA Procedure Selection, crossover, and mutation operations 11 13 2 9 1101 1011 0100 1001 1101 1011 1011 1001 Generation n Reproduction Crossover 1 011 1 101 10 01 1 0 0 11 1 1 011 0 101 1 0 1010 1 0 0 11 1 Mutation Generation n+ 1 15 8 1 Intelligent Systems Laboratory 13 Two keys to the success of a GA • • Data structures for » Genes » Chromosomes » Population Fitness evaluation function Intelligent Systems Laboratory Knowledge Representation • • • • • Semantic networks Frame based systems Rule based expert systems Ontologies Neural networks Intelligent Systems Laboratory Anything AbstractObjects Sets Numbers Events Representational Objects Places Intervals Sentences Processes Physical Objects Measurements Moments Categories Times Weights Things Animals Stuff Agents Humans Intelligent Systems Laboratory Expert Systems • Rule based systems • • • • • Garnered a great deal of attention in the 1980’s Most famous examples are in medical domains Stimulated interest in “logic programming” Encode knowledge of people as sets of rules Still widely used • Knowledge acquisition bottleneck Intelligent Systems Laboratory Representing Uncertainty • • Fuzzy logic Bayesian reasoning Intelligent Systems Laboratory Uncertainty versus Vagueness • Certainty–degree of belief • • • there is a 50% probability of rain today I am 30 % sure the patient is suffering from pneumonia Vagueness–the degree to which an item belongs to a category • the man is tall • move the wheel slightly to the left • the patient’s lungs are highly congested Intelligent Systems Laboratory Fuzzy Sets Represent Vagueness • • • • Lotfi Zadeh popularized the idea in the 60’s Popular concept in Eastern philosophy Reasoning with fuzzy sets is called fuzzy logic Fuzzy logic is also called • • approximate reasoning continuous logic Intelligent Systems Laboratory Fuzzy Set Definitions • • Set membership can be expressed using a characteristic (or descrimination) function Classic (or crisp) sets If objects x are chosen from some universe X 1 if x is an element of set A 0 if x is not an element of set A  A (x)   • Fuzzy sets - an element can be a partial member of a set (grade of membership) 0   A (x)  1 Intelligent Systems Laboratory Examples of Fuzzy Concepts from Natural Language • • • • • • • John is tall The weather is rainy Turn the volume up a little Dr. Bridges’ tests are long Add water until the dough is the right consistency There was very little change in the cost The water bill was somewhat high Intelligent Systems Laboratory Representing Fuzzy Sets • Enumeration of membership values of all elements with non-zero membership TALL = {.125/5.5, .5/6, .875/6.5, 1/7, 1/7.5, 1/8} • Represent membership with a function Intelligent Systems Laboratory Functional Representations Fuzzy Set Tall Membership 1 Tall 0 4 5 6 7 Intelligent Systems Laboratory Height in feet Linguistic (or Fuzzy)Variable • • • Usually corresponds to a noun The values of a linguistic variable are fuzzy sets (which correspond to adjectives) Examples: Linguistic variable Height Weight Temperature Speed Fuzzy sets short medium tall light average heavy cold cool typical warm hot slow medium fast Intelligent Systems Laboratory Linguistic Variable Temperature Cold Normal Hot 1 0 30 40 50 60 70 80 90 100 Intelligent Systems Laboratory Some Fuzzy Set Operations • Set union A  B A  B(x) max(A(x),B(x)) for all x X alternate syntax (join operator) A  B(x) A(x)B(x)) for all x X • Set intersection A B AB(x) min(A(x),B(x)) for all x X alternate syntax (meet operator) A  B(x) A(x) B(x)) for all x X Intelligent Systems Laboratory Fuzzy Reasoning • A fuzzy proposition is a statement that asserts a value for a linguistic (or fuzzy) variable • Example: Joe’s height is medium » Linguistic variable (noun) Joe’s height » Fuzzy set (adjective) medium » The fuzzy set “medium” is a value of the linguistic variable “Joe’s height” • • A fuzzy rule relates two or more fuzzy propositions Fuzzy inference techniques are used to draw conclusions using fuzzy rules Intelligent Systems Laboratory Example Fuzzy Rule If speed is normal then braking.force is medium Speed Normal = (0/0, .1/20, .8/40, 1/60, .1/80, 0/100) braking.force Medium = (0/0, .5/1, 1/2, 1/3, .2/4, 0/5) Intelligent Systems Laboratory J. Dickerson, D. Bedeant, Z.Cox, W. Qi, D. Ashlock, and E. Wurtele, Atlantic Symposium on Computational Biology and Genome Information Systems & Technology (CBGIST 2000, 26-30. Intelligent Systems Laboratory Bayesian Reasoning • • Bayesian networks: Represent knowledge as a network of random variables Many names and many variations • Belief networks • Probabilistic networks • Causal networks • Knowledge Maps • Influence Diagram (extension) • Decision Network (extension) Intelligent Systems Laboratory Belief Network P(B) 0.001 P(E) 0.002 Burglary Earthquake B T T F F Alarm JohnCalls A T F P(J|A) 0.90 0.05 MaryCalls Intelligent Systems Laboratory E T F T F P(A|B,E) 0.95 0.94 0.29 0.01 A T F P(M|A) 0.70 0.01 Intelligent Systems Laboratory Classification of Learning Systems • Supervised learning • • • • Give the system a set of examples and an “answer” for each example. Train the system until it can give the correct response to these examples (or most of them). Unsupervised learning • Give the system a set of examples and let it discover interesting patterns in the examples. Reinforcement learning • Learn from rewards and penalties Intelligent Systems Laboratory Feature Vectors • Simple representation • used by most learning systems. Represents each example as a vector or numbers • Quantities • Nominal data • Ordinal data #53 5.3 cold 3.2 Intelligent Systems Laboratory blue 1 Neural Networks • Computational models “loosely” based on the • structure of the brain Characteristics of the brain • Large number of simple processing units (neurons) • Highly connected • No central control • Neurons are slow devices compared to digital computers • Can perform complex tasks in a short period of time • Neurons are failure-prone devices • Handles fuzzy situations very well. • Information accessed on the basis of content • Learns from experience Intelligent Systems Laboratory Neural Networks • • • • • Based on model of nervous system Large numbers of simple processing units Units are highly connected and connections are weighted. Highly parallel distributed control Emphasis on learning internal representations automatically Intelligent Systems Laboratory Neural Network Concepts • Cell or unit or neuron or node • Autonomous processing unit that models a neuron • Purpose • » Receives information from other cells » Performs simple processing » Sends results on to one or more cells Layers • A collection of cells that perform a common function • Types: » Input layer » Hidden layer » Output layer Intelligent Systems Laboratory Layers of Neurons I1 H1 O1 I2 H2 I3 Input Layer Hidden Layer Output Layer Intelligent Systems Laboratory Properties • • • In general, there is no interconnection between cells in the same layer Connections are one or two way communications links between two cells Weights are the strength of the connections. A weight wij is a real number than indicates the influence that cell ui has on cell uj Intelligent Systems Laboratory More about weights • • • • Positive weights indicate reinforcement Negative weights indicate inhibition Weight of 0 indicates no influence or connection Weights may be initialized to one of these: • • • • 0 predefined values random values Weights are altered by experience Intelligent Systems Laboratory Multilayer Feed-Forward Networks • • Networks that are connected acyclic graphs Backpropagation • • • • Most popular training method for feed forward layered networks. Invented in 1969 by Bryson and Ho Ignored until 80’s Supervised learning technique Intelligent Systems Laboratory Back Propagation • • • • • Initialize the network with random weights Show it an input instance Compute the output Determine how much the output differs from the goal. Feed small adjustments to the weights back through the network based on the error. Intelligent Systems Laboratory General Algorithm for Training Network Initialize NN for epoch = 1 to MAXEPOCHS for each input-output pair in the training set present an example and compute error adjust weights to reduce the error compute mean-square-error for training set for each input-output pair in the test set compute error compute mean-square-error for test set Intelligent Systems Laboratory Generalization Training set % correct Test set Training time (# of epochs) Intelligent Systems Laboratory Computational Biology Applications • • • • • • Protein classification Sequence analysis DNA fragment assembly Prediction of transmembrane regions Phylogenetic classification of ribosome sequences And many more Intelligent Systems Laboratory Self-Organizing Maps • • • Also called Kohonen maps Used for unsupervised learning Widely applied for comparison of gene expression data Intelligent Systems Laboratory Principle of Self-Organizing Maps Intelligent Systems Laboratory Self-Organizing Map from Yeast Gene Expression Data (German Florez) Intelligent Systems Laboratory Knowledge Discovery • • Definition: Non-trivial extraction of implicit, previously unknown, and potentially useful information from data. Applications in biology • • Text mining Association rule mining Intelligent Systems Laboratory KDD Process Interpretation/ Evaluation Data Mining Knowledge Transformation Preprocessing Selection --- --- --- ----- --- --- --Preprocessed Data Patterns Transformed Data Target Data Data Intelligent Systems Laboratory Iterative Clustering Procedure ( Wan, Bridges, J.Boyle, A.Boyle) Download data from Genbank Represent data using an appropriate method Construct feature vector for each gene from POSITIONAL WEIGHT MATRICES Cluster with K-clustering Select clustering without clear patterns for further clustering Visualize results Conclusions Intelligent Systems Laboratory Positional Weight Matrix Representation Clustering Results S. solfataricus with A, G box (2976) Clustering window (-48 to -1) With A box (1656) Nearby with A, G box (286) Cluster window (-24 to -1) With A and G box (142) With G box (1320) Distant with A box (1370) Nearby with G box (571) With A box (146) Intelligent Systems Laboratory Distant with G box (749)

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Genetic Algorithm Optimization of Membership Functions for