Download Chap02g-neural network model

Advanced information retreival Chapter 02: Modeling Neural Network Model Neural Network Model  A neural network is an oversimplified representation of the neuron interconnections in the human brain:  nodes are processing units  edges are synaptic connections  the strength of a propagating signal is modelled by a weight assigned to each edge  the state of a node is defined by its activation level  depending on its activation level, a node might issue an output signal Neural Networks • Neural Networks – – – – – Complex learning systems recognized in animal brains Single neuron has simple structure Interconnected sets of neurons perform complex learning tasks Human brain has 1015 synaptic connections Artificial Neural Networks attempt to replicate non-linear learning found in nature Dendrites Axon Cell Body Neural Networks (cont’d) – Dendrites gather inputs from other neurons and combine information – Then generate non-linear response when threshold reached – Signal sent to other neurons via axon x1 x2  xn  y – Artificial neuron model is similar – Data inputs (xi) are collected from upstream neurons input to combination function (sigma) Neural Networks (cont’d) – Activation function reads combined input and produces nonlinear response (y) – Response channeled downstream to other neurons • What problems applicable to Neural Networks? – – – – Quite robust with respect to noisy data Can learn and work around erroneous data Results opaque to human interpretation Often require long training times Input and Output Encoding – Neural Networks require attribute values encoded to [0, 1] • Numeric – Apply Min-max Normalization to continuous variables X*  X  min( X ) X  min( X )  range( X ) max( X )  min( X ) – Works well when Min and Max known – Also assumes new data values occur within Min-Max range – Values outside range may be rejected or mapped to Min or Max Input and Output Encoding (cont’d) • Output – Neural Networks always return continuous values [0, 1] – Many classification problems have two outcomes – Solution uses threshold established a priori in single output node to separate classes – For example, target variable is “leave” or “stay” – Threshold value is “leave if output >= 0.67” – Single output node value = 0.72 classifies record as “leave” Simple Example of a Neural Network Input Layer Node 1 Node 2 Node 3 Hidden Layer W0A W1A W1B W2A W2B W3A W3B Node A Node B Output Layer WAZ Node Z WBZ W0Z W0B – Neural Network consists of layered, feedforward, completely connected network of nodes – Feedforward restricts network flow to single direction – Flow does not loop or cycle – Network composed of two or more layers Simple Example of a Neural Network (cont’d) – – – – – – – – Most networks have Input, Hidden, Output layers Network may contain more than one hidden layer Network is completely connected Each node in given layer, connected to every node in next layer Every connection has weight (Wij) associated with it Weight values randomly assigned 0 to 1 by algorithm Number of input nodes dependent on number of predictors Number of hidden and output nodes configurable Simple Example of a Neural Network (cont) Input Layer Node 1 Node 2 Node 3 W1A W1B W2A W2B W3A W3B Hidden Layer W0A Node A WAZ Node B WBZ W0B – Combination function produces linear combination of node inputs and connection weights to single scalar value net j   Wij xij  W0 j x0 j  W1 j x1 j  ...  WIj xIj i – – – – – – For node j, xij is ith input Wij is weight associated with ith input node I+ 1 inputs to node j x1, x2, ..., xI are inputs from upstream nodes x0 is constant input value = 1.0 Each input node has extra input W0jx0j = W0j Output Layer Node Z W0Z Simple Example of a Neural Network (cont’d) x0 = 1.0 W0A = 0.5 W0B = 0.7 W0Z = 0.5 x1 = 0.4 W1A = 0.6 W1B = 0.9 WAZ = 0.9 x2 = 0.2 W2A = 0.8 W2B = 0.8 WBZ = 0.9 x3 = 0.7 W3A = 0.6 W3B = 0.4 – The scalar value computed for hidden layer Node A equals net A   WiA xiA  W0 A (1.0)  W1 A x1 A  W2 A x2 A  W3 A x3 A  i 0.5  0.6(0.4)  0.8(0.2)  0.6(0.7)  1.32 – For Node A, netA = 1.32 is input to activation function – Neurons “fire” in biological organisms – Signals sent between neurons when combination of inputs cross threshold Simple Example of a Neural Network (cont’d) – Firing response not necessarily linearly related to increase in input stimulation – Neural Networks model behavior using non-linear activation function – Sigmoid function most commonly used y 1 1  ex – In Node A, sigmoid function takes netA = 1.32 as input and produces output 1 y  0.7892 1.32 1 e Simple Example of a Neural Network (cont’d) – Node A outputs 0.7892 along connection to Node Z, and becomes component of netZ – Before netZ is computed, contribution from Node B required net B  WiB xiB  W0 B (1.0)  W1B x1B  W2 B x2 B  W3 B x3 B  i 0.7  0.9(0.4)  0.8(0.2)  0.4(0.7)  1.5 and, 1 f (net B )   0.8176 1.5 1 e – Node Z combines outputs from Node A and Node B, through netZ Simple Example of a Neural Network (cont’d) – Inputs to Node Z not data attribute values – Rather, outputs are from sigmoid function in upstream nodes net Z  WiZ xiZ  W0 Z (1.0)  WAZ x AZ  WBZ xBZ  i 0.5  0.9(0.7892 )  0.9(0.8176 )  1.9461 finally, 1 f (net z )   0.8750 1  e 1.9461 – Value 0.8750 output from Neural Network on first pass – Represents predicted value for target variable, given first observation Sigmoid Activation Function – Sigmoid function combines nearly linear, curvilinear, and nearly constant behavior depending on input value – Function nearly linear for domain values -1 < x < 1 – Becomes curvilinear as values move away from center – At extreme values, f(x) is nearly constant – Moderate increments in x produce variable increase in f(x), depending on location of x – Sometimes called “Squashing Function” – Takes real-valued input and returns values [0, 1] Back-Propagation – Neural Networks are supervised learning method – Require target variable – Each observation passed through network results in output value – Output value compared to actual value of target variable – (Actual – Output) = Error – Prediction error analogous to residuals in regression models – Most networks use Sum of Squares (SSE) to measure how well predictions fit target values SSE   2 ( actual  output )  Re cords OutputNodes Back-Propagation (cont’d) – Squared prediction errors summed over all output nodes, and all records in data set – Model weights constructed that minimize SSE – Actual values that minimize SSE are unknown – Weights estimated, given the data set Back-Propagation Rules – Back-propagation percolates prediction error for record back through network – Partitioned responsibility for prediction error assigned to various connections – Back-propagation rules defined (Mitchell) wij , NEW  wij ,CURRENT  wij , where wij   j xij   learning rate x ij  signifies ith input to node j j  represents responsibility for a particular error belonging to node j Back-Propagation Rules (cont’d) – Error responsibility computed using partial derivative of the sigmoid function with respect to netj – Values take one of two forms output j (1  output j )(actual j  output j ) for output layer nodes  j   output (1  output ) W jk j j j for hidden layer nodes  DOWNSTREAM where, W jk j refers to weighted sum of error responsibilities for nodes downstream DOWNSTREAM – Rules show why input values require normalization – Large input values xij would dominate weight adjustment – Error propagation would be overwhelmed, and learning stifled Example of Back-Propagation Input Layer Node 1 Node 2 Node 3 – – – – – – W1A W1B W2A W2B W3A W3B Hidden Layer W0A Output Layer Node A WAZ Node B WBZ W0B Node Z W0Z Recall that first pass through network yielded output = 0.8750 Assume actual target value = 0.8, and learning rate = 0.01 Prediction error = 0.8 - 0.8750 = -0.075 Neural Networks use stochastic back-propagation Weights updated after each record processed by network Adjusting the weights using back-propagation shown next – Error responsibility for Node Z, an output node, found first  Z  output Z (1  output Z )(actual Z  output Z )  0.875(1  0.875)(0.8  0.875)  0.0082 Example of Back-Propagation – Now adjust “constant” weight w0Z using rules W0 Z   Z (1)  0.1(0.0082 )(1)  .00082 w0 Z , NEW  w0 Z ,CURRENT  w0 Z  0.5  0.00082  0.49918 – Move upstream to Node A, a hidden layer node – Only node downstream from Node A is Node Z  A  output A (1  output A ) W jk j DOWNSTREAM  0.7892 (1  0.7892 )(0.9)(0.0082 )  0.00123 (cont’d) Example of Back-Propagation (cont’d) – Adjust weight wAZ using back-propagation rules WAZ   Z (OUTPUTA )  0.1(0.0082 )(0.7892 )  0.000647 wAZ , NEW  wAZ ,CURRENT  wAZ  0.9  0.000647  0.899353 – Connection weight between Node A and Node Z adjusted from 0.9 to 0.899353 – Next, Node B is hidden layer node – Only node downstream from Node B is Node Z  B  output B (1  output B ) W jk j DOWNSTREAM  0.8176 (1  0.8176 )(0.9)( 0.0082 )  0.0011 Example of Back-Propagation (cont’d) – Adjust weight wBZ using back-propagation rules WBZ   Z (OUTPUTB )  0.1(0.0082 )(0.8176 )  0.00067 wBZ , NEW  wBZ ,CURRENT  wBZ  0.9  0.0.00067  0.89933 – Connection weight between Node B and Node Z adjusted from 0.9 to 0.89933 – Similarly, application of back-propagation rules continues to input layer nodes – Weights {w1A, w2A, w3A , w0A} and {w1B, w2B, w3B , w0B} updated by process Example of Back-Propagation (cont’d) – Now, all network weights in model are updated – Each iteration based on single record from data set • Summary – – – – – Network calculated predicted value for target variable Prediction error derived Prediction error percolated back through network Weights adjusted to generate smaller prediction error Process repeats record by record Termination Criteria – Many passes through data set performed – Constantly adjusting weights to reduce prediction error – When to terminate? – Stopping criterion may be computational “clock” time? – Short training times likely result in poor model – Terminate when SSE reaches threshold level? – Neural Networks are prone to overfitting – Memorizing patterns rather than generalizing – And … Learning Rate – Recall Learning Rate (Greek “eta”) is a constant 0    1, where   learning rate – Helps adjust weights toward global minimum for SSE • Small Learning Rate – With small learning rate, weight adjustments small – Network takes unacceptable time converging to solution • Large Learning Rate – Suppose algorithm close to optimal solution – With large learning rate, network likely to “overshoot” optimal solution Neural Network for IR:  From the work by Wilkinson & Hingston, SIGIR’91 Query Terms Document Terms k1 Documents d1 ka ka kb kb kc dj dj+1 kc kt dN Neural Network for IR    Three layers network Signals propagate across the network First level of propagation:  Query terms issue the first signals  These signals propagate accross the network to reach the document nodes  Second level of propagation:  Document nodes might themselves generate new signals which affect the document term nodes  Document term nodes might respond with new signals of their own Quantifying Signal Propagation    Normalize signal strength (MAX = 1) Query terms emit initial signal equal to 1 Weight associated with an edge from a query term node ki to a document term node ki: Wiq =  wiq 2 sqrt ( i wiq ) Weight associated with an edge from a document term node ki to a document node dj: Wij = wij 2 sqrt ( i wij ) Quantifying Signal Propagation  After the first level of signal propagation, the activation level of a document node dj is given by: i Wiq Wij = i wiq wij 2 sqrt ( 2i wiq ) * 2sqrt ( i wij )  which   is exactly the ranking of the Vector model New signals might be exchanged among document term nodes and document nodes in a process analogous to a feedback cycle A minimum threshold should be enforced to avoid spurious signal generation Conclusions    Model provides an interesting formulation of the IR problem Model has not been tested extensively It is not clear the improvements that the model might provide

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Chap02g-neural network model