Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Reti Neurali Ricorsive e sistemi connessionisti per il NLP Il problema del First-Pass Attachment Disambiguation Fabrizio Costa University of Florence Overview • What is this talk about: brief review of connectionist architectures for NLP introduction of a connectionist recursive system for syntactic parsing • Hybrid model: Dynamic grammar: strong incrementality hypothesis Recursive connectionist predictor • Investigations: linguistic preferences expressed by the system • Advances: enhance performance of the proposed model by information reduction, domain partitioning Connectionism and NLP • Even if strongly criticized (Fodor,Pylyshyn. Connectionism and Cognitive architecture. A critical analyses. Cognition,1998) connectionist system are applied to a variety of NLP tasks: grammaticality prediction case role assignment syntactic parsers multy-features system (lexical,syntactic,semantic) 3 Advantages • Parallel satisfaction of multiple and diverse constraints in contrast rule-based sequential processing needs to compute solution that will later on be discarded if not meeting the requirements • Develop and process distributed representations for complex structures 4 Connectionist Architectures • Brief review of: Neural Networks Recurrent Networks Recursive Auto Associative Memory • Introduction to Recursive Neural Networks 5 Neural Networks • What is an Artificial Neural Network (ANN)? • A simple model: the perceptron inputs x1 x2 x3 . . xn w1 w2 w3 w4 y output y=i=1..n wi xi - 6 What can it learn? • A perceptron can learn linear decision boundaries from examples • Ex: bidimensional input x2 x2 Classe C1 + + x1 + + Esempi + Classe C2 x1 + + + w1 x1 + w2 x2 - =0 7 How can it learn? • Iterative algorithm: at every presentation of the input the procedure reduces the error x2 x2 + ++ + Iterazione 1 x2 + x1 ++ + Iterazione 2 + x1 ++ + x1 Iterazione 3 8 Getting more complex • Multylayer Feed-forward Network hidden layer non linearity Input layer hidden layer 2 x1 x2 x3 . . xn output layer 1 hidden layer 1 9 Signal Flow • Learning takes place in 2 phases: Forward phase (compute prediction) Backward phase (propagate error signal) Forward x1 x2 x3 . . xn Backward 10 Decision Boundaries • Now decision boundaries can be made arbitrary complex increasing the number of neurons in the hidden layer. + + + + + + + 11 ANN and NLP • Since ANN can potentially learn any mapping from input values to any desired output • ...why not apply them to NLP problems? • How to code linguistic information to make it processable by ANN? • What are the properties of linguistic information? 12 Neural Networks and NLP • Problems: in the easiest formulation NLP input consists of a sequence of variable length of tokens the output at any time depends on the input received an arbitrary numer of time-steps in the past standard neural networks can directly process only sequences of fixed size • Idea: introduce an explicit state representation recycle this state as input for the next time-step 13 Recurrent Networks Output Units Holding Units Hidden Units Holding Units Input Units Rumelhart, Hinton, Williams. Parallel distributed processing: Explorations in the microstructure of Cognition. MIT Press 1986 14 Working Principle • A complete forward propagation (from input to output) is 1 time step • the backward connections copy activations to Holding Units for next time step • at the first time step all Holding Units have a fixed conventional value (generally all 0s) • at a general time step: the Hidden Units receive input from both the Input Unit and the appropriate Holding Units the Output Unit receive input from both the Hidden Unit and the appropriate Holding Units 15 Simple Recurrent Networks • J.L.Elman proposed and studied the properties of a Simple Recurrent Networks applied to linguistic problems Elman. Finding structure in time. Cognitive Science, 1990 Elman. Distributed Representations, Simple Recurrent Networks and Grammatical Structure. Machine Learning, 1991 • Simple Recurrent Networks are Recurrent networks that have a single set of holding units for the hidden layer called context layer backpropagation of error signal is trucated at context layer 16 Simple Recurrent Networks Output Units Hidden Units Context Layer Input Units 17 Elman Experiment • Task: predict next token in a sequence representing a sentence • Claim: in order to exhibit good prediction performance the system has to learn syntactic constraints learn to preserve relevant information of previous inputs in the Context Layer learn to discard irrelevant information 18 Elman Setting • Simplified grammar capable of generating embedded relative clauses 23 words plus the end of sentence marker ‘.’ number agreement verb argument structure interaction with relative clause • the agent or subject in a RC is omitted center embedding viable sentences • the end of sentence marker cannot occur at all positions in the sentence 19 Network and Training • Local representations for input and output • Hidden layer with 70 units • Training set: 4 sets of 10.000 sentences sets increasingly “difficult”, with a higher percentage of sentences containing RCs 5 epochs 20 Results • On sentences from the training set: the network has learned to predict verb agreement in number with the noun and all the other constraints present in the grammar • Good generalization on other sentences not present in the training set 21 Critics • Possibility of rote learning the vocabulary simbols are 23 but there are only 10 different classes (ie. ‘boy’ ‘dog’ ‘girl’ are equivalent) the number of disinct sentence pattern is small (there are 400 different sentences of 3 words but only 18 different patterns if we consider the equivalences) there are very few sentence patterns that are in the test set that are not in the training set 22 Critics • The simplified error backproagation algoritm doesn’t allow learning of long term dependencies • the backpropagation of the error signal is trucated at the context layer. • This makes computation simplier (no need to store the history of activations), and local in time • The calculated gradient forces the network to transfer informatioin about the input into the hidden layer only if that information is usefull for the current output, but if it is usefull more than one time step in the future, there is no guarantee that it will be somehow preserved 23 Processing Linguistic data • Linguistic Data as syntactic information is “naturally” represented in a structured way The servant of the actress NP Flat info D PP N NP P The servant of Syntactic info D the N actress 24 Processing structured data • It is possible to “flatten” the information and transform it in a vector representation • BUT • flattening makes dependencies “further” apart, more difficult to process A B C (A(B(DEFGH)C)) DEFGH • We need to directly process structured information!! 25 How to process structured information with a connectionist architecture? • Idea: recursively compress sub trees into distributed vector representations • Pollack. Recursive distributed representations. Artificial Intelligence, 1990 • Autoencoder net to learn how to compress the fields of a node to a label and uncompress the label to the fields 26 Recursive Auto Associative Memory RAAM left right whole left right 27 Example A((BC)D) B p r B r r q A C B D A q C r q p D A q D C 28 Training • It is a moving target training when learning (B C) r (B C) the representation of r changes this changes the target for another example (r D) q (r D) • this can cause instability so the learning rate has to be kept small and hence we have very long trining times 29 Coding and Decoding • A RAAM can be viewed as 2 nets trained simultaneously: an encoding network that learns the reduced representation r for BC a decoding network that takes in input the reduced representation r and decodes it to BC • Both encoding and decoding are recursive operations • The encoding net knows, from the structure of the tree, how many times it must recursively compress representations • The decoding network has to decide if a decoded field is a terminal node or an internal node which should be further decoded 30 Decoding • Pollack solution: use binary codes for terminal nodes • internal representations have codes with values in [0,1] but not close to 0 or 1 • Decision rule: if a decoded field has all its values close to 0 or 1 then it is a terminal node and it is not decoded further 31 Experiment • Task: coding and decoding of compositional prepositions • Ex: Pat thought that John loved Mary (thought Pat (loved John (Mary))) 32 Experimental Setting • The network has: 48 input units 16 hidden units 48 output units • Training set: 13 complex propositions 33 Results • After training the network is able to code and decode some (not all) novel propositions • Cluster analysis shows that similar trees are encoded with similar codes Ex: (loved John Pat) and (loved Pat Mary) are more similar to each other than any of the codes for other trees 34 Further results • Pollak tried to test if a network could be trained to manipulate representations without decoding them • Ex: train a network to transfor reduced representation for (is liked X Y) into (likes Y X) • Results: if trained on 12 of the 16 possible propositions the network generalizes correctly to the other 4 propositions • Chalmers experiments with a network that transforms reduced representations from passive to active structures 35 Problems • Generalization problems for new structures • Very long training times • Information storage limits: all configurations have to be stored in fixed-size vector representations as the hight grows we have an exponential number of possible trees the numerical precision limits are exceeded by small trees • Unknown (but believed not good) scaling properties 36 Recursive Neural Network • We want to directly process complex tree structures work in a supervised framework • We use Recursive Neural Networks (RNN) specialized for tree structures Frasconi, Gori, Sperduti. A general Framework for Adaptive Processing of Data Structures. IEE Transactions on Neural Networks, 1998 37 What is a RNN? • Recursive Neural Networks for trees are composed of several replicas of a Recursive Neetwork and one Output Network node state Output ... label encoding 1st child state last child state Recursive Network Root state Output Network How does a RNN process tree data structures? • General processing step: Structure unfolding • Prediction phase: Recursive state update • Learning phase: Backpropagation through structure Structure unfolding S VP NP NP NP PRP VBZ It has DT no PP NN bearing IN on Structure unfolding S VP NP NP It NP has no bearing PP on Structure unfolding S VP NP It has no bearing on Structure unfolding S VP It has no bearing on Structure unfolding S It has no bearing on Structure unfolding It has no bearing on Structure unfolding Output network It has no bearing on Prediction phase Information Flow It has no bearing on Prediction phase Information Flow It has no bearing on Prediction phase Information Flow It has no bearing on Prediction phase Information Flow It has no bearing on Prediction phase Information Flow It has no bearing on Error Correction: Information Flow It has no bearing on Error Correction: Information Flow It has no bearing on Error Correction: Information Flow It has no bearing on Error Correction: Information Flow It has no bearing on Error Correction: Information Flow It has no bearing on Encoding POS tags and non terminals Local or one-hot encoding representation NP VP PP … VBZ CC … X 01 02 03 ...0001 ...0010 …0100 38 39 ..010..0 ..100..0 NP PP 71 0 0 1 0 0 1 0 0 10…..0 57 Updating equations • Let x(v) be the state of node v I(v) be the label (symbolic) of node v ch[v] be the children of node v o be the output • then: x(v)=f(x(ch[v]), I(v)) o=g(x(r)) • where: f() is the state updating function g() is the output function r is the root node 58 Model Overview • Hybrid system: Dynamic Grammar constrains the structure Connectionist system provides a predictor • Linguistic framework Strong incrementality assumption • Statistical framework Relevant structural features are adaptively encoded by the network the network computes maximum-likelihood estimates of correct disambiguation decisions Incremental processing • Given some syntactic left context, and a new word, choose the tree that corresponds to the new preferred syntactic tree NP D Left context PP N NP P The servant of New word D the WH N actress who Dynamic grammar of connection paths (similar to TAG/SuperTAGs) S NP PRP It NP VP VBZ has VP NP NP DT no NN bearing Connection paths combined to form trees S NP PRP It NP VP VBZ has VP NP NP DT no NN bearing The dynamic grammar generates many legal next incremental trees • Variability in attachment point • Variability in structure of connection path Attachment ambiguity NP D S PP N NP P The servant of S D N WH the actress who Connection path ambiguity S NP D The N athlete VP V NP realized t PRP his S’ S NP PRP his Alternatives • Given a single correct left context and a next word, the many legal trees that can be formed according to the dynamic grammar are called THE FOREST OF ALTERNATIVES 66 Learning preferences • Task: learn P(O|I) where O = {0,1}m I = {T1 ,T2 ,…,Tm} • O indicates the only correct element in the forest of legal alternatives ex: o=(0,0,0,1,0,0) • This is a dynamyc classification problem m varies from example to example typically m=120 (but goes up to 2000) 67 The learning task: forward phase 0 Ti1 Ti2 1 …. 0 ………. Learning target TiN Correct attachment in training set Negative example Negative example The learning task: forward phase 1 …. 0 0 .02 ……... .95 Ti1 Ti2 ………. Learning target .01 Proposed output (via softmax) TiN Correct attachment in training set Negative example Negative example The learning task: backward phase 0 E1 Ti1 1 …. 0 E2 Ti2 ……... Learning target Propagated error E3 ………. TiN (back-propagation through structure) Correct attachment in training set Negative example Negative example Running in test mode .01 .97 Ti1 Ti2 .01 ………. Proposed output (via softmax) TiN Given a left context and all the possible attachments for the current input word choose the “best” attachment Experimental setup • Training on WSJ section of Penn treebank realistic corpus representative of natural language large size (40.000 sentence, 1 million words) uniform language (articles on economic subject) 72 Investigations • Interesting question: How are alternatives prioritized? What information is the system really using? Analyse error correlation with structural features • We correlate the system error with “interesting” structural features: number of nodes height frequency • of: whole incremental tree connection path • or characteristics such as outdegree or depth of: average node anchor root 74 Error Correlation (Spearman Rank Correlation) Description Average node Root Anchor Corr Max outdegree 0,18 Avg outdegree 0,02 Avg outdegree 0,21 Depth 0,17 Incremental tree Num nodes 0,2 Height 0,19 Connection path Num nodes 0,33 Height 0,34 Frequency -0,39 Forest Size 0,31 Word Index 0,19 75 Refine analyses • Analyze and characterize incremental trees preferred by the network Discriminating Features • AIM: find features of incremental trees that discriminate between correctly classified trees and mistakenly preferred trees • We consider: true positive: correct incremental trees that are preferred by the network false positives: trees preferred by the network but that are not the correct ones 77 What leads to mistakes? • Discover and characterize feature differences in incremental trees when comparing the net's false positive choice against the true element Description tree max outdegree tree height tree num nodes cpath height cpath num nodes anchor outdegree anchor depth True Elem Fals Pos Diff/sd 7,38 30,91 2,12 3,35 7,2 30,55 1,67 2,77 0,05 0,02 0,64 0,58 78 Result • The trees that are preferred by the network are simplier than the correct ones less high with less nodes with connection paths that are • shorter • with fewer nodes 79 Further analyses • Analyze separately the 2 tasks in structural ambiguity resolution: Choosing the correct attachment site Choosing the correct connection path 80 Anchor attachment • AIM: what are the predictive performance of the net if we consider only the anchor attachment problem? • EXPERIMENT group all the connection paths with the same anchor point in classes rank the groups according to the highest score given by the net to any member of the group consider a group correctly predicted iff the correct connection path belongs to the group 81 Experiment: choose correct attachment site • Network’s performance on attachment-site prediction has an accuracy of 91.54% Perc. correct classified 100 98 96 94 92 90 88 86 1 2 Position 3 Baseline • AIM: How good is the previous result? Is the network more precise than just a simple frequency heuristic? • Experiment: Rank the connectin paths groups according to the group’s most frequent element Compare it against the network choice 83 Is frequency all that is being used? • Net preference against frequency heuristic: there is more to it 100 95 Perc Correct 90 85 80 75 70 65 60 1 2 3 Position Net Freq 84 What kind of statistics is the network using to predict the attachment site? • Claim: network is approximating human heuristics • Consequence: simulate heuristic and match with network preferences • Order alternative attachments by: Heuristic 1: Late Closure first, then Minimal Attachment, then frequency of CP Heuristic 2: Minimal Attachment first, then Late Closure, then frequency of CP Results • How many times does the choices of the system coincides with that of the Heuristics? Position Net=LC-MA Net=MA-LC 1 43,50% 44,50% 1&2 78,30% 61,50% 86 How is the correct CP choosen? • Trained network was given test set with correct attachment sites only, and was required to choose correct connection path Results • Given attachment-site, correct connection-path is chosen 89% of the time • Almost all correctly chosen CPs are most frequent (3,5% of the correctly chosen CPs are not the most frequent) 88 Summary of results • Attachment-site preferences are prioritized over choosing frequent connection path. • Attachment-site preferences model human heuristics • Connection paths are then chosen on frequency basis Enhancing results • Structured data can be naturally manipulated and made “simplier” • The domain can be naturally partitioned in non overlapping classes • Idea: we could improve accuracy by means of: structural reduction modular network 90 How much information in the tree does the network consider when making attachment decisions? • Claim: The network is not affected by structure deeply embedded beyond the right frontier • Consequence: that information is not relevant (=noise?) and could (should!) be deleted from the tree Reduced incremental trees Example tree Left context S NP D N VP NP NP PP a friend P N V of Jim saw D N the thief Connection path PP P with Reduced incremental trees Right frontier S NP D N VP NP NP PP a friend P N V of Jim saw D N the thief PP P with Reduced incremental trees Right frontier + c-commanding nodes S NP D N VP NP NP PP a friend P N V of Jim saw D N the thief PP P with Reduced incremental trees Right frontier + c-commanding nodes + connection path S NP D N VP NP PP a friend P N V of Jim saw D N the thief PP P with Reduced incremental trees S NP VP NP V D N PP P Experimental Setup • Training sets: generated from WSJ Penn Treebank (40k sentences) “Full” training set with complete trees “Reduced” training set with reduced trees • Testing: Compare the performance of the network trained on full and reduced dataset on disambiguation task (2k sentences) Perc. Correctly predicted Results 100 95 90 85 80 1 2 Reduced 3 Baseline 4 Summary of results • Reduction of trees did not hinder performance Accuracy increased with reduced information deep information constitutes noise for the system • The network makes little or no use of information beyond the right frontier Linguistic evidence that deeply embedded information is less accessible: Right Roof Constraint on extraposition (Ross, 1967) Domain partitioning • Claim: the information needed to learn how to disambiguate the attachment site depends on the word that is being processed • Ex: how to disambiguate the attachemnt of punctuation can be learned separately from how to attach verbs • Conclusion: different networks would benefit from specializoing and learning preferences for each type of word 100 Word classes • The information on the leaves is represented by Part Of Speech Tags • We group them so to obtain the following classes: Nouns Verbs Prepositions Adjectives Adverbs Articles Conjunctions Punctuation Other 101 Experimental setup • The WSJ 40K sentences dataset is partitioned • A modular network is trained on the partitioned dataset • The test is run on the 2K partitioned test dataset 102 ive le it i on ad ve rb pr ep os ot he r io n ua t io n co nj un ct pu nc t ad je ct ar t ic no un ve rb Percentage Correct 100 50 90 45 80 40 70 35 60 30 50 25 40 20 30 15 20 10 10 5 0 0 Error reduction Results Classes 103 Summary of results • The modular network outperforms the single network: Single network: 84,82 Modular network: 87,14 Total error reduction: 15,3% • Explanatory hypothesis: The partition of the task has lead to a better allocation of resources in the state representation space; now states coding for less frequent structures of other categories don’t have to compete with more freqent ones 104 Conclusions • We can use connectionist system to model complex preference task in a structured domain • On a structured domain it is easier to investigate the learned properties • Due to the higly structured nature of data and of the task we could improve accuracy by means of: structural reduction of data natural partitioning of domain Future work • Introduce lexical information Note: WSJ has 20.000 lexical items! • Use the preference model as an informer for an incremental parser 106 Thank you 107