Download Sentence Processing using a recurrent network

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Distributed firewall wikipedia , lookup

Piggybacking (Internet access) wikipedia , lookup

Computer network wikipedia , lookup

Cracking of wireless networks wikipedia , lookup

Network tap wikipedia , lookup

Airborne Networking wikipedia , lookup

Transcript
Sentence Processing using
a Simple Recurrent Network
EE 645 Final Project
Spring 2003
Dong-Wan Kang
5/14/2003
Contents
1. Introduction - Motivations
2. Previous & Related Works
a) McClelland & Kawamoto(1986)
b) Elman (1990, 1993, & 1999)
c) Miikkulainen (1996)
3. Algorithms (Williams and Zipser, 1989)
- Real Time Recurrent Learning
4.
5.
6.
7.
Simulations
Data & Encoding schemes
Results
Discussion & Future work
Motivations
• Can the neural network recognize the lexical
classes from the sentence and learn the various
types of sentences?
• From cognitive science perspective:
- comparison between human language learning
and neural network learning pattern
- (e.g.) Learning English past tense (Rumelhart &
McClelland,1986), Grammaticality judgment (Allen &
Seidenberg,1999), Embedded sentences(Elman 1993,
Miikkulainen1996, etc.)
Related Works
• McClelland & Kawamoto (1986)
- Sentences with Case Role Assignments and semantic features by
using backpropagation algorithm
- output: 2500 case role units for each sentence
- (e.g.)
input:
the boy hit
output: [ Agent Verb
the wall with the ball.
Patient Instrument ] + [other features]
- Limitation: poses a hard limit on the number of input size.
- Alternative: Instead of detecting the input patterns displaced in
space, detect the patterns which were in time (sequential inputs).
Related Works (continued)
• Elman (1990,1993, & 1999)
- Simple Recurrent Network: Partially
Recurrent Network using Context units
- Network with a dynamic memory
- Context units at time t hold a copy of
the activations of the hidden units from
the previous time step at time t-1.
- Network can recognize sequences.
input:
output:
Many
|
years
years
|
ago
ago
|
boy
boy
|
and
and
|
girl
girl
|
…
…
Related Works (continued)
• Miikkulainen (1996)
- SPEC architecture
(Subsymbolic Parser for Embedded
Clauses  Recurrent Network)
- Parser, Segmenter, and Stack:
process the center and tail
embedded sentences: 98,100
sentences with 49 different
sentence by using case role
assignments
- (e.g.) Sequential Inputs
input: …, the girl, who, liked, the dog, saw, the boy, …
output: …, [the girl, saw, the boy] [the girl, liked, the dog]
case role: (agent,
act, patient) (agent, act, patient)
Algorithms
• Recurrent Network
- Unlike feedforward networks, they allow connections both ways
between a pair of units and even from a unit to itself.
- Backpropagation through time (BPTT) – unfolds the temporal
operation of the network into a layered feedforward network at
every time step. (Rumelhart, et al., 1986)
- Real Time Recurrent Learning (RTRL) – two versions (Williams and
Zipser, 1989)
1) update weights after processing sequences is completed.
2) on-line: update weights while sequences are being presented.
- Simple Recurrent Network (SRN) – partially recurrent network in
terms of time and space. It has context units which store the
outputs of the hidden units (Elman, 1990). (It can be modified from
RTRL algorithm.)
Real Time Recurrent Learning
• Williams and Zipser (1989)
- This algorithm computes the derivatives of states and
outputs with respect to all weights as the network
processes the sequence.
• Summary of Algorithm:
In recurrent network, for any unit Vi connected to any
other and the input i (t ) at node i at time t, the dynamic
update rule is:
Vi (t )  g (hi (t  1))  g ( wijV j (t  1)   i (t  1))
j
RTRL
(continued)
• Error measure:
with target outputs
 k (t ) defined for some k’s and t’s
 k (t )  Vk (t )
Ek (t )  
0
if
 k (t )
is defined at time t;
otherwise
• Total cost function
T
E   E (t ) ,
t 0
t =0,1, …, T, where
1
E (t )   [ Ek (t )]2
2 k
RTRL
(continued)
• The gradient of E separate in time, to do
gradient descent, we define:
E (t )
Vk (t )
wpq (t )  
   Ek (t )
wpq
wpq
k
• The derivative of update rule:

V j (t  1) 
Vi (t )
 g ' (hi (t  1))  ipVq (t  1)   wij

w pq

w
j


pq
where initial condition t = 0,
Vi (0)
0
wpq
RTRL (continued)
• Depending on the way of updating weights, there can be
two versions of RTRL.
1) Update the weights after the sequences are
completed at (t = T ).
2) Update the weights after each time step: on-line
• Elman’s “tlearn” simulator program for “Simple Recurrent
Network” (which I’m using for this project) is
implemented based on the classical backpropagation
algorithm and the modification of this RTRL algorithm.
Simulation
• Based on Elman’s data and Simple Recurrent
Network(1990,1993, & 1999), simple sentences and
embedded sentences are simulated by using “tlearn”
neural network program (BP + modified version of RTLR
algorithm) available at http://crl.ucsd.edu/innate/index.shtml.
• Question:
1. Can the network discover the lexical classes from word
order?
2. Can the network recognize the relative pronouns and
predict them?
Network Architecture
•
•
•
•
31 input nodes
31 output nodes
150 hidden nodes
150 context nodes
* black arrow:
distributed and learnable
* dotted blue arrow:
linear function and
one-to-one connection
with hidden nodes
Training Data
• Lexicon (31 words)
• Grammar (16 templates)
NOUN-HUM
man woman boy girl
NOUN-ANIM
cat mouse dog lion
NOUN-INANIM book rock car
NOUN-AGRESS dragon monster
NOUN-FRAG
glass plate
NOUN-FOOD cookie bread sandwich
NOUN-HUM VERB-EAT NOUN-FOOD
NOUN-HUM VERB-PERCEPT NOUN-INANIM
NOUN-HUM VERB-DESTROY NOUN-FRAG
NOUN-HUM VERB-INTRAN
NOUN-HUM VERB-TRAN NOUN-HUM
NOUN-HUM VERB-AGPAT NOUN-INANIM
NOUN-HUM VERB-AGPAT
NOUN-ANIM VERB-EAT NOUN-FOOD
NOUN-ANIM VERB-TRAN NOUN-ANIM
NOUN-ANIM VERB-AGPAT NOUN-INANIM
NOUN-ANIM VERB-AGPAT
NOUN-INANIM VERB-AGPAT
NOUN-AGRESS VERB-DESTROY NOUN-FRAG
NOUN-AGRESS VERB-EAT NOUN-HUM
NOUN-AGRESS VERB-EAT NOUN-ANIM
NOUN-AGRESS VERB-EAT NOUN-FOOD
VERB-INTRAN think sleep exist
VERB-TRAN
see chase like
VERB-AGPAT
move break
VERB-PERCEPT smell see
VERB-DESTROY break smash
VERB-EAT
eat
----------------------------RELAT-HUM
who
RELAT-INHUM which
Sample Sentences & Mapping
• Simple sentences – 2 types
- man think
(2 words)
- girl see dog
(3 words)
- man break glass
(3 words)
• Embedded sentences - 3 types
(*RP – Relative Pronoun)
1. monster eat man who sleep
(RP–sub, VERB-INTRAN)
2. dog see man who eat sandwich
(RP-sub, VERB-TRAN)
3. woman eat cookie which cat chase (RP-obj, VERB-TRAN)
• Input-Output Mapping: (predict next input – sequential input)
INPUT:
OUTPUT:
girl
|
see
see dog
|
|
dog man
man
|
break
break
|
glass
glass
|
cat
cat …
|
…
Encoding scheme
• Random word representation
sleep
dog
woman
…
0000000000000000000000000000001
0000100000000000000000000000000
0000000000000000000000000010000
- 31-bit vector for each lexical item, each lexical item is
represented by a randomly-assigned different bit.
- not semantic feature encoding
Training a network
• Incremental Input (Elman, 1993)
“Starting small” strategy
• Phase I: simple sentences (Elman, 1990, used 10,000 sentences)
- 1,564 sentences generated(4,636 31-bit vectors)
- train all patterns: learning rate =0.1, 23 epochs
• Phase II: embedded sentences (Elman, 1993, 7,500 sentences)
- 5,976 sentences generated(35,688 31-bit vectors)
- loaded with weights from phase I
- train (1,564 + 5,976) sentences together:
learning rate = 0.1, 4 epochs
Performance
• Network performance was measured by Root Mean Squared Error:

RMS error =

 (tk  yk )
2
k
k
k

 
tk


 yk
the number of input patterns
target output vector
actual output vector
• Phase I: After 23 epochs, RMS ≈ 0.91
• Phase II: After 4 epochs, RMS ≈ 0.84
• Why can RMS not be lowered? The prediction task is
nondeterministic, so the network cannot produce the unique output
for the corresponding input. For this simulation, RMS is NOT
the best measurement of performance.
•
Elman’s simulation: RMS = 0.88 (1990),
Mean Cosine = 0.852 (1993)
Phase I: RMS ≈ 0.91 after 23 epochs
Phase II: RMS ≈ 0.84 after 4 epochs
Results and Analysis
<Test results after phase II>
Output
…
which
?
?
?
?
?
which
?
?
…
?
which
?
see
…
?
?
?
?
?
?
...
Target
(target: which )
(target: lion )
(target: see )
(target: boy ) 
(target: move )
(target: sandwich )
(target: which )
(target: cat )
(target: see )
(target: book )
(target: which )
(target: man )
(target: see )
(target: dog )
(target: chase )
(target: man )
(target: who )
(target: smash )
(target: glass )
• Arrow() indicates the start of the
sentence.
• In all positions, the word “which” is
predicted correctly! But most of words
are not predicted including the word
“who” is not. Why?  Training Data
• Since the prediction task is nondeterministic, predicting the exact next
word can not be the best performance
measurement.
• We need to look at hidden unit
activations of each input, since they
reflect what the network has learned
about classes of inputs with regard to
what they predict. Cluster Analysis, PCA
• Cluster
Analysis
•
The network
successfully
recognizes
VERB, NOUN,
and some of
their
subcategories.
•
WHO and
WHICH has
different
distance
•
VERB-INTRAN
failed to fit in
VERB
<Hierarchical
cluster
diagram
of hidden
unit
activation
vectors>
Discussion & Conclusion
1. The network can discover the lexical classes from word order. Noun
and Verb are different classes except the “VERB-INTRAN”. Also,
subclasses for NOUN are classified correctly, but some subclasses
for VERB are mixed. This is related to the input example.
2. The network can recognize and predict the relative pronoun, “which”,
but not “who” Why? Because the sentences for “who” is not “RPobj”, so “who” is just considered as one of normal subject in simple
sentences.
3. The organization of input data is important and sensitive to the
recurrent network, since it processes the input sequentially and online.
4. Generally, recurrent networks by using RTRL recognized sequential
input, but it requires more training time and computation resources.
Future Studies
• Recurrent Least Squared Support Vector
Machines, Suykens, J.A.K. & Vandewalle,
J., (2000).
- provides new perspectives for time-series
prediction and nonlinear modeling
- seems more efficient than BPTT, RTRL & SRN
References
Allen, J., & Seidenberg, M. (1999). The emergence of grammaticality in connectionist networks. In
Brian MacWhinney (Ed.), The emergence of language (pp.115-151). Hillsdale, NJ: Lawrence
Erlbaum.
Elman, J.L. (1990). Finding structure in time. Cognitive Science, 14, 179-211.
Elman, J. (1993). Learning and development in neural networks: the importance of starting small.
Cognition, 48, 71-99.
Elman, J.L. (1999). The emergence of language: A conspiracy theory. In B. MacWhinney (Ed.)
Emergence of Language. Hillsdale, NJ: Lawrence Earlbaum Associates.
Hertz, J., Krogh, A., & Palmer, R.G. (1991). Introduction to the Theory of Neural Computation.
Redwood City, CA: Addison-Wesley.
MacClelland, J. L., & Kawamoto A. H. (1986). Mechanisms of sentence processing: Assigning roles to
constituents of sentences. (273-325). In J. L. McClelland & D. E. Rumelhart (Eds.), Parallel
distributed processing: Explorations in the microstructure of cognition. Cambridge, MA: MIT Press.
Miikkulaninen R. (1996). Subsymbolic case-role analysis of sentences with embedded clauses,
Cognitive Science, 20, 47-73.
Rummelhart, D., Hinton, G. E., and Williams, R. (1986). "Learning Internal Representations by Error
Propagation," Parallel and Distributed Processing: Exploration in the Microstructure of Cognition,
Vol. 1, D. Rumelhart and J. McClelland (Eds.), MIT Press, Cambridge, Massachusetts, 318-362
Rumelhart, D.E., & McClelland, J.L. (1986). On learning the past tense of English verbs. In J.L.
McClelland & D.E. Rumelhart (Eds.), Parallel distributed processing: Explorations in the
microstructure of cognition. Cambridge, MA: MIT Press.
Williams, R. J., & Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural
networks. Neural Computation, 1, 270--280.