Download Simple Word Vector representa]ons: word2vec, GloVe

Document related concepts

Euclidean vector wikipedia , lookup

Singular-value decomposition wikipedia , lookup

Four-vector wikipedia , lookup

Matrix calculus wikipedia , lookup

Covariance and contravariance of vectors wikipedia , lookup

Transcript
SimpleWordVectorrepresenta2ons:
word2vec,GloVe
Dr.KiraRadinsky
CTOSalesPredict
Visi8ngProfessor/Scien8stTechnion
SlideswereadaptedfromlecturesbyRichardSocher
Howdowerepresentthemeaningofaword?
Defini8on:Meaning(Websterdic8onary)
•  theideathatisrepresentedbyaword,phrase,etc.
•  theideathatapersonwantstoexpressbyusing
words,signs,etc.
•  theideathatisexpressedinaworkofwri8ng,art,etc.
Howtorepresentmeaninginacomputer?
Commonanswer:UseataxonomylikeWordNetthathas
hypernyms(is-a)rela8onships
and
synonymsets(good):
[Synset('procyonid.n.01'),
Synset('carnivore.n.01'),
Synset('placental.n.01'),
Synset('mammal.n.01'),
Synset('vertebrate.n.01'),
Synset('chordate.n.01'),
Synset('animal.n.01'),
Synset('organism.n.01'),
Synset('living_thing.n.01'),
Synset('whole.n.02'),
Synset('object.n.01'),
Synset('physical_en8ty.n.01'),
Synset('en8ty.n.01')]
S:(adj)full,good
S:(adj)es8mable,good,honorable,respectable
S:(adj)beneficial,good
S:(adj)good,just,upright
S:(adj)adept,expert,good,prac8ced,
proficient,skillful
S:(adj)dear,good,near
S:(adj)good,right,ripe
…
S:(adv)well,good
S:(adv)thoroughly,soundly,good
S:(n)good,goodness
S:(n)commodity,tradegood,good
Problemswiththisdiscreterepresenta2on
•  Greatasresourcebutmissingnuances,e.g.
synonyms:
adept,expert,good,prac8ced,proficient,skillful?
•  Missingnewwords(impossibletokeepuptodate):
wicked,badass,ni[y,crack,ace,wizard,genius,ninjia
•  Subjec8ve
•  Requireshumanlabortocreateandadapt
Hardtocomputeaccuratewordsimilarity!
Problemswiththisdiscreterepresenta2on
ThevastmajorityofruleKbasedandsta8s8calNLPworkregards
wordsasatomicsymbols:h o t e l , c o n f e r e n c e , w a l k Invectorspaceterms,thisisavectorwithone1andalotofzeroes
[ 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 ] Dimensionality:20K(speech)–50K(PTB)–500K(bigvocab)–13M(Google1T)
Wecallthisa“one-hot”representa8on.Itsproblem:
m o t e l [ 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 ] AND h o t e l [ 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ] = 0 Distribu2onalsimilaritybasedrepresenta2ons
Youcangetalotofvaluebyrepresen8ngawordby
meansofitsneighbors
“Youshallknowawordbythecompanyitkeeps”
(J.R.Firth1957:11)
Oneofthemostsuccessfulideasofmodernsta8s8calNLP
government debt problems turning into banking crises as has happened in
saying that Europe needs unified banking regulation to replace the hodgepodge
" Thesewordswillrepresentbanking!
Howtomakeneighborsrepresentwords?
Answer:WithacooccurrencematrixX
•  2op8ons:fulldocumentvswindows
•  WordKdocumentcooccurrencematrixwillgive
generaltopics(allsportstermswillhavesimilar
entries)leadingto“LatentSeman8cAnalysis”
•  Windowallowsustocapturebothsyntac8c(POS)and
seman8cinforma8on!
Windowbasedcooccurencematrix
•  Windowlength1(morecommon:5-10)
•  Symmetric(irrelevantwhetherle[orrightcontext)
•  Examplecorpus:
•  Ilikedeeplearning.
•  IlikeNLP.
•  Ienjoyflying.
Windowbasedcooccurencematrix
•  Examplecorpus:
•  Ilikedeeplearning.
•  IlikeNLP.
•  Ienjoyflying.
9
counts
I
like
enjoy deep
learning NLP
flying
.
I
0
2
1
0
0
0
0
0
like
2
0
0
1
0
1
0
0
enjoy
1
0
0
0
0
0
1
0
deep
0
1
0
0
1
0
0
0
learning 0
0
0
1
0
0
0
1
NLP
0
1
0
0
0
0
0
1
flying
0
0
1
0
0
0
0
1
.
0
0
0
1
1
4/1/15 0
RichardSoc
h1
er
0
Problemswithsimplecooccurrencevectors
Increaseinsizewithvocabulary
Veryhighdimensional:requirealotofstorage
Subsequentclassifica8onmodelshavesparsityissues
Modelsarelessrobust
Solu2on:Lowdimensionalvectors
•  Idea:store“most”oftheimportantinforma8onina
fixed,smallnumberofdimensions:adensevector
•  Usuallyaround25–1000dimensions
•  Howtoreducethedimensionality?
Method1:DimensionalityReduc2ononX
SingularValueDecomposi8onofcooccurrencematrixX.
m
r
r
S1
=
n
n U1U2U3 . . .
r
S2
S3 .
0
m
0
.
r
.
Sr
X
U
S
V
m
k
k
m
V1
V2
V
.3
.
=
n
X
S1
n U1U2U3 . .
k
0
U
S2
0
S3 .
S
.
k
Sk
V1
V2
V3
.
.
.
V
T
T
X isthebestrankkapproxima8ontoX,intermsofleastsquares.
Prof.RichardFeynman
K=1(takethefirstsingularvector)
K=2
K=10
K=50
SimpleSVDwordvectorsinPython
Corpus:
Ilikedeeplearning.IlikeNLP.Ienjoyflying.
20
RichardSocher
4/1/15
SimpleSVDwordvectorsinPython
Corpus:Ilikedeeplearning.IlikeNLP.Ienjoyflying.
Prin8ngfirsttwocolumnsofUcorrespondingtothe2biggestsingularvalues
21
RichardSocher
4/1/15
Wordmeaningisdefinedintermsofvectors
•  Inallsubsequentmodels,includingdeeplearningmodels,a
wordisrepresentedasadensevector
linguis,cs =
22
0.286
0.792
−0.177
−0.107
0.109
−0.542
0.349
0.271
HackstoX
•  Problem:func8onwords(the,he,has)aretoo
frequent- syntaxhastoomuchimpact.Somefixes:
• 
min(X,t),witht~100
• 
Ignorethemall
•  Rampedwindowsthatcountcloserwordsmore
•  UsePearsoncorrela8onsinsteadofcounts,thenset
nega8vevaluesto0
• +++
23
RichardSocher
4/1/15
Interes2ngseman2cpaPernsemergeinthevectors
WRIST
ANKLE
SHOULDER
ARM
LEG
HAND
FOOT
HEAD
NOSE
FINGER
TOE
FACE
EAR
EYE
TOOTH
DOG
CAT
PUPPY
KITTEN
COW
MOUSE
TURTLE
OYSTER
LION
BULL
CHICAGO
ATLANTA
MONTREAL
NASHVILLE
TOKYO
CHINA
RUSSIA
AFRICA
ASIA
EUROPE
AMERICA
BRAZIL
MOSCOW
FRANCE
HAWAII
AnImprovedModelofSeman8cSimilarityBasedonLexicalCoKOccurrence
Rohdeetal.2005
24
RichardSocher
4/1/15
Interes2ngseman2cpaPernsemergeinthevectors
CHOOSING
CHOOSE
CHOSCEHNOSE
STOLEN
STEAL
STOLE
STEALING
SPEAK
SPOKE
SPOKEN
SPEAKING
TAKE
TAKEN TAKING
TOOK
TN
HRTO
THROW
HW
REW
THROWING
SHOWN
SHOWED
EATEEAN
ATE
T
EATING
SHOWING
SHOW
GROWN
GROW
GREW
GROWING
AnImprovedModelofSeman8cSimilarityBasedonLexicalCoKOccurrence
Rohdeetal.2005
25
RichardSocher
4/1/15
Interes2ngseman2cpaPernsemergeinthevectors
DRIVER
SWIMMER
DRIVE
CLEAN
JANITOR
STUDENT
TEACHER
DOCTOR
BRIDE
SWIM
PRIEST
TEACH
LEARN
MARRY
TREAT
PRAY
AnImprovedModelofSeman8cSimilarityBasedonLexicalCoKOccurrence
Rohdeetal.2005
26
RichardSocher
4/1/15
ProblemswithSVD
Computa8onalcostscalesquadra8cally fornxmmatrix:
O(mn2)flops(whenn<m)
Badformillionsofwordsordocuments
Hardtoincorporatenewwordsordocuments
DifferentlearningregimethanotherDLmodels
Morenextlesson….
QuickReminder
LanguageModels(Unigrams,Bigrams,etc.)
Createsuchamodelthatwillassignaprobabilityto
asequenceoftokens.Forexample,whatistheprobability
ofeachofthefollowingsentences:
“Thecatjumpedoverthepuddle.”
VS
“stockboilfishistoy.”
i.e.probabilityonanygivensequenceofnwords
LanguageModels(Unigrams,Bigrams,etc.)
Windowbasedcooccurencematrix
•  Examplecorpus:
•  Ilikedeeplearning.
•  IlikeNLP.
•  Ienjoyflying.
32
counts
I
like
enjoy deep
learning NLP
flying
.
I
0
2
1
0
0
0
0
0
like
2
0
0
1
0
1
0
0
enjoy
1
0
0
0
0
0
1
0
deep
0
1
0
0
1
0
0
0
learning 0
0
0
1
0
0
0
1
NLP
0
1
0
0
0
0
0
1
flying
0
0
1
0
0
0
0
1
.
0
0
0
1
1
4/1/15 0
RichardSoc
h1
er
0
Method1:DimensionalityReduc2ononX
SingularValueDecomposi8onofcooccurrencematrixX.
m
r
r
S1
=
n
n U1U2U3 . . .
r
S2
S3 .
0
m
0
.
r
.
Sr
X
U
S
V
m
k
k
m
V1
V2
V
.3
.
=
n
X
S1
n U1U2U3 . .
k
0
U
S2
0
S3 .
S
.
k
Sk
V1
V2
V3
.
.
.
V
T
T
X isthebestrankkapproxima8ontoX,intermsofleastsquares.
Method1:DimensionalityReduc2ononX
SingularValueDecomposi8onofcooccurrencematrixX.
m
r
r
S1
=
n
n U1U2U3 . . .
r
S2
S3 .
0
m
0
.
r
.
Sr
X
U
S
V
m
k
k
m
V1
V2
V
.3
.
=
n
X
S1
n U1U2U3 . .
k
0
U
S2
0
S3 .
S
.
k
Sk
V1
V2
V3
.
.
.
V
T
T
X isthebestrankkapproxima8ontoX,intermsofleastsquares.
ProblemswithSVD
Computa8onalcostscalesquadra8cally fornxmmatrix:
O(mn2)flops(whenn<m)
Badformillionsofwordsordocuments
Hardtoincorporatenewwordsordocuments
DifferentlearningregimethanotherDLmodels
Idea:Directlylearnlow-dimensionalwordvectors
•  Oldidea.Relevantforthislecture&deeplearning:
• 
Learningrepresenta8onsbyback-propaga8ngerrors.
(Rumelhartetal.,1986)
• 
Aneuralprobabilis8clanguagemodel(Bengioetal.,2003)
• 
NLPfromScratch(Collobert&Weston,2008)
• 
Arecentandevensimplermodel:
word2vec(Mikolovetal.2013)- intronow
MainIdeaofword2vec
•  Insteadofcapturingcooccurrencecountsdirectly:
Predictsurroundingwordsofeveryword
•  Botharequitesimilar,see“Glove:GlobalVectorsfor
WordRepresenta,on”byPenningtonetal.(2014)
•  Fasterandcaneasilyincorporateanewsentence/
documentoraddawordtothevocabulary
Skip-GramModel
Goal:
Predic8ngsurroundingcontextwordsgivenacenterword
“jumped”
centerword
predict
{"The","cat",’over","the’,"puddle"} context
Howwillitwork?
HowtolearnW1andW2?
crossentropy
for1-hotvectors:
Intui8on:
Perfectpredic8on:
Badpredic8on: iistheindexwherethecorrectword’s
onehotvectoris1.
yˆi=1=>H(yˆ,y)=−1log(1)=0
yˆi=0.01=>H(yˆ,y)=−1log(0.01)≈4.605 HowtolearnW1andW2?
Objec2vefunc2on:Maximizethelogprobabilityofanycontextwordgiventhecurrentcenterword
NaiveBayesassump8on
SoXmaxreminder:
SoXmaxReminder
Linearregression
themodelparametersθwere
trainedtominimizeacostfunc8onJ
SoXmax:Linearregressiongeneraliza2ontomul2-class
Output:aK-dimensionalvector(whoseelementssumto1)
Howtoop2mizethefunc2onJ?
Gradientdescentisanalgorithmthatminimizesafunc8on
Given:afunc8ondefinedbyasetofparameters
Startwith:anini8alsetofparametervalues
Itera2vely:movestowardasetofparametervaluesthat
minimizethefunc8on
GradientDescent-example
Goal:oflinearregressionistofitalinetoaset
ofpoints(y=mx+b)
Minimize
GradientDescent-example
CalculateGradientsofthefunc8ontominimize:
GradientDescent
2000itera8ons
star8ngpoint:m=-1b=0
Howtocalculatethegradient
Ontheboard
Calcula2ngallgradients!
•  Wewentthroughgradientsforeachcentervectorvinawindow
•  Wealsoneedgradientsforexternalvectorsv’(uontheboard)
•  Derive!
•  Generallyineachwindowwewillcomputeupdatesforall
parametersthatarebeingusedinthatwindow.
•  Forexamplewindowsizec=1,sentence:
“Ilikelearning.”
•  Firstwindowcomputesgradientsfor:
•  internalvectorvlikeandexternalvectorsv’Iandv’learning
•  Nextwindowinthatsentence?
Computeallvectorgradients!
•  Weo[endefinethesetofALLparametersinamodelinterms
ofonelongvector
•  Inourcasewith
d---dimensionalvectors
and
Vmanywords:
GradientDescent
•  Tominimize
overthefullbatch(theen8retrainingdata)
wouldrequireustocomputegradientsforallwindows
•  Updateswouldbeforeachelementofµ :
•  Withstepsize
•  Inmatrixnota8onforallparameters:
VanillaGradientDescentCode
Intui2on
•  Forasimpleconvexfunc8onovertwoparameters.
•  Contourlinesshowlevelsofobjec8vefunc8on
Stochas2cGradientDescent
•  ButCorpusmayhave40Btokensandwindows
•  Youwouldwaitaverylong8mebeforemakingasingleupdate!
•  Verybadideaforpreƒymuchallneuralnets!
•  Instead:Wewillupdateparametersa[ereachwindowt
àStochas8cgradientdescent(SGD)
•  Butineachwindow,weonlyhaveatmost2c---1words,
so
isverysparse!
Stochas2cgradientswithwordvectors!
Le
ct
RichardSocher
4/6/15
Stochas2cgradientswithwordvectors!
•  Wemayaswellonlyupdatethewordvectorsthatactually
appear!
•  Solu8on:eitherkeeparoundhashforwordvectorsoronly
updatecertaincolumnsoffullembeddingmatrixLandL’
d
[
V
]
•  Importantifyouhavemillionsofwordvectorsanddo
distributedcompu8ngtonothavetosendgigan8cupdates
around.
Nega2veSampling
J=U0-
P(k|c)Uk
Notethatthesumma8onover|V|iscomputa8onallyhuge!
Solu2on:Focusonmostlyposi8vecorrela8ons.
Foreverytrainingstep,insteadofloopingovertheen8re
vocabulary,wecanjustsampleseveralnega8veexamplesfrom
anoisedistribu8on(Pn(w))whoseprobabili8esmatchthe
orderingofthefrequencyofthevocabulary.
Whatneedstochangeintheskipmodel?objec8vefunc8on,
gradients,updaterules
Nega2veSampling
Considerapair(w,c)ofwordandcontext.Didthispair
comefromthetrainingdata?
Nega2veSampling
W(1)andW(2)
"false"or"nega8ve"corpus.Hasunnaturalsentences
thatshouldgetalowprobabilityofeveroccurring,e.g.:
"stockboilfishistoy".
Intui2on:trainbinarylogis8cregressionsforatruepair(centerwordandwordinits
contextwindow)andacoupleofrandompairs(thecenterwordwitharandomword)
Nega2veSampling
Wecangenerateontheflybyrandomlysamplingthis
nega8vefromthewordbank.Ournewobjec8vefunc8on
wouldthenbe:
Where:
Whatisagood
?
Nega2veSampling
Whatisagood
?
Empirically:
TheUnigramModelraisedtothepowerof3/4
Thepowermakeslessfrequentwordsbesampledmoreo[en:
Word
Probabilitytobesampledfor“neg”D
is
0.93/4=0.92
cons8tu8on
0.093/4=0.16
bombas8c
0.013/4=0.032
Approxima2onsSummary+PSet1
•  Withlargevocabulariesthisobjec8vefunc8onisnot
scalableandwouldtraintooslowly->Why?
•  Idea:approximatethenormaliza8onor
•  Definenega8vepredic8onthatonlysamplesa
fewwordsthatdonotappearinthecontext
•  Similartofocusingonmostlyposi8ve
correla8ons
•  YouwillderiveandimplementthisinPset1
(Gradientsandupdaterules)
Whattodowiththetwosetsofvectors?
•  WeendupwithLandL’fromallthevectorsvandv’
•  Bothcapturesimilarco---occurrenceinforma8on.Itturnsout,the
bestsolu8onistosimplysumthemup:
Lfinal=L+L’
•  OneofmanyhyperparametersexploredinGloVe:Global
VectorsforWordRepresenta,on(Penningtonetal.(2014)
Con2nuousBagofWordsModel(CBOW)
Goal:
Predic8ngacenterwordformthesurroundingcontext
{"The","cat",’over","the’,"puddle"}
context
predict
“jumped”
centerword
Howwillitwork?
HowtolearnW1andW2?
hotvectorscrossentropy:
iistheindexwherethecorrectword’s
onehotvectoris1.
LinearRela2onshipsinword2vec
Theserepresenta8onsareverygoodatencodingdimensionsof
similarity!
•  Analogiestes8ngdimensionsofsimilaritycanbesolvedquite
welljustbydoingvectorsubtrac8onintheembeddingspace
Syntac8cally
• xapple−xapples≈xcar−xcars≈xfamily− xfamilies
• Similarlyforverbandadjec8vemorphologicalforms
Seman8cally(Semeval2012task2)
• xshirt−xclothing≈xchair− xfurniture
• xking−xman≈xqueen− xwoman
Countbasedvsdirectpredic2on
LSA, HAL (Lund & Burgess),
COALS (Rohde et al),
Hellinger-PCA (Lebret & Collobert)
•  NNLM, HLBL, RNN, Skipgram/CBOW, (Bengio et al; Collobert
•  Fast training
•  Scales with corpus size
•  Efficient usage of statistics
•  Primarily used to capture word
similarity
•  Disproportionate importance
given to small counts
& Weston; Huang et al; Mnih & Hinton;
Mikolov et al; Mnih & Kavukcuoglu)
•  Inefficient usage of statistics
•  Generate improved performance
on other tasks
•  Can capture complex patterns
beyond word similarity
Combiningthebestofbothworlds:GloVe
MainInsight:Ra8oofco-occurrenceprobabili8escan
encodemeaning
Combiningthebestofbothworlds:GloVe
MainInsight:Ra8oofco-occurrenceprobabili8escan
encodemeaning
Howcanwecapturethisbehaviorinthewordvectorspace?
Log-bilinearmodel:
Vectordifferences:
Combiningthebestofbothworlds:GloVe
Thelog-bilinearmodel
Fasttraining,scalabletohugecorpora,
Goodperformanceevenwithsmallcorpus,andsmallvectors
Gloveresults
Nearestwordsto
frog:
1. frogs
2. toad
3. litoria
4. leptodactylidae
5. rana
6. lizard
7. eleutherodactylus
litoria
rana
leptodactylidae
eleutherodactylus
WordAnalogies
Testforlinearrela8onships,examinedbyMikolovetal.(2014)
a:b::c:?
man:woman::king:?
+
king
[0.300.70]
-
man
[0.200.20]
+
woman
[0.600.30]
queen
[0.700.80]
king
queen
woman
man
GloveVisualiza2ons
73
RichardSocher
4/1/15
GloveVisualiza2ons:CompanyNCEO
74
RichardSocher
4/1/15
GloveVisualiza2ons:Superla2ves
35
RichardSocher
4/1/15
Advantagesoflowdimensionalwordvectors
•  Wordvectorswillformthebasisforallsubsequent
lectures.
•  Allourseman8crepresenta8onswillbevectors!
•  Wecancomputecomposi8onalrepresenta8onsfor
longerphrasesorsentenceswiththemandsolvelots
ofdifferenttasks.! Nextlecture!
76