Download Simple Word Vector representa]ons: word2vec, GloVe

SimpleWordVectorrepresenta2ons: word2vec,GloVe Dr.KiraRadinsky CTOSalesPredict Visi8ngProfessor/Scien8stTechnion SlideswereadaptedfromlecturesbyRichardSocher Howdowerepresentthemeaningofaword? Defini8on:Meaning(Websterdic8onary) •  theideathatisrepresentedbyaword,phrase,etc. •  theideathatapersonwantstoexpressbyusing words,signs,etc. •  theideathatisexpressedinaworkofwri8ng,art,etc. Howtorepresentmeaninginacomputer? Commonanswer:UseataxonomylikeWordNetthathas hypernyms(is-a)rela8onships and synonymsets(good): [Synset('procyonid.n.01'), Synset('carnivore.n.01'), Synset('placental.n.01'), Synset('mammal.n.01'), Synset('vertebrate.n.01'), Synset('chordate.n.01'), Synset('animal.n.01'), Synset('organism.n.01'), Synset('living_thing.n.01'), Synset('whole.n.02'), Synset('object.n.01'), Synset('physical_en8ty.n.01'), Synset('en8ty.n.01')] S:(adj)full,good S:(adj)es8mable,good,honorable,respectable S:(adj)beneficial,good S:(adj)good,just,upright S:(adj)adept,expert,good,prac8ced, proficient,skillful S:(adj)dear,good,near S:(adj)good,right,ripe … S:(adv)well,good S:(adv)thoroughly,soundly,good S:(n)good,goodness S:(n)commodity,tradegood,good Problemswiththisdiscreterepresenta2on •  Greatasresourcebutmissingnuances,e.g. synonyms: adept,expert,good,prac8ced,proficient,skillful? •  Missingnewwords(impossibletokeepuptodate): wicked,badass,ni[y,crack,ace,wizard,genius,ninjia •  Subjec8ve •  Requireshumanlabortocreateandadapt Hardtocomputeaccuratewordsimilarity! Problemswiththisdiscreterepresenta2on ThevastmajorityofruleKbasedandsta8s8calNLPworkregards wordsasatomicsymbols:h o t e l , c o n f e r e n c e , w a l k Invectorspaceterms,thisisavectorwithone1andalotofzeroes [ 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 ] Dimensionality:20K(speech)–50K(PTB)–500K(bigvocab)–13M(Google1T) Wecallthisa“one-hot”representa8on.Itsproblem: m o t e l [ 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 ] AND h o t e l [ 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ] = 0 Distribu2onalsimilaritybasedrepresenta2ons Youcangetalotofvaluebyrepresen8ngawordby meansofitsneighbors “Youshallknowawordbythecompanyitkeeps” (J.R.Firth1957:11) Oneofthemostsuccessfulideasofmodernsta8s8calNLP government debt problems turning into banking crises as has happened in saying that Europe needs unified banking regulation to replace the hodgepodge " Thesewordswillrepresentbanking! Howtomakeneighborsrepresentwords? Answer:WithacooccurrencematrixX •  2op8ons:fulldocumentvswindows •  WordKdocumentcooccurrencematrixwillgive generaltopics(allsportstermswillhavesimilar entries)leadingto“LatentSeman8cAnalysis” •  Windowallowsustocapturebothsyntac8c(POS)and seman8cinforma8on! Windowbasedcooccurencematrix •  Windowlength1(morecommon:5-10) •  Symmetric(irrelevantwhetherle[orrightcontext) •  Examplecorpus: •  Ilikedeeplearning. •  IlikeNLP. •  Ienjoyflying. Windowbasedcooccurencematrix •  Examplecorpus: •  Ilikedeeplearning. •  IlikeNLP. •  Ienjoyflying. 9 counts I like enjoy deep learning NLP flying . I 0 2 1 0 0 0 0 0 like 2 0 0 1 0 1 0 0 enjoy 1 0 0 0 0 0 1 0 deep 0 1 0 0 1 0 0 0 learning 0 0 0 1 0 0 0 1 NLP 0 1 0 0 0 0 0 1 flying 0 0 1 0 0 0 0 1 . 0 0 0 1 1 4/1/15 0 RichardSoc h1 er 0 Problemswithsimplecooccurrencevectors Increaseinsizewithvocabulary Veryhighdimensional:requirealotofstorage Subsequentclassifica8onmodelshavesparsityissues Modelsarelessrobust Solu2on:Lowdimensionalvectors •  Idea:store“most”oftheimportantinforma8onina fixed,smallnumberofdimensions:adensevector •  Usuallyaround25–1000dimensions •  Howtoreducethedimensionality? Method1:DimensionalityReduc2ononX SingularValueDecomposi8onofcooccurrencematrixX. m r r S1 = n n U1U2U3 . . . r S2 S3 . 0 m 0 . r . Sr X U S V m k k m V1 V2 V .3 . = n X S1 n U1U2U3 . . k 0 U S2 0 S3 . S . k Sk V1 V2 V3 . . . V T T X isthebestrankkapproxima8ontoX,intermsofleastsquares. Prof.RichardFeynman K=1(takethefirstsingularvector) K=2 K=10 K=50 SimpleSVDwordvectorsinPython Corpus: Ilikedeeplearning.IlikeNLP.Ienjoyflying. 20 RichardSocher 4/1/15 SimpleSVDwordvectorsinPython Corpus:Ilikedeeplearning.IlikeNLP.Ienjoyflying. Prin8ngfirsttwocolumnsofUcorrespondingtothe2biggestsingularvalues 21 RichardSocher 4/1/15 Wordmeaningisdefinedintermsofvectors •  Inallsubsequentmodels,includingdeeplearningmodels,a wordisrepresentedasadensevector linguis,cs = 22 0.286 0.792 −0.177 −0.107 0.109 −0.542 0.349 0.271 HackstoX •  Problem:func8onwords(the,he,has)aretoo frequent- syntaxhastoomuchimpact.Somefixes: •  min(X,t),witht~100 •  Ignorethemall •  Rampedwindowsthatcountcloserwordsmore •  UsePearsoncorrela8onsinsteadofcounts,thenset nega8vevaluesto0 • +++ 23 RichardSocher 4/1/15 Interes2ngseman2cpaPernsemergeinthevectors WRIST ANKLE SHOULDER ARM LEG HAND FOOT HEAD NOSE FINGER TOE FACE EAR EYE TOOTH DOG CAT PUPPY KITTEN COW MOUSE TURTLE OYSTER LION BULL CHICAGO ATLANTA MONTREAL NASHVILLE TOKYO CHINA RUSSIA AFRICA ASIA EUROPE AMERICA BRAZIL MOSCOW FRANCE HAWAII AnImprovedModelofSeman8cSimilarityBasedonLexicalCoKOccurrence Rohdeetal.2005 24 RichardSocher 4/1/15 Interes2ngseman2cpaPernsemergeinthevectors CHOOSING CHOOSE CHOSCEHNOSE STOLEN STEAL STOLE STEALING SPEAK SPOKE SPOKEN SPEAKING TAKE TAKEN TAKING TOOK TN HRTO THROW HW REW THROWING SHOWN SHOWED EATEEAN ATE T EATING SHOWING SHOW GROWN GROW GREW GROWING AnImprovedModelofSeman8cSimilarityBasedonLexicalCoKOccurrence Rohdeetal.2005 25 RichardSocher 4/1/15 Interes2ngseman2cpaPernsemergeinthevectors DRIVER SWIMMER DRIVE CLEAN JANITOR STUDENT TEACHER DOCTOR BRIDE SWIM PRIEST TEACH LEARN MARRY TREAT PRAY AnImprovedModelofSeman8cSimilarityBasedonLexicalCoKOccurrence Rohdeetal.2005 26 RichardSocher 4/1/15 ProblemswithSVD Computa8onalcostscalesquadra8cally fornxmmatrix: O(mn2)flops(whenn<m) Badformillionsofwordsordocuments Hardtoincorporatenewwordsordocuments DifferentlearningregimethanotherDLmodels Morenextlesson…. QuickReminder LanguageModels(Unigrams,Bigrams,etc.) Createsuchamodelthatwillassignaprobabilityto asequenceoftokens.Forexample,whatistheprobability ofeachofthefollowingsentences: “Thecatjumpedoverthepuddle.” VS “stockboilfishistoy.” i.e.probabilityonanygivensequenceofnwords LanguageModels(Unigrams,Bigrams,etc.) Windowbasedcooccurencematrix •  Examplecorpus: •  Ilikedeeplearning. •  IlikeNLP. •  Ienjoyflying. 32 counts I like enjoy deep learning NLP flying . I 0 2 1 0 0 0 0 0 like 2 0 0 1 0 1 0 0 enjoy 1 0 0 0 0 0 1 0 deep 0 1 0 0 1 0 0 0 learning 0 0 0 1 0 0 0 1 NLP 0 1 0 0 0 0 0 1 flying 0 0 1 0 0 0 0 1 . 0 0 0 1 1 4/1/15 0 RichardSoc h1 er 0 Method1:DimensionalityReduc2ononX SingularValueDecomposi8onofcooccurrencematrixX. m r r S1 = n n U1U2U3 . . . r S2 S3 . 0 m 0 . r . Sr X U S V m k k m V1 V2 V .3 . = n X S1 n U1U2U3 . . k 0 U S2 0 S3 . S . k Sk V1 V2 V3 . . . V T T X isthebestrankkapproxima8ontoX,intermsofleastsquares. Method1:DimensionalityReduc2ononX SingularValueDecomposi8onofcooccurrencematrixX. m r r S1 = n n U1U2U3 . . . r S2 S3 . 0 m 0 . r . Sr X U S V m k k m V1 V2 V .3 . = n X S1 n U1U2U3 . . k 0 U S2 0 S3 . S . k Sk V1 V2 V3 . . . V T T X isthebestrankkapproxima8ontoX,intermsofleastsquares. ProblemswithSVD Computa8onalcostscalesquadra8cally fornxmmatrix: O(mn2)flops(whenn<m) Badformillionsofwordsordocuments Hardtoincorporatenewwordsordocuments DifferentlearningregimethanotherDLmodels Idea:Directlylearnlow-dimensionalwordvectors •  Oldidea.Relevantforthislecture&deeplearning: •  Learningrepresenta8onsbyback-propaga8ngerrors. (Rumelhartetal.,1986) •  Aneuralprobabilis8clanguagemodel(Bengioetal.,2003) •  NLPfromScratch(Collobert&Weston,2008) •  Arecentandevensimplermodel: word2vec(Mikolovetal.2013)- intronow MainIdeaofword2vec •  Insteadofcapturingcooccurrencecountsdirectly: Predictsurroundingwordsofeveryword •  Botharequitesimilar,see“Glove:GlobalVectorsfor WordRepresenta,on”byPenningtonetal.(2014) •  Fasterandcaneasilyincorporateanewsentence/ documentoraddawordtothevocabulary Skip-GramModel Goal: Predic8ngsurroundingcontextwordsgivenacenterword “jumped” centerword predict {"The","cat",’over","the’,"puddle"} context Howwillitwork? HowtolearnW1andW2? crossentropy for1-hotvectors: Intui8on: Perfectpredic8on: Badpredic8on: iistheindexwherethecorrectword’s onehotvectoris1. yî=1=>H(yˆ,y)=−1log(1)=0 yî=0.01=>H(yˆ,y)=−1log(0.01)≈4.605 HowtolearnW1andW2? Objec2vefunc2on:Maximizethelogprobabilityofanycontextwordgiventhecurrentcenterword NaiveBayesassump8on SoXmaxreminder: SoXmaxReminder Linearregression themodelparametersθwere trainedtominimizeacostfunc8onJ SoXmax:Linearregressiongeneraliza2ontomul2-class Output:aK-dimensionalvector(whoseelementssumto1) Howtoop2mizethefunc2onJ? Gradientdescentisanalgorithmthatminimizesafunc8on Given:afunc8ondefinedbyasetofparameters Startwith:anini8alsetofparametervalues Itera2vely:movestowardasetofparametervaluesthat minimizethefunc8on GradientDescent-example Goal:oflinearregressionistofitalinetoaset ofpoints(y=mx+b) Minimize GradientDescent-example CalculateGradientsofthefunc8ontominimize: GradientDescent 2000itera8ons star8ngpoint:m=-1b=0 Howtocalculatethegradient Ontheboard Calcula2ngallgradients! •  Wewentthroughgradientsforeachcentervectorvinawindow •  Wealsoneedgradientsforexternalvectorsv’(uontheboard) •  Derive! •  Generallyineachwindowwewillcomputeupdatesforall parametersthatarebeingusedinthatwindow. •  Forexamplewindowsizec=1,sentence: “Ilikelearning.” •  Firstwindowcomputesgradientsfor: •  internalvectorvlikeandexternalvectorsv’Iandv’learning •  Nextwindowinthatsentence? Computeallvectorgradients! •  Weo[endefinethesetofALLparametersinamodelinterms ofonelongvector •  Inourcasewith d---dimensionalvectors and Vmanywords: GradientDescent •  Tominimize overthefullbatch(theen8retrainingdata) wouldrequireustocomputegradientsforallwindows •  Updateswouldbeforeachelementofµ : •  Withstepsize •  Inmatrixnota8onforallparameters: VanillaGradientDescentCode Intui2on •  Forasimpleconvexfunc8onovertwoparameters. •  Contourlinesshowlevelsofobjec8vefunc8on Stochas2cGradientDescent •  ButCorpusmayhave40Btokensandwindows •  Youwouldwaitaverylong8mebeforemakingasingleupdate! •  Verybadideaforpreƒymuchallneuralnets! •  Instead:Wewillupdateparametersa[ereachwindowt àStochas8cgradientdescent(SGD) •  Butineachwindow,weonlyhaveatmost2c---1words, so isverysparse! Stochas2cgradientswithwordvectors! Le ct RichardSocher 4/6/15 Stochas2cgradientswithwordvectors! •  Wemayaswellonlyupdatethewordvectorsthatactually appear! •  Solu8on:eitherkeeparoundhashforwordvectorsoronly updatecertaincolumnsoffullembeddingmatrixLandL’ d [ V ] •  Importantifyouhavemillionsofwordvectorsanddo distributedcompu8ngtonothavetosendgigan8cupdates around. Nega2veSampling J=U0- P(k|c)Uk Notethatthesumma8onover|V|iscomputa8onallyhuge! Solu2on:Focusonmostlyposi8vecorrela8ons. Foreverytrainingstep,insteadofloopingovertheen8re vocabulary,wecanjustsampleseveralnega8veexamplesfrom anoisedistribu8on(Pn(w))whoseprobabili8esmatchthe orderingofthefrequencyofthevocabulary. Whatneedstochangeintheskipmodel?objec8vefunc8on, gradients,updaterules Nega2veSampling Considerapair(w,c)ofwordandcontext.Didthispair comefromthetrainingdata? Nega2veSampling W(1)andW(2) "false"or"nega8ve"corpus.Hasunnaturalsentences thatshouldgetalowprobabilityofeveroccurring,e.g.: "stockboilfishistoy". Intui2on:trainbinarylogis8cregressionsforatruepair(centerwordandwordinits contextwindow)andacoupleofrandompairs(thecenterwordwitharandomword) Nega2veSampling Wecangenerateontheflybyrandomlysamplingthis nega8vefromthewordbank.Ournewobjec8vefunc8on wouldthenbe: Where: Whatisagood ? Nega2veSampling Whatisagood ? Empirically: TheUnigramModelraisedtothepowerof3/4 Thepowermakeslessfrequentwordsbesampledmoreo[en: Word Probabilitytobesampledfor“neg”D is 0.93/4=0.92 cons8tu8on 0.093/4=0.16 bombas8c 0.013/4=0.032 Approxima2onsSummary+PSet1 •  Withlargevocabulariesthisobjec8vefunc8onisnot scalableandwouldtraintooslowly->Why? •  Idea:approximatethenormaliza8onor •  Definenega8vepredic8onthatonlysamplesa fewwordsthatdonotappearinthecontext •  Similartofocusingonmostlyposi8ve correla8ons •  YouwillderiveandimplementthisinPset1 (Gradientsandupdaterules) Whattodowiththetwosetsofvectors? •  WeendupwithLandL’fromallthevectorsvandv’ •  Bothcapturesimilarco---occurrenceinforma8on.Itturnsout,the bestsolu8onistosimplysumthemup: Lfinal=L+L’ •  OneofmanyhyperparametersexploredinGloVe:Global VectorsforWordRepresenta,on(Penningtonetal.(2014) Con2nuousBagofWordsModel(CBOW) Goal: Predic8ngacenterwordformthesurroundingcontext {"The","cat",’over","the’,"puddle"} context predict “jumped” centerword Howwillitwork? HowtolearnW1andW2? hotvectorscrossentropy: iistheindexwherethecorrectword’s onehotvectoris1. LinearRela2onshipsinword2vec Theserepresenta8onsareverygoodatencodingdimensionsof similarity! •  Analogiestes8ngdimensionsofsimilaritycanbesolvedquite welljustbydoingvectorsubtrac8onintheembeddingspace Syntac8cally • xapple−xapples≈xcar−xcars≈xfamily− xfamilies • Similarlyforverbandadjec8vemorphologicalforms Seman8cally(Semeval2012task2) • xshirt−xclothing≈xchair− xfurniture • xking−xman≈xqueen− xwoman Countbasedvsdirectpredic2on LSA, HAL (Lund & Burgess), COALS (Rohde et al), Hellinger-PCA (Lebret & Collobert) •  NNLM, HLBL, RNN, Skipgram/CBOW, (Bengio et al; Collobert •  Fast training •  Scales with corpus size •  Efficient usage of statistics •  Primarily used to capture word similarity •  Disproportionate importance given to small counts & Weston; Huang et al; Mnih & Hinton; Mikolov et al; Mnih & Kavukcuoglu) •  Inefficient usage of statistics •  Generate improved performance on other tasks •  Can capture complex patterns beyond word similarity Combiningthebestofbothworlds:GloVe MainInsight:Ra8oofco-occurrenceprobabili8escan encodemeaning Combiningthebestofbothworlds:GloVe MainInsight:Ra8oofco-occurrenceprobabili8escan encodemeaning Howcanwecapturethisbehaviorinthewordvectorspace? Log-bilinearmodel: Vectordifferences: Combiningthebestofbothworlds:GloVe Thelog-bilinearmodel Fasttraining,scalabletohugecorpora, Goodperformanceevenwithsmallcorpus,andsmallvectors Gloveresults Nearestwordsto frog: 1. frogs 2. toad 3. litoria 4. leptodactylidae 5. rana 6. lizard 7. eleutherodactylus litoria rana leptodactylidae eleutherodactylus WordAnalogies Testforlinearrela8onships,examinedbyMikolovetal.(2014) a:b::c:? man:woman::king:? + king [0.300.70] - man [0.200.20] + woman [0.600.30] queen [0.700.80] king queen woman man GloveVisualiza2ons 73 RichardSocher 4/1/15 GloveVisualiza2ons:CompanyNCEO 74 RichardSocher 4/1/15 GloveVisualiza2ons:Superla2ves 35 RichardSocher 4/1/15 Advantagesoflowdimensionalwordvectors •  Wordvectorswillformthebasisforallsubsequent lectures. •  Allourseman8crepresenta8onswillbevectors! •  Wecancomputecomposi8onalrepresenta8onsfor longerphrasesorsentenceswiththemandsolvelots ofdifferenttasks.! Nextlecture! 76

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Simple Word Vector representa]ons: word2vec, GloVe