Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CS5614:(Big)Data ManagementSystems B.AdityaPrakash Lecture#1:Introduc/on Prakash2017 CS5614:(Big)DataManagementSystems 2 Datacontainsvalueandknowledge Prakash2017 CS5614:(Big)DataManagementSystems 3 DataandBusiness 4* Data1and1business* Recommended'linksA +79%''clicksA Personalized'' News'InterestsA +250%'clicksA Top'SearchesA +43%'clicksA vs.1randomly1selected*vs.1editorial1oneDsizeDfitsDall*vs.1editor1selected* Prakash2017 CS5614:(Big)DataManagementSystems Source:A.Machhanavajjhala 4 DataandScience Data1and1science* 5* Red:1official1numbers1from1Center1for1Disease1Control1and1Prevention;1weekly11 Black:1based1on1Google1search1logs;1daily1(potentially1instantaneously)* Detecting'influenza'epidemics'using'search' engine'query'data1 http://www.nature.com/nature/journal/v457/n7232/full/ nature07634.html Prakash2017 CS5614:(Big)DataManagementSystems 5 DataandGovernment Data1and1government* 6* http://www.washingtonpost.com/opinions/obama-the-big-datapresident/2013/06/14/1d71fe2e-d391-11e2b05f-3ea3f0e7bb5a_story.html http://www.washingtonpost.com/ business/economy/democratspush-to-redeploy-obamas-voterdatabase/2012/11/20/ d14793a4-2e83-11e2-89d4-040c93 30702a_story.html http://www.whitehouse.gov/blog/ Democratizing-Data Prakash2017 http://www.theguardian.com/world/2013/jun/23/ edward-snowden-nsa-files-timeline CS5614:(Big)DataManagementSystems Source:A.Machhanavajjhala 6 DataandCulture Data1and1culture* 7* • Word1frequencies1in1 EnglishDlanguage1 books1in1Google’s1 database1 1 http://blogs.plos.org/everyone/ 2013/03/20/what-are-you-inthe-mood-for-emotionaltrends-in-20th-century-books/ Prakash2017 CS5614:(Big)DataManagementSystems Source:A.Machhanavajjhala 7 8* Data1and1____1☜ your1favorite1subject* Sports* Prakash2017 Journalism* CS5614:(Big)DataManagementSystems 8 Goodnews:DemandforDataMining Prakash2017 CS5614:(Big)DataManagementSystems 9 Howtoextractvaluefromdata? § ManipulateData – CS,Domainexper/se § AnalyzeData – Math,CS,Stat… § Communicateyourresults – CS,DomainExper/se Prakash2017 CS5614:(Big)DataManagementSystems 10 CommunicaEonisimportant! Communicating1results* 13* “The"British"government"spends"" £13"billion"a"year"on"universities.”F – So?* – Try1instead1 http://wheredoesmymoneygo.org/" bubbletree-map.html#/~/total/education/university “On"average,"1"in"every"15"Europeans"" is"totally"illiterate.”F – True* – But1about111in1every1141is1under171years1old!* http://datajournalismhandbook.org/1.0/en/understanding_data_0.html* Prakash2017 CS5614:(Big)DataManagementSystems 11 WhatisDataMining? § Givenlotsofdata § DiscoverpaJernsandmodelsthatare: – Valid:holdonnewdatawithsomecertainty – Useful:shouldbepossibletoactontheitem – Unexpected:non-obvioustothesystem – Understandable:humansshouldbeableto interpretthepaWern Prakash2017 CS5614:(Big)DataManagementSystems 12 DataMiningTasks § DescripEvemethods – Findhuman-interpretablepaWernsthat describethedata • Example:Clustering § PredicEvemethods – Usesomevariablestopredictunknown orfuturevaluesofothervariables • Example:Recommendersystems Prakash2017 CS5614:(Big)DataManagementSystems 13 Theory &Algo. Biology Physics Comp. Systems ML& Stats. Social Science Bigdata Econ. 14 Prakash2017 CS5614:(Big)DataManagementSystems COURSELOGISTICS Prakash2017 CS5614:(Big)DataManagementSystems 15 CourseInformaEon § Instructor B.AdityaPrakash,TorgersenHall3160F,[email protected] – OfficeHours:TBD – IncludestringCS5614insubject § TeachingAssistant TBD – OfficeHours:TBD § ClassMeeEngTime Tuesdays,Thursdays,9:30-10:45am,McBrydeHall226 § Syllabus:RelaEonalDatabaseSystems,BigdataTechnologies(MR andnewsoYwarestack),Streams,RecommendaEonSystems, LargeScaleMachineLearning,andGraphMining Prakash2017 CS5614:(Big)DataManagementSystems 16 CourseInformaEon § KeepinginTouch Coursewebsite hWp://www.cs.vt.edu/~badityap/classes/cs5614-Spr17/ updatedregularlythroughthesemester – Piazzalinkonthewebsite Prakash2017 CS5614:(Big)DataManagementSystems 17 Textbook § Required JureLeskovec,AnandRajaramanandJefferyUllman: MiningMassiveDatasets(2nd)CambridgeUniversity Press.2010 Webpageforthebook(withFREEPDF!) www.mmds.org Prakash2017 CS5614:(Big)DataManagementSystems 18 Textbook § Recommended(fordatabaseinternals) RaghuRamakrishnanandJohannesGehrke DatabaseManagementSystems(3rdEd.).McGraw Hill. Prakash2017 CS5614:(Big)DataManagementSystems 19 Pre-reqs (A) ShouldenjoythecourseJ (B) Backgroundin 1. 2. 3. 4. 5. Algorithms ProbabilityandStats UndergraduatelevelDatabases LinearAlgebra(helps) Graphtheory(helps) (C) Graduate-levelProgrammingSkills(i.e.abilityto useunfamiliarsohware,pickingupnew languages,comfortablewithatleastoneof Python/C/C++/Ruby/Javaetc.(Matlab/Raplus)) Prakash2017 CS5614:(Big)DataManagementSystems 20 Force-add § Talktomeonceaherclass AND § Fill-inthissurveyby6pmESTtoday hWps://goo.gl/forms/APfoI5CymKqKg0Pk1 Prakash2017 CS5614:(Big)DataManagementSystems 21 CourseGrading § Detailscomingsoon(nextlecture) § Broadly – Somehws – Nomidterm – Take-homeFinal – Project – Classpar/cipa/on Prakash2017 CS5614:(Big)DataManagementSystems 22 CourseProject § 2,or3(max)personsperproject. § Majorworkforthisclass. § Pickyourowntopic – Youhavetojus/fywhythetopicisinteres/ng,andrelevantto thecourse,andofsuitabledifficulty § Harderway: – Jointprojectswithothercoursesarealsonego/able.Inthat case,youwillneedtheapprovaloftheinstructor,andyoualso needtoclarifyexactlywhatstepswillbedoneforthiscourse,as wellasfortheothercourse. – Projectsrelatedtoyourdisserta/on/master-projectarealso possible,aslongasthereisno'double-dipping',i.e.,youclearly specifywhattheprojectwilldo,inaddi/ontowhatyouwere planningtodoforyourthesisanyway. § Askmeifyouneedhelpandideas(Imayreleasealistof suitabletopicslater) Prakash2017 CS5614:(Big)DataManagementSystems 23 CourseProject § § § § Proposal Milestone FinalReport PosterPresenta/on(orin-classpresenta/onTBD) Prakash2017 CS5614:(Big)DataManagementSystems 24 WARM-UPANDBASICS Prakash2017 CS5614:(Big)DataManagementSystems 25 RelaEonalDatabases:Whatwewill cover(next1month) § Implementa/on – Whatisunder-the-hoodofaDBlikeOracle/MySQL? § Design – Howdoyoumodelyourdataandstructureyourinforma/onin adatabase? § Programming – Howdoyouusethecapabili/esofaDBMS? § Achievesabalancebetween – afirmtheore/calfounda/ontodesigningmoderate-sized databases – crea/ng,querying,andimplemen/ngrealis/cdatabasesand connec/ngthemtoapplica/ons Prakash2017 CS5614:(Big)DataManagementSystems 26 CS4604:CourseOutline § Weeks1–4:Query/ Manipula/onLanguages andDataModeling Rela/onalAlgebra Datadefini/on ProgrammingwithSQL En/ty-Rela/onship(E/R) approach – SpecifyingConstraints – GoodE/Rdesign – – – – § Weeks5–8:Indexes, Processingand Op/miza/on – – – – Storing Hashing/Sor/ng QueryOp/miza/on NoSQLandHadoop Prakash2017 § Week9-10:Rela/onal Design – Func/onalDependencies – Normaliza/ontoavoid redundancy § Week11-12:Concurrency Control – Transac/ons – LoggingandRecovery § Week13–14:Students’ choice – Prac/ceProblems – XML – Dataminingand warehousing Wewillgooverallof CS5614:(Big)DataManagementSystems thisquickly!J 27 WhatisthegoalofaDBMS? § Electronicrecord-keeping Fastandconvenientaccesstoinforma/on § DBMS==databasemanagementsystem – `Rela/onal’inthisclass – data+setofinstruc/onstoaccess/manipulate data Prakash2017 CS5614:(Big)DataManagementSystems 28 WhatisaDBMS? § FeaturesofaDBMS – Supportmassiveamountsofdata – Persistentstorage – Efficientandconvenientaccess – Secure,concurrent,andatomicaccess § Examples? – Searchengines,bankingsystems,airlinereserva/ons, corporaterecords,payrolls,salesinventories. – Newapplica/ons:Wikis,social/biological/mul/media/ scien/fic/geographicdata,heterogeneousdata. Prakash2017 CS5614:(Big)DataManagementSystems 29 FeaturesofaDBMS • Supportmassiveamountsofdata – Giga/tera/petabytes – Fartoobigformainmemory • Persistentstorage – Programsupdate,query,manipulatedata. – Datacon/nuestolivelongaherprogramfinishes. • Efficientandconvenientaccess – Efficient:donotsearchen/redatabasetoansweraquery. – Convenient:allowuserstoquerythedataaseasilyaspossible. • Secure,concurrent,andatomicaccess – Allowmul/pleuserstoaccessdatabasesimultaneously. – Allowauseraccesstoonlytoauthorizeddata. – Providesomeguaranteeofreliabilityagainstsystemfailures. Prakash2017 CS5614:(Big)DataManagementSystems 30 ExampleScenario § Students,takingclasses,obtaininggrades – FindmyGPA – <andotherad-hocqueries> Prakash2017 CS5614:(Big)DataManagementSystems 31 ObvioussoluEon1:Folders § Advantages? – Cheap;Easy-to-use § Disadvantages? – Noad-hocqueries – Nosharing – LargePhysicalfoot-print Prakash2017 CS5614:(Big)DataManagementSystems 32 ObviousSoluEon++ § FlatfilesandC(C++,Java…)programs – E.g.one(ormore)UNIX/DOSfiles,withstudent recordsandtheircourses Prakash2017 CS5614:(Big)DataManagementSystems 33 ObviousSoluEon++ § Layoutforstudentrecords? – CSV(‘comma-separated-values’) Hermione Grainger,123,Potions,A Draco Malfoy,111,Potions,B Harry Potter,234,Potions,A Ron Weasley,345,Potions,C Prakash2017 CS5614:(Big)DataManagementSystems 34 ObviousSoluEon++ § Layoutforstudentrecords? – Otherpossibili/eslike Hermione Grainger,123 Draco Malfoy,111 Harry Potter,234 Ron Weasley,345 Prakash2017 CS5614:(Big)DataManagementSystems 123,Potions,A 111,Potions,B 234,Potions,A 345,Potions,C 35 Problems? § inconvenientaccesstodata(need‘C++’ exper/ze,plusknowledgeoffile-layout) – dataisola/on § § § § § § dataredundancy(andinconsistencies) integrityproblems atomicityproblems concurrent-accessproblems securityproblems ……. Prakash2017 CS5614:(Big)DataManagementSystems 36 Problems-Why? § Twomainreasons: – file-layoutdescrip/onisburiedwithintheC programsand – thereisnosupportfortransac/ons(concurrency andrecovery) DBMSshandleexactlythesetwoproblems Prakash2017 CS5614:(Big)DataManagementSystems 37 ExampleScenario § RDBMS=“Rela/onal”DBMS § Therela/onalmodelusesrela/onsortablestostructuredata § ClassListrela/on: Student Course Grade HermioneGrainger Po/ons A DracoMalfoy Po/ons B HarryPoWer Po/ons A RonWeasley Po/ons C § Rela/onseparatesthelogicalview(externals)fromthe physicalview(internals) § Simplequerylanguages(SQL)foraccessing/modifyingdata – FindallstudentswhosegradesarebeWerthanB. – SELECTStudentFROMClassListWHEREGrade>“B” Prakash2017 CS5614:(Big)DataManagementSystems 38 DBMSArchitecture Prakash2017 CS5614:(Big)DataManagementSystems 39 TransacEonProcessing § Oneormoredatabaseopera/onsaregrouped intoa“transac/on” § Transac/onsshouldmeetthe“ACIDtest” – Atomicity:All-or-nothingexecu/onoftransac/ons. – Consistency:Databaseshaveconsistencyrules(e.g.whatdata isvalid).Atransac/onshouldNOTviolatethedatabase’s consistency.Ifitdoes,itneedstoberolledback. – Isola/on:Eachtransac/onmustappeartobeexecutedasifno othertransac/onisexecu/ngatthesame/me. – Durability:Anychangeatransac/onmakestothedatabase shouldpersistandnotbelost. Prakash2017 CS5614:(Big)DataManagementSystems 40 Disadvantagesover(flat)files? Prakash2017 CS5614:(Big)DataManagementSystems 41 Disadvantagesover(flat)files § Price § addi/onalexper/se(SQL/DBA) (hence:over-killforsmall,single-userdatasets But:mobilephones(eg.,android)usesqlite) Prakash2017 CS5614:(Big)DataManagementSystems 42 ABriefHistoryofDBMS § Theearliestdatabases(1960s)evolvedfromfilesystems – Filesystems • Allowstorageoflargeamountsofdataoveralongperiodof/me • Filesystemsdonotsupport: – Efficientaccessofdataitemswhoseloca/oninapar/cularfileisnot known – Logicalstructureofdataislimitedtocrea/onofdirectorystructures – Concurrentaccess:Mul/pleusersmodifyingasinglefilegenerate non-uniformresults • Naviga/onalandhierarchical • Userprogrammedthequeriesbywalkingfromnodetonodeinthe DBMS. § Rela/onalDBMS(1970stonow) – Viewdatabaseintermsofrela/onsortables – High-levelqueryanddefini/onlanguagessuchasSQL – Allowusertospecifywhat(s)hewants,nothowtogetwhat(s)hewants § Object-orientedDBMS(1980s) – Inspiredbyobject-orientedlanguages – Object-rela/onalDBMS Prakash2017 CS5614:(Big)DataManagementSystems 43 TheDBMSIndustry § ADBMSisasohwaresystem. § MajorDBMSvendors:Oracle,Microsoh,IBM,Sybase § Free/Open-sourceDBMS:MySQL,PostgreSQL,Firebird. – UsedbycompaniessuchasGoogle,Yahoo,Lycos,BASF…. § Allare“rela/onal”(or“object-rela/onal”)DBMS. § AmulE-billiondollarindustry Prakash2017 CS5614:(Big)DataManagementSystems 44 Fundamentalconcepts § 3-levelarchitecture § logicaldataindependence § physicaldataindependence Prakash2017 CS5614:(Big)DataManagementSystems 45 3-levelarchitecture § viewlevel § logicallevel § physicallevel Prakash2017 v1 CS5614:(Big)DataManagementSystems v2 v3 46 3-levelarchitecture § viewlevel § logicallevel:eg.,tables – STUDENT(ssn,name) – TAKES(ssn,cid,grade) § physicallevel: – howarethesetablesstored,howmanybytes/ aWributeetc Prakash2017 CS5614:(Big)DataManagementSystems 47 3-levelarchitecture § viewlevel,eg: – v1:selectssnfromstudent – v2:selectssn,c-idfromtakes § logicallevel § physicallevel Prakash2017 CS5614:(Big)DataManagementSystems 48 3-levelarchitecture § ->hence,physicalandlogicaldata independence: § logicalD.I.: – ??? § physicalD.I.: – ??? Prakash2017 CS5614:(Big)DataManagementSystems 49 3-levelarchitecture § ->hence,physicalandlogicaldata independence: § logicalD.I.: – canadd(drop)column;add/droptable § physicalD.I.: – canaddindex;changerecordorder Prakash2017 CS5614:(Big)DataManagementSystems 50 Databaseusers § ‘naive’users § casualusers § applica/onprogrammers § [DBA(Databaseadministrator)] Prakash2017 CS5614:(Big)DataManagementSystems 51 Casualusers select* fromstudent DBMS andmeta-data= catalog data Prakash2017 CS5614:(Big)DataManagementSystems 52 ``Naive’’users Pictorially: app.(eg., reportgenerator) DBMS andmeta-data= catalog data Prakash2017 CS5614:(Big)DataManagementSystems 53 App.programmers § thosewhowritetheapplica/ons(likethe ‘reportgenerator’) Prakash2017 CS5614:(Big)DataManagementSystems 54 DBAdministrator(DBA) § Du/es? Prakash2017 CS5614:(Big)DataManagementSystems 55 DBAdministrator(DBA) § schemadefini/on(‘logical’level) § physicalschema(storagestructure,access methods § schemasmodifica/ons § gran/ngauthoriza/ons § integrityconstraintspecifica/on Prakash2017 CS5614:(Big)DataManagementSystems 56 Overallsystemarchitecture § [Users] § DBMS – queryprocessor – storagemanager – transac/on manager § [Files] Prakash2017 CS5614:(Big)DataManagementSystems 57 naive app.pgmr emb.DML casual DMLproc. DBA users DDLint. app.pgm(o) queryeval. trans.mgr buff.mgr queryproc. filemgr data Prakash2017 storagemgr. meta-data CS5614:(Big)DataManagementSystems 58 Overallsystemarchitecture § queryprocessor – DMLcompiler – embeddedDMLpre-compiler – DDLinterpreter – Queryevalua/onengine Prakash2017 CS5614:(Big)DataManagementSystems 59 Overallsystemarchitecture(cont’d) § storagemanager – authoriza/onandintegritymanager – transac/onmanager – buffermanager – filemanager Prakash2017 CS5614:(Big)DataManagementSystems 60 Overallsystemarchitecture(cont’d) § Files – datafiles – datadic/onary=catalog(=meta-data) – indices – sta/s/caldata Prakash2017 CS5614:(Big)DataManagementSystems 61 Someexamples: § DBAdoingaDDL(datadefini/onlanguage) opera/on,eg., createtablestudent... Prakash2017 CS5614:(Big)DataManagementSystems 62 naive app.pgmr emb.DML casual DMLproc. DBA users DDLint. app.pgm(o) queryeval. trans.mgr buff.mgr queryproc. filemgr data Prakash2017 storagemgr. meta-data CS5614:(Big)DataManagementSystems 63 Someexamples: § casualuser,askingforanupdate,eg.: updatestudent setnameto‘smith’ wheressn=‘345’ Prakash2017 CS5614:(Big)DataManagementSystems 64 naive app.pgmr emb.DML casual DMLproc. DBA users DDLint. app.pgm(o) queryeval. trans.mgr buff.mgr queryproc. filemgr data Prakash2017 storagemgr. meta-data CS5614:(Big)DataManagementSystems 65 naive app.pgmr emb.DML casual DMLproc. DBA users DDLint. app.pgm(o) queryeval. trans.mgr buff.mgr queryproc. filemgr data Prakash2017 storagemgr. meta-data CS5614:(Big)DataManagementSystems 66 naive app.pgmr emb.DML casual DMLproc. DBA users DDLint. app.pgm(o) queryeval. trans.mgr buff.mgr queryproc. filemgr data Prakash2017 storagemgr. meta-data CS5614:(Big)DataManagementSystems 67 Someexamples: § app.programmer,crea/ngareport,eg main(){ .... execsql“select*fromstudent” ... } Prakash2017 CS5614:(Big)DataManagementSystems 68 naive app.pgmr casual DBA users pgm(src) emb.DML DMLproc. DDLint. app.pgm(o) queryeval. trans.mgr buff.mgr queryproc. filemgr data Prakash2017 storagemgr. meta-data CS5614:(Big)DataManagementSystems 69 Someexamples: § ‘naive’user,runningthepreviousapp. Prakash2017 CS5614:(Big)DataManagementSystems 70 naive app.pgmr casual DBA users pgm(src) emb.DML DMLproc. DDLint. app.pgm(o) queryeval. trans.mgr buff.mgr queryproc. filemgr data Prakash2017 storagemgr. meta-data CS5614:(Big)DataManagementSystems 71 Conclusions § (rela/onal)DBMSs:electronicrecordkeepers § customizethemwithcreatetablecommands § askSQLqueriestoretrieveinfo Prakash2017 CS5614:(Big)DataManagementSystems 72 Conclusionscontd mainadvantagesover(flat)files&scripts: § logical+physicaldataindependence(ie., flexibilityofaddingnewaWributes,newtables andindices) § concurrencycontrolandrecovery Prakash2017 CS5614:(Big)DataManagementSystems 73