Download CS 5614: (Big) Data Management Systems

Document related concepts

Database wikipedia , lookup

Clusterpoint wikipedia , lookup

Big data wikipedia , lookup

Functional Database Model wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Database model wikipedia , lookup

Transcript
CS5614:(Big)Data
ManagementSystems
B.AdityaPrakash
Lecture#1:Introduc/on
Prakash2017
CS5614:(Big)DataManagementSystems
2
Datacontainsvalueandknowledge
Prakash2017
CS5614:(Big)DataManagementSystems
3
DataandBusiness
4*
Data1and1business*
Recommended'linksA
+79%''clicksA
Personalized''
News'InterestsA
+250%'clicksA
Top'SearchesA
+43%'clicksA
vs.1randomly1selected*vs.1editorial1oneDsizeDfitsDall*vs.1editor1selected*
Prakash2017
CS5614:(Big)DataManagementSystems
Source:A.Machhanavajjhala
4
DataandScience
Data1and1science*
5*
Red:1official1numbers1from1Center1for1Disease1Control1and1Prevention;1weekly11
Black:1based1on1Google1search1logs;1daily1(potentially1instantaneously)*
Detecting'influenza'epidemics'using'search'
engine'query'data1
http://www.nature.com/nature/journal/v457/n7232/full/
nature07634.html
Prakash2017
CS5614:(Big)DataManagementSystems
5
DataandGovernment
Data1and1government*
6*
http://www.washingtonpost.com/opinions/obama-the-big-datapresident/2013/06/14/1d71fe2e-d391-11e2b05f-3ea3f0e7bb5a_story.html
http://www.washingtonpost.com/
business/economy/democratspush-to-redeploy-obamas-voterdatabase/2012/11/20/
d14793a4-2e83-11e2-89d4-040c93
30702a_story.html
http://www.whitehouse.gov/blog/
Democratizing-Data
Prakash2017
http://www.theguardian.com/world/2013/jun/23/
edward-snowden-nsa-files-timeline
CS5614:(Big)DataManagementSystems
Source:A.Machhanavajjhala
6
DataandCulture
Data1and1culture*
7*
•  Word1frequencies1in1
EnglishDlanguage1
books1in1Google’s1
database1
1
http://blogs.plos.org/everyone/
2013/03/20/what-are-you-inthe-mood-for-emotionaltrends-in-20th-century-books/
Prakash2017
CS5614:(Big)DataManagementSystems
Source:A.Machhanavajjhala
7
8*
Data1and1____1☜ your1favorite1subject*
Sports*
Prakash2017
Journalism*
CS5614:(Big)DataManagementSystems
8
Goodnews:DemandforDataMining
Prakash2017
CS5614:(Big)DataManagementSystems
9
Howtoextractvaluefromdata?
§  ManipulateData
–  CS,Domainexper/se
§  AnalyzeData
–  Math,CS,Stat…
§  Communicateyourresults
–  CS,DomainExper/se
Prakash2017
CS5614:(Big)DataManagementSystems
10
CommunicaEonisimportant!
Communicating1results*
13*
“The"British"government"spends""
£13"billion"a"year"on"universities.”F
–  So?*
–  Try1instead1
http://wheredoesmymoneygo.org/"
bubbletree-map.html#/~/total/education/university
“On"average,"1"in"every"15"Europeans""
is"totally"illiterate.”F
–  True*
–  But1about111in1every1141is1under171years1old!*
http://datajournalismhandbook.org/1.0/en/understanding_data_0.html*
Prakash2017
CS5614:(Big)DataManagementSystems
11
WhatisDataMining?
§  Givenlotsofdata
§  DiscoverpaJernsandmodelsthatare:
–  Valid:holdonnewdatawithsomecertainty
–  Useful:shouldbepossibletoactontheitem
–  Unexpected:non-obvioustothesystem
–  Understandable:humansshouldbeableto
interpretthepaWern
Prakash2017
CS5614:(Big)DataManagementSystems
12
DataMiningTasks
§  DescripEvemethods
–  Findhuman-interpretablepaWernsthat
describethedata
•  Example:Clustering
§  PredicEvemethods
–  Usesomevariablestopredictunknown
orfuturevaluesofothervariables
•  Example:Recommendersystems
Prakash2017
CS5614:(Big)DataManagementSystems
13
Theory
&Algo.
Biology
Physics
Comp.
Systems
ML&
Stats.
Social
Science
Bigdata
Econ.
14
Prakash2017
CS5614:(Big)DataManagementSystems
COURSELOGISTICS
Prakash2017
CS5614:(Big)DataManagementSystems
15
CourseInformaEon
§  Instructor
B.AdityaPrakash,TorgersenHall3160F,[email protected]
–  OfficeHours:TBD
–  IncludestringCS5614insubject
§  TeachingAssistant
TBD
–  OfficeHours:TBD
§  ClassMeeEngTime
Tuesdays,Thursdays,9:30-10:45am,McBrydeHall226
§  Syllabus:RelaEonalDatabaseSystems,BigdataTechnologies(MR
andnewsoYwarestack),Streams,RecommendaEonSystems,
LargeScaleMachineLearning,andGraphMining
Prakash2017
CS5614:(Big)DataManagementSystems
16
CourseInformaEon
§  KeepinginTouch
Coursewebsite
hWp://www.cs.vt.edu/~badityap/classes/cs5614-Spr17/
updatedregularlythroughthesemester
–  Piazzalinkonthewebsite
Prakash2017
CS5614:(Big)DataManagementSystems
17
Textbook
§  Required
JureLeskovec,AnandRajaramanandJefferyUllman:
MiningMassiveDatasets(2nd)CambridgeUniversity
Press.2010
Webpageforthebook(withFREEPDF!)
www.mmds.org
Prakash2017
CS5614:(Big)DataManagementSystems
18
Textbook
§  Recommended(fordatabaseinternals)
RaghuRamakrishnanandJohannesGehrke
DatabaseManagementSystems(3rdEd.).McGraw
Hill.
Prakash2017
CS5614:(Big)DataManagementSystems
19
Pre-reqs
(A) ShouldenjoythecourseJ
(B) Backgroundin
1. 
2. 
3. 
4. 
5. 
Algorithms
ProbabilityandStats
UndergraduatelevelDatabases
LinearAlgebra(helps)
Graphtheory(helps)
(C) Graduate-levelProgrammingSkills(i.e.abilityto
useunfamiliarsohware,pickingupnew
languages,comfortablewithatleastoneof
Python/C/C++/Ruby/Javaetc.(Matlab/Raplus))
Prakash2017
CS5614:(Big)DataManagementSystems
20
Force-add
§  Talktomeonceaherclass
AND
§  Fill-inthissurveyby6pmESTtoday
hWps://goo.gl/forms/APfoI5CymKqKg0Pk1
Prakash2017
CS5614:(Big)DataManagementSystems
21
CourseGrading
§  Detailscomingsoon(nextlecture)
§  Broadly
–  Somehws
–  Nomidterm
–  Take-homeFinal
–  Project
–  Classpar/cipa/on
Prakash2017
CS5614:(Big)DataManagementSystems
22
CourseProject
§  2,or3(max)personsperproject.
§  Majorworkforthisclass.
§  Pickyourowntopic
–  Youhavetojus/fywhythetopicisinteres/ng,andrelevantto
thecourse,andofsuitabledifficulty
§  Harderway:
–  Jointprojectswithothercoursesarealsonego/able.Inthat
case,youwillneedtheapprovaloftheinstructor,andyoualso
needtoclarifyexactlywhatstepswillbedoneforthiscourse,as
wellasfortheothercourse.
–  Projectsrelatedtoyourdisserta/on/master-projectarealso
possible,aslongasthereisno'double-dipping',i.e.,youclearly
specifywhattheprojectwilldo,inaddi/ontowhatyouwere
planningtodoforyourthesisanyway.
§  Askmeifyouneedhelpandideas(Imayreleasealistof
suitabletopicslater)
Prakash2017
CS5614:(Big)DataManagementSystems
23
CourseProject
§ 
§ 
§ 
§ 
Proposal
Milestone
FinalReport
PosterPresenta/on(orin-classpresenta/onTBD)
Prakash2017
CS5614:(Big)DataManagementSystems
24
WARM-UPANDBASICS
Prakash2017
CS5614:(Big)DataManagementSystems
25
RelaEonalDatabases:Whatwewill
cover(next1month)
§  Implementa/on
–  Whatisunder-the-hoodofaDBlikeOracle/MySQL?
§  Design
–  Howdoyoumodelyourdataandstructureyourinforma/onin
adatabase?
§  Programming
–  Howdoyouusethecapabili/esofaDBMS?
§  Achievesabalancebetween
–  afirmtheore/calfounda/ontodesigningmoderate-sized
databases
–  crea/ng,querying,andimplemen/ngrealis/cdatabasesand
connec/ngthemtoapplica/ons
Prakash2017
CS5614:(Big)DataManagementSystems
26
CS4604:CourseOutline
§  Weeks1–4:Query/
Manipula/onLanguages
andDataModeling
Rela/onalAlgebra
Datadefini/on
ProgrammingwithSQL
En/ty-Rela/onship(E/R)
approach
–  SpecifyingConstraints
–  GoodE/Rdesign
– 
– 
– 
– 
§  Weeks5–8:Indexes,
Processingand
Op/miza/on
– 
– 
– 
– 
Storing
Hashing/Sor/ng
QueryOp/miza/on
NoSQLandHadoop
Prakash2017
§  Week9-10:Rela/onal
Design
–  Func/onalDependencies
–  Normaliza/ontoavoid
redundancy
§  Week11-12:Concurrency
Control
–  Transac/ons
–  LoggingandRecovery
§  Week13–14:Students’
choice
–  Prac/ceProblems
–  XML
–  Dataminingand
warehousing
Wewillgooverallof
CS5614:(Big)DataManagementSystems
thisquickly!J
27
WhatisthegoalofaDBMS?
§  Electronicrecord-keeping
Fastandconvenientaccesstoinforma/on
§  DBMS==databasemanagementsystem
–  `Rela/onal’inthisclass
–  data+setofinstruc/onstoaccess/manipulate
data
Prakash2017
CS5614:(Big)DataManagementSystems
28
WhatisaDBMS?
§  FeaturesofaDBMS
–  Supportmassiveamountsofdata
–  Persistentstorage
–  Efficientandconvenientaccess
–  Secure,concurrent,andatomicaccess
§  Examples?
–  Searchengines,bankingsystems,airlinereserva/ons,
corporaterecords,payrolls,salesinventories.
–  Newapplica/ons:Wikis,social/biological/mul/media/
scien/fic/geographicdata,heterogeneousdata.
Prakash2017
CS5614:(Big)DataManagementSystems
29
FeaturesofaDBMS
•  Supportmassiveamountsofdata
–  Giga/tera/petabytes
–  Fartoobigformainmemory
•  Persistentstorage
–  Programsupdate,query,manipulatedata.
–  Datacon/nuestolivelongaherprogramfinishes.
•  Efficientandconvenientaccess
–  Efficient:donotsearchen/redatabasetoansweraquery.
–  Convenient:allowuserstoquerythedataaseasilyaspossible.
•  Secure,concurrent,andatomicaccess
–  Allowmul/pleuserstoaccessdatabasesimultaneously.
–  Allowauseraccesstoonlytoauthorizeddata.
–  Providesomeguaranteeofreliabilityagainstsystemfailures.
Prakash2017
CS5614:(Big)DataManagementSystems
30
ExampleScenario
§  Students,takingclasses,obtaininggrades
–  FindmyGPA
–  <andotherad-hocqueries>
Prakash2017
CS5614:(Big)DataManagementSystems
31
ObvioussoluEon1:Folders
§  Advantages?
–  Cheap;Easy-to-use
§  Disadvantages?
–  Noad-hocqueries
–  Nosharing
–  LargePhysicalfoot-print
Prakash2017
CS5614:(Big)DataManagementSystems
32
ObviousSoluEon++
§  FlatfilesandC(C++,Java…)programs
–  E.g.one(ormore)UNIX/DOSfiles,withstudent
recordsandtheircourses
Prakash2017
CS5614:(Big)DataManagementSystems
33
ObviousSoluEon++
§  Layoutforstudentrecords?
–  CSV(‘comma-separated-values’)
Hermione Grainger,123,Potions,A
Draco Malfoy,111,Potions,B
Harry Potter,234,Potions,A
Ron Weasley,345,Potions,C
Prakash2017
CS5614:(Big)DataManagementSystems
34
ObviousSoluEon++
§  Layoutforstudentrecords?
–  Otherpossibili/eslike
Hermione Grainger,123
Draco Malfoy,111
Harry Potter,234
Ron Weasley,345
Prakash2017
CS5614:(Big)DataManagementSystems
123,Potions,A
111,Potions,B
234,Potions,A
345,Potions,C
35
Problems?
§  inconvenientaccesstodata(need‘C++’
exper/ze,plusknowledgeoffile-layout)
–  dataisola/on
§ 
§ 
§ 
§ 
§ 
§ 
dataredundancy(andinconsistencies)
integrityproblems
atomicityproblems
concurrent-accessproblems
securityproblems
…….
Prakash2017
CS5614:(Big)DataManagementSystems
36
Problems-Why?
§  Twomainreasons:
–  file-layoutdescrip/onisburiedwithintheC
programsand
–  thereisnosupportfortransac/ons(concurrency
andrecovery)
DBMSshandleexactlythesetwoproblems
Prakash2017
CS5614:(Big)DataManagementSystems
37
ExampleScenario
§  RDBMS=“Rela/onal”DBMS
§  Therela/onalmodelusesrela/onsortablestostructuredata
§  ClassListrela/on:
Student
Course
Grade
HermioneGrainger
Po/ons
A
DracoMalfoy
Po/ons
B
HarryPoWer
Po/ons
A
RonWeasley
Po/ons
C
§  Rela/onseparatesthelogicalview(externals)fromthe
physicalview(internals)
§  Simplequerylanguages(SQL)foraccessing/modifyingdata
–  FindallstudentswhosegradesarebeWerthanB.
–  SELECTStudentFROMClassListWHEREGrade>“B”
Prakash2017
CS5614:(Big)DataManagementSystems
38
DBMSArchitecture
Prakash2017
CS5614:(Big)DataManagementSystems
39
TransacEonProcessing
§  Oneormoredatabaseopera/onsaregrouped
intoa“transac/on”
§  Transac/onsshouldmeetthe“ACIDtest”
– Atomicity:All-or-nothingexecu/onoftransac/ons.
– Consistency:Databaseshaveconsistencyrules(e.g.whatdata
isvalid).Atransac/onshouldNOTviolatethedatabase’s
consistency.Ifitdoes,itneedstoberolledback.
– Isola/on:Eachtransac/onmustappeartobeexecutedasifno
othertransac/onisexecu/ngatthesame/me.
– Durability:Anychangeatransac/onmakestothedatabase
shouldpersistandnotbelost.
Prakash2017
CS5614:(Big)DataManagementSystems
40
Disadvantagesover(flat)files?
Prakash2017
CS5614:(Big)DataManagementSystems
41
Disadvantagesover(flat)files
§  Price
§  addi/onalexper/se(SQL/DBA)
(hence:over-killforsmall,single-userdatasets
But:mobilephones(eg.,android)usesqlite)
Prakash2017
CS5614:(Big)DataManagementSystems
42
ABriefHistoryofDBMS
§  Theearliestdatabases(1960s)evolvedfromfilesystems
–  Filesystems
•  Allowstorageoflargeamountsofdataoveralongperiodof/me
•  Filesystemsdonotsupport:
–  Efficientaccessofdataitemswhoseloca/oninapar/cularfileisnot
known
–  Logicalstructureofdataislimitedtocrea/onofdirectorystructures
–  Concurrentaccess:Mul/pleusersmodifyingasinglefilegenerate
non-uniformresults
•  Naviga/onalandhierarchical
•  Userprogrammedthequeriesbywalkingfromnodetonodeinthe
DBMS.
§  Rela/onalDBMS(1970stonow)
–  Viewdatabaseintermsofrela/onsortables
–  High-levelqueryanddefini/onlanguagessuchasSQL
–  Allowusertospecifywhat(s)hewants,nothowtogetwhat(s)hewants
§  Object-orientedDBMS(1980s)
–  Inspiredbyobject-orientedlanguages
–  Object-rela/onalDBMS
Prakash2017
CS5614:(Big)DataManagementSystems
43
TheDBMSIndustry
§  ADBMSisasohwaresystem.
§  MajorDBMSvendors:Oracle,Microsoh,IBM,Sybase
§  Free/Open-sourceDBMS:MySQL,PostgreSQL,Firebird.
–  UsedbycompaniessuchasGoogle,Yahoo,Lycos,BASF….
§  Allare“rela/onal”(or“object-rela/onal”)DBMS.
§  AmulE-billiondollarindustry
Prakash2017
CS5614:(Big)DataManagementSystems
44
Fundamentalconcepts
§  3-levelarchitecture
§  logicaldataindependence
§  physicaldataindependence
Prakash2017
CS5614:(Big)DataManagementSystems
45
3-levelarchitecture
§  viewlevel
§  logicallevel
§  physicallevel
Prakash2017
v1
CS5614:(Big)DataManagementSystems
v2
v3
46
3-levelarchitecture
§  viewlevel
§  logicallevel:eg.,tables
–  STUDENT(ssn,name)
–  TAKES(ssn,cid,grade)
§  physicallevel:
–  howarethesetablesstored,howmanybytes/
aWributeetc
Prakash2017
CS5614:(Big)DataManagementSystems
47
3-levelarchitecture
§  viewlevel,eg:
–  v1:selectssnfromstudent
–  v2:selectssn,c-idfromtakes
§  logicallevel
§  physicallevel
Prakash2017
CS5614:(Big)DataManagementSystems
48
3-levelarchitecture
§  ->hence,physicalandlogicaldata
independence:
§  logicalD.I.:
–  ???
§  physicalD.I.:
–  ???
Prakash2017
CS5614:(Big)DataManagementSystems
49
3-levelarchitecture
§  ->hence,physicalandlogicaldata
independence:
§  logicalD.I.:
–  canadd(drop)column;add/droptable
§  physicalD.I.:
–  canaddindex;changerecordorder
Prakash2017
CS5614:(Big)DataManagementSystems
50
Databaseusers
§  ‘naive’users
§  casualusers
§  applica/onprogrammers
§  [DBA(Databaseadministrator)]
Prakash2017
CS5614:(Big)DataManagementSystems
51
Casualusers
select*
fromstudent
DBMS
andmeta-data=
catalog
data
Prakash2017
CS5614:(Big)DataManagementSystems
52
``Naive’’users
Pictorially:
app.(eg.,
reportgenerator)
DBMS
andmeta-data=
catalog
data
Prakash2017
CS5614:(Big)DataManagementSystems
53
App.programmers
§  thosewhowritetheapplica/ons(likethe
‘reportgenerator’)
Prakash2017
CS5614:(Big)DataManagementSystems
54
DBAdministrator(DBA)
§  Du/es?
Prakash2017
CS5614:(Big)DataManagementSystems
55
DBAdministrator(DBA)
§  schemadefini/on(‘logical’level)
§  physicalschema(storagestructure,access
methods
§  schemasmodifica/ons
§  gran/ngauthoriza/ons
§  integrityconstraintspecifica/on
Prakash2017
CS5614:(Big)DataManagementSystems
56
Overallsystemarchitecture
§  [Users]
§  DBMS
–  queryprocessor
–  storagemanager
–  transac/on
manager
§  [Files]
Prakash2017
CS5614:(Big)DataManagementSystems
57
naive
app.pgmr
emb.DML
casual
DMLproc.
DBA
users
DDLint.
app.pgm(o)
queryeval.
trans.mgr
buff.mgr
queryproc.
filemgr
data
Prakash2017
storagemgr.
meta-data
CS5614:(Big)DataManagementSystems
58
Overallsystemarchitecture
§  queryprocessor
–  DMLcompiler
–  embeddedDMLpre-compiler
–  DDLinterpreter
–  Queryevalua/onengine
Prakash2017
CS5614:(Big)DataManagementSystems
59
Overallsystemarchitecture(cont’d)
§  storagemanager
–  authoriza/onandintegritymanager
–  transac/onmanager
–  buffermanager
–  filemanager
Prakash2017
CS5614:(Big)DataManagementSystems
60
Overallsystemarchitecture(cont’d)
§  Files
–  datafiles
–  datadic/onary=catalog(=meta-data)
–  indices
–  sta/s/caldata
Prakash2017
CS5614:(Big)DataManagementSystems
61
Someexamples:
§  DBAdoingaDDL(datadefini/onlanguage)
opera/on,eg.,
createtablestudent...
Prakash2017
CS5614:(Big)DataManagementSystems
62
naive
app.pgmr
emb.DML
casual
DMLproc.
DBA
users
DDLint.
app.pgm(o)
queryeval.
trans.mgr
buff.mgr
queryproc.
filemgr
data
Prakash2017
storagemgr.
meta-data
CS5614:(Big)DataManagementSystems
63
Someexamples:
§  casualuser,askingforanupdate,eg.:
updatestudent
setnameto‘smith’
wheressn=‘345’
Prakash2017
CS5614:(Big)DataManagementSystems
64
naive
app.pgmr
emb.DML
casual
DMLproc.
DBA
users
DDLint.
app.pgm(o)
queryeval.
trans.mgr
buff.mgr
queryproc.
filemgr
data
Prakash2017
storagemgr.
meta-data
CS5614:(Big)DataManagementSystems
65
naive
app.pgmr
emb.DML
casual
DMLproc.
DBA
users
DDLint.
app.pgm(o)
queryeval.
trans.mgr
buff.mgr
queryproc.
filemgr
data
Prakash2017
storagemgr.
meta-data
CS5614:(Big)DataManagementSystems
66
naive
app.pgmr
emb.DML
casual
DMLproc.
DBA
users
DDLint.
app.pgm(o)
queryeval.
trans.mgr
buff.mgr
queryproc.
filemgr
data
Prakash2017
storagemgr.
meta-data
CS5614:(Big)DataManagementSystems
67
Someexamples:
§  app.programmer,crea/ngareport,eg
main(){
....
execsql“select*fromstudent”
...
}
Prakash2017
CS5614:(Big)DataManagementSystems
68
naive
app.pgmr
casual
DBA
users
pgm(src)
emb.DML
DMLproc.
DDLint.
app.pgm(o)
queryeval.
trans.mgr
buff.mgr
queryproc.
filemgr
data
Prakash2017
storagemgr.
meta-data
CS5614:(Big)DataManagementSystems
69
Someexamples:
§  ‘naive’user,runningthepreviousapp.
Prakash2017
CS5614:(Big)DataManagementSystems
70
naive
app.pgmr
casual
DBA
users
pgm(src)
emb.DML
DMLproc.
DDLint.
app.pgm(o)
queryeval.
trans.mgr
buff.mgr
queryproc.
filemgr
data
Prakash2017
storagemgr.
meta-data
CS5614:(Big)DataManagementSystems
71
Conclusions
§  (rela/onal)DBMSs:electronicrecordkeepers
§  customizethemwithcreatetablecommands
§  askSQLqueriestoretrieveinfo
Prakash2017
CS5614:(Big)DataManagementSystems
72
Conclusionscontd
mainadvantagesover(flat)files&scripts:
§  logical+physicaldataindependence(ie.,
flexibilityofaddingnewaWributes,newtables
andindices)
§  concurrencycontrolandrecovery
Prakash2017
CS5614:(Big)DataManagementSystems
73