Download The Library (Big) Data scienrst

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
TheLibrary(Big)Datascien4st
IFLA/ALAwebinar:
“BigData:newrolesandopportuni4esfornewlibrarians”
June15th2016
IFLABigDataSpecialInterestGroup(SIG)
WouterKlapwijk,StellenboschUniversity,SIGconvenor
[email protected]
IFLABigDataSIG
•  ProposedatWLIC2014,Lyon
•  Pe44onedforatWLIC2015,CapeTown
•  EndorsedbytheIFLAProfessionalCommiXeein
December2015
•  SIGsponsor:ITSec4on
•  Objec4ves:
1.  ProvideafocuspointfordevelopingideasregardingBig
Dataasitaffectslibraries
2.  Provideapla[ormwithinIFLAtoassessanddevelopthe
avenuesofresponsefromIFLAtothisdevelopingarea
Deconstruc<ng“BigData”and“datascience”
Thistalkisbasedonanumberofeverydayques4ons:
1.  Whatdoes“datascience”mean?
²  isitonlyhappeninginTechplaceslikeFacebookandGoogle?
2.  WhatareDataScien4st?
²  canLibrariansalsobeDataScien4sts?
3.  IsdatasciencethescienceofBigData?
²  whatistherela4onshipbetweenBigDataanddatascience?
4.  Exactlywhatis“BigData”anyway?
²  justhowbigisBig?orisBigrela4ve?islibrarydataBig?
Datascience
§  Asetoffundamentalprinciplesthatguidethe
extrac4onofknowledgefromdata
§  The“civilengineeringofdata”:turningdataintodata
products
Goalofdatascience:
Ø toimprovedecision-making,forthebeXermentof
organiza4onsandsocietyatlarge
Rela<ontoother“engineering”concepts
“datamining”
ü helpsaccomplishdatasciencegoalsviatechnologies
thatincorporatedatascienceprinciples
ü but…itstechniquesaremuchmoreextensivethan
thesetofprinciplescomprisingdatascience
“datawarehousing”
ü afacilita4ngtechnologyfor“datamining”
ü but…notalwaysincludedaspartof“datamining”
Rela<ontoothercompu<ngconcepts
“dataprocessing”
ü ismoregeneralthandatascience
ü thereisprocessinginvolvedinallaspectsof
compu4ng
“BigDatatechnologies”
ü areocenusedfordataprocessinginsupportofdata
miningtechniques
ü …andotherdatascienceac4vi4es
ScienceorCraG?
v Thetermdatasciencehasexistedforover30years–itisa
fieldontoitself
v Founda4onrestsincenturyoldprac4cesofSta4s4cs,
Mathema4cs,andsincemid-20thcentury,alsoComputer
Sciences
v ItisnotjustarebrandingofSta4s4csandMachineLearning
inthecontextoftheTechindustry
v MuchofthefielddevelopmentishappeninginIndustry,and
notinAcademia
Examplesofdatascienceproducts
Domain
Internet
Example
Recommenda4onsystems
(Amazon=books;Facebook=friends)
Finance
Creditra4ngs
Educa<on
Personalizedlearningandassessment
Government
Policiesbasedondata
Prac<cingdatascience
WhatdoDataScien4stdo?
TheDataScien<st
Twoaspectstoconsider:
1.  understandwhattheyDOinbusiness
2.  understandwhichSKILLStheymustpossess
WhatdotheyDO?
1.  Theyaskques4ons
o  Probe,beingcurious
2.  They(tryto)solveproblems
o  Analy4calthinking,makingnewdiscoveries
3.  Theycul4vate(new)socskills
o  Communica4ngandvisualizingdata
WhatdotheyDO?
1.  Theyaskques4ons
o  Probe,beingcurious
2.  They(tryto)solveproblems
o  Analy4calthinking,makingnewdiscoveries
3.  Theycul4vate(new)socskills
o  Communica4ngandvisualizingdata
Howmuchoftheabovedoyouaslibrariando?
Thelibraryprofessional’sgenes?
Isitinourpedigreetocon4nuouslyaskques4ons?
Dowehavethetasteandmindsetforanaly4calthinking?
Doweonlydoadhocanalysis,ordowepreferanongoing
conversa4onwithdata?
IsthereenoughofaBusinessAnalystorSocialScien4stinus?
WhichSKILLSdotheyneed?
Soc
skills
Hard
skils
Domainspecific
skills
DataScien4st
WhichSKILLSdotheyneed?
Communica4on
Soc
skills
Linearalgebra
Sta4s4cs
Ar4ficialIntelligence
MachineLearning
Hard
skils
Domainspecific
skills
Understandthebusiness,
e.g.librarianship
DataScien4st
DataScien<stsareteammembers
Sta4s4cian
Mathema4cian
Data
programmer
Socialscien4st
Systems
administrator
?
Differentskillsareembeddedacrossmul<-disciplinaryteam
members
TheDataScien<stteamprofile
14
12
10
8
6
4
2
0
Sta4s4cs
Mathema4cs
ComputerScience Domainexper4se
Itisimportanttounderstanddatascienceevenifyounever
intendtodoityourself
Prac<cingdatascience
Whatdoesthecracofdatasciencelooklike?
3disciplinaryareas
SOURCES
•  DATA
ANALYTICS
•  COLLECT
•  CLEAN
•  INTEGRATE
•  PROCESS
VISUALIZATION •  COMMUNICATE
3disciplinaryareas
SystemsAdministrator
SOURCES
•  DATA
ANALYTICS
•  COLLECT
•  CLEAN
•  INTEGRATE
•  PROCESS
DataProgrammer
Appdesigner
VISUALIZATION •  COMMUNICATE
Eachdisciplinaryarearequiresdifferentskills
SOURCES
Database(1960-)
Firstintegrateddatastore
(Bachman),1963
Rela4onaldatamodel(Codd),
1970
SQL(Boyce&Chamberlain),
1970+
DataWarehouse(1975-)
BigData(2005-)
FirstcommercialRDBMS
(Oracle),1979
DB2(IBM),1983
FirstKDDworkshop,1989
FirstKDDdataminingconference
(Fayyaad,Shapiro),1995
NoSQL(Evans),2009
SOURCES
BIGDATA
SMALLDATA
Mostlyunstructured(80%)
Varioussources
Needstoberelatedand
combined
Databases
DataWarehouses
Quan4ta4veandqualita4ve
Mostlystructuredandindexical
Metadata
“Longtaildata”
A/V
Social
Logs
IncompleteDataTaxonomy:somedataareneitherjustbignorjustsmall
SOURCES
Smalldata
•  ThetermdenotestheoppositeofBigData
•  Datausuallyhousedindatabasesanddata
warehouses
•  Usuallystructured,qualita4veandindexicalin
nature
•  Examples:Librarydata,ResearchData(RDM)
•  Researchdata=primarydata
SOURCES
Bigdata
•  Datasetsthataretoolargefortradi4onaldata
processingandstoragesystems(3V’s,4V’s,5V’s)
•  Classifiedinto3classesof“datafica2on”:
1.  Directeddata(e.g.surveillancedata)
2.  Automateddata(e.g.devicegenerateddata)
3.  Volunteereddata(e.g.socialnetworksdata)
ANALYTICS
Database(1960-)
Firstintegrateddatastore
(Bachman),1963
Rela4onaldatamodel(Codd),
1970
SQL(Boyce&Chamerlain),
1970+
DataWarehouse(1975-)
BigData(2005-)
FirstcommercialRDBMS
(Oracle),1979
DB2(IBM),1983
NoSQL(Evans),2009
FirstKDDworkshop,1989
FirstKDDdataminingconference
(Fayyaad,Shapiro),1995
BusinessIntelligence(BI)
DATA
DELUGE
DataScience
ANALYTICS
Thereare4broadclassesofanaly4cs(ocenusedin
combina4on):
1.  DataminingandpaXernrecogni4on
v  AI–MachineLearning–DataMining
2.  Datavisualiza4onandvisualanaly4cs
v  Appdevelopment
3.  Sta4s4calanalysis
v  Sta4s4caltechniquesandprinciples(regression,etc.)
4.  Predic4on,simula4on,andop4miza4on
v  Algorithms
Prac<cingdatascience
InLibraries
1.DataatScale
ValueandInsightcanbeextractedfromsmalldatabyscalingthemupinto
largerdatasets,forreusethroughdigitaldatainfrastructures
2.AnalyzingExhaustdata
Exhaustdata=producedasaby-productofthemain
func4onofadeviceorsystem
Mostexhaustdataistransientinnature–itisnever
examinedandsimplydiscarded!
Example:logofaself-checkoutunit
ExampleofanalyzingExhaustdata
StructuredandUnstructured
data
VISITS
Books
PATRONS
LOANS
DVDs
LOCATIONS
DIGITIZEDBOOKS
Journals
Ac<onableInsights
BeXerforecastsforfuture
libraryplanning
BeXerusageofsystemsand
resources
Produc4vitygainwithbeXer
decision-making
ExamplesofAnalysisandVisualiza<on
Libraryanaly4cstoolkit–HarvardUniversity:
hXps://osc.hul.harvard.edu/liblab/projects/libraryanaly4cs-toolkit
Textanaly4cs–GoogleBooksNgramViewer:
hXps://books.google.com/ngrams
OpenSourceimplementa4on–Bookworm:
hXp://bookworm.culturomics.org
Insummary
Thefundamentalprincipleofdatascienceisthatdata,
andthecapabilitytoextractusefulknowledgefromit,
shouldberegardedasakeystrategicasset.
Librariesmustlearntostartthinkingdata-analy<cally.
Doweonlyusegutandintui4on,oralsodataandrigor,
inourdecision-making?
Insummary
Youcanapplythesameprinciples,toolsand
techniquesforsmalldatathanyouwouldforbigdata
“…thetoolsofdatascienceareasappropriatefor
gigabyteastheyareforpetabytescaledatasets…”
(hXps://datascience.berkeley.edu/about/what-is-data-science/)
Insummary
Challengesforlibrarians:
q ThereisashortageofBigDatatalent
q TheBigDataSIGisaXemp4ngtounderstandandframe
BigDataproblems
Opportuni4esforlibrarians:
q Growyourdataanaly4calskills
q AXendonlinecourses:KhanAcademy,Coursera,Socware
Carpentry,digitalbooks
q Therearefreesocwaretools:R,(SQLServer2016includes
R),Python,appvisualiza4ontools
Thankyou
[email protected]