Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
TheLibrary(Big)Datascien4st IFLA/ALAwebinar: “BigData:newrolesandopportuni4esfornewlibrarians” June15th2016 IFLABigDataSpecialInterestGroup(SIG) WouterKlapwijk,StellenboschUniversity,SIGconvenor [email protected] IFLABigDataSIG • ProposedatWLIC2014,Lyon • Pe44onedforatWLIC2015,CapeTown • EndorsedbytheIFLAProfessionalCommiXeein December2015 • SIGsponsor:ITSec4on • Objec4ves: 1. ProvideafocuspointfordevelopingideasregardingBig Dataasitaffectslibraries 2. Provideapla[ormwithinIFLAtoassessanddevelopthe avenuesofresponsefromIFLAtothisdevelopingarea Deconstruc<ng“BigData”and“datascience” Thistalkisbasedonanumberofeverydayques4ons: 1. Whatdoes“datascience”mean? ² isitonlyhappeninginTechplaceslikeFacebookandGoogle? 2. WhatareDataScien4st? ² canLibrariansalsobeDataScien4sts? 3. IsdatasciencethescienceofBigData? ² whatistherela4onshipbetweenBigDataanddatascience? 4. Exactlywhatis“BigData”anyway? ² justhowbigisBig?orisBigrela4ve?islibrarydataBig? Datascience § Asetoffundamentalprinciplesthatguidethe extrac4onofknowledgefromdata § The“civilengineeringofdata”:turningdataintodata products Goalofdatascience: Ø toimprovedecision-making,forthebeXermentof organiza4onsandsocietyatlarge Rela<ontoother“engineering”concepts “datamining” ü helpsaccomplishdatasciencegoalsviatechnologies thatincorporatedatascienceprinciples ü but…itstechniquesaremuchmoreextensivethan thesetofprinciplescomprisingdatascience “datawarehousing” ü afacilita4ngtechnologyfor“datamining” ü but…notalwaysincludedaspartof“datamining” Rela<ontoothercompu<ngconcepts “dataprocessing” ü ismoregeneralthandatascience ü thereisprocessinginvolvedinallaspectsof compu4ng “BigDatatechnologies” ü areocenusedfordataprocessinginsupportofdata miningtechniques ü …andotherdatascienceac4vi4es ScienceorCraG? v Thetermdatasciencehasexistedforover30years–itisa fieldontoitself v Founda4onrestsincenturyoldprac4cesofSta4s4cs, Mathema4cs,andsincemid-20thcentury,alsoComputer Sciences v ItisnotjustarebrandingofSta4s4csandMachineLearning inthecontextoftheTechindustry v MuchofthefielddevelopmentishappeninginIndustry,and notinAcademia Examplesofdatascienceproducts Domain Internet Example Recommenda4onsystems (Amazon=books;Facebook=friends) Finance Creditra4ngs Educa<on Personalizedlearningandassessment Government Policiesbasedondata Prac<cingdatascience WhatdoDataScien4stdo? TheDataScien<st Twoaspectstoconsider: 1. understandwhattheyDOinbusiness 2. understandwhichSKILLStheymustpossess WhatdotheyDO? 1. Theyaskques4ons o Probe,beingcurious 2. They(tryto)solveproblems o Analy4calthinking,makingnewdiscoveries 3. Theycul4vate(new)socskills o Communica4ngandvisualizingdata WhatdotheyDO? 1. Theyaskques4ons o Probe,beingcurious 2. They(tryto)solveproblems o Analy4calthinking,makingnewdiscoveries 3. Theycul4vate(new)socskills o Communica4ngandvisualizingdata Howmuchoftheabovedoyouaslibrariando? Thelibraryprofessional’sgenes? Isitinourpedigreetocon4nuouslyaskques4ons? Dowehavethetasteandmindsetforanaly4calthinking? Doweonlydoadhocanalysis,ordowepreferanongoing conversa4onwithdata? IsthereenoughofaBusinessAnalystorSocialScien4stinus? WhichSKILLSdotheyneed? Soc skills Hard skils Domainspecific skills DataScien4st WhichSKILLSdotheyneed? Communica4on Soc skills Linearalgebra Sta4s4cs Ar4ficialIntelligence MachineLearning Hard skils Domainspecific skills Understandthebusiness, e.g.librarianship DataScien4st DataScien<stsareteammembers Sta4s4cian Mathema4cian Data programmer Socialscien4st Systems administrator ? Differentskillsareembeddedacrossmul<-disciplinaryteam members TheDataScien<stteamprofile 14 12 10 8 6 4 2 0 Sta4s4cs Mathema4cs ComputerScience Domainexper4se Itisimportanttounderstanddatascienceevenifyounever intendtodoityourself Prac<cingdatascience Whatdoesthecracofdatasciencelooklike? 3disciplinaryareas SOURCES • DATA ANALYTICS • COLLECT • CLEAN • INTEGRATE • PROCESS VISUALIZATION • COMMUNICATE 3disciplinaryareas SystemsAdministrator SOURCES • DATA ANALYTICS • COLLECT • CLEAN • INTEGRATE • PROCESS DataProgrammer Appdesigner VISUALIZATION • COMMUNICATE Eachdisciplinaryarearequiresdifferentskills SOURCES Database(1960-) Firstintegrateddatastore (Bachman),1963 Rela4onaldatamodel(Codd), 1970 SQL(Boyce&Chamberlain), 1970+ DataWarehouse(1975-) BigData(2005-) FirstcommercialRDBMS (Oracle),1979 DB2(IBM),1983 FirstKDDworkshop,1989 FirstKDDdataminingconference (Fayyaad,Shapiro),1995 NoSQL(Evans),2009 SOURCES BIGDATA SMALLDATA Mostlyunstructured(80%) Varioussources Needstoberelatedand combined Databases DataWarehouses Quan4ta4veandqualita4ve Mostlystructuredandindexical Metadata “Longtaildata” A/V Social Logs IncompleteDataTaxonomy:somedataareneitherjustbignorjustsmall SOURCES Smalldata • ThetermdenotestheoppositeofBigData • Datausuallyhousedindatabasesanddata warehouses • Usuallystructured,qualita4veandindexicalin nature • Examples:Librarydata,ResearchData(RDM) • Researchdata=primarydata SOURCES Bigdata • Datasetsthataretoolargefortradi4onaldata processingandstoragesystems(3V’s,4V’s,5V’s) • Classifiedinto3classesof“datafica2on”: 1. Directeddata(e.g.surveillancedata) 2. Automateddata(e.g.devicegenerateddata) 3. Volunteereddata(e.g.socialnetworksdata) ANALYTICS Database(1960-) Firstintegrateddatastore (Bachman),1963 Rela4onaldatamodel(Codd), 1970 SQL(Boyce&Chamerlain), 1970+ DataWarehouse(1975-) BigData(2005-) FirstcommercialRDBMS (Oracle),1979 DB2(IBM),1983 NoSQL(Evans),2009 FirstKDDworkshop,1989 FirstKDDdataminingconference (Fayyaad,Shapiro),1995 BusinessIntelligence(BI) DATA DELUGE DataScience ANALYTICS Thereare4broadclassesofanaly4cs(ocenusedin combina4on): 1. DataminingandpaXernrecogni4on v AI–MachineLearning–DataMining 2. Datavisualiza4onandvisualanaly4cs v Appdevelopment 3. Sta4s4calanalysis v Sta4s4caltechniquesandprinciples(regression,etc.) 4. Predic4on,simula4on,andop4miza4on v Algorithms Prac<cingdatascience InLibraries 1.DataatScale ValueandInsightcanbeextractedfromsmalldatabyscalingthemupinto largerdatasets,forreusethroughdigitaldatainfrastructures 2.AnalyzingExhaustdata Exhaustdata=producedasaby-productofthemain func4onofadeviceorsystem Mostexhaustdataistransientinnature–itisnever examinedandsimplydiscarded! Example:logofaself-checkoutunit ExampleofanalyzingExhaustdata StructuredandUnstructured data VISITS Books PATRONS LOANS DVDs LOCATIONS DIGITIZEDBOOKS Journals Ac<onableInsights BeXerforecastsforfuture libraryplanning BeXerusageofsystemsand resources Produc4vitygainwithbeXer decision-making ExamplesofAnalysisandVisualiza<on Libraryanaly4cstoolkit–HarvardUniversity: hXps://osc.hul.harvard.edu/liblab/projects/libraryanaly4cs-toolkit Textanaly4cs–GoogleBooksNgramViewer: hXps://books.google.com/ngrams OpenSourceimplementa4on–Bookworm: hXp://bookworm.culturomics.org Insummary Thefundamentalprincipleofdatascienceisthatdata, andthecapabilitytoextractusefulknowledgefromit, shouldberegardedasakeystrategicasset. Librariesmustlearntostartthinkingdata-analy<cally. Doweonlyusegutandintui4on,oralsodataandrigor, inourdecision-making? Insummary Youcanapplythesameprinciples,toolsand techniquesforsmalldatathanyouwouldforbigdata “…thetoolsofdatascienceareasappropriatefor gigabyteastheyareforpetabytescaledatasets…” (hXps://datascience.berkeley.edu/about/what-is-data-science/) Insummary Challengesforlibrarians: q ThereisashortageofBigDatatalent q TheBigDataSIGisaXemp4ngtounderstandandframe BigDataproblems Opportuni4esforlibrarians: q Growyourdataanaly4calskills q AXendonlinecourses:KhanAcademy,Coursera,Socware Carpentry,digitalbooks q Therearefreesocwaretools:R,(SQLServer2016includes R),Python,appvisualiza4ontools Thankyou [email protected]