Download Deliverable D3.1 Research Report on TDM Landscape

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
REDUCING BARRIERS AND INCREASING UPTAKE OF TEXT AND DATA MINING FOR RESEARCH
ENVIRONMENTSUSINGACOLLABORATIVEKNOWLEDGEANDOPENINFORMATIONAPPROACH
DeliverableD3.1
ResearchReportonTDMLandscapein
Europe
©2016FutureTDM|Horizon2020|GARRI-3-2014–665940
2
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
Project
Acronym:
FutureTDM
Title:
Reducing Barriers and Increasing Uptake of Text and Data Mining for Research
EnvironmentsusingaCollaborativeKnowledgeandOpenInformationApproach
Coordinator:
SYNYOGmbH
Reference:
665940
Type:
Collaborativeproject
Programme:
HORIZON2020
Theme:
GARRI-3-2014-ScientificInformationintheDigitalAge:TextandDataMining(TDM)
Start:
01September,2015
Duration:
24months
Website:
http://www.futuretdm.eu
E-Mail:
[email protected]
Consortium:
SYNYOGmbH,Research&DevelopmentDepartment,Austria,(SYNYO)
StichtingLIBER,TheNetherlands,(LIBER)
OpenKnowledge,UK,(OK/CM)
RadboudUniversity,CentreforLanguageStudies,TheNetherlands,(RU)
TheBritishLibraryBoard,UK,(BL)
UniversiteitvanAmsterdam,Inst.forInformationLaw,TheNetherlands,(UVA)
Athena Research and Innovation Centre in Information, Communication and
Knowledge Technologies, Inst. for Language and Speech Processing, Greece,
(ARC)
UbiquityPressLimited,UK,(UP)
FundacjaProjekt:Polska,Poland,(FPP)
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
3
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
Deliverable
Number:
D3.1
Title:
ResearchreportonTDMLandscapeinEurope
Leadbeneficiary:
RU
Workpackage:
WP3:ASSESS:Studies,Publications,LegalRegulations,PoliciesandBarriers
Disseminationlevel:
Public(PU)
Nature:
Report(RE)
Duedate:
31.01.2016
Submissiondate:
29.04.2016
Authors:
MariaEskevich,RU
AntalvandenBosch,RU
Contributors:
MarcoCaspers,UVA
LucieGuibault,UVA
AlessioBertone,SYNYO
SusanReilly,LIBER
CarmenMunteanu,SYNYO
PeterLeitner,SYNYO
SteliosPiperidis,ARC
Review:
MarcoCaspers,UVA
LucieGuibault,UVA
AlessioBertone,SYNYO
Acknowledgement: This project has received Disclaimer: The content of this publication is the
funding from the European Union’s Horizon 2020 sole responsibility of the authors, and does not in
Research and Innovation Programme under Grant any way represent the view of the European
AgreementNo665940.
Commissionoritsservices.
ThisreportbyFutureTDMConsortiummemberscan
bereusedundertheCC-BY4.0license
(https://creativecommons.org/licenses/by/4.0/).
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
4
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
TableofContents
1
Introduction....................................................................................................................................6
1.1.Whatistextanddatamining?.....................................................................................................6
1.2.ThegrowingimportanceofTDM.................................................................................................7
1.3.Goalandstructureofthereport..................................................................................................9
2
TDMframework............................................................................................................................11
2.1.Datatobemined........................................................................................................................11
2.2.TechnologyandR&D..................................................................................................................16
3
TDMacrosseconomicsectors.......................................................................................................21
3.1.Primarysector............................................................................................................................22
3.2.Secondarysector........................................................................................................................22
3.3.Tertiarysector............................................................................................................................23
3.4.Quaternarysector......................................................................................................................25
3.5.Quinarysector............................................................................................................................26
4
Summaryoftheliterature.............................................................................................................27
5
Conclusion.....................................................................................................................................40
References.............................................................................................................................................41
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
5
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
LISTOFFIGURES
Figure1.NumberofdifferenttypesofpublicationsintheDBLPComputerSciencebibliography........8
Figure2.NumberofdifferenttypesofpublicationsontheEuropeanLifeSciencesPMCwebsite........9
Figure3.MacroviewofTDMframework.............................................................................................11
Figure4.MicroviewonTDMenvironment..........................................................................................17
Figure5.GeneraleconomicstructureandconnectionbetweenTDMandalleconomicsectors........22
Figure6.Timelineofreports.................................................................................................................29
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
6
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
1.INTRODUCTION
Recentyearshaveseenanexponentialgrowthinthevolumeofdigitaldataavailabletoalllevelsof
society.Afurtherproliferationofdigitalordigitisedcontentisexpectedinthis,theeraofBigData.A
numberofinterrelatedfactorshaveboostedthisdatathrustandtheaccompanyingdevelopmentof
technologiestoexploitthisnewtypeofinformation-richgroundmaterialwithtextanddatamining
(TDM). First, the costs for data storage and cloud computing facilities continue to decrease, while
computing power increases. Second, digitisation has moved from specific processes, such as
archiving and optimizing industrial workflows, to nearly all daily aspects of societal activities. This
processpavesthewayfordataficationofthisinformation,i.e.thecreationofnewinformationand
knowledge, potentially with new value, on the basis of machine-readable content. TDM is the
umbrellatermfortechnologiesthatenablethisgoal.
Thecontinuousgrowthofdata1increasinglyemphasisesthequestionofhowtobestmakeuseofit.
Societyexpects,withgoodreason,thattheimplementationofintelligentmethodstodealwithBIg
Datawillleadtoimprovedinformationaccess,fasterknowledgediscovery,higherproductivity,and
competitiveness.TheseexpectationshaveinspiredTDMresearchersandcompaniesacrosstheworld
to explore novel algorithms for extracting information, discovering knowledge, and improving the
ability to predict from data on a large scale. As data is a manifold concept, so is TDM. This report
aimstochartandstructuretheTDMlandscape.
1.1.WhatisTextandDataMining?
Text and Data Mining was initially defined as “the discovery by computer of new, previously
unknown information, by automatically extracting and relating information from different (…)
resources,torevealotherwisehiddenmeanings”(Hearst,1999),inotherwords,“anexploratorydata
analysisthatleadstothediscoveryofheretoforeunknowninformation,ortoanswersforquestions
for which the answer is not currently known” (Hearst, 1999). Nowadays, this umbrella term
encompassesdiversetechniquesthatallowinterpretationofcontentofanytyperangingfromraw
data, e.g. sensor data, text, images and multimedia, to processed content, e.g. diagrams, charts,
tables,references,maps,formulas,chemicalstructures,andmetadatafromsemi-structuredsources,
onalargescalethroughtheidentificationofpatterns.
TDM algorithms harbour vast potential for nearly all scientific fields and for a wide diversity of
practical industrial and societal applications. However, this broad potential and variety of TDM
applications complicates the landscape view, as even the technology itself is referred to using
different terms within different fields. Within the business and managerial context, TDM can be
referred to as Business Intelligence Solutions or Qualitative Data Analysis. To highlight the central
1
Itisexpectedthattherewillbe16trilliongigabytesofdataby2020,whichcorrespondstoanannualgrowth
rate of 236 % in data generation (Motion for a resolution further to Question for Oral Answer B8-0116/2016
pursuant to Rule 128(5) of the Rules of Procedure on “Towards a thriving data-driven economy”
(2015/2612(RSP),2016):
http://www.europarl.europa.eu/sides/getDoc.do?type=MOTION&reference=B8-2016-0308&language=EN
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
7
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
positionofvastquantitiesofdata,thetermBigData(Processing)isused.Whenthefocusisonthe
discovery of hitherto undiscovered information, Content Mining appears to be a term of choice,
whileDataAnalyticsandTextAnalyticsemphasisethedata-drivenaspectofperforminganalysesand
afocusonseekinganalyticalsolutionstochallenges.Intheacademicworld,TDMisoftenconsidered
tobecomposedofMachineLearningmethodscoupledwithExploratoryDataAnalysismethodssuch
as visualisation. Machine Learning, in turn, arose from the post-war fields of Artificial Intelligence,
InformationTheory,andPatternRecognition.
1.2.TheGrowingImportanceofTDM
Withtheincreaseofcomputationalpower,theriseofBigData,andthedigitisationofvastamounts
ofdata,theuseofTDMpromisestoyieldvaluablesocietalandeconomicbenefitsintermsofnew
insights and cost savings that would otherwise not be possible: more scientific breakthroughs, a
greater understanding of society, and countless business opportunities. The potential of TDM as a
researchtechniqueandasaneconomicopportunityhasbeenrecognisedatapoliticallevelintheEU
(HargreavesI.etal.,2014).2ThebenefitsofTDMhavealsobeennotedbytheresearchcommunity
andsocietalstakeholderse.g.viatheHagueDeclaration3.
Originallyrootedinacademicresearch,TDMhasleftitsacademicchildhoodyearsandisnowaglobal
informationtechnology(IT)industry.Asanindustry,TDMisfuelledbydirectaccesstolargeamounts
ofdata.CompaniescaneitherdevelopandruntheirownTDMsoftware,oroutsourcethisworkto
external experts who can customise TDM solutions. Industry is now rapidly rolling out the
implementation of TDM for Big Data, as computing power grows and businesses find new ways to
gather and access data. According to the International Data Corporation’s Worldwide Big Data
TechnologyandServicesForecastfor2013-2017,thegrowthrateoftheBigDatamarketuntil2017
willbesixtimesfasterthanthatoftheoverallinformationandcommunication(ICT)market.TheBig
DataValuePublic-PrivatePartnership2predictsthatTDMwillreachatotalofEUR50billionby2017,
andmayresultin3.75millionnewjobs.WhilemanycompaniesseethepotentialofTDMfortheir
business development, only 1.7 % of companies are currently making full use of advanced digital
technologieslikeTDM,despitethebenefitsthatdigitaltoolscanbring.2
TheapplicationofTDMtoscienceholdsthepromiseofacceleratingscientificdiscoveries.Miningthe
content present in research publication databases and the associated research data sets, as far as
they are available for harvesting, will boost scientific research by increasing the productivity of
researchers and leaving part of the scientific discovery process to computers, a trend called Data
Science. The size of scientific publication databases varies across domains, but in each case, the
number is beyond a value that would be feasible for a human to read and assess without digital
support. Figures 1-2 give examples of the volume of content that has to be dealt with by the
2
(Motion for a resolution further to Question for Oral Answer B8-0116/2016 pursuant to Rule 128(5) of the
Rules of Procedure on “Towards a thriving data-driven economy” (2015/2612(RSP), 2016):
http://www.europarl.europa.eu/sides/getDoc.do?type=MOTION&reference=B8-2016-0308&language=EN;
http://ec.europa.eu/research/innovation-union/pdf/TDM-report_from_the_expert_group-042014.pdf
3
TheHagueDeclarationonKnowledgeDiscovery:
http://thehaguedeclaration.com/the-hague-declaration-on-knowledge-discovery-in-the-digital-age/
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
8
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
academia and industry researchers in the Computer Science4 and Life Sciences domains5,
respectively. TDM may help researchers in all scientific fields to regain control over the academic
process,toaggregateinformationandtofollowtrendsacrossfieldsbeyondtheirown.Estimationsof
addedeconomicvalueinthiscase,basedonthetimesavingbenefitofTDMalone,areanincreaseof
2percent,orEUR5.3billion,withlong-termimpactofoverallgainintherangeofatleastEUR32.5
billion(Filippov,2014).
Figure1.NumberofdifferenttypesofpublicationsintheDBLPComputerSciencebibliography.6
4
http://dblp2.uni-trier.de
https://europepmc.org
6
http://dblp2.uni-trier.de/statistics/publicationsperyear.html[accessedonMarch2016]
5
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
9
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
Patents
4.2million
NHSguidelines
765
Abstracts
30.9million,
25.9million
fromPubMed
Fulltextarticles
3.6million
Agricolarecords
617244
Figure2.NumberofdifferenttypesofpublicationsontheEuropeanLifeSciencesPMCwebsite.7
1.3.Goalandstructureofthereport
Inthisreport,weoutlinethetextanddatamininglandscape.Westartbyprovidinganoverviewof
types of data and technologies that the TDM experts are working with. We look at the global
economicstructureoftheknowledge-basedsocietyandhighlightusecasesforTDMuptakeacrossall
economic sectors. Following recent discussions on TDM perspectives within the scientific, political,
andeconomiccontext,wehighlighttheimportanceofthetechnologyanditspotential.
TDM algorithms are developed globally. It is therefore our task to list and structure the important
technology definitions, data types, underlying algorithms and leading research communities on a
worldwide scale. However, the actual uptake of TDM and ease of practical implementation vary
considerablydependingonthecountryandtheeconomicsectorinquestion.Inthisreport,wegivea
broad perspective on the general European Union (EU) landscape, and quote exceptional countryleveldirectivesonlywhenthesecountrieshavelegalregulationsspecificallyaddressingTDM,e.g.the
caseofstatutoryexceptiononcopyrightintheUK8.
TDM is becoming increasingly ubiquitous on the global market, as each company that tracks its
activities,products,andcommunicationsinsomedigitalformhasthepotentialtoreusethisdatafor
furtheranalysisandimprovement.Forsomesectors,countriesandapplications,itisalreadypossible
to estimate the potential profit or at least the size of the market. Other fields such as journalism,
7
8
https://europepmc.org/About[accessedonMarch2016]
"Section 29A and Schedule 2(2)1D of the Copyright, Designs and Patents Act 1988":
http://www.legislation.gov.uk/uksi/2014/1372/regulation/3/made
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
10
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
which is in the process of embracing the concept of Data Journalism, have to first agree on the
conceptualchangesrequiredtotheircurrentworkbehaviourinordertomergethetraditionalwork
patternwithTDM(LewisS.C.andWestlundO.,2014).
This report is structured as follows: in Section 2, we introduce the definitions of the overall TDM
framework, classifications of data types, and the technologies that can be used in text and data
miningprocess.InSection3,weshowhowTDMispotentiallyrelatedtoallsectorsofeconomics,and
we illustrate this assumption with examples and potential use cases of TDM implementation and
integration across sectors. In Section 4, we draw a timeline of scientific, industrial and political
documents that emphasise the importance of and interest in TDM and the need for uptake across
sectors,andwesummarisethekeymessages.WeconcludethereportwithSection5,outliningthe
importanceofTDMforthefutureoftheeconomyandscience.
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
11
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
2.TDMFRAMEWORK
Thequantityofdigitalcontentcurrentlyproducedbysocietyandthepotentialeaseofaccesstothis
dataatanylocationviaeasy-to-usedevices(withreliableInternetconnection,andmobiledevices)
increasetheneedforcontentminingtechnologiesthatboosthumancapacitiesintermsofspeedof
dataprocessing(Mayer-SchoenbergerandCukier,2013).TDMtechnologiescansupportindividuals
aswellaslargerinstitutionsandcompaniesindecision-makingprocessesandinthedevelopmentof
novel ideas and solutions. In this section we discuss diverse data classifications, technologies that
underlie TDM implementations, and the research communities that bring together the researchers
anddevelopersfrombothacademiaandindustrytoimproveandadvanceTDMalgorithms.
2.1.Datatobemined
TDMisnotdefinedthroughasinglescientificdomainortheory;itisacomplexsetoftechnologies
with different underlying theories and assumptions, consisting of components that are developed
withindiversedisciplines.AsFigure3shows,atamacrolevel,theTDMframeworkiscentredaround
data.AsweentertheBigDataera(Mayer-SchoenbergerandCukier,2013),theefficientstorageand
managementofcontentrepresentasignificantchallengeoftheirown,bothintermsoftheamount
ofinformationandintermsoftheheterogeneousnatureofthedata,whichcanbenumeric,textual,
multimodal,etc.Themorecomplexthedata,themorecreativeandinnovativetheTDMcomponents
have to be. Data can be structured according to its size and legal accessibility, according to its
sources,orbywhoproducesitthecontentthatconstitutesthedatacollections.
InSection2.1.1,weprovideanoverviewofthetypesofsourcesthatcangenerateandholddata,to
illustratethebreadthofthesourcesonwhichTDMmayoperate.InSection2.1.2,wesummarisea
newlyproposedclassificationofdatasetsaccordingtodatasizeandtypesoflegalaccess,inorderto
identifythetypesofdatasetsthatcouldbemoreeasilyaccessedandminedbyfor-profitandnonprofitorganisations,inthecasethatcopyrightrestrictionsareadjusted.
Figure3.MacroviewofTDMframework.9
9
ThisFigure(‘MacroviewonTDMenvironment’)byFutureTDMcanbereusedundertheCC-BY4.0license.
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
12
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
2.1.1.Datasourcesanddataholders
Datacomesincountlessforms.Itiseasytothinkofdataasintentionallycollectedtextorscientific
data,butdatamaybealsoproducedbyhumansandbusinessesastheycarryoutdatafiedactivities,
leaving so-called “data exhaust” as a by-product of their actions and movements in the world, e.g.
users’ online interactions, corporate and personal websites, and released digital documents and
archives (Mayer-Schoenberger and Cukier, 2013). Another type of data is represented by
measurements of uncontrolled events happening in nature as recorded by weather sensors,
telescopes,andotherinstrumentsacrosstheglobe.
Belowwelistdifferentdatasources,rangingfrompersonaldata,createdortrackedfromindividuals,
to professionally gathered and created content in forms of cultural knowledge data, scientific and
businessdata,publicsectorinformation.AllthisdatacanbeeithersharedpubliclyontheInternet,
andstoredoffline.Whensharedpublicly,theoriginalcontentisenrichedwithadatadescriptionand
furtherinformationthatareavailableonawebsite.
Personaldata
ThetracesthateachInternetuserleavesformamassiveamountofinformationthathasvaluewhen
gathered and structured on a mass scale. Currently this information is mostly given by individuals
eitherwithorwithouttheirconsent;inthefirstcaseoftenwithoutthemrealisinghowprofitablethis
information is, and in the second case, without them even realizing that their information is being
collectedinanyway.
States also collect information about individuals in the country, and can often compel people to
providethemwithinformation,ratherthanhavingtopersuadethemtodosooroffersomethingin
return, as private companies would have to. This special status allows a state to gather large
amountsofinformationthatcanbepotentiallyminedbyprivateorpublicinstitutions.Thiscanthen
beusedtooptimiseinteractionsbetweenthestateandindividuals,toimprovetheservicesprovided
bythestate,andtocreatebusinessmodelsaroundthisdataortousethisknowledgeforthegreater
goodofthesociety.Atthesametime,alongwiththeaccesstopersonaldata,statesareresponsible
forprotectingtheprivacyofthedataproducerswhenminingthisdataandreleasingittothegeneral
public.
Culturalheritagedata
There has been enormous investment in the digitisation of heritage data, i.e. data that are
consideredworthwhilearchiving,studying,orexhibiting.In2010,itwasestimatedthatitwouldtake
EUR10billiontodigitiseEurope’sculturalheritageholdings.3By2012,itwasestimatedthatatleast
20% of Europe’s cultural heritage collections had been digitised10. At the same time libraries were
investing in preserving digital collections and exploring ways to increase the accessibility of this
content. Libraries use TDM to structure digital content, and to develop and explore further
possibilities for knowledge discovery e.g. via topic modelling. Improvements in the accuracy of
10
http://pro.europeana.eu/enumerate/statistics/results
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
13
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
optical character recognition (OCR), including for handwritten texts11 will increase the impact and
efficiency of applying TDM to digital content in order to enhance metadata or analyse collections.
Google Books12 and Google Art13 operate on a worldwide scale, and digitise cultural (printed)
heritage for their own technology development, as well as to make this data accessible to a wider
audience.TheHathiTrustDigitalLibraryisapartnershipmodelbringingtogetherdigitisedcollections
fromacrossUSinstitutionsandmakingthemavailabletoUSresearchersforanalysis.14AtaEuropean
level, Europeana, which collects metadata about heterogeneous data such as digitised artworks,
artefacts,books,videosandsounds,providesaccesstothismetadataandlimiteddatasetsviaAPIs15.
The DARIAH16 and CLARIN17 infrastructures illustrate the value and potential of access to these
extremely heterogeneous corpora for the research community. Both infrastructures support TDM
acrossawide-rangingcorpora,e.g.CLARINsupportsnaturallanguageprocessingandanalysisoftext
(includingnewspapers,webfora,books),andspeech(newsbroadcasts,signlanguage).
Scientificdata
Scientific data consist of two main types: data that have been collected through experiments and
studies(measurementsofanykind,texts,videos,etc.),andthescientificpublicationsthatdescribe
the work after analysis has been performed (containing data description, methods, experimental
results,etc.).
Data used in experiments may be released to a general audience, along with the publication
describingthestudy.Thisjointreleaseofdata,withcorrespondingpublicationandrelatedmetadata,
is considered to have a beneficial side-effect: it promotes scientific research reproducibility and
scientific integrity. An increasing number of researchers and international research organisations
perceivethistobeapositivestepforwards(ICSU,ISSC,TWAS,IAP,2015).18
Scientific data can be released purely as a data set collected for general purposes. In this way,
researchers in business or academia can access the data and decide themselves on the type of
experimentstorun.AnexampleofthisapproachistheopenaccessreleaseoftheEUsatellitedata
fromCopernicusEarthobservation19.Theinitiativetoaskthequestionsandtoanswerthemliesin
thehandsoftheaudience.OpendataanddatasharinginresearchisontheincreaseinEuropedue
to funder-mandates such as the H2020 Open Data Pilot, and the increasing availability of
infrastructurefordatasharinge.g.genericservicessuchasEUDAT20andZenodo21.Onaworldwide
11
http://transcriptorium.eu
12
13
https://books.google.com
https://www.google.com/culturalinstitute/project/art-project
https://sharc.hathitrust.org
15
http://labs.europeana.eu/api
16
https://de.dariah.eu/tatom
17
http://www.clarin.eu
18
FORCE11 MANIFESTO: Improving Future Research
https://www.force11.org/about/manifesto
19
http://www.eea.europa.eu/highlights/eu-satellite-data-to-be
20
http://www.eudat.eu
14
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
Communication
and
e-Scholarship:
14
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
scale,anexampleistheResearchDataAlliance(RDA)22,aforumwheretheinfrastructuresfordata
sharingacrossdisciplines,institutions,andcountries,arediscussed,andguidelinesareformulated.In
addition,industryisalsobeginningtomaketheirscientificdatasetsavailable,e.g.clinicaltrialdata,
oftenforethicalortransparencyreasons.
Scientific publications can also be considered as a valuable dataset for TDM. These texts represent
potentialsourcesofnewknowledgewhentheinformationtheycontainissummarisedandanalysed
incomparisonwithotherstudiesandlinesofenquiryfromotherdisciplines.Oftencontainingfigures
anddatavisualisations,articlesareofvaluenotjustbecauseofthetexttheycontain,butbecause
they are important sources of raw data and context. TDM analysis of publications can help find
connections between studies in different domains that are implicitly or indirectly connected
(Swanson, 1986). In the field of bioinformatics, the literature has been mined to analyse the
interactions of genes and proteins in various diseases. Often this type of analysis is performed on
Medline23abstracts,astheyarefreelyavailablebutithasbeennotedthatthisisextremelylimiting.
VanHaagenetal.foundthatonly32%ofknownProteinProteinInteractions(PPIs)couldbefoundin
Medline abstracts, and therefore the rest could only be found by accessing the full text literature
(vanHaagenetal.,2009).
Scientific publications that are released via open access are, by definition, accessible for TDM and
TDM-basedapplications.PubMedCentral24hasbecomeawidelyusedsourceforTDM.TheEuropean
open access publication infrastructure OpenAire25 not only supports TDM by encouraging storing
articles under open access licences and in interoperable machine readable formats, it also utilises
TDM to infer links between data and publications, and to provide funders with data regarding
publications arising from research funding and research trends26. Subscription content is less
accessibleas,withtheexceptionoftheUK,Europeanresearcherscanonlyminethiscontentunder
conditions specified by licences. In contrast to this, in the US, researchers may mine content that
theyhavelegalaccessto,withouttheneedforadditionallicencepermissions.
Businessdata
Businessesarecreatingandcollectingdatafromallpossiblesources,asminingdatafromcombined
sources improves insights into internal processes, strategies, and the market. These data can vary
from energy consumption by business agents, such as shops and carriers, to personal information
about customers and their digital and financial transactions. Most of this data constitutes value
withinthecontextofaknowledge-basedeconomy,andissensitivetodataprotectionissues,thus,is
usedinternallybycompaniesandisnotreleasedpublically.
21
https://zenodo.org
https://rd-alliance.org
23
https://www.nlm.nih.gov/bsd/pmresources.html
22
24
http://www.ncbi.nlm.nih.gov/pmc
https://www.openaire.eu
26
https://blogs.openaire.eu/?p=88
25
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
15
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
Publicsectorinformation(PSI)
PSIisdataproducedandmadeavailablebygovernmentalinstitutions.Today,agrowingnumberof
governmentsrealisethatthisdatacanbringmoreprofitandbenefittosocietywhenreleasedtoa
general audience. PSI data release initiatives allow commercial companies that are already
experienced in data mining, to use this additional information source to build better products for
theircustomers,consequentlycreatingmorejobsandboostingcommercialsuccess.
The European Union Open Data Portal27 gives access to a growing range of PSI data from the
institutions and other bodies of the European Union (EU); it is free for both commercial or noncommercial purposes. The EU Open Data Portal is managed by the Publications Office of the
EuropeanUnion.ImplementationoftheEU'sopendatapolicyistheresponsibilityoftheDirectorateGeneralforCommunicationsNetworks,ContentandTechnologyoftheEuropeanCommission.
Open data portals have also been set up by individual EU Member States.28 The datasets of public
documentsthattheseportalsshare,aremostlyunderCC-0license,i.e.theownersdidnotreserve
anyrights,andthedataisfreeforsharinganduseinanycountry,byanyinstitutions.Thesedatasets
representthedocumentsinthenativelanguagesofeachspecificcountry.29,30,31,32,33
Internetdata
Sincetheintroductionofthefirstwebsitein1991,theWorldWideWebhadexpandedtomorethan
45 billion pages with unique URLs in the late 2010s (Bosch et al., 2016). Many webpages contain
valuable information that can be crawled for further mining. More generally, the Internet (the
infrastructureonwhichtheWorldWideWebisbased,butalsoservicessuchasinternet,email,and
filetransfer)allowsaccesstolargeamountsofdataofalltypes,thatcanbecrawledwiththeproper
credentials. Many data remains, of course, hidden behind passwords and firewalls. The so-called
Deep Web, the part of the World Wide Web that is not directly accessible for indexing by search
engines,hasbeenestimatedtoholdatleast500timesasmuchdataastheindexed(‘surface’)web.34
2.1.2.Datatypes
AnytypeofdataconsideredtobeproperlyavailableforTDMmustfirstbeavailableas,orconverted
to,astandardisedandmachine-readableformsothatitcanbeeasilyprocessed.Intheir2014report
to the European Commission entitled 'Standardisation in the area of innovation and technological
development, notably in the field of Text and Data Mining', Hargreaves et al.(2014) identified five
27
28
29
https://open-data.europa.eu/en/data
http://ec.europa.eu/newsroom/dae/document.cfm?action=display&doc_id=8340
https://data.gov.uk
30
31
https://data.overheid.nl
https://www.data.gouv.fr
32
https://www.govdata.de
33
http://data.norge.no
34
https://en.wikipedia.org/wiki/Deep_web
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
typesofdatabasesthatlendthemselvestoTDM,accordingtotheirsizeandaccesspolicies,aswellas
theirpotentialforrevenuestructurechanges:35
1.XXL:alldatabehindfirewalls,i.e.companiesandorganisations’internaldatabases,notaccessible
tothepublic.36
2. XL: all publicly accessible data not behind a firewall or paywall. e.g. freely accessible newspaper
websites.
3. L: all publicly accessible data located behind a paywall. e.g. online newspaper pages that are
accessiblewithasubscriptionfee.
4 M: all publicly accessible data behind a paywall whose clients are mainly researchers and
companiesinneedofreports.e.g.Reuters,Bloomberg.
5.S:scientificpublishers’databehindapaywall.
Thisclassificationcoversdatacollectionsofdifferentsizesthatcontaincontentofadiversenature.
Wecanmakeadistinctionbetweenhigh-leveldatathatislikelytobesubjecttocopyrightprotection,
and low-level data that is highly unlikely to be protected under copyright law (although data
protectionlawmayapplyincertaincases):
-
Protected:text,image,sound,multimedia;
Non-protected,unlessthecontentbecomespartofthedatabasethatthecompanyor
institutiondidputsubstantialinvestmentinto:e.g.clicks,likes,GPScoordinates,timestamps,
andmeasurementsofadifferentkind.
2.2.TechnologyandR&D
Textanddatamininghasgraduallyevolvedintoalargedomainwithcomplexapproachesforarange
ofapplications.Historically,thedevelopmentofminingtechnologieshasitsrootsinthemid-1700s,
withtheinventionofmathematicalmodels,regressionmethods,andstatisticalanalysis.Duetothe
advent of computers and systems for storing larger quantities of data, the possibility to mine very
largedatasetsandfilterrelevantinformationhasgreatlyexpandedwhatusedtobelaboriousexpert
analysis work. Since the beginning of the 1990s, modern data mining tools have appeared on the
market and in labs, allowing more advanced analysis, to discover hidden patterns and to extract
implicit knowledge, as envisaged and tested by Swanson D.R. (Swanson, 1986). With the invention
andadoptionoftheWorldWideWeb,researchersexpandedtheireffortstodeveloptextanddata
miningmethodstosearchandexploretheweb.Thisledtothecreationofsearchengines,starting
from simple interface-based browsers like Mosaic and Netscape, and developing into large-scale
systems like Google and Bing. It represented the opening gambit towards more complex methods,
not only based on keyword search, but also on their correlation, their similarity, their joint
probabilities,andtotheapplicationoftextminingtodifferentfields(suchasthebiomedicaldomain,
35
Theoriginaldocumentusestheterm‘database’;wereplacedthisby‘data’.
36
ThesecollectionsareoutsidetheFutureTDMscope.
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 16
17
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
economics, social sciences, or humanities), and types of data and content (such as scientific
publicationsandresearchdata,socialmedia,multimedia,orpublicsectorinformation).
Figure 4 shows a zoom-in view of the TDM environment, outlining the main research fields that
develop,test,andfinallyprovidethealgorithmsforTDMimplementationsforeachspecificdomain
of use. The core scientific areas are machine learning, which is machine learning of tasks and the
discoveryofpatternsandstructureindata;informationretrievalandsearch,whichfocusesonthe
optimisation of search methods that index vast amounts of content and allow the location and
extraction of the most relevant facts for any question and user type; and natural language
processing techniques, which deal with human language in its textual and spoken form, covering
humanity’s past as well as reflecting current trends and topics. This scheme does not include an
exhaustive list of techniques and research directions; its aim is to illustrate the scale and potential
scientific interaction, and the broad knowledge required to create a successful, scientifically sound
TDMapplication.
Below,webrieflyintroducethemainresearchdirectionsthatconstitutethecomponentsoftextand
dataminingapplications.
Information
Extraction
Artificial
Intelligence
Concept
Extraction
Information
Retrieval
and Search
Classification
Clustering
Opinion
Mining
Negation
and Modality
Detection
Predictive
Analytics
Machine
Learning
Natural
Language
Processing
Machine
Translation
Regression
Multimedia
Processing
Association
Rules
Textual
Entailment
Summarisation
Statistics
Figure4.MicroviewonTDMenvironment.37
Machine Learning (ML) studies the automated recognition of patterns in data, and develops
algorithmsabletolearntaskswithoutexplicit(human)instructionorprogramming.Learningisbased
37
ThisFigure(‘MicroviewonTDMenvironment’)byFutureTDMcanbereusedundertheCC-BY4.0license.
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
18
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
ondataexplorationandisaimedatgeneralizingknowledgeandpatternsbeyondtheexamplesthat
arefedtothealgorithmattrainingstage.
Informationretrieval(IR)canvaryfromsimpleretrievalofalltheitems(e.g.fulldocuments)withina
collection that contain user-requested words in metadata fields or structured databases, to more
sophisticatedsearchandrankingoftheresultsbasedonthepresenceofcombinationsofimportant
wordsandnumbersindifferentsectionsoftextsordatafields(Baeza-YatesandRibeiro-Neto,1999).
It is important to note that, although TDM uses information retrieval algorithms as one of the
componentswithintheTDMframework,itgoesbeyondthesimpleretrievalofwholedocumentsor
their parts. Within the IR framework, the task of knowledge processing and of new knowledge
creation still resides with the user, while in the case of full TDM technology implementation, the
extractionofnovel,never-beforeencounteredinformationisthemaingoal(Hearst,1999).
Naturallanguageprocessing(NLP)aimsatdevelopingcomputationalmodelsfortheunderstanding
andgenerationofnaturallanguage.NLPmodelscapturethestructureoflanguagethroughpatterns
that map linguistic form to information and knowledge or vice versa; these patterns are either
inspiredbylinguistictheoryorautomaticallydiscoveredfromdata.TheNLPcommunityissubdivided
intogroupsthatfocusonspokenorwrittencontent,i.e.speechtechnologies(ST)andcomputational
linguistics (CL). ST researchers work with audio signals, synthesising and recognising speech and
paralinguisticphenomena.Oncespeechcontenthasbeentranscribedinatextualrepresentation,it
canbefurthertreatedastextforTDM.CLresearcherstargetsyntacticanddiscoursestructures;their
algorithms help us understand the basic concepts, relations, facts, and events expressed in textual
content.
ThevisualisationoftheoutputoftheML,IR,andNLPalgorithmsisanintrinsicpartofalltheTDM
constituenttechnologies.Visualisationisvaluableforresearchersinfieldslikebioinformaticswhere,
for example, the mining of protein behaviour described in a set of publications is visualised in one
scheme;orintextmining,whenawordcloud,withwordsindifferentsizeandcoloursdependingon
their frequencies, gives a general impression of the most frequent terms used in a collection of
documents.
Statistics, as a field of mathematics in general and machine learning in particular, provides its
practitionerswithawidevarietyoftoolsfordatacollection,analysis,interpretationorexplanation,
andpresentation.
Classification/Categorisation/Clustering of content within a collection in order to split (in a
supervisedorunsupervisedmanner)theseintogroupsofcertaintypesbasedoncontent,representa
TDM preprocessing step at the collection level, as these categories are used for further specific
miningofthecollection’scontent.Machinelearning(ML)algorithmsrepresentadominantsolution
forthistask,e.g.theNaïveBayesapproach(Lewis,1998),decisionrules(Cohen,1995),andsupport
vectormachines(Joachims,1998).Whileclassificationalgorithmsidentifywhetheranobjectbelongs
toacertaingroup,regressionalgorithmsestimateorpredictaresponseasavaluefromacontinuous
set.
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
19
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
Association rule-learning algorithms allow for the discovery of interesting relations between
variables within large databases (Agrawal et al., 1993). For example, when a large database of
customer shopping behaviour is mined, association rules help to define what types of products or
services are usually sold together. This information can help to adjust and tailor marketing
campaigns,aswellasmakeproductarrangementsinthestoresmoreefficient.
Informationextraction(IE)isawideconceptthatcoverstheinferenceofinformationfromtextual
data.Examplesoftypesofextractedinformationarestatisticsontheusageofterms,theoccurrence
of named entities in text and their relations or associations according to certain classification
schemes,andthepresenceofspecificfactsorlogicalinferencesexpressedintexts(e.g.interactions
betweengenes,logicalproofs,ortheinferenceoflogicalconsequencesandentailmentexpressedin
thetext).
Term or Concept extraction is a first step for further IE, as it is necessary to define the terms of
importance for the corpus to start its processing. Once these terms, being specific for the topic to
conceptualiseagivenknowledgedomain,areextracted,anontologyofthefieldcanbecreatedfor
furtherIEtaskimplementations.
Negationandmodality(orhedge)detectionisanimportantaspectofTDM,asnegationandhedging
constructionsinthetext(suchas“OurresultscastdoubtonfindingXofAuthorZ”or“Wefoundthat
findingZdoesnotapplytocontextY”)offercrucialevaluationsofreportedfacts.
Sentimentanalysis,oropinionmining,aimstodeterminetheattitudeofaspeakerorawriterwith
respecttosometopicortheoverallcontextualpolarityofadocument.Theattitudemaybehisor
her judgment or evaluation, the affective state of the author, or the intended emotion the author
intendstoconveytothereader.
Summarisationofagiventextorasetoftextsisthetaskofcreatingashortertextthatcapturesthe
most crucial information from the original texts. First, the crucial content to be summarised is
defined,andthen,dependingonthetask,itiscondensedintoanewrepresentationthatmayconsist
of sentences or parts from the original documents (extractive summarisation), or that may be a
newly created text or multimedia document that rephrases the summarised message (abstractive
summarisation).
Predictive analytics covers the methods that analyse previous usage or transactions statistics and,
basedontheseresultsandtrends,offerspredictionsonfurthersystemsdevelopment.
Machine translation targets human quality level of translation of written or spoken content from
onelanguagetoanotherusingcomputers.Althoughitoriginallystartedasrule-basedsystemstaking
linguistic knowledge into account, nowadays the pioneers in the field use Big Data size collections
anddeeplearningalgorithmstotraintheirsystemsandtoachieveperformanceimprovements.
Multimediaprocessingallowsfurtherminingofallaspectsofmultimediacontent,varyingfromthe
audio stream of a given video to visual concepts extraction and annotation, person and object
discovery and localisation on the screen. This direction of research and applications has gained a
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
20
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
great deal of importance, due to the fact that contemporary Internet users share and create large
quantities of multimedia content that, for example, reflect their opinions and spread information
aboutevents.
As depicted in Figure 4, the methods listed here are not insulated domains of research and
application;theyrepresentacomplexinteractivesetoftoolsandmethodsthatcanberecombinedin
multipleways,dependingontheneedsofthespecifictaskandtheavailabilityofdataandexpertise.
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
21
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
3.TDMACROSSECONOMICSECTORS
Since the beginning of humanity, the development of novel technologies has defined societal
structure and its employment distribution. Agriculture and natural resource extraction form a
primary sector, as this is an initial type of human joint-effort to produce food for survival. The
inventionofmachinesandelaboratetoolsledtothedevelopmentofthesecondarysector,whichhas
evolved into complex machinery engineering, construction, and space exploration. The third sector
comprises all the services that society members deliver to each other, varying from retail to
transportation, and from medical healthcare support to entertainment. Until recently, this three
componentstructurerepresentedmostofthesocietiesintheworld.However,thegrowthofdatadriven business, services and research, has led to a discussion on the changes in the economic
structurethatshouldreflectanoveltypeofknowledge-basedeconomy,whereknowledgeanddata
become a separate valuable commodity (OCDE, 1996). Thus recently, the research that supports
knowledge sharing and growth, as well as education of the professionals that can carry out these
activities,havebeentermedaquaternaryeconomicsector.Moreover,thehighleveldecisionmakers
in governments, large industry companies, and education, who have the potential to shape the
futureoftheentiresectorswiththeirvisionandfollowingdecisions,havebeenplacedinaseparate,
quinarysector.
Forthepurposeofourreport,weareinterestedinthecurrentandpotentialpresenceofTDMinall
economicsectors.WedepicttheserelationsinFigure5.Theprimary,secondary,andtertiarysectors
followroughlythesametypeofinteractionwithTDMtechnologies.Thesesectorsrepresentsources
of all possible types of data that are available for mining, and at the same time, have tasks and
challenges that require smart solutions. The quaternary sector is particularly central to TDM
development and implementation, as it produces the TDM experts and advances the technology
itself.Expertsatthedecision-makerleveluseTDMtoolstotestandprovetheirvisionofeconomic
development, as well as to hypothesise where a specific decision will lead to, and to measure and
assessitsprogress.
In the sections below, we list examples per sector and subfield on how TDM is used, in order to
illustratethebroadnessofTDMuptakeanditspotentialfordevelopment.
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
22
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
SECONDARY SECTOR
Aerospace
Agro-Industry
Bio-Technology
Construction
Technology
Engineering
Manufacturing
Microelectronics
Data &
Tasks
TERTIARY SECTOR
PRIMARY SECTOR
Data &
Tasks
Agriculture
Coal Mining
Technology
Natural Resources Industries
Wholesale Trade
Technology
TEXT AND
DATA MINING
Data &
Tasks
Across Economic
Sectors
Industry or
Gov. Experts
QUINARY SECTOR
High Level Decision Makers
in Government, Industry,
Commerce, and Education
Entertainment
Finance, Banking, Insurance
Health and Social Care
IT Services
Public Administration
Retail/Trade
Telecommunications and Journalism
Tourism
Transportation
Technology
TDM Experts
Tools for
Analysis
QUARTENARY SECTOR
Education
Research and Development (e.g. Tech)
Figure5.GeneraleconomicstructureandconnectionbetweenTDMandalleconomicsectors.38
3.1.Primarysector
Making the best use of limited natural resources with a constantly growing population is one of
humanity’s ultimate challenges. Taking on the challenge requires the understanding of changing
patternsinclimateandenvironment;TDMhasthepotentialtodiscoversuchpatternsautomatically
from data. The primary data that can be used for TDM are collected in the form of satellite
measurements,weathersensorsonthegroundandbelow,intheair,andinthewater.Otherdata
arederivedfromeconomicindicatorsfromtheprimarysectors:productionnumbers,profits,stock
marketvaluepatternsofminedgoodssuchasoil,coal,andgold,andcropssuchaswheat39.
3.2.Secondarysector
Anylarge-scalemanufacturingplantordistributednetworkofpipelinesispackedwithvarioustypes
of sensors that produce a continuous stream of measurements that are crucial for maintenance
analysis,suchastemperature,pressurelevels,leakdetections,etc.Whenanalysedonalargescale,
this information offers a basis for discovering and understanding strategies for cost savings and
safety measures. For example, calculating optimal routes for product delivery can boost fuel
efficiency,leadingtoasmallerecologicalfootprintaswellastocostssavings;knowledgeofelectrical
gridusageorgaspipelinesintegrityhelpstoschedulemaintenanceofthemostsensitivepartsofthe
38
This Figure (‘General economic structure and connection between TDM and all economic sectors’) by
FutureTDMcanbereusedundertheCC-BY4.0license.
39
http://www.agroknow.com/agroknow
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
23
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
system (Mayer-Schoenberger and Cukier, 2013), (Auclair D., 2016); data-driven smart grid
applicationscansignificantlylowerCO2emissions(ICSU,ISSC,TWAS,IAP,2015).
3.3.Tertiarysector
ThesectorofprovidedservicesembracesTDMatadifferentpacedependingontheTDMexpertise
andawarenessinthefield,BIgDataavailability,andthesensibilityofthisdataintermsofprivacy,as
muchmorepersonaldataisminedwithinthissector.
Companies allocate great effort to interacting with their existing customers, as this allows them to
monitorcustomerfeedbackontheproducts,andatthesametimetoprofilecustomersforfurther
targetedproductpromotion.Ontheotherhand,informationaboutusers’Internetsurfingbehaviour
isgatheredintheformofclicklogs,websiteviewsandsearchqueries.Oncethisdataiscategorised
and profiled, the advertising network industry can match the right advertisement from the right
company to the suitable client profile. Sentiment analysis of customer comments in global social
networks,suchasTwitterandinternetforums,allowscompaniestotrackcustomerinterestintheir
products,aswellastousethesechannelsfortargetedadvertisingofservicesandproducts.
3.3.1.Finance,Banking,Insurance
Inthefinancedomain,predictiveanalyticsplayacrucialrole.TDMbasedanalyseshelptomanage
risks and to make informed decisions. Banks and insurance companies use TDM to build consumer
profiles and develop automatic classifiers to determine eligibility for financial products, and to set
insurancepremiums.Customerparameterscanincludediversetypesofinformationgatheredfrom
different sources, ranging from consumers’ zip codes, employment background and educational
history,totheirshoppinghistory,socialmediausageandfriendshipnetwork(RamirezE.etal.,2016).
Financial institutions typically purchase this information from consumer reporting agencies (CRA)
thatcrawldatafromtheInternetandfromspecialgovernmentandprivatedatabases.
3.3.2.Healthandsocialcare
Both health care providers and consumers benefit when treatment plans and medical advice are
tailoredtoindividualpatients.TDMhasgraduallybecomepartofthehealthcaresystem:inrouting
and treatment tailoring, treatment monitoring, as well as in claims processing and insurance
payments(AuclairD.,2016).Inaddition,BigDataisusedbydifferentmedicalinstitutionstopredict
the likelihood of hospital readmission for different categories of patients and types of disease
combinations.
MedicaltreatmentsandpredictionssuggestedbyartificialintelligenceandTDMarebeingintegrated
in decision support systems for doctors (e.g. Babylon40 in the UK, IBM Watson41 in the USA). Such
systems,poweredbyAI,canhelptolowerthecostsinregionswithashortageofspecialtyproviders.
40
41
http://www.wired.co.uk/news/archive/2014-04/28/babylon-ali-parsa
http://www-03.ibm.com/press/us/en/presskit/27793.wss
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
24
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
Moreover, medical TDM systems can flag prescription anomalies based on patient records, thus
helpingtopreventmedicationerrors(AuclairD.,2016).
3.3.3.Publicadministration
Publicadministrationcollectdataaboutdifferentaspectsofthelivesofthemembersofsociety,such
asownershipofmovableandimmovableassets,useofthehealthcaresystem,employmentstatus,
and family situation. This information is used to keep track of societal developments and changes,
and to come up with adjustments to policies and reallocation of services. A substantial amount of
thisinformationstaysofflineandisnotinterconnected.Ifthiscontentwouldbemadeopentothe
publicthroughopengovernmentinitiatives,manytypesofbusinessesandnon-profitorganisations
would have a chance to put this data to use, e.g. by connecting the PSI data to their own data, or
building new products and services on top of the released PSI (Mayer-Schoenberger and Cukier,
2013).
Likethefinanceindustry,publicadministrationcanbenefitfrominternalTDMimplementations,as
intelligent mining and automated auditing of the data can help to detect and to prevent fraud,
currently leading to financial losses in tax collection; TDM may generally help to optimise the
workflow.
3.3.4.Retail
Inretail,trackingandminingofpurchasetransactionsbycustomersallowscompaniestoadjustthe
range of products on display according to the preferences of the customers in a particular area.In
additiontobetterfittedproductselection,foodretailerscanadjusttheirfacilitiesarrangements,for
example,refrigeratortemperatures,basedonlarge-scalestatisticsofalltheshopsinachain,which
inturnhelpstosaveontheenergybills.
Somemarketplacesaresetupassalesplatformsforindividualsandcompaniesincompletelyvirtual
environments,likeinternationalgiantseBay42andAmazon43ortheFrenchsmallersizeequivalent,Le
BonCoin44.Inthecontextofonlineshopping,theminingofuserbehaviour(clicks,productsselected
orbought,timespentonthepage,browsersused,timeofthedaywhenthewebsiteisaccessed)is
increasingly being used for product recommendations and reminders, which eventually increases
sales.
3.3.5.Telecommunicationsandjournalism
Thetelecommunicationsindustrygeneratessomeofthelargestmultimediadatasets,andTDMhelps
to filter and to play already available content that is of interest to clients depending on their
preferences. Mining customer preferences, in turn, influences decisions on allocating resources to
thecreationofnewcontentthatwillpotentiallybesuccessful.
42
http://www.ebay.com
43
https://www.amazon.com
44
http://www.leboncoin.fr
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
25
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
Journalists always needed diverse and reliable information sources to report news events and to
bring an analytical perspective to a general audience. Nowadays, journalists have to deal with an
avalancheofdatasources,thusminingisseenasavaluabletoolforinvestigativejournalismtofind,
confirm,andeventuallytoprovestoriesinamoretransparentway(DiakopoulosN.,2014).Miningof
blogs,websitesandtwitterfeedsmakesjournalistsawareoftheiraudience’sinterestsandlevelof
knowledge, which helps them to shape their social interaction accordingly and to select relevant
news pieces (Flaounas I. et al., 2013). At the same time, the journalists have to be careful when
redefiningtheiragendausingTDM.TheyneedtocumulatetheknowledgegainedfromTDMwiththe
actualstateofsociety,asnotallsocietymembersarerepresentedinthesocialmediaorwebdata
availableforcrawling.Intheirworkanduseofdata,journalistshavetofindabalancebetweendata
personalisationandtheecologyofcommonknowledge(LewisS.C.andWestlundO.,2014).
3.4.Quaternarysector
This information and knowledge-rich sector focuses on knowledge gathering, processing, and
creation, via research practices and teaching society members. In this sector, TDM algorithms are
boththesubjectofresearchandasetoftoolsthatenableresearch.
3.4.1.Research
The speed of scientific progress depends to a large extent on the framework that supports the
sharing of novel ideas and key findings within and across communities, the reproducibility and
comparabilityofresults,andtheresearchimpactmeasurements.TDMtechniqueshavethepotential
toaddvalueandincreaseefficiencyforallthesecomponents.
AsmentionedinSection1andillustratedwithexamplesinFigures1-2,researchersnowadayshave
tokeeptrackofanenormousnumberofpublicationstoretainanup-to-dateunderstandingofthe
research in their field. Moreover, a growing amount of research happens at the intersection of
severaldomains,ormethodsarebeingadoptedacrossdomains.Thus,thecurrentvolumesofcrossdomain content render them infeasible to be read by humans. TDM can help researchers discover
newconnectionsandtrends,allowingthemtomoreoptimallyusetheirtimeandotherresources.
Thequalityofscientificresultssharedwithinthescientificcommunityreliesandheavilydependson
thesystematicreviewingofsubmittedarticlesforpublication.Thisrigorousprocessguaranteesthat
theresearchoutcome,whenaccepted,bringsnovelinformationtothefieldofknowledge,andthe
results have been achieved following the standard procedures that safeguard reproducibility and
reliability.Ideally,highqualityreviewingreliesonfullawarenessofallthepublicationsinthefield.
However,withtheincreasingnumberofpublishedstudiesatagloballevel,itbecomesunfeasiblefor
anyhumanresearcher/reviewertokeeptrackofallstudiesthathavebeenreported.Therefore,TDM
canprovidevaluablesupportforreviewers,asitcanreducetheworkloadandminimiseanypotential
bias.
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
26
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
Furthermore,thequalityofpublishedresearchcanbemeasuredwithusingTDM.Forexample,smart
mining of publication references that takes into account the discourse structure in case of each
quotation,yieldsamoresubtleandinformativeoverviewofthewholecitationsnetwork.
Clinicalstudiesrequireameaningfulsamplesizeofpatientswithcertainconditionsinordertotest
theeffectivenessofnewdrugsormedicalprocedure.Pharmaceuticalcompaniesthatdevelopthese
drugsandtreatmentsrelyonintermediarycompaniesthatcarryouttheactualscreeningofmedical
recordsforsuitablepotentialcandidates.TheseintermediariesworkonalargescaleandrelyonTDM
technologiestoestablishpatient-clinicalmatches,aswellastosupportpost-trialpharmacovigilance
throughearlydetectionofadversedrugevents.
3.4.2.Education
Traditional educational institutions can use BIg Data techniques to identify students for advanced
classes, while at the same time determine students who are at risk of dropping out or might be in
needofearlyinterventionstrategies.TheGatesFoundationhasinvestedinanumberofprogrammes
to improve the quality and availability of educational data with the aim of developing policy and
practicetoimprovetheimpactofeducation,andtobooststudentachievement45.
Massive open online courses (MOOC), introduced in the late 2000s, are currently reshaping the
wholeconceptofdistanceeducation,andhaveapotentialtoimpactthewholeeducationalsystem,
as they make it possible to monitor and assess individual class performance based on large-scale
miningofstudentprofiles,attendance,taskcompletion,andforumdiscussions.
3.5.Quinarysector
Funders and high-level decision makers rely on their general visions of technology and societal
development paths, as well as the impact-metrics that assess potential economic, political, and
societal value. These experts can use TDM to mine the current status of affairs, and to predict the
outcomes of diverse types of interventions. Moreover, sentiment analysis of social networks
activitiescanprovidethemwithfastfeedbackfromthepopulationonactionstaken.
45
http://www.gatesfoundation.org/Media-Center/Press-Releases/2009/01/Foundation-Invests-in-Researchand-Data-Systems-to-Improve-Student-Achievement
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
27
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
4.SUMMARYOFTHELITERATURE
Astextanddataminingharbouralargepotentialacrosspracticallyallsectorsoftheeconomy,their
legal,economic,andpoliticalimplicationshavebeendiscussedinseveralexpertreportsandfromthe
pointofviewofvariousstakeholders.Thisreportisbasedonseveralofthesereports;inthissection
we summarise the most prominent reports published in recent years for further reading and
reference.Wehighlightthepointswhichcallforurgentlegalandeconomicchangesthatwouldallow
TDMtounleashitsfullpotential.Wealsohighlightcaseswhereuptakesuccessismentioned,andwe
outlinethedirectioninwhichdiscussionsdevelop.Figure6depictsthegeneraltimelineofreports,
their research questions and focus (Legal aspects, Economic factors, Examples, Technology). It is
worthnotingthatreportscreatedonbehalfofgovernmentsandacademiaaremoreoftenreleased
via open access, even when produced by private companies. Therefore, they are available for
analysis, while private sector reports, usually created as a product for sale, stay behind a paywall,
restrictingouraccesstotheircontent.
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
28
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
29
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
46
Figure6.Timelineofreports. Year
Title
Authors
Researchquestions/
Focus
2011
DigitalOpportunity.A
ReviewofIntellectual
47
PropertyandGrowth. I.Hargreaves
Thisreportaimsto
Basedonanalysisofprevious
findeconomic
publications,reports,reportedlegal
evidencetosupport
andeconomicpractices.
changesinthe
IntellectualProperty
lawsatthelevelofthe
UK,andfurtherata
unifiedEUlevel.
TwoimportantareasofutilityofTDM:(1)costsavings,(2)
widereconomicimpactgeneration
2011
BIgData:Thenext
frontierforinnovation,
competition,and
48
productivity. McKinseyGlobal
Institute(MGI)
ToexploreBigData
potentialwhenusedin
everyeconomysector,
itsinnovation
potential,andthe
addedvalue.
Basedonanalysisofprevious
publications,reports,reportedlegal
andeconomicpractices,aswellas
internalMGIresearch.
ThisreportgivesageneralperspectiveontheBigDataTDM
beingincorporatedintoalleconomicsectors,asdatafication
reachesallsectorsoftheglobaleconomy.Theexamplesof
potentialprofits,aswellasraisingissues,arelistedforbothEU
andUSAmarkets.
-Issues:datapoliciesinthecontextofBigData,privacy
protection;storageandaccessfacilitiesandnoveltechniques
thatneedtobeimplemented.
-BigDataanditsprocessingcreateadditionalvaluefor
businessesinanumberofways:improvedandmoreuser
targetedservicequality,overalltransparencyofperformance,
andthesupportofhumandecisionswithreliableanalytics.
TheValueandBenefits
D.McDonald,U.
Toexplorecosts,
Consultationswithstakeholdersand -Theavailabilityofmaterialfortextminingislimited.Evenin
2012
J.Manyika,M.
Chui,B.Brown,
J.Bughin,R.
Dobbs,C.
Roxburgh,A.
HungByers
Dataandmethods
Keyfindings
46
ThisFigure(‘Timelineofreports’)byFutureTDMcanbereusedundertheCC-BY4.0license.
https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/32563/ipreview-finalreport.pdf
48
http://www.mckinsey.com/business-functions/business-technology/our-insights/big-data-the-next-frontier-for-innovation
47
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
30
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
ofTextMiningtoUK
FurtherandHigher
Education.Digital
49
Infrastructure.JISC. Kelly
benefits,barriersand
risksassociatedwith
textminingwithinUK
furtherandhigher
education
casestudies(literaturereviewin
systemsbiology,textminingto
expediteresearch,increaseof
accessibilityandrelevanceof
scholarlycontent.
caseofearlyadoptersinthefieldsofbiomedicalchemistry
research,theminingislimitedtotheOpenAccessdocuments.
-Mainbenefitsareincreasedresearchers’efficiency,improved
researchprocessandquality.Thereportdemonstrateshow
muchmoneyintermsofpersontimecanbesavedorusedfor
properresearchtaskswhenthescientificcontentismined
automatically,leavingtheresearcherswithanalysischallenges
andpureresearchquestions.
-Barriers:legaluncertainty,inaccessibleinformation.
2012
Businessintelligenceand MISQuarterly
Analytics:FromBigData 50
toBigImpact. Themainfocusofthis -Bibliometricstudyofcritical
revueistodescribe
BusinessintelligenceandAnalytics
anddiscussBusiness
publications
intelligenceand
Analytics(BI&A)
implementedfor
diversemarket
applicationsthatuse
manydatatypes(from
user-generated
contenttoFederal
ReserveWireNetwork
information)
-BI&Ainitscurrentstate(2.0)processesweb-based
unstructureddata,andthenextstepforthetechnology(3.0)
willbetousemobileandsensor-basedcontent.
2012
ResearchTrends.Special
51
IssueonBigData. Thisspecialissuehasa --
broadperspective
varyingfrom
applicationsscenarios
-Bibliometricsanalysisconfirmsthat‘ComputerScience’and
‘Engineering’aretheleadingtopicsinBigDatapublications,
andUSleadsthepublicationsraceinthefield.
49
https://www.jisc.ac.uk/reports/value-and-benefits-of-text-mining
http://hmchen.shidler.hawaii.edu/Chen_big_data_MISQ_2012.pdf
51
http://www.researchtrends.com/wp-content/uploads/2012/09/Research_Trends_Issue30.pdf
50
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
31
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
toinfrastructureissues
forTDMinBigData
context
2012
META-NET.Strategic
ResearchAgendafor
MultilingualEurope
2020.52
2013
2014
META
Technology
Council
Thisreportfocuseson
thecurrentstateof
andthefurther
potentialfor
developmentof
languagetechnologies
onaglobalscale,and
morespecifically,the
Europeanlandscape.
e-IRGWhitePaper. e-IRG
Thisreportfocuseson theinfrastructure
initiativesthathave
alreadybeentaken
andespeciallyon
thosethatareurgently
neededinorderto
supportTDMona
largescaleacross
communities.
-InfrastructureneedssupportatnationalandEUlevel,and
requirescollaborationwithbusiness.
-Openscienceshouldbesupportedthroughandviashared
infrastructures.
Bridgingthegap.How
digitalinnovationcan
drivegrowthandcreate
P.Hofheinz,M.
Mandel
-Thisdocument
outlinesthegrowthof
dataataglobalscale,
-Largerrevolutioninalleconomicsectorsandaspectsof
societallifearepredictedtobeduetotheBigDataera,but
mostimportantlythesewillbeshapedbydataanalyticsforthis
53
-MultilingualismisaanimportantfeatureofEurope,asit
meansthatthereisneedforthediversificationandlocalisation
ofthetoolsthatwillbeabletoprovideservicesatthesame
levelforthewholevarietyofEuropeanlanguagesand
customers.
Basedonanalysisofprevious
publications,reports,reportedlegal
andeconomicpractices.
52
53
http://www.meta-net.eu/vision/reports/meta-net-sra-version_1.0.pdf
http://e-irg.eu/documents/10920/11274/annex_5.2_e-irg_white_paper_2013_-_final_version.pdf
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
32
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
2014
jobs.LisbonCouncil
54
SpecialBriefing. andaspectsofdatadriveneconomylike
betterconditionsfor
creatinganinclusive
economyforall
societymembers,
betterpotentialfor
cheaperindividual
education,and
trainingforadvanced
manufacturing.
-Theauthorsfocuson
the‘datagap’
betweenEuropeand
NorthAmerica.
-Strongfocusisgiven
tothediscussionon
policyregulationsthat
needtobeadjusted
acrosstheAtlantic,
bothintheEUandthe
USA.
MappingTextandData S.Filippov
MininginAcademicand
ResearchCommunitiesin
Europe.LisbonCouncil
55
SpecialBriefing. -Thisreportcontinues
theabove-mentioned
workongathering
informationonthe
importanceofTDMfor
theglobaleconomy
data.
-EuropeisbehindcountrieslikeSouthKorea,US,Canada,in
termsofdatause.
-Therearesuccessfulexamplesofopenaccesspolicy
introductionforpublicdata.InEstonia,thestatereleased
publiclyowneddataincludingutilitiesandcellphonenetworks
afterhavingstrippedprivateinformationandanonymization,
whichhadpositiveimpactonthedevelopmentofdata-driven
businesses.
-TheauthorsarguethatiftheEuropeandataprotection
statutesarenotlightened,thecostofdatausagewillonlyrise,
whichconsequentlywillwidenthegapbetweenEuropeand
theUSA,aswellastherestoftheworld.
Basedonanalysisofprevious
publications,reports,reportedlegal
andeconomicpractices,aswellas
interviews,andbibliographyand
patentbasedanalysison
ScienceDirectdatabase.
-Thedifferenceincopyrightaccessacrossthecountriesis
reflectedbythenumberofpublicationsontextanddata
miningsubjects.TheUSisleadingthefield,whiletheonly
EuropeancountryatthetopofthelististheUK,whichhasan
exceptionforcontentminingforacademicpurposes.
-ThenumberofpublicationsonTDMinEnglishisseveral
54
55
http://www.progressivepolicy.org/wp-content/uploads/2014/04/LISBON_COUNCIL_PPI_Bridging_the_Data_Gap.pdf
http://www.lisboncouncil.net/publication/publication/109-mapping-text-and-data-mining-in-academic-and-research-communities-in-europe.html
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
33
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
development,and
focusesonthereasons
whyEUlagsbehind
theUSandsomeAsian
countriesinthis
domain,and
hypothesiseshowthis
canchangeeither
positivelyor
negatively,depending
onthemeasures
taken.
2014
2014
Standardisationinthe
areaofinnovationand
technological
development,notablyin
thefieldofTextand
DataMining.Report
fromtheExpertGroup,
56
EuropeanCommission. ExpertGroup
Assessingtheeconomic
impactsofadapting
certainlimitationsand
exceptionstocopyright
andrelatedrightsinthe
EU.Analysisofspecific
57
policyoptions. CharlesRiver
Associates(CRA)
ordershigherthaninanyotherlanguages,covering97%ofthe
publications.
-TDMispatentedasalgorithmsintheUSandother
jurisdictions,whileinEuropeTDMispatentedasbeing
embeddedinthesoftware.Chinaisrisingstronglyinthis
marketintermsofpatents.
-Thisreportliststhestakeholdertypesandthelegaland
economicissuesthattheyencounterbothintheEUandinthe
restoftheworld.
-Theexpertshypothesiseonhowthevaluechainmightbe
adjustedwhenTDMisfullyincorporatedandlegallyallowed,
arguingthatitshouldhaveamorepositiveimpactoverall.The
dataiscategorisedaccordingtobothsizeandlegalaccess
pattern.
-IntroductionofIPexceptionsforthecaseofTDMforscience
shouldnotsignificantlyadverselyaffectrightholders’incentives
forcontentcreation,contentqualityandTDM-specific
investments.
I.Hargreaves,L.
Guibault,C.
Handke,P.
Valcke,B.
Martens,R.
Lynch,S.Filippov
Thisreportfocuseson
economicimpactsof
specificpolicyoptions
J.Boulanger,A. inseveraltopicsof
Carbonnel,R.De interest(digital
preservationofand
Coninck,G.
remoteaccessto
Langus.
56
57
http://ec.europa.eu/research/innovation-union/pdf/TDM-report_from_the_expert_group-042014.pdf
http://ec.europa.eu/internal_market/copyright/docs/studies/140623-limitations-economic-impacts-study_en.pdf
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
34
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
culturalheritage,elendingforlibraries,
TDMforscientific
purposes),inviewof
providingpolicy
guidanceonthese
topics.
2014
Studyonthelegal
frameworkoftextand
58
datamining(TDM). DeWolf&
Partnersforthe
European
Commission
Thefocusisonthe
legalaspectsofTDM.
Dataanalysisisthemostencompassingtermandismore
preferableforuseinalegalcontext.
Classificationof4differentlevelsofaccess:“alltoall”(webdate)/“manytomany”(socialnetworks)/“onetomany”
(contractualclauses)/“onetoone”(confidentialagreements)
2014
TheDataHarvest:How
sharingresearchdata
canyieldknowledge,
59
jobsandgrowth. RDAEurope
Thereportfocuseson
themeasuresthat
shouldbetakenin
ordertosupport
researchusingBig
Data.
-Mainincentivesofthereportinclude:
--dataplanmanagement;
--promotionofliteracyintermsofalltheaspectsofBigData
useacrosssociety;
--developtoolsandpoliciesthatsupportdatasharing.
-Thisreportaimsto
providepolicymakers
andanalystswiththe
meanstocompare
2015
F.Genova,H.
Hanahoe,L.
Laaskonen,C.
Morais-Pires,P.
Wittenburg,J.
Wood
OECDScience,
TechnologyandIndustry
Scoreboard2015.
INNOVATIONFOR
Organisationfor
EconomicCooperationand
Development,
-Thisisabroadlyscopedreportthathassectionsthatdiscuss
trendsandfeaturesofknowledgeeconomies.
-ItgivesperspectivesontheoverallscientificexcellenceofEU
countriescomparedtotherestoftheworldacrossallfieldsof
58
59
http://ec.europa.eu/internal_market/copyright/docs/studies/1403_study2_en.pdf
https://rd-alliance.org/sites/default/files/attachment/TheDataHarvestFinal.pdf
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
35
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
2015
GROWTHAND
60
SOCIETY. Europe
Skillsofthedatavores.
TalentandtheData
61
revolution. Nesta,UK
J.MateosGarcia,H.
Bakhshi,G.
Windsor
economieswithothers
ofasimilarsizeorwith
asimilarstructure,
andtomonitor
progresstowards
desirednationalor
supranationalpolicy
goals.
-Themainfocusof
thisreportisonthe
knowledge-based
economies’
developmenttrends.
Introducesitsown
classificationofthe
companiesthatwork
withBigData,and
thenanalysesthe
typesofskillsdata
scientistsand
engineershaveto
learnandmasterto
boosttheeconomy
andscientific
development.
Theanalysisfocuses
ontheUKmarket
whilethecompany
science,andtheamountofinvestmentindifferentdomains,
withspecialhighlightsonthenewgenerationofICT-related,
BigDatabasedtechnologiesandpatents.
-Acrossmanymetrics,Europeancountrieswhencombinedare
comparabletotheresultsofUSA,Japan,China,SouthKorea.
InternalNESTAanalysisofthedata
analyticsmarketandhiring
strategiesandchallengesinthefield
thatarebasedonmorethan50indepthinterviews(inthefieldsof
manufacturing,retail,ICT,creative
media,financialservices,
pharmaceuticals),collaboration
withCreativeSkillset.
60
61
http://www.oecd.org/science/oecd-science-technology-and-industry-scoreboard-20725345.htm
https://www.nesta.org.uk/sites/default/files/skills_of_the_datavores.pdf
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
-Classificationofcompaniesconnectedtodataprocessing:
--‘Data-active’companiesthatareeitherdata-driven
(Datavores),workwithlargevolumesofdata(DataBuilders),or
combinedatafromdifferentsources(DataMixers),
--Dataphobes:othercompaniesthatrestrictthemselvesto
smallsizedatasets.
-Thesedifferenttypesofcompanieshavedifferentbehaviours
whilesettinguptheirrequirements,andthenhiringthedata
experts.Forexample,datavoresanddatabuildersrecruitmost
oftheircandidatesfromComputerSciences,Engineering,and
Mathematics,whileBusinessdisciplinesarethesourcefor
managementanalyticspositions.
36
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
classificationcanbe
usedglobally.
2015
BriefingPaper.Textand
DataMiningandthe
NeedforaSciencefriendlyEUCopyright
62
Reform. ScienceEurope
Summarisesthe
Basedonanalysisofprevious
argumentsinthe
publications,reports,reportedlegal
debateforcopyright
andeconomicpractices.
lawreformthatshould
ensurethatlegallyaccessedcontent
couldbefreelymined
withoutadditional
permissionandcost
withintheEU.
-EUregulationsfordatabaseprotection(Directive1996/9/EC)
seriouslyimpederesearcherswhoaimtoaccessthecontentto
beminedviaadatabase,asdownloadingasubstantialpartof
thedatabasecontentneedstobeauthorisedbythedatabase
owner,evenifnoneofthecontentisundercopyright.
-Scientificpublishersstarttointroducelicencesregulatingthe
miningofsubscribedcontent,whichfurtherlimitsthe
researchersintheirimmediateinteractionwiththecontent.
-ExceptionsincopyrightlawincountrieslikeUK,Japan,USA,
whichallowtheirrepresentativestoavoiddifficultieswiththe
legalityofcontentmining,whichputsEUresearchersata
disadvantage,andcomplicatespotentialforinternational
collaborations.
-Theauthorssuggestthefollowingstepsforresearch
organisations:
--AvoidrequestingthattheirresearchersabstainfromTDM;
--RefusetosignTDM-licensesaspartofsubscriptioncontracts,
ornegotiatemorefavourablecondition;
--Empowertheiremployeesinternallythroughtrainingand
TDMsupport,andexternallybyadvocatingcopyrightchanges
atnationalandEuropeanlevel.
2015
IsEuropefallingbehind
inDataMining?
Copyright’sImpacton
C.Handke,L.
Guibault,J.-J.
Vallbe
Thispaperfocuseson
thecopyright
arrangements
DemonstratesthatresearchersfromtheEU/EEAcountrieshave
weakerperformanceintermsofpublicationsinthegrowing
fieldofdatamining.Thisispartiallyduetothecomparatively
Theauthorsmeasureresearch
outputinthenumberof
publicationsinacademicjournal
62
http://www.scienceeurope.org/uploads/PublicDocumentsAndSpeeches/WGs_docs/SE_Briefing_Paper_textand_Data_web.pdf
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
37
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
DataMininginAcademic
63
research. 2015
ScienceInternational:
OpenDatainaBigData
64
World. impactingdatamining
byacademic
researchers.
International
Councilfor
Science(ISCU),
International
ScienceCouncil
(ISSC),The
WorldAcademy
ofSciences
(TWAS),
InterAcademy
Partnership(IAP)
articlesinThomsonReuters’sWeb
ofScience(WoS)database.
-Thisreportfollowsan Basedonanalysisofprevious
accordofInternational publications,reports,reportedlegal
Scientificorganisations andeconomicpractices.
andfocusesonthe
opportunitiesthatthe
BigDatareality
represents.
-Itdiscusseshowtext
anddataminingofthis
scientificcontentona
massscalecanhelp
whenthescience
addressesglobal
challengessuchas
infectiousdisease,
energydepletion,
migration,
sustainabilityandthe
operationoftheglobal
economyingeneral.
63
64
http://elpub.architexturez.net/system/files/15_HANDKE_Elpub2015_Paper_23.pdf
http://www.icsu.org/science-international/accord/open-data-in-a-big-data-world-short
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
strongcopyrightprotectionlawsinthesecountries.
-Unprecedentedopportunitiesforscientificdevelopmentin
currentdigitalagerelyon2pillars:BigData(largevolumesof
complexanddiversedatastreaminginrealtime)andLinked
Data(separatedatasetslogicallyrelatedtoaparticular
phenomenonallowcomputerstoinferdeeperrelationships).
-Forthescientists,theopenaccesstothescientificdataand
publicationsisamuch-neededimperative.Therefore,authors
insistonderegulatingopendataforscientificdatausage.Data
shouldbe“intelligentlyopen”,i.e.discoverableontheWeb,
accessibleandintelligibleforfurtheruse,andassessablein
termsofdataproducers’quality.Thisopennesswillalsohelp
replicabilityteststhattestandscrutinisethereportedresults.
-Theauthorsnotetheongoingdiscussionwhetherpublicdata
shouldbefreelyavailabletoeveryone,orjusttothenon-profit
sector.Theyarguethatitisprimarilynotappropriateor
productivetointroducethisdiscriminationindataaccess,and
moreimportantly,robustevidenceisaccumulatingtosupport
theclaimthatthisaccesstodataforeveryonebringsdiverse
benefits,andbroadereconomicandsocietalvalue.
-Theauthorsgiveexamplesofcross-countrycollaborationsin
openaccessinitiativesinSouthAmericaandAfrica,while
notingthatsharingdataallowsmoreuniformdevelopmentof
technologiesandtheirimplementationonaglobalscale.
38
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
2016
TextandDataMining:
TechnologiesUnder
65
Construction. Outsellmarket
performance
report,USA
D.Auclair
2016
BigData.AToolfor
InclusionorExclusion?
Understandingthe
66
Issues. FederalTrade
Commission
(FTC)Report,
USA
E.Ramirez,J.
Brill,M.K.
Ohlhaussen,T.
McSweeny
-Focusofthisreportis
onTDMaspectsfor
scientificresearch,
healthcare,and
engineering.
Theauthorsbasethereporton
approximately20in-depth
interviewswitharangeof
stakeholderslikecontentvendors
andtechnologyandtoolproviders,
aswellasonpreviouslypublished
information.
-Theauthorshighlightthefactthatdatamininghasbeenused
longerformarketresearchandcommercialactivities,as
correspondingcommercialwebsitescontainlargequantitiesof
structureddata,whileacademiaandindustryhavetodealwith
highvolumeofunstructuredtextsanddata,whichmadetheir
progressinuptakeslower.
-ThereportprovidesalistofprovidersofTDM-related
functionsandexamplesoftypesofbusinessthattheyare
serving.
-Theauthorslistanumberofactionsforcontentprovidersand
TDMtoolvendorstofollow,thatcouldhelpthemtobenefit
fromTDMuptakeinthenearfuture:
--TDMsolutionsshouldbescalable
--Storeddatashouldfollowthestandardsatcollectionand
storagestages.
--Privacyandsecurityregulationsaretobeensured.
-Thisreportargues
thatitisnota
questionofwhether
companiesshoulduse
BigDataanalytics,but
howtoitthat
correctly,minimising
legalandethicalrisks.
-Highlightsneedto
identifyandavoid
potentialbiasesand
inaccuraciesinTDM
FTCheldapublicone-dayworkshop
‘BigData:AToolforInclusionor
Exclusion?’Inordertogeta
feedbackfrominvolved
stakeholdersonthepotentialofBig
Dataintermsofcreated
opportunitiesforconsumersand
policyissues,namelysecurity
breaches,andbiasesand
inaccuraciesinproper
population/eventsrepresentationin
thedata.
-Thereportincludesalistofquestionsthataretobe
consideredandinvestigatedwhencompaniesareusingBig
Dataanalyticsintheirbusinessmodels.Thisprocedurecould
helpthemtoavoidviolatinglawsintermsofdiscrimination,
andtopreventerrorsthatmightoccurduetotrainingdataset
inconsistenciesthatcouldleadtopotentialbiasesinoffersand
services.
Listofquestions:
-Howrepresentativeisyourdataset?
-Doesyourdatamodelaccountforbiases?
-HowaccurateareyourpredictionsbasedonBigData?
-DoesyourrelianceonBigDataraiseethicalorfairness
65
66
http://www.copyright.com/wp-content/uploads/2016/01/Outsell-Market-Performance-22jan2016-Data-and-Text-Mining-Licensed.pdf
https://www.ftc.gov/system/files/documents/reports/big-data-tool-inclusion-or-exclusion-understanding-issues/160106big-data-rpt.pdf
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
39
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
concerns?
applications,sothat
theBigDataisusedto
createnovel
opportunitiesfor
societal
improvements,and
especiallytosupport
low-incomeand
underserved
communities.
2016
EuropeanParliament
resolutionon‘Towardsa
thrivingdata-driven
67
economy’. EUParliament
J.Buzek
onbehalfofthe
Committeeon
Industry,
Researchand
Energy
Thisdocumentaimsto
structurethe
knowledge-based
economylandscape,
withafocusonthe
EU,andtooutline
whatareurgent
measuresneedtobe
takeninordertobe
theleadersinthefield
ataglobalscale.
Basedonanalysisofprevious
publications,reports,andinternally
producedfinancialestimationsof
themarketpotential.
67
http://www.europarl.europa.eu/sides/getDoc.do?type=MOTION&reference=B8-2016-0308&language=EN
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
-ThisresolutionoutlinesBigDataanddata-driveneconomy
potentialtill2020,withafocusontheEuropeanissuesand
needs.
-AnnouncementoftheEuropean‘FreeFlowofData’initiative
-StressestheneedforamodernICTinfrastructureata
Europeanscale.
40
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
5.CONCLUSION
Theuptakeoftextanddataminingisgrowingacrossallsectorsofknowledge-basedeconomies,as
thestrengthsofTDManditseconomicadvantages,suchastimesavingandscaling,areincreasingly
wellrecognised(ICSU,ISSC,TWAS,IAP,2015),(AuclairD.,2016).WhilethedevelopmentofTDMby
the larger IT companies is directed at global markets, there is a long tail of TDM research and
development in industry aimed at specific markets and niches. Most TDM industries makes use of
provenTDMtechnologies,suchassupervisedmachine-learningalgorithmsforclassificationorrisks
assessment(Ittooetal.,2016),(FleurenandAlkema,2015),howeverthereisahealthyrelationship
between industry and academic research on the development and uptake of new methods from
fieldslikeArtificialIntelligence(e.g.deeplearningalgorithms).
Ingeneral,industrieshavenotroublefindingaccesstodata,andmanagetouseTDMforeconomic
profit. However, different types of barriers still occur. If barriersto the availability of full texts and
datasetsfromacademicdomainsandpublicsectorinformationwerelifted,thefirstapplicationsthat
couldbemadepossiblewouldfinallyreviveoldvisionsfromthesecondhalfofthe20thcenturyon
accomplishing automatic unconstrained scientific discovery from observational data (Simon et al.,
1981),(Simonetal.,2007).
Large-scaleglobalcompanieslikeAlphabet(Google),IBM,andMicrosoft,currentlyinvestheavilyin
TDM algorithms and infrastructure development, as much of the value they accrue nowadays is
leveragedfromthedataprovidedtothembytheircustomers,incombinationwithallthecrawlable
content on the web. For companies and research institutes based in the European Union, a
competitivestrategynecessarilyinvolvesawideruptakeofTDM,basedonabetterawarenessofthe
potentialofTDMacrossalleconomicsectors.
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
41
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
REFERENCES
Agrawal, R., Imielinski, T., Swami, A., 1993. Mining association rules between sets of items in large
databases, in: Proceedings of the 1993 ACM SIGMOD International Conference on
ManagementofData.ACM,Washington,D.C.,USA,pp.207–216.
Auclair D., 2016. Text and Data Mining: Technologies Under Construction. (Outsell market
performancereport).
Baeza-Yates, R.A., Ribeiro-Neto, B., 1999. Modern Information Retrieval. Addison-Wesley Longman
PublishingCo.,Inc.,Boston,MA,USA.
Cohen, W.W., 1995. Learning to classify English text with ILP methods. Advances in inductive logic
programming32,124–143.
DiakopoulosN.,2014.Algorithmicaccountability.Journalisticinvestigationofcomputatiionalpower
structures.DigitalJournalism3,398–415.
Filippov,S.,2014.MappingTextandDataMininginAcademicandResearchCommunitiesinEurope.
FlaounasI.,AliO.,Landsdall-WelfareT.,DeBieT.,MosdellN.,LewisJ.,CristianiniN.,2013.Research
methodsintheageofdigitaljournalism.Massive-scaleautomatedanalysisofnews-contenttopics,style,gender.DigitalJournalism1,102–116.
Fleuren,W.W.M.,Alkema,W.,2015.Applicationoftextmininginthebiomedicaldomain.Methods
74,97–106.doi:10.1016/j.ymeth.2015.01.015
Hargreaves I., Guibault L., Handke C., Valcke P., Martens B., Lynch R., Filippov S., 2014.
Standardisationintheareaofinnovationandtechnologicaldevelopment,notablyinthefield
ofTextandDataMining.ReportfromtheExpertGroup.
Hearst, M.A., 1999. Untangling Text Data Mining. College Park, Maryland, Proceedings of the 37th
Annual Meeting of the Association for Computational Linguistics on Computational
Linguistics(ACL’99)3–10.
ICSU, ISSC, TWAS, IAP, 2015. Science International: Open Data in a Big Data World. International
Council for Science, International Science Council, The World Academy of Sciences,
InterAcademyPartnership.
Ittoo, A., Nguyen, L.M., van den Bosch, A., 2016. Text analytics in industry: Challenges, desiderata
andtrends.ComputersinIndustry.doi:10.1016/j.compind.2015.12.001
Joachims,T.,1998.TextCategorizationwithSuportVectorMachines:LearningwithManyRelevant
Features,in:Proceedingsofthe10thEuropeanConferenceonMachineLearning.SpringerVerlag,pp.137–142.
Lewis,D.D.,1998.Naive(Bayes)atForty:TheIndependenceAssumptioninInformationRetrieval,in:
Proceedingsofthe10thEuropeanConferenceonMachineLearning.Springer-Verlag,pp.4–
15.
Lewis S.C., Westlund O., 2014. Big data and journalism: Epistemology, expertise, economics, and
ethics.DigitalJournalism3,447–466.
Mayer-Schoenberger,V.,Cukier,K.,2013.BigData:ARevolutionThatWillTransformHowWeLive,
WorkandThink.JohnMurrayPublishers.
MotionforaresolutionfurthertoQuestionforOralAnswerB8-0116/2016pursuanttoRule128(5)of
theRulesofProcedureon“Towardsathrivingdata-driveneconomy”(2015/2612(RSP),2016.
OCDE, 1996. The knowledge-based economy. Organisation for Economic Co-operation and
Development,Paris.
RamirezE.,BrillJ.,OhlhaussenM.K.,McSweenyT.,2016.BigData.AToolforInclusionorExclusion?
UnderstandingtheIssues.(FederalTradeCommission(FTC)Report).
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940
42
D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE
Simon, H.A., Langley, P.W., Bradshaw, G.L., 1981. Scientific discovery as problem solving. Synthese
47,1–27.doi:10.1007/BF01064262
Simon,Uren,V.,Li,G.,Sereno,B.,Mancini,C.,2007.Modelingnaturalisticargumentationinresearch
literatures:Representationandinteractiondesignissues:ResearchArticles.Int.J.Intell.Syst.
22,17–47.doi:10.1002/int.v22:1
Swanson, D.R., 1986. Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspect
BiolMed30.
vanHaagen,H.H.H.B.M.,’tHoen,P.,Bovo,A.B.,deMorrée,A.,vanMulligen,E.,Chichester,C.,Kors,
J.,denDunnen,J.,vanOmmen,G.J.B.,vanderMaarel,S.,Kern,V.M.,Mons,B.,Schuemie,
M., 2009. Novel protein-protein interactions inferred from literature context. PLoS ONE 4.
doi:10.1371/journal.pone.0007894
©2016FutureTDM|Horizon2020|GARRI-3-2014|665940