Download Annex 4.4.2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Annex 4.4.2
Board meeting 2/12-13, March 13th, 2013
MEMORANDUM
To
IFRRO Board
From IFRRO General Counsel
Re
Text and Data Mining
Date
12 February 2013
A. BACKGROUND
New areas outside the traditional life sciences and drug discovery are emerging in social sciences,
humanities, business and marketing. It seems that text- and data-mining (TDM) introduces an
important niche in the text analytics field. Apparently, licensing of “data” has become an
increasingly important issue in science1.
So far, IFRRO members CCC and STM, and the UK Publishers’ Associations PLS and (PA), are
active in the area of TDM. These activities will be outlined in more detail below.
a. Definitions
“Text and Data Mining” is used mostly as a collective term to describe both text mining and data
mining. However, there is no universally agreed definition, partly because it is being used by
different communities for different purposes. At the outset, it seems to be helpful to distinguish
between text mining as the extraction of semantic logic from text, and data mining as the discovery
of new insights.
(i) Data Mining
It appears that data mining is an analytical process that looks for trends and patterns in datasets that
reveal new insights, which are implicit, previously unknown and potentially useful pieces of
information. It is the extraction of trends and patterns from data.2
(ii) Text Mining
On the other hand, it appears that text mining is the extraction of meaning from a body of text.
Generally, text mining is seen as the indexing of content.3 It has also been defined as “analysis of
data contained in national language text”4, or described as: “Text mining, roughly equivalent to text
analytics, refers to the process of deriving high-quality information from text. High-quality
information is typically derived through the devising of patterns and trends through means such as
statistical pattern learning.”5
(iii)Text and Data Mining
The difference between text mining and data mining is somewhat blurred when statistical analysis is
used to extract meaning from the text. One could argue that, from a computer’s point of view, text
1
http://pantonprinciples.org/ and http://www.isitopendata.org/
Jonathan Clark, Text Mining and Scholarly Publishing, Publishing Research Consortium 2012, page 19.
3
Jonathan Clark, Text Mining and Scholarly Publishing, Publishing Research Consortium 2012, page 5.
4
Definition provided by Roy Kaufmann (CCC) during the CCC/ALPSP TDM webinar on 11 December 2012.
5
http://en.wikipedia.org/wiki/Text_mining
2
1
mining and data mining are very similar. In the agreement between STM, PDR and ALPSP, the
definition includes both: “Text and Data Mining (TDM): download, extract and index information
from the Publisher’s Content to which the Subscriber has access (…).”6
b. Examples of Text Mining
Some examples of text mining users are the following websites:





CiteXplore – EBI/UKPMC7
ChemSpider8
SureChem (http://www.surechem.com)9
BrainMap.org10
Relay Technology Management Inc. (http://relaytm.com)11
Text mining of scholarly content:
See figure 2, from:
http://www.jisc.ac.uk/media/documents/publications/reports/2012/value-text-mining.pdf
B. LEGAL LANDSCAPE
There are legal uncertainties around text mining, and there is no consensus on how to best deal with
them. Some perspectives from the UK, US and EU are outlined below.
a. UK
Hargreaves Report12
 Recommended TDM exception to copyright
 Is it copyright?
 Technology, access, security, privacy
UK Parliamentary Business Innovation and Skills Committee Report June 201213
 Encourages licenses
 Encourages publishers to develop business models
6
http://www.stm-assoc.org/2012_09_12_PDR_ALPSP_STM_Text_Mining_Press_Release.pdf
http://www.ebi.ac.uk/literature/trainees/citexplore.html
8
http://www.chemspider.com/
9
SureChem is a search engine for patents that allows chemists to search by chemical structure, chemical name, keyword
or patent field. It is looking to add other sources of data, for instance journal articles, and to extend into biology, and
perhaps further (“Take My Content Please!”, Nicko Goncharoff, http://river-valley.tv/take-my-content-please-theservice-based-business-model-of-surechem/).
10
BrainMap is a database of published functional and structural neuroimaging experiments. The database can be
analysed to study human brain function and structure.
11
Relay Technology Management Inc. is a company that uses text mining to create information products for
pharmaceutical and biotech companies.
12
http://www.ipo.gov.uk/ipreview-finalreport.pdf
13
http://www.publications.parliament.uk/pa/cm201213/cmselect/cmbis/367/367.pdf and
http://www.publications.parliament.uk/pa/cm201213/cmselect/cmbis/367/367vw.pdf
7
2
JISC report14:
 Limited uptake of TDM within UK universities
 A lack of skilled staff
 High transaction and entry costs
 Recommended working with publishers, technology service providers and other key
stakeholders
 Explore the technical requirements for optimal provision of text mining infrastructure
services
 Focus on interoperability and metadata standards
The UK Hargreaves report 15 recommended that text and data mining be excepted from UK
copyright. However, it is to be questioned whether an exception would indeed remove the legal
uncertainties, as claimed in the Hargreaves report.16
The UK Government’s White Paper, Modernising Copyright, published on 20 December 2012,
states that the Government will amend the law “(…) so that it is not an infringement of copyright
for a person who already has a right to access a work (whether under a licence or otherwise) to
copy the work as part of a technological process of analysis and synthesis of the content of the
work for the sole purpose of non-commercial research. This will enable key research without
undermining publishers’ control over IT systems or commercial exploitation.
A licence governing access to a work will not be able to prevent or restrict use of the work in
accordance with this exception, but it may impose conditions of access to the licensor’s computer
system or to third party systems on which the work is accessed. Therefore the exception will not
prevent a publisher from applying technological measures on networks required in order to
maintain security or stability, or from licensing higher volumes of access to research outputs at an
additional cost. To the extent that technological measures prevent a researcher benefiting from this
exception, they will be able to appeal to the Secretary of State.
This measure will not provide a “right to data mine” works to which the researcher does not
already have a right of access, and will not cover data mining for commercial purposes. This is
consistent with the principles of the Finch Review of Open Access to publicly funded research,
which concluded earlier this year”.17
OVERVIEW: UK Copyright and Text Mining
Hargreaves report, May 2011
“According to the Wellcome Trust, 87 per cent of the material housed in UK’s main medical
research database (UK PubMed Central) is unavailable for legal text and data mining.”
http://www.ipo.gov.uk/ipreview-finalreport.pdf
14
http://www.jisc.ac.uk/publications/reports/2012/value-and-benefits-of-text-mining.aspx
Digital Opportunity, A Review of Intellectual Property and Growth, An Independent Report, Prof. Ian Hargreaves,
May 2011, http://www.ipo.gov.uk/ipreview-finalreport.pdf.
16
The Value and Benefits of Text Mining, JISC, http://www.jisc.ac.uk/publications/reports/2012/value-and-benefits-oftext-mining.aspx.
17
http://www.ipo.gov.uk/response-2011-copyright-final.pdf
15
3
The Guardian, 23 May 2012: “What do publishers have against this hi-tech research tool?
“[c]ountless ... academics are prevented from using the most modern research techniques because
the big publishing companies such as Macmillan, Wiley and Elsevier, which control the distribution
of most of the world's academic literature, by default do not allow text mining of the content that sits
behind their expensive paywalls.” http://www.guardian.co.uk/science/2012/may/23/text-miningresearch-tool-forbidden
JISC text mining report, March 2012: “Legal uncertainty, inaccessible information silos, lack of
information and lack of a critical mass are barriers to text mining within UKFHE.”
“The UKFHE sector collaborates with content publishers and service providers to explore potential
new business models and innovative text mining services that meet the sector’s requirement”.
[Recommendation 2, p.5]
http://www.jisc.ac.uk/media/documents/publications/reports/2012/value-text-mining.pdf
The Business Innovation and Skills Committee in the report of its inquiry The Hargreaves Review
of Intellectual Property: Where next? recommends: “We believe that publishers should seek rapidly
to offer models in which licences are readily available at realistic rates to all bona fide licensees
and we encourage the Department to promote early development of such models.” (#65)
The UK IPO in its Consultation on Copyright stated that: (#7.96) “The Government proposes to
make it possible for whole works to be copied for the purpose of data mining for non-commercial
research.” However the BIS Committee concluded that “we believe that content mining should be
opened up by way of managed but nevertheless accessible licensing processes.” (#64)
Dame Janet Finch in her Report on how to expand access to research publications makes this
appeal to publishers: “Subject to any legislative changes following the Hargreaves review, all
publishers will have to consider what arrangements they will put in place to make their content
available for text and data mining.” [#9.26, p.106]
b. US
HathiTrust Litigation18




“The search capabilities have already given rise to new methods of academic inquiry
such as text mining”
“Plaintiffs also argue that non-consumptive research such as text mining causes harm (…)
because authors [sic] might one day pay for licences.”
Argument deemed speculative
Court concludes “no CCC licence”
c.
EU

European Commission launched a stakeholder dialogue on TDM in early 201319
18
http://www.publishersweekly.com/pw/by-topic/digital/copyright/article/54321-in-hathitrust-ruling-judge-says-googlescanning-is-fair-use.html
19
http://europa.eu/rapid/press-release_MEMO-12-950_en.htm#PR_metaPressRelease_bottom
4

CFC (Sandra Chastanet) and PLS (Sarah Faulder) as participants, and the IFRRO
Secretariat (Olav Stokkmo, replaced by James Boyd at the first meeting)), as observer, are
represented in Working Group 4 (Text and data mining for scientific and research purposes),
launched in Brussels on 4 February 2013
C. THE CCC PILOT: ADVANTAGES, DISADVANTAGES AND SOME OTHER CONSIDERATIONS
Following CCC’s pilot project, advantages, disadvantages and other considerations with respect to
TDM were outlined in the webinar “Content Data and Text Mining: From Containers to Enhanced
Research Tools” (11 December 2012).
Below some aspects from CCC’s and ALPSP’s (Association of Learned and Professional Society
Publishers) presentation and related discussions at the webinar in December 2012:
a. Opinion of publishers
Scholarly publishers have been aware for some time of the rising market demand for text mining of
their publications. The industry is working to streamline and enable the means better to meet that
demand. In her report for the Publishing Research Consortium Journal Article Mining, Eefke Smit
summarised practices, policies and plans at the time of publication in May 2011. Some of her
findings are highlighted below:





“Publishers are relatively liberal in granting permission: over 90% grant research-focused
mining requests, 60% in most or all cases, 33% for some cases. 32% allow any kind of
mining without permissions needed. 68% of publishers consider mining requests on a case
by case basis. More than 80% require information on intent and purpose.”20
A total of 32 % of publishing respondents allows for any and all kind of mining without
permissions needed, including the 28% who have an Open Access policy for this.
69% of publisher respondents consider mining requests on a case by case basis, 14% have a
formal policy that is publicly stated, 28% have no general policy, 21% are formulating a
policy.
When permission is requested, 35 % of publisher respondents generally allow mining in all
or the majority of cases, another 53% in some cases. More than 80% require information on
intent and purpose for all or most cases.
53 % of publisher respondents will decline mining requests if the results can replace or
compete with their own products and services21
According to Jonathan Clark22, a great challenge for publishers also seems to be the creation of an
infrastructure that makes their content more machine-accessible and that also supports everything
text-miners or computational linguists might want to do with the content.
b. Obstacles
According to CCC, the main things holding back TDM could be grouped into three main categories:
20
http://www.publishingresearch.net/documents/PRCSmitJAMreport20June2011VersionofRecord.pdf
http://www.publishingresearch.net/documents/PRCSmitJAMreport20June2011VersionofRecord.pdf
22
Jonathan Clark, Text Mining and Scholarly Publishing, Publishing Research Consortium 2012.
21
5



Technical issues: lack of common formats and interoperability, and also lack of agreement
on access/authentication arrangements.
Licensing arrangements: current requirements for users to negotiate separately with multiple
publishers, together with some legal uncertainties; lack of cross-publisher cooperation, incl.
on technical formats.
Business models and market development: no clear view of the value of TDM or how to
measure it, the lack of (established) business models or pricing models for TDM, uncertainly
and fear (primarily on the part of publishers), limited awareness and use of TDM outside
pharma.
c. Prospects of success
In CCC’s view, the main items that would accelerate the development of TDM are:
A need for a central broker, with two distinct roles:
 A rights clearance or licensing role;
 A technical function that might include normalising the data or (more ambitiously) creating a
central TDM marketplace and database, hosting standardized XML content and providing
access for mining.
Fixed-term licenses (allowing unlimited access for mining while in force) might simplify things for
the user, while allowing publishers to revert to alternative arrangements if necessary:
 Agreement on standard content formats important;
 Maintaining momentum and urgency.
d. Further considerations
 For licensed content, pharma and academic users argue that the right to mine content to
which they have paid-for licensed access (or which is freely available on the web, e.g. in
repositories) should be included as standard within the license agreement.
 Publishers generally want individual discussions because business models have not yet been
formalised, and the technical framework needs to be established.
 New services, new costs.
 Unlicensed content opportunity for both ends.
e. Solution offered by CCC
In CCC’s opinion, an intermediary organisation between users and publishers could provide a viable
platform for addressing the aforementioned TDM issues and could facilitate a conversation between
both parties to seek a middle ground.
Against this background, Reed Elsevier is setting up a pilot automated licensing system: researchers
at institutions involved in the pilot will have access to a self-service process that gives access to their
institution’s subscribed Elsevier content through APIs (Application Program Interface)23. It will not
be necessary to consider requests on a case-by-case basis. The bulk of requests will be considered
pre-approved, an automatic licence generated, and access provided through the automated system.
23
The Elsevier Article API facilitates search and access to scientific journals and scientific articles. The API provides
web services for searching for journals, journal volumes, specific issues, articles, and article images. The Article and
Article Image specific API interactions provide access to the full-text article XML (and the associated images) and
enable a mash-up developer to render the returned article in customizable formats.
6
The CCC Pilot includes both commercial and non-commercial uses for the following users and
publishers:

-
Users
Bio Medical
Chemical
Marketing

-
Publisher
Social Sciences
Bio Medical
Physical Sciences
Plant Sciences
f. What users want
In CCC’s view, users want:
 Flexibility:
- Obtaining a license to access database
- Secure that license without interrupting their workflow.

-
Confidentiality:
User’s queries for licensing content should be kept confidential
The minimum amount of information should be obtained for licensing purposes.

-
Control:
Define the TDM algorithms and services involved
Define the objective of the mining
Retain access to the outcomes of the TDM activities
g. Main advantages for users
CCC assumes that the main advantages for users are:





Include a single centralised point of licensing across rightsholders
Check of existing license coverage and the ability to purchase new licenses when needed
Format for all related content regardless of the content’s origin
Set of discovery tools and metadata descriptors for all content across publishers
API and seamless access to all content to be mined
h. Main advantages for publishers
From CCC’s perspective, the main advantages for publishers are as follows:


Elimination of the need for rightholders to standardise their format
Flexible licensing of content for text and data mining
7



Flexible pricing of content
Visibility into the aggregated data related to data mining
Ability to avoid individual negotiations
CCC PILOT – User benefits:
 Flexible licensing – timely access
 Confidential access (essential for pharma)
 Single point of content access and delivery
 Standardised content format across publishers
CCC PILOT – Outcomes:
 No need for bilateral licensing and individual negotiations
 Develop new business models for content access (e.g. unsubscribed content)
 Potential for extension into the academic research space – solves further significant access
issues
D. TDM SERVICES OFFERED (JOINTLY) BY UK PA, PLS, CCC, CROSSREF AND STM
Other IFRRO members are also developing (jointly) model licences for publishers, in meeting the
text mining needs of researchers. Inter alia, while STM has prepared a sample STM-PDR model
licence for pharma24, PLS is offering its services as a clearing-house for requests (single point of
contact per project; commonly agreed terms) and the UK PA is co-ordinating a cross-industry effort
to provide users with a click-through licence.
On 19 December 2012, EMMA, ENPA, EPC, FEP and STM co-hosted a “Mini-Seminar on Text
and data Mining” at the European Commission’s premises. The seminar was well-attended, inter alia
by Commission representatives from DG MARKT, DG CULTURE and DG CONNECT.
Presentations were given by, inter alia, Jonathan Clark (author of a guide on TDM), Maximilian
Haeussler (researcher at UC Santa Cruz), Eefke Smit (STM), Andrew Hughes (NLA/PDLN), Sarah
Faulder (PLS), and representatives from Springer, Elsevier and Wiley-Blackwell.
a. PLS
Work has begun to establish a Clearing House for TDM permissions at the PLS, based on an
enhancement of its existing rights database, PLSe. According to PLS, this could act as an entry point
for researchers wanting to mine journal content. Appropriate rightholders would be identified on
behalf of the researcher and the necessary permissions facilitated. Once content to be mined has
been specified and rightholders to that content have been identified, then, subject to licences,
protocols are needed to verify the permissions that enable mining tools to be applied to full text
articles on the publishers’ platforms.
PLS plans to develop licences that would support smaller publishers not in a position to negotiate
their own licences directly. PLS recommends that the first step for a publisher who wishes to make
24
http://www.stm-assoc.org/text-and-data-mining-stm-statement-sample-licence/
8
content available for text mining is to decide the terms and conditions under which they will do so.
This will be governed by whether the purpose is commercial or non-commercial.
It is not always clear, however, who the rightholder is, nor how to contact these to seek permission.
Several organisations, including PLS, CCC and CrossRef, are working to enable services in this area.
As a rightholder, the publisher must give permission for text mining. This can be done in a number
of ways. Permission can be included in an access licence agreement with, for instance, an institution.
STM has produced a model clause for this purpose25. Some publishers have established a process for
individual researchers to obtain permission to text mine with some restrictions26, while others do not
support text mining yet. Some organisations such as PubMed, allow unrestricted text mining without
permission. The Pharma-Documentation-Ring (P-D-R) recently updated their sample licence to
grant text and data-mining rights for the content to which each of the P-D-R members subscribe.27
Researchers want to track and contact ‘potentially hundreds’ of publishers for permission to mine
their text (permissions not required to mine data per se). Connecting researchers to rightholders
could be a task for RROs. The envisaged solution by PLS, to be fully functional by mid-2013, is a
single discovery portal (in order to find the appropriate publishers and route their permission
requests to the relevant person in the publishing house). With the PLS database, PLS is developing a
clearing house for researchers and a licensing service for the long tail of publishers.
b. UK PA
The UK PA is aiming to convene a cross-sector working group comprising researchers, funders,
technology providers, and publishers to develop a set of principles for a standard ‘click-through’
licence that meets the needs of both researchers wanting to use mining tools and publishers willing
to grant user rights to their content. It follows that in order to develop such a licence, a mutual
understanding of needs and an active dialogue is needed between the two communities, researchers
and publishers.
To streamline permissions transactions even further, especially across multiple smaller publishers, a
collective licence might be developed, potentially in collaboration with PLS and CLA. A collective
licence could also be of value for publishers with less text to license, who may find it a more cost
effective solution than managing their own permissions bilaterally.
c. CrossRef, CCC and STM
Having set up a Clearing House permissions service via PLS, and a group to develop a ‘click
through’ licence for the application of mining tools, publishers are currently exploring the means to
enable text mining itself by using enhancements to existing technology. Several publishers and
organisations are looking at this or planning working pilots, including CrossRef, an independent
membership organisation aiming to promote the development and cooperative use of new and
innovative technologies to speed and facilitate scholarly research, and CCC.
25
STM Statement on Text and Data Mining and Sample Licence, http://www.stm-assoc.org/text-and-data-mining-stmstatement-sample-licence/
26
See, for example, Elsevier, http://www.elsevier.com/editors/open-access/open-access-policies/content-miningpolicies; Springer: http://www.springeropen.com/about/datamining/.
27
http://www.p-d-r.com/content/press_releases/archive/2012/
9
CrossRef is potentially well-positioned to provide solutions to most of the logistical and technical
problems that have been identified by both publishers and researchers. By leveraging existing
CrossRef and publisher infrastructure, with modest development efforts, it should be possible to
establish an automated, centralised and efficient mechanism to allow researchers and publishers to
agree to the terms of a standard text mining licence and to enable a standard cross-publisher
mechanism for identifying and retrieving the full text of journal articles for text mining purposes.
CCC, in cooperation with STM, brought together users, publishers, and technology companies to
explore the state of text and data mining for scientific publications and journals. The participants
explored the key drivers and hindrances for TDM. Following that event, CCC convened a group of
publishers and users from the US, UK, and Europe, in order to create a working pilot.
CCC’s TDM system is being built for the purpose of facilitating proper discovery of, and efficient
access to, high quality articles while respecting the rights of publishers who create and manage
content and databases. The key goals are:
i. to eliminate the burden of multiple formats for users and relieve publishers of the
responsibility of normalization of content and data,
ii. to provide one-stop clearing of rights and/or access to content for each TDM project, by
providing appropriate licenses on behalf of many different publishers, and
iii. to generate royalties for rightholders whose content will be used for the purposes of TDM.
Providing users with access to both subscribed and unsubscribed content for mining purposes is a
key deliverable of the CCC project, and one which has been broadly accepted by both publishers
and users.
To this end, the Publishing Research Consortium, a collaboration of publisher associations that
supports research into scholarly communication in order to enable evidence-based discussion, has
commissioned a Guide to Text and Data Mining (apparently not yet published) in order to provide
practical guidance on the aims, methods, outputs, and rationale for text mining and also some insight
into the technical implications and surrounding issues affecting publishers and their readership.
The Pharma-Documentation-Ring (P-D-R) sample license has been updated to grant text and datamining rights to use the content to which each of the P-D-R members subscribes. The P-D-R sample
license serves as a benchmark used by P-D-R’s members to negotiate individual subscription
agreements with publishers and other content suppliers.28 The text of the clause reads:
“Text and Data Mining (TDM): download, extract and index information from the Publisher’s
Content to which the Subscriber has access under this Subscription Agreement. Where required,
mount, load and integrate the results on a server used for the Subscriber’s text mining system and
evaluate and interpret the TDM Output for access and use by Authorised Users. The Subscriber
shall ensure compliance with Publisher's Usage policies, including security and technical access
requirements. Text and data mining may be undertaken on either locally loaded Publisher Content
or as mutually agreed.”
28
The agreement was reached between P-D-R, an association of twenty-one pharmaceutical companies, ALPSP and
STM. See also: http://www.stm-assoc.org/2012_12_11_STM_Report_2012.pdf
10
E. AUTHOR INVOLVEMENT
So far, to our knowledge the rightholder category involved in (projects-related) work with respect to
text and data mining is (mainly) publishers. Authors do not seem to having been involved yet, but
we assume that text mining also concerns authors, for instance as regards unpublished
material/manuscripts. The character of the use, as large scale subsidiary usages of multiple works by
multiple rightholders, combined with potential involvement of both authors and publishers, makes it
appropriate to consider collective management of rights. It is therefore relevant for RROs to
contemplate whether to offer their services to the rightholders in relation to TDM.
F. OPPORTUNITIES FOR RRO INVOLVEMENT
a. Understanding the needs of users/researchers
Broadly speaking, there are (so far) four main reasons for users to embark on text mining: to
enriching the content in some way; to enable systematic review of literature; for discovery; or
computational linguistics research.29
Against this background, RROs could consider offering services to authors and publishers in
relation to TDM. Their involvement could contribute to the removal of potential friction between
TDM users and rightholders by handling payments and offering single licences, in particular given
the demand for a central broker and fixed-term licenses.
Managed licensed access can deliver benefits for researchers irrespective of any legislation, which
will not in itself resolve the significant technical issues involved. Therefore, streamlining the means
to enable text mining will be essential. To achieve this, a deep understanding of the needs of
researchers and content miners will be required, and collaboration will be needed between
stakeholders across the sector. 30
b. Making text mining work on commercial platforms
 Text/data mining applications, including previous examples, often are research project- or
research-specific and not always attractive to commercial publishing platforms and their
customers
 Value to the non-expert can be limited
 “Articles of the future”31 and “Adventures in semantic publishing”32 not widely implemented
yet
 A solution for medical case reports in journals?33
c. The need for a standardised ‘click-through’ licence
Once the content to be mined has been sufficiently specified so that rightholders of that content can
be identified and approached, when the necessary permissions have been sought and granted, those
permissions still need to be consolidated into some form of licence. Work to develop model clauses
that multiple publishers can use and adapt for their own purposes in individual licences has been in
29
Jonathan Clark, Text Mining and Scholarly Publishing, Publishing Research Consortium 2012, page 7.
http://text.soe.ucsc.edu/progress.html
31
http://www.articleofthefuture.com/
32
http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000361
33
See also: www.casesdatabase.com
30
11
hand for some time, but so far this work has been principally applied to negotiations between the
major STM publishers and the commercial pharmaceutical industry. The licence used there may not
be appropriate for non-commercial transactions between academic researchers and the broader range
of journal publishers. Ideally, some form of standardised ‘click-through’ licence is needed to link up
the process of verifying and granting permissions and the process of enabling access for the
application of mining tools to full text published articles on the publishers’ (or authors’) platforms.
One way would be a clearing house for permissions that provides a single point of contact for
researchers and rightholders. Ideally, this would be a standard ‘click-through’ licence. A machinereadable licence would enable every article with a defined identifier, for instance with a DOI, to
have the licence associated with it which would greatly simplify the whole process. A researcher
would accept and receive a certificate that would work across all content.
Permission would be granted under defined terms and conditions of use that are usually detailed in
the licence. This could be a standard licence or one designed specifically for a particular purpose.
The period of time that a licence could cover would depend on the text mining needs. For
computational linguistics search, often a one-time access will be sufficient. For systematic literature
reviews and data mining, however, access will be needed over an extended period as new content is
added all the time. Content may be delivered as a single delivery (“data dump”) or online access
may be granted. Rightholders may choose to allow robot crawling of their digital content, possibly
with restrictions.
The use of a name identifier, such as ISNI (the ISO approved International Standard Name
Identifier), would be useful, to uniquely identify researchers and other contributors.
G. SOME LEGAL ISSUES
Text mining may frequently result in the creation of databases of facts or raw data extracted from
the sources mined. From a legal perspective, it is not clear whether any resultant database is
protected separately as a derivative work. This would need to be assessed on a case-by-case basis.
(On the other hand, some licences cover derivative works, but require attribution of the source,
which might be challenging from a practical perspective.)
If data = numerical representation of facts, then they are generally not copyrightable, but there are:


Many levels of data/derived digital data34
Jurisdictional differences (e.g. US vs. Australian law; EU database rights)
= ambiguity about legal status of content
34
Public consultation on implementing CC0 for data published in open access journals Sept-Nov 2012,
http://blogs.biomedcentral.com/bmcblog/2012/09/10/put-the-open-in-open-data/; see also: Hrynaszkiewicz I, Cockerill
MJ: Open by default: a proposed copyright license and waiver agreement for open access research and data in peerreviewed journals. BMC Research Notes 2012, 5:494 http://www.biomedcentral.com/1756-0500/5/494
12
H. ELEMENTS IN A STANDARD TDM LICENCE
The Terms and Conditions of a standard TDM Licence could include – very briefly:
1. Definitions
2. Grant of licence
RRO conditions for non-exclusive licence – permitted uses, incl. text (and data) mining, e.g.:
 downloading, extracting and indexing information from licensed website/online
sources;
 mounting, loading and integrating results on a server used for text (and data) mining;
 evaluating and interpreting the text (and data) mining output for access and use;
 copying from digital publications and storing of digital copies:
 making/distributing and/or permitting making/distributing of paper copies;
 making available and/or permitting making available of digital copies;
 (scanning material to) produce digital copies.
3. Conditions applying to the creation and use of licensed copies; further conditions applying to
scanning and use of digital material (incl. security and technical access requirements)
4. Commercial uses (if applicable)
5. Duration
6. Payment
7. Notification / Notification to licensee’s staff
8. Data collection
9. Indemnity
10. Breach and termination
11. General: notices, variation of terms, assignments, jurisdiction/disputes/governing law, etc.
I. SOME FURTHER READING
-
Witten, I.H. (2005), “Text mining”, in: Practical handbook of internet computing, edited by
M.P. Singh, pp. 14-1 - 14-22. Chapman & Hall/CRC Press, Boca Raton, Florida;
http://www.cs.waikato.ac.nz/~ihw/papers/04-IHW-Textmining.pdf
-
National Centre for Text Mining (NaCTeM), http://www.nactem.ac.uk
-
The Arrowsmith Project, http://arrowsmith.psych.uic.edu/arrowsmith_uic/index.html
- END of Document 13