Download Electronic Resource Management

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Bibliomining: An Introduction
1
Outline
•
•
•
•
•
Introduction
Bibliomining Process
Example Applications
Placing Bibliomining in Context
A Research Agenda to Advance Bibliomining
2
Origins and Definition of
Bibliomining
• ‘‘bibliometrics’’ + ‘‘data mining’’
– Bibliometrics focuses on the creation of works
– Data mining (Web usage mining) focuses on the access of works
• The application of data mining and bibliometric tools to data
produced from library services
• Gain a better understanding of library user communities
– Frequencies and aggregate measures hide underlying patterns
• The combination of data mining, bibliometrics, statistics, and
reporting tools used to extract patterns of behavior-based
artifacts from library systems for aiding decision-making or
justifying services
3
Bibliometrics
• Traditional bibliometrics is based on the quantitative
exploration of document-based scholarly communication
• Data for bibliometrics
– Works: authors, collections
– Connections: citations, authorship, common terms, other aspects of
the creation and publication process
• Allow the researchers to understand the context in which a
work was created, the long-term citation impact of the work
and the differences between fields in regard to their
scholastic output patterns
4
Data for Bibliometrics
5
Bibliometrics (Cont.)
• Frequency-based, Visualization, data mining
– Frequency of authorship in a subject, commonality of words used,
and discovery of a core set of frequently cited works
– Integrating the citations between works allows for very rich
exploration of relations between scholars and topics
– Linkages between works are used to aid in automated information
retrieval and visualization of scholarship and the social networks
between those involved with the creation process
– Many newer bibliometric applications involve Web-based resources
and hyperlinks that enhance or replace traditional citation information
6
Social Network
7
User-based Data Mining
• One popular area: the examination of how users explore
Web spaces (Web usage Mining)
– Focus on accesses of different Web pages by a particular user (or IP
address)
– Patterns of use are discovered through data mining and used to
personalize the information presented to the user or improve the
information service
• In user-based data mining, the links between works come
from a commonality of use
– If one user accesses two works during the same session, for
example, then if another user views one of those works then the
other might also be of interest
8
Data for User-Based Data Mining
Links between works that result from the users
9
Data for Anonymized CommunityBased Web Usage Mining
Demographic Surrogate
10
Bibliomining Process
11
Overview
• Determining areas of focus
• Identifying internal and external data sources
• Collecting, cleaning, and anonymizing the data into a data
warehouse
• Selecting appropriate analysis tools
• Discovery of patterns through data mining and creation of
reports with traditional analytical tools
• Analyzing and implementing the results
12
Determining Areas of Focus
• Might come from a specific problem in the library or may be
a general area requiring exploration and decision-making
• Directed data mining: problem-focused
– Ex. Budget cuts have reduced the staff time for contacting patrons
about delinquent materials. Is there a way to predict the chance
patrons will return material once it is one week late in order to
prioritize our calling lists?
• Undirected data mining: consider general topical area
– Ex. How are different departments and types of patrons using the
electronic journals?
– May produce an overwhelming number of patterns to explore for
validation
– should be considered only when a strong data warehouse is in place
13
Identifying Data Sources
• The bibliomining process requires transactional, nonaggregated, low-level data
• Privacy issue?
• Internal data sources are those already within the library
system
– Patron database, transactional data, Web server logs
• External data sources
– Demographic information related to a specific ID number that is
located in the computer center or personnel management system
– Demographic information for zip codes from census data
14
Data for Bibliomining
15
Conceptual Framework for Data Types
in the Bibliomining Data Mining
16
A Framework for the Data
• Data about a work
– Three kinds of fields
• Fields that were extracted from the work (like title or author)
• Fields that are created about the work (like subject heading)
• Fields that indicate the format and location of the work (like URL
or collection)
– Come from a MARC record, Dublin Core information, or CMS
– Can also connect into bibliometric information, such as citations or
links to other works
• May require extraction from the original source (in the case of
digital reference) or linking into a citation database
– Challenge: no article level usage reports
17
A Framework for the Data (Cont.)
• Data about the user
– Demographic surrogate
– Other fields that come from inferences about the user: zip code,
location/department/lab (inference from IP address)
18
A Framework for the Data (Cont.)
• Data about the service
– Searching, circulation, reference, interlibrary loan and other library
services
– Fields common to most services include time and date, library
personnel involved, location, method, and if the service was used in
conjunction with other services
– Each library services also has a set of appropriate fields
• Searching: the content of the search and the next steps taken
• Interlibrary loan: cost, a vendor, and a time of fulfillment
• Circulation: acquisition process of the work and circulation length.
19
Creating the Data Warehouse
• A data warehouse is a DB that is separate from the
operational systems and contains a cleaned and
anonymized version of the operational data reformatted for
analysis
• Use queries to extract the data from the identified sources,
combines those data using common fields, cleans the data,
and writes the resulting records into either a flat file or a
relational database designed specifically for analysis
• Can be automated to pull data from the operational systems
into the data warehouse on a regular basis
20
Creating the Data Warehouse –
Protecting Patron Privacy
• Going through the data warehousing process requires the
library to examine their data sources
• By explicitly determining what to keep and what to destroy,
libraries can save the demographic information needed to
evaluate communities of users without keeping records of
the individuals in those communities
• Two examples
21
Cleaning Transactional Records
22
Cleaning Web Server
Transactional Records
23
Creating the Data Warehouse –
Building the Data Warehouse
• Building the data warehouse takes much more time than
mining the data
• Suggest to start with a narrowly defined bibliomining topic
and work through the entire process
• This iterative process also has the advantage of allowing
those developing the data warehouse, to improve their
collection and cleaning algorithms early in the life of the
bibliomining project
24
Selecting Appropriate Analysis
Tools
•
•
•
•
•
Traditional Reporting
Management information system (MIS)
Online Analytical Processing (OLAP)
Visualization
Data Mining
25
Analysis Tools – Traditional
Reporting
• Library decision-makers examine aggregates and averages
to understand their service use
• The advantage to the data warehouse is that new questions
can be asked not only of the present situation but also, the
past
– This allows those doing evaluation or measurement to ask new
questions and then create a historical view of those reports in order
to understand trends
• Libraries can more easily understand behavior between
different demographic groups in the library
26
Analysis Tools – Management
information system (MIS)
• Provide a manager with the ability to ask basic questions of
the data
• ILS packages have some type of basic MIS built in
• An MIS built on top of a data warehouse made for the library
will be more powerful and provide information that the library
needs to see
• Another addition to MIS is a critical factor alert system
– Example: if hourly circulation (factor) is below or above a certain
level, a manager could be immediately notified so staffing changes
could be made
27
Analysis Tools – Online
Analytical Processing (OLAP)
• An interactive view of the data
• Under the surface, the OLAP tool has run thousands of DB
queries to combine all of the selected variables along with
all of the selected measures (aggregation types,
timeframes…)
• All of the fields are defined ahead of time, and the system
runs many queries before anyone uses it
– Response to the manager using the OLAP front-end for reports is
instant, which encourages exploration
• Penn Library Data Farm
(http://metrics.library.upenn.edu/prototype/datafarm/)
28
Analysis Tools – Online Analytical
Processing (OLAP) (Cont.)
• The user will pick one of many variables from a list to
examine
• Example: use of e-journals under dimensions, such as time
and subject
– A high-level view of this data in a tabular report (year and general
classification)
– Expand the report -- click on a year  expand the year into quarters,
leaving the subject headings the same and recalculating the data.
– The user can then click on another field to drill down into the data
• During exploration, the manger can capture any view of the
data and turn it into a regular report
29
Analysis Tools – Visualization
• Present the characteristics of data in a visual form
30
Analysis Tools – Data Mining
• Discovery of valid, novel, and actionable patterns in large
amounts of data using statistical and artificial intelligence
tools
• Two main categories of data mining tasks
– Description: understand the data from the past and the present
• discover patterns for affinity groups of variables common to
different patrons or clusters of demographic groups that exhibit
certain characteristics (association rule mining, clustering)
– Prediction: make a statement about the unknown based upon what is
known
• Classification (place an item into a category)
• Estimation (produce a numeric value for an unknown variable)
31
Analysis Tools – Data Mining
(Cont.)
• Techniques: neural networks, regression, clustering, rule
generation, and classification
• Process:
–
–
–
–
–
–
Take a cleaned data set
Generate new variables from existing ones
Split the data into model building sets and test sets
Apply techniques to the model building sets to discover patterns
Use the test sets to ensure the patterns are more generalizable
Confirm these patterns with someone who knows the domain
• Web Usage Mining, Text Mining (+ bibliometrics)
32
Analysis Tools – Category &
Cluster Results
Cluster Results
Category
33
Analysis Tools – Cluster Detail
Information
Related
Topic
Cluster
Label
Citation
Relation
Related
Abstract
Article
Cluster Label
34
Analysis Tools – Citation Relation
35
DREW Open Effort Project
• Digital reference electronic warehouse (DREW) .
– Develop an XML schema to…
• Allow digital reference transactions from different services and in
different communication forms to live together in one space
• Allow researchers to access these archives and explore them
using a variety of methods
– Capture the results of this research into a management information
system, and then allow the reference services to view their own
archives through the tools created by the researchers
• Knowledge base, citations and links to other works
36
Analysis and Implementation
• Once the results have been developed, they must be
validated
– Test and tweak the model with data that were not used during the
development process (training and test)
– The most important validation is to have a librarian who is familiar
with that particular library context examine the models .
• Implement the report/model
– Essential to monitor the variables that power the models over time; if
the mean of a variable strays too far because of changes in the
library, the model may have to be reevaluated
37
Example Applications – See Another
PPT
38
Placing Bibliomining in Context
39
Conceptual Framework for
Decision-Makers
40
Conceptual Framework for Library
and Information Scientists
推論
歸納
Hypothetico-Deductive-Inductive Method
41
Understanding both Frameworks
• In both frameworks, bibliomining is not the end of the
exploration process
• It is one tool to be used in combination with other methods
of measurement and evaluation, such as LIBQUAL, Emetrics, cost-benefit analyses, surveys, focus groups, or
other qualitative explorations
• Using only bibliomining to understand a digital library can
result in biased or incomplete results
• While the information provided by bibliomining is useful, it
needs to be supplemented by more user-based approaches
to provide a more complete picture of the library system
42
A Research Agenda to Advance
Bibliomining
43
Data Collection
• Various data sources
–
–
–
–
–
Integrated library system
Web-based front-end to digital libraries (federated search)
A system to support interlibrary loan
A system to support digital reference services
External systems – citation databases, census data
• How to collect data and match it between systems
– Standard for data – Project COUNTER, NISO Z39.7-200x (library
metrics and statistics)  aggregate-level data
– Cooperation between system creators – easily exportable data
warehouse and match between systems through common fields
44
User Privacy
• The bibliomining data warehouse can provide the method
for keeping information about the materials used in the
library without maintaining specific information about the
users of the library
• How about the effect this anonymization has on the power
of the data mining tools to discover patterns?
• Privacy-protecting data mining
• Privacy issues coming from Digital Reference Service
(DRS): personal information in the questions
– Text mining and NLP
45
Variable, Metric, and Model
Generation
• While researchers have developed metrics for library
statistics, they have primarily focused on fields from one
data source
• Once the warehouse has been constructed, the possibilities
grow for the discovery interesting variables for mining and
metrics for evaluation
• Start in the data mining process, looking for relationships
between individual variables that allow for deeper
understanding
– Through the patterns discovered with data mining, new metrics and
measures can be proposed
• Example: one-time high-demand needs VS. needs that
represent the general user base
46
Integration of Management Information
System and Data Mining tools
• Integrate the found algorithms into the systems that drive
digital libraries
• This combination of a built-in data warehouse, interactive
reporting module, standards for report description, and
modular design will make it much easier for library decisionmakers to get involved with bibliomining.
• Toward developing these integrated modules for other
systems that support digital libraries
47
Multi-system Data Warehouses
and Knowledge Bases
• The creation of services that span many digital libraries
– Library consortia
– Joining together digital library sources and services while still maintaining
identity for those participating (like National Science Digital Library)
• Join data warehouses with libraries that have similar user groups and
similar collections
– Agree demographic surrogates or develop a cross-walk algorithm to map
demographics
– Need to ensure that these patterns apply to their own library before making
decisions based upon them
• Methods for combining utilization and collection metadata between
different systems.
– Standardize a series of metrics (what do “Hit” and “Visit” mean?)
– Create a standard for record-level data (MARC, COUNTER…)
48
Conclusion: moving beyond
evaluation to understanding
• The final and most long-lasting area of research of
bibliomining is improving understanding of digital libraries at
a generalized, and perhaps even conceptual, level
• These data warehouses will combine resources traditionally
unavailable in this combined form to researchers
– What connections can be made between patron demographics, and
bibliometric-based social networks of authors?
– How much influence do the works written and cited by faculty at an
institution have on the patterns of student use of library services?
– How do usage patterns differ between departments or demographic
groups, and what can the library do to better personalize and
enhance existing services?
• Qualitative + quantitative
49