Download Electronic Resource Management

Bibliomining: An Introduction 1 Outline • • • • • Introduction Bibliomining Process Example Applications Placing Bibliomining in Context A Research Agenda to Advance Bibliomining 2 Origins and Definition of Bibliomining • ‘‘bibliometrics’’ + ‘‘data mining’’ – Bibliometrics focuses on the creation of works – Data mining (Web usage mining) focuses on the access of works • The application of data mining and bibliometric tools to data produced from library services • Gain a better understanding of library user communities – Frequencies and aggregate measures hide underlying patterns • The combination of data mining, bibliometrics, statistics, and reporting tools used to extract patterns of behavior-based artifacts from library systems for aiding decision-making or justifying services 3 Bibliometrics • Traditional bibliometrics is based on the quantitative exploration of document-based scholarly communication • Data for bibliometrics – Works: authors, collections – Connections: citations, authorship, common terms, other aspects of the creation and publication process • Allow the researchers to understand the context in which a work was created, the long-term citation impact of the work and the differences between fields in regard to their scholastic output patterns 4 Data for Bibliometrics 5 Bibliometrics (Cont.) • Frequency-based, Visualization, data mining – Frequency of authorship in a subject, commonality of words used, and discovery of a core set of frequently cited works – Integrating the citations between works allows for very rich exploration of relations between scholars and topics – Linkages between works are used to aid in automated information retrieval and visualization of scholarship and the social networks between those involved with the creation process – Many newer bibliometric applications involve Web-based resources and hyperlinks that enhance or replace traditional citation information 6 Social Network 7 User-based Data Mining • One popular area: the examination of how users explore Web spaces (Web usage Mining) – Focus on accesses of different Web pages by a particular user (or IP address) – Patterns of use are discovered through data mining and used to personalize the information presented to the user or improve the information service • In user-based data mining, the links between works come from a commonality of use – If one user accesses two works during the same session, for example, then if another user views one of those works then the other might also be of interest 8 Data for User-Based Data Mining Links between works that result from the users 9 Data for Anonymized CommunityBased Web Usage Mining Demographic Surrogate 10 Bibliomining Process 11 Overview • Determining areas of focus • Identifying internal and external data sources • Collecting, cleaning, and anonymizing the data into a data warehouse • Selecting appropriate analysis tools • Discovery of patterns through data mining and creation of reports with traditional analytical tools • Analyzing and implementing the results 12 Determining Areas of Focus • Might come from a specific problem in the library or may be a general area requiring exploration and decision-making • Directed data mining: problem-focused – Ex. Budget cuts have reduced the staff time for contacting patrons about delinquent materials. Is there a way to predict the chance patrons will return material once it is one week late in order to prioritize our calling lists? • Undirected data mining: consider general topical area – Ex. How are different departments and types of patrons using the electronic journals? – May produce an overwhelming number of patterns to explore for validation – should be considered only when a strong data warehouse is in place 13 Identifying Data Sources • The bibliomining process requires transactional, nonaggregated, low-level data • Privacy issue? • Internal data sources are those already within the library system – Patron database, transactional data, Web server logs • External data sources – Demographic information related to a specific ID number that is located in the computer center or personnel management system – Demographic information for zip codes from census data 14 Data for Bibliomining 15 Conceptual Framework for Data Types in the Bibliomining Data Mining 16 A Framework for the Data • Data about a work – Three kinds of fields • Fields that were extracted from the work (like title or author) • Fields that are created about the work (like subject heading) • Fields that indicate the format and location of the work (like URL or collection) – Come from a MARC record, Dublin Core information, or CMS – Can also connect into bibliometric information, such as citations or links to other works • May require extraction from the original source (in the case of digital reference) or linking into a citation database – Challenge: no article level usage reports 17 A Framework for the Data (Cont.) • Data about the user – Demographic surrogate – Other fields that come from inferences about the user: zip code, location/department/lab (inference from IP address) 18 A Framework for the Data (Cont.) • Data about the service – Searching, circulation, reference, interlibrary loan and other library services – Fields common to most services include time and date, library personnel involved, location, method, and if the service was used in conjunction with other services – Each library services also has a set of appropriate fields • Searching: the content of the search and the next steps taken • Interlibrary loan: cost, a vendor, and a time of fulfillment • Circulation: acquisition process of the work and circulation length. 19 Creating the Data Warehouse • A data warehouse is a DB that is separate from the operational systems and contains a cleaned and anonymized version of the operational data reformatted for analysis • Use queries to extract the data from the identified sources, combines those data using common fields, cleans the data, and writes the resulting records into either a flat file or a relational database designed specifically for analysis • Can be automated to pull data from the operational systems into the data warehouse on a regular basis 20 Creating the Data Warehouse – Protecting Patron Privacy • Going through the data warehousing process requires the library to examine their data sources • By explicitly determining what to keep and what to destroy, libraries can save the demographic information needed to evaluate communities of users without keeping records of the individuals in those communities • Two examples 21 Cleaning Transactional Records 22 Cleaning Web Server Transactional Records 23 Creating the Data Warehouse – Building the Data Warehouse • Building the data warehouse takes much more time than mining the data • Suggest to start with a narrowly defined bibliomining topic and work through the entire process • This iterative process also has the advantage of allowing those developing the data warehouse, to improve their collection and cleaning algorithms early in the life of the bibliomining project 24 Selecting Appropriate Analysis Tools • • • • • Traditional Reporting Management information system (MIS) Online Analytical Processing (OLAP) Visualization Data Mining 25 Analysis Tools – Traditional Reporting • Library decision-makers examine aggregates and averages to understand their service use • The advantage to the data warehouse is that new questions can be asked not only of the present situation but also, the past – This allows those doing evaluation or measurement to ask new questions and then create a historical view of those reports in order to understand trends • Libraries can more easily understand behavior between different demographic groups in the library 26 Analysis Tools – Management information system (MIS) • Provide a manager with the ability to ask basic questions of the data • ILS packages have some type of basic MIS built in • An MIS built on top of a data warehouse made for the library will be more powerful and provide information that the library needs to see • Another addition to MIS is a critical factor alert system – Example: if hourly circulation (factor) is below or above a certain level, a manager could be immediately notified so staffing changes could be made 27 Analysis Tools – Online Analytical Processing (OLAP) • An interactive view of the data • Under the surface, the OLAP tool has run thousands of DB queries to combine all of the selected variables along with all of the selected measures (aggregation types, timeframes…) • All of the fields are defined ahead of time, and the system runs many queries before anyone uses it – Response to the manager using the OLAP front-end for reports is instant, which encourages exploration • Penn Library Data Farm (http://metrics.library.upenn.edu/prototype/datafarm/) 28 Analysis Tools – Online Analytical Processing (OLAP) (Cont.) • The user will pick one of many variables from a list to examine • Example: use of e-journals under dimensions, such as time and subject – A high-level view of this data in a tabular report (year and general classification) – Expand the report -- click on a year  expand the year into quarters, leaving the subject headings the same and recalculating the data. – The user can then click on another field to drill down into the data • During exploration, the manger can capture any view of the data and turn it into a regular report 29 Analysis Tools – Visualization • Present the characteristics of data in a visual form 30 Analysis Tools – Data Mining • Discovery of valid, novel, and actionable patterns in large amounts of data using statistical and artificial intelligence tools • Two main categories of data mining tasks – Description: understand the data from the past and the present • discover patterns for affinity groups of variables common to different patrons or clusters of demographic groups that exhibit certain characteristics (association rule mining, clustering) – Prediction: make a statement about the unknown based upon what is known • Classification (place an item into a category) • Estimation (produce a numeric value for an unknown variable) 31 Analysis Tools – Data Mining (Cont.) • Techniques: neural networks, regression, clustering, rule generation, and classification • Process: – – – – – – Take a cleaned data set Generate new variables from existing ones Split the data into model building sets and test sets Apply techniques to the model building sets to discover patterns Use the test sets to ensure the patterns are more generalizable Confirm these patterns with someone who knows the domain • Web Usage Mining, Text Mining (+ bibliometrics) 32 Analysis Tools – Category & Cluster Results Cluster Results Category 33 Analysis Tools – Cluster Detail Information Related Topic Cluster Label Citation Relation Related Abstract Article Cluster Label 34 Analysis Tools – Citation Relation 35 DREW Open Effort Project • Digital reference electronic warehouse (DREW) . – Develop an XML schema to… • Allow digital reference transactions from different services and in different communication forms to live together in one space • Allow researchers to access these archives and explore them using a variety of methods – Capture the results of this research into a management information system, and then allow the reference services to view their own archives through the tools created by the researchers • Knowledge base, citations and links to other works 36 Analysis and Implementation • Once the results have been developed, they must be validated – Test and tweak the model with data that were not used during the development process (training and test) – The most important validation is to have a librarian who is familiar with that particular library context examine the models . • Implement the report/model – Essential to monitor the variables that power the models over time; if the mean of a variable strays too far because of changes in the library, the model may have to be reevaluated 37 Example Applications – See Another PPT 38 Placing Bibliomining in Context 39 Conceptual Framework for Decision-Makers 40 Conceptual Framework for Library and Information Scientists 推論歸納 Hypothetico-Deductive-Inductive Method 41 Understanding both Frameworks • In both frameworks, bibliomining is not the end of the exploration process • It is one tool to be used in combination with other methods of measurement and evaluation, such as LIBQUAL, Emetrics, cost-benefit analyses, surveys, focus groups, or other qualitative explorations • Using only bibliomining to understand a digital library can result in biased or incomplete results • While the information provided by bibliomining is useful, it needs to be supplemented by more user-based approaches to provide a more complete picture of the library system 42 A Research Agenda to Advance Bibliomining 43 Data Collection • Various data sources – – – – – Integrated library system Web-based front-end to digital libraries (federated search) A system to support interlibrary loan A system to support digital reference services External systems – citation databases, census data • How to collect data and match it between systems – Standard for data – Project COUNTER, NISO Z39.7-200x (library metrics and statistics)  aggregate-level data – Cooperation between system creators – easily exportable data warehouse and match between systems through common fields 44 User Privacy • The bibliomining data warehouse can provide the method for keeping information about the materials used in the library without maintaining specific information about the users of the library • How about the effect this anonymization has on the power of the data mining tools to discover patterns? • Privacy-protecting data mining • Privacy issues coming from Digital Reference Service (DRS): personal information in the questions – Text mining and NLP 45 Variable, Metric, and Model Generation • While researchers have developed metrics for library statistics, they have primarily focused on fields from one data source • Once the warehouse has been constructed, the possibilities grow for the discovery interesting variables for mining and metrics for evaluation • Start in the data mining process, looking for relationships between individual variables that allow for deeper understanding – Through the patterns discovered with data mining, new metrics and measures can be proposed • Example: one-time high-demand needs VS. needs that represent the general user base 46 Integration of Management Information System and Data Mining tools • Integrate the found algorithms into the systems that drive digital libraries • This combination of a built-in data warehouse, interactive reporting module, standards for report description, and modular design will make it much easier for library decisionmakers to get involved with bibliomining. • Toward developing these integrated modules for other systems that support digital libraries 47 Multi-system Data Warehouses and Knowledge Bases • The creation of services that span many digital libraries – Library consortia – Joining together digital library sources and services while still maintaining identity for those participating (like National Science Digital Library) • Join data warehouses with libraries that have similar user groups and similar collections – Agree demographic surrogates or develop a cross-walk algorithm to map demographics – Need to ensure that these patterns apply to their own library before making decisions based upon them • Methods for combining utilization and collection metadata between different systems. – Standardize a series of metrics (what do “Hit” and “Visit” mean?) – Create a standard for record-level data (MARC, COUNTER…) 48 Conclusion: moving beyond evaluation to understanding • The final and most long-lasting area of research of bibliomining is improving understanding of digital libraries at a generalized, and perhaps even conceptual, level • These data warehouses will combine resources traditionally unavailable in this combined form to researchers – What connections can be made between patron demographics, and bibliometric-based social networks of authors? – How much influence do the works written and cited by faculty at an institution have on the patterns of student use of library services? – How do usage patterns differ between departments or demographic groups, and what can the library do to better personalize and enhance existing services? • Qualitative + quantitative 49

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Electronic Resource Management