Download Scott Nicholson

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Striking a Balance:
Bibliomining and Privacy
Please do not reuse these slides
without prior permission from Scott
[email protected].
Nicholson
Assistant Professor
Copyright 2004
Syracuse University
School of Information Studies
http://bibliomining.org
[email protected]
Scott Nicholson, Syracuse
University School of Information
Studies
What is Bibliomining?
Bibliomining is the combination of
Bibliometrics and Data Mining
used on the data produced during the
Please do not
reuse theseand
slides
operation of libraries
(physical
without prior permission from
digital)
[email protected].
Copyright 2004
Scott Nicholson, Syracuse University School of Information Studies
What is Bibliomining?


Application of advanced analysis
tools to data produced by libraries
May include
 Data mining
 Bibliometrics (patterns in scholarship)
 Online analytical processing (OLAP)
Please do not reuse these slides
prior permission from
 Other statisticalwithout
techniques
[email protected].
Copyright 2004
Scott Nicholson, Syracuse University School of Information Studies
Goals of Bibliomining

Improved decision-making through better
understanding of
 Patron Behavior
 Library Staff Behavior
 Behavior of outside organizations

Can provide justification for




Library management policies and decisions
Please do not reuse these slides
Acquisitions and ILL
source
without
priorselection
permission from
[email protected].
Collection development
decisions
Use of library services
(funding
Copyright
2004 bodies)
Scott Nicholson, Syracuse University School of Information Studies
Steps in Bibliomining

Determine areas of focus
 Prediction vs. Description

Determine data source needs
 Internal and External

Gather data

Create data warehouse



do not reuse
Select appropriatePlease
analysis
toolsthese slides
without prior permission from
Create & test models
/ Create reports
[email protected].
Analyze results Copyright 2004
Scott Nicholson, Syracuse University School of Information Studies
Creating the data warehouse




A data warehouse is a collection of
cleaned and anonymized data in a
relational database and a point for
queries
Outside of the operational systems
Connects disparate data sources into
easily accessible database
Can be one time
basis
Please do not reuse these slides
without prior permission from
[email protected].
or
updated on a regular
Copyright 2004
Scott Nicholson, Syracuse University School of Information Studies
Steps in the Warehousing Process



Identify fields of interest
Determine fields that contain
personally identifiable information
(PII)
Please do not reuse these slides
without prior permission from
[email protected].
Determine combinations of fields
that create PII Copyright
(dept. 2004
+ level +
gender)
Scott Nicholson, Syracuse University School of Information Studies
Methods for dealing with
Personally Identifiable Information

Use codes, Ids for matching and discard
Original Patron Database
Original Circulation Records
Book ID
QA76.9
PS159.G8
HF5415.125
Subject
Computer Science
American Literature
Marketing
Patron
392-33
575-49
392-33
Patron
373-34
392-33
575-49
Name
Abby Lavender
Kenneth Moore
Sophie Richards
Class
Grad
Ugrad
Faculty
Dept.
Psych
Math
English
Data Warehouse - Combined Cleaned Circulation Records
Book ID
QA76.9
PS159.G8
HF5415.125
Subject
Computer Science
American Literature
Marketing
Patron Class
Patron Dept.
Please
do not reuse
these slides
Ugrad
Math
without
Faculty prior permission
English from
[email protected].
Ugrad
Math
Copyright 2004
Scott Nicholson, Syracuse University School of Information Studies
Codes for PII

One typical suggestion – code the
PII fields, and then record the codes
in the database
 Appropriate for other parties

Do not use a reversible encoding
procedure to encode variables.
Please do not reuse these slides
 This does not protect
patron’s
without prior
permission from
[email protected].
information from
an investigation.
Copyright 2004
Scott Nicholson, Syracuse University School of Information Studies
Coding and not discarding

Use a code when there is some
aspect of the ID that is important
 Example – IP addresses


Think about the use of the field, and
code appropriately
Do not generate
code from original;
Please do not reuse these slides
without
prior permission
from
rather, use other
methods
for code
[email protected].
that capture key information
Copyright 2004
Scott Nicholson, Syracuse University School of Information Studies
Methods for dealing with
Personally Identifiable Information

Use for matching and discard
IP Address
12.90.201.23
98.28.189.49
12.90.201.23
12.90.201.23
98.28.189.49
IP Identifier
102902-1032-A
102902-1033-A
102902-1032-A
102902-1032-A
102902-1033-A
Original Web Server Transaction Log
Time/Date
Page Retrieved
10:32/10-29-02
Index.html
10:33/10-29-02
Resources/oclc.asp
10:35/10-29-02
Reference.html
10:36/10-29-02
Databases.html
10:37/10-29-02
Resources/oclc.asp
Referring Page
Google.com
Index.html
Index.html
Reference.html
Firstsearch.html
Data Warehouse – Cleaned Web Transaction Records
Time/Date
Page Retrieved
Referring Page
10:32/10-29-02
Index.html
Google.com
Please
do
not
reuse
these
slides
10:33/10-29-02
Resources/oclc.asp
Index.html
without
prior permission
from
10:35/10-29-02
Reference.html
Index.html
10:36/10-29-02
Databases.html
Reference.html
[email protected].
10:37/10-29-02
Resources/oclc.asp
Firstsearch.html
Copyright 2004
Scott Nicholson, Syracuse University School of Information Studies
Dealing with categories
Make sure that combinations of categories
don’t identify an individual.
Original Demographics
Ugrad Grad Faculty
English 27
5
8
C. Sci 14
3
7
Math 33
1
6
Psych 24
6
7
Bus.
24
14 5
Cleaned Demographics
Ugrad Grad Faculty
English
27
5
8
C. Sci/Math 47
4
13
Please
do not reuse
Psych
24 these
6 slides
7
without
Bus. prior permission
24
14 from
5
[email protected].
Copyright 2004
Scott Nicholson, Syracuse University School of Information Studies
Dealing with Textual data



Digital Reference transactions
Easy to deal with the metadata
Hard to deal with the text
 Manual cleaning of PII
 Natural Language Processing research

Similar problemPlease
with
do not reuse these slides
prior permission from
deidentificationwithout
of
medial
records
[email protected].
Copyright 2004
Scott Nicholson, Syracuse University School of Information Studies
People to Involve


Institutional Research Board (IRB)
Legal counsel
 Ensures you are following state laws for
library data


Library administration / Board
Patrons
Please do not reuse these slides
without prior
permission
 If there are policies,
follow
them from
[email protected].
 If there are not,
create them
Copyright 2004
Scott Nicholson, Syracuse University School of Information Studies
Benefits to creating the Data
Warehouse

Cleaned resource, ready for analysis
 Outside of operational system


Use for regular reports and research
Please
do not reuse
these
slides
Forces library to
examine
the
life
of
without prior permission from
data
[email protected].
 Are there backup
tapes2004
created?
Copyright
 How long are backups kept?
Scott Nicholson, Syracuse University School of Information Studies
Striking a Balance

A well-designed data warehouse
strikes the balance between
Protecting Privacy
and
Please do not reuse these slides
without prior permission from
[email protected].
Copyright 2004
Maintaining a Data-Based
History
Scott Nicholson, Syracuse University School of Information Studies
For more information

About bibliomining:
 http://bibliomining.com

About an active data warehouse project:
 http://metrics.library.upenn.edu/prototype/
datafarm/
Please do not reuse these slides
 About this presentation:
without prior permission from
[email protected].
 http://bibliomining.com/nicholson
 “The Bibliomining Copyright
process:2004
Data warehousing
and data mining for library decision-making”
Scott Nicholson, Syracuse University School of Information Studies