Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Striking a Balance: Bibliomining and Privacy Please do not reuse these slides without prior permission from Scott [email protected]. Nicholson Assistant Professor Copyright 2004 Syracuse University School of Information Studies http://bibliomining.org [email protected] Scott Nicholson, Syracuse University School of Information Studies What is Bibliomining? Bibliomining is the combination of Bibliometrics and Data Mining used on the data produced during the Please do not reuse theseand slides operation of libraries (physical without prior permission from digital) [email protected]. Copyright 2004 Scott Nicholson, Syracuse University School of Information Studies What is Bibliomining? Application of advanced analysis tools to data produced by libraries May include Data mining Bibliometrics (patterns in scholarship) Online analytical processing (OLAP) Please do not reuse these slides prior permission from Other statisticalwithout techniques [email protected]. Copyright 2004 Scott Nicholson, Syracuse University School of Information Studies Goals of Bibliomining Improved decision-making through better understanding of Patron Behavior Library Staff Behavior Behavior of outside organizations Can provide justification for Library management policies and decisions Please do not reuse these slides Acquisitions and ILL source without priorselection permission from [email protected]. Collection development decisions Use of library services (funding Copyright 2004 bodies) Scott Nicholson, Syracuse University School of Information Studies Steps in Bibliomining Determine areas of focus Prediction vs. Description Determine data source needs Internal and External Gather data Create data warehouse do not reuse Select appropriatePlease analysis toolsthese slides without prior permission from Create & test models / Create reports [email protected]. Analyze results Copyright 2004 Scott Nicholson, Syracuse University School of Information Studies Creating the data warehouse A data warehouse is a collection of cleaned and anonymized data in a relational database and a point for queries Outside of the operational systems Connects disparate data sources into easily accessible database Can be one time basis Please do not reuse these slides without prior permission from [email protected]. or updated on a regular Copyright 2004 Scott Nicholson, Syracuse University School of Information Studies Steps in the Warehousing Process Identify fields of interest Determine fields that contain personally identifiable information (PII) Please do not reuse these slides without prior permission from [email protected]. Determine combinations of fields that create PII Copyright (dept. 2004 + level + gender) Scott Nicholson, Syracuse University School of Information Studies Methods for dealing with Personally Identifiable Information Use codes, Ids for matching and discard Original Patron Database Original Circulation Records Book ID QA76.9 PS159.G8 HF5415.125 Subject Computer Science American Literature Marketing Patron 392-33 575-49 392-33 Patron 373-34 392-33 575-49 Name Abby Lavender Kenneth Moore Sophie Richards Class Grad Ugrad Faculty Dept. Psych Math English Data Warehouse - Combined Cleaned Circulation Records Book ID QA76.9 PS159.G8 HF5415.125 Subject Computer Science American Literature Marketing Patron Class Patron Dept. Please do not reuse these slides Ugrad Math without Faculty prior permission English from [email protected]. Ugrad Math Copyright 2004 Scott Nicholson, Syracuse University School of Information Studies Codes for PII One typical suggestion – code the PII fields, and then record the codes in the database Appropriate for other parties Do not use a reversible encoding procedure to encode variables. Please do not reuse these slides This does not protect patron’s without prior permission from [email protected]. information from an investigation. Copyright 2004 Scott Nicholson, Syracuse University School of Information Studies Coding and not discarding Use a code when there is some aspect of the ID that is important Example – IP addresses Think about the use of the field, and code appropriately Do not generate code from original; Please do not reuse these slides without prior permission from rather, use other methods for code [email protected]. that capture key information Copyright 2004 Scott Nicholson, Syracuse University School of Information Studies Methods for dealing with Personally Identifiable Information Use for matching and discard IP Address 12.90.201.23 98.28.189.49 12.90.201.23 12.90.201.23 98.28.189.49 IP Identifier 102902-1032-A 102902-1033-A 102902-1032-A 102902-1032-A 102902-1033-A Original Web Server Transaction Log Time/Date Page Retrieved 10:32/10-29-02 Index.html 10:33/10-29-02 Resources/oclc.asp 10:35/10-29-02 Reference.html 10:36/10-29-02 Databases.html 10:37/10-29-02 Resources/oclc.asp Referring Page Google.com Index.html Index.html Reference.html Firstsearch.html Data Warehouse – Cleaned Web Transaction Records Time/Date Page Retrieved Referring Page 10:32/10-29-02 Index.html Google.com Please do not reuse these slides 10:33/10-29-02 Resources/oclc.asp Index.html without prior permission from 10:35/10-29-02 Reference.html Index.html 10:36/10-29-02 Databases.html Reference.html [email protected]. 10:37/10-29-02 Resources/oclc.asp Firstsearch.html Copyright 2004 Scott Nicholson, Syracuse University School of Information Studies Dealing with categories Make sure that combinations of categories don’t identify an individual. Original Demographics Ugrad Grad Faculty English 27 5 8 C. Sci 14 3 7 Math 33 1 6 Psych 24 6 7 Bus. 24 14 5 Cleaned Demographics Ugrad Grad Faculty English 27 5 8 C. Sci/Math 47 4 13 Please do not reuse Psych 24 these 6 slides 7 without Bus. prior permission 24 14 from 5 [email protected]. Copyright 2004 Scott Nicholson, Syracuse University School of Information Studies Dealing with Textual data Digital Reference transactions Easy to deal with the metadata Hard to deal with the text Manual cleaning of PII Natural Language Processing research Similar problemPlease with do not reuse these slides prior permission from deidentificationwithout of medial records [email protected]. Copyright 2004 Scott Nicholson, Syracuse University School of Information Studies People to Involve Institutional Research Board (IRB) Legal counsel Ensures you are following state laws for library data Library administration / Board Patrons Please do not reuse these slides without prior permission If there are policies, follow them from [email protected]. If there are not, create them Copyright 2004 Scott Nicholson, Syracuse University School of Information Studies Benefits to creating the Data Warehouse Cleaned resource, ready for analysis Outside of operational system Use for regular reports and research Please do not reuse these slides Forces library to examine the life of without prior permission from data [email protected]. Are there backup tapes2004 created? Copyright How long are backups kept? Scott Nicholson, Syracuse University School of Information Studies Striking a Balance A well-designed data warehouse strikes the balance between Protecting Privacy and Please do not reuse these slides without prior permission from [email protected]. Copyright 2004 Maintaining a Data-Based History Scott Nicholson, Syracuse University School of Information Studies For more information About bibliomining: http://bibliomining.com About an active data warehouse project: http://metrics.library.upenn.edu/prototype/ datafarm/ Please do not reuse these slides About this presentation: without prior permission from [email protected]. http://bibliomining.com/nicholson “The Bibliomining Copyright process:2004 Data warehousing and data mining for library decision-making” Scott Nicholson, Syracuse University School of Information Studies