Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Text and Data Mining Linda Pikula NOAA [email protected] OceanTeacher Global Academy, Digital Asset Management and Preservation 30, September – 4, October, 2013 KMFRI Mombasa, Kenya TDM=Text Mining “automated processing of large amounts of structured digital textual content for purposes of information retrieval, extraction, interpretation and analysis” Bernie Reilly, Center for Research Libraries CRL TDM=Data Mining • Overview Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. Another Definition “automated tools, techniques or technology to process large volumes of digital content that is often not well structured…to identify and select relevant information; to extract information from the content, to identify relationships within/between/across documents and incidents or events for metanalysis” Eefke Smit Another Definition Text Mining discovers themes, patterns, emerging issues and insights buried in document collections. By automatically reading text and delivering algorithms for rigorous, advanced analyses, the solution makes it possible to grasp future trends and act on new opportunities more precisely and with less risk. It can include advanced linguistic capabilities within the core data mining solution SAS definition TDM Business uses vary from scholarly uses Class Discussion How might business use data mining? Health sciences? Scholarly uses? Reasons for TDM To enrich content Systematic review of literature Discovery Computation linguistics research Steps in TDM Hurdles to overcome 1. Researchers must be able to process large amounts of content: automated 2. Researchers must identify questions to be asked 3. Must be able to find the right sources to be mined 4. Must be able to access these sources 5. Must be able to download the results 6. -To analyze and interpret 1. Software required? 2. Construct proper query 3. Obtain permission to access – if not subscribed by an Institution-licensing problems 4. Varying formates-no-standard formats for storage Librarians Role in Text/Data Mining 1. Advise on License Language- to develop publishers licenses that address TDM See work of California Digital Library and JISC and CRL 2. Assist Researchers in TDM-inform them of TDM process, what data mining can do for them and connect them with the tools to accomplish TDM – through interviews develop strategies, “pilot studies” User Case Since 1982 -90,000 journal articles on spinal cord injury There has been an average of 22 journal articles a day on spin cord injury How can all this information be analyzed? TDM With the help of automated software a large amount of data and text will be processed to identify entities, instances, actions, relationships and patterns to do further analysis Typical TDM Content Text mining output typically consists of a new metadata layer for information: - - Journal Article Clusters and categorizations, indexes Topical maps, to show the occurrence of topic and their interelationships Databases with fact, patterns, relationships, statements, assertions, properties found in the articles Visualisations: graphs, mappings, plot-graphs and topical maps Class- Please View Smit,Eefke and Maurits van der Graaf. Content Mining a short introduction to practices and policies presented for Center for Research Libraries, July 17, 2013 (CRL Global Resources Forum) http://www.crl.edu/sites/default/files/follow_u p_material/Smit.pdf Class Please Read https://blogs.libraries.iub.edu/scholcomm/2013/01/07/a-guide-to-text-and-data-mining-at-indiana-universitybloomington/ http://www.libraries.iub.edu/index.php?pageId=530000216 Tools for searching the Deep Web Deep Dyve http://www.deepdyve.com Deep Web Technologies http://www.deepwebtech.com WorldWideScience.org http://worldwidescience.org Deep Web Harvester from BrightPlanet http://www.brightplanet.com Credits Okerson, Ann. Text & Data Mining- A Librarian overview, IFLA WLIC, Singapore, August 7, 2013 http://library.ifla.org/252/1/165-okerson-en.pdf Smit,Eefke and Maurits van der Graaf. Content Mining a short introduction to practices and policies presented for Center for Research Libraries, July 17, 2013 (CRL Global Resources Forum) http://www.crl.edu/sites/default/files/follow_up_material/Smit.pdf Speirs, martha A. Data mining for scholarly journals: challenges and solutions for libraries. IFLA WLIC 2013, June 28, 2013 EMEA regional council meeting connects members to the latest in library data research: Mining insights from 50 million books. NEXTSpace no. 21, May 2013 Utube-Text/Data mining, libraries and online publishers, July 17,2013. CRL. http://www.youtube.com/watch?v=2e1xymY9ePg Chiang,Katherine. Data mining, data fusion, and libraries. June 21, 2010. 31st Annual IATUL Conference. Paper 4 http://docs.lib.purdue.edu/iatul2010/conf/day1/4