Download Data mining - NYU Computer Science

Data mining Modern organizations are overwhelmed with data, producing mountains of textual documents, spreadsheets,and web pages. Every activity of the organization yields more documents with each passing day. But a surprisingly large volume of data remains relatively inaccessible. Any organization is likely to have data that remain tantalizingly out of reach, either through a loss of institutional memory (“Fred ran that project, but he took the early retirement package”) or through needle-in-a-haystack search complexity. The idea that information resources should be counted as an asset, and managed to yield maximum advantage emerges in cycles every few years. The concept has most recently manifested itself as part of the “knowledge management” approach. However the idea is packaged, there is a growing acknowledgement that information resources are important, and that they should be consciously cultivated. The topic of information extraction involves more than databases and indexing documents. Often, the greatest institutional memory of any organization is that of its own people. However, people are not so amenable to the usual automated computer-based search and retrieval techniques. The challenge is how to effectively submit an inquiry to the target group of people and get constructive responses in a timely manner. The “people” solution involves a multi-pronged approach. Provide network connectivity to the field offices. Enable access to the internal web system, with policies, protocols, and old documents available electronically. Provide access to existing database systems. Index existing document archives to enable search-engine-like query and response. Then, through a combination of official and self-interested participation, individual Knowledge Networks benefit from the participation of the most knowledgeable experts within the system, and thus much of the undocumented history of an organization and their prior activities is made accessible to the participants. The recognition of the importance of people in building computer systems is also an inherent characteristic in the construction of “expert systems”, wherein the knowledge and experience of an expert in a particular field is somehow captured and stored as a set of rules, such that queries submitted yield answers comparable to those of an expert. Indexing is a form of information extraction; so, too, is the process of crawling to support more comprehensive indexing. It is, in a way, a form of “mining” for content that might otherwise remain relatively unknown, but the term “data mining” usually implies something more subtle. The concept of “data mining” is based upon the premise that information about information is itself useful information. Given vast amounts of data, it is possible to detect patterns that might provide insights that could prove useful in making strategic decisions on issues ranging from the content and structure of a web site to the selection of promotional items and the generation of marketing strategies. “A class of database applications that look for hidden patterns in a group of data that can be used to predict future behavior. For example, data mining software can help retail companies find customers with common interests. The term is commonly misused to describe software that presents data in new ways. True data mining software doesn't just change the presentation, but actually discovers previously unknown relationships among the data. Data mining is popular in the science and mathematical fields but also is utilized increasingly by marketers trying to distill useful consumer data from Web sites.” [1] [1] http://www.webopedia.com/TERM/d/data_minin g.html Note that this definition addresses database applications, but the term is now used much more expansively to cover almost any application that seeks to extract information or patterns from volumes of raw data. One of the earliest applications of data mining was the problem of reliably extracting data from semi-structured text sources, and transforming it into the more structured format of fields in a relational database system. Imagine, for example, that a system was designed to scan through the Wall Street Journal on a daily basis, and extract from the newspaper every reference made to the topic of resignations and appointments of CEOs in US corporations[1]. [1] This idea was, in fact, used as an example in Roman Yangarber’s PhD dissertation at New York University “Scenario Customization for Information Extraction”, 2000. The process involves a form of indexing, recognition of the key words and context, and the transformation of the raw text into a structured form. It is consistent with the idea of “mining” because it creates a substantial information resource that did not otherwise exist, effectively extracting content from the data source. Data mining is generally broken into two main applications: mining for content, and mining for usage patterns and structure. Data mining represents another concept that predates the emergence of the web, but the fact that seemingly all digital data is available through the web, and the web is readily accessible, means that most research and development of mining techniques are applied to the web setting. The web provides the scale and range of data necessary to test and challenge the capacity of any mining application. There once was an Internet startup called EZWays.com, with a dream of gathering all of the transport sector schedule data into one integrated travel routing engine. They collected routes, schedules, and pricing data for all public transportation such buses, ferrys, trains, and subways in a specific region of the country. The user could then evaluate all the alternatives to air travel for a given journey, and compare them based upon trip cost and trip duration. The relevant routing information for public transport agencies is readily available, published on the web sites of the respective transit authorities. It is therefore an attractive candidate for a targeted data mining operation. A web crawler was designed to visit the transit sites, and retrieve pages likely to contain schedule data. The pages were then parsed and analyzed to determine what information they contained. The presentation of schedule data generally falls into a relatively small number of formats. The system was able to recognize such data and apply extraction techniques unique to each format. For example, some of the transport data was stored in a PDF format. The system converted tables in a PDF file into HTML counterparts, and used the same strategy for information extraction in PDF settings as it did for web pages. The system was even able to parse schedules given partly in text phrases (such as “and then every twenty minutes until 4:00 am”), and in that limited context determine the likely information content. The system included consistency checks to determine that the schedules as presented were in fact plausible; it was soon apparent that some published schedules were internally inconsistent! The EZ-Ways system is an example of “mining for content”. It established a narrow range of target sites, and extracted fairly well defined data sets that met a relatively small range of patterns. While some structural evaluation took place in the process, the emphasis was on content extraction. We have seen several examples of web mining for content, but as previously noted, there is much to be learned from the structure of the pages (as in the page rank example) and the data associated with the way that users browse a web site. Information gathered from web server logs, client-side agents, and proxy server caching data can be used to detect usage patterns that might otherwise go unnoticed. The concept of web usage mining is becoming more popular as a means to determine just how visitors use the web, and which parts of a web site seem to be most effective. The usage mining results are then applied to reformulate web sites so as to increase effectiveness, and formulate strategies for target marketing and personalization. Web usage mining is typically organized into three main stages: preprocessing, pattern discovery, and pattern analysis. The main benefits associated with web usage analysis are related to the assessment of the structure of the site, and how this might be improved, and the potential to provide web personalization at an individual or group level to enhance the experience of the users. For many eCommerce merchants, web usage analysis is an extremely informative tool that helps identify the demographics of their customer base, improve the presentation and delivery of information on the site, enhance the management of the customer relationship, and potentially increase “yield” of sales per customer through target marketing and cross-promotion. A potential problem associated with web usage analysis is that while the data is extremely useful for a merchant involved in eCommerce, the amount of data collected and the information it yields may constitute a potential breach of user privacy. If a user visits the site with the belief that the browsing is relatively anonymous, but in fact the site is collecting session data on every page view and duration, the potential for an adverse reaction may outweigh the benefits from collecting the data. A comprehensive introduction to data mining is available in the “Tutorial on High Performance Data Mining”, at http://wwwusers.cs.umn.edu/~mjoshi/hpdmtut/. • A good introduction to the general field of data mining as it applies to the web can be found in the paper “Web Mining Research: A Survey”, by Kosala and Blockeel, published in the ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD), July 2000. It is available through the Citeseer service at http://citeseer.nj.nec.com/kosala00web.html. An introduction to the application of data mining techniques to ascertain patterns in web use can be found in “Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data”, by Srivastava, Cooley, Deshpande, and Tan, SIGKDD January 2000. http://citeseer.nj.nec.com/srivastava00web.html.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data mining - NYU Computer Science