Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Information Retrieval Chapter 2 by Rajendra Akerkar, Pawan Lingras Presented by: Xxxxxx Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). Process : Information Retrieval Figure 2.1 Transforming a text document to a weighted list of keywords Process : Information Retrieval 1. The first step in transforming a document is simply to list all the words in a document. 2. The second step is removal of some of the most commonly occurring words. Data Mining has emerged as one of the most exciting and dynamic fields in computing science. The driving force for data mining is the presence of petabyte-scale online archives that potentially contain valuable bits of information hidden in them. Commercial enterprises have been quick to recognize the value of this concept; consequently, within the span of a few years, the software market itself for data mining is expected to be in excess of $10 billion. Data mining refers to a family of techniques used to detect interesting nuggets of relationships/knowledge in data. While the theoretical underpinnings of the field have been around for quite some time (in the form of pattern recognition, statistics, data analysis and machine learning), the practice and use of these techniques have been largely ad-hoc. With the availability of large databases to store, manage and assimilate data, the new thrust of data mining lies at the intersection of database systems, artificial intelligence and algorithms that efficiently analyze data. The distributed nature of several databases, their size and the high complexity of many techniques present interesting computational challenges. A given word may occur in a variety of syntactic forms The word connect, may appear as A stem is what is left after its affixes (prefixes and suffixes) are removed ◦ plurals ◦ past tense ◦ gerund forms (a noun derived from a verb) ◦ connector, connection, connections, connected, connecting, connects, preconnection, and postconnection. ◦ ed, s, or, ed, ing, and ion are suffixes ◦ pre and post are prefixes Use of stems may arguably improve retrieval performance Users rarely specify the exact forms of the word they are looking for Reasonable to retrieve documents with similar words Calculating frequency of each word Term Document Matrix • • • • Term-document matrix (TDM) is a two-dimensional representation of a document collection. Rows of the matrix represent various documents Columns correspond to various index terms Values in the matrix can be either the frequency or weight of the index term (identified by the column) in the document (identified by the row). Thank You