Automated subject classification of textual web documents

... Another difference within the text categorization approach is in the document pre-‐processing and indexing part, where documents are represented as vectors of term weights. Computing the term weights can be ...

Semantics-Based Spam Detection by Observance of Outgoing

... Abstract—The existing spam detection system are mostly keyword-based and find the spam message present in the outgoing message by matching the keyword. The quality of result provided by traditional keyword-based spam detection is not optimal for finding the spam information present in the message. T ...

Distractor Quality Analyze In Multiple Choice Questions

... may be too cumbersome and inconvenient to use. Therefore, we should use a mathematical model. The simplest and most natural way is to define Boolean model. In construction of Boolean model, we introduce the following interpretation of the logical variables of Boolean functions. Let’s denote attribut ...

On Word Frequency Information and Negative Evidence in Naive

... Why is the performance of the multinomial Naive Bayes classifier improved when the word frequency information is eliminated in the documents? In [6] and [7] the distribution of terms in documents was studied. It was found that terms often exhibit burstiness: the probability that a term appears a sec ...

Manifold Alignment using Procrustes Analysis

Can Word Probabilities from LDA be Simply Added up to Represent

... [email protected] ...

Intelligent Search on the Internet

... present in the query are lost, therefore relevant information is not retrieved. - polysemy occurs when a term has several diﬀerent meanings; it causes irrelevant documents to appear in the result lists. In order to solve such problems, documents are represented through underlying concepts. The conce ...

Semantic Outlier Detection for Affective Common-Sense Reasoning and Concept-Level Sentiment Analysis Erik Cambria

... Sentic computing (Cambria and Hussain 2015) tackles these crucial issues by exploiting affective common-sense reasoning, i.e., the intrinsically-human capacity to interpret the cognitive and affective information associated with natural language and, hence, to infer new knowledge and make decisions, ...

Matching Ottoman Words: An image retrieval approach to historical

... Chan et al. [4] presented a segmentation based approach that utilizes gHMMs with a bi-gram letter transition model. Their lexiconfree system performs text queries on off-line printed and handwritten Arabic documents. Saykol et al. [20] used the idea of compression for content-based retrieval of Otto ...

Word Sense Disambiguation for Arabic Text Categorization

... others representations. The main difficulty in this approach is that it is not capable of determining the correct senses. For a word that has multiple synonyms, they choose the first concept to determine the nearest concept. The work in [14] is a comparative study with the other usual modes of repre ...

Keyword Extraction from a Single Document

... used relatively impartially with each frequent term, while a term such as “imitation” or “digital computer” shows cooccurrence especially with particular terms. These biases are derived from either semantic, lexical, or other relations of two terms. Thus, a term with co-occurrence biases may have an ...

Ontology construction for information classification

... only pose possible setbacks due to the quality of the dictionary, it will also prove incapable of adapting to the incessantly changing environment ...

Cross-Language Information Retrieval

... Phrasal Translation and Query Expansion Techniques for Crosslanguage Information Retrieval, Lisa Ballesteros and W. Bruce Croft, Research and Development in Information Retrieval, 1995. Resolving Ambiguity for Cross-Language Retrieval, Lisa Ballesteros and W. Bruce Croft, Research and Development in ...

Magnifico: A Platform For Expert Mining Using Metadata

... Afterwards we measure the importance of each word for a given sub-discipline using the term frequency of that word occurring in the specific sub-discipline. For every publication, after collecting all the words appearing in the title and publisher name, stop words are removed from the word collectio ...

Pyndri: a Python Interface to the Indri Search Engine

... There is still, however, a lack of an integrated Python library dedicated to Information Retrieval (IR) research. Researchers often implement their own procedures to parse common file formats, perform tokenization, token normalization that encompass the overall task of corpus indexing. Uysal and Gun ...

INF5820 Distributional Semantics

... Words are in paradigmatic relation if the same neighbors typically occur near them (humans often ‘eat’ both ‘bread’ and ‘butter’). It is also called second order co-occurrence. The words in such a relation may well never actually co-occur with each other. ...

N045038690

... with the characteristics of exploitation and exploration, GAs can efficiently deal with large search spaces, and hence are less prone to get stuck into a local optimum solution when compared to other algorithms. This derives from the GAs ability to handle multiple concurrent solutions (individuals) ...

Fiqure 4: The Binomail distribution

... obtained per the frequency of terms appearance in the corpus by providing a systematic way to detect which entity classes are most similar to each other and, therefore, which entity classes are the best candidates for establishing the similarity between two terms with respect to the domain ontology. ...

NLDB10-OntoGain - Intelligent Systems Laboratory

... Aims at organizing concepts into a hierarchical structure where each concept is related to its respective broader and narrower terms Two methods in OntoGain Agglomerative clustering Formal Concept Analysis (FCA) ...

Conceptual grouping in word co-occurrence networks

... One way to quantify the ideas on conceptual grouping presented above is to build a custom semantic network for a user query. What we do is build a new small semantic network with all concepts that are linked to the user query (e.g. 'bomb', see Figure 1, which shows only some of the links around 'bom ...

CL35491494

... deals with languages. Language refers to a body of words and the systems for their use common to a people who are of the same community or nation, the same geographical area, or the same cultural tradition. It is the primary means of communication used by particular groups of human beings [1]. It is ...

Discriminative Improvements to Distributional Sentence Similarity

... al., 2003; Arora et al., 2012); the difference from SVD is the addition of a non-negativity constraint in the latent representation based on non-orthogonal basis. While W may simply contain counts of distributional features, prior work has demonstrated the utility of reweighting these counts (Turney ...

in the document - XP

... The main idea behind tf-idf is that the term occurring infrequently should be given a higher weight than a term that occurs frequently. •Important definitions in tf-idf context : t = number of distinct terms in the document collection. tfij = number of occurrences of term tj in document Di. This is ...

No Slide Title

... – Start with some user-supplied relevance information about a “training set” of documents – The training set is used to compute term weights by estimating P(t in document | document is relevant) P(t in document | document is irrelevant ) ...

2006 Paula Matuszek

... – hyperbolic viewer based on document similarity; browse a field of scientific documents – “map” based techniques showing peaks, valleys, outliers – Faceted search results showing document counts for different categorizations, with browsing ...

< 1 2 >

Latent semantic analysis

Latent semantic analysis (LSA) is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per paragraph (rows represent unique words and columns represent each paragraph) is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Words are then compared by taking the cosine of the angle between the two vectors (or the dot product between the normalizations of the two vectors) formed by any two rows. Values close to 1 represent very similar words while values close to 0 represent very dissimilar words.An information retrieval method using latent semantic structure was patented in 1988 (US Patent 4,839,853, now expired) by Scott Deerwester, Susan Dumais, George Furnas, Richard Harshman, Thomas Landauer, Karen Lochbaum and Lynn Streeter. In the context of its application to information retrieval, it is sometimes called Latent Semantic Indexing (LSI).

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Latent semantic analysis