Download Representation of hypertext documents based on

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Representation (arts) wikipedia , lookup

Transcript
Representation of hypertext documents
based on
terms, links and text compressibility
Julian Szymański
Department of Computer Systems Architecture,
Gdańsk University of Technology, Poland
Włodzisław Duch
Department of Informatics, Nicolaus Copernicus University, Toruń, Poland
School of Computer Engineering, Nanyang Technological University, Singapore
Outline
Text representations



Words
References
Compression
Evaluation of text representations



Wikipedia data
SVM & PCA
Experimental results and conclusions
Future directions
Text representation
Amount of the information in the Internet grows rapidly. Thus machine support is
needed for:

Categorization (supervised or unsupervised)

Searching / retrieval
Human understand the text. Machine doesn't. To process the text machine requires
it in computationable form.
Results of text processing strongly depends on the methods used for text
representation.
Processing natural language – several approaches to that problem:

Logic (ontologies),

Statistical processing of large text copora,

Geometry mainly used in machine learning.
Machine learning for NLP uses text features
The aim of the experiments presented here is to find hypertext representation
suitable for automatic categorization
Information retrieval 4 Wiki project – improvement of existing Wikipedia category
system
Text representation with features
Convenient way for machine processing of the is a
vector of the features.
 k – document number,
 n – feature number,
 c – feature value
Text set is represented as a matrix of the N
features related with text k by the weight c.
Where features come from?
Words
The most intuitive approach is to take words as features. Words content
should describe well subject of the text.
n-th word has value C in context of k-th document calculated as:
where

tf – term frequency. Describes how many times the word n appears in k
document.

idf – inverse document frequency. Describes how seldom n word
appears in whole text set. Proportion of nuber of all documents and
nuber of the documents containing the given word.
Problem: high dimensional sparse vectors.
BOW - Bag of Words that looses syntax
Preprocessing: stopwords, stemming. Features -> terms
Some other possibilities n-grams, profiles of the letter frequiencies.
References
Scientific articles contains bibliography. Web documents contains
hyperlinks. They can be used as representaton space where document is
represented by other document it is refferenced to.
Typically binary vector containing 0 – lack of reference to the given
document 1 – existance of the reference
Some possible extensions:

Not all articles are equqll. Ranking algorithms such as PageRank, HITS
allow to measur eimportance of the documents and provide intsead
binary walue weight that deccribe importance while one article points to
the another.

We can use references of higher order, that captures references not
only from neighbours but also loog further.
Similary like words representation, sparse vectors but much more lower
dimensions,
Poor for capturing semantic.
Compression
Usually we need to show differences and similarities between text in the
repository. They can be calculated using eg. Cosine distance which is
suitable for high dimensional, sparse vectors.
Square matrix describing text similarity.
Other possibility is to make representation space based on algorithmic
information estimated using standard file compression techniques.
Key idea: If two documents are similar their concatenation will lead to a
file size slightly larger than the size of a single compressed file.
Two similar files will be compressed better that two different.
complexity-based similarity measure as a fraction by which the sum of
the separately compressed files exceeds the size of the jointly
compressed file.
where A and B denote text files, and the suffix p denotes the
compression operation.
The data
The three ways to generate numerical representation of texts have
been compared on a set of articles selected from the Wikipedia
Articles that belog to sub categories of Super category Science:

Chemistry → Chemical compounds,

Biology → Trees

Mathematics → Algebra

Computer science → MS (Microsoft) operating systems

Geology → Volcanology.
Rough view of the class distribution
PCA projections of the data with two principal components having the
highest variance
Projection of dataset on two highest principal components for text
representation based on terms, links and compressionfor
Number of components used that complete 90% of variance and cumulative
sum of primary components variance for successive text representations
SVM classification
Classification may be used as a method for validation of the text representations.
The better results classifier obtains – the better representation is. Information
extracted by different text representations may be estimated by comparing
classifier errors in various feature spaces.
Multiclass classification with SVM performed with one-versus-other class
approach has been used with two-fold crossvalidation repeated 50 times for
accurate averaging of the results.
Raw rrepresentation based on complexity gives the best results.
Reducing the dimensionality removing the features that are related only to one
article improves the results.
Introducing cosine kernel improves considerably results
SVM and PCA reduction
Selecting components that complete 90% of variance has been used for
dimensionality reduction
It worsen the results of classification for terms and links (to high reduction?)
PCA does not influence complexity representation
as in previous results introduction of cosine kernel improves classification.
For terms it is even slightly better
Summary
Complexity measure allowed for much more compact representation, as
seen from the cumulative contribution of principal components and
achieved best accuracy in PCA-reduced space with only 36 dimensions.
After using cosine kernel term based representation is slightly more
accurate.
Explicit representation of kernel spaces and the use of linear SVM
classifier allows to find important reference documents for a given
category, as well as identify collocations and phrases that are important
for characterization of each category.
Distance-typed kernels improves results and reduces dimensionality in
terms and links representations.
Improvement is also in the case of representation based on complexity
where similarity, based on distance, is second-order transformation.
Future directions
Different methods of representation extract different information from texts.
They show different aspects of the documents .
In future we plan to combine representations and use one, joint
representation.
We plan introduce more background knowledge and capture some
semantics.

Wordnet can be used as semantic space where words from the article
are mapped.

Wordnet is made as a network of interconnected synsets – elementary
atoms that brings meaning.

Mapping requires usage of disambiguation techniques.

It allow to use activations of a WordNet semantic network and then
calculate distances between them what should give better semantic
similarity measures.
Large scale classifier for whole Wikipedia.
Thank for yor attention