Download PatchGrid Visualization

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Principal component analysis wikipedia , lookup

Transcript
PatchGrid Visualization
PatchGrid is similar to a visualization called “HeatMap” ,“CCCViz” (ref. Dim Anch
paper and patent pending), or Bertin’s Permutation Matrix. Essentially, it is a two
dimensional table (similar to a spreadsheet) where each cell or field value is mapped to a
color. At the extreme levels a cell value could be as small as one pixel, hence a table as
large as 1 million columns by 1 million rows can be completely visualized using this
visualization. The power of the visualization and specifically AnVil’s enhancements,
comes from the unique sorting and clustering of the rows and columns as well as the
specialized interactions, selections and data probing. Many different color scales,
enhanced by the specific statistics of each row and column, are also used in AnVil’s
implementation. This also greatly enhances the power of the PatchGrid visualization.
PURS and PUCS Algorithmic Data Reduction
PURS and PUCS stands for principal uncorrelated record selection and principle
uncorrelated column selection. In a generalized data mining setting where one is
presented with many columns and rows (records) of data, it is quite possible that many of
these rows and columns are highly dependent and on each other. For supervised machine
learning these correlated columns and rows are redundant and can most often be
eliminated without any loss of classification accuracy. Pairwise cross correlation of
every column with every other column, and every row with every other row can be
prohibitive and would still leave one with the problem of eliminating the highly
correlated rows and columns. AnVil’s PURS and PUCS is a technique which will
eliminate highly correlated columns and rows in an automatic and incremental fashion.
An operator selects a correlation threshold, and incremental threshold, and the number of
records and columns desired from the dataset. The analysis proceeds by generating a
number of random selections of columns and rows and then eliminating columns and
rows from the dataset which correlate higher than the threshold value to the random
selections. If no columns or rows can be eliminated the correlation threshold is reduced
by the incremental threshold and the procedure is repeated until the desired number of
columns and rows remains. If the incremental threshold is 0, then the analysis and
elimination stops when no columns or rows(records) higher than the initial threshold can
be found. In this case the dataset is usually not reduced to the desired number of rows
and columns. The reduction can be performed on just colums or just rows.
The reduced data set is usually a well represented “subset “ of the original which
can be used in clustering, classifying and other data mining techniques.