Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
PatchGrid Visualization PatchGrid is similar to a visualization called “HeatMap” ,“CCCViz” (ref. Dim Anch paper and patent pending), or Bertin’s Permutation Matrix. Essentially, it is a two dimensional table (similar to a spreadsheet) where each cell or field value is mapped to a color. At the extreme levels a cell value could be as small as one pixel, hence a table as large as 1 million columns by 1 million rows can be completely visualized using this visualization. The power of the visualization and specifically AnVil’s enhancements, comes from the unique sorting and clustering of the rows and columns as well as the specialized interactions, selections and data probing. Many different color scales, enhanced by the specific statistics of each row and column, are also used in AnVil’s implementation. This also greatly enhances the power of the PatchGrid visualization. PURS and PUCS Algorithmic Data Reduction PURS and PUCS stands for principal uncorrelated record selection and principle uncorrelated column selection. In a generalized data mining setting where one is presented with many columns and rows (records) of data, it is quite possible that many of these rows and columns are highly dependent and on each other. For supervised machine learning these correlated columns and rows are redundant and can most often be eliminated without any loss of classification accuracy. Pairwise cross correlation of every column with every other column, and every row with every other row can be prohibitive and would still leave one with the problem of eliminating the highly correlated rows and columns. AnVil’s PURS and PUCS is a technique which will eliminate highly correlated columns and rows in an automatic and incremental fashion. An operator selects a correlation threshold, and incremental threshold, and the number of records and columns desired from the dataset. The analysis proceeds by generating a number of random selections of columns and rows and then eliminating columns and rows from the dataset which correlate higher than the threshold value to the random selections. If no columns or rows can be eliminated the correlation threshold is reduced by the incremental threshold and the procedure is repeated until the desired number of columns and rows remains. If the incremental threshold is 0, then the analysis and elimination stops when no columns or rows(records) higher than the initial threshold can be found. In this case the dataset is usually not reduced to the desired number of rows and columns. The reduction can be performed on just colums or just rows. The reduced data set is usually a well represented “subset “ of the original which can be used in clustering, classifying and other data mining techniques.