Download ACM SIGKDD Conference on Knowledge Discovery and Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
ACM SIGKDD Conference on Knowledge Discovery and Data Mining
August 24-27, New York, U.S.A.
Reviews For Paper
Research
Track
1516
Paper ID
Variance Gradient Optimized Classification on Vertically Structured Data
Title
Masked Reviewer ID: Assigned_Reviewer_1
Review:
Question
How would you rate the
novelty of the problem
solved in this paper?
A minor variation of some well studied problems
How would you rate the
technical ideas and
development in this
paper?
The technical development is incremental without fundamental contributions
How would you rate the
empirical study
Acceptable, but there is room for improvement
conducted in this paper?
Repeatability: are the
data sets used publicly
available (and thus the
experiments may be
repeated by a third
party)?
Yes
How would you rate the
quality of presentation?
The writing needs significant improvement in terms of organization and clarity
Which topic category do
you think this paper
Unsupervised learning
belongs to?
What is your overall
recommendation?
Reject. Clearly below the standards for the conference.
List up to 3 particular
1. Mathematical analysis on the gradient-based approach.
strengths of the paper. If
2. The application of the relatively new data structure pTree
none, just say "none".
List up to 3 particular
weaknesses of this
paper. If none, just say
"none"..
1. Presentation : needs improvement on this aspect.
2.The comparison is done on some basic simplistic data
3. The comparison is only between the proposed ideas
On the presentation aspect the paper needs improvement especially in section 2.2. For example
there is no distinction (visually) on the end of algorithm 1 and the beginning of the "normal" text.
Another example is
the mathematical formulas that are packed as densely as possible.
Moving on to the content, according to the authors, there is no need to test the speed of the pTree based
Detailed comments for
algorithms in large data as it has been
the authors; justification shown by their past work to be efficient. As the algorithms that are presented here are justified
of your overall rating
mathematically, the main point
of this paper is on the quality of the results.
This is a defining aspect of the paper as it is the only subject the reviewer is left to comment. The
usefulness of the pTree has been
explored by previous work of the author(s) (references 1 through 9). Thus this paper is and should be
judged on whether it produces
good results.
Unfortunately, the experiments are conducted on simplistic datasets. These datasets are not really good
for benchmarking
but more for a "proof of concept" approach. They lack in complexity and two of them (IRIS and
SEEDS) have at most 200 instances
(and 4 and 7 features for each). This set of data is "prone" to display good results even with the most
simplistic approaches.
The authors might avoid comparison with large data only because the scale does not matter in the
efficiency of the approach but -with this argument- they also "avoid" the complexity of larger datasets.
Moreover, as this is a clustering procedure, there should be a comparison with another algorithm that is
not part of their 3
alternatives. Perhaps they should conciser comparing with a state of the art clustering algorithm on the
quality so that they may
answer the actual question posed in the beginning of their document:
"That is, if we structure our data vertically and
process across those vertical structures (horizontally), can
those horizontal algorithms compete quality-wise with the
time-honored methods that process horizontal (record) data
vertically?"
List of typos,
grammatical errors
and/or concrete
suggestions to improve
presentation
2.1 Please fix the first sentence as it is hard to make sense.
2.2 Mathematical formulas and algorithms should stand out when compared to the text.
"Then our problem is to develop an algorithm to
(1.2) ...." -> this makes no sense
1.1 and 1.2 are not connected in some logical manner and their indent is also different.
"Thus, 2. is equivalent to" ...which/what "2"?
4 'fact that the combining of" -> the combination
"whereas for horizontally structured data is doubles (at least)." -> is double
"pTree compression is design to improve speed more and" -> is designed
Masked Reviewer ID: Assigned_Reviewer_2
Review:
Question
How would you rate the
novelty of the problem
solved in this paper?
A minor variation of some well studied problems
How would you rate the
technical ideas and
development in this
paper?
The technical development is incremental without fundamental contributions
How would you rate the
empirical study
Not thorough, or even faulty
conducted in this paper?
Repeatability: are the
data sets used publicly
available (and thus the
experiments may be
repeated by a third
party)?
Yes
How would you rate the
quality of presentation?
The writing needs significant improvement in terms of organization and clarity
Which topic category do
you think this paper
Classification
belongs to?
What is your overall
recommendation?
Reject. Clearly below the standards for the conference.
List up to 3 particular
strengths of the paper. If none
none, just say "none".
List up to 3 particular
weaknesses of this
paper. If none, just say
"none"..
1. The statements are not clear.
2. There are few related works discussed.
3. The experiments are problematic.
This paper proposed classification methods based on predicate trees (pTrees) for large datasets. There
are some problems:
1. The whole paper is not well organized, and is unclear. There are only 6 pages of this paper, and the
limitation is 10 pages. There are enough spaces to state the problem and the algorithm more clearly.
2. Classification for “big data” is a well studied problem. The authors did not discuss the other related
works in this area.
Detailed comments for
the authors; justification
3. Some proofs of lemma/theorem/corollary are missing.
of your overall rating
4. The authors claim that the proposed algorithms are very efficient, and which are useful to handle “big
data”. But in the experiments, there are only very small UCI datasets (a few hundreds of examples), and
no running time records.
5. The experimental results of the proposed algorithms seem not very strong. For the example of Wine
dataset, it is easy to achieve a 95% accuracy by many common classification methods, such as 1NN. But
the accuracies of your methods are 62.7%, 66.7%, and 81.3% on this dataset.
Masked Reviewer ID: Assigned_Reviewer_3
Review:
Question
How would you rate the
novelty of the problem
solved in this paper?
A well established problem
How would you rate the
technical ideas and
development in this
paper?
The technical development has some flaws
How would you rate the
empirical study
Not thorough, or even faulty
conducted in this paper?
Repeatability: are the
data sets used publicly
available (and thus the
experiments may be
repeated by a third
party)?
Yes
How would you rate the
quality of presentation?
The writing needs significant improvement in terms of organization and clarity
Which topic category do
you think this paper
Novel statistical techniques for big data
belongs to?
What is your overall
recommendation?
Reject. Clearly below the standards for the conference.
1. Interesting approach to use of pTree/vertically structured data (apparently similar to column stores) in
data mining.
List up to 3 particular
strengths of the paper. If 2. Formulation of the gradient of maximal variance (1st PC) using iteration in this context.
none, just say "none".
3. Addresses a paradigm-shifting basic question: "do we give up quality (accuracy) when horizontally
processing vertical bit-slices compared to vertically processing horizontal records"?
1. Paper is not yet complete; needs further development.
List up to 3 particular
weaknesses of this
paper. If none, just say
"none"..
2. The performance evaluation section (considering 4 UCI datasets) provides only 3 accuracy results for
each dataset.
3. Algorithm 1, on p.3, is incomplete and currently includes some typos.
It seemed the method proposed here is finding the first principal component using something like the
power method for finding eigenvectors. If the method in the paper, or the resulting gradient, are in fact
different from this, it might help to clarify that.
Detailed comments for
in Corollary 3, it is not clear why the complexity is O(n^2 log(1/3)). (By the way, there is a right
the authors; justification
parenthesis missing.)
of your overall rating
The relationship between pTrees and vertically structured data, on the one hand, and the column store
systems now being used in Big Data applications on the other, wants clarification. It might generate a lot
of interest in pTrees if the column store systems differed in important ways.