Download h p

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Appendix: To cluster, or not to cluster
Andreas Adolfsson, Margareta Ackerman and Naomi Brownstein
1
Simulations
Details of the simulations are described in this section. Each scenario is run 1000 times.
The rate of p-values below the selected significance level (0.05) is recorded in Table 1 of
the main paper. For unclusterable datasets, this rate represents Type I error, while for
clusterable datasets, the rate represents statistical power. We now discuss how each type of
data summarized in Table 1 was generated:
1.1
Unclusterable Datasets
• Rows (1) of Table 1: Datasets contain a single bivariate Gaussian cluster of 50 points
with mean and standard deviation of 100 and 2, respectively, for each independent
dimension.
• Row (2) of Table 1: Datasets include a single trivariate Gaussian cluster of 50 points
in 3 independent dimensions with mean 100 and standard deviation 2.
• Row (3) of Table 1: Datasets include a single Gaussian cluser with 100 points in 50
independent dimensions, each with mean 2 and standard deviation 2.
• Row (4) of Table 1: Datasets contain a single Gaussian cluster of 50 points in 10
dimensions, each with mean 2 and standard deviation 2.
1.2
Sparse Distance Data
Next, we tested behavior of the tests in the presence of data with a small number of outlying
points to see if the tests considered these points as outliers or small clusters.
• Row (5) of Table 1: Datasets include a single bivariate Gaussian cluster consisting of
50 points with mean 50 and standard deviation 2 in each dimension. We simply added
a single bivariate Gaussian observation with standard deviation 2 and mean randomly
sampled from a uniform(60,65) distribution.
• Row (6) of Table 1: We verify the effect of density on outliers by expanding the
larger cluster size from 50 to 250 points.
1
• Row (7) of Table 1: Third, we generate a similar single cluster with three outliers.
The outliers are inpendently generated from bivariate Gaussian distributions with standard deviations of 2 and mean randomly sampled from first (40,55) to (45,60), second
(65,65) to (70,70), and third, (65,45) to (70,50), respectively.
• Row (8) of Table 1: Datasets were generated with a high likelihood of outliers
using 100 points from a two dimensional t-distribution with 5 degrees of freedom and
non-centrality parameters at 100.
• Row (9) of Table 1: Data contains 100 points from a t-distribution in two dimensions
with 10 degrees of freedom each and non-centrality parameter 100.
• Row (10) of Table 1: Data includes 100 points from a t-distribution in two independent dimensions with 15 degrees of freedom and non-centrality parameter 100.
1.3
Clusterable Datasets
• Row (11) of Table 1: We simulated two well-separated clusters using 50 point
bivariate gaussian distributions, with each dimension centered around the mean of
either 30 or 50 and standard deviation of 2.
• Row (12) of Table 1: We simulated data with three Gaussian clusters each consisting
of 50 points with a standard deviation of 2, centered around (30,20), (40,20), (35,30).
• Row (13) of Table 1: One notorious test for prior notions of clusterability is noise
robustness, so we generated data consisting of the same three well separated clusters
discussed earlier (yielding clusterable results) additionally, we adds 80 points of noise
to three bivariate Gaussian dimensions centered around (30,40), (70,40) and (50,80)
with a standard deviation of 2; effectively producing 35% noise in the dataset. The
noisy points are generated with mean 50 and standard deviation 20 in both dimensions.
• Row (14) of Table 1: Data was generated from three bivariate Gaussian clusters with
the same means as in row (12) but with standard deviations 1, 3, and 5, respectively.
• Row (15) of Table 1: Next, we explore the effects of varying cluster density and
spread, Simulating a dataset with three Gaussian clusters centered around (35,40),
(65,40), and (50,60). First, we generate all clusters with standard deviation of 2, and
clusters consisting of 100, 66, and 33 points, respectively.
• Row (16) of Table 1: Next we generate three independent Gaussian clusters. The
centers of each 50 point 2-dimensional cluster are located at (35,40), (65,40), (50,60)
and standard deviations of 2 for both independent dimensions.
• Row (17) of Table 1: Expanding upon the dimensions of the prior dataset creates
trivariate clusterable structure. Maintaining the Gaussian distributed clusters, we now
center all dimensions around the mean values of 20, 40, 60 for each cluster, respectively.
2
• Row (18) of Table 1: Data consists of two well separated 10-dimensional clusters.
Using the same approach as simulating a single 10 dimensional cluster, we create
50 points each with 10 independent dimensions for each cluster. Dimensions had a
standard deviation of 2 and centered around 10 for the first cluster and 20 for the
second cluster.
• Row (19) of Table 1: Expanding the previous set (row 18) yields four 10-dimensional
well separated clusters. The first two clusters were as described in the preceding
paragraph. The third is a 10-dimensional Gaussian cluster consisting of 50 points and
a standard deviation of 2, each respective dimension centered 60 for the third cluster
and 80 for the fourth.
• Row (20) of Table 1: Data consists of 2 clusters of 100 points each in 50 independent
dimensions with the same mean. The means for clusters 1 and 2 are 5 and 10.
• Row (21) of Table 1: Similar to the previous row, data consists of 2 fifty-dimensional
clusters each with 100. The clusters overlap, with means of 3 and 6 in each dimension
and standard deviation of 2.
• Row (22) of Table 1: Data includes two clusters of 100 points each in two independent dimensions, each with a t-5 distribution with non-centrality parameters 50 and
150.
• Row (23) of Table 1: Data is generated as in row 22 but with 10 degrees of freedom.
• Row (24) of Table 1: Data is generated as in row 22 but with 15 degrees of freedom.
1.4
Chaining Data
• Row (25) of Table 1: Data consists 50 points generated uniformly from a Single unt
circle.
• Row (26) of Table 1: Data is generated uniformly from two concentric circles. Each
with 50 points, the inner circle has a radius of 1, the outer has radius 2, and both are
centered at the origin.
• Row (27) of Table 1: Data includes points drawn uniformly from each of 3 concentric
circles. As previously, the circles are centered at the origin, each has fifty points, and
the radii are 1, 2, and 3, respectively.
• Row (28) of Table 1: Data consists of five concentric circles, with fifty points each
and the first three generated as in the previous row. The two added circles have radii
4 and 5.
• Row (29) of Table 1: One vertical line with 100 points was generated from a Gaussian
random varable with mean 50, standard devation 25, and constant x-coordinate at 50.
3
• Row (30) of Table 1: Two vertical parallel lines consist of 100 points each, with mean
50 and standard deviation of 25. The lines are horizontally located at x-coordinates of
30 and 55 respectively.
• Row (31) of Table 1: Finally, data is generated from both a vertical line and a
circle, each wth 100 points. The line has vertical mean of 0, standard deviation 2, and
is located horizontally at x=5. The circle is generated with radius 3, centered about
the origin.
4