Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Appendix: To cluster, or not to cluster Andreas Adolfsson, Margareta Ackerman and Naomi Brownstein 1 Simulations Details of the simulations are described in this section. Each scenario is run 1000 times. The rate of p-values below the selected significance level (0.05) is recorded in Table 1 of the main paper. For unclusterable datasets, this rate represents Type I error, while for clusterable datasets, the rate represents statistical power. We now discuss how each type of data summarized in Table 1 was generated: 1.1 Unclusterable Datasets • Rows (1) of Table 1: Datasets contain a single bivariate Gaussian cluster of 50 points with mean and standard deviation of 100 and 2, respectively, for each independent dimension. • Row (2) of Table 1: Datasets include a single trivariate Gaussian cluster of 50 points in 3 independent dimensions with mean 100 and standard deviation 2. • Row (3) of Table 1: Datasets include a single Gaussian cluser with 100 points in 50 independent dimensions, each with mean 2 and standard deviation 2. • Row (4) of Table 1: Datasets contain a single Gaussian cluster of 50 points in 10 dimensions, each with mean 2 and standard deviation 2. 1.2 Sparse Distance Data Next, we tested behavior of the tests in the presence of data with a small number of outlying points to see if the tests considered these points as outliers or small clusters. • Row (5) of Table 1: Datasets include a single bivariate Gaussian cluster consisting of 50 points with mean 50 and standard deviation 2 in each dimension. We simply added a single bivariate Gaussian observation with standard deviation 2 and mean randomly sampled from a uniform(60,65) distribution. • Row (6) of Table 1: We verify the effect of density on outliers by expanding the larger cluster size from 50 to 250 points. 1 • Row (7) of Table 1: Third, we generate a similar single cluster with three outliers. The outliers are inpendently generated from bivariate Gaussian distributions with standard deviations of 2 and mean randomly sampled from first (40,55) to (45,60), second (65,65) to (70,70), and third, (65,45) to (70,50), respectively. • Row (8) of Table 1: Datasets were generated with a high likelihood of outliers using 100 points from a two dimensional t-distribution with 5 degrees of freedom and non-centrality parameters at 100. • Row (9) of Table 1: Data contains 100 points from a t-distribution in two dimensions with 10 degrees of freedom each and non-centrality parameter 100. • Row (10) of Table 1: Data includes 100 points from a t-distribution in two independent dimensions with 15 degrees of freedom and non-centrality parameter 100. 1.3 Clusterable Datasets • Row (11) of Table 1: We simulated two well-separated clusters using 50 point bivariate gaussian distributions, with each dimension centered around the mean of either 30 or 50 and standard deviation of 2. • Row (12) of Table 1: We simulated data with three Gaussian clusters each consisting of 50 points with a standard deviation of 2, centered around (30,20), (40,20), (35,30). • Row (13) of Table 1: One notorious test for prior notions of clusterability is noise robustness, so we generated data consisting of the same three well separated clusters discussed earlier (yielding clusterable results) additionally, we adds 80 points of noise to three bivariate Gaussian dimensions centered around (30,40), (70,40) and (50,80) with a standard deviation of 2; effectively producing 35% noise in the dataset. The noisy points are generated with mean 50 and standard deviation 20 in both dimensions. • Row (14) of Table 1: Data was generated from three bivariate Gaussian clusters with the same means as in row (12) but with standard deviations 1, 3, and 5, respectively. • Row (15) of Table 1: Next, we explore the effects of varying cluster density and spread, Simulating a dataset with three Gaussian clusters centered around (35,40), (65,40), and (50,60). First, we generate all clusters with standard deviation of 2, and clusters consisting of 100, 66, and 33 points, respectively. • Row (16) of Table 1: Next we generate three independent Gaussian clusters. The centers of each 50 point 2-dimensional cluster are located at (35,40), (65,40), (50,60) and standard deviations of 2 for both independent dimensions. • Row (17) of Table 1: Expanding upon the dimensions of the prior dataset creates trivariate clusterable structure. Maintaining the Gaussian distributed clusters, we now center all dimensions around the mean values of 20, 40, 60 for each cluster, respectively. 2 • Row (18) of Table 1: Data consists of two well separated 10-dimensional clusters. Using the same approach as simulating a single 10 dimensional cluster, we create 50 points each with 10 independent dimensions for each cluster. Dimensions had a standard deviation of 2 and centered around 10 for the first cluster and 20 for the second cluster. • Row (19) of Table 1: Expanding the previous set (row 18) yields four 10-dimensional well separated clusters. The first two clusters were as described in the preceding paragraph. The third is a 10-dimensional Gaussian cluster consisting of 50 points and a standard deviation of 2, each respective dimension centered 60 for the third cluster and 80 for the fourth. • Row (20) of Table 1: Data consists of 2 clusters of 100 points each in 50 independent dimensions with the same mean. The means for clusters 1 and 2 are 5 and 10. • Row (21) of Table 1: Similar to the previous row, data consists of 2 fifty-dimensional clusters each with 100. The clusters overlap, with means of 3 and 6 in each dimension and standard deviation of 2. • Row (22) of Table 1: Data includes two clusters of 100 points each in two independent dimensions, each with a t-5 distribution with non-centrality parameters 50 and 150. • Row (23) of Table 1: Data is generated as in row 22 but with 10 degrees of freedom. • Row (24) of Table 1: Data is generated as in row 22 but with 15 degrees of freedom. 1.4 Chaining Data • Row (25) of Table 1: Data consists 50 points generated uniformly from a Single unt circle. • Row (26) of Table 1: Data is generated uniformly from two concentric circles. Each with 50 points, the inner circle has a radius of 1, the outer has radius 2, and both are centered at the origin. • Row (27) of Table 1: Data includes points drawn uniformly from each of 3 concentric circles. As previously, the circles are centered at the origin, each has fifty points, and the radii are 1, 2, and 3, respectively. • Row (28) of Table 1: Data consists of five concentric circles, with fifty points each and the first three generated as in the previous row. The two added circles have radii 4 and 5. • Row (29) of Table 1: One vertical line with 100 points was generated from a Gaussian random varable with mean 50, standard devation 25, and constant x-coordinate at 50. 3 • Row (30) of Table 1: Two vertical parallel lines consist of 100 points each, with mean 50 and standard deviation of 25. The lines are horizontally located at x-coordinates of 30 and 55 respectively. • Row (31) of Table 1: Finally, data is generated from both a vertical line and a circle, each wth 100 points. The line has vertical mean of 0, standard deviation 2, and is located horizontally at x=5. The circle is generated with radius 3, centered about the origin. 4