Epsilon Grid Order: An Algorithm for the Similarity Join on
... facilitate the search by similarity, multidimensional feature vectors are extracted from the original objects and organized in multidimensional access methods. The particular property of this feature transformation is that the Euclidean distance between two feature vectors corresponds to the (dis-) ...
... facilitate the search by similarity, multidimensional feature vectors are extracted from the original objects and organized in multidimensional access methods. The particular property of this feature transformation is that the Euclidean distance between two feature vectors corresponds to the (dis-) ...
Fast mining of frequent tree structures by hashing and indexing
... subsequently be processed. Characteristic examples include the XML and HTML documents and the biological data. An intrinsic feature possessed by all these data is that they do not have a rigid, well-defined structure. There are several reasons for that. For instance, the data source may not impose a ...
... subsequently be processed. Characteristic examples include the XML and HTML documents and the biological data. An intrinsic feature possessed by all these data is that they do not have a rigid, well-defined structure. There are several reasons for that. For instance, the data source may not impose a ...
Cloud-based Malware Detection for Evolving Data Streams
... they exhibit significant concept-drift as attackers react and adapt to defenses. Third, for data streams that do not have any fixed feature set, such as text streams, an additional feature extraction and selection task must be performed. If the number of candidate features is too large, then traditi ...
... they exhibit significant concept-drift as attackers react and adapt to defenses. Third, for data streams that do not have any fixed feature set, such as text streams, an additional feature extraction and selection task must be performed. If the number of candidate features is too large, then traditi ...
Semi-Final Proceedin..
... learning classifiers do best when the number of dimensions are small (less than 100) and the number of data points are large (greater than 1000). Unfortunately, in many bio-chemical data sets, the dimensions are large ( ex. 30,000 genes) and the data points (ex. 50 patient samples) are small. The fi ...
... learning classifiers do best when the number of dimensions are small (less than 100) and the number of data points are large (greater than 1000). Unfortunately, in many bio-chemical data sets, the dimensions are large ( ex. 30,000 genes) and the data points (ex. 50 patient samples) are small. The fi ...
Tutorial "Computational intelligence for data mining"
... black-and-white picture may be inappropriate in many applications. Discontinuous cost function allow only nongradient optimization. Sets of rules are unstable: small change in the dataset leads to a large change in structure of complex sets of rules. Reliable crisp rules may reject some cases as unc ...
... black-and-white picture may be inappropriate in many applications. Discontinuous cost function allow only nongradient optimization. Sets of rules are unstable: small change in the dataset leads to a large change in structure of complex sets of rules. Reliable crisp rules may reject some cases as unc ...
Author Proof - Soft Computing and Intelligent Information Systems
... rules. Some of the most well-known decision tree algorithms are C4.5 [36] and CART [37]. The main difference between these two algorithms is the splitting criterion used at the internal nodes: C4.5 uses the information gain ratio, and CART employs the gini index. Decision trees have been employed su ...
... rules. Some of the most well-known decision tree algorithms are C4.5 [36] and CART [37]. The main difference between these two algorithms is the splitting criterion used at the internal nodes: C4.5 uses the information gain ratio, and CART employs the gini index. Decision trees have been employed su ...
A Novel K-Means Based Clustering Algorithm for High Dimensional
... which are for samples. There are missing values in this table because some questions have not been answered, so we replaced them with 0. On the other hand we need to calculate length of each vector base on its dimensions for further process. All attributes value in this table is ordinal and we arran ...
... which are for samples. There are missing values in this table because some questions have not been answered, so we replaced them with 0. On the other hand we need to calculate length of each vector base on its dimensions for further process. All attributes value in this table is ordinal and we arran ...
Survey on Different Density Based Algorithms on
... same as original DBSCAN algorithm and all the data are initialized as UNCLASSIFIED in the beginning. All the border objects have been considered for the clustering process. But there are few possibilities to miss the core objects and which causes some loss of objects. ODBSCAN gives better result tha ...
... same as original DBSCAN algorithm and all the data are initialized as UNCLASSIFIED in the beginning. All the border objects have been considered for the clustering process. But there are few possibilities to miss the core objects and which causes some loss of objects. ODBSCAN gives better result tha ...
Classification and Clustering - Connected Health Summer School
... • Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct — prior knowledge can be combined with observed data • Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making ag ...
... • Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct — prior knowledge can be combined with observed data • Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making ag ...
Learning Classifiers from Only Positive and Unlabeled Data
... The input to an algorithm that learns a binary classifier consists normally of two sets of examples. One set is positive examples x such that the label y = 1, and the other set is negative examples x such that y = 0. However, suppose the available input consists of just an incomplete set of positive ...
... The input to an algorithm that learns a binary classifier consists normally of two sets of examples. One set is positive examples x such that the label y = 1, and the other set is negative examples x such that y = 0. However, suppose the available input consists of just an incomplete set of positive ...
Data Preparation and Reduction
... Attention should be paid to data transformation, because relatively simple transformations can sometimes be far more effective for the final performance ! MLDM-Berlin Chen 11 ...
... Attention should be paid to data transformation, because relatively simple transformations can sometimes be far more effective for the final performance ! MLDM-Berlin Chen 11 ...
A Hybrid K-Mean Clustering Algorithm for Prediction Analysis
... used in a wide array of applications, the k-means algorithm has drawbacks: As many clustering methods, this algorithm says that the clusters k in the database is called as beforehand which are not completely right in realworld application13. Moreover, the k-means algorithm is computationally very ex ...
... used in a wide array of applications, the k-means algorithm has drawbacks: As many clustering methods, this algorithm says that the clusters k in the database is called as beforehand which are not completely right in realworld application13. Moreover, the k-means algorithm is computationally very ex ...
Subspace Scores for Feature Selection in Computer Vision
... It has long been observed that k-means can be accelerated by projecting data points to a small set of top principal data components before clustering, without sacrificing quality. Typically, O(k) principal components are used, which can be a substantial reduction since k, the number of target cluste ...
... It has long been observed that k-means can be accelerated by projecting data points to a small set of top principal data components before clustering, without sacrificing quality. Typically, O(k) principal components are used, which can be a substantial reduction since k, the number of target cluste ...
Algorithmic Bias from discrimination discovery to fairness
... [Pre_2] F. Kamiran and T. Calders. “Data preprocessing techniques for classification without discrimination”. In Knowledge and Information Systems (KAIS), 33(1), 2012. [Pre_3] S. Hajian and J. Domingo-Ferrer. “A methodology for direct and indirect discrimination prevention in data mining”. In IEEE T ...
... [Pre_2] F. Kamiran and T. Calders. “Data preprocessing techniques for classification without discrimination”. In Knowledge and Information Systems (KAIS), 33(1), 2012. [Pre_3] S. Hajian and J. Domingo-Ferrer. “A methodology for direct and indirect discrimination prevention in data mining”. In IEEE T ...
K-nearest neighbors algorithm
In pattern recognition, the k-Nearest Neighbors algorithm (or k-NN for short) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression: In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor. In k-NN regression, the output is the property value for the object. This value is the average of the values of its k nearest neighbors.k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification. The k-NN algorithm is among the simplest of all machine learning algorithms.Both for classification and regression, it can be useful to assign weight to the contributions of the neighbors, so that the nearer neighbors contribute more to the average than the more distant ones. For example, a common weighting scheme consists in giving each neighbor a weight of 1/d, where d is the distance to the neighbor.The neighbors are taken from a set of objects for which the class (for k-NN classification) or the object property value (for k-NN regression) is known. This can be thought of as the training set for the algorithm, though no explicit training step is required.A shortcoming of the k-NN algorithm is that it is sensitive to the local structure of the data. The algorithm has nothing to do with and is not to be confused with k-means, another popular machine learning technique.