Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
IT 6702 –Data warehousing and Data mining UNIT IV-CLUSTERING AND APPLICATION AND TRENDS IN DATA PART – A 1. How Outlier may be detected by clustering? (May 15) An outlier is an observation point that is distant from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set. 2. Let x1=(1,2) and x2=(3,5) represents two points. Calculate the Manhattan distance between these two points.(May 15). The distance between two points measured along axes right angles. In a plane with p1 at (x1,y1) and p2 at (x2,y2), it is |x1-x2|+|y1-y2|. Manhattan distance between x1 and x2 is |1-2|+|3-5|=3 3. Define Divisive hierarchical clustering.(Nov 14) The variant of hierarchical clustering is called top down clustering or divisive clustering. We start at the top with all documents in one cluster. The cluster is split using a flat clustering algorithm. This procedure is applied recursively until each documents is in in its own singleton cluster 4. What is Sting? (May 14) Statistical Information Grid is called as STING; it is a grid based multi resolution clustering method. In STING method, all the objects are contained into rectangular cells, these cells are kept into various levels of resolutions and these levels are arranged in a hierarchical structure. 5. Define Wave cluster (May 14) It is a grid based multi resolution clustering method. In this method all the objects are represented by a multidimensional grid structure and a wavelet transformation is applied for finding the dense region. Each grid cell contains the information of the group of objects that map into a cell. 6. What are the requirements of clustering? Scalability Ability to deal with different types of attributes Ability to deal with noisy data Minimal requirements for domain knowledge to determine input parameters Constraint based clustering Interpretability and usability 7. Define CLARA and CLARANS? Clustering in LARge Applications is called as CLARA. The efficiency of CLARA depends upon the size of the representative data set. CLARA does not work properly if any representative data set from the selected representative data sets does not find best k-medoids. To recover this drawback a new algorithm, Clustering Large Applications based upon RANdomized search (CLARANS) is introduced. The CLARANS works like CLARA, the only difference between CLARA and CLARANS is the clustering process that is done after selecting the representative data sets. 8. Define Chameleon method? Chameleon is another hierarchical clustering method that uses dynamic modelling. Chameleon is introduced to recover the drawbacks of CURE method. In this method two clusters are merged, if the interconnectivity between two clusters is greater than the interconnectivity between the objects within a cluster. Dept of CSE/EGSPEC Page 1 IT 6702 –Data warehousing and Data mining 9. Define a Spatial database. Spatial databases contain spatial-related information. Such databases include geographic (map) databases, VLSI chip design databases, and medical and satellite image databases. Spatial data may be represented in raster format, consisting of n-dimensional bit maps or pixel maps. 10. What is a DBSCAN? Density Based Spatial Clustering of Application Noise is called as DBSCAN. DBSCAN is a density based clustering method that converts the high-density objects regions into clusters with arbitrary shapes and sizes. DBSCAN defines the cluster as a maximal set of density connected points. 11. Define Binary variables? And what are the two types of binary variables? Binary variables are understood by two states 0 and 1, when state is 0, variable is absent and when state is 1, variable is present. There are two types of binary variables, symmetric and asymmetric binary variables. Symmetric variables are those variables that have same state values and weights. Asymmetric variables are those variables that have not same state values and weights. 12. What is web usage mining? Web usage mining performs mining on web usage data, or web logs. A web log is a listing of page reference data. It is also called clickstream data because each entry corresponds to a mouse click. When these logs are examined from a server perspective, mining uncovers information about the sites where the service resides an can help improve the design of the sites. 13. Define BIRCH,ROCK and CURE. BIRCH(Balanced Iterative Reducing and Clustering Using Hierarchies): Partitions objects hierarchically using tree structures and then refines the clusters using other clustering methods.it defines a clustering feature and an associated tree structure that summarizes a cluster.The tree is a height balanced tree that stores cluster information.BIRCH doesn’t Produce spherical Cluster and may produce unintended cluster. ROCK(RObust Clustering using links): Merges clusters based on their interconnectivity. Great for categorical data. Ignores information about the looseness of two clusters while emphasizing interconnectivity. CURE(Clustering Using Representatives): Creates clusters by sampling the database and shrinks them toward the center of the cluster by a specified fraction. Obviously better in runtime but lacking in precision. 14. What are the applications of spatial databases? Applications of spatial databases are in the field of geographical (map) databases, VLSI chip designs, and medical and satellite images databases. 15. What is text database? A text database is a database that contains text documents or other word descriptions in the form of long sentences or paragraphs, such as product specifications, error or bug reports, warning messages, summary reports, notes, or other documents. Dept of CSE/EGSPEC Page 2 IT 6702 –Data warehousing and Data mining PART-B 1. Explain the types of data in cluster analysis in detail with example. (Nov 14) Data structure Data matrix (two modes) object by variable Structure Data matrix (or object-by-variable structure): This represents n objects, such as persons, with p variables (also called measurements or attributes), such as age, height, weight, gender, and so on. The structure is in the form of a relational table, or n-by-pmatrix (n objects _p variables): Dissimilarity matrix (or object-by-object structure): This stores a collection of proximities that are available for all pairs of n objects. It is often represented by an n-by-n table: where d(i, j) is the measured difference or dissimilarity between objects i and j. The rows and columns of the data matrix represent different entities, while those of the dissimilarity matrix represent the same entity. Thus, the data atrix is often called a two-mode matrix, whereas the dissimilarity matrix is called a one-mode matrix Interval-scaled variables are continuous measurements of a roughly linear scale. Typical examples include weight and height, latitude and longitude coordinates (e.g., when clustering houses), and weather temperature. “How can the data for a variable be standardized?” To standardize measurements, one choice is to convert the original measurements to unitless variables. Given measurements for a variable f , this can be performed as follows. Using mean absolute deviation is more robust than using standard deviation Similarity and Dissimilarity Between Objects Distances are normally used to measure the similarity or dissimilarity between two dataobjects Some popular ones include: Minkowski distance: where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two pdimensional data objects, and q is a positive integer o If q = 1, d is Manhattan distance Dept of CSE/EGSPEC Page 3 IT 6702 –Data warehousing and Data mining Also, one can use weighted distance, parametric Pearson product moment correlation, or other dissimilarity measures Both the Euclidean distance and Manhattan distance satisfy the following mathematic requirements of a distance function: 1. d(i, j) _ 0: Distance is a nonnegative number. 2. d(i, i) = 0: The distance of an object to itself is 0. 3. d(i, j) = d( j, i): Distance is a symmetric function. 4. d(i, j) _ d(i, h)+d(h, j): Going directly fromobject i to object j in space is no more than making a detour over any other object h (triangular inequality). Binary Variables Let us see how to compute the dissimilarity between objects described by either symmetric or asymmetric binary variables. A binary variable has only two states: 0 or 1, where 0 means that the variable is absent, and 1 means that it is present. Given the variable smoker describing a patient, for instance, 1 indicates that the patient smokes, while 0 indicates that the patient does not. dissimilarity between two binary variables?” One approach involves computing a dissimilarity matrix from the given binary data. If all binary variables are thought of as having the same weight, we have the 2-by-2 contingency table, where q is the number of variables that equal 1 for both objects i and j, r is the number of variables that equal 1 for object i but that are 0 for object j, s is the number of variables that equal 0 for object i but equal 1 for object j, and t is the number of variables that equal 0 for both objects i and j. The total number of variables is p, where p = q+r+s+t. A binary variable is symmetric if both of its states are equally valuable and carry the same weight; that is, there is no reference on which outcome should be coded as 0 or 1. One such example could be the attribute gender having the states male and female. Dissimilarity that is based on symmetric binary variables is called symmetric binary dissimilarity. Its dissimilarity (or distance) measure, defined in Equation (7.9), can be used to assess the dissimilarity between objects i and j. Dept of CSE/EGSPEC Page 4 IT 6702 –Data warehousing and Data mining Categorical variables A generalization of the binary variable in that it can take more than 2 states, e.g., red,yellow, blue, green Method 1: Simple matching o m: # of matches, p: total # of variables Method 2: use a large number of binary variables o nominal states M Ordinal Variables An ordinal variable can be discrete or continuous Order is important, e.g., rank Can be treated like interval-scaled replace xif by their rank map the range of each variable onto [0, 1] by replacing i-th object in the f-th variableBy Compute the dissimilarity using methods for interval-scaled variables Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at exponential scale, such as AeBt or Ae-Bt where A and B are positive constants, and t typically represents time. Methods: Treat them like interval-scaled variables—not a good choice! (why?—the scale can be distorted) apply logarithmic transformation yif = log(xif) treat them as continuous ordinal data treat their rank as interval-scaled Variables of Mixed Types A database may contain all the six types of variables symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio One may use a weighted formula to combine their effects Dept of CSE/EGSPEC Page 5 IT 6702 –Data warehousing and Data mining Vector Objects Vector objects: keywords in documents, gene features in micro-arrays, etc. Broad applications: information retrieval, biologic taxonomy, etc. Cosine measure A variant: Tanimoto coefficient 2. Explain in detail about the applications of data mining. DATA MINING APPLICATIONS Introduction Data Mining is widely used in diverse areas. There are number of commercial data mining system vailable today yet there are many challenges in this field. In this tutorial we will applications and trend of Data Mining. Data Mining Applications Here is the list of areas where data mining is widely used: Financial Data Analysis Retail Industry Telecommunication Industry Biological Data Analysis Other Scientific Applications Intrusion Detection Financial Data Analysis The financial data in banking and financial industry is generally reliable and of high quality which facilitates the systematic data analysis and data mining. Here are the few typical cases: Design and construction of data warehouses for multidimensional data analysis and data mining. Loan payment prediction and customer credit policy analysis. Classification and clustering of customers for targeted marketing. Detection of money laundering and other financial crimes. Retail Industry Data Mining has its great application in Retail Industry because it collects large amount data from on sales, customer purchasing history, goods transportation, consumption and services. It is natural that the quantity of data collected will continue to expand rapidly because of increasing ease, availability and popularity of web. The Data Mining in Retail Industry helps in identifying customer buying patterns and trends. That leads to improved quality of customer service and good customer retention and satisfaction. Here is the list of examples of data mining in retail industry: Design and Construction of data warehouses based on benefits of data mining. Multidimensional analysis of sales, customers, products, time and region. Analysis of effectiveness of sales campaigns. Customer Retention. Product recommendation and cross-referencing of items. Telecommunication Industry Today the Telecommunication industry is one of the most emerging industries providing various Dept of CSE/EGSPEC Page 6 IT 6702 –Data warehousing and Data mining services such as fax, pager, cellular phone, Internet messenger, images, e-mail, web data transmission etc.Due to the development of new computer and communication technologies, the telecommunication industry is rapidly expanding. This is the reason why data mining is become very important to help and understand the business. Data Mining in Telecommunication industry helps in identifying the telecommunication patterns, catch fraudulent activities, make better use of resource, and improve quality of service. Here is the list examples for which data mining improve telecommunication services: Multidimensional Analysis of Telecommunication data. Fraudulent pattern analysis. Identification of unusual patterns. Multidimensional association and sequential patterns analysis. Mobile Telecommunication services. Use of visualization tools in telecommunication data analysis. Biological Data Analysis Now a days we see that there is vast growth in field of biology such as genomics, proteomics, functional Genomics and biomedical research. Biological data mining is very important part of Bioinformatics. Following are the aspects in which Data mining contribute for biological data analysis: Semantic integration of heterogeneous, distributed genomic and proteomic databases. Alignment, indexing, similarity search and comparative analysis multiple nucleotide sequences. Discovery of structural patterns and analysis of genetic networks and protein pathways. Association and path analysis. Visualization tools in genetic data analysis. Other Scientific Applications The applications discussed above tend to handle relatively small and homogeneous data sets for which the statistical techniques are appropriate. Huge amount of data have been collected from scientific domains such as geosciences, astronomy etc. There is large amount of data sets being generated because of the fast numerical simulations in various fields such as climate, and ecosystem modelling, chemical engineering, fluid dynamics etc. Following are the applications of data mining in field of Scientific Applications: Data Warehouses and data pre-processing. Graph-based mining. Visualization and domain specific knowledge. 3. 4. 5. 6. 7. Intrusion Detection Intrusion refers to any kind of action that threatens integrity, confidentiality, or availability of network resources. In this world of connectivity security has become the major issue. With increased usage of internet and availability of tools and tricks for intruding and attacking network prompted intrusion detection to become a critical component of network administration. Here is the list of areas in which data mining technology may be applied for intrusion detection: Development of data mining algorithm for intrusion detection. Association and correlation analysis, aggregation to help select and build discriminating attributes. Analysis of Stream data. Distributed data mining. Visualization and query tools. Explain outlier analysis with example. (Nov 14) Explain with an example density- based local outlier detection (May 13) Write Notes on DBSCAN (Nov 13) Explain hierarchical method and density based method of classification with example (Nov 14) How agglomerative hierarchical clustering works? Explain with an example (May 15) Dept of CSE/EGSPEC Page 7 IT 6702 –Data warehousing and Data mining 8. How divisive hierarchical clustering works? Explain with an example (May 15) 9. Consider five points (x1,x2,x3,x4,x5) with the following coordinates as two dimensional sample for clustering:X1=(0,2), x2=(1,0), x3=(2,1), x4=(4,1) and x5=(5,3). Illustrate the K-means algorithm on the above data set. The required number of clusters is 2 and initially clusters are formed from random distribution samples:c1=(x1,x2,x4) and c2(x3,x5). (May 15) 10. Consider five points (x1,x2,x3,x4,x5) with the following coordinates as two dimensional sample for clustering:X1=(0,2.25), x2=(0,0.25), x3=(0.25,0), x4=(4.5,0) and x5=(4.5,2.5). Illustrate the K-means partitioning algorithm (clustering algorithm) using the above data set. (May 13) Dept of CSE/EGSPEC Page 8