* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Features of spatial databases
Survey
Document related concepts
Transcript
DWDM PART-A 1. Write down the applications of data warehousing. 2. When is data mart appropriate? 3. What is concept hierarchy? give an example. 4. What are the various forms of data preprocessing? DATA CLEANING , DATA INTEGRATION , DATA REDUCTION , DATA TRANSFORMATION Data cleaning fill in missing values smooth noisy data identify outliers correct data inconsistency Data integration combines data from multiple sources to form a coherent data store. Data transformation Metadata, correlation analysis, data conflict detection and resolution of semantic heterogeneity contribute towards smooth data integration. convert the data into appropriate forms for mining. E.g. attribute data maybe normalized to fall between a small range such as 0.0 to 1.0 Data reduction data cube aggregation, dimension reduction, data compression, numerosity reduction and discretization. Used to obtain a reduced representation of the data while minimizing the loss of information content. 5. Write the two measures of Association Rule. 6. Define conditional pattern base. Is a small data base of patterns that co – occur with an item. 7. List out the major strength of decision tree method. 8. Distinguish between classification and clustering. 9. Define a spatial database. A spatial database is a database that is optimized to store and query data related to objects in space, including points, lines and polygons. While typical databases can understand various numeric and character types of data, additional functionality needs to be added for databases to process spatial data types. These are typically called geometry or feature. The Open Geospatial Consortium created the Simple Features specification and sets standards for adding spatial functionality to database systems. OGC Homepage. Features of spatial databases Database systems use indexes to quickly look up values and the way that most databases index data is not optimal for spatial queries. Instead, spatial databases use a spatial index to speed up database operations. In addition to typical SQL queries such as SELECT statements, spatial databases can perform a wide variety of spatial operations. The following query types and many more are supported by the Open Geospatial Consortium: Spatial Measurements: Finds the distance between points, polygon area, etc. Spatial Functions: Modify existing features to create new ones, for example by providing a buffer around them, intersecting features, etc. Spatial Predicates: Allows true/false queries such as 'is there a residence located within a mile of the area we are planning to build the landfill?' Constructor Functions: Creates new features with an SQL query specifying the vertices (points of nodes) which can make up lines. If the first and last vertex of a line are identical the feature can also be of the type polygon (a closed line). Observer Functions: Queries which return specific information about a feature such as the location of the center of a circle 10. list out any two various commercial data mining tools. PART-B 11.(a) (i) With a neat sketch explain the architecture of a data warehouse (ii) Discuss the typical OLAP operations with an example. Or (b) (i) Discuss how computations can be performed efficiently on data cubes. (ii) Write short notes on data warehouse meta data. 12.(a) (i) Explain various methods of data cleaning in detail. (ii) Give an account on data mining Query language. DMQL: A Data Mining Query Language for Relational Databases (Han et al, Simon Fraser University) Integrating Data Mining with SQL Databases: OLE DB for Data Mining (Netz et al, Microsoft) MSQL: A Query Language for Database Mining (Imielinski & Virmani, Rutgers University) Or (b) How is Attribute-Oriented Induction implemented? Explain in detail. Attribute-oriented induction (AOI) uses concept hierarchies to discover hidden patterns from a huge amount of data and presents the concise patterns as a general description of the original data. It is an effective data analysis and data reduction technique. the construction of concept hierarchies for numeric attributes is sometimes subjective, and the generalization of border values near the cutting points of discretization can easily result in misconception. 13. (a) Write and explain the algorithm for mining frequent item sets without candidate generation. Give relevant example. Or (b) Discuss the approaches for mining multi level association rules from the transactional databases. Give relevant example. Items often form hierarchy. Items at the lower level are expected to have lower support. Rules regarding itemsets at appropriate levels could be quite useful. Transaction database can be encoded based on dimensions and levels We can explore shared multi-level mining Mining Multi-Level Associations A top_down, progressive deepening approach: First find high-level strong rules: milk bread [20%, 60%]. Then find their lower-level “weaker” rules: 2% milk wheat bread [6%, 50%]. Variations at mining multiple-level association rules. Level-crossed association rules: 2% milk Wonder wheat bread Association rules with multiple, alternative hierarchies: 2% milk Wonder bread 14. (a) (i) Explain the algorithm for constructing a decision tree from training samples. (ii) Explain Bayes theorem. In probability theory, Bayes' theorem (often called Bayes' law after Rev Thomas Bayes; IPA:/'beɪz/) relates the conditional and marginal probabilities of two random events. It is often used to compute posterior probabilities given observations. For example, a patient may be observed to have certain symptoms. Bayes' theorem can be used to compute the probability that a proposed diagnosis is correct, given that observation. (See example 2) Bayes' Theorem states that judgements should be influenced by two main factors: the base rate, and the likelihood ratio. As a formal theorem, Bayes' theorem is valid in all common interpretations of probability. However, it plays a central role in the debate around the foundations of statistics: frequentist and Bayesian interpretations disagree about the ways in which probabilities should be assigned in applications. Frequentists assign probabilities to random events according to their frequencies of occurrence or to subsets of populations as proportions of the whole, while Bayesians describe probabilities in terms of beliefs and degrees of uncertainty. The articles on Bayesian probability and frequentist probability discuss these debates in greater detail. Bayes' theorem relates the conditional and marginal probabilities of events A and B, where B has a non-vanishing probability: Each term in Bayes' theorem has a conventional name: P(A) is the prior probability or marginal probability of A. It is "prior" in the sense that it does not take into account any information about B. P(A|B) is the conditional probability of A, given B. It is also called the posterior probability because it is derived from or depends upon the specified value of B. P(B|A) is the conditional probability of B given A. P(B) is the prior or marginal probability of B, and acts as a normalizing constant. Intuitively, Bayes' theorem in this form describes the way in which one's beliefs about observing 'A' are updated by having observed 'B'. Or (b)Explain the following clustering methods in detail: (i) BIRCH Birch (balanced iterative reducing and clustering using hierarchies) is an incremental and hierarchical clustering algorithm for large databases. The strongests point of the birch algorithm its support for very large databases (main memory is lower than the size of the DB). There are two main building components in the birch algorithm: 1. hierarchical clustering component, 2. main memory structure component, We will revisit every component in detail and give an conceptual idea how the birch clustering works The idea of a hierarchical clustering is illustrated in Figure 1. The algorithm starts with single point clusters (every point in a database is a cluster, cf Figure 1(a)). Then it groups the closest points into separate clusters (Figure 1(b)), and continues, until only one cluster remains (Figure 1(c)). The computation of the clusters is done with a help of distance matrix (O(n2) large) and O(n2) time. 1(a): The Dataset Figure 1: The Idea of Hierarchical Clustering 1(b): 1st Step 1(c): 2nd Step 1(d): Last Step Birch uses a main memory (of limited size) data structure called CF tree. The tree is organized in such a way that (i) the leave contain actual clusters, and (ii) the size of any cluster in a leaf is not large than R. An example of the CF tree is illustrated in Figure 2. Initially, the data points in one cluster. As the data arrives, a check is made whether the size of the cluster does not exceed R (cf. Figures 2(a)-(b)). If the cluster size grows too big, the cluster is split into two clusters, and the points are redistributed (Figure 2(c)). The points are then continuously inserted to the cluster which enlarges less (cf. Figure 2(d)). At each node of the tree the CF tree keeps information about the mean of the cluster, and the mean of the sum of squares to compute the size of the clusters efficiently. The tree structure also depends on the branching parameter T, which determines the maximum number of children each node can have. Figure 1: The Idea of CF Tree 1(a): 1st Step 1(b): 2nd Step 1(c): 3rd Step 1(e): 4rd Step The birch algorithm starts with a dataset, and tries to guess the size of the cluster R so the tree can fit in the main memory. If the tree does not fit into the main memory, it reduces the R and rebuilds the tree. The process is repeated untill the tree fits into main memory. The birch algorithm can also include a number of post processing phrases to remove outliers and improve clustering. (ii) CURE CURE is an efficient clustering algorithm for large databases that is more robust to outliers and identifies clusters having non-spherical shapes and wide variances in size. To avoid the problems with non-uniform sized or shaped clusters,CURE employs a novel heirarchial clustering algorithm that adopts a middle ground between the centroid based and all point extremes. In CURE, a constant number c of well scattered points of a cluster are chosen and they are shrunk towards the centroid of the cluster by a fraction $\alpha$. The scatterd points aftyre shrinking are used as representatives of the cluster. The clusters with the closest pair of representatives are the clusters that are merged at each step of CURE's heirarchial clustering algorithm. This enables CURE to correctly identify the clusters and makes it less sensitive to outliers. The algorithm is given below. The running time of the algorithm is O(n^{2}\log n)$ and space complexity is O(n) The algorithm cannot be directly applied to large databases.So for this purpose we do the following enhancements Random sampling : To handle large data sets, we do random sampling and draw a sample data set. Generally the random sample fits in main memory. Also because of the random sampling there is a trade off between accuracy and efficiency. Partitioning for speed up : The basic idea is to partition the sample space into p partitions. Then in the first pass partially cluster each partition until the final number of clusters reduces to \frac{np}{q} for some constant q \ge 1. Then run a second clustering pass on \frac{n}{q} partial clusters for all the partitions. For the second pass we only store the representative points since merge procedure only requires representative points of previous clusters before computing the new representative points for the merged cluster. The advantage of partitioning the input is that we can reduce the execution times. Labeling data on disk : Since we only have representative points for k clusters, the remaining data points should also be assigned to the clusters. For this a fraction of randomly selected representative points for each of the k clusters is chosen and data point is assigned to the cluster containing the representative point closest to it. [edit] Pseudocode \textbf{CURE(no. of points,k)} \textbf{Input :} A set of points S \textbf{Output :} K clusters ` 1 For every cluster u(each input point), u.mean and u.rep store the mean of the points in cluster and set of c(initially = 1 since each cluster has one data point) representative points of the cluster. Also u.closest stores cluster closest to u. 2 All the input points are inserted into a k-d tree T 3 Treat each input point as seperate cluster, compute u.closest for each u and then insert each cluster into the heap Q.(clusters are arranged in increasing order of distances between u and u.closest). 4 While size(Q) \ge k 6 Remove the top elemnt of Q(say u) and merge it with its closest cluster u.closest(say v) and compute 7 the new representative points for the merged cluster w. Also remove u and v from T and Q. 8 Also for all the clusters x in Q, update x.closest and relocate x 9 insert w into Q 10 repeat 15.(a) (i) What is a multimedia database? Explain the methods of mining multimedia database? A multimedia database is a database that hosts one or more primary media file types such as .txt (documents), .jpg (images), .swf (videos), .mp3 (audio), etc. And loosely fall into three main categories: Static media (timeindependent, i.e. images and handwriting) Dynamic media (time-dependent, i.e. video and sound bytes) Dimensional media (i.e. 3D games or computer-aided drafting programs- CAD) All primary media files are stored in binary strings of zeros and ones, and are encoded according to file type. The term "data" is typically referenced from the computer point of view, whereas the term "multimedia" is referenced from the user point of view. Types of Multimedia Databases There are numerous different types of multimedia databases, including: The Authentication Multimedia Database (also known as a Verification Multimedia Database, i.e. retina scanning), is a 1:1 data comparison The Identification Multimedia Database is a data comparison of one-to-many (i.e. passwords and personal identification numbers A newly-emerging type of multimedia database, is the Biometrics Multimedia Database; which specializes in automatic human verification based on the algorithms of their behavioral or physiological profile. This method of identification is superior to traditional multimedia database methods requiring the typical input of personal identification numbers and passwords- Due to the fact that the person being identified does not need to be physically present, where the identification check is taking place. This removes the need for the person being scanned to remember a PIN or password. Fingerprint identification technology is also based on this type of multimedia database. Or (b) (i) Discuss the social impacts of data mining. (ii) Discuss spatial data mining.