Download Features of spatial databases

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Extensible Storage Engine wikipedia , lookup

Database wikipedia , lookup

Relational model wikipedia , lookup

Functional Database Model wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
DWDM
PART-A
1. Write down the applications of data warehousing.
2. When is data mart appropriate?
3. What is concept hierarchy? give an example.
4. What are the various forms of data preprocessing?
DATA CLEANING , DATA INTEGRATION , DATA REDUCTION , DATA
TRANSFORMATION

Data cleaning
 fill in missing values
 smooth noisy data
 identify outliers
 correct data inconsistency

Data integration
 combines data from multiple sources to form a coherent data store.


Data transformation



Metadata, correlation analysis, data conflict detection and resolution
of semantic heterogeneity contribute towards smooth data integration.
convert the data into appropriate forms for mining.
E.g. attribute data maybe normalized to fall between a small range
such as 0.0 to 1.0
Data reduction


data cube aggregation, dimension reduction, data compression,
numerosity reduction and discretization.
Used to obtain a reduced representation of the data while minimizing
the loss of information content.
5. Write the two measures of Association Rule.
6. Define conditional pattern base. Is a small data base of patterns that co – occur with an
item.
7. List out the major strength of decision tree method.
8. Distinguish between classification and clustering.
9. Define a spatial database.
A spatial database is a database that is optimized to store and query data related to
objects in space, including points, lines and polygons. While typical databases can
understand various numeric and character types of data, additional functionality needs to
be added for databases to process spatial data types. These are typically called geometry
or feature. The Open Geospatial Consortium created the Simple Features specification
and sets standards for adding spatial functionality to database systems. OGC Homepage.
Features of spatial databases
Database systems use indexes to quickly look up values and the way that most databases
index data is not optimal for spatial queries. Instead, spatial databases use a spatial index
to speed up database operations.
In addition to typical SQL queries such as SELECT statements, spatial databases can
perform a wide variety of spatial operations. The following query types and many more
are supported by the Open Geospatial Consortium:





Spatial Measurements: Finds the distance between points, polygon area, etc.
Spatial Functions: Modify existing features to create new ones, for example by
providing a buffer around them, intersecting features, etc.
Spatial Predicates: Allows true/false queries such as 'is there a residence located
within a mile of the area we are planning to build the landfill?'
Constructor Functions: Creates new features with an SQL query specifying the
vertices (points of nodes) which can make up lines. If the first and last vertex of a
line are identical the feature can also be of the type polygon (a closed line).
Observer Functions: Queries which return specific information about a feature
such as the location of the center of a circle
10. list out any two various commercial data mining tools.
PART-B
11.(a) (i) With a neat sketch explain the architecture of a data warehouse
(ii) Discuss the typical OLAP operations with an example.
Or
(b) (i) Discuss how computations can be performed efficiently on data cubes.
(ii) Write short notes on data warehouse meta data.
12.(a) (i) Explain various methods of data cleaning in detail.
(ii) Give an account on data mining Query language.



DMQL: A Data Mining Query Language for Relational Databases (Han
et al, Simon Fraser University)
Integrating Data Mining with SQL Databases: OLE DB for Data Mining
(Netz et al, Microsoft)
MSQL: A Query Language for Database Mining (Imielinski & Virmani,
Rutgers University)

Or
(b) How is Attribute-Oriented Induction implemented? Explain in detail.


Attribute-oriented induction (AOI) uses concept hierarchies to discover hidden
patterns from a huge amount of data and presents the concise patterns as a general
description of the original data. It is an effective data analysis and data reduction
technique. the construction of concept hierarchies for numeric attributes is
sometimes subjective, and the generalization of border values near the cutting
points of discretization can easily result in misconception.
13. (a) Write and explain the algorithm for mining frequent item sets without
candidate generation. Give relevant example.
Or
(b) Discuss the approaches for mining multi level association rules from the
transactional databases. Give relevant example.
Items often form hierarchy.
Items at the lower level are expected to have lower support.
Rules regarding itemsets at
appropriate levels could be quite useful.


Transaction database can be encoded based on dimensions and levels
We can explore shared multi-level mining
Mining Multi-Level Associations

A top_down, progressive deepening approach:
 First find high-level strong rules:
milk  bread [20%, 60%].

Then find their lower-level “weaker” rules:
2% milk  wheat bread [6%, 50%].

Variations at mining multiple-level association rules.
 Level-crossed association rules:
2% milk  Wonder wheat bread

Association rules with multiple, alternative hierarchies:
2% milk  Wonder bread
14. (a) (i) Explain the algorithm for constructing a decision tree from training samples.
(ii) Explain Bayes theorem.
In probability theory, Bayes' theorem (often called Bayes' law after Rev Thomas Bayes;
IPA:/'beɪz/) relates the conditional and marginal probabilities of two random events. It is
often used to compute posterior probabilities given observations. For example, a patient
may be observed to have certain symptoms. Bayes' theorem can be used to compute the
probability that a proposed diagnosis is correct, given that observation. (See example 2)
Bayes' Theorem states that judgements should be influenced by two main factors: the
base rate, and the likelihood ratio.
As a formal theorem, Bayes' theorem is valid in all common interpretations of
probability. However, it plays a central role in the debate around the foundations of
statistics: frequentist and Bayesian interpretations disagree about the ways in which
probabilities should be assigned in applications. Frequentists assign probabilities to
random events according to their frequencies of occurrence or to subsets of populations
as proportions of the whole, while Bayesians describe probabilities in terms of beliefs and
degrees of uncertainty. The articles on Bayesian probability and frequentist probability
discuss these debates in greater detail.
Bayes' theorem relates the conditional and marginal probabilities of events A and B,
where B has a non-vanishing probability:
Each term in Bayes' theorem has a conventional name:




P(A) is the prior probability or marginal probability of A. It is "prior" in the sense
that it does not take into account any information about B.
P(A|B) is the conditional probability of A, given B. It is also called the posterior
probability because it is derived from or depends upon the specified value of B.
P(B|A) is the conditional probability of B given A.
P(B) is the prior or marginal probability of B, and acts as a normalizing constant.
Intuitively, Bayes' theorem in this form describes the way in which one's beliefs about
observing 'A' are updated by having observed 'B'.
Or
(b)Explain the following clustering methods in detail:
(i) BIRCH
Birch (balanced iterative reducing and clustering using hierarchies) is an incremental and
hierarchical clustering algorithm for large databases. The strongests point of the birch
algorithm its support for very large databases (main memory is lower than the size of the
DB).
There are two main building components in the birch algorithm:
1. hierarchical clustering component,
2. main memory structure component,
We will revisit every component in detail and give an conceptual idea how the birch
clustering works
The idea of a hierarchical clustering is illustrated in Figure 1. The algorithm starts with
single point clusters (every point in a database is a cluster, cf Figure 1(a)). Then it groups
the closest points into separate clusters (Figure 1(b)), and continues, until only one cluster
remains (Figure 1(c)). The computation of the clusters is done with a help of distance
matrix (O(n2) large) and O(n2) time.
1(a): The Dataset
Figure 1: The Idea of Hierarchical Clustering
1(b): 1st Step
1(c): 2nd Step
1(d): Last Step
Birch uses a main memory (of limited size) data structure called CF tree. The tree is
organized in such a way that (i) the leave contain actual clusters, and (ii) the size of any
cluster in a leaf is not large than R. An example of the CF tree is illustrated in Figure 2.
Initially, the data points in one cluster. As the data arrives, a check is made whether the
size of the cluster does not exceed R (cf. Figures 2(a)-(b)). If the cluster size grows too
big, the cluster is split into two clusters, and the points are redistributed (Figure 2(c)). The
points are then continuously inserted to the cluster which enlarges less (cf. Figure 2(d)).
At each node of the tree the CF tree keeps information about the mean of the cluster, and
the mean of the sum of squares to compute the size of the clusters efficiently. The tree
structure also depends on the branching parameter T, which determines the maximum
number of children each node can have.
Figure 1: The Idea of CF Tree
1(a): 1st Step
1(b): 2nd Step 1(c): 3rd Step
1(e): 4rd Step
The birch algorithm starts with a dataset, and tries to guess the size of the cluster R so the
tree can fit in the main memory. If the tree does not fit into the main memory, it reduces
the R and rebuilds the tree. The process is repeated untill the tree fits into main memory.
The birch algorithm can also include a number of post processing phrases to remove
outliers and improve clustering.
(ii) CURE
CURE is an efficient clustering algorithm for large databases that is more robust to
outliers and identifies clusters having non-spherical shapes and wide variances in size.
To avoid the problems with non-uniform sized or shaped clusters,CURE employs a novel
heirarchial clustering algorithm that adopts a middle ground between the centroid based
and all point extremes. In CURE, a constant number c of well scattered points of a cluster
are chosen and they are shrunk towards the centroid of the cluster by a fraction $\alpha$.
The scatterd points aftyre shrinking are used as representatives of the cluster. The clusters
with the closest pair of representatives are the clusters that are merged at each step of
CURE's heirarchial clustering algorithm. This enables CURE to correctly identify the
clusters and makes it less sensitive to outliers.
The algorithm is given below.
The running time of the algorithm is O(n^{2}\log n)$ and space complexity is O(n)
The algorithm cannot be directly applied to large databases.So for this purpose we do the
following enhancements
Random sampling : To handle large data sets, we do random sampling and draw a sample
data set. Generally the random sample fits in main memory. Also because of the random
sampling there is a trade off between accuracy and efficiency.
Partitioning for speed up : The basic idea is to partition the sample space into p partitions.
Then in the first pass partially cluster each partition until the final number of clusters
reduces to \frac{np}{q} for some constant q \ge 1. Then run a second clustering pass on
\frac{n}{q} partial clusters for all the partitions. For the second pass we only store the
representative points since merge procedure only requires representative points of
previous clusters before computing the new representative points for the merged cluster.
The advantage of partitioning the input is that we can reduce the execution times.
Labeling data on disk : Since we only have representative points for k clusters, the
remaining data points should also be assigned to the clusters. For this a fraction of
randomly selected representative points for each of the k clusters is chosen and data point
is assigned to the cluster containing the representative point closest to it.
[edit] Pseudocode
\textbf{CURE(no. of points,k)}
\textbf{Input :} A set of points S
\textbf{Output :} K clusters `
1 For every cluster u(each input point), u.mean and u.rep store the mean of the points in
cluster and set of c(initially = 1 since each cluster has one data point) representative
points of the cluster. Also u.closest stores cluster closest to u.
2 All the input points are inserted into a k-d tree T
3 Treat each input point as seperate cluster, compute u.closest for each u and then insert
each cluster into the heap Q.(clusters are arranged in increasing order of distances
between u and u.closest).
4 While size(Q) \ge k
6 Remove the top elemnt of Q(say u) and merge it with its closest cluster u.closest(say v)
and compute 7 the new representative points for the merged cluster w. Also remove u and
v from T and Q.
8 Also for all the clusters x in Q, update x.closest and relocate x
9 insert w into Q
10 repeat
15.(a) (i) What is a multimedia database? Explain the methods of mining multimedia
database?
A multimedia database is a database that hosts one or more primary media file types such
as .txt (documents), .jpg (images), .swf (videos), .mp3 (audio), etc. And loosely fall into
three main categories:


 Static media (timeindependent, i.e. images and handwriting)
Dynamic media (time-dependent, i.e. video and sound bytes)
Dimensional media (i.e. 3D games or computer-aided drafting programs- CAD)
All primary media files are stored in binary strings of zeros and ones, and are encoded
according to file type.
The term "data" is typically referenced from the computer point of view, whereas the
term "multimedia" is referenced from the user point of view.
Types of Multimedia Databases
There are numerous different types of multimedia databases, including:



The Authentication Multimedia Database (also known as a Verification
Multimedia Database, i.e. retina scanning), is a 1:1 data comparison
The Identification Multimedia Database is a data comparison of one-to-many (i.e.
passwords and personal identification numbers
A newly-emerging type of multimedia database, is the Biometrics Multimedia
Database; which specializes in automatic human verification based on the
algorithms of their behavioral or physiological profile.
This method of identification is superior to traditional multimedia database methods
requiring the typical input of personal identification numbers and passwords-
Due to the fact that the person being identified does not need to be physically present,
where the identification check is taking place.
This removes the need for the person being scanned to remember a PIN or password.
Fingerprint identification technology is also based on this type of multimedia database.
Or
(b) (i) Discuss the social impacts of data mining.
(ii) Discuss spatial data mining.