Download MSc in Bioinformatics 4 MBI403 ‑ DATA WAREHOUSING AND

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
PROGRAM
MSc in Bioinformatics
SEMESTER
4
SUBJECT
MBI403 - DATA WAREHOUSING AND DATA MINING
BOOK ID
B1633
SESSION
Winter 2015
No
Q 1
Question/Answer key
Marks Total Marks
10
Explain the Top-Down and Bottom-up data warehouse development
methodologies.
( Unit 1 ; Section 1.5 )
A 1
Top-down and Bottom-up data warehouse development methodologies
Despite the fact that Data Warehouses can be designed in a number of different
ways, they all share a number of important characteristics.
• Most Data Warehouses are Subject Oriented. This means that the information
that is in the Data Warehouse is stored in a way that allows it to be connected to
objects or event, which occur in reality.
10
• Another characteristic that is frequently seen in Data Warehouses is called
Time Variant.
• A time variant Data Warehouse will allow changes in the information to be
monitored and recorded over time.
• All the programs that are used by a particular institution will be stored in the
Data Warehouse, and it will be integrated together
Q 2
10
Explain the functionalities and advantages of data warehouses.
( Unit 1 ; Section 1.3 )
A 2
functionalities
Data Warehouses provide the following functionality:
• Roll-up: Data is summarized with increased generalization.
10
• Drill-down: Increasing levels of detail are revealed.
• Pivot: Cross tabulation that is, rotation is performed.
• Slice and Dice: Performing projection operations on the dimensions.
• Sorting: Data is sorted by ordinal value.
• Selection: Data is available by value or range.
Ver : MScBI_1308
1
• A Data Warehouse provides a common data model for data, regardless of the
data source. This makes it easier to report and analyze information than it would
be if multiple data models from disparate sources were used to retrieve
information such as sales invoices, order receipts, general ledger charges, etc.
• Prior to loading data into the Data Warehouse inconsistencies are identified
and resolved. This greatly simplifies reporting and analysis.
• Information in the Data Warehouse is under the control of Data Warehouse
users so that, even if the source system data is purged over time, the information
in the warehouse can be stored safely for extended periods of time.
• Because they are separate from operational systems, Data Warehouses
provide fast retrieval of data without slowing down operational systems.
• Data Warehouses facilitate Decision Support System applications such as
trend reports
Q 3
10
Describe Hyper cube and Multicube.
( Unit 6 ; Section 6.6 )
A 3
Describe Hyper Cube and Multicube.
• Multidimensional databases can present their data to an application using two
types of cubes: hypercubes and multicubes.
• The Hypercube is the cube with four Dimensions.
10
• In the hypercube model, as shown in the following illustration, all data appears
logically as a single cube.
• This intuitive representation is a hypercube, a representation that
accommodates more than three dimensions.
• At a lower level of simplification, a Hypercube can very well accommodate
three dimensions…..
• Multicube: In the multicube model, data is segmented into a set of smaller
cubes, each of which is composed of a subset of the available dimensions It
means we can view the cube in multiple dimensions.
Q 4
10
List and explain the strategies for data reduction.
A 4
( Unit 10 ; Section 10.5 )
The strategies for data reduction:
1) Date cube aggregation, where aggregation operations are applied to the data
in the construction of a data cube.
• 2) Dimension reduction, where irrelevant, weakly relevant, or redundant
attributes or dimensions may be detected and removed.
10
• 3) Data compression, where encoding mechanisms are used to reduce the
data set size.
Ver : MScBI_1308
2
• 4) Numerosity reduction, where the data are replaced or estimated by
alternative, smaller data representations such as a parametric models.
• 5) Discretization and concept hierarchy generation, where raw data values
for attributes are replaced by ranges or higher conceptual levels. Concept
hierarchies allow the mining of data at multiple levels of abstraction and are a
powerful tool for data mining.
Q 5
10
Describe K-means method for clustering. List its advantages and drawbacks.
( Unit 12 ; Section 12.4 )
A 5
K-means method for clustering
K-means is one of the simplest unsupervised learning algorithms that solve the
well-known clustering problem.
• The procedure follows a simple and easy way to classify a given data set
through a certain number of clusters (assume k clusters) fixed a priori.
5
• The main idea is to define k centroids, one for each cluster.
• The basic step of k-means clustering is simple.
• In the beginning, we determine number of cluster K and we assume the
centroid or center of these clusters.
Advantages and drawbacks.
• Advantages:
• With a large number of variables, K-Means may be computationally faster than
hierarchical clustering (if K is small).
5
• K-Means may produce tighter clusters than hierarchical clustering, especially if
the clusters are globular.
• Drawbacks:
• It does not do well with overlapping clusters.
• The clusters are easily pulled off-center by outliers.
• Each record is either inside or outside of a given cluster.
Q 6
10
Describe multilevel databases and web query systems
( Unit 13 ; Section 13.5 )
A 6
Ver : MScBI_1308
Multilevel Databases
Several researchers have proposed a multilevel database approach to organizing
Web-based information.
• The main idea behind these proposals is that the lowest level of the database
contains primitive semi-structured information stored in various web repositories,
such as hypertext documents.
5
Web Query Systems
5
3
There have been many web-based query systems and languages developed
recently that attempt to utilize standard database query languages such as SQL,
structural information about web documents, and even natural language
processing for accommodating the types of queries that are used in World Wide
Web searches.
Ver : MScBI_1308
4