Download ANURAG Group Of Institutions (Formerly CVSR College of

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
ANURAG Group Of Institutions
(Formerly CVSR College of Engineering)
VENKATAPUR (V), GHATKESAR (M), R.R Dist,
Course Code:
Year / Semester: III-yr II-SEM
Course Title: Data Warehousing and Data Mining
Course Time: 2014-2015
Time Table:
9:009:50
9:5010:40
MON
TUE
WED
THR
FRI
SAT
10:40 11:3012:20
11:30
DWDM
1:102:00
2:00-2-50 2:503:40
DWDM
DWDM
DWDM
DWDM
Required Text Books:


Data Mining – Concepts and Techniques - Jiawei Han & Micheline Kamber Harcourt
India.
Introduction to Data Mining- Pang –Ning Tan, Michael Steinbach and Vipin Kumar,
earson education.
Course Objectives:
 To familiarize the concepts and architectural types of data Warehous es.
 Provides efficient design and management of data storages using data warehousing and
OLAP.
 To understand the fundamental processes, concepts and techniques of data
mining.
 To consistently apply knowledge concerning current data mining research and how this
may contribute to the effective design and implementation of data mining applications.
 To provide advance research skills through the investigation of data -mining
literature.
 To understand an appreciation for the inherent complexity of the data -mining
task.
Department of Information Technology
ANURAG Group Of Institutions
(Formerly CVSR College of Engineering)
VENKATAPUR (V), GHATKESAR (M), R.R Dist,
Course Outcomes:





Understand the concepts and architectural types of data Warehous es and
provides efficient design and management of data storages using data warehousing and
OLAP.
Understand the fundamental processes, concepts and techniques of data
mining.
Apply knowledge concerning current data mining research and how this may contribute
to the effective design and implementation of data mining applications.
Identify different research skills through the investigation of data -mining
literature.
Appreciate and use of the inherent complexity of the data -mining task
Evaluation Methodology:
S.no
1.
2.
3.
4.
5.
Method of Evaluation
Internal Exam -I
Internal Exam -II
Assignment -I
Assignment -II
External Exam
Examination Dates
Marks
20
20
5
5
75
Remarks
Note:
H&K: Mining – Concepts and Techniques - Jiawei Han & Micheline Kamber Harcourt
BB: Black Board.
PPT: Power Point Presentation.
Department of Information Technology
ANURAG Group Of Institutions
(Formerly CVSR College of Engineering)
VENKATAPUR (V), GHATKESAR (M), R.R Dist,
DATA WAREHOUSING & DATA MINING SYLLABUS
UNIT I:
DATA WAREHOUSING : Data Warehouse and OLAP Technology for Data Mining: Data Warehouse,
Multidimensional Data Model, Data Warehouse Architecture, Data Warehouse Implementation, From
Data Warehousing to Data Mining,OLAP.
UNIT II:
DATA MINING :Introduction – Data – Types of Data – Data Mining Functionalities – Classification of
Data Mining Systems – Data Mining Task Primitives – Integration of a Data Mining System with a Data
Warehouse – Issues –Data Preprocessing.
UNIT III:
ASSOCIATION RULE MINING AND CLASSIFICATION
Mining Frequent Patterns, Associations and Correlations – Efficient and Scalable Frequent Itemset
Mining Methods – Mining Various Kinds of Association Rules – Correlation Analysis – Constraint Based
Association Mining.
Classification and Prediction - Basic Concepts - Decision Tree Induction - Bayesian Classification – Rule
Based Classification – Classification by Back propagation – Support Vector Machines – Associative
Classification – Lazy Learners – Other Classification Methods – Prediction, Accuracy and Error
measures, Evaluating the accuracy of a Classifier or a Predictor, Ensemble Methods.
UNIT IV:
CLUSTERING IN DATA MINING :Cluster Analysis - Types of Data – Categorization of Major
Clustering Methods - Kmeans – Partitioning Methods – Hierarchical Methods - Density-Based Methods –
Grid Based Methods – Model-Based Clustering Methods – Clustering High Dimensional Data Constraint – Based Cluster Analysis – Outlier Analysis
UNIT V:
APPLICATIONS AND TRENDS IN DATA MINING: Data Mining Applications, Data Mining
System Products and Research Prototypes, Additional Themes on Data Mining and Social Impacts of
Data Mining.
TEXT BOOKS:
1. Jiawei Han and Micheline Kamber, “Data Mining Concepts and Techniques”, Second Edition,
Elsevier, 2007.
2. Alex Berson and Stephen J. Smith, “ Data Warehousing, Data Mining & OLAP”, Tata McGraw – Hill
Edition, Tenth Reprint 2007.
Department of Information Technology
ANURAG Group Of Institutions
(Formerly CVSR College of Engineering)
VENKATAPUR (V), GHATKESAR (M), R.R Dist,
UNIT-I: DATA WAREHOUSING :
Syllabus:

Data Warehouse and OLAP Technology for Data Mining: Data Warehouse, Multidimensional
Data Model, Data Warehouse Architecture, Data Warehouse Implementation, From Data
Warehousing to Data Mining,OLAP.

Objectives:
This unit deals with introduction to data warehouse, OLAP and data generalization. The
basic concepts, architectures and general implementations of data warehouse and
relationship between data warehousing and data mining are presented. The further
discussion drives detailed study of methods of data cube computation, including the
OLAP methods. Further explorations of data warehouse and OLAP are also discussed.
Attribute-oriented induction, an alternative method for data generalization and concept
description is also discussed.

Micro Plan
S.No
1.
2.
3.
4.
5.
6.
7.
8.

Topics
Data Warehouse
Multidimensional Data Model
Data Warehouse Architecture
Data Warehouse Implementation
Further Development of Data Cube
Technology
From Data Warehousing to Data
Mining
Efficient Methods for Data Cube
Computation
Further Development for Data Cube
OLAP Technology
Total number of classes
References
H&K
H&K
H&K
H&K
H&K
Teaching
Methodology
BB/PPT
BB/PPT
BB/PPT
BB/PPT
BB/PPT
Number of
class
1
1
1
1
1
H&K
BB/PPT
1
H&K
BB/PPT
1
H&K
BB/PPT
1
Assignment Questions
1. Briefly compare the following concepts. You may use an example to explain your point(s).
(a) Snowflake schema, fact constellation, star net query model
(b) Data cleaning, data transformation, refresh
Department of Information Technology
8
ANURAG Group Of Institutions
(Formerly CVSR College of Engineering)
VENKATAPUR (V), GHATKESAR (M), R.R Dist,
(c) Enterprise warehouse, data mart, virtual warehouse.
2. A data warehouse can be modeled by either a star schema or a snowflake schema. Briefly
describe the similarities and the differences of the two models, and then analyze their advantages
and disadvantages with regard to one another. Give your opinion of which might be more
empirically useful and state the reasons behind your answer.
3.
What are the differences between the three main types of data warehouse usage: information
processing, analytical processing, and data mining? Discuss the motivation behind OLAP mining
(OLAM).
4.
Explain the Development for Data Cube OLAP Technology.
Unit-II: DATA MINING

Syllabus:
Introduction – Data – Types of Data – Data Mining Functionalities – Classification of Data
Mining Systems – Data Mining Task Primitives – Integration of a Data Mining System with a
Data Warehouse – Issues –Data Preprocessing.

Objectives:
The first half of this unit provides an introduction to the multidisciplinary field of data
mining and discusses the evolutionary path of database technology. It examines the
various types of data to be mined. The second half introduces techniques for
preprocessing the data before mining which includes the use of concept hierarchies for
dynamic and static discretization. The automatic generation of concept hierarchies is also
described.

Micro Plan
S.No
1.
2.
3.
4.
Topics
Fundamentals of data mining
Data Mining Functionalities
Classification of Data Mining
systems
Data Mining Task Primitives
References
H&K
H&K
H&K
Teaching
Methodology
BB/PPT
BB/PPT
BB/PPT
Number of
class
1
H&K
BB/PPT
1
Department of Information Technology
1
ANURAG Group Of Institutions
(Formerly CVSR College of Engineering)
VENKATAPUR (V), GHATKESAR (M), R.R Dist,
5.
6.
7.
8.
9.
10.

Integration of Database or a Data
Warehouse System
Major issues in Data Mining
Needs for Preprocessing the Data
Data Cleaning, Data Integration
Data Reduction , Data
Transformation
Discretization and Concept
Hierarchy Generation
Total number of classes
H&K
BB/PPT
2
H&K
H&K
H&K
H&K
BB/PPT
BB/PPT
BB/PPT
BB/PPT
1
1
1
H&K
BB/PPT
1
9
Assignment Questions:
1. What is data mining? In your answer, address the following:
(a) Is it another hype?
(b) Is it a simple transformation of technology developed from databases, statistics, and machine
learning?
(c) Explain how the evolution of database technology led to data mining.
(d) Describe the steps involved in data mining when viewed as a process of knowledge discovery.
2. Present an example where data mining is crucial to the success of a business. What data mining
functions does this business need? Can they be performed alternatively by data query processing
or simple statistical analysis?
3. Based on your observation, describe another possible kind of knowledge that needs to be
discovered by data mining methods but has not been listed in this chapter. Does it require a mining
methodology that is quite different from those outlined in this chapter?
4. What are the major challenges of mining a huge amount of data (such as billions of tuples) in
comparison with mining a small amount of data (such as a few hundred tuple data set)?
5. Suppose that the data for analysis includes the attribute age. The age values for the data
tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33,
33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
(a) What is the mean of the data? What is the median?
(b) What is the mode of the data? Comment on the data’s modality (i.e., bimodal, trimodal, etc.).
(c) What is the midrange of the data?
(d) Can you find (roughly) the first quartile (Q1) and the third quartile (Q3) of the data?
(e) Give the five-number summary of the data.
(f) Show a boxplot of the data.
(g) How is a quantile-quantile plot different from a quantile plot?
6. Discuss issues to consider during data integration.
7. Data quality can be assessed in terms of accuracy, completeness, and consistency. Propose two
Other dimensions of data quality.
Department of Information Technology
ANURAG Group Of Institutions
(Formerly CVSR College of Engineering)
VENKATAPUR (V), GHATKESAR (M), R.R Dist,
Unit-III: ASSOCIATION RULE MINING AND CLASSIFICATION
Syllabus:
PART1: Mining Frequent Patterns, Associations and Correlations – Efficient and Scalable
Frequent Itemset Mining Methods – Mining Various Kinds of Association Rules – Correlation
Analysis – Constraint Based Association Mining.
PART2: Classification and Prediction - Basic Concepts - Decision Tree Induction - Bayesian
Classification – Rule Based Classification – Classification by Back propagation – Support Vector
Machines – Associative Classification – Lazy Learners – Other Classification Methods –
Prediction, Accuracy and Error measures, Evaluating the accuracy of a Classifier or a Predictor,
Ensemble Methods.
Objectives:
PART1:This unit presents methods for mining frequent patterns, associations, and
correlations in transactional and relational databases and data warehouses. The chapter
also presents techniques for mining multilevel association rules, multidimensional
association rules, and quantitative association rules.

Micro Plan
S.No
1.
2.
3.
4.
5.
Topics
References
Basic Concepts
Efficient and Scalable Frequent Itemset
Mining Methods
Mining various kind of Association Rules,
From Association to Correlation analysis,
Constraint-Based Association Mining.
Total number of classes
H&K
H&K
Teaching
Methodology
BB/PPT
BB/PPT
H&K
H&K
H&K
BB/PPT
BB/PPT
BB/PPT
Department of Information Technology
Number of
class
1
2
2
2
2
9
ANURAG Group Of Institutions
(Formerly CVSR College of Engineering)
VENKATAPUR (V), GHATKESAR (M), R.R Dist,

1.
Assignment Questions
A database has five transactions. Let min sup = 60% and min con f = 80%.
(a) Find all frequent item sets using Apriori and FP-growth, respectively. Compare the efficiency of
the two mining processes.
(b) List all of the strong association rules (with support s and confidence c) matching the following
meta rule, where X is a variable representing customers, and item denotes variables representing
items(e.g., “A”, “B”, etc.):
2. Give a short example to show that items in a strong association rule may actually be negatively
correlated.
3. Association rule mining often generates a large number of rules. Discuss effective methods that can
be used to reduce the number of rules generated while still preserving most of the interesting rules.
Syllabus:
PART2:Classification and Prediction - Basic Concepts - Decision Tree Induction - Bayesian
Classification – Rule Based Classification – Classification by Back propagation – Support Vector
Machines – Associative Classification – Lazy Learners – Other Classification Methods – Prediction,
Accuracy and Error measures, Evaluating the accuracy of a Classifier or a Predictor, Ensemble
Methods.
PART2:Objectives:
This unit describes methods for data classification and prediction, including decision tree
induction, Bayesian classification, rule-based classification and many more it also
Department of Information Technology
ANURAG Group Of Institutions
(Formerly CVSR College of Engineering)
VENKATAPUR (V), GHATKESAR (M), R.R Dist,
projects the discussion of measuring and enhancing classification and prediction
accuracy.

Micro Plan
S.No
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Topics
References
Issues Regarding Classification and Prediction
Classification by Decision Tree Induction
Rule- Based Classification
Classification by Backpropagation
Support Vector Machines
Associative Classification
Lazy Learner, Other Classification Methods
Prediction, Accuracy and Error Measures
Evaluating the Accuracy of a classifier or a
Predictor
Ensemble Methods
Total number of classes
H&K
H&K
H&K
H&K
H&K
H&K
H&K
H&K
H&K
Teaching
Methodology
BB/PPT
BB/PPT
BB/PPT
BB/PPT
BB/PPT
BB/PPT
BB/PPT
BB/PPT
BB/PPT
H&K
BB/PPT
Number of
class
1
1
2
1
1
2
1
1
2
2
14
Assignment Questions
1. Why naïve Bayesian classification is called “naïve”? Briefly outline the major ideas of naïve Bayesian
classification.
2. Briefly outline the major steps of decision tree classification.
3. Why is tree pruning useful in decision tree induction? What is a drawback of using a separate set of
tuples to evaluate pruning?
4. What is associative classification? Why is associative classification able to achieve higher classification
accuracy than a classical decision tree method? Explain how associative classification can be used for
text document classification.
5. The support vector machine (SVM) is a highly accurate classification method. However, SVM
classifiers suffer from slow processing when training with a large set of data tuples. Discuss how
to overcome this difficulty and develop a scalable SVM algorithm for efficient SVM classification
in large datasets.
6. What is boosting? State why it may improve the accuracy of decision tree induction.
7. It is difficult to assess classification accuracy when individual data objects may belong to more than
one class at a time. In such cases, comment on what criteria you would use to compare different
classifiers modeled after the same data.
Department of Information Technology
ANURAG Group Of Institutions
(Formerly CVSR College of Engineering)
VENKATAPUR (V), GHATKESAR (M), R.R Dist,
UNIT-IV: CLUSTERING IN DATA MINING :
Syllabus: Cluster Analysis - Types of Data – Categorization of Major Clustering Methods - Kmeans –
Partitioning Methods – Hierarchical Methods - Density-Based Methods –Grid Based Methods – ModelBased Clustering Methods – Clustering High Dimensional Data - Constraint – Based Cluster Analysis –
Outlier Analysis

Objectives:
Several major data clustering approaches are presented including clustering highdimensional data, as well as constraint based cluster analysis. Outlier analysis is also
discussed.

Micro Plan
S.No
1.
2.
3.
4.
5.
6.
7.
8.

Topics
Types of Data in Cluster Analysis
A Categorization of Major Clustering
Methods
Partitioning Methods, Density-Based
Methods
Grid-Based Methods
Model-Based Clustering Methods
Clustering High Dimensional Data
Constraint –Based Cluster Analysis
Outlier Analysis
Total number of classes
References
H&K
H&K
Teaching
Methodology
BB/PPT
BB/PPT
Number of
class
1
1
H&K
BB/PPT
2
H&K
H&K
H&K
H&K
BB/PPT
BB/PPT
BB/PPT
BB/PPT
BB/PPT
1
1
1
2
1
10
Assignment Questions
1. Given the following measurements for the variable age: 18, 22, 25, 42, 28, 43, 33, 35, 56, 28,
standardize the variable by the following:
(a) Compute the mean absolute deviation of age.
(b) Compute the z-score for the first four measurements.
2. Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8):
(a) Compute the Euclidean distance between the two objects.
(b) Compute the Manhattan distance between the two objects.
(c) Compute the Minkowski distance between the two objects, using q = 3.
Department of Information Technology
ANURAG Group Of Institutions
(Formerly CVSR College of Engineering)
VENKATAPUR (V), GHATKESAR (M), R.R Dist,
3. Present conditions under which density-based clustering is more suitable than partitioning-based
clustering and hierarchical clustering. Given some application examples to support your argument.
4. Why is outlier mining important? Briefly describe the different approaches behind statistical-based
outlier detection, distanced-based outlier detection, density-based local outlier detection, and
deviation-based outlier detection.
5. Describe each of the following clustering algorithms in terms of the following criteria: (i) shapes of
clusters that can be determined; (ii) input parameters that must be specified; and (iii) limitations.
(a) k-means
(b) k-medoids
(c) CLARA
(d) BIRCH
(e) ROCK
(f) Chameleon
(g) DBSCAN
6. For constraint-based clustering, aside from having the minimum number of customers in each cluster
(for ATM allocation) as a constraint, there could be many other kinds of constraints. For example, a
constraint could be in the form of the maximum number of customers per cluster, average income of
customers per cluster, maximum distance between every two clusters, and so on. Categorize the kinds
of constraints that can be imposed on the clusters produced and discuss how to perform clustering
efficiently under such kinds of constraints.
UNIT-V: APPLICATIONS AND TRENDS IN DATA MINING:
Syllabus:
Data Mining Applications, Data Mining Systems Products and Research Prototypes,
Additional Themes on Data Mining and Social Impacts of Data Mining.

Objectives:
The concepts of various applications and trends in data mining are projected, among
which social impacts of data mining, such as privacy and data security issues are studied
in detailed to challenge research issues. Further discussion of ubiquitous data mining has
also been added.
Department of Information Technology
ANURAG Group Of Institutions
(Formerly CVSR College of Engineering)
VENKATAPUR (V), GHATKESAR (M), R.R Dist,

Micro Plan
S.No
1.
2.
3.
4.

Topics
References
H&K
H&K
Teaching
Methodology
BB/PPT
BB/PPT
Number of
class
1
1
Data Mining Applications
Data Mining Systems Products and
Research Prototypes
Additional Themes on Data Mining
Social Impacts of Data Mining
Total number of classes
H&K
H&K
BB/PPT
BB/PPT
1
1
4
Assignment Questions:
1. Research and describe an application of data mining that was not presented in this unit.
Discuss how different forms of data mining can be used in the application.
2. Study an existing commercial data mining system. Outline the major features of such a
System from a multidimensional point of view, including data types handled, architecture
of the system, data sources, data mining functions, data mining methodologies, coupling
with database or data warehouse systems, scalability, visualization tools, and graphical
user interfaces. Can you propose one improvement to such a system and outline how to
realize it?
3. What are the differences between visual data mining and data visualization? Data visualization
may suffer from the data abundance problem. Propose a data mining method that may help
people see through the network topology to the interesting features of the social network.
4. What are the major challenges faced in bringing data mining research to market? Illustrate
one data mining research issue that, in your view, may have a strong impact on the
market and on society. Discuss how to approach such a research issue.
Department of Information Technology