Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ANURAG Group Of Institutions (Formerly CVSR College of Engineering) VENKATAPUR (V), GHATKESAR (M), R.R Dist, Course Code: Year / Semester: III-yr II-SEM Course Title: Data Warehousing and Data Mining Course Time: 2014-2015 Time Table: 9:009:50 9:5010:40 MON TUE WED THR FRI SAT 10:40 11:3012:20 11:30 DWDM 1:102:00 2:00-2-50 2:503:40 DWDM DWDM DWDM DWDM Required Text Books: Data Mining – Concepts and Techniques - Jiawei Han & Micheline Kamber Harcourt India. Introduction to Data Mining- Pang –Ning Tan, Michael Steinbach and Vipin Kumar, earson education. Course Objectives: To familiarize the concepts and architectural types of data Warehous es. Provides efficient design and management of data storages using data warehousing and OLAP. To understand the fundamental processes, concepts and techniques of data mining. To consistently apply knowledge concerning current data mining research and how this may contribute to the effective design and implementation of data mining applications. To provide advance research skills through the investigation of data -mining literature. To understand an appreciation for the inherent complexity of the data -mining task. Department of Information Technology ANURAG Group Of Institutions (Formerly CVSR College of Engineering) VENKATAPUR (V), GHATKESAR (M), R.R Dist, Course Outcomes: Understand the concepts and architectural types of data Warehous es and provides efficient design and management of data storages using data warehousing and OLAP. Understand the fundamental processes, concepts and techniques of data mining. Apply knowledge concerning current data mining research and how this may contribute to the effective design and implementation of data mining applications. Identify different research skills through the investigation of data -mining literature. Appreciate and use of the inherent complexity of the data -mining task Evaluation Methodology: S.no 1. 2. 3. 4. 5. Method of Evaluation Internal Exam -I Internal Exam -II Assignment -I Assignment -II External Exam Examination Dates Marks 20 20 5 5 75 Remarks Note: H&K: Mining – Concepts and Techniques - Jiawei Han & Micheline Kamber Harcourt BB: Black Board. PPT: Power Point Presentation. Department of Information Technology ANURAG Group Of Institutions (Formerly CVSR College of Engineering) VENKATAPUR (V), GHATKESAR (M), R.R Dist, DATA WAREHOUSING & DATA MINING SYLLABUS UNIT I: DATA WAREHOUSING : Data Warehouse and OLAP Technology for Data Mining: Data Warehouse, Multidimensional Data Model, Data Warehouse Architecture, Data Warehouse Implementation, From Data Warehousing to Data Mining,OLAP. UNIT II: DATA MINING :Introduction – Data – Types of Data – Data Mining Functionalities – Classification of Data Mining Systems – Data Mining Task Primitives – Integration of a Data Mining System with a Data Warehouse – Issues –Data Preprocessing. UNIT III: ASSOCIATION RULE MINING AND CLASSIFICATION Mining Frequent Patterns, Associations and Correlations – Efficient and Scalable Frequent Itemset Mining Methods – Mining Various Kinds of Association Rules – Correlation Analysis – Constraint Based Association Mining. Classification and Prediction - Basic Concepts - Decision Tree Induction - Bayesian Classification – Rule Based Classification – Classification by Back propagation – Support Vector Machines – Associative Classification – Lazy Learners – Other Classification Methods – Prediction, Accuracy and Error measures, Evaluating the accuracy of a Classifier or a Predictor, Ensemble Methods. UNIT IV: CLUSTERING IN DATA MINING :Cluster Analysis - Types of Data – Categorization of Major Clustering Methods - Kmeans – Partitioning Methods – Hierarchical Methods - Density-Based Methods – Grid Based Methods – Model-Based Clustering Methods – Clustering High Dimensional Data Constraint – Based Cluster Analysis – Outlier Analysis UNIT V: APPLICATIONS AND TRENDS IN DATA MINING: Data Mining Applications, Data Mining System Products and Research Prototypes, Additional Themes on Data Mining and Social Impacts of Data Mining. TEXT BOOKS: 1. Jiawei Han and Micheline Kamber, “Data Mining Concepts and Techniques”, Second Edition, Elsevier, 2007. 2. Alex Berson and Stephen J. Smith, “ Data Warehousing, Data Mining & OLAP”, Tata McGraw – Hill Edition, Tenth Reprint 2007. Department of Information Technology ANURAG Group Of Institutions (Formerly CVSR College of Engineering) VENKATAPUR (V), GHATKESAR (M), R.R Dist, UNIT-I: DATA WAREHOUSING : Syllabus: Data Warehouse and OLAP Technology for Data Mining: Data Warehouse, Multidimensional Data Model, Data Warehouse Architecture, Data Warehouse Implementation, From Data Warehousing to Data Mining,OLAP. Objectives: This unit deals with introduction to data warehouse, OLAP and data generalization. The basic concepts, architectures and general implementations of data warehouse and relationship between data warehousing and data mining are presented. The further discussion drives detailed study of methods of data cube computation, including the OLAP methods. Further explorations of data warehouse and OLAP are also discussed. Attribute-oriented induction, an alternative method for data generalization and concept description is also discussed. Micro Plan S.No 1. 2. 3. 4. 5. 6. 7. 8. Topics Data Warehouse Multidimensional Data Model Data Warehouse Architecture Data Warehouse Implementation Further Development of Data Cube Technology From Data Warehousing to Data Mining Efficient Methods for Data Cube Computation Further Development for Data Cube OLAP Technology Total number of classes References H&K H&K H&K H&K H&K Teaching Methodology BB/PPT BB/PPT BB/PPT BB/PPT BB/PPT Number of class 1 1 1 1 1 H&K BB/PPT 1 H&K BB/PPT 1 H&K BB/PPT 1 Assignment Questions 1. Briefly compare the following concepts. You may use an example to explain your point(s). (a) Snowflake schema, fact constellation, star net query model (b) Data cleaning, data transformation, refresh Department of Information Technology 8 ANURAG Group Of Institutions (Formerly CVSR College of Engineering) VENKATAPUR (V), GHATKESAR (M), R.R Dist, (c) Enterprise warehouse, data mart, virtual warehouse. 2. A data warehouse can be modeled by either a star schema or a snowflake schema. Briefly describe the similarities and the differences of the two models, and then analyze their advantages and disadvantages with regard to one another. Give your opinion of which might be more empirically useful and state the reasons behind your answer. 3. What are the differences between the three main types of data warehouse usage: information processing, analytical processing, and data mining? Discuss the motivation behind OLAP mining (OLAM). 4. Explain the Development for Data Cube OLAP Technology. Unit-II: DATA MINING Syllabus: Introduction – Data – Types of Data – Data Mining Functionalities – Classification of Data Mining Systems – Data Mining Task Primitives – Integration of a Data Mining System with a Data Warehouse – Issues –Data Preprocessing. Objectives: The first half of this unit provides an introduction to the multidisciplinary field of data mining and discusses the evolutionary path of database technology. It examines the various types of data to be mined. The second half introduces techniques for preprocessing the data before mining which includes the use of concept hierarchies for dynamic and static discretization. The automatic generation of concept hierarchies is also described. Micro Plan S.No 1. 2. 3. 4. Topics Fundamentals of data mining Data Mining Functionalities Classification of Data Mining systems Data Mining Task Primitives References H&K H&K H&K Teaching Methodology BB/PPT BB/PPT BB/PPT Number of class 1 H&K BB/PPT 1 Department of Information Technology 1 ANURAG Group Of Institutions (Formerly CVSR College of Engineering) VENKATAPUR (V), GHATKESAR (M), R.R Dist, 5. 6. 7. 8. 9. 10. Integration of Database or a Data Warehouse System Major issues in Data Mining Needs for Preprocessing the Data Data Cleaning, Data Integration Data Reduction , Data Transformation Discretization and Concept Hierarchy Generation Total number of classes H&K BB/PPT 2 H&K H&K H&K H&K BB/PPT BB/PPT BB/PPT BB/PPT 1 1 1 H&K BB/PPT 1 9 Assignment Questions: 1. What is data mining? In your answer, address the following: (a) Is it another hype? (b) Is it a simple transformation of technology developed from databases, statistics, and machine learning? (c) Explain how the evolution of database technology led to data mining. (d) Describe the steps involved in data mining when viewed as a process of knowledge discovery. 2. Present an example where data mining is crucial to the success of a business. What data mining functions does this business need? Can they be performed alternatively by data query processing or simple statistical analysis? 3. Based on your observation, describe another possible kind of knowledge that needs to be discovered by data mining methods but has not been listed in this chapter. Does it require a mining methodology that is quite different from those outlined in this chapter? 4. What are the major challenges of mining a huge amount of data (such as billions of tuples) in comparison with mining a small amount of data (such as a few hundred tuple data set)? 5. Suppose that the data for analysis includes the attribute age. The age values for the data tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70. (a) What is the mean of the data? What is the median? (b) What is the mode of the data? Comment on the data’s modality (i.e., bimodal, trimodal, etc.). (c) What is the midrange of the data? (d) Can you find (roughly) the first quartile (Q1) and the third quartile (Q3) of the data? (e) Give the five-number summary of the data. (f) Show a boxplot of the data. (g) How is a quantile-quantile plot different from a quantile plot? 6. Discuss issues to consider during data integration. 7. Data quality can be assessed in terms of accuracy, completeness, and consistency. Propose two Other dimensions of data quality. Department of Information Technology ANURAG Group Of Institutions (Formerly CVSR College of Engineering) VENKATAPUR (V), GHATKESAR (M), R.R Dist, Unit-III: ASSOCIATION RULE MINING AND CLASSIFICATION Syllabus: PART1: Mining Frequent Patterns, Associations and Correlations – Efficient and Scalable Frequent Itemset Mining Methods – Mining Various Kinds of Association Rules – Correlation Analysis – Constraint Based Association Mining. PART2: Classification and Prediction - Basic Concepts - Decision Tree Induction - Bayesian Classification – Rule Based Classification – Classification by Back propagation – Support Vector Machines – Associative Classification – Lazy Learners – Other Classification Methods – Prediction, Accuracy and Error measures, Evaluating the accuracy of a Classifier or a Predictor, Ensemble Methods. Objectives: PART1:This unit presents methods for mining frequent patterns, associations, and correlations in transactional and relational databases and data warehouses. The chapter also presents techniques for mining multilevel association rules, multidimensional association rules, and quantitative association rules. Micro Plan S.No 1. 2. 3. 4. 5. Topics References Basic Concepts Efficient and Scalable Frequent Itemset Mining Methods Mining various kind of Association Rules, From Association to Correlation analysis, Constraint-Based Association Mining. Total number of classes H&K H&K Teaching Methodology BB/PPT BB/PPT H&K H&K H&K BB/PPT BB/PPT BB/PPT Department of Information Technology Number of class 1 2 2 2 2 9 ANURAG Group Of Institutions (Formerly CVSR College of Engineering) VENKATAPUR (V), GHATKESAR (M), R.R Dist, 1. Assignment Questions A database has five transactions. Let min sup = 60% and min con f = 80%. (a) Find all frequent item sets using Apriori and FP-growth, respectively. Compare the efficiency of the two mining processes. (b) List all of the strong association rules (with support s and confidence c) matching the following meta rule, where X is a variable representing customers, and item denotes variables representing items(e.g., “A”, “B”, etc.): 2. Give a short example to show that items in a strong association rule may actually be negatively correlated. 3. Association rule mining often generates a large number of rules. Discuss effective methods that can be used to reduce the number of rules generated while still preserving most of the interesting rules. Syllabus: PART2:Classification and Prediction - Basic Concepts - Decision Tree Induction - Bayesian Classification – Rule Based Classification – Classification by Back propagation – Support Vector Machines – Associative Classification – Lazy Learners – Other Classification Methods – Prediction, Accuracy and Error measures, Evaluating the accuracy of a Classifier or a Predictor, Ensemble Methods. PART2:Objectives: This unit describes methods for data classification and prediction, including decision tree induction, Bayesian classification, rule-based classification and many more it also Department of Information Technology ANURAG Group Of Institutions (Formerly CVSR College of Engineering) VENKATAPUR (V), GHATKESAR (M), R.R Dist, projects the discussion of measuring and enhancing classification and prediction accuracy. Micro Plan S.No 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Topics References Issues Regarding Classification and Prediction Classification by Decision Tree Induction Rule- Based Classification Classification by Backpropagation Support Vector Machines Associative Classification Lazy Learner, Other Classification Methods Prediction, Accuracy and Error Measures Evaluating the Accuracy of a classifier or a Predictor Ensemble Methods Total number of classes H&K H&K H&K H&K H&K H&K H&K H&K H&K Teaching Methodology BB/PPT BB/PPT BB/PPT BB/PPT BB/PPT BB/PPT BB/PPT BB/PPT BB/PPT H&K BB/PPT Number of class 1 1 2 1 1 2 1 1 2 2 14 Assignment Questions 1. Why naïve Bayesian classification is called “naïve”? Briefly outline the major ideas of naïve Bayesian classification. 2. Briefly outline the major steps of decision tree classification. 3. Why is tree pruning useful in decision tree induction? What is a drawback of using a separate set of tuples to evaluate pruning? 4. What is associative classification? Why is associative classification able to achieve higher classification accuracy than a classical decision tree method? Explain how associative classification can be used for text document classification. 5. The support vector machine (SVM) is a highly accurate classification method. However, SVM classifiers suffer from slow processing when training with a large set of data tuples. Discuss how to overcome this difficulty and develop a scalable SVM algorithm for efficient SVM classification in large datasets. 6. What is boosting? State why it may improve the accuracy of decision tree induction. 7. It is difficult to assess classification accuracy when individual data objects may belong to more than one class at a time. In such cases, comment on what criteria you would use to compare different classifiers modeled after the same data. Department of Information Technology ANURAG Group Of Institutions (Formerly CVSR College of Engineering) VENKATAPUR (V), GHATKESAR (M), R.R Dist, UNIT-IV: CLUSTERING IN DATA MINING : Syllabus: Cluster Analysis - Types of Data – Categorization of Major Clustering Methods - Kmeans – Partitioning Methods – Hierarchical Methods - Density-Based Methods –Grid Based Methods – ModelBased Clustering Methods – Clustering High Dimensional Data - Constraint – Based Cluster Analysis – Outlier Analysis Objectives: Several major data clustering approaches are presented including clustering highdimensional data, as well as constraint based cluster analysis. Outlier analysis is also discussed. Micro Plan S.No 1. 2. 3. 4. 5. 6. 7. 8. Topics Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods, Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Clustering High Dimensional Data Constraint –Based Cluster Analysis Outlier Analysis Total number of classes References H&K H&K Teaching Methodology BB/PPT BB/PPT Number of class 1 1 H&K BB/PPT 2 H&K H&K H&K H&K BB/PPT BB/PPT BB/PPT BB/PPT BB/PPT 1 1 1 2 1 10 Assignment Questions 1. Given the following measurements for the variable age: 18, 22, 25, 42, 28, 43, 33, 35, 56, 28, standardize the variable by the following: (a) Compute the mean absolute deviation of age. (b) Compute the z-score for the first four measurements. 2. Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8): (a) Compute the Euclidean distance between the two objects. (b) Compute the Manhattan distance between the two objects. (c) Compute the Minkowski distance between the two objects, using q = 3. Department of Information Technology ANURAG Group Of Institutions (Formerly CVSR College of Engineering) VENKATAPUR (V), GHATKESAR (M), R.R Dist, 3. Present conditions under which density-based clustering is more suitable than partitioning-based clustering and hierarchical clustering. Given some application examples to support your argument. 4. Why is outlier mining important? Briefly describe the different approaches behind statistical-based outlier detection, distanced-based outlier detection, density-based local outlier detection, and deviation-based outlier detection. 5. Describe each of the following clustering algorithms in terms of the following criteria: (i) shapes of clusters that can be determined; (ii) input parameters that must be specified; and (iii) limitations. (a) k-means (b) k-medoids (c) CLARA (d) BIRCH (e) ROCK (f) Chameleon (g) DBSCAN 6. For constraint-based clustering, aside from having the minimum number of customers in each cluster (for ATM allocation) as a constraint, there could be many other kinds of constraints. For example, a constraint could be in the form of the maximum number of customers per cluster, average income of customers per cluster, maximum distance between every two clusters, and so on. Categorize the kinds of constraints that can be imposed on the clusters produced and discuss how to perform clustering efficiently under such kinds of constraints. UNIT-V: APPLICATIONS AND TRENDS IN DATA MINING: Syllabus: Data Mining Applications, Data Mining Systems Products and Research Prototypes, Additional Themes on Data Mining and Social Impacts of Data Mining. Objectives: The concepts of various applications and trends in data mining are projected, among which social impacts of data mining, such as privacy and data security issues are studied in detailed to challenge research issues. Further discussion of ubiquitous data mining has also been added. Department of Information Technology ANURAG Group Of Institutions (Formerly CVSR College of Engineering) VENKATAPUR (V), GHATKESAR (M), R.R Dist, Micro Plan S.No 1. 2. 3. 4. Topics References H&K H&K Teaching Methodology BB/PPT BB/PPT Number of class 1 1 Data Mining Applications Data Mining Systems Products and Research Prototypes Additional Themes on Data Mining Social Impacts of Data Mining Total number of classes H&K H&K BB/PPT BB/PPT 1 1 4 Assignment Questions: 1. Research and describe an application of data mining that was not presented in this unit. Discuss how different forms of data mining can be used in the application. 2. Study an existing commercial data mining system. Outline the major features of such a System from a multidimensional point of view, including data types handled, architecture of the system, data sources, data mining functions, data mining methodologies, coupling with database or data warehouse systems, scalability, visualization tools, and graphical user interfaces. Can you propose one improvement to such a system and outline how to realize it? 3. What are the differences between visual data mining and data visualization? Data visualization may suffer from the data abundance problem. Propose a data mining method that may help people see through the network topology to the interesting features of the social network. 4. What are the major challenges faced in bringing data mining research to market? Illustrate one data mining research issue that, in your view, may have a strong impact on the market and on society. Discuss how to approach such a research issue. Department of Information Technology