Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
For approval of new courses and deletions or modifications to an existing course. Course Approval Form registrar.gmu.edu/facultystaff/curriculum Action Requested: Course Level: x Create new course Delete existing course Modify existing course (check all that apply) Title Prereq/coreq Other: College/School: Submitted by: Subject Code: Credits Schedule Type Repeat Status Restrictions Number: Effective Term: 776 (Do not list multiple codes or numbers. Each course proposal must have a separate form.) Title: Current Banner (30 characters max including spaces) New Mining Massive Datasets Credits: 3 (check one) Grade Mode: Fixed Variable x (check one) or to x (check one) Prerequisite(s): CS750 or equivalent Computer Science Email: [email protected] x Fall Spring Summer Year 2011 Mining Massive Datasets Repeat Status: Regular (A, B, C, etc.) Satisfactory/No Credit Special (A, B C, etc. +IP) Undergraduate Graduate Grade Type Department: Ext: X31627 Volgenau School of Engineering Daniel Barbará CS x Schedule Type Code(s): (check all that apply) Not Repeatable (NR) Repeatable within degree (RD) Repeatable within term (RT) x Lecture (LEC) Lab (LAB) Recitation (RCT) Internship (INT) Corequisite(s): Maximum credits allowed: Independent Study (IND) Seminar (SEM) Studio (STU) Instructional Mode: x 100% face-to-face Hybrid: ≤ 50% electronically delivered 100% electronically delivered Special Instructions: (list restrictions for major, college, or degree;hard-coding; etc.) Are there equivalent course(s)? Yes x No If yes, please list Catalog Copy for NEW Courses Only (Consult University Catalog for models) Description (No more than 60 words, use verb phrases and present tense) Notes (List additional information for the course) Applications with massive amounts of data are becoming commonplace. From Social Network data to Genomics, the need for efficient, scalable needs to analyze data is pressing. This course covers the techniques to mine large datasets, including Distributed File Systems and Map-Reduce, similarity search, data stream processing. It covers classic problems in data mining, such as clustering, association rule mining, and others from the point of view of scalability. The course includes a final project to exercise the concepts covered in class. Indicate number of contact hours: Hours of Lecture or Seminar per week: 2.5 Hours of Lab or Studio: When Offered: (check all that apply) x Fall Summer Spring Approval Signatures Department Approval Date College/School Approval Date If this course includes subject matter currently dealt with by any other units, the originating department must circulate this proposal for review by those units and obtain the necessary signatures prior to submission. Failure to do so will delay action on this proposal. Unit Name Unit Approval Name For Graduate Courses Only Unit Approver’s Signature Date Graduate Council Member Provost Office Graduate Council Approval Date For Registrar Office’s Use Only: Banner_____________________________Catalog________________________________ revised 2/2/10 Course Proposal Submitted to the Curriculum Committee of the College of Science 1. COURSE NUMBER AND TITLE: CS 776: Mining Massive Datasets Course Prerequisites: CS 750 or equivalent course Catalog Description: Applications with massive amounts of data are becoming commonplace. From Social Network data to Genomics, the need for efficient, scalable needs to analyze data is pressing. This course covers the techniques to mine large datasets, including Distributed File Systems and Map-Reduce, similarity search, data stream processing. It covers classic problems in data mining, such as clustering, association rule mining, and others from the point of view of scalability. The course includes a final project to exercise the concepts covered in class 2. COURSE JUSTIFICATION: Course Objectives: To familiarize students with the emerging techniques for analyzing very large datasets. To apply the concepts learned in class in a project utilizing massive datasets and a cluster of computers such as the Hydra cluster. Course Necessity: Massive datasets are becoming commonplace in the industry. While GMU has classes on Data Mining, it lacks a class that focuses on large datasets analysis. Course Relationship to Existing Programs: This course can be used as an elective in the MS-CS and Phd-CS programs. Course Relationship to Existing Courses: This course is the natural extension of CS 688 and CS 750 3. APPROVAL HISTORY: 4. SCHEDULING AND PROPOSED INSTRUCTORS: Semester of Initial Offering: Fall 2011 Proposed Instructors: Dr. Daniel Barbará and Dr. Huzefa Rangwala 5. TENTATIVE SYLLABUS: COURSE PROPOSAL BY THE DEPARTMENT OF COMPUTER SCIENCE PROPOSAL DESIGNATION New Course Proposal I. CATALOG DESCRIPTION A. CS 776: Mining Massive Datasets B. Prerequisite: CS 750 or equivalent Applications with massive amounts of data are becoming commonplace. From Social Network data to Genomics, the need for efficient, scalable needs to analyze data is pressing. This course covers the techniques to mine large datasets, including Distributed File Systems and Map-Reduce, similarity search, data stream processing. It covers classic problems in data mining, such as clustering, association rule mining, and others from the point of view of scalability. The course includes a final project to exercise the concepts covered in class II. JUSTIFICATION A. Desirability of adding this course Massive datasets are becoming commonplace in the industry. While GMU has classes on Data Mining, it lacks a class that focuses on large datasets analysis. B. Relationship to other courses This course is the natural follow-up to CS750. It can be used as elective in the MS and PhD CS programs, and also in the Data Mining certificate III. SCHEDULING IV. A. This course will be offered in the Fall semester of 2011 and every year subsequently. B. Proposed instructors are Dr. Daniel Barbará and Dr. Huzefa Rangwala SAMPLE SYLLABUS Syllabus: CS 776 Mining Massive Datasets Course Objectives This course addresses the techniques needed to perform the analysis and mining of very large datasets. The emergence of such datasets is becoming ubiquitous in industry, government and scientific organizations. Topics Covered • Distributed File Systems and Map-Reduce • Scalable similarity search • Data-stream analysis • Search engines for large repositories of data • Algorithms for clustering massive data • Algorithms to find association rules in very large data Grading Policy • Individual assignments 35% • Final exam 30% • Final group project and presentation 35% Sample Schedule Class 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Topic Data Mining refreshing Large Scale File Systems and MapReduce Large Scale File Systems and MapReduce (II) Large-scale similarity searching Data-Stream mining (I) Data-Stream mining (II) Link Analysis (I) Link Analysis (II) Large-scale Frequent Itemset finding Large-scale Clustering (I) Large-scale Clustering (II) Recommender Systems Student presentations Student presentations Final exam Readings Rajaraman and Ullman (RU) Ch 1 RU Ch. 2 RU Ch 2 RU Ch 3 RU Ch 4 and selected papers RU Ch 4 and selected papers RU Ch 5 and selected papers RU Ch 5 and readings from Koller and Miller’s book on Prob. Graphical models RU Ch 6 RU Ch 7 RU Ch 7 RU Ch 9 Textbooks Required: Mining of Massive Datasets by Anand Rajaraman and Jeffrey Ullman. Soon to be published. Available currently on-line Reference: Probabilistic Graphical Models: Principles and Techniques (Adaptive Computation and Machine Learning) by Daphne Koller and Nir Friedman. The MIT Press (August 31, 2009) Selected papers. Course Description 776 Mining Massive Datasets (3:3:0) Prerequisite: CS750 or equivalent. The course investigates techniques to mine large datasets, including Distributed File Systems and Map-Reduce, similarity search, data stream processing. It covers classic problems in data mining, such as clustering, association rule mining, and others from the point of view of scalability. The course includes a final project to exercise the concepts covered in class