Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Dealing with Data – Especially Big Data INFO-GB-3322 Spring 2014 Norman White Background: Most courses spend their time on the concepts and techniques of analyzing data, but virtually no time on how to handle the data and store it in form to be analyzed. This course is focused on how one deals with data, from its initial acquisition to its final analysis. Topics include data acquisition, data cleaning and formatting, common data formats, data representation and storage, data transformations, data base management systems, “big data” or nosql solutions for storing and analyzing data, common analysis tools including excel, sas and matlab, data mining and data visualization. The course will be taught in an interactive lab learning environment where after the first few classes, some of the class time will be spent working as teams on small assignments. Students should have notebook computers that are powerful, have adequate ram and disk space and wifi. Most recent notebooks should be sufficient. In addition to students personal systems, the class will have access to several servers and a ”big data” cluster to use in assignments and projects. This course should be valuable background for students in information systems, business analytics, market research, operations, finance, marketing and accounting. Textbook: Optional Requirements: There will numerous small homeworks, a mid-term and a team project. Grading: Homeworks will count 20%, the midterm 35% , the final project 35% , class participation 5% and team member ratings 5%. Lecture Outline Week Topic 1) Introduction to data, formats, representation. Binary, character, floating point formats. 2) Files, Records and fields, Sequential processing, sorting and merging data, look forward to the future. What if we can sort and process very large data sets efficiently. Homework. Simple sort merge reporting problem. 3) Handling unstructured data, Converting text data to common formats like csv, tab delimited, fixed format, xml. Inputting data into Excel. Common problems. Homework. Load text file into Excel and analyze 4) Common preprocessing tools, unix tools sed,grep, cut, awk, perl, python etc.. Concept of pipeline processing. Homework. Use unix tools to convert unstructured text file to a csv format file suitable for loading into Excel, SAS or a data base. 5) Relational data bases. Overview of features and functions. Using E/R diagrams to generate schemas. Homework. E/R diagram of business case 6) Query languages, SQL, including joins and aggregation. Homework. Use SQL on multitable data base to answer questions. 7) Mid –term (First half of class), Project Meetings second half. 8) ETL – Extraction, Translation and Loading. How do we prepare data to be loaded into a database? 9) “Big Data”. How do we handle terabytes and petabytes of unstructured data? Why don’t traditional data base systems scale? Remember sort/merge? Discussion of Google file system, map reduce and hadoop. Problems of handling web and social network data. Demonstration of sorting a very large file. Wordcount example. Homework: Final project proposal due 10) “Big Data” analytics. How do we scale data base systems, data mining and other analytical techniques to handle massive data bases. Overview of massively distributed systems like Pig, HIVE, Mahout, HBASE. (Discuss the pagerank problem) Homework: Run Map-Reduce job to develop a word count of trigrams in a large textual data set. Or run Pegasus to analyze a large social network Or run Mahout to develop a recommender system 11) Data Visualization. A picture is worth a thousand words. Show how large amounts of data can be displayed using graphical techniques. Give examples of some standard techniques. Treemap, Tufte … (Possible speaker) 12) In class team meetings on projects. 13) Final Project presentations. What did we learn?