Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Dealing with Data – Especially Big Data INFO-GB-2346.01 Fall 2016 Professor Norman White [email protected] normwhite@twitter TA: TBA Assistant: Sharon Kim [email protected] Background: Most courses spend their time on the concepts and techniques of analyzing data, but virtually no time on how to handle the data and store it in form to be analyzed. This course is focused on how one deals with data, from its initial acquisition to its final analysis. The course will be broken into several major topics. 1) Acquisition of data : how do we get it, where do we get it and how do we get it 2) Cleaning or “munging data”: What do we need to do to the data to get in a format that we can use it. What are some of the tools and techniques for different types of data 3) Storage and retrieval of data: How do we organize, store and retrieve the data. 4) Analyzing data: What are common techniques and tools and approaches for data analysis? Topics include data acquisition, data cleaning and formatting, common data formats, data representation and storage, data transformations, data base management systems, “big data” or nosql solutions for storing and analyzing data, common analysis tools including excel, Python, and Panda, data mining and data visualization. Python will be used extensively through the course. The course will be taught in an interactive lab learning environment where after the first few classes, some of the class time will be spent working as teams on small assignments. Students should have notebook computers that are powerful, have adequate ram and disk space and wifi. Most recent notebooks should be sufficient. Each student will have their own linux computer running in the Amazon Cloud (EC2). Each computer will be preloaded with much of the class materials and data sets. In addition to student’s personal systems on EC2, the class will have access to several servers and a ”big data” cluster to use in assignments and projects. Ipython/jupyter will be used for many of the examples and homeworks. This course should be valuable background for students in information systems, business analytics, market research, operations, finance, marketing and accounting. Textbook: None , most material will be on https://github.com/nhwhite212/Dealing-with-Data A copy of which is in your Ipython notebook in your virtual machine. Requirements: There will numerous small and large homeworks and a team project. Grading: Homeworks will count 45% , the final project 35% , class participation 10% and team member ratings 10%. Lecture Outline Week Topic Acquiring Data 1) Course introduction – Instructions on retrieving your dedicated Amazon EC2 Machine. What is data., Where do we find it, How do we get it. Examples: using curl to pull data from web sites Downloading formatted data as CVS/TSV files. 2) Handling unstructured data, data acquisition, converting text data to common formats like csv, tab delimited, fixed format, xml. Inputting data into Excel. Common problems. Homework. Use Curl to download data, and load it into ipython to analyze “Munging” Data 3) Common preprocessing tools, unix tools sed,grep, cut, awk, perl, python etc.. Concept of pipeline processing. Introduction to regular expressions. In class lab on converting textual data to a CSV file. Homework. Use unix tools or python to convert unstructured text file to a csv format file suitable for loading into Excel, SAS or a data base. 4) Introduction to Python a. Primitive data types in python b. Lists, tuples, arrays, sets, dictionaries c. Control flow statements d. Reading and writing files e. Functions and classes f. Libraries, matplotlib g. In class exercise, homework 5) More Python a. Regular expressions – how do we extract meaning from text files b. Using REs in python Storing and Managing Data 6) Introduction to relational data bases. Conceptual background, Overview of features and functions. Using E/R diagrams to generate “good” data base designs. Simple queries using SQL. In class exercise, create an ER diagram of sample system. Homework. E/R diagram of business case 7) Query languages, advanced SQL, including joins and aggregation. Homework. Use SQL on multi-table data base to answer questions. Final Project Proposal due Analyzing Data 8) Mid –term Business Analytics tools, Excel, SAS, matlab, R, Tableau, Introduction to Pandas 9) Advanced SQL, introduction to Non-Relational Databases, NoSQl, Object Oriented, Mongodb, HBase, Firebase. Python access to databases. Homework: use Pandas to analyze large data set 10) Using Python and Pandas to access Relational Data Bases. 11) “Big Data”. How do we handle terabytes and petabytes of unstructured data? Why don’t traditional data base systems scale? Overview of Hadoop, HDFS and Map-Reduce. Problems of handling web and social network data. Overview of HIVE. New processing models, Spark, Tez. Demonstration of sorting a very large file. Wordcount example in Python. Techniques for analyzing textual data, TF-IDF transformation for textual data. 12) Data Visualization. A picture is worth a thousand words. Show how large amounts of data can be displayed using graphical techniques. Matplotlib, Pandas, Tableau . 13) Final Project presentations. What did we learn?