Download Dealing with Data – Especially Big Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Dealing with Data – Especially Big Data
INFO-GB-2346.01
Fall 2016
Professor Norman White
[email protected]
normwhite@twitter
TA: TBA
Assistant: Sharon Kim [email protected]
Background: Most courses spend their time on the concepts and techniques
of analyzing data, but virtually no time on how to handle the data and store it
in form to be analyzed. This course is focused on how one deals with data,
from its initial acquisition to its final analysis.
The course will be broken into several major topics.
1) Acquisition of data : how do we get it, where do we get it and how do
we get it
2) Cleaning or “munging data”: What do we need to do to the data to get
in a format that we can use it. What are some of the tools and techniques
for different types of data
3) Storage and retrieval of data: How do we organize, store and retrieve
the data.
4) Analyzing data: What are common techniques and tools and approaches
for data analysis?
Topics include data acquisition, data cleaning and formatting, common data
formats, data representation and storage, data transformations, data base
management systems, “big data” or nosql solutions for storing and analyzing
data, common analysis tools including excel, Python, and Panda, data
mining and data visualization. Python will be used extensively through the
course.
The course will be taught in an interactive lab learning environment where
after the first few classes, some of the class time will be spent working as
teams on small assignments. Students should have notebook computers that
are powerful, have adequate ram and disk space and wifi. Most recent
notebooks should be sufficient. Each student will have their own linux
computer running in the Amazon Cloud (EC2). Each computer will be
preloaded with much of the class materials and data sets.
In addition to student’s personal systems on EC2, the class will have access
to several servers and a ”big data” cluster to use in assignments and projects.
Ipython/jupyter will be used for many of the examples and homeworks.
This course should be valuable background for students in information
systems, business analytics, market research, operations, finance, marketing
and accounting.
Textbook: None , most material will be on
https://github.com/nhwhite212/Dealing-with-Data
A copy of which is in your Ipython notebook in your virtual machine.
Requirements: There will numerous small and large homeworks and a
team project.
Grading: Homeworks will count 45% , the final project 35% , class
participation 10% and team member ratings 10%.
Lecture Outline
Week
Topic
Acquiring Data
1)
Course introduction –
Instructions on retrieving your dedicated Amazon EC2
Machine.
What is data., Where do we find it,
How do we get it.
Examples: using curl to pull data from web sites
Downloading formatted data as CVS/TSV files.
2)
Handling unstructured data, data acquisition,
converting text data to common formats like csv, tab
delimited, fixed format, xml. Inputting data into Excel.
Common problems.
Homework. Use Curl to download data, and load
it into ipython to analyze
“Munging” Data
3)
Common preprocessing tools, unix tools sed,grep,
cut, awk, perl, python etc.. Concept of pipeline
processing. Introduction to regular expressions.
In class lab on converting textual data to a CSV file.
Homework. Use unix tools or python to convert
unstructured text file to a csv format file suitable
for loading into Excel, SAS or a data base.
4)
Introduction to Python
a. Primitive data types in python
b. Lists, tuples, arrays, sets, dictionaries
c. Control flow statements
d. Reading and writing files
e. Functions and classes
f. Libraries, matplotlib
g. In class exercise, homework
5)
More Python
a. Regular expressions – how do we extract meaning
from text files
b. Using REs in python
Storing and Managing Data
6)
Introduction to relational data bases. Conceptual
background, Overview of features and functions.
Using E/R diagrams to generate “good” data base
designs. Simple queries using SQL. In class exercise,
create an ER diagram of sample system.
Homework. E/R diagram of business case
7)
Query languages, advanced SQL, including joins and
aggregation.
Homework. Use SQL on multi-table data base to
answer questions.
Final Project Proposal due
Analyzing Data
8)
Mid –term
Business Analytics tools, Excel, SAS, matlab, R,
Tableau, Introduction to Pandas
9)
Advanced SQL, introduction to Non-Relational
Databases, NoSQl, Object Oriented, Mongodb,
HBase, Firebase. Python access to databases.
Homework: use Pandas to analyze large data set
10)
Using Python and Pandas to access Relational Data
Bases.
11)
“Big Data”. How do we handle terabytes and
petabytes of unstructured data? Why don’t traditional
data base systems scale?
Overview of Hadoop, HDFS and Map-Reduce.
Problems of handling web and social network data.
Overview of HIVE. New processing models, Spark,
Tez.
Demonstration of sorting a very large file. Wordcount
example in Python. Techniques for analyzing textual
data, TF-IDF transformation for textual data.
12)
Data Visualization. A picture is worth a thousand
words. Show how large amounts of data can be
displayed using graphical techniques. Matplotlib,
Pandas, Tableau
.
13)
Final Project presentations. What did we learn?