Download Dealing with Data – Especially Big Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Dealing with Data – Especially Big Data
INFO-GB-2346.30
Spring 2016
Very Rough Draft
Subject to Change
Professor Norman White
Background: Most courses spend their time on the concepts and techniques
of analyzing data, but virtually no time on how to handle the data and store it
in form to be analyzed. This course is focused on how one deals with data,
from its initial acquisition to its final analysis.
Topics include data acquisition, data cleaning and formatting, common data
formats, data representation and storage, data transformations, data base
management systems, “big data” or nosql solutions for storing and analyzing
data, common analysis tools including excel, sas and matlab, data mining
and data visualization.
The course will be taught in an interactive lab learning environment where
after the first few classes, some of the class time will be spent working as
teams on small assignments. Students should have notebook computers that
are powerful, have adequate ram and disk space and wifi. Most recent
notebooks should be sufficient.
In addition to students personal systems, the class will have access to several
servers and a ”big data” cluster to use in assignments and projects. Ipython
will be used for some of the examples and homeworks.
This course should be valuable background for students in information
systems, business analytics, market research, operations, finance, marketing
and accounting.
Textbook: None
Requirements: There will numerous small homeworks, a mid-term and a
team project.
Grading: Homeworks will count 20%, the midterm 35% , the final project
35% , class participation 5% and team member ratings 5%.
Lecture Outline
Week
Topic
1)
Course introduction - Introduction to data, formats,
representation. Binary, character, …, Files, Records
and fields, Sequential processing, sorting and
merging data, look forward to the future. What if we
can sort and process very large data sets efficiently.
2)
Handling unstructured data, data acquisition,
converting text data to common formats like csv, tab
delimited, fixed format, xml. Inputting data into Excel.
Common problems.
Homework. Load text file into Excel and analyze
3)
Common preprocessing tools, unix tools sed,grep,
cut, awk, perl, python etc.. Concept of pipeline
processing. Introduction to regular expressions.
In class lab on converting textual data to a CSV file.
Homework. Use unix tools to convert unstructured
text file to a csv format file suitable for loading into
Excel, SAS or a data base.
4)
Introduction to relational data bases. Conceptual
background, Overview of features and functions.
Using E/R diagrams to generate “good” data base
designs. Simple queries using SQL.
Homework. E/R diagram of business case
5)
Query languages, advanced SQL, including joins and
aggregation.
Homework. Use SQL on multitable data base to
answer questions.
6)
Mid –term (First half of class)
Business Analytics tools, Excel, SAS, matlab, Python,
Tableau, Datameer (Second half of class)
7)
Advanced SQL, introduction to Non-Relational
Databases, NoSQl, Object Oriented, Mongodb,
HBase, …
Homework: Final project proposal due
8)
“Big Data”. How do we handle terabytes and
petabytes of unstructured data? Why don’t traditional
data base systems scale? Remember sort/merge?
Overview of Hadoop, HDFS and Map-Reduce.
Problems of handling web and social network data.
Overview of HIVE. New processing models, Spark,
Tez.
Demonstration of sorting a very large file. Wordcount
example. Techniques for analyzing textual data,
TF-IDF transformation for textual data.
9)
“Big Data” analytics. How do we scale data base
systems, data mining and other analytical techniques
to handle massive data bases. Overview of massively
distributed systems like Pig, Mahout, Pegasus, Hive,
HBASE. (Discuss the pagerank problem)
Homework: Run Map-Reduce job to develop a
word count of trigrams in a large textual data set.
Or run Mahout to develop a recommender system
10)
Data Visualization. A picture is worth a thousand
words. Show how large amounts of data can be
displayed using graphical techniques. Give examples
of some standard techniques. Treemap, Tufte …
Topological Data Analysis. Tableau example on Hive.
11)
In class team meetings on projects.
12)
Final Project presentations. What did we learn?