Download CS498 – Data Mining Lab 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
CS498 – Data Mining Lab 1 - Intro
Submission Requirements
•
•
•
•
•
All answers must be computer generated (including text and diagrams), and submitted in
PDF formats.
All students are required to submit your assignments via Blackboard.
All uploaded files must include lab number, title, and student name(s).
Files must be named using the following format: cs498-dm-lab1-urbain.pdf
Submit lab report separately from project archive.
This lab consists of two parts: Q&A and a short introduction to working with tools
and data.
Q&A
1. Define each of the following data mining functionalities: characterization,
discrimination, association analysis, correlation analysis, classification,
regression, clustering, and outlier analysis. Give examples of each data
mining functionality, using a real-life database that you are familiar with.
2. Recent applications pay special attention to spatiotemporal data streams.
A spatiotemporal data stream contains spatial information that changes
over time, and is in the form of stream data, i.e., the data flow in-and-out
like possibly infinite streams.
(a) Identify three application examples of spatiotemporal data streams.
(b) For each of your application examples (a) provide a short bulleted list
of the kind of interesting knowledge that could be mined from such data
streams. Assume limited computation time and resources.
(c) Provide a concise bulleted list of some of the major challenges in
spatiotemporal data mining. For each challenge, provide a brief
description of the challenge. ~ 2 to 3 of sentences for each challenge.
(d) Using one application example (a), outline a method/approach to mine
one kind of knowledge (b) from such data stream data efficiently.
Jay Urbain, PhD 1 Working with tools and data
Please read through the entire lab before starting. Note: The decisions you make
for accessing the data may be dependent on what tools you would like to
experiment with.
1) You will need to access the Adult data set. You may download a commaseparated-value (csv) file of adult.data from the course outline, or you may
access the data directly through a database I’ve set up.
File snapshot:
You can load this into MySQL with “load data local infile….”. If you’re rusty with
databases, SQL, and JDBC see the cs386 course outline and feel free to ask for
help.
http://jayurbain.com/msoe/cs386/outline.html
If students would like, I can set up a network database. Let me know.
You can read this into Excel directly as a comma-separated-file (csv).
Database connection.
R load command.
Java or Python read file, separate each read line by comma.
Jay Urbain, PhD 2 2) Answer the following questions with respect to the Adult data set. Note: the
“income” attribute is the “classification” attribute, i.e., the attribute we are trying to
predict.
a) Review the following attributes of the data set. Identify each attribute as
numerical or categorical (one of a set of categories). If the attribute is
numerical, further classify the attribute as continuous or discrete
(countable). It the attribute is categorical, further classify the attribute as
regular categorical or ordinal (ordered).
Example:
Classification attribute (typically last attribute in most data sets)
• Income: >50K, <=50K.
Feature attributes:
• Age
• workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Localgov, State-gov, Without-pay, Never-worked.
• Fnlwgt
• education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assocacdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate,
5th-6th, Preschool.
• education-num
• marital-status: Married-civ-spouse, Divorced, Never-married, Separated,
Widowed, Married-spouse-absent, Married-AF-spouse.
• occupation: Tech-support, Craft-repair, Other-service, Sales, Execmanagerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Admclerical, Farming-fishing, Transport-moving, Priv-house-serv, Protectiveserv, Armed-Forces.
• relationship: Wife, Own-child, Husband, Not-in-family, Other-relative,
Unmarried.
• race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
• sex: Female, Male.
• capital-gain
• capital-loss
• hours-per-week
Jay Urbain, PhD 3 •
native-country: United-States, Cambodia, England, Puerto-Rico, Canada,
Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South,
China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam,
Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador,
Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland,
Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong,
Holand-Netherlands.
Jay Urbain, PhD 4 b) To simplify our analysis, convert the continuous age attribute into a
categorical attribute of 8 values, i.e., 10-19, 20-29, …:
c) Write a utility/script/SQL to determine the association between two
categorical attributes. The distribution (see sample for education below)
and generate a histogram (using the distribution) for categorical attribute
with respect to income. Review your results. Which individual attributes
are likely to be good predictors? Rank order them in descending order.
Education <=50k >50k n 10th 871 62 11th 1115 60 12th 400 33 1st-­‐4th 162 6 5th-­‐6th 317 16 7th-­‐8th 606 16 9th 481 27 Assoc-­‐
acdm 802 265 Assoc-­‐voc 1021 361 Bachelors 3134 2221 Doctorate 107 306 HS-­‐grad 8826 1675 Masters 764 959 Preschool 51 0 Prof-­‐school 5904 423 Some-­‐
college 5904 1387 Jay Urbain, PhD <=50k >50k 933 1175 433 168 333 622 508 0.93 0.95 0.92 0.96 0.95 0.97 0.95 0.07 0.05 0.08 0.04 0.05 0.03 0.05 1067 1382 5355 413 10501 1723 51 6327 0.75 0.74 0.59 0.26 0.84 0.44 1.00 0.93 0.25 0.26 0.41 0.74 0.16 0.56 0.00 0.07 7291 0.81 0.19 5 1.20 1.00 0.80 0.60 <=50k 0.40 >50k Some-­‐college Prof-­‐school Preschool Masters HS-­‐grad Doctorate Bachelors Assoc-­‐voc Assoc-­‐acdm 9th 7th-­‐8th 5th-­‐6th 1st-­‐4th 12th 11th 0.00 10th 0.20 You may use Java, SQL, R, Python, Octave/Matlab, or Excel in any combination
to complete this assignment. I encourage you to experiment and try different
approaches. Pick something you know and pick something you would like to
learn. The following page contains links to download pages and tutorials.
Submission:
Submit a lab report as a single PDF file and an archive you're your
project/scripts/SQL with a README.txt file following the guidelines listed at the
beginning of this assignment.
You report should contain responses to parts (1) and (2) of this lab. In addition,
please provide feedback on the lab.
Jay Urbain, PhD 6 JMathPlot
Plotting package for Java.
https://code.google.com/p/jmathplot/
R:
Install R, then R-Studio. Startup R, select which version of R you want to use.
• http://www.r-project.org/
• http://www.rstudio.com/
Python:
Install Python on your system:
https://developers.google.com/edu/python/set-up
Great tutorial Python course: https://developers.google.com/edu/python/
To do data analysis in Python you basically need to add the NumPy and Pandas
packages. I find it a lot easier to use an integrated packages such as Anaconda
or Canopy.
•
•
https://store.continuum.io/cshop/anaconda/
https://www.enthought.com/downloads/
Octave
A free alternative to Matlab is Octave: http://www.gnu.org/software/octave/
I’ve included a PDF with Octave install instructions on the course outline:
octave_install.pdf.
Introduction to Octave:
http://www.gnu.org/software/octave/doc/interpreter/index.html
Matlab:
You can install the student version of Matlab from the MSOE network.
Java Data Mining Package Weka
• http://www.cs.waikato.ac.nz/ml/weka/
Jay Urbain, PhD 7