Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CS498 – Data Mining Lab 1 - Intro Submission Requirements • • • • • All answers must be computer generated (including text and diagrams), and submitted in PDF formats. All students are required to submit your assignments via Blackboard. All uploaded files must include lab number, title, and student name(s). Files must be named using the following format: cs498-dm-lab1-urbain.pdf Submit lab report separately from project archive. This lab consists of two parts: Q&A and a short introduction to working with tools and data. Q&A 1. Define each of the following data mining functionalities: characterization, discrimination, association analysis, correlation analysis, classification, regression, clustering, and outlier analysis. Give examples of each data mining functionality, using a real-life database that you are familiar with. 2. Recent applications pay special attention to spatiotemporal data streams. A spatiotemporal data stream contains spatial information that changes over time, and is in the form of stream data, i.e., the data flow in-and-out like possibly infinite streams. (a) Identify three application examples of spatiotemporal data streams. (b) For each of your application examples (a) provide a short bulleted list of the kind of interesting knowledge that could be mined from such data streams. Assume limited computation time and resources. (c) Provide a concise bulleted list of some of the major challenges in spatiotemporal data mining. For each challenge, provide a brief description of the challenge. ~ 2 to 3 of sentences for each challenge. (d) Using one application example (a), outline a method/approach to mine one kind of knowledge (b) from such data stream data efficiently. Jay Urbain, PhD 1 Working with tools and data Please read through the entire lab before starting. Note: The decisions you make for accessing the data may be dependent on what tools you would like to experiment with. 1) You will need to access the Adult data set. You may download a commaseparated-value (csv) file of adult.data from the course outline, or you may access the data directly through a database I’ve set up. File snapshot: You can load this into MySQL with “load data local infile….”. If you’re rusty with databases, SQL, and JDBC see the cs386 course outline and feel free to ask for help. http://jayurbain.com/msoe/cs386/outline.html If students would like, I can set up a network database. Let me know. You can read this into Excel directly as a comma-separated-file (csv). Database connection. R load command. Java or Python read file, separate each read line by comma. Jay Urbain, PhD 2 2) Answer the following questions with respect to the Adult data set. Note: the “income” attribute is the “classification” attribute, i.e., the attribute we are trying to predict. a) Review the following attributes of the data set. Identify each attribute as numerical or categorical (one of a set of categories). If the attribute is numerical, further classify the attribute as continuous or discrete (countable). It the attribute is categorical, further classify the attribute as regular categorical or ordinal (ordered). Example: Classification attribute (typically last attribute in most data sets) • Income: >50K, <=50K. Feature attributes: • Age • workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Localgov, State-gov, Without-pay, Never-worked. • Fnlwgt • education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assocacdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. • education-num • marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. • occupation: Tech-support, Craft-repair, Other-service, Sales, Execmanagerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Admclerical, Farming-fishing, Transport-moving, Priv-house-serv, Protectiveserv, Armed-Forces. • relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. • race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. • sex: Female, Male. • capital-gain • capital-loss • hours-per-week Jay Urbain, PhD 3 • native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands. Jay Urbain, PhD 4 b) To simplify our analysis, convert the continuous age attribute into a categorical attribute of 8 values, i.e., 10-19, 20-29, …: c) Write a utility/script/SQL to determine the association between two categorical attributes. The distribution (see sample for education below) and generate a histogram (using the distribution) for categorical attribute with respect to income. Review your results. Which individual attributes are likely to be good predictors? Rank order them in descending order. Education <=50k >50k n 10th 871 62 11th 1115 60 12th 400 33 1st-‐4th 162 6 5th-‐6th 317 16 7th-‐8th 606 16 9th 481 27 Assoc-‐ acdm 802 265 Assoc-‐voc 1021 361 Bachelors 3134 2221 Doctorate 107 306 HS-‐grad 8826 1675 Masters 764 959 Preschool 51 0 Prof-‐school 5904 423 Some-‐ college 5904 1387 Jay Urbain, PhD <=50k >50k 933 1175 433 168 333 622 508 0.93 0.95 0.92 0.96 0.95 0.97 0.95 0.07 0.05 0.08 0.04 0.05 0.03 0.05 1067 1382 5355 413 10501 1723 51 6327 0.75 0.74 0.59 0.26 0.84 0.44 1.00 0.93 0.25 0.26 0.41 0.74 0.16 0.56 0.00 0.07 7291 0.81 0.19 5 1.20 1.00 0.80 0.60 <=50k 0.40 >50k Some-‐college Prof-‐school Preschool Masters HS-‐grad Doctorate Bachelors Assoc-‐voc Assoc-‐acdm 9th 7th-‐8th 5th-‐6th 1st-‐4th 12th 11th 0.00 10th 0.20 You may use Java, SQL, R, Python, Octave/Matlab, or Excel in any combination to complete this assignment. I encourage you to experiment and try different approaches. Pick something you know and pick something you would like to learn. The following page contains links to download pages and tutorials. Submission: Submit a lab report as a single PDF file and an archive you're your project/scripts/SQL with a README.txt file following the guidelines listed at the beginning of this assignment. You report should contain responses to parts (1) and (2) of this lab. In addition, please provide feedback on the lab. Jay Urbain, PhD 6 JMathPlot Plotting package for Java. https://code.google.com/p/jmathplot/ R: Install R, then R-Studio. Startup R, select which version of R you want to use. • http://www.r-project.org/ • http://www.rstudio.com/ Python: Install Python on your system: https://developers.google.com/edu/python/set-up Great tutorial Python course: https://developers.google.com/edu/python/ To do data analysis in Python you basically need to add the NumPy and Pandas packages. I find it a lot easier to use an integrated packages such as Anaconda or Canopy. • • https://store.continuum.io/cshop/anaconda/ https://www.enthought.com/downloads/ Octave A free alternative to Matlab is Octave: http://www.gnu.org/software/octave/ I’ve included a PDF with Octave install instructions on the course outline: octave_install.pdf. Introduction to Octave: http://www.gnu.org/software/octave/doc/interpreter/index.html Matlab: You can install the student version of Matlab from the MSOE network. Java Data Mining Package Weka • http://www.cs.waikato.ac.nz/ml/weka/ Jay Urbain, PhD 7