Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
STAT 462/862 Unit 2: Overview Devon Lin Department of Mathematics and Statistics Queen’s University, Sept 23 1 Topics • Types of data, terminology • Statistical data analysis • Various problems Reference: Sec 1, 2.1, 2.2 of Hastie, Tibshirani and Friedman Auto Dataset • http://www-bcf.usc.edu/~gareth/ISL/Auto.data • data(Auto) in the R package ISLR A data frame with 397 observations on the following 9 variables. mpg: miles per gallon cylinders: Number of cylinders between 4 and 8 displacement: Engine displacement (cu. inches) horsepower: Engine horsepower weight: Vehicle weight (lbs.) acceleration: Time to accelerate from 0 to 60 mph (sec.) year: Model year (modulo 100) origin: Origin of car (1. American, 2. European, 3. Japanese) name: Vehicle name 3 • What variables impact mpg and in what way? Treat mpg as the response and all other variables except name as the predictors. • predict whether a given car gets high or low gas mileage; Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. 4 Datasets • http://www.statsci.org/datasets.html • https://stat.ethz.ch/R-manual/Rdevel/library/datasets/html/00Index.html • http://mldata.org/ 5 What data looks like? X1 X2 X3 Y x1 x2 x3 x4 x5 • Data/points/instances/examples/samples/records: rows • Input Variables/independent variables/features/attributes/covariates/predictors/regressors/factors: columns • Output Variable/outcome/response/label/dependent variable/: special column that is observed or to be predicted Data types • Continuous: quantitative, a number like weight or length • Discrete: qualitative, a symbol, like ‘cat’ or ‘dot’, ‘0’ or ‘1’, {0,1,2,3}, {small, median, large} Output types • • • • • • • Independent identically distributed (iid) Spatial Times series Spatial-temporal Functional Image (matrices) Variable-size non-vector data (eg. graphs, texts) Statistical data analysis • Step 1: Define the problem and state the objective. • Step 2: Collect data. • Step 3: Analyze data: exploratory methods and confirmatory methods (includes model fitting and assessment of the model assumptions.) • Step 4: Draw conclusions and make recommendations. Main goals of data analysis • Estimation: based on observations, or training data (xi, yi), i=1,2….,n, through a process called learning (or estimation). • Prediction: Use that model to predict something about data you haven’t seen before, that comes from the same distribution as the training data, called test data. Main learning tasks 1. 2. 3. 4. 5. Regression: predict a continuous output variable Classification: predict a discrete output variable Density estimation: predict the distribution Clustering: predict clusters Dimensionality reduction: find a smaller set of new variables each being a combination of the input variables Supervised learning: predicting an output variable for which we get to see examples. (regression, classification) Unsupervised learning: predicting a target variable for which we never get to see examples. (density estimation, clustering, dimensionality reduction) Regression Output/Response X1 X2 X3 Y 3 Training Dataset 4.5 5.6 2.9 model learn 7 X1 Test Dataset X2 X3 Y X1 X2 X3 Y ? 5.1 ? 3.9 ? ? ? apply model 7.9 2.9 4 Classical example of a regression model: least square regression An example of regression • Data: measurements of the girth, height and volume of timber in 31 felled black cherry trees. Note that girth is the diameter of the tree (in inches) measured at 4 ft 6 in above the ground. > data(trees) > attach(trees) > plot(Volume ˜ Girth, data = trees, log = "xy") Classification Output/Target/Class X1 X2 X3 Y A Training Dataset B B A model learn B X1 Test Dataset X2 X3 Y X1 X2 X3 Y ? B ? A ? ? ? apply model A B B An example of classification • Data: gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica. >install.packages(‘e1071’) >library(‘e1071’) >data(iris) >attach(iris) >## classification mode ># default with factor response: >model <- svm(Species ˜ ., data = iris) >print(model) >summary(model) An example of density estimation • Data: Eruption time in minutes for the Old Faithful geyser in Yellowstone, National Park, Wyoming, USA. > data(faithful) > attach(faithful) > eruptions > d = density(eruptions, bw = "sj") > plot(d) Clustering • Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that – Data points in one cluster are more similar to one another. – Data points in separate clusters are less similar to one another. • Similarity measures: – Euclidean distance if variables are continuous. – Other problem-specific measures: city-block distance, Mahalanobis distance, etc. 20 Illustrating clustering Euclidean Distance Based Clustering in 3-D space. Within-cluster distances are minimized Between-cluster distances are maximized 21 Illustrating document clustering • Clustering points: 3204 Articles of Los Angeles Times. • Similarity measure: How many words are common in these documents (after some word filtering). C a te g o ry T o ta l C o r r e c t ly A r t ic le s P la c e d F in a n c ia l 555 364 F o r e ig n 341 260 N a t io n a l 273 36 M e tro 943 746 S p o rts 738 573 E n t e r t a in m e n t 354 278 22 Anomaly or Outlier Detection • Detect significant deviations from normal behavior • Applications: – Credit Card Fraud Detection – Online Fraud Detection – Network Intrusion Detection Typical network traffic at University level may reach over 100 million connections per day 23 Data collection • Observational study: observe and measure the variables without changing its conditions • Experimental design: manipulate the conditions of the experiments and control the factors • Survey sampling: the process of choosing samples from a population • Simulation: use a mathematical, computer model to represent the physical process. 24 Data cleaning • Raw data need to be accurately entered for successful evaluation of information • Check character variables have valid values • Check numeric variables are within range • Check for missing values • Check for and eliminate duplicates • Check for unique values (ID variables) • Check for invalid dates or observations • Combining multiple files 25 Some questions for data analysis • What model class to fit the data? • Which notion of error to use? (loss functions) • How to make sure the error on future data is minimized?(generalization) • Which model to use? (model selection) • The performance of the model when its assumptions do not hold?(robustness) Challenges of data analysis • Scalability – a method that works on a small data set may not work on a larger one (is the algorithm efficient and practical for larger data sets) • • • • • Dimensionality Complex and massive data Data quality Data ownership and distribution Privacy preservation 27 Topics • Regression and regularized method • Classification – Linear, quadratic, linear discriminant analysis – Logistic regression – Support vector machines • Clustering – K-means, hierarchical, self-organizing maps • Singular Value Decomposition, Principal Component Analysis. 28