Download Introduction

Introduction 1. What is Data Mining? • According to Data Mining by Hand et al. Data Mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner. • According to en.wikipedia.org/wiki/Data mining Data mining is the principle of sorting through large amounts of data and picking out relevant information. It is usually used by business intelligence organizations, and financial analysts, but it is increasingly used in the sciences to extract information from the enormous data sets generated by modern experimental and observational methods. It has been described as “the nontrivial extraction of implicit, previously unknown, and potentially useful information from data”[1] and “the science of extracting useful information from large data sets or databases”.[2] [1]: W. Frawley and G. Piatetsky-Shapiro and C. Matheus (Fall 1992). “Knowledge Discovery in Databases: An Overview”. AI Magazine: pp. 213-228. ISSN 0738-4602. [2]: D. Hand, H. Mannila, P. Smyth (2001). Principles of Data Mining. MIT Press, Cambridge, MA. ISBN 0-262-08290-X. • Key components: – data exploration and analysis – large amount of (observational) data – new and useful information Implied by large amount of data: Data mining involves computation intensive methods, many of which are automated or semi-automated (think of algorithms). 2. Data Mining tasks (a) Exploratory Data Analysis: The goal is to explore the data without any clear ideas of what we are looking for. 1 (b) Descriptive modeling: The goal is to describe all of the data in a global model. Cluster analysis (or data segmentation): The aim is to partition the pdimensional space into several homogeneous groups so that similar cases are put into the same group. (c) Predictive modeling: The goal is to build a model that will permit the value of one variable (the response) to be predicted from the known values of other variables (predictors). Classification: if the variable to be predicted is a categorical variable. Regression: if the variable to be predicted is quantitative. (d) Discovering patterns and rules: The goal is to detect patterns and rules in the data. Association rules: The aim is to find joint values of the variables X = (X1 , . . . , Xp ) that appear most frequently in the database. In a ”market basket” analysis, Xj ∈ {0, 1} and the analysis helps identify items that are frequently purchased together. Another example of pattern detection is spotting fraudulent behavior by detecting regions of the space that constitutes truly unusual behavior in the context of normal variability. (e) Retrieval by content: The user has a pattern of interest and the goal is to find similar patterns in the data set. This task is most commonly used for text and image data sets. For example, document retrieval using a set of keywords. 3. Supervised and unsupervised learning Variable types: There are two variable types: (1) categorical or qualitative variables, and (2) quantitative variables. Within each type, there are two subtypes: a qualitative variable could be nominal or ordinal, and a quantitative variable could be discrete or continuous. Some examples are provided below: • Categorical or qualitative variable: – Nominal: sex, ethnicity 2 – Ordinal: education (below high school, high school, college, and graduate school) • Quantitative variable: – Discrete: age (in years) – Continuous: weight, blood pressure Note that sometimes the term categorical variable is used to refer to a nominal variable and that the term continuous variable is used interchangeably as a quantitative variable. Now back to learning types: (a) Unsupervised learning or learning without a teacher Here we observe only features, no response. Data: (Xi : i = 1, . . . , n) where each Xi is a p-vector Both tasks (b) and (d), descriptive modeling and discovering patterns and rules, fall into the unsupervised learning category. (b) Supervised learning or learning with a teacher Inputs: also called covariates, predictors, features, or independent variables, denoted by X, a p-vector Output: also called outcome, response, or dependent variable, denoted by Y Data: {(Xi , Yi ) : i = 1, . . . , n} where each Xi is a p-vector   supervised learning =  classification: Y is categorical regression: Y is quantitative The data mining task (c), predictive modeling, falls into the supervised learning category. In supervised learning, the ”student” presents an answer ŷi for each xi in the data, and the supervisor or ”teacher” provides either the correct answer or an error (e.g. the squared loss, (yi − ŷi )2 ) associated with the student’s answer. In unsupervised learning, one has a set of n observations (x1 , . . . , xn ). The goal is to directly infer the distribution of X without 3 the help of a supervisor or teacher providing correct answers or degree of error for each observation. 4. What to be covered in this class tasks (b) and (c) unsupervised learning: cluster analysis supervised learning: k-nearest-neighbor classifier, classification and regression trees (CART), as well as extensions of CART such as boosting, bagging, and random forests. 5. References: Data Mining by Hand et al.: Chapter 1 4

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Introduction