Download Introduction

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Introduction
1. What is Data Mining?
• According to Data Mining by Hand et al. Data Mining is the analysis of (often large) observational data sets to find
unsuspected relationships and to summarize the data in novel ways that
are both understandable and useful to the data owner.
• According to en.wikipedia.org/wiki/Data mining Data mining is the principle of sorting through large amounts of data and
picking out relevant information. It is usually used by business intelligence
organizations, and financial analysts, but it is increasingly used in the sciences to extract information from the enormous data sets generated by
modern experimental and observational methods. It has been described
as “the nontrivial extraction of implicit, previously unknown, and potentially useful information from data”[1] and “the science of extracting
useful information from large data sets or databases”.[2]
[1]: W. Frawley and G. Piatetsky-Shapiro and C. Matheus (Fall 1992).
“Knowledge Discovery in Databases: An Overview”. AI Magazine: pp.
213-228. ISSN 0738-4602.
[2]: D. Hand, H. Mannila, P. Smyth (2001). Principles of Data Mining.
MIT Press, Cambridge, MA. ISBN 0-262-08290-X.
• Key components:
– data exploration and analysis
– large amount of (observational) data
– new and useful information
Implied by large amount of data: Data mining involves computation intensive methods, many of which are automated or semi-automated (think
of algorithms).
2. Data Mining tasks
(a) Exploratory Data Analysis: The goal is to explore the data without any
clear ideas of what we are looking for.
1
(b) Descriptive modeling: The goal is to describe all of the data in a global
model.
Cluster analysis (or data segmentation): The aim is to partition the pdimensional space into several homogeneous groups so that similar cases
are put into the same group.
(c) Predictive modeling: The goal is to build a model that will permit the
value of one variable (the response) to be predicted from the known values
of other variables (predictors).
Classification: if the variable to be predicted is a categorical variable.
Regression: if the variable to be predicted is quantitative.
(d) Discovering patterns and rules: The goal is to detect patterns and rules
in the data.
Association rules: The aim is to find joint values of the variables X =
(X1 , . . . , Xp ) that appear most frequently in the database. In a ”market
basket” analysis, Xj ∈ {0, 1} and the analysis helps identify items that
are frequently purchased together.
Another example of pattern detection is spotting fraudulent behavior by
detecting regions of the space that constitutes truly unusual behavior in
the context of normal variability.
(e) Retrieval by content: The user has a pattern of interest and the goal is
to find similar patterns in the data set. This task is most commonly used
for text and image data sets. For example, document retrieval using a set
of keywords.
3. Supervised and unsupervised learning
Variable types: There are two variable types: (1) categorical or qualitative
variables, and (2) quantitative variables. Within each type, there are two subtypes: a qualitative variable could be nominal or ordinal, and a quantitative
variable could be discrete or continuous. Some examples are provided below:
• Categorical or qualitative variable:
– Nominal: sex, ethnicity
2
– Ordinal: education (below high school, high school, college, and graduate school)
• Quantitative variable:
– Discrete: age (in years)
– Continuous: weight, blood pressure
Note that sometimes the term categorical variable is used to refer to a nominal
variable and that the term continuous variable is used interchangeably as a
quantitative variable.
Now back to learning types:
(a) Unsupervised learning or learning without a teacher
Here we observe only features, no response.
Data: (Xi : i = 1, . . . , n) where each Xi is a p-vector
Both tasks (b) and (d), descriptive modeling and discovering patterns and
rules, fall into the unsupervised learning category.
(b) Supervised learning or learning with a teacher
Inputs: also called covariates, predictors, features, or independent variables, denoted by X, a p-vector
Output: also called outcome, response, or dependent variable, denoted
by Y
Data: {(Xi , Yi ) : i = 1, . . . , n} where each Xi is a p-vector


supervised learning = 
classification: Y is categorical
regression: Y is quantitative
The data mining task (c), predictive modeling, falls into the supervised
learning category.
In supervised learning, the ”student” presents an answer ŷi for each xi
in the data, and the supervisor or ”teacher” provides either the correct
answer or an error (e.g. the squared loss, (yi − ŷi )2 ) associated with the
student’s answer. In unsupervised learning, one has a set of n observations
(x1 , . . . , xn ). The goal is to directly infer the distribution of X without
3
the help of a supervisor or teacher providing correct answers or degree of
error for each observation.
4. What to be covered in this class
tasks (b) and (c)
unsupervised learning: cluster analysis
supervised learning: k-nearest-neighbor classifier, classification and regression
trees (CART), as well as extensions of CART such as boosting, bagging, and
random forests.
5. References:
Data Mining by Hand et al.: Chapter 1
4