Download Introduction - University of Essex

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Introduction
Slide 1 of 21
CE802
MACHINE LEARNING AND DATA MINING
INSTRUCTOR: PAUL SCOTT
P.D.Scott
University of Essex
Introduction
Slide 2 of 21
INTRODUCTION
COURSE ORGANISATION
WHAT IS MACHINE LEARNING?
WHAT IS DATA MINING?
P.D.Scott
University of Essex
Introduction
Slide 3 of 21
COURSE ORGANISATION
SYLLABUS
ASSESSMENT
RESOURCES
“WHAT DO I NEED TO KNOW?”
LECTURES
CONTACT INFORMATION
P.D.Scott
University of Essex
Introduction
Slide 4 of 21
SYLLABUS
 Introduction
 Learning to classify
 Learning to predict numeric values
 Evaluating learning procedures
 Association rules
 Clustering
 Reinforcement learning
 Multiple learners
 (Neural networks)
P.D.Scott
University of Essex
Introduction
Slide 5 of 21
ASSESSMENT
EXAMINATION
A two hour examination early in the
summer term.
Worth 60% of module mark.
TWO ASSIGNMENTS
Assignment 1:
Worth 15% of module mark.
Due Week 9
Assignment 2:
Worth 25% of module mark.
Due Week 16 (i.e. after Xmas break)
P.D.Scott
University of Essex
Introduction
Slide 6 of 21
RESOURCES
COURSE NOTES
Hard copy distributed in class
Electronic version available on module
web site
WEKA
A public domain machine learning tool
box
MODULE WEB SITE
Lots of information about the module
including lists of relevant reading
material and web sites.
P.D.Scott
University of Essex
Introduction
Slide 7 of 21
“WHAT DO I NEED TO KNOW?”
“How much programming is there in the
course?”
This is not a programming course.
You will not be assessed on your
ability to program.
“How much maths do I need to know?”
Most of the maths will be basic
probability and statistics.
Occasionally I will use vectors and a
little calculus but this will not be
examinable.
P.D.Scott
University of Essex
Introduction
Slide 8 of 21
LECTURES
Lectures are scheduled as follows:
Wednesday 11:00 am - 12:50 pm 5S.3.2
Friday
3:00 pm - 3:50 pm
4.336
Normally the Wednesday session will be a
lecture and the Friday section a problem class.
There are no labs for this module. However,
you are expected to spend some time
familiarising yourself with software relevant to
the assignment – notably WEKA.
P.D.Scott
University of Essex
Introduction
Slide 9 of 21
CONTACT INFORMATION
My office:
1NW.3.19 (Networks Building, same level
as Square 1)
My email address:
[email protected]
My office phone:
Ext 2015
Appointments:
Email or phone – I can usually find a time
within 24 hours.
P.D.Scott
University of Essex
Introduction
Slide 10 of 21
WHAT IS MACHINE LEARNING?
Machine learning is that branch of artificial intelligence
concerned with getting computers to learn from experience.
Very roughly:
Programming is telling a machine what to do?
Machine learning is showing a machine what you want it to
do and expecting it to figure out how to do it.
But actually it can be more complicated than that.
Sometimes we cannot even show the machine what we
want because we do not really know. A smart machine can
even help in this situation.
P.D.Scott
University of Essex
Introduction
Slide 11 of 21
VERY BRIEF HISTORY OF MACHINE LEARNING
Machine learning has been part of artificial intelligence since
before the subject began
Late 1940s
Turing’s refutation of Lady Lovelace’s Objection
(“computers cannot be intelligent because they only do
what their programmers tell them to”) was to suggest that
computers could learn for themselves.
Early 1950s
Arthur Samuel built a checkers (called “draughts” in UK)
playing program that improved its performance through
playing and reached state championship level.
Middle 1950s – Middle 60s
First neural network learning systems.
Some work on symbolic learning systems.
Late 1960s
Eclipse of neural networks as a result of wildly overoptimistic claims about their capabilities.
Early 1970s – Middle 1970s
Very little done in machine learning – most researchers
believed they must first solve “the representation problem”.
Late 1970s
Renaissance of machine learning as potential solution to
the “knowledge bottleneck” in expert systems.
Middle 1980s
Renaissance of neural net approaches
Late 1980s – Present
CE802
P.D.Scott
University of Essex
Introduction
Slide 12 of 21
KEY ELEMENTS OF A MACHINE LEARNER
Learning is often defined as improving performance at some
task so there must be
A task
An associated performance measure
Since learning is defined as deriving from experiences or
examples, there must be
A set of examples
A representation format for examples
Learning must produce some information structure that can be
used to perform the task.
This information structure can be viewed as a model of the
domain in which learning takes place.
This in turn implies the existence of a set of possible
models from which the one produced is selected.
This is determined by the model representation which thus
restricts the set of models that can be considered
The process of learning itself can then be viewed as a search of
the space of possible representations for a model that
maximises the performance measure.
The restrictions of the representation and the strategy used by
the search process lead to inductive bias – favouring some
possible models over others.
P.D.Scott
University of Essex
Introduction
Slide 13 of 21
A TAXONOMY OF LEARNING TASKS
Learning to classify
Given a set of training examples and their associated
classes
Learn to correctly predict the classification of unclassified
examples.
e.g. A parent teaching a child to recognise animals by
showing him/her pictures.
Learning to predict numerical values (regression)
Given a set of training examples and associated numerical
values
Learn to correctly predict the numerical value for other
examples in which it is not known.
e.g. Learning to predict tomorrow’s temperature from
meteorological records.
Learning to classify and regression are often called “supervised
learning”
Learning to form groups (clustering)
Given a set of unclassified examples
Develop a “sensible” scheme for classifying them.
e.g. This slide!
Clustering is often called “unsupervised learning”.
Learning what to do next (reinforcement learning)
Given the experience of engaging actively in a task
Learn to improve performance when engaged in similar
tasks in future
e.g. Samuels Checker Player
P.D.Scott
University of Essex
Introduction
Slide 14 of 21
ATTRIBUTE TYPES
Levels of Measurement
Statisticians classify variables by what they term “level of
measurement”. There are three main groups:
Nominal
The values taken by nominal variables simply define a
mutually exclusive set of categories.
No other relationship is assumed between members of the
value set.
For example: nationality, favourite colour, make of car.
In machine learning these are often termed categorical
attributes.
Ordinal
The values of ordinal variables are totally ordered so they
may be ranked, but there is no other numerical
significance to the intervals between them.
For example: degree classification.
In machine learning it is common to treat ordinals as if
they were either nominal or interval attributes.
Interval-Ratio
Interval-ratio variables have the properties of ordinal
variables and, in addition, the interval between values has
an arithmetic meaning.
Consequently, arithmetic operations may be applied to
them.
For example: age, number of siblings, income.
In machine learning these are often called continuous or
numeric attributes.
P.D.Scott
University of Essex
Introduction
Slide 15 of 21
WHAT IS DATA MINING?
It is not easy to provide a concise definition of data mining is
because:
the field is new and steadily evolving
practitioners, researchers and authors differ on the precise
boundaries they draw around the area.
It is easier to approach the subject in terms of the types of
problems it is intended to address.
The Problem/Challenge
In the last few decades computing technology has enabled
organizations to accumulate huge quantities of data about
the domains in which they operate.
These huge archives are the result of a major investment of
resources.
The archives may contain much potentially valuable
information about the relationships between objects
represented in the data.
In the past, these archives have been underutilised because
of the difficulty of discovering such relationships.
A Broad Definition
Data mining is the development and application of
computer tools to assist in the discovery of useful or
interesting relationships in large databases.
P.D.Scott
University of Essex
Introduction
Slide 16 of 21
Some Examples
Marketing
A businessman who knows his customer well enough to know
what his customer wants will be a successful businessman.
Marketing costs contribute a significant fraction to the
price of many products.
Much of this money is wasted in that it goes to people
unlikely to buy the product.
If you want to sell luxury cars, it is probably stupid to:
Send mailshots to districts where the average
income is very low.
Put TV ads out during children’s Saturday morning
cartoons.
The more you know about the type of people who might
be likely to buy your product, the more effectively you can
use your marketing budget.
The above examples of bad marketing are obvious but
would it be better to:
Send mailshots to affluent suburbs or rural areas?
Put TV ads out during football matches or news
programmes?
Companies have quite a lot of information about the type
of people who buy their products but they would benefit
from even more.
Hence the “loyalty card” in supermarkets and other chain
stores.
P.D.Scott
University of Essex
Introduction
Slide 17 of 21
Loyalty Cards
These cards supply vast amounts of information about the
products customers buy.
In particular they tell the company what combinations of
products are frequently bought.
This provides an opportunity for cross selling.
Exploration of transaction data to discover groups of products
purchased together is known as market basket analysis or
affinity grouping.
Of course, many such affinities are obvious. e.g.
People who buy paint often buy paintbrushes
But many are much more surprising:A Much Quoted Example
A US supermarket applied data mining techniques to loyalty
card transaction data and found
On Thursdays and Fridays customers frequently
purchase diapers (UK nappies) and sixpacks of beer.
Of course, once the rule is discovered it is often easy to find
an explanation. e.g.
Couples with babies cannot go out so easily so are
more likely to spend Saturday evenings in drinking
beer in front of the TV.
Nevertheless this relationship was not deduced (c.f.
explanation based learning).
The supermarket concerned arranged that beer should be
displayed near the diapers and hence boosted its beer sales.
P.D.Scott
University of Essex
Introduction
Slide 18 of 21
Bioinformatics
The Problem
The human genome project and similar undertakings are
generating vast amounts of information about the DNA
sequences of humans and animals.
This information is even less useful than a core dump of a
huge computer program unless it can be interpreted.
It is necessary to locate genes and translate the DNA
sequences into the corresponding sequences of amino acids
– that is, the proteins they encode.
Unfortunately, knowing the sequence of amino acids doesn’t
tell you much about a protein’s shape and hence its function.
The Solution
Suppose you already know the function (and possibly the
structure) of a family of proteins.
You can use data mining or statistical techniques to discover
the general characteristics of the sequences of all known
members of that protein family.
For example, you might construct a Hidden Markov Model.
This provides you with a “template” of the typical member of
the family.
This template can then be matched against a database of
protein sequences to discover other proteins that are likely to
be members of the same family and hence have similar
functions.
P.D.Scott
University of Essex
Introduction
Slide 19 of 21
So What Is Data Mining?
Statisticians draw a distinction between:
Confirmatory Data Analysis
The objective is to determine whether or not a set of
data supports a given hypothesis.
Example:
Analysing a set of opinion poll data to confirm the
hypothesis that women are more likely than men to
vote Conservative.
Exploratory Data Analysis
The objective is to discover relationships in a set of
data.
Example:
Investigating a set of opinion poll data and
discovering that a greater proportion of women
than men said they would vote Conservative.
The distinction is not rigid: many tools can be used for both.
Traditional statistics has place a greater emphasis on
confirmatory than exploratory analysis.
An Alternative Definition
Data mining could be defined as the use of computer
tools for exploratory data analysis in large data sets.
As such it draws on three parent areas:
Machine Learning
Statistics
Visualization
P.D.Scott
University of Essex
Introduction
Slide 20 of 21
Stages in Data Mining
The application of appropriate statistical, machine learning
and/or visualization techniques is actually only a small part of
most data mining projects.
Typically it takes less than 10% of the effort.
Main Stages
1.
2.
3.
4.
5.
6.
Define the problem
Identify the relevant sources of data
Access the selected data sources
Combine the data sources
Apply appropriate data mining tools
Inspect results and, if necessary, return to an earlier
stage.
In practice stages 2, 3 and 4 dominate the work because:
Data is always in mutually incompatible formats (Murphy’s
Law1).
None of these formats are suitable for the data mining
tools you plan to use (Murphy’s Law again).
The data will contain inconsistent, incomplete or
erroneous elements (Murphy’s Law yet again)
All of this means a major effort in data transformation and
cleaning is often needed before the actual data mining can
begin.
Murphy’s Law: ”If anything could possibly go wrong, it will.”
1
P.D.Scott
University of Essex
Introduction
Slide 21 of 21
Data Mining Tools
A large proportion of commercial data mining is done using
specialised packages providing a variety of alternative
procedures that can be applied to the data.
Typically they fall into one of two groups:
Visualization Tools
These provide a range of techniques for visual display of
data points.
For example, 3D rotatable scatter plots.
Such tools enable the user to exploit the powerful pattern
recognition capabilities of the eye to detect regularities.
Analytic Tools
These provide a range of machine learning and statistical
techniques .
Typical facilities may include:
Data transformation routines
Simple visualization tools
Naïve Bayes Classifier
Decision tree induction
Back propagation neural nets
Kohonen nets
Multiple regression
A graphical programming language that enables the
user to construct sequences of operations using icons
and a mouse.
P.D.Scott
University of Essex