Download Data mining.

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining
A Brief Overview
Copyright © Curt Hill 2003-2016
The Problem
• Huge volumes of data overwhelm
traditional methods of data analysis
such as:
• Spreadsheets
• Ad hoc queries
• Multidimensional analysis tools
• Statistical analysis packages
Copyright © Curt Hill 2003-2016
What is Data Mining?
• Exploratory data analysis based on a data
warehouse
– Knowledge Discovery in Databases (KDD)
• Data Mining extracts previously unknown
and potentially useful information
– Rules, constraints, correlations, patterns,
signatures and irregularities
• The goal is to automate the methods for
finding these in the data
Copyright © Curt Hill 2003-2016
Data Warehouse
• A database usually separated from the
operational database(s)
• Used as a base for decision support
systems
– Upper and middle management
– Not used for day to day management but for
spotting trends and making path decisions
• Typically very large and composed of
recent copies from the operational
database(s)
• Data Mining is one of the applications that
could use
Copyright © Curt Hill 2003-2016
Goals of Data Mining
• Prediction of future behaviors
– Seasonal or non-seasonal trends
– How will consumers respond to
discounts?
– Allows the enterprise to be ready
• Identification of item, event or
activity
– Intruders may be identified by the files
they access or programs they use
Copyright © Curt Hill 2003-2016
Goals Again
• Classification of categories of users
or products
– Shoppers may be categorized as:
•
•
•
•
Discount seeking
Rush
Regular
Attached to certain brand names
– The store may be made more friendly to
such
• Optimize the use of time, space,
materials and money
Copyright © Curt Hill 2003-2016
Knowledge Discovery
• There are several types of
discoverable knowledge
–
–
–
–
–
Association Rules
Classification hierarchies
Sequential patterns
Time series patterns
Clustering
• Each of these needs more
information
Copyright © Curt Hill 2003-2016
Association Rules
• What we are looking for is
knowledge of associations that are
not obvious
• This has gained traction in market
basket research
– Very profitable information
• If a MRI has characteristic a and b
then if often has c
– This is an association rule
Copyright © Curt Hill 2003-2016
Market Basket Model
• Premise: the items in a checkout
transaction are not random
• Thus we analyze customer
transactions for patterns or
association rules
• These patterns may guide decisions
on
– Sale items
– Shelf arrangement or product
placement
Copyright © Curt Hill 2003-2016
Retail Example
• A young father goes to the store to buy
disposable diapers
• On his way through the store he sees a
Sports Illustrated and buys it
• In general, people do not impulse buy
disposable diapers, but while buying
these, they may buy something else on
impulse
• Can we examine retail transaction
records and perceive the connection?
Copyright © Curt Hill 2003-2016
Association Rule
• Is of the form: X => Y
– Where both X and Y could be sets of items
• The support of this rule is the percent of
total transactions that have both
• The confidence of this rule is the number
of transactions which have the first one
divided by the number of transactions
that have both
• High support and high confidence
indicates rules that business decisions
may be based upon this rule
– Put magazine rack on the route to the diapers
Copyright © Curt Hill 2003-2016
Agriculture Example
• LandSat are in polar orbits
• They record data on all land every 18
days
• A pixel is approximately 31 yards on a
side
• Seven bands from near infrared to
ultraviolet are recorded for each pixel
• Each produce a 1 byte value
• Can you get this data in a spreadsheet?
Copyright © Curt Hill 2003-2016
Agriculural rule
• In middle summer a near infrared
value in the range 48 to 255 and red
in red in range 0 to 31 suggests that
the yield will be 128 to 255 bushels
acre
• If the support and confidence are
high this suggests that the farmer
should apply nitrogen to the areas
where near infrared was less than
47 and red was greater than 32
Copyright © Curt Hill 2003-2016
Computational Difficulties
• Consider how many tickets a
supermarket or department store
might generate?
• In general, most of these tickets
have more than two or three items
• The store carries thousands of items
• Discovering these association rules
become computationally taxing
• One good reason to keep this off of
the operational database
Copyright © Curt Hill 2003-2016
Algorithm Properties
• There are a number of algorithms
for finding these rules
• These typically exploit two
properties:
• Downward closure
• The subset of a large itemset should also
have large support
• Removing a few items does not hurt
• Antimonotocity
• The superset of a small itemset should have
small support
Copyright © Curt Hill 2003-2016
Classification
• Classifying data into predetermined
groups
• Then we can deal with the groups in
different ways
• AKA supervised learning
– Developed by Artificial Intelligence
• The process of clustering is
attempting to classify data in groups
that are not predetermined
Copyright © Curt Hill 2003-2016
Models
• The two typical models are decision
trees and a set of rules
• We look at the data to build the
model and then use the model for
new data
• Consider in the next slide a decision
tree for granting a credit card to an
applicant
Copyright © Curt Hill 2003-2016
Example: Decision Tree
Married
Yes
No
Salary
<25K
Poor
Balance
>75K
Fair
Good
>5K
<5K
Poor
Age
<25
Fair
Copyright © Curt Hill 2003-2016
>25
Good
Clustering
• AKA unsupervised learning
• Classify the data into groups that
you are not aware of to begin with
• A distance function must be
supplied that describes the distance
between two points
– The points are often not purely numeric
– They are often not in 2 dimensions or
even 3 which makes things interesting
Copyright © Curt Hill 2003-2016
Applications
• Marketing
– Determine advertising, store
placement, segmentation of customers
• Finance
– Analysis of performance of securities
• Manufacturing
– Optimizing resources, designing the
manufacturing process
• Health Care
– Discovery of items in X-Ray and MRI
images
Copyright © Curt Hill 2003-2016
Example
• Certain diseases switch on genes
characteristic to that disease
• Drugs often switch off a gene
• In 2011 database of genes and what
affected them was mined
• The result was that mice infected
with small cell lung cancer were
treated with an antidepressant,
imipramine
– The tumors were reduced
Copyright © Curt Hill 2003-2016
Telco Example
• A local telephone company mines its
connection data for possible
marketing opportunities
• A phone very busy in the 3PM to 6PM
range suggests a teenager
– Pitch a teen phone
• Busy in the 9AM to 5PM suggests a
home business
– Pitch a business line
Copyright © Curt Hill 2003-2016
Social Media
• Publicly viewable social media
presents a very large quantity of
data
• However it is:
– Noisy
– Unstructured
– Dynamic
• It is of great interest in political
campaigns, marketing, health care
– This is where people express things
first
Copyright © Curt Hill 2003-2016
Data Scientists
• Has a nicer ring than knowledge
workers but is a similar position
• A 2016 survey considered how they
spend their time:
–
–
–
–
–
–
Cleaning and organizing data 60%
Collecting data sets 19%
Mining data for patterns 9%
Refining algorithms 4%
Building training sets 3%
Other 5%
• Data janitors
Copyright © Curt Hill 2003-2016
•
Skills
According to the same survey, the
skills in the most demand are:
• SQL – Structured Query Language
• Hadoop – algorithm and database for big
data
• Python – programming language
• Java – programming language
• R – programming language
• Hive – A NoSQL database
• MapReduce – algorithm to exploit multiple
processors
• NoSQL – class of non-relational databases
• Pig – system to analyze big data
• SAS – Statisical Analysis System
Copyright © Curt Hill 2003-2016
Finally
• Much of the analysis done in data
mining has been done for centuries
– What is different now is the amount and
types of captured data
• There are a number of commercial
tools for mining
• Many large companies have
substantial investment and return on
their mining activities
Copyright © Curt Hill 2003-2016