Download C - WordPress.com

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Magister Teknologi Informasi
Universitas Indonesia
2012
Data Mining

More data is generated:
 Bank, telecom, other
business transactions ...
 Scientific Data: astronomy,
biology, etc
 Web, text, and e-commerce

More data is captured:
 Storage technology faster
and cheaper
 DBMS capable of handling
bigger DB
We have large data stored in one or more
database/s.
 We starved to find new information within those
data (for research usage, competitive edge, etc).
 We want to identify patterns or rules (trends and
relationships) in those data.
 We know that a certain data exist inside a database,
but what are the consequences of that data’s
existence?




There is often information “hidden” in the data that is
not readily evident
Human analysts may take weeks to discover useful information
Much of the data is never analyzed at all
4,000,000
3,500,000
The Data
Gap
3,000,000
2,500,000
2,000,000
1,500,000
Total new disk (TB) since 1995
1,000,000
Number of
analysts
500,000
0
1995
1996
1997
1998
From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”
1999


Data Mining is a process of extracting
previously unknown, valid and actionable
information from large databases and then
using the information to make crucial
business decisions (Cabena et al. 1998).
Data mining: discovering interesting patterns
from large amounts of data (Han and
Kamber, 2001).

Definition from [Connolly, 2005]:
 The process of extracting valid, previously
unknown, comprehensive, and actionable
information from large databases and using it to
make crucial business decisions.

The thin red line of data mining: it is all
about finding patterns or rules by extracting
data from large databases in order to find
new information that could lead to new
knowledge.
What is not Data
Mining?


What is Data Mining?
– Look up phone
number in phone
directory
– Certain names are more
prevalent in certain US
locations (O’Brien, O’Rurke,
O’Reilly… in Boston area)
– Query a Web
search engine for
information about
“Amazon”
– Group together similar
documents returned by search
engine according to their
context (e.g. Amazon
rainforest, Amazon.com,)
Data Mining in Knowledge Discovery
Process
Integration
Interpretation
& Evaluation
Knowledge
Knowledge
__ __ __
__ __ __
__ __ __
DATA
Ware
house
Transformed
Data
Target
Data
Patterns
and
Rules
Understanding
Raw
Data




Clustering
Classification
Association Rules
Other Methods:






Outlier detection
Sequential patterns
Prediction
Trends and analysis of changes
Methods for special data types, e.g., spatial data mining, web mining
…
9


Association rules try to find association between
items in a set of transactions.
For example, in the case of association between
items bought by customers in supermarket:
90% of transactions that purchase bread and
butter also purchase milk
Antecedent: bread and butter
Consequent: milk
Confidence factor: 90%
10
A transaction is a set of items: T={ia, ib,…it}
 T  I, where I is the set of all possible items {i1, i2,…in}
 D, the task relevant data, is a set of transactions
(database of transactions).
 Example:

 items sold by supermarket (I:Itemset): {sugar, parsley, onion,
tomato, salt, bread, olives, cheese, butter}
 Transaction by customer (T): T1: {sugar, onion, salt}
 Database (D): {T1={salt, bread, olives}, T2={sugar, onion, salt},
T3={bread}, T4={cheese, butter}, T5={tomato}, …}
11
An association rule is the form:
P  Q, where P I, Q  I, and P  Q = 

Example:
 {bread}  {butter, cheese}
 {onion, tomato}  {salt}
12

Support of a rule P  Q = Support of (P  Q) in D
 sD(P  Q ) = sD(P  Q)
 percentage of transactions in D containing P and Q.
 #transactions containing P and Q divided by cardinality of D.

Confidence of a rule P  Q
 cD(P  Q) = sD(P  Q)/sD(P)
 percentage of transactions that contain both P and Q in the
subset of transactions that contain already P.
13

Thresholds:
 minimum support: minsup
 minimum confidence: minconf

Frequent itemset P
 support of P larger than minimum support

Strong rule P  Q (c%)
 (P  Q) frequent,
 c is larger than minimum confidence
14
Transaction ID
Items Bought
Min. support 50%
Min. confidence 50%
2000
A, B, C
1000
A, C
Frequent Itemset
4000
A, D
{A}
75%
5000
B, E, F
{B}
50%
{C}
50%
{A,C}
50%
For rule {A}  {C}:
Support
support = support({A, C}) = 50%
confidence = support({A, C})/support({A}) = 66.6%
For rule {C}  {A}:
support = support({A, C}) = 50%
confidence = support({A, C})/support({C}) = 100.0%
15

Input
 A database of transactions
 Each transaction is a list of items (Ex. purchased by a
customer in a visit)

Find all strong rules that associate the presence of
one set of items with that of another set of items.
 Example: 98% of people who purchase tires and auto
accessories also get automotive services done
 There are no restrictions on the number of items in the
head or body of the rule.

The most famous algorithm is APRIORI
16

Find the frequent itemsets: the sets of items
that have minimum support
 A subset of a frequent itemset must also be a
frequent itemset
▪ i.e., if {AB} isa frequent itemset, both {A} and {B} should
be a frequent itemset
 Iteratively find frequent itemsets with cardinality
from 1 to k (k-itemset)

Use the frequent itemsets to generate
association rules.
Source: [Sunysb, 2009]
TID
List of Items
T100 I1, I2, I5
T100 I2, I4
T100 I2, I3
T100 I1, I2, I4
T100 I1, I3
T100 I2, I3
T100 I1, I3
T100 I1, I2 ,I3, I5
T100 I1, I2, I3
Consider a database, D,
consisting of 9 transactions.
 Suppose min. support count
required is 2 (i.e. min_sup = 2/9 =
22%).
 Let minimum confidence required
is 70%.
 We have to first find out the
frequent itemset using apriori
algorithm.
 Then, association rules will be
generated using min. support &
min. confidence

Step 1: Generating 1-itemset Frequent Pattern
Scan D
for count
of each
candidate
Itemset
{l1}
{l2}
Compare
Sup.Count candidate
support count
6
with minimum
support count
7
Itemset
Sup.Count
{l1}
6
{l2}
7
{l3}
6
{l3}
6
{l4}
2
{l4}
2
{;5}
2
{;5}
2
C1
L1
•The set of frequent 1-itemsets, L1, consists of the candidate 1itemsets satisfying minimum support.
•In the first iteration of the algorithm, each item is member of the set of
candidate

Step 2: Generating 2-itemset Frequent Pattern
Generate
C2
candidat
es from
L1
Itemset
Itemset
Sup.
Count
{l1,l2}
{l1,l2}
4
{l1,l3}
4
{l1,l4}
1
{l1,l5}
2
{l2,l3}
{l2,l3}
4
{l2,l4}
{l2,l4}
2
{l2,l5}
{l2,l5}
2
{l3,l4}
{l3,l4}
0
{l3,l5}
{l3,l5}
1
{l4,l5}
C2
{l4,l5}
0
{l1,l3}
{l1,l4}
{l1,l5}
Scan D for
count of
each
candidate
C2
Compare
candidate
support
count with
minimum
support
count
Itemset
Sup.
Count
{l1,l2}
4
{l1,l3}
4
{l1,l5}
2
{l2,l3}
4
{l2,l4}
2
{l2,l5}
2
L2
To discover the set of frequent 2-itemsets, L2, the
algorithm uses L1JoinL1 to generate a candidate set
of 2-itemsets, C2.
 Next, the transactions in D are scanned and the
support count for each candidate itemset in C2 is
accumulated (as shown in the middle table).
 The set of frequent 2-itemsets, L2, is then
determined, consisting of those candidate 2itemsets in C2 having minimum support.
 Note:We haven’t used Apriori Property yet.


Generate
C3
candidat
es from
L2
Step 3: Generating 3-itemset Frequent Pattern
Scan D
for count
of each
candidate
Itemset
Itemset
Sup.
Count
{l1,l2,l3}
{l1,l2,l5}
{l1,l2,l3}
{l1,l2,l5}
C3
C3
Compare
candidate
support
count with
minimum
support
count
Itemset
Sup.
Count
2
{l1,l2,l3}
2
2
{l1,l2,l5}
2
L3
The generation of the set of candidate 3-itemsets,
C3, involves use of the Apriori Property.
 In order to find C3, we compute L2JoinL2.
 C3= L2JoinL2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3,
I4}, {I2, I3, I5}, {I2, I4, I5}}.

Based on the Apriori property that all subsets of a
frequent itemset must also be frequent, we can
determine that four latter candidates cannot
possibly be frequent.
 For example, lets take {I1, I2, I3}. The 2-item subsets
of it are {I1, I2}, {I1, I3} & {I2, I3}. Since all 2-item
subsets of {I1, I2, I3} are members of L2, We will keep
{I1, I2, I3} in C3.
 Lets take another example of {I2, I3, I5} which
shows how the pruning is performed. The 2-item
subsets are {I2, I3}, {I2, I5} & {I3,I5}.

BUT, {I3, I5} is not a member of L2 and hence it is not
frequent violating Apriori Property. Thus We will
have to remove {I2, I3, I5} from C3.
 Therefore, C3= {{I1, I2, I3}, {I1, I2, I5}} after checking
for all members of result of Join operation for
Pruning.
 Now, the transactions in D are scanned in order to
determine L3, consisting of those candidates 3itemsets in C having minimum support.





Step 4: Generating 4-itemset Frequent Pattern
The algorithm uses L3JoinL3 to generate a candidate set of
4-itemsets, C4. Although the join results in {{I1, I2, I3, I5}}, this
itemset is pruned since its subset {{I2, I3, I5}} is not frequent.
Thus, C4= φ, and algorithm terminates, having found all of
the frequent items. This completes our Apriori Algorithm.
What’s Next ? These frequent itemsets will be used to
generate strong association rules( where strong association
rules satisfy both minimum support & minimum confidence).
Step 5: Generating Association Rules from Frequent
Itemsets
 Procedure:

 For each frequent itemset “I”, generate all nonempty subsets
of I.
 For every nonempty subset s of I, output the rule “s  (I-s)”
if support_count(I) / support_count(s) ≥ min_conf where
min_conf is minimum confidence threshold.

In our example:
 We had L =
{{l1},{l2},{l3},{l4},{l5},{l1,l2},{l1,l3},{l1,l5},{l2,l3},{l2,l3},{
l2,l5},{l1,l2,l3},{l1,l2,l5}}.
▪ Lets take I = {l1,l2,l5}
▪ Its all nonempty subsets are {l1,l2}, {l1,l5}, {l2,l5}, {l1}, {l2},
{l5}


Let minimum confidence thresholdis , say 70%.
The resulting association rules are shown below,
each listed with its confidence.
 R1: {I1,I2}  {I5}
▪ Confidence = sc{I1,I2,I5}/sc{I1,I2} = 2/4 = 50%
▪ R1 is Rejected.
 R2: {I1,I5}  {I2}
▪ Confidence = sc{I1,I2,I5}/sc{I1,I5} = 2/2 = 100%
▪ R2 is Selected.
 R3: {I2,I5}  {I1}
▪ Confidence = sc{I1,I2,I5}/sc{I2,I5} = 2/2 = 100%
▪ R3 is Selected.

R4: {I1}  {I2,I5}
 Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33%
 R4 is Rejected.
 R5: {I2}  {I1,I5}
 Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29%
 R5 is Rejected.

R6: {I5}  {I1,I2}
 Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100%
 R6 is Selected.

In this way, We have found three strong
association rules.
Learn a method for predicting the instance class from
pre-labeled (classified) instances
Many approaches:
Statistics,
Decision Trees,
Neural Networks,
...
Prepare a collection of records (training set )
 Each record contains a set of attributes, one of the attributes
is the class.
 Find a model for class attribute as a function of the values of
other attributes (decision tree, neural network, etc)
 Prepare test set to determine the accuracy of the model.
Usually, the given data set is divided into training and test
sets, with training set used to build the model and test set
used to validate it.
 After happy with the accuracy, use your model to classify new
instance

From: http://www-users.cs.umn.edu/~kumar/dmbook/index.php
32
Tid Refund Marital
Status
Taxable
Income Cheat
Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
No
Single
75K
?
2
No
Married
100K
No
Yes
Married
50K
?
3
No
Single
70K
No
No
Married
150K
?
4
Yes
Married
120K
No
Yes
Divorced 90K
?
5
No
Divorced 95K
Yes
No
Single
40K
?
6
No
Married
No
No
Married
80K
?
60K
10
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
10
No
Single
90K
Yes
Training
Set
Learn
Classifier
Test
Set
Model
From: http://www-users.cs.umn.edu/~kumar/dmbook/index.php
33
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Splitting Attributes
Refund
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
NO
> 80K
YES
10
Training Data
Married
Model: Decision Tree
MarSt
10
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Married
NO
Single,
Divorced
Refund
No
Yes
NO
TaxInc
< 80K
NO
> 80K
YES
There could be more than one tree that
fits the same data!
Test Data
Start from the root of tree.
Refund
Yes
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?

Direct Marketing
 Goal: Reduce cost of mailing by targeting a set of consumers
likely to buy a new cell-phone product.
 Approach:
▪ Use the data for a similar product introduced before.
▪ We know which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision forms the class
attribute.
▪ Collect various demographic, lifestyle, and companyinteraction related information about all such customers.
▪ Type of business, where they stay, how much they earn, etc.
▪ Use this information as input attributes to learn a classifier
model.
From [Berry & Linoff] Data Mining Techniques, 1997
37

Fraud Detection
 Goal: Predict fraudulent cases in credit card transactions.
 Approach:
▪ Use credit card transactions and the information on its
account-holder as attributes.
▪ When does a customer buy, what does he buy, how often he pays on
time, etc
▪ Label past transactions as fraud or fair transactions. This
forms the class attribute.
▪ Learn a model for the class of the transactions.
▪ Use this model to detect fraud by observing credit card
transactions on an account.
From: http://www-users.cs.umn.edu/~kumar/dmbook/index.php
38

Customer Attrition/Churn:
 Goal: To predict whether a customer is likely to be
lost to a competitor.
 Approach:
▪ Use detailed record of transactions with each of the past
and present customers, to find attributes.
▪ How often the customer calls, where he calls, what time-of-the
day he calls most, his financial status, marital status, etc.
▪ Label the customers as loyal or disloyal.
▪ Find a model for loyalty.
From [Berry & Linoff] Data Mining Techniques, 1997
39
Clustering is a process of partitioning a set of data
(or objects) in a set of meaningful sub-classes, called
clusters.
Helps users understand the natural grouping or
structure in a data set.
 Cluster: a collection of data objects that are
“similar” to one another and thus can be treated
collectively as one group.
 Clustering: unsupervised classification: no
predefined classes

40
Find “natural” grouping of
instances given un-labeled data

A good clustering method will produce high quality
clusters in which:
 the intra-class similarity (that is within a cluster) is high.
 the inter-class similarity (that is between clusters) is low.



The quality of a clustering result also depends on both
the similarity measure used by the method and its
implementation.
The quality of a clustering method is also measured by
its ability to discover some or all of the hidden patterns.
The quality of a clustering result also depends on the
definition and representation of cluster chosen.
42


Partitioning algorithms: Construct various
partitions and then evaluate them by some
criterion.
Hierarchy algorithms: Create a hierarchical
decomposition of the set of data (or objects)
using some criterion. There is an
agglomerative approach and a divisive
approach.
43

Partitioning method: Given a number k, partition a
database D of n objects into a set of k clusters so that a
chosen objective function is minimized (e.g., sum of
distances to the center of the clusters).
 Global optimum: exhaustively enumerate all partitions –
too expensive!
 Heuristic methods based on iterative refinement of an
initial partition
44
Hierarchical decomposition of the data set (with respect to a
given similarity measure) into a set of nested clusters
 Result represented by a so called dendrogram
 Nodes in the dendrogram represent possible clusters
 can be constructed bottom-up (agglomerative approach) or top
down (divisive approach)

Clustering obtained by cutting the dendrogram at a desired level: each
connected component forms a cluster.
45

cluster similarity = similarity of two
most similar members
- Potentially long and
skinny clusters
+ Fast
1 2 3 4 5
1
2
3
4
5
0
2

6

10
 9
0
3
9
8




0

7 0 
5 4 0
(1,2) 3 4 5
(1,2) 0


3 3 0

4 9 7 0 


5 8 5 4 0
d (1, 2 ), 3  min{ d1 ,3 , d 2 , 3 }  min{ 6,3}  3
5
d (1, 2 ), 4  min{ d1, 4 , d 2 , 4 }  min{ 10,9}  9
4
d (1, 2 ), 5  min{ d1,5 , d 2 , 5 }  min{ 9,8}  8
3
2
1
1
1
2
3
4
5
0
2

6

10
 9
2
3 4 5
0
3
9
8




0

7 0 
5 4 0
(1,2) 3 4 5
(1,2) 0


3 3 0

4 9 7 0 


5 8 5 4 0
(1,2,3) 4 5
(1,2,3) 0

4 7 0 
5 5 4 0
5
d (1, 2, 3), 4  min{ d(1, 2), 4 , d 3, 4}  min{ 9,7}  7
d (1, 2, 3),5  min{ d (1, 2), 5 , d3, 5}  min{ 8,5}  5
4
3
2
1
1
1
2
3
4
5
0
2

6

10
 9
2
3 4 5
0
3
9
8




0

7 0 
5 4 0
(1,2) 3 4 5
(1,2) 0


3 3 0

4 9 7 0 


5 8 5 4 0
(1,2,3) 4 5
(1,2,3) 0

4 7 0 
5 5 4 0
5
d (1, 2, 3),( 4, 5)  min{ d (1, 2, 3), 4 , d (1, 2, 3),5 }  5
4
3
2
1

cluster similarity = similarity of two least similar
members
+ tight clusters
- slow
1 2 3 4 5
1
2
3
4
5
0
2

6

10
 9
0
3
9
8




0

7 0 
5 4 0
(1,2) 3 4 5
(1,2)  0


3  6 0

4 10 7 0 


5  9 5 4 0
d (1, 2 ), 3  max{ d1, 3 , d 2, 3 }  max{ 6,3}  6
5
d (1, 2 ), 4  max{ d1 , 4 , d 2 , 4 }  max{10,9}  10
4
d (1, 2 ), 5  max{ d1, 5 , d 2 ,5 }  max{ 9,8}  9
3
2
1
1 2 3 4 5
1
2
3
4
5
0
2

6

10
 9
0
3
9
8




0

7 0 
5 4 0
(1,2) 3 4 5
(1,2)  0


3  6 0

4 10 7 0 


5  9 5 4 0
(1,2) 3 (4,5)
(1,2)  0

3  6 0 
(4,5) 10 7 0
5
d (1, 2 ), ( 4 ,5 )  max{ d (1, 2), 4 , d (1, 2), 5}  max{10,9}  10
d 3,( 4, 5)  max{ d3, 4 , d 3, 5}  max{ 7,5}  7
4
3
2
1
1 2 3 4 5
1
2
3
4
5
0
2

6

10
 9
0
3
9
8




0

7 0 
5 4 0
(1,2) 3 4 5
(1,2)  0


3  6 0

4 10 7 0 


5  9 5 4 0
(1,2) 3 (4,5)
(1,2)  0

3  6 0 
(4,5) 10 7 0
5
d (1, 2, 3),( 4, 5)  max{ d(1, 2), ( 4, 5) , d 3, ( 4, 5)}  10
4
3
2
1
Dendogram: Hierarchical Clustering
Clustering obtained
by cutting the
dendrogram at a
desired level: each
connected
component forms a
cluster.
54


Understanding the Data
Data Cleaning
 Missing Values, Noisy Values, Outliers
 Dates
 Nominal/Numeric
 Discretization




Normalization
Smoothing
Transformation
Attribute selection
55

Can't be expected to be expert in all fields, but understa
nding the data can be extremely useful for data mining.
 What data is available?
 What available data is actually relevant or useful?
 Can the data be enriched from other sources?
 Are there historical datasets available?
 Who is the Real expert to ask questions of?
 (Are the results at all sensible? Or are they completely
obvious?)

Answers to these questions before embarking on a data
mining project are invaluable later on.
56

Number of instances available:
 5000 or more for reliable results

Number of attributes:
 Depends on data set, but any attribute less than
10 instances is typically not worth including

Number of instances per class:
 More than 100 per class
 If very unbalanced, consider stratified sampling
57

Goal: maximizing data quality
 Assess data quality
 Correct noted deficiencies

Assess data quality
 Analyze data distribution: is there any strange data
distribution?
 Analyze data elements: check inconsistencies, redundant,
missing values, outlier, etc.
 Conduct physical audit: ensure data recorded properly, for
example: cross check data to customer
 Analyze business rules: check data violates business rules





Exclude the attribute for which data is
frequently missing
Exclude records that have missing data
Extrapolate missing values from other known
values
Use a predictive model to determine a value
For quantitative values, use a generic figure,
such as the average.

We want all dates to be the same. YYYYMMDD is an ISO standard, BUT it has some issues for da
ta mining.
 Year 10,000 AD! (we only have 4 digits)
 Dates BC[E] eg -0300-02-03 is not a valid YYYY-MM-DD
date.
 Most importantly: Does not preserve intervals,
with/without the –second

Other representations:
 Posix/Unix System Date: Number of seconds since 1970
 etc
60
Nominal data – without ordering, eg: Sex, Country, etc
Some algorithms can't deal with nominal or numeric attributes.
Eg Decision trees deal best with nominal, but Neural Networks
and many clustering algorithms require only numeric attribute
values.
 In case the algorithm requires converting Nominal to Numeric


 Binary field: One value is 0, other value is 1




(eg gender)
Ordered fields: Convert to numbers to preserve order (eg A vs C grade
becomes 4 and 2 respectively)
Few Values: Convert each value into a new binary attribute, for example:
possible values for attribute AT are A, B, C, D then you can create 4 new
attributes ATa, ATb, ATc, ATd with each attribute has value either 0 or 1
Many Values: Convert into groups of values, each with its own (binary)
attribute. eg group states in the US into 5 groups of 10.
Unique Values: Ignore identifier like attributes (buang atribut)
61


Some algorithms require nominal or discrete values. How can
we turn a numeric value into a nominal value, or a numeric value
with a smaller range of values.
Several Discretization techniques. Often called 'binning‘.
 Equal Width
 Equal Depth
 Class Dependent
 Entropy
 Fuzzy





(Allow some fuzziness as to the edges of the bin)
Non-disjoint (Allow overlapping intervals)
ChiMerge (Use Chi-Squared Test in the same way as Entropy)
Iterative (Use some technique, then minimise classifier error)
Lazy
(Only discretize during classification (VERY lazy!))
Proportional k-Interval (Number of bins = square root of instances)
62

We might want to normalise our data such that two numeric
values are comparable. For example to compare age and
income.

Decimal Scaling: v' = v/10k for smallest k such that
max(abs(v'))<1
Eg: -991 and 99, k is 3, and -991 becomes -0.991

Min/Max Normalisation:
v'= (v-min)/(max-min) * (newMax-newMin) + newMin
Eg: 73600 in [12000,98000] to [0,1] is 0.716

Zero-mean Normalization: v'=(v-mean)/stddev
eg: mean 54000, stddev = 16000, v=73600 then v' = 1.225

In the case of noisy data, we might want to smooth
the data such that it is more uniform.

Some possible techniques:
 Regression: Find the function for the data, and move each
value some amount closer to what the function predicts
(see classification)
 Clustering: Some clustering techniques remove outliers.
We could cluster the data to remove these noisy values.
 Binning: We could use some technique to discretize the
data, and then smooth based on those 'bins'.


Transform data to more meaningful form
For example:
 Birth date is transformed to Age
 Date of the first transaction is transformed to
number of days since the customer becomes
member
 Grades of each course are transformed to
cumulative GPA



Before getting to the data mining, we may want to either
remove instances or select only a portion of the complete
data set to work with.
Why? Perhaps our algorithms don't scale well to the amount
of data we have
Techniques:
 Records selection
▪ Partitioning: Split the database into sections and work with each in turn.
Often not appropriate unless the algorithm is designed to do it.
▪ Sampling: Select a random subset of the data and use that which is
hopefully representative.
 Attribute selection
▪ Stepwise Forward Selection: Find the best attribute and add.
▪ Stepwise Backward Elimination: Find the worst attribute and remove.
▪ Genetic Algorithms: Use a 'survival of the fittest' along with random crossbreeding approach
▪ etc

What technique will you use to solve this problem?
 Given set of applicant attributes (name, salary, age, etc),
you want to decide whether you have to approve
customer application on credit card or not.
 Given national examination scores, you want to group
Kabupatens into three educational level: Good, Average,
Poor
 You want to suggest to your customer about suitable
pant given her/his choice of shirt.
 You want to estimate economic growth of Indonesia
given some data (GNP, GDP, etc)

Pak Bedu adalah seorang
dokter yang ingin
menggunakan TI untuk
membantunya
memutuskan apakah
pasiennya terkena kanker
atau tidak. Untuk
memutuskan hal tersebut,
Pak Bedu sudah memiliki
data setiap pasien yang
meliputi hasil uji dalam 5
test laboratorium serta
keputusan apakah dia
terkena kanker atau tidak.
Berikut ini contoh datanya:
ID
T1
T2
T3
T4
T5
Cancer?
P1
1
3
4
2
3
Yes
P2
0
5
1
1
No
P3
1
2
4
2
2
No
P4
2
2
3
1
2
Yes
P5
2
2
3
1
2
No
Masalah apa yang bisa Anda temukan
di data Pak Bedu?

Pada saat Pak Bedu memeriksa satu atribut
(misal T1) ditemukan 5 atribut yang distinct
dengan jumlah sebagai berikut:






1 dengan jumlah 1234
2 dengan jumlah 2037
3 dengan jumlah 1659
4 dengan jumlah 1901
11 dengan jumlah 1
Apa yang bisa Anda simpulkan dengan
melihat data tersebut?
Pak Bedu mengembangkan usaha kliniknya di 3 tempat. Pak Bedu
menyerahkan sepenuhnya pengelolaan data pasien ke setiap
klinik. Pak Bedu ingin mengetahui karakteristik dari pasien untuk
usahanya dengan mengumpulkan data dari ketiga kliniknya.
Hanya saja Pak Bedu bingung karena setiap klinik memiliki skema
data yang berbeda-beda. Apa yang harus Pak Bedu lakukan?
 Skema Klinik 1: Pasien( Nama, TglLahir, Tinggi (meter), Berat(kg),
JenisKelamin (L/P), Alamat, Provinsi)
 Skema Klinik 2: Pasien( Nama, TglLahir, Tinggi (centimeter),
Berat(kg), JenisKelamin (P/W), Alamat, Kota )
 Skema Klinik 3: Pasien( Nama, Umur, Tinggi (meter), Berat(kg),
JenisKelamin (L/P), Kota, Provinsi)

Pak Bedu ternyata juga memiliki usaha Sekolah Tinggi Kesehatan.
Sebagai orang yang baik hati, Pak Bedu ingin memberikan
beasiswa untuk mahasiswanya yang sedang mengerjakan skripsi.
Hanya saja Pak Bedu ingin memperoleh mahasiswa yang memiliki
potensi untuk bisa menyelesaikan skripsinya dalam waktu satu
semester. Pak Bedu memiliki data nilai mahasiswa yang sudah
lulus beserta lama waktu penyelesaian studinya. Skema data yang
dimiliki Pak Bedu antara lain: Tabel Mahasiswa(NPM, Nama,
Umur, AsalDaerah, LamaSkripsi, LamaStudi), Tabel
MataKuliah(Kode, Nama, Kelompok), Tabel Nilai(NPM, KodeMK,
Nilai)
 Diskusikan apa yang kira-kira bisa Anda lakukan untuk membantu
Pak Bedu



Pak Bedu mengembangkan kliniknya sampai
100 cabang. Pak Bedu ingin melihat pola-pola
kunjungan pasien, daerah mana yang banyak
orang sakit, kapan banyak yang sakit, dsb.
Pak Bedu memiliki data tentang klinik dan
jumlah total kunjungan pasien setiap
bulannya.
Apa yang bisa Anda lakukan?