Download Document

Document related concepts
no text concepts found
Transcript
소프트웨어시스템 실습
머신러닝 Machine Learning (1)
2016. 2학기
Basic Learning Process
 Data storage: fact 데이터 저장
 utilizes observation, memory, and recall to provide a factual basis
 Abstraction: 데이터 변환
 Involves the translation of stored data into broader representations and concepts.
 Generalization: 학습 (일반화)
 uses abstracted data to create knowledge and models
 Evaluation: 평가 (성능개선 지향)
 provides a feedback mechanism to measure the utility of learned knowledge and inform
potential improvements.
Machine Learning (기계학습)
감독형 학습 (Supervised Learning)
• 자동분류 (Classification)
• 회귀분석 (Regression)
• => 예측 모델 (Prediction model)의 도출
비감독형, 자율 학습 (Unsupervised Learning)
• 클러스터링 (Clustering), 연관규칙 마이닝 (Association)
• => 설명 모델 (Description model)의 도출
강화학습 (Reinforcement Learning)
• Agent : (State, Action) -> Reward
3
Machine Learning in Practice
 Data collection:
 In most cases, the data will need to be combined into a single source like a text file, spreadsheet, or
database.
 Data exploration and preparation:
 Data understanding -> feature selection, model selection
 Data cleansing -> fixing or cleaning so-called "messy" data, eliminating unnecessary data
 Data transformation -> recoding the data to conform to the learner's expected inputs.
 Model training:
 Machine learning algorithm selection -> Model construction
 Model evaluation:
 evaluate the accuracy of the model using a test dataset
 develop measures of performance specific to the intended application.
 Model improvement:
 utilize more advanced strategies to augment the performance of the model
 Augment the current training data
 Use another ML algorithm
Input Data
• Numeric
• Nominal (or categorical)
• Ordinal
Types of Machine Learning
Algorithms
 Supervised Learning: prediction model을 생성
 Classification: 주어진 데이터에 적합한 category를 결정
 Category: class 컬럼에 존재하는 값들
 Regression: 주어진 데이터에 대한 수치형 값(예: income, laboratory
values, test scores, or counts of items)
 Unsupervised Learning: description model을 생성
 Association rule mining (pattern discovery): basket data analysis
 Clustering: segmentation analysis
 Meta-learning: 상위 수준의 learning 방법의 설계
 focus on learning how to learn more effectively
Supervised Learning:
Classification, Regression
 학습 알고리즘에 따라 예측(분류) 모델 형태가 다름
k-Nearest Neighbors
Support Vector Machine
Statistics
(ex) Bayesian Network
Decision Trees
Neural Network
7
Classification 시스템 구조
 기본 개념
분류 (예측) 모델
8
Prediction모델의 생성: 의사결정 트리
(Decision Tree)
Credit Analysis
salary < 20000
salary
10000
40000
15000
75000
18000
education
high school
under graduate
under graduate
graduate
graduate
label
reject
accept
reject
accept
accept
레이블 (클래스)
학습 데이터
yes
no
학습
Education
in graduate
yes
accept
accept
no
reject
분류 모델
9
Unsupervised Learning:
Clustering
여행을 즐기는
직장인
골프를 즐기는
부자 노년층
Unsupervised Learning:
Association Mining
 Given:
•상품 구매 기록으로부터 상품간의 연관성을 측정하여 함께 거래될 가능성을 규
칙으로 표현
일명: 장바구니 분석
Data Understanding
Exploring the structure of data
Exploring numeric variables
Visualizing numeric variables: box-plot
boxplot
 boxplot의 해석
Visualizing numeric variables: histogram
Measuring the central tendency : mode
Exploring categorical variables
Exploring relationships between
variables
 Visualizing relationships – scatterplots
Examining relationships – two-way
cross-tabulations
Supervised Learning
k-Nearest Neighbors (instance-based learning)
k-Nearest Neighbors
K-NN: example
K-NN: example
Euclidean distance
K-NN: example
Class 컬럼
When k = 1, tomato’s neighbors : orange
When k = 3, tomato’s neighbors : orange, grape, nuts
Choosing an appropriate k
• Noisy data의 영향을 줄임
• 작지만 중요한 패턴을 놓칠 수 있음
• Over-fitting 가능성이 커짐
• 작지만 중요한 패턴을 포착할 수 있음
Weighted voting
The vote of the closer neighbors is considered
more authoritative than the vote of the far
away neighbors.
Rescaling
 Min-max normalization
 z-score standardization
Coding
 The Euclidean distance formula is not defined for nominal data.
 To calculate the distance between nominal features,
 we need to convert them into a numeric format.
 => dummy coding, where a value of 1 indicates one category, and 0,
the other.
K-NN : Lazy learning
 원칙적으로 lazy learning은 진정한 learning이 아님
 Prediction 단계 이전에 training data를 저장만 함
 그래서 Prediction 단계는 다른 알고리즘에 비해 시간이 오래 걸림
 별칭
 Instance-based learning
 Rote learning
Example: diagnosing breast cancer
 Step 1 – collecting data
 Wisconsin Breast Cancer Diagnostic dataset from the UCI Machine
Learning Repository (http://archive.ics.uci.edu/ml)
 measurements from digitized images of fine-needle aspirate of a breast
mass.
 569 examples with 32 features.
Example: diagnosing breast cancer
 Step 2 – exploring and preparing the data
 Importing the CSV file
 Browsing the structure
Example: diagnosing breast cancer
 Step 2 – exploring and preparing the data
Target feature (class 컬럼)은 factor 형으로
Example: diagnosing breast cancer
 Step 2 – exploring and preparing the data
 Transformation – normalizing numeric data
Example: diagnosing breast cancer
 Step 2 – exploring and preparing the data
 Transformation – normalizing numeric data
normalize() 함수 이용 !
Example: diagnosing breast cancer
 Step 2 – exploring and preparing the data
 Transformation – normalizing numeric data
normalize() 함수 이용
Example: diagnosing breast cancer
 Step 2 – exploring and preparing the data
 Transformation – normalizing numeric data
Example: diagnosing breast cancer
 Step 2 – exploring and preparing the data
 Data preparation – creating training and test datasets
Example: diagnosing breast cancer
 Step 3 – Training a model & evaluating model performance
 knn() 함수는 class 패키지에 있음
Example: diagnosing breast cancer
 Step 4 – improving model performance
 One method => Transformation – z-score standardization
Example: diagnosing breast cancer
 Step 4 – improving model performance
 Another method => Testing alternative values of k
Supervised Learning
Probabilistic Learning: Naïve Bayes
Classification
Understanding probability
Joint probability
• 사건 A, B가 서로 독립이면
P(A ∩ B) = P(A) * P(B)
•
사건 A, B가 서로 독립이면
P(A ∩ B) = P(A|B) * P(B)
P(A ∩ B) = P(B|A) * P(A)
Bayes’ Theorem
Bayes’ Theorem
P(spam ∩ Viagra) = P(Viagra|spam) * P(spam) = (4/20) * (20/100) = 0.04
P(spam|Viagra) = P(Viagra|spam) * P(spam) / P(Viagra) = (4/20) * (20/100) / (5/100) = 0.80
Naïve Bayes algorithm
Naïve Bayes algorithm
 Classification with Naive Bayes
어떤 email이 ‘Viagra’, ‘Unsubscribe’ 단어는 포함하고, ‘Money’, ‘Groceries’ 단어는 포함하고 있
지 않을 때, 이 email이 spam 인지에 대한 posterior probability는?
Naïve Bayes algorithm
이대로 계산하기에는 너무 복잡
P(w1, w2) = P(w1) * P(w2|w2)
P(w1, w2|s) = P(w1|s) * P(w2|w1, s)
P(w1, w2, w3|s) = P(w1|s) * P(w2, w3|w1, s)
= P(w1|s) * P(w2|w1, s) * P(w3|w1, w2, s)
P(w1, w2, w3, w4|s) = P(w1|s) * P(w2, w3, w4|w1, s)
= P(w1|s) * P(w2|w1, s) * P(w3, w4|w1, w2, s)
= P(w1|s) * P(w2|w1, s) * P(w3|w1, w2, s) * P(w4|w1, w2, w3, s)
같은 class (예: spam) 상에서 만약 단어(사건)간에 서로 독립이라면
class-conditional independence
= P(w1|s) * P(w2|s) * P(w3|s) * P(w4|s)
Naïve Bayes algorithm
분모 부분은 class에 상관없이 동일한 값을 가지므로
Naïve Bayes algorithm
=
=
요약하면,
Probability 값을 갖도록 조정
Naïve Bayes algorithm
 Likelihood 계산과정에서 한가지 문제가 있음
 예를 들어, ‘Viagra’, ‘Groceries’, ‘Money’, ‘Unsubscribe’ 단어를 가지는
email2이 있을 때, Naïve Bayes 알고리즘에 따라 spam에 대한 likelihood
값 P(‘Viagra’, ‘Groceries’, ‘Money’, ‘Unsubscribe | spam)은?
 P(‘Viagra’, ‘Groceries’, ‘Money’, ‘Unsubscribe | spam) =
 P(‘Viagra’, ‘Groceries’, ‘Money’, ‘Unsubscribe | spam) =
 해결책: Laplace estimator
Naïve Bayes algorithm
 Laplace estimator
 frequency table의 각 셀에 작은 수 (예: 1)을 보태줌
5/24
2/84
17/24
80/84
11/24
15/84
11/24
67/84
1/24
9/84
21/24
72/84
13/24
24/84
9/24
58/84
24
84
7/108
97/108
26/108
78/108
10/108
93/108
37/108
67/108
108
Naïve Bayes algorithm
 P(‘Viagra’, ‘Groceries’, ‘Money’, ‘Unsubscribe | spam) =
 P(‘Viagra’, ‘Groceries’, ‘Money’, ‘Unsubscribe | spam) =
 P(spam|‘Viagra’, ‘Groceries’, ‘Money’, ‘Unsubscribe) = 0.0004/(0.0004+0.0001) = 0.8
 P( ham|‘Viagra’, ‘Groceries’, ‘Money’, ‘Unsubscribe) = 0.0001/(0.0004+0.0001) = 0.2
5/24
2/84
17/24
80/84
11/24
15/84
11/24
67/84
1/24
9/84
21/24
72/84
13/24
24/84
9/24
58/84
24
84
7/108
97/108
26/108
78/108
10/108
93/108
37/108
67/108
108
Naïve Bayes algorithm
 Using numeric features with Naive Bayes
 numeric features => 이산화 (discretization)
 즉, 전체 수치값의 영역을 구역(bin)별로 나누어 카테고리화 시킴
 예) 하루에 email을 받은 시간 feature를 추가하여 spam 여부를 구분
Naïve Bayes algorithm
 Example – filtering mobile phone spam
Naïve Bayes algorithm
 exploring and preparing the data
Naïve Bayes algorithm
 Data preparation – cleaning and standardizing text data
 tm 패키지 활용
sms_raw$text 벡터로부터 Source object 생성
 일단, corpus 객체를 생성함
 cf) PCorpus() : DB와 같은 저장소에 permanent corpus를 생성
Naïve Bayes algorithm
 Data preparation – cleaning and standardizing text data
 실제 text 내용을 보기 위해서는,
Naïve Bayes algorithm
 Data preparation – cleaning and standardizing text data
 다수의 문서를 보기 위해서, lapply() 함수 활용
Naïve Bayes algorithm
 Data Preparation
숫자 제거
구두점 제거
white space 제거
Naïve Bayes algorithm
 Data Preparation: stopwords 제거
Naïve Bayes algorithm
 Data preparation – cleaning and standardizing text data
Naïve Bayes algorithm
 Data preparation – splitting text documents into words
 Document-Term Matrix 생성
Naïve Bayes algorithm
 Data preparation – creating training and test datasets
모델 평가를 위해 class 컬럼 정보 저장
training
data
test
data
training data: 전체 데이터에서
70-80% 정도의 비율
Naïve Bayes algorithm
 Visualizing text data – word clouds
Naïve Bayes algorithm
 Data preparation – feature (word) selection
 어떤 단어는 classification에 도움이 되지 않음
 frequent word 만을 가진 DTM을 생성
Naïve Bayes algorithm
 Data preparation – data transformation
 DTM matrix에 있는 값은 numeric -> categorical 값으로 변환 필요
Naïve Bayes algorithm
 Training a model on the data
 Evaluating the model
Naïve Bayes algorithm
 Improving the model
Supervised Learning
Decision Trees
Decision Trees
Decision Trees
 Recursive partitioning (or Divide and Conquer)
 영화 흥행 예측
Decision Trees
 Recursive partitioning (or Divide and Conquer)
Decision Trees
 C5.0 decision tree algorithm: DT의 표준 알고리즘
Decision Trees
 Choosing the best split
원칙적으로, 분할 영역의 데이터가 하나의 클래스 값을 가져야 함
분할과정에서, 각 분할영역이 하나의 클래스를 가지는 정도(Purity)를 측정
해야 함
C5.0에서는 purity 측정을 위해 entropy를 이용
Decision Trees
 Entropy 계산
P (red) = 0.6
P (blue) = 0.4
50-50 split 일 때 , 최대 entropy
Decision Trees
 Information Gain:
 The change in homogeneity (entropy)
D
𝑰𝒏𝒇𝒐𝑮𝒂𝒊𝒏 𝑫, 𝑨𝒊 = 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚𝑨𝒊(𝑫)
Ai
D1
D2
...
Dv
v
| Dj |
j 1
|D|
EntropyAi ( D)  
 Entropy( D j )
가장 큰 InfoGain값을 가지는 컬럼을 선택하여 분할
Decision Trees (C5.0)
entropy( D)  
6
6
9
9
 log 2

 log 2
 0.971
15
15 15
15
6
9
 entropy( D1 ) 
 entropy( D2 )
15
15
6
9

0
 0.918
15
15
 0.551
entropyOwn _ house ( D )  
5
5
5
 entropy( D1 ) 
 entropy( D2 ) 
 entropy( D3 )
15
15
15
Age
5
5
5

 0.971 
 0.971 
 0.722
young
15
15
15
 0.888
middle
entropyAge ( D)  
old
CS583, Bing Liu, UIC
Yes No entropy(Di)
2
3 0.971
3
2 0.971
4
1 0.722
79
Decision Trees
 Overfitting: training data에 지나치게 적합화
 training data에는 정확하지만, test data에서는 오류가 커짐
 결과적으로 tree의 형태가 가지가 커지고, 깊어지는 형태가 됨
 Overfitting을 피하는 방법
 Pre-pruning (early stopping): 적당한 시점에 분할을 정지
 그 시점을 알기가 매우 어려움
 Post-pruning: 최대한 트리를 성장시킨 후에 classification 도움되지 않는
가지를 제거
 pruning을 위해 validation set 설정
Decision Trees
 또 다른 Purity 측정
 Gini Index
C
Gini  index  1   Pi
i 1
P (red) = 0.6
P (blue) = 0.4
Decision Trees
overfitting:
분류경계선이 지나치게
training data에 적합
Decision Trees
 Example: identifying risky bank loans
 Exploring and preparing the data
Decision Trees
 Exploring and preparing the data
Decision Trees
 Data preparation – creating random training and test
datasets
Decision Trees
 Training a model on the data
Decision Trees
Decision Trees
 Evaluating the model
Decision Trees
 Improving the model: C5.0은 Boosting 기법을 포함
Related documents