Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
소프트웨어시스템 실습 머신러닝 Machine Learning (1) 2016. 2학기 Basic Learning Process  Data storage: fact 데이터 저장  utilizes observation, memory, and recall to provide a factual basis  Abstraction: 데이터 변환  Involves the translation of stored data into broader representations and concepts.  Generalization: 학습 (일반화)  uses abstracted data to create knowledge and models  Evaluation: 평가 (성능개선 지향)  provides a feedback mechanism to measure the utility of learned knowledge and inform potential improvements. Machine Learning (기계학습) 감독형 학습 (Supervised Learning) • 자동분류 (Classification) • 회귀분석 (Regression) • => 예측 모델 (Prediction model)의 도출 비감독형, 자율 학습 (Unsupervised Learning) • 클러스터링 (Clustering), 연관규칙 마이닝 (Association) • => 설명 모델 (Description model)의 도출 강화학습 (Reinforcement Learning) • Agent : (State, Action) -> Reward 3 Machine Learning in Practice  Data collection:  In most cases, the data will need to be combined into a single source like a text file, spreadsheet, or database.  Data exploration and preparation:  Data understanding -> feature selection, model selection  Data cleansing -> fixing or cleaning so-called "messy" data, eliminating unnecessary data  Data transformation -> recoding the data to conform to the learner's expected inputs.  Model training:  Machine learning algorithm selection -> Model construction  Model evaluation:  evaluate the accuracy of the model using a test dataset  develop measures of performance specific to the intended application.  Model improvement:  utilize more advanced strategies to augment the performance of the model  Augment the current training data  Use another ML algorithm Input Data • Numeric • Nominal (or categorical) • Ordinal Types of Machine Learning Algorithms  Supervised Learning: prediction model을 생성  Classification: 주어진 데이터에 적합한 category를 결정  Category: class 컬럼에 존재하는 값들  Regression: 주어진 데이터에 대한 수치형 값(예: income, laboratory values, test scores, or counts of items)  Unsupervised Learning: description model을 생성  Association rule mining (pattern discovery): basket data analysis  Clustering: segmentation analysis  Meta-learning: 상위 수준의 learning 방법의 설계  focus on learning how to learn more effectively Supervised Learning: Classification, Regression  학습 알고리즘에 따라 예측(분류) 모델 형태가 다름 k-Nearest Neighbors Support Vector Machine Statistics (ex) Bayesian Network Decision Trees Neural Network 7 Classification 시스템 구조  기본 개념 분류 (예측) 모델 8 Prediction모델의 생성: 의사결정 트리 (Decision Tree) Credit Analysis salary < 20000 salary 10000 40000 15000 75000 18000 education high school under graduate under graduate graduate graduate label reject accept reject accept accept 레이블 (클래스) 학습 데이터 yes no 학습 Education in graduate yes accept accept no reject 분류 모델 9 Unsupervised Learning: Clustering 여행을 즐기는 직장인 골프를 즐기는 부자 노년층 Unsupervised Learning: Association Mining  Given: •상품 구매 기록으로부터 상품간의 연관성을 측정하여 함께 거래될 가능성을 규 칙으로 표현 일명: 장바구니 분석 Data Understanding Exploring the structure of data Exploring numeric variables Visualizing numeric variables: box-plot boxplot  boxplot의 해석 Visualizing numeric variables: histogram Measuring the central tendency : mode Exploring categorical variables Exploring relationships between variables  Visualizing relationships – scatterplots Examining relationships – two-way cross-tabulations Supervised Learning k-Nearest Neighbors (instance-based learning) k-Nearest Neighbors K-NN: example K-NN: example Euclidean distance K-NN: example Class 컬럼 When k = 1, tomato’s neighbors : orange When k = 3, tomato’s neighbors : orange, grape, nuts Choosing an appropriate k • Noisy data의 영향을 줄임 • 작지만 중요한 패턴을 놓칠 수 있음 • Over-fitting 가능성이 커짐 • 작지만 중요한 패턴을 포착할 수 있음 Weighted voting The vote of the closer neighbors is considered more authoritative than the vote of the far away neighbors. Rescaling  Min-max normalization  z-score standardization Coding  The Euclidean distance formula is not defined for nominal data.  To calculate the distance between nominal features,  we need to convert them into a numeric format.  => dummy coding, where a value of 1 indicates one category, and 0, the other. K-NN : Lazy learning  원칙적으로 lazy learning은 진정한 learning이 아님  Prediction 단계 이전에 training data를 저장만 함  그래서 Prediction 단계는 다른 알고리즘에 비해 시간이 오래 걸림  별칭  Instance-based learning  Rote learning Example: diagnosing breast cancer  Step 1 – collecting data  Wisconsin Breast Cancer Diagnostic dataset from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml)  measurements from digitized images of fine-needle aspirate of a breast mass.  569 examples with 32 features. Example: diagnosing breast cancer  Step 2 – exploring and preparing the data  Importing the CSV file  Browsing the structure Example: diagnosing breast cancer  Step 2 – exploring and preparing the data Target feature (class 컬럼)은 factor 형으로 Example: diagnosing breast cancer  Step 2 – exploring and preparing the data  Transformation – normalizing numeric data Example: diagnosing breast cancer  Step 2 – exploring and preparing the data  Transformation – normalizing numeric data normalize() 함수 이용 ! Example: diagnosing breast cancer  Step 2 – exploring and preparing the data  Transformation – normalizing numeric data normalize() 함수 이용 Example: diagnosing breast cancer  Step 2 – exploring and preparing the data  Transformation – normalizing numeric data Example: diagnosing breast cancer  Step 2 – exploring and preparing the data  Data preparation – creating training and test datasets Example: diagnosing breast cancer  Step 3 – Training a model & evaluating model performance  knn() 함수는 class 패키지에 있음 Example: diagnosing breast cancer  Step 4 – improving model performance  One method => Transformation – z-score standardization Example: diagnosing breast cancer  Step 4 – improving model performance  Another method => Testing alternative values of k Supervised Learning Probabilistic Learning: Naïve Bayes Classification Understanding probability Joint probability • 사건 A, B가 서로 독립이면 P(A ∩ B) = P(A) * P(B) • 사건 A, B가 서로 독립이면 P(A ∩ B) = P(A|B) * P(B) P(A ∩ B) = P(B|A) * P(A) Bayes’ Theorem Bayes’ Theorem P(spam ∩ Viagra) = P(Viagra|spam) * P(spam) = (4/20) * (20/100) = 0.04 P(spam|Viagra) = P(Viagra|spam) * P(spam) / P(Viagra) = (4/20) * (20/100) / (5/100) = 0.80 Naïve Bayes algorithm Naïve Bayes algorithm  Classification with Naive Bayes 어떤 email이 ‘Viagra’, ‘Unsubscribe’ 단어는 포함하고, ‘Money’, ‘Groceries’ 단어는 포함하고 있 지 않을 때, 이 email이 spam 인지에 대한 posterior probability는? Naïve Bayes algorithm 이대로 계산하기에는 너무 복잡 P(w1, w2) = P(w1) * P(w2|w2) P(w1, w2|s) = P(w1|s) * P(w2|w1, s) P(w1, w2, w3|s) = P(w1|s) * P(w2, w3|w1, s) = P(w1|s) * P(w2|w1, s) * P(w3|w1, w2, s) P(w1, w2, w3, w4|s) = P(w1|s) * P(w2, w3, w4|w1, s) = P(w1|s) * P(w2|w1, s) * P(w3, w4|w1, w2, s) = P(w1|s) * P(w2|w1, s) * P(w3|w1, w2, s) * P(w4|w1, w2, w3, s) 같은 class (예: spam) 상에서 만약 단어(사건)간에 서로 독립이라면 class-conditional independence = P(w1|s) * P(w2|s) * P(w3|s) * P(w4|s) Naïve Bayes algorithm 분모 부분은 class에 상관없이 동일한 값을 가지므로 Naïve Bayes algorithm = = 요약하면, Probability 값을 갖도록 조정 Naïve Bayes algorithm  Likelihood 계산과정에서 한가지 문제가 있음  예를 들어, ‘Viagra’, ‘Groceries’, ‘Money’, ‘Unsubscribe’ 단어를 가지는 email2이 있을 때, Naïve Bayes 알고리즘에 따라 spam에 대한 likelihood 값 P(‘Viagra’, ‘Groceries’, ‘Money’, ‘Unsubscribe | spam)은?  P(‘Viagra’, ‘Groceries’, ‘Money’, ‘Unsubscribe | spam) =  P(‘Viagra’, ‘Groceries’, ‘Money’, ‘Unsubscribe | spam) =  해결책: Laplace estimator Naïve Bayes algorithm  Laplace estimator  frequency table의 각 셀에 작은 수 (예: 1)을 보태줌 5/24 2/84 17/24 80/84 11/24 15/84 11/24 67/84 1/24 9/84 21/24 72/84 13/24 24/84 9/24 58/84 24 84 7/108 97/108 26/108 78/108 10/108 93/108 37/108 67/108 108 Naïve Bayes algorithm  P(‘Viagra’, ‘Groceries’, ‘Money’, ‘Unsubscribe | spam) =  P(‘Viagra’, ‘Groceries’, ‘Money’, ‘Unsubscribe | spam) =  P(spam|‘Viagra’, ‘Groceries’, ‘Money’, ‘Unsubscribe) = 0.0004/(0.0004+0.0001) = 0.8  P( ham|‘Viagra’, ‘Groceries’, ‘Money’, ‘Unsubscribe) = 0.0001/(0.0004+0.0001) = 0.2 5/24 2/84 17/24 80/84 11/24 15/84 11/24 67/84 1/24 9/84 21/24 72/84 13/24 24/84 9/24 58/84 24 84 7/108 97/108 26/108 78/108 10/108 93/108 37/108 67/108 108 Naïve Bayes algorithm  Using numeric features with Naive Bayes  numeric features => 이산화 (discretization)  즉, 전체 수치값의 영역을 구역(bin)별로 나누어 카테고리화 시킴  예) 하루에 email을 받은 시간 feature를 추가하여 spam 여부를 구분 Naïve Bayes algorithm  Example – filtering mobile phone spam Naïve Bayes algorithm  exploring and preparing the data Naïve Bayes algorithm  Data preparation – cleaning and standardizing text data  tm 패키지 활용 sms_raw$text 벡터로부터 Source object 생성  일단, corpus 객체를 생성함  cf) PCorpus() : DB와 같은 저장소에 permanent corpus를 생성 Naïve Bayes algorithm  Data preparation – cleaning and standardizing text data  실제 text 내용을 보기 위해서는, Naïve Bayes algorithm  Data preparation – cleaning and standardizing text data  다수의 문서를 보기 위해서, lapply() 함수 활용 Naïve Bayes algorithm  Data Preparation 숫자 제거 구두점 제거 white space 제거 Naïve Bayes algorithm  Data Preparation: stopwords 제거 Naïve Bayes algorithm  Data preparation – cleaning and standardizing text data Naïve Bayes algorithm  Data preparation – splitting text documents into words  Document-Term Matrix 생성 Naïve Bayes algorithm  Data preparation – creating training and test datasets 모델 평가를 위해 class 컬럼 정보 저장 training data test data training data: 전체 데이터에서 70-80% 정도의 비율 Naïve Bayes algorithm  Visualizing text data – word clouds Naïve Bayes algorithm  Data preparation – feature (word) selection  어떤 단어는 classification에 도움이 되지 않음  frequent word 만을 가진 DTM을 생성 Naïve Bayes algorithm  Data preparation – data transformation  DTM matrix에 있는 값은 numeric -> categorical 값으로 변환 필요 Naïve Bayes algorithm  Training a model on the data  Evaluating the model Naïve Bayes algorithm  Improving the model Supervised Learning Decision Trees Decision Trees Decision Trees  Recursive partitioning (or Divide and Conquer)  영화 흥행 예측 Decision Trees  Recursive partitioning (or Divide and Conquer) Decision Trees  C5.0 decision tree algorithm: DT의 표준 알고리즘 Decision Trees  Choosing the best split 원칙적으로, 분할 영역의 데이터가 하나의 클래스 값을 가져야 함 분할과정에서, 각 분할영역이 하나의 클래스를 가지는 정도(Purity)를 측정 해야 함 C5.0에서는 purity 측정을 위해 entropy를 이용 Decision Trees  Entropy 계산 P (red) = 0.6 P (blue) = 0.4 50-50 split 일 때 , 최대 entropy Decision Trees  Information Gain:  The change in homogeneity (entropy) D 𝑰𝒏𝒇𝒐𝑮𝒂𝒊𝒏 𝑫, 𝑨𝒊 = 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫 − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚𝑨𝒊(𝑫) Ai D1 D2 ... Dv v | Dj | j 1 |D| EntropyAi ( D)    Entropy( D j ) 가장 큰 InfoGain값을 가지는 컬럼을 선택하여 분할 Decision Trees (C5.0) entropy( D)   6 6 9 9  log 2   log 2  0.971 15 15 15 15 6 9  entropy( D1 )   entropy( D2 ) 15 15 6 9  0  0.918 15 15  0.551 entropyOwn _ house ( D )   5 5 5  entropy( D1 )   entropy( D2 )   entropy( D3 ) 15 15 15 Age 5 5 5   0.971   0.971   0.722 young 15 15 15  0.888 middle entropyAge ( D)   old CS583, Bing Liu, UIC Yes No entropy(Di) 2 3 0.971 3 2 0.971 4 1 0.722 79 Decision Trees  Overfitting: training data에 지나치게 적합화  training data에는 정확하지만, test data에서는 오류가 커짐  결과적으로 tree의 형태가 가지가 커지고, 깊어지는 형태가 됨  Overfitting을 피하는 방법  Pre-pruning (early stopping): 적당한 시점에 분할을 정지  그 시점을 알기가 매우 어려움  Post-pruning: 최대한 트리를 성장시킨 후에 classification 도움되지 않는 가지를 제거  pruning을 위해 validation set 설정 Decision Trees  또 다른 Purity 측정  Gini Index C Gini  index  1   Pi i 1 P (red) = 0.6 P (blue) = 0.4 Decision Trees overfitting: 분류경계선이 지나치게 training data에 적합 Decision Trees  Example: identifying risky bank loans  Exploring and preparing the data Decision Trees  Exploring and preparing the data Decision Trees  Data preparation – creating random training and test datasets Decision Trees  Training a model on the data Decision Trees Decision Trees  Evaluating the model Decision Trees  Improving the model: C5.0은 Boosting 기법을 포함