Download Introduction to Advanced Analytics in R Language

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Expectation–maximization algorithm wikipedia , lookup

Forecasting wikipedia , lookup

Regression toward the mean wikipedia , lookup

Data assimilation wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Introduction to Advanced
Analytics in R Language
Timothy Wong
Data Scientist
What is R Language?
• Offers modern & sophisticated statistical algorithms
• Used by over 2 million data scientists, statisticians and analysts
• Has a thriving open-source community
• Big Data analytics via ‘Microsoft R Server’
2
RStudio
3
Packages
• CRAN
https://cran.r-project.org
https://cran.r-project.org/web/packages/available_packages_by_name.html
# Install a new package
install.packages('dplyr')
# Load a package (either one below)
require(dplyr)
library(dplyr)
4
R Basics
• Variable creation
• Subsetting your data
• Missing values
• Vectorised operation
• Writing your own function
• Data frame
5
Easy to Use
• PROC REG
• PROC SQL
• PROC SORT
• PROC MEANS
• PROC GPLOT
= lm(), glm()
= %>%
= order()
= mean(), sd()
= plot(), ggplot()
…(goes on)
6
Linear Regression
• Univariate
𝑌 = 𝛽0 + 𝛽1 𝑥
𝑌 = 𝛽0 + 𝛽1 𝑥1
residual
• Bivariate / Multivariate
𝑌 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2
• 𝐾th order polynomial function
𝐾
𝑌
𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛
=
𝛽0
𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
+
𝛽k 𝑥𝑘
𝑘
𝑌 = 𝛽0 + 𝛽1 𝑥 + 𝛽2 𝑥 2 + ⋯ + 𝛽𝑀 𝑥 𝑀
𝑘=1
𝑝𝑜𝑙𝑦𝑛𝑜𝑚𝑖𝑎𝑙
# Univariate linear model
myModel <- lm(y~x, myData)
summary(myModel)
7
Linear Regression
Term
Description
Residuals
This is the unexplained bit of the model, defined as observed value minus fitted value
(𝜖𝑖 = 𝑦𝑖 − 𝑦𝑖 ). If parametric assumption is correct, the mean and median value should be
very close to zero.
Estimate
Standard error
t-value
𝑷𝒓 > 𝒕
Coefficient of the corresponding independent variable (i.e. the 𝛽 values).
Standard deviation of the slope.
The number of standard deviations away from zero (i.e. the null hypothesis).
𝑝-value of the model estimate. In general, you may consider any variable with 𝑝-value
above 0.05.
Multiple 𝑹𝟐
Pearson’s correlation squared which indicates strength of relationship between original
and fitted values.
Adjusted 𝑹𝟐
Adjusted version of 𝑹𝟐 .
𝑭-statistics
Global hypothesis for the model as a whole.
8
Linear Regression in R
# Load internal dataset
data(USArrests)
# Read top 10 rows
head(USArrests, 10)
# Checks dimension of this data frame
dim(USArrests)
# Univariate linear model
arrestModel1 <- lm(Murder ~ UrbanPop, USArrests)
summary(arrestModel1)
# Multivariate linear model
arrestModel2 <- lm(Murder ~ UrbanPop + Assault + Rape, USArrests)
summary(arrestModel2)
# Polynomial term
arrestModel3 <- lm(Murder ~ poly(UrbanPop,2) + poly(Assault,2) + poly(Rape,2), USArrests)
summary(arrestModel3)
9
Regression Diagnostics in R
# Partial regression plot (Check variable influence)
require(car)
avPlots(arrestModel2)
# Standardised regression coefficients (Check variable influence)
require(QuantPsyc)
lm.beta(arrestModel2)
# Quantile-Quantile plot (Check normality assumption)
qqnorm(arrestModel2$residuals)
qqline(arrestModel2$residuals)
# Regression residual plot (Check heteroscedasticity)
plot(arrestModel2$fitted.values, rstandard(arrestModel2))
# Compare all models using Chi-square test
anova(arrestModel1, arrestModel2, arrestModel3, test='Chisq')
10
Regression Diagnostics:
Residual plot (Homoscandiscity vs. Heteroscandiscity)
Source: StackExchange
11
Regression Diagnostics:
Quantile-Quantile Plot
• Checks normality assumption
12
Regression Diagnostics:
Pearson’s Correlation
Source: Wikipedia
13
Regression Diagnostics:
Model Overfitting
Model 1
3
Model 2
2
𝑗
𝑌 = 𝛽0 +
𝑘
𝛽ℎ𝑝𝑘 𝑥ℎ𝑝
𝛽𝑤𝑡𝑗 𝑥𝑤𝑡 +
𝑗=1
8
𝑘=1
5
𝑗
𝑌 = 𝛽0 +
𝑘
𝛽ℎ𝑝𝑘 𝑥ℎ𝑝
𝛽𝑤𝑡𝑗 𝑥𝑤𝑡 +
𝑗=1
𝑘=1
14
Poisson Regression
• Modelling # of discrete events
• Total number of inbound calls of each customer over a fixed period
• Number of child in each household
• Number of tea-refills each employee has during office hour
𝜆=1
𝜆=2
𝜆=3
𝜆=4
𝜆=5
# Poisson regression
myModel <- glm(y ~ x1 + x2, family="poisson", data=myData)
summary(myModel)
15
Logistic Regression
• Modelling binomial distribution
• Toss a coin: Head / tail
• Examination: pass / fail
• Product: sold / unsold
• Logistic function
𝑃𝑟 𝑌 =
1 + 𝑒−
1
𝛽0 +𝛽1 𝑥1
• Odds-ratio (𝑒 𝛽1 )
• The change in probability when 𝑥1
increases by 1 unit
# Logistic regression
myModel <- glm(y ~ x1 + x2, family=“binomial", data=myData)
summary(myModel)
16
Recursive Partitioning
• Divide data into regions recursively
𝑥2
𝑥2
𝑥2
𝑥2
ℛ3
ℛ2
ℛ2
ℛ1
ℛ1
𝑥1
𝑠 𝑥1
ℛ1
ℛ1
ℛ3
ℛ2
ℛ4
𝑥1
𝑥1
17
Decision Tree
• Data gets divided recursively
into regions (a.k.a. ‘leaves’)
Stronger nodes
• Tree pruning
• Removes weaker leaves
• Hence avoids overfitting
require(rpart)
Regions
Prune
# Grow a simple tree
myTree<- rpart(y~x1+x2, myData)
summary(myTree)
Weaker nodes
18
Random Forest
• Consists of many decision trees
• Randomly selected variables will be used in each tree
• Usually no need to prune them (i.e. all trees are allowed to grow big)
• 𝑀 trees in a forest will produce 𝑀 predictions
• Final prediction is calculated as mean value for regression problem
• Classification problem will use most the common label (i.e. majority voting)
library(randomForest)
# Grow a large forest with 1000 trees
myForest <- randomForest(y ~ x1 + x2, ntree = 1000, data = myData)
19
Time Series Analysis:
Correlograms
• Regularly-spaced time series
• Explore variable relationship across temporal space
Observed Data
(+ other time series variables)
Cross-correlation Function (CCF)
Autocorrelation Function (ACF)
Partial Autocorrelation Function (PACF)
20
Time Series Analysis:
Decomposition
Observed data
𝑡
Trend
Seasonality
Noise
𝑡
𝑡
𝑡
(+ other time series variables)
Forecast
𝑡
21
Autoregressive Moving Average (𝐴𝑅𝑀𝐴)
• 𝐴𝑅𝑀𝐴(𝑝, 𝑞)
𝑝
𝑋𝑡
𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛
=
𝑞
𝜙𝑖 𝑋𝑡−𝑖 +
𝑖=1
𝐴𝑅 𝑝
𝜃𝑖 𝜖𝑡−𝑖 + 𝜖𝑡
𝑖=1
𝑀𝐴 𝑞
𝑒𝑟𝑟𝑜𝑟
• 𝐴𝑅𝐼𝑀𝐴 𝑝, 𝑑, 𝑞
• 𝐴𝑅𝐼𝑀𝐴: Autoregressive Integrative Moving Average
• 𝑑th order integration can be added
• ‘integration’ simply refers to the difference from previous time step!
• First order differencing (d=1): 𝑋𝑡 ′ = 𝑋𝑡 − 𝑋𝑡−1
• To satisfy stationarity requirement
22
𝐴𝑅𝐼𝑀𝐴 Forecasting with Seasonality
• 𝐴𝑅𝐼𝑀𝐴 𝑝, 𝑑, 𝑞 𝑃, 𝐷, 𝑄
𝑚
• All parameter values can be automatically identified in R language.
• Simple models are preferred
• Therefore we intend to keep 𝑝 + 𝑞 + 𝑃 + 𝑄 small
library(forecast)
# Automatically search p,d,q,P,D,Q values
myArima <- auto.arima(myTs, xreg = cbind(x1, x2))
summary(myArima)
23
Neural Network:
Multilayer Perceptron
• Fully-interconnected layers
• Non-linear activation function
Output layer
• Captures subtle ‘non-linear’ relationships
Input layer
• Gradient Descent
• Iterative optimisation algorithm
• Reduce error bit by bit
• Converge at local minimum
Loss
Hidden layer 1
Hidden layer 2
Random initiation
…
…
…
.
Local minimum
Parameter
space
24
𝐾-means Clustering
• Clustering is subjective
• How many clusters are there?
𝐾=3
𝐾=4
𝐾=5
25
# Runs K-means clustering algorithm
𝐾-means Clustering
K <- 3
myCluster <- kmeans(myData, K)
• Iteratively move towards cluster centroid
• Terminates when clusters stop changing
Random initiation
Convergence
26
Hierarchical Clustering
• Agglomerative hierarchical clustering
• Starts from 𝑁 clusters
• Merge clusters one by one according to Euclidean distance
# Calculates Euclidean distance
myDistance <- dist(myData)
# Runs hierarchical clustering algorithm
myDendrogram <- hclust(myDistance)
# Draws dendrogram
plot(myDendrogram)
# Prune the tree
K <- 3
myClusters <- cutree(myDendrogram, K)
27
Hierarchical Clustering
Iteration 1
Iteration 2
Iteration 4
Iteration 3
Iteration 7
Iteration 5
Iteration 8
Iteration 6
28
User Communities (1)
• http://www.londonr.org
• http://www.meetup.com/Manchester-R
• http://www.meetup.com/Cardiff-R-User-Group/
• http://www.meetup.com/SheffieldR-Sheffield-R-Users-Group/
• http://www.edinbr.org
• http://www.cambr.org.uk
• http://www.meetup.com/NottinghamR-Nottingham-R-Users-Group/
• http://www.meetup.com/BirminghamR/
29
User Communities (2)
• R user Conference (useR!)
• Effective Applications of the R Language (EARL)
• European R Users Meeting (eRum)
30
Learning Resources
• Data Analysis Examples (UCLA)
http://www.ats.ucla.edu/stat/dae/
• Regression Models in R (Harvard)
http://tutorials.iq.harvard.edu/R/Rstatistics/Rstatistics.html
• The R Project (NYU)
https://dev1.ed-projects.nyu.edu/statistics/overview-of-r-r-studio-r-commander/
• Choosing a Statistical Test
http://guides.nyu.edu/quant/choose_test_1DV
• Statistical Computing (Oxford)
http://portal.stats.ox.ac.uk/userdata/ruth/APTS2012/APTS.html
• Forecasting: Principles and Practice (Monash)
https://www.otexts.org/fpp/
• Time Series Analysis and Its Applications (Pittsburgh)
http://www.stat.pitt.edu/stoffer/tsa4/
• R in Action
http://www.statmethods.net
• Quantitative Financial Modelling & Trading Framework for R
http://www.quantmod.com
• Econometrics in R (Northwestern)
https://cran.r-project.org/doc/contrib/Farnsworth-EconometricsInR.pdf
• Data Analysis with R (Facebook)
https://www.udacity.com/course/data-analysis-with-r--ud651
• Rstatistics
http://www.rstatistics.net
31