Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to Advanced Analytics in R Language Timothy Wong Data Scientist What is R Language? • Offers modern & sophisticated statistical algorithms • Used by over 2 million data scientists, statisticians and analysts • Has a thriving open-source community • Big Data analytics via ‘Microsoft R Server’ 2 RStudio 3 Packages • CRAN https://cran.r-project.org https://cran.r-project.org/web/packages/available_packages_by_name.html # Install a new package install.packages('dplyr') # Load a package (either one below) require(dplyr) library(dplyr) 4 R Basics • Variable creation • Subsetting your data • Missing values • Vectorised operation • Writing your own function • Data frame 5 Easy to Use • PROC REG • PROC SQL • PROC SORT • PROC MEANS • PROC GPLOT = lm(), glm() = %>% = order() = mean(), sd() = plot(), ggplot() …(goes on) 6 Linear Regression • Univariate 𝑌 = 𝛽0 + 𝛽1 𝑥 𝑌 = 𝛽0 + 𝛽1 𝑥1 residual • Bivariate / Multivariate 𝑌 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 • 𝐾th order polynomial function 𝐾 𝑌 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 = 𝛽0 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 𝛽k 𝑥𝑘 𝑘 𝑌 = 𝛽0 + 𝛽1 𝑥 + 𝛽2 𝑥 2 + ⋯ + 𝛽𝑀 𝑥 𝑀 𝑘=1 𝑝𝑜𝑙𝑦𝑛𝑜𝑚𝑖𝑎𝑙 # Univariate linear model myModel <- lm(y~x, myData) summary(myModel) 7 Linear Regression Term Description Residuals This is the unexplained bit of the model, defined as observed value minus fitted value (𝜖𝑖 = 𝑦𝑖 − 𝑦𝑖 ). If parametric assumption is correct, the mean and median value should be very close to zero. Estimate Standard error t-value 𝑷𝒓 > 𝒕 Coefficient of the corresponding independent variable (i.e. the 𝛽 values). Standard deviation of the slope. The number of standard deviations away from zero (i.e. the null hypothesis). 𝑝-value of the model estimate. In general, you may consider any variable with 𝑝-value above 0.05. Multiple 𝑹𝟐 Pearson’s correlation squared which indicates strength of relationship between original and fitted values. Adjusted 𝑹𝟐 Adjusted version of 𝑹𝟐 . 𝑭-statistics Global hypothesis for the model as a whole. 8 Linear Regression in R # Load internal dataset data(USArrests) # Read top 10 rows head(USArrests, 10) # Checks dimension of this data frame dim(USArrests) # Univariate linear model arrestModel1 <- lm(Murder ~ UrbanPop, USArrests) summary(arrestModel1) # Multivariate linear model arrestModel2 <- lm(Murder ~ UrbanPop + Assault + Rape, USArrests) summary(arrestModel2) # Polynomial term arrestModel3 <- lm(Murder ~ poly(UrbanPop,2) + poly(Assault,2) + poly(Rape,2), USArrests) summary(arrestModel3) 9 Regression Diagnostics in R # Partial regression plot (Check variable influence) require(car) avPlots(arrestModel2) # Standardised regression coefficients (Check variable influence) require(QuantPsyc) lm.beta(arrestModel2) # Quantile-Quantile plot (Check normality assumption) qqnorm(arrestModel2$residuals) qqline(arrestModel2$residuals) # Regression residual plot (Check heteroscedasticity) plot(arrestModel2$fitted.values, rstandard(arrestModel2)) # Compare all models using Chi-square test anova(arrestModel1, arrestModel2, arrestModel3, test='Chisq') 10 Regression Diagnostics: Residual plot (Homoscandiscity vs. Heteroscandiscity) Source: StackExchange 11 Regression Diagnostics: Quantile-Quantile Plot • Checks normality assumption 12 Regression Diagnostics: Pearson’s Correlation Source: Wikipedia 13 Regression Diagnostics: Model Overfitting Model 1 3 Model 2 2 𝑗 𝑌 = 𝛽0 + 𝑘 𝛽ℎ𝑝𝑘 𝑥ℎ𝑝 𝛽𝑤𝑡𝑗 𝑥𝑤𝑡 + 𝑗=1 8 𝑘=1 5 𝑗 𝑌 = 𝛽0 + 𝑘 𝛽ℎ𝑝𝑘 𝑥ℎ𝑝 𝛽𝑤𝑡𝑗 𝑥𝑤𝑡 + 𝑗=1 𝑘=1 14 Poisson Regression • Modelling # of discrete events • Total number of inbound calls of each customer over a fixed period • Number of child in each household • Number of tea-refills each employee has during office hour 𝜆=1 𝜆=2 𝜆=3 𝜆=4 𝜆=5 # Poisson regression myModel <- glm(y ~ x1 + x2, family="poisson", data=myData) summary(myModel) 15 Logistic Regression • Modelling binomial distribution • Toss a coin: Head / tail • Examination: pass / fail • Product: sold / unsold • Logistic function 𝑃𝑟 𝑌 = 1 + 𝑒− 1 𝛽0 +𝛽1 𝑥1 • Odds-ratio (𝑒 𝛽1 ) • The change in probability when 𝑥1 increases by 1 unit # Logistic regression myModel <- glm(y ~ x1 + x2, family=“binomial", data=myData) summary(myModel) 16 Recursive Partitioning • Divide data into regions recursively 𝑥2 𝑥2 𝑥2 𝑥2 ℛ3 ℛ2 ℛ2 ℛ1 ℛ1 𝑥1 𝑠 𝑥1 ℛ1 ℛ1 ℛ3 ℛ2 ℛ4 𝑥1 𝑥1 17 Decision Tree • Data gets divided recursively into regions (a.k.a. ‘leaves’) Stronger nodes • Tree pruning • Removes weaker leaves • Hence avoids overfitting require(rpart) Regions Prune # Grow a simple tree myTree<- rpart(y~x1+x2, myData) summary(myTree) Weaker nodes 18 Random Forest • Consists of many decision trees • Randomly selected variables will be used in each tree • Usually no need to prune them (i.e. all trees are allowed to grow big) • 𝑀 trees in a forest will produce 𝑀 predictions • Final prediction is calculated as mean value for regression problem • Classification problem will use most the common label (i.e. majority voting) library(randomForest) # Grow a large forest with 1000 trees myForest <- randomForest(y ~ x1 + x2, ntree = 1000, data = myData) 19 Time Series Analysis: Correlograms • Regularly-spaced time series • Explore variable relationship across temporal space Observed Data (+ other time series variables) Cross-correlation Function (CCF) Autocorrelation Function (ACF) Partial Autocorrelation Function (PACF) 20 Time Series Analysis: Decomposition Observed data 𝑡 Trend Seasonality Noise 𝑡 𝑡 𝑡 (+ other time series variables) Forecast 𝑡 21 Autoregressive Moving Average (𝐴𝑅𝑀𝐴) • 𝐴𝑅𝑀𝐴(𝑝, 𝑞) 𝑝 𝑋𝑡 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 = 𝑞 𝜙𝑖 𝑋𝑡−𝑖 + 𝑖=1 𝐴𝑅 𝑝 𝜃𝑖 𝜖𝑡−𝑖 + 𝜖𝑡 𝑖=1 𝑀𝐴 𝑞 𝑒𝑟𝑟𝑜𝑟 • 𝐴𝑅𝐼𝑀𝐴 𝑝, 𝑑, 𝑞 • 𝐴𝑅𝐼𝑀𝐴: Autoregressive Integrative Moving Average • 𝑑th order integration can be added • ‘integration’ simply refers to the difference from previous time step! • First order differencing (d=1): 𝑋𝑡 ′ = 𝑋𝑡 − 𝑋𝑡−1 • To satisfy stationarity requirement 22 𝐴𝑅𝐼𝑀𝐴 Forecasting with Seasonality • 𝐴𝑅𝐼𝑀𝐴 𝑝, 𝑑, 𝑞 𝑃, 𝐷, 𝑄 𝑚 • All parameter values can be automatically identified in R language. • Simple models are preferred • Therefore we intend to keep 𝑝 + 𝑞 + 𝑃 + 𝑄 small library(forecast) # Automatically search p,d,q,P,D,Q values myArima <- auto.arima(myTs, xreg = cbind(x1, x2)) summary(myArima) 23 Neural Network: Multilayer Perceptron • Fully-interconnected layers • Non-linear activation function Output layer • Captures subtle ‘non-linear’ relationships Input layer • Gradient Descent • Iterative optimisation algorithm • Reduce error bit by bit • Converge at local minimum Loss Hidden layer 1 Hidden layer 2 Random initiation … … … . Local minimum Parameter space 24 𝐾-means Clustering • Clustering is subjective • How many clusters are there? 𝐾=3 𝐾=4 𝐾=5 25 # Runs K-means clustering algorithm 𝐾-means Clustering K <- 3 myCluster <- kmeans(myData, K) • Iteratively move towards cluster centroid • Terminates when clusters stop changing Random initiation Convergence 26 Hierarchical Clustering • Agglomerative hierarchical clustering • Starts from 𝑁 clusters • Merge clusters one by one according to Euclidean distance # Calculates Euclidean distance myDistance <- dist(myData) # Runs hierarchical clustering algorithm myDendrogram <- hclust(myDistance) # Draws dendrogram plot(myDendrogram) # Prune the tree K <- 3 myClusters <- cutree(myDendrogram, K) 27 Hierarchical Clustering Iteration 1 Iteration 2 Iteration 4 Iteration 3 Iteration 7 Iteration 5 Iteration 8 Iteration 6 28 User Communities (1) • http://www.londonr.org • http://www.meetup.com/Manchester-R • http://www.meetup.com/Cardiff-R-User-Group/ • http://www.meetup.com/SheffieldR-Sheffield-R-Users-Group/ • http://www.edinbr.org • http://www.cambr.org.uk • http://www.meetup.com/NottinghamR-Nottingham-R-Users-Group/ • http://www.meetup.com/BirminghamR/ 29 User Communities (2) • R user Conference (useR!) • Effective Applications of the R Language (EARL) • European R Users Meeting (eRum) 30 Learning Resources • Data Analysis Examples (UCLA) http://www.ats.ucla.edu/stat/dae/ • Regression Models in R (Harvard) http://tutorials.iq.harvard.edu/R/Rstatistics/Rstatistics.html • The R Project (NYU) https://dev1.ed-projects.nyu.edu/statistics/overview-of-r-r-studio-r-commander/ • Choosing a Statistical Test http://guides.nyu.edu/quant/choose_test_1DV • Statistical Computing (Oxford) http://portal.stats.ox.ac.uk/userdata/ruth/APTS2012/APTS.html • Forecasting: Principles and Practice (Monash) https://www.otexts.org/fpp/ • Time Series Analysis and Its Applications (Pittsburgh) http://www.stat.pitt.edu/stoffer/tsa4/ • R in Action http://www.statmethods.net • Quantitative Financial Modelling & Trading Framework for R http://www.quantmod.com • Econometrics in R (Northwestern) https://cran.r-project.org/doc/contrib/Farnsworth-EconometricsInR.pdf • Data Analysis with R (Facebook) https://www.udacity.com/course/data-analysis-with-r--ud651 • Rstatistics http://www.rstatistics.net 31