Download R + R Tool for Visual Studio= Data Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Extensible Storage Engine wikipedia , lookup

Clusterpoint wikipedia , lookup

Big data wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Functional Database Model wikipedia , lookup

Database model wikipedia , lookup

Transcript
R + R Tool for Visual
Studio= Data Science
Bahrudin Hrnjica, MVP,
daenet
[email protected]
@bhrnjica
Agenda
Quick intro to R Language
Statistic
Graphs
Microsoft R Server
Microsoft R Server and RevoScaleR
What is R
• a programming language, a statistical package, an
interpreter and Open Source
‐ Free
‐ Highly extensible
‐ Focused on statistics and machine learning
‐ Transparent and reproducible
‐ Single-threaded
‐ In-memory data
What R is not
• is not a database, but connects to DBMSs
• has no click-point user interfaces, but connects to Java,
TclTk
• language interpreter can be very slow, but allows to call
own C/C++ code
• no spreadsheet view of data, but connects to
Excel/MsOffice
Two distributions of R
CRAN – project – http://r-project.org
MRAN project – http://mran.microsoft.com
• The R Project for Statistical
• Microsoft R Application
Computing
• Original distribution since the
beginning.
• Open Source project written in
C, C++, FORTRAN
• R code is fully compatible for
both distributions.
Network
• Enhanced distribution of R
• Support of BLASS and
LAPACK C++ libraries and
multy-core processing.
Microsoft R Open
•
•
•
•
Free
Open source, Cross-OS project
Enhanced distribution of R
supports a variety of:
• big data statistics,
• predictive modeling,
• machine learning capabilities.
• Single-threaded
• In memory data manipulation
Microsoft R portfolio
Microsoft R Server – MRS
MRS extends open-source R to allow
• Multi-threading
• Matrix operations, linear algebra, many other math operations run
on all available cores.
• Parallel processing
• ScaleR functions utilize all available resources, local or distributed
• On-disk data storage
• RAM limitation lifted- Break Through Your memory Barrier!
• Working with data too big to fit your memory
• Building models that take too long to run
• Working with clusters and distributed systems
MRS’s Native Data Format: XDF File
• Chunk –oriented
• Easy to distribute to nodes
• Fast to append
• Column-oriented
• Fast retrieval of variables
• Pre- computed metadata
Tool for write, edit, debug and run R code
(script).
• R Studio – http://rstudio.com – free, corss OS tool
for R
• R Tools for Visual Studio – Visual Studio vnext will
include this tool
• Today can be downloaded in Preview version
• Suited for Developer to run R script within VS
• Same Dev experience for editing, debugging and running R code.
Input and Output
•
Input & Output
•
•
•
•
inData
•
CSV, SAS, SPSS, ODBC connection…
•
DXF file; returns a data frame if left blank
outFile
Subset of variables
•
•
varsToKeep
varsToDrop
•
•
numRows
rowSelection
Subset of Rows
Data Sources
•
•
•
•
•
•
Text files
SAS, SPSS
Teradata
HDFS
Database in ODBC
Runs in-database in SQL Server 2016
• Importing from Databases
• Setup ODBC first
• Each data source (RxOdbcData) is one query (not databse)
• SQL Server 2016 can run MRS internaly; no ODBC required
• rxDataStep
•
•
•
•
Subset rows within criteria (rosSelection)
Select Columns by name (varsToKeep, varsToDrop)
Create and modify variables (transformations)
Pull data into in-memory data.frame
Working with Data
• Subsetting Rows
• rowSelection takes a logical vector, just like subset()
• Chain multiple criteria together with & and |
• numRows=N – to get the first N rows of a dataset
• Selecting Columns
• varsToKeep, varsToDrop
• One quirk: cant keep/drop when in Data == outFile
• Transformation
•
•
•
•
Create new variables
Modify existing vars
Change the variable type
Takes a list of named element – each a new variable
• Complex transformation
• Simple transformations depend on a single row of data
• Complex trans. Depends on multiple rows
• In distributed context, that means moving results between nodes
Managing Factors
• Factors count as complex because levels, level order
and level encoding can vary across chunks
• Use rxFactors to create and modify factors
• The F() shortcut
How Algorithms Work in MS R Server:
• Chank by Chunk – aka Parallel External Memory
Algorithms (PEMAs)
•
•
•
•
Data just needs fit on disk
Chanks of data distributed to all available cores/nodes
Intermediate results calculated in-memory for each chunk
Final result assembled in-memory
Available ML Algorithms
•
•
•
•
•
•
•
Linear Regression – rxLinMod
Generalized linear models – rxLogit, rxGLM
Decision Tree – rxDTree
Gradient boosted boosted decision tree –rxBTree
Decision forest – rxDFores
K-means – rxKmeans
Naïve Bayes - rxNaiveBayes
PEMAs in Contex
• On Laptops
• Chunks pulled from local disk
• All cores process chunks in parallel
• Computer cluster
• Chunks partitioned across nodes
• All cores on nodes process local chunks in parallel
Analyzing Data with MRS
• Pre-computed metadata
• rxGetInfo, rxGetVarInfo
• Summary Statistics
• rxSummary, rxCube
• Predictive modelling
• Regressions: rxLogit, rxGLM
• Decision tree and forest: rxBTree, rxDTree, rxDForest
• K-means and Naïve Bayes: rxKmeans, rxNaiveBayes
• Metadata Retrieval - numeric
• rxSummary(~ arr_delay, data= flightsxdf) – one variable
• rxSummary(~ arr_delay + dep_delay, data= myXdf) – two variables
• rxSummary( arr_delay ~ day_of_week, data = myDxf) – groupwise
• Metadata Retrieval – categorical
• rxCrossTable for frequency tables
• rxCube for long tables
• Formula interface: rxCrossTabs( ~ day_of_week : dest_F, data = myDxf)
Modeling Workflow in MRS
1. Load Data (rxImport)
2. Exploratory analysis (rxGetInfo, rxSummary,
rxCube)
3. Clean data (rxDataStep, rxFactors)
4. Build model – (rxLinMod, rxGLM, etc)
5. Evaluate and predict – (rxPredict)
Using formula Syntax in Models
•
•
•
•
One predictor
•
rxLinMod(y ~ x, data = myDxf)
•
rxLinMod(y ~ x + z, data = myDxf)
•
rxLinMod(y ~ x * z, data = myDxf)
Two predictors
Two predictors with interaction term
Sample
•
•
•
rxLinMod(mpg ~ hp + wt, data = mtcars)
rxLinMod(delayed ~ dep_time +* dayofweek, data = flightsXdf)
rxNaiveBayes(Species ~ Sepal.Length + Sepal.Width, data = iris)
Model Evaluation and Prediction
•
•
MRS models don’t include fitted values or residuals by default
Generated fitted values, residuals and prediction with rxPredict:
#Fitted values: data used to fit model
rxPredict(modelToObject = delayedMod, data = flightsxdf, outData = flightXdf)
•
Other options
•
•
•
•
•
•
Residual: computeResiduals = TRUE
Standard Errors: computeStdErrors: T
Confidence interval: interval: “confidence”
Prediction intervals: intervals = “predictions”
For binary clasifers: rxRocCurve
Compares actual values to one or more predictions generated by rxPredict