Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
R + R Tool for Visual Studio= Data Science Bahrudin Hrnjica, MVP, daenet [email protected] @bhrnjica Agenda Quick intro to R Language Statistic Graphs Microsoft R Server Microsoft R Server and RevoScaleR What is R • a programming language, a statistical package, an interpreter and Open Source ‐ Free ‐ Highly extensible ‐ Focused on statistics and machine learning ‐ Transparent and reproducible ‐ Single-threaded ‐ In-memory data What R is not • is not a database, but connects to DBMSs • has no click-point user interfaces, but connects to Java, TclTk • language interpreter can be very slow, but allows to call own C/C++ code • no spreadsheet view of data, but connects to Excel/MsOffice Two distributions of R CRAN – project – http://r-project.org MRAN project – http://mran.microsoft.com • The R Project for Statistical • Microsoft R Application Computing • Original distribution since the beginning. • Open Source project written in C, C++, FORTRAN • R code is fully compatible for both distributions. Network • Enhanced distribution of R • Support of BLASS and LAPACK C++ libraries and multy-core processing. Microsoft R Open • • • • Free Open source, Cross-OS project Enhanced distribution of R supports a variety of: • big data statistics, • predictive modeling, • machine learning capabilities. • Single-threaded • In memory data manipulation Microsoft R portfolio Microsoft R Server – MRS MRS extends open-source R to allow • Multi-threading • Matrix operations, linear algebra, many other math operations run on all available cores. • Parallel processing • ScaleR functions utilize all available resources, local or distributed • On-disk data storage • RAM limitation lifted- Break Through Your memory Barrier! • Working with data too big to fit your memory • Building models that take too long to run • Working with clusters and distributed systems MRS’s Native Data Format: XDF File • Chunk –oriented • Easy to distribute to nodes • Fast to append • Column-oriented • Fast retrieval of variables • Pre- computed metadata Tool for write, edit, debug and run R code (script). • R Studio – http://rstudio.com – free, corss OS tool for R • R Tools for Visual Studio – Visual Studio vnext will include this tool • Today can be downloaded in Preview version • Suited for Developer to run R script within VS • Same Dev experience for editing, debugging and running R code. Input and Output • Input & Output • • • • inData • CSV, SAS, SPSS, ODBC connection… • DXF file; returns a data frame if left blank outFile Subset of variables • • varsToKeep varsToDrop • • numRows rowSelection Subset of Rows Data Sources • • • • • • Text files SAS, SPSS Teradata HDFS Database in ODBC Runs in-database in SQL Server 2016 • Importing from Databases • Setup ODBC first • Each data source (RxOdbcData) is one query (not databse) • SQL Server 2016 can run MRS internaly; no ODBC required • rxDataStep • • • • Subset rows within criteria (rosSelection) Select Columns by name (varsToKeep, varsToDrop) Create and modify variables (transformations) Pull data into in-memory data.frame Working with Data • Subsetting Rows • rowSelection takes a logical vector, just like subset() • Chain multiple criteria together with & and | • numRows=N – to get the first N rows of a dataset • Selecting Columns • varsToKeep, varsToDrop • One quirk: cant keep/drop when in Data == outFile • Transformation • • • • Create new variables Modify existing vars Change the variable type Takes a list of named element – each a new variable • Complex transformation • Simple transformations depend on a single row of data • Complex trans. Depends on multiple rows • In distributed context, that means moving results between nodes Managing Factors • Factors count as complex because levels, level order and level encoding can vary across chunks • Use rxFactors to create and modify factors • The F() shortcut How Algorithms Work in MS R Server: • Chank by Chunk – aka Parallel External Memory Algorithms (PEMAs) • • • • Data just needs fit on disk Chanks of data distributed to all available cores/nodes Intermediate results calculated in-memory for each chunk Final result assembled in-memory Available ML Algorithms • • • • • • • Linear Regression – rxLinMod Generalized linear models – rxLogit, rxGLM Decision Tree – rxDTree Gradient boosted boosted decision tree –rxBTree Decision forest – rxDFores K-means – rxKmeans Naïve Bayes - rxNaiveBayes PEMAs in Contex • On Laptops • Chunks pulled from local disk • All cores process chunks in parallel • Computer cluster • Chunks partitioned across nodes • All cores on nodes process local chunks in parallel Analyzing Data with MRS • Pre-computed metadata • rxGetInfo, rxGetVarInfo • Summary Statistics • rxSummary, rxCube • Predictive modelling • Regressions: rxLogit, rxGLM • Decision tree and forest: rxBTree, rxDTree, rxDForest • K-means and Naïve Bayes: rxKmeans, rxNaiveBayes • Metadata Retrieval - numeric • rxSummary(~ arr_delay, data= flightsxdf) – one variable • rxSummary(~ arr_delay + dep_delay, data= myXdf) – two variables • rxSummary( arr_delay ~ day_of_week, data = myDxf) – groupwise • Metadata Retrieval – categorical • rxCrossTable for frequency tables • rxCube for long tables • Formula interface: rxCrossTabs( ~ day_of_week : dest_F, data = myDxf) Modeling Workflow in MRS 1. Load Data (rxImport) 2. Exploratory analysis (rxGetInfo, rxSummary, rxCube) 3. Clean data (rxDataStep, rxFactors) 4. Build model – (rxLinMod, rxGLM, etc) 5. Evaluate and predict – (rxPredict) Using formula Syntax in Models • • • • One predictor • rxLinMod(y ~ x, data = myDxf) • rxLinMod(y ~ x + z, data = myDxf) • rxLinMod(y ~ x * z, data = myDxf) Two predictors Two predictors with interaction term Sample • • • rxLinMod(mpg ~ hp + wt, data = mtcars) rxLinMod(delayed ~ dep_time +* dayofweek, data = flightsXdf) rxNaiveBayes(Species ~ Sepal.Length + Sepal.Width, data = iris) Model Evaluation and Prediction • • MRS models don’t include fitted values or residuals by default Generated fitted values, residuals and prediction with rxPredict: #Fitted values: data used to fit model rxPredict(modelToObject = delayedMod, data = flightsxdf, outData = flightXdf) • Other options • • • • • • Residual: computeResiduals = TRUE Standard Errors: computeStdErrors: T Confidence interval: interval: “confidence” Prediction intervals: intervals = “predictions” For binary clasifers: rxRocCurve Compares actual values to one or more predictions generated by rxPredict