* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Using Normalized Status Change Events Data in Business Intelligence
Survey
Document related concepts
Transcript
Making Good Use of Data at Hand: Open Source Tools Mark C. Cooke, Ph.D. Tax Management Associates, Inc. Overview • Open Data concept – Data is produced for various purposes but can be used to derive novel insights; i.e. “Business Intelligence (BI)” • Open Source tools exist for making good use of existing data sets – ETL (“Extract, Transform, Load”) + Analytics • Knime and the R language are two of the most powerful resources for leveraging data 2013 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. Open Data • Open Data concept – governments collect, through existing management systems, enormous quantities of data that can be leveraged in alternative and novel ways to find solutions. • The goal is often to leverage the broader community to develop solutions that governments may not have previously conceived. • Open Data and Business Intelligence should be used by internal consumers as well. 2013 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. Open Data 2013 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. “Data Scientist” 2013 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. Doing Data the Old Way • Data is locked inside systems :-( – Software systems are designed to wrap a Graphical User Interface (GUI) around data. – The GUI functionality, historically, has to be programmed to produce reports, views, and analysis. • The GUI is driven by the sole purpose of the software. But the data has many purposes… 2013 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. Open Data – Way Forward • Making data talk across platforms: AS400, SQL, XML, Excel, PDF’s, Text Files, Image Files (.png, .jpeg, etc.), Shape Files (ESRI), email archives, web-scraping, API’s from social media, etc. • Connecting data across multiple platforms • Using data for novel insight • Tools now exist for importing, cleaning, standardizing, and analyzing data using complex algorithms built into accessible packages 2013 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. Open Data • These systems are known as “Data Agnostic:” Database Agnostic - Database-agnostic is a term describing the capacity of software to function with any vendor’s database management system (DBMS). In information technology (IT), agnostic refers to the ability of something – such as software or hardware – to work with various systems, rather than being customized for a single system. – http://searchdatamanagement.techtarget.com/definition/da tabase-agnostic 2013 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. Data Science • What is the breadth of the tool base? – Reading in data from various resources – Transforming data to merge various resources, translate data into a usable format or to add new data elements – Analyzing data from basic logical and statistical functions to higher level machine learning tools and algorithms “Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data.” http://en.wikipedia.org/wiki/Machine_learning 2013 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. Data Science • What is the output? – “Business Intelligence” or actionable information that drives business decisions through insight – Creating new insights from existing data – Visualizations - representation of that BI in ways to make it consumable to a non-specialist audience “According to Friedman (2008) the "main goal of data visualization is to communicate information clearly and effectively through graphical means.” http://en.wikipedia.org/wiki/Data_visualization 2013 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. 2013 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. • Knime is a GUI-based data agnostic tool for ETL, analytics, and visualization. • Knime is an open source platform for the desktop with commercial enterprise server layers including collaboration tools and web-services (web-portal). • Knime supports other analytics languages, including the R language for statistical computing www.Knime.org 2013 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. • The advantages of Knime: – Rapid development environment – Very powerful processing handling large datasets on commodity hardware • Allows for 100% data samples up to millions of elements row-wise – Workflows can be saved, shared, and duplicated – nodes are stepwise allowing for quick revisions – nodes provide access to complex algorithms 2013 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. What is Knime? 2013 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. The Knime Workbench 2013 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. Knime Nodes • Nodes are the workers inside a workflow • Every node serves at least one function • Nodes can also be built as MetaNodes, which are a collection of nodes performing common functions • A collection of nodes is called a “workflow” • You can develop nodes with Java and the node development support 2013 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. Knime Nodes • For example, the file reader node is an intelligent file reader that can determine the type of file • However, it also allows for the end user to adjust parameters 2013 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. Knime Nodes • The Column Filter node allows users to filter columns from a table (conveniently named…) 2013 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. Knime Nodes (sample) 2013 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. Knime Integrates with R • R integration is key to expanding the data analysis and visualization capabilities of Knime • R supports data ingestion of complex files (including ESRI) • R supports complex data manipulation and statistical analysis • R supports a wide variety of highly customizable visualizations So, what is R, exactly? 2013 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. R Project for Statistical Computing www.r-project.org • R is an open source scripting language which can be run inside Knime, but also within a command line environment independently • Several GUI interfaces for R exist such as R Studio, a group that provides software for using R as well as training and extension packages (www.rstudio.com) • Community contributions make up the bulk of R packages, which now total more than 4,700 2013 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. R Project for Statistical Computing www.r-project.org • The R base package (standard software) provides methods for reading data, ETL, analysis and visualizations • The community provided packages take this base and build on it depending on the interest of the producer • Packages stretch across all imaginable data uses, including advanced statistical analyses, machine learning and data mining, and advanced graphical visualizations (including sophisticated mapping) 2013 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. Popular R Packages A (very) brief overview of popular packages: • Plyr – for advanced data manipulation • Maps – for mapping datasets onto georeferenced outputs • GGPlot2 – for advanced data visualizations • Rcurl – for reading data from webpages and repositories • TextMining – for text mining applications • SNA – for social network analysis 2013 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. R Inside Knime Basic Data Manipulation: 2013 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. R Inside Knime Basic Visual using Maps: 2013 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. Knime + R + TPP Case examples for working with TPP: • Look at distribution of TPP accounts across a county, state, or region • Map entities or create a heatmap (choropleth) of the distribution of personal property values • Compare personal property reporting across schedules across industry sectors (m&e across manufacturing types) • Compare like-kind entity reporting (franchises, big-box) for consistency in values • Compare personal property accounts with other data resources (real property accounts, permits, etc.) 2013 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. Brief Demonstration Data: Florida 67 Counties More than 1.24 million personal property accounts Goals: 1. Group all data by industry to illustrate the taxable value and exempted value by type 2. Subset the data to include only a particular industry 3. Map the state-wide exempt value in a choropleth 2013 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. Questions? Thank you for your time and attention. I am always happy to discuss data, so please feel free to contact me at any of the information below. Mark C Cooke [email protected] 704.847.1234 (office) 704.953.6349 (cell) www.linkedin.com/in/markccooke 2013 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.