Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Where is my data and what does it look like? Getting a handle on organizational data with visual data discovery In a previous blog post, I talked about the growing volumes of both structured and unstructured data and the opportunities and challenges facing organizations that want to unlock the information in that data. Now let’s turn our attention to the tools needed to handle massive data volumes efficiently so that we can fulfill the promise of unlocking that business-critical information. Before launching a reporting or analytics project, we have to make sure that the right data is available, with sufficient history, quality and transactional detail to meet our objectives. We also need to determine who is authorized to access, view, and interact with this data. Data resides in dozens of disparate systems, file formats and geographic locations. Simply cataloging what is available and who owns that data can be a daunting task. A partial sample of systems, leading vendors, and internal owners for a typical company might look like this: Data Store / System Type Vendors Data Owner(s) Enterprise Resource Planning Spreadsheets Relational Database Management System Human Resources and Workforce Management Sales Force Automation & Customer Relationship Management SAP, Sage, Epicor, Infor Microsoft, Apache OpenOffice Oracle, Microsoft, MySQL Operations, Finance Individuals across the organization IT KRONOS, SuccessFactors HR SalesForce, Microsoft Sales & Marketing For data that resides in text files and spreadsheets discovery is relatively straightforward: open the file and examine the contents. But what about the bulk of organizational data that resides in relational database tables? Business users aren’t data jockeys and shouldn’t be expected to be proficient in SQL or scripting languages just to explore their data. Yet IT departments have downsized significantly and business support requests are secondary to keeping critical network & server infrastructure operational. Motivated by these insights, Dimensional Insight created the Domain Editor, a visual data discovery tool. Domain Editor allows users to quickly get an overview of a database without having to write SQL. Users can see the tables and views, the number of records contained in the database, and the relationships between records. Data discovery tools make it easy to view sample data from each table in order to see the column formats, their data type, and the unique values for that column. Users can quickly determine whether the data needed for their projects is available, and what level of transactional detail and length of history is captured, perhaps in related transactional tables. Why is this important? Corporate databases can contain a fair amount of redundant, erroneous, or useless information. It’s relatively easy to build database tables for a project that is abandoned months later or left undocumented. Source system feeds and business rules can change. Storage hardware and cloud-based options get cheaper by the month, enabling this tendency toward database bloat. With the rapid growth in data volumes, we want to screen out suspect fields before they clutter the models we build for reporting and analytics. Business decisions based on bad data are not benign, they are dangerous. Data discovery tools are not a replacement for data validation policies and procedures that should be implemented before data is ever entered into a database. Rather, the summary statistics these tools generate can provide some important insights to help flag suspect fields for further investigation. Take a look at the following statistics generated for a database table. This data is contrived, but taken from actual examples I’ve encountered in corporate databases: Field name Record count Unique values Missing values Min Value Max Value DateOfBirth EmploymentStartDate HomeZipCode Education YearsEd ActiveStatus 1,200,890 1,200,890 1,200,890 1,200,890 1,200,890 1,200,890 5012 368 566 5 5 1 3% 0% 47% 8% 8% 0% 1/1/1900 12/1/2011 10012 8 8 0 11/30/2012 11/30/2012 110989 12 12 0 We can go a step further and look at the distribution of values for these fields. For the DateOfBirth field, this might look like: Value Record count 1/1/1900 12/3/1936 1/7/1937 6/9/1938 … 19,200 1 1 2 … How likely is it that 12% of our customers were born on 1/1/1900? Before data validation policies became widespread, errors like this occurred frequently. Perhaps 1/1/1900 was the default value for the Date of Birth field in an online form. If the user simply skips entering this value, guess what gets written to the database? Legacy databases are littered with similar examples. So to salvage the DateOfBirth field, replacing the 1/1/1900 value with MISSING might be prudent. EmploymentStartDate illustrates another problem. It’s impossible to have more than 365 unique values (366 in a leap year), yet our data survey turns up 368 unique values. Browsing through the distribution of unique values for this field turns up the erroneous entries: Value Record count 4/31/2012 6/31/2012 … 12 320 … Looking at HomeZipCode, the 6 digit value of 110989 jumps out. And what about the 47% missing? While some amount of missing data is tolerable, we may have to discard this field altogether or perform further discovery to see if HomeZipCode is captured more completely in another table or database. What about Education and YearsEd? The two fields have identical summary stats. Could it be that they are actually duplicate fields? Very likely, especially if the distributions are identical. But without performing a record by record comparison of these fields, we won’t be 100% certain that they are indeed identical fields. Verifying field duplication is probably the most computationally expensive process in data discovery. Finally, let’s see what’s wrong with ActiveStatus. The field contains exactly one and only one value. This field contains no usable information that would differentiate one record from another, since all records contain the value 0 for this field. These are just a few ways that visual data discovery helps place business intelligence initiatives on a solid foundation. By quickly surfacing data errors, outliers, and other inconsistencies users can determine which errors can be remedied and which fields will be flagged as unusable. In the rush to capture the rapidly growing volumes and varieties of organizational, market, and customer data, it’s easy to ignore the old maxim of “garbage in, garbage out”. We can only derive business-critical insights from valid data. Visual data discovery tools such as Dimensional Insight’s Domain Editor help business intelligence consumers determine what database tables and fields are available within their organization for reporting and analytics initiatives. More importantly, these tools provide users with an easy, SQL-free approach to assessing the quality and content of that data.