Download These are just a few ways that visual data discovery helps place

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Big data wikipedia , lookup

Database wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
Where is my data and what does it look like? Getting a handle on
organizational data with visual data discovery
In a previous blog post, I talked about the growing volumes of both structured and unstructured data
and the opportunities and challenges facing organizations that want to unlock the information in that
data. Now let’s turn our attention to the tools needed to handle massive data volumes efficiently so that
we can fulfill the promise of unlocking that business-critical information.
Before launching a reporting or analytics project, we have to make sure that the right data is available,
with sufficient history, quality and transactional detail to meet our objectives. We also need to
determine who is authorized to access, view, and interact with this data. Data resides in dozens of
disparate systems, file formats and geographic locations. Simply cataloging what is available and who
owns that data can be a daunting task. A partial sample of systems, leading vendors, and internal
owners for a typical company might look like this:
Data Store / System Type
Vendors
Data Owner(s)
Enterprise Resource Planning
Spreadsheets
Relational Database Management
System
Human Resources and Workforce
Management
Sales Force Automation & Customer
Relationship Management
SAP, Sage, Epicor, Infor
Microsoft, Apache OpenOffice
Oracle, Microsoft, MySQL
Operations, Finance
Individuals across the organization
IT
KRONOS, SuccessFactors
HR
SalesForce, Microsoft
Sales & Marketing
For data that resides in text files and spreadsheets discovery is relatively straightforward: open the file
and examine the contents. But what about the bulk of organizational data that resides in relational
database tables? Business users aren’t data jockeys and shouldn’t be expected to be proficient in SQL or
scripting languages just to explore their data. Yet IT departments have downsized significantly and
business support requests are secondary to keeping critical network & server infrastructure operational.
Motivated by these insights, Dimensional Insight created the Domain Editor, a visual data discovery tool.
Domain Editor allows users to quickly get an overview of a database without having to write SQL. Users
can see the tables and views, the number of records contained in the database, and the relationships
between records. Data discovery tools make it easy to view sample data from each table in order to see
the column formats, their data type, and the unique values for that column. Users can quickly determine
whether the data needed for their projects is available, and what level of transactional detail and length
of history is captured, perhaps in related transactional tables.
Why is this important? Corporate databases can contain a fair amount of redundant, erroneous, or
useless information. It’s relatively easy to build database tables for a project that is abandoned months
later or left undocumented. Source system feeds and business rules can change. Storage hardware and
cloud-based options get cheaper by the month, enabling this tendency toward database bloat.
With the rapid growth in data volumes, we want to screen out suspect fields before they clutter the
models we build for reporting and analytics. Business decisions based on bad data are not benign, they
are dangerous. Data discovery tools are not a replacement for data validation policies and procedures
that should be implemented before data is ever entered into a database. Rather, the summary statistics
these tools generate can provide some important insights to help flag suspect fields for further
investigation. Take a look at the following statistics generated for a database table. This data is
contrived, but taken from actual examples I’ve encountered in corporate databases:
Field name
Record count
Unique values
Missing
values
Min Value
Max Value
DateOfBirth
EmploymentStartDate
HomeZipCode
Education
YearsEd
ActiveStatus
1,200,890
1,200,890
1,200,890
1,200,890
1,200,890
1,200,890
5012
368
566
5
5
1
3%
0%
47%
8%
8%
0%
1/1/1900
12/1/2011
10012
8
8
0
11/30/2012
11/30/2012
110989
12
12
0
We can go a step further and look at the distribution of values for these fields. For the DateOfBirth field,
this might look like:
Value
Record count
1/1/1900
12/3/1936
1/7/1937
6/9/1938
…
19,200
1
1
2
…
How likely is it that 12% of our customers were born on 1/1/1900? Before data validation policies
became widespread, errors like this occurred frequently. Perhaps 1/1/1900 was the default value for the
Date of Birth field in an online form. If the user simply skips entering this value, guess what gets written
to the database? Legacy databases are littered with similar examples. So to salvage the DateOfBirth
field, replacing the 1/1/1900 value with MISSING might be prudent.
EmploymentStartDate illustrates another problem. It’s impossible to have more than 365 unique values
(366 in a leap year), yet our data survey turns up 368 unique values. Browsing through the distribution
of unique values for this field turns up the erroneous entries:
Value
Record count
4/31/2012
6/31/2012
…
12
320
…
Looking at HomeZipCode, the 6 digit value of 110989 jumps out. And what about the 47% missing?
While some amount of missing data is tolerable, we may have to discard this field altogether or perform
further discovery to see if HomeZipCode is captured more completely in another table or database.
What about Education and YearsEd? The two fields have identical summary stats. Could it be that they
are actually duplicate fields? Very likely, especially if the distributions are identical. But without
performing a record by record comparison of these fields, we won’t be 100% certain that they are
indeed identical fields. Verifying field duplication is probably the most computationally expensive
process in data discovery.
Finally, let’s see what’s wrong with ActiveStatus. The field contains exactly one and only one value. This
field contains no usable information that would differentiate one record from another, since all records
contain the value 0 for this field.
These are just a few ways that visual data discovery helps place business intelligence initiatives on a
solid foundation. By quickly surfacing data errors, outliers, and other inconsistencies users can
determine which errors can be remedied and which fields will be flagged as unusable.
In the rush to capture the rapidly growing volumes and varieties of organizational, market, and
customer data, it’s easy to ignore the old maxim of “garbage in, garbage out”. We can only derive
business-critical insights from valid data. Visual data discovery tools such as Dimensional Insight’s
Domain Editor help business intelligence consumers determine what database tables and fields are
available within their organization for reporting and analytics initiatives. More importantly, these tools
provide users with an easy, SQL-free approach to assessing the quality and content of that data.