Download Chapter06

Document related concepts

Entity–attribute–value model wikipedia , lookup

Data Protection Act, 2012 wikipedia , lookup

Versant Object Database wikipedia , lookup

Expense and cost recovery system (ECRS) wikipedia , lookup

Data center wikipedia , lookup

Data model wikipedia , lookup

Concurrency control wikipedia , lookup

Data analysis wikipedia , lookup

Database wikipedia , lookup

3D optical data storage wikipedia , lookup

Relational model wikipedia , lookup

Data vault modeling wikipedia , lookup

Clusterpoint wikipedia , lookup

Information privacy law wikipedia , lookup

Business intelligence wikipedia , lookup

Database model wikipedia , lookup

Transcript
CHAPTER SIX
FOUNDATIONS OF BUSINESS INTELLIGENCE:
DATABASE AND INFORMATION
MANAGEMENT
Introduction


Opening Case Review
Review traditional file systems
Opening Case


Toronto Globe
Data in Mainframe not accessible



Data parts copied to Access / Foxbase /
Excel
Practice creates data islands (Silos)
Implemented SAP business
warehouse

Single version of truth
Opening Case

Data mining with SAP HANA


An in-memory row-ordered database
Challenges

Getting people to adopt the DW
File Management Terms

Field – A unique data item


Record – Multiple fields connected
together


Last name, first name
Customer, student,
File – Multiple records of the same
type

Customer file
Traditional File Systems

Fields contain a data item


Fields grouped into records


Customer / student
Records group into files


Last name / first name
Customer file
I disagree with your book. A
database does not store the files
1960s Data Management

These are legacy systems



Characterized by traditional file
processing
Data processing was sequential


Batch processing
Not possible to directly locate a particular file
record
Data dependent on the programs that
used the data

Program data dependence
1970s Data Management

Batch processing gives way to on line
transaction processing



Technologies



Files stored on disk rather than tape
Any record can be located in the same amount
of time
Indexed Sequential Access Method (ISAM)
Virtual Sequential Access Method (VSAM)
Direct Access files

Use a hashing function to derive record keys
1980s Data Management


Databases are becoming
commonplace
Personal computer databases are
evolving


DBase
R-Base
1990s Data Management




Huge data stores and transaction
processing capabilities
Distributed databases
Object-oriented databases
6 Million+ transactions per second
Traditional Files Systems
(Problems)

Data redundancy


Data inconsistency



Multiple unsynchronized copies of the
same data item in different places
Those unsynchronized copies are not
the same
Which copy is correct?
(authoritative)
Program / Data dependence
Information Granularity

Refers to the level of detail of
information


Detailed (POS transaction)
Course (Global sales totals)
Transactional vs. Analytical
Information

Transactional information comes
from a business process



A bank deposit
A credit card charge
Analytical information uses
transactional data for the purposes
of decision making


Account balance trends
Using credit card history to detect fraud
Transactional vs. Analytical
Information
Information Dimensions

Information timeliness



Information quality




Obsolete information is useless
Today’s information needs to be
provided in real time or near real time
Wrong information is useless
Redundant information can be the
cause of errors
Information must be complete
Data inconsistency and data
integrity
Database Management

Characteristics





Complex
Databases often spread across multiple
servers
Databases often spread across multiple
physical disks
Fault tolerance is critical
Databases may be distributed
Database Vendors (1)


The industry has consolidated
IBM



Oracle
Microsoft



DB2 Universal
SQL Server
Access
Sun (MySQL)

Is now Oracle
Database Performance





Transaction Processing Performance
Council provides standard
benchmarks
TPC-C – Online transaction
processing
TPC-E – Online brokerage
transactions
TPC-H – Ad-hoc decision support
TPC-W – Web / E-commerce
Database Performance (TPC-C)




Multiple transaction types
Independent of software and
hardware
Scalable
Basis is online transaction
processing (OLTP)
Realities of a DBMS






Data centric rather than application centric
Can be a repository for all an organization’s
data
Databases tend to be centralized
Queries get data from a DBMS
 SQL is the standard query language
Report generators create printed and Webbased reports
Applications interface with DBMS
Types of Databases

Database models include:




Hierarchical database model – A treebased structure
Network database model –
Mathematically, a directed graph
Relational database model – stores
information in the form of logically
related two-dimensional tables
Object-oriented databases
Elements of a Database

Logical view and physical view


Users see and work with the logical
view
Physical view is controlled by the
database management system itself
Entities and Attributes

Relational databases store
information in tables (entities)


Customer / order / product
Tables contain fields (attributes)

Customer name, address
Keys

Each table has a primary key that
uniquely identifies each record



Natural keys have some meaning
(stock symbol)
Artificial keys have no intrinsic meaning
(your R number)
Foreign keys are used to link tables
in one-to-many relationships
Database Interaction
Advantages of an RDMS
(Scalability)

Database can scale to the terabyte
or petabyte range


NSA maintains 1.9 trillion telephone
call records
Large databases can span several
servers and storage devices
Advantages of an RDBMS
(Redundancy)

Databases can be configured to
write duplicate information


Citibank
Journaling and checkpointing are
supported
Advantages of an RDBMS
(Integrity)


Relational integrity constraints are
rules that apply to the relationships
between tables
Business integrity constraints
enforce business rules

Not really a part of the DBMS itself
Advantages of an RDBMS
(Information Security)

A DBMS supports advanced access
rights




By
By
By
By
table and fields
time of day
location
row information
Relational Database (Illustration)
Non-relational Databases
(Introduction)


Required by scalable applications
like Facebook, Google and others
Build upon a couple of principles




BASE – basically available, soft-state,
eventually consistent (non-relational)
ACID – Atomicity, Consistency,
Isolation, Durability (relational)
They are somewhat new and
unproven
Usually in-memory
Non-relational Databases
(Examples)



No-SQL
MongoDB
Hadoop
Hadoop (Characteristics)

Built by Apache but there are
several 3-rd party implementations




We break down huge data sets
Process them in clusters
And put the results back together
Characteristics


A distributed file system
MapReduce for dividing a task into
small parts
Hadoop (Illustration)
Hadoop (Illustration)
Data-driven Web Sites

Nearly all transactional Web sites
rely on a database





Amazon
Your bank
Any shopping cart application
Ebay or Craig’s List
Facebook and You Tube
Database Integration

Databases often need to be
integrated



Because of mergers and acquisitions
Because of organizational changes
We are referring to connections to
multiple databases
Designing a database
(Normalization)

Normalization is the process of
factoring data into different tables
to


Eliminate data redundancy
Support referential integrity
Data Warehouses (Introduction)


Central source for clean data
May contain internal or external data




Use to spot hidden patterns in data
May be integrated with operational
database
Parts of a data warehouse are called data
marts
Data warehouses contain an analytical
component
Cleansing Data

Data is often obtained from a
myriad of sources




External lists
Internal databases
Other databases
This data must be cleansed and
sanitized to remove

Redundancy / errors / etc…
Data Warehouses (Illustration)
Multidimensional Analysis


Data are often analyzed as 3dimensional cubes
Cubes are then ‘sliced and diced’ to
look at various layers
Multidimensional Analysis
(Illustration)
The cost of Perfect Information
Database Design (Introduction)

In the systems process, we design
before we implement




Requirements specification
Conceptual design
Logical design
Physical Design
Database Design Tools

Unified Modeling Language (UML)




Visio
Rational Rose
Entity relationship diagrams
describes relationships between
data
Normalization eliminates redundant
data
Database Management HR




Database administrators
Data managers
Programmers and systems analysts
Data security
Business Intelligence
(Introduction)




Simply put, it’s internal and
external data used to support better
decision making
It’s challenging to sift through the
mountains of data
It requires cross-functional
collaboration between systems
More in the next chapter but we use
ERP systems to improve business
intelligence
Business Intelligence (Industries)


BI applies to all industries
Retail and sales


Banking


Understanding procurement and
distribution (SCM) / customers (CRM)
Understand credit worthiness / fraud
behavior
Insurance

Forecast claim risk and understand at –
risk customers
Business Intelligence (Industries)

Airlines


Routing planes / minimize turnaround
time (Southwest)
Marketing



Demographics
Sell based on known customer behavior
(Harrah’s)
Amazon
Business Intelligence (Levels)

Operational


Tactical


Short term (Dell ordering supplies)
Strategic


Day-to-day operations (building a Dell)
Long term organizational goals
The systems that provide BI
typically do so at all levels
BI Levels (Illustration)
BI and Latency


From the time of acquisition, how
long does it take to analyze
(analysis latency)
Time to make a decision based on
the analysis

E-transactions significantly reduce
latency
Data Mining (Introduction)



Data gets mined (analyzed) from
data contained in a data warehouse
or data mart
Specialized tools are used to
analyze data for ‘interesting
nuggets’
Ways to mine


Drill down (general to specific)
Drill up (specific to general)
Data Mining (Sequences)

Events are linked over time



I buy a house
Home Depot knows that
They send me a coupon to buy
appliances
Data Mining (Classification)

We classify items (people for
example) into groups and look at
the characteristics of that group


Churned customers
Customers who have stopped gambling
Data Mining (Clustering)



Use to define classification groups
Cluster analysis groups data by trait
or traits
Examples


Don’t drink the water in Fallon
Segment customers by zip codes
Data Mining (Association)

Answers the question “What traits
are associated with other traits”

When I stay at Harrah’s,
I gamble
 I eat at the Sage room


When I stay in Vegas,

I gamble more
Data Mining (Statistical Analysis)

It’s basic statistics



Analysis of variance
Correlation coefficients
Etc…
Text Mining

We need a way to mine
unstructured data



All of those Facebook posts
All of those Twitter posts
Techniques


We mine text
We use keywords to mine sentiment
BI Benefits

We can understand what’s
happening inside and outside a
department




Sales knows about product inventory
levels and production schedules
Production knows about sales and sales
forecasts
Finance knows about the sales
forecasts too
This information is provided in near
real time
Quantifying BI

Some benefits can be clearly
quantified






Costs went down
Productivity increased
Inventory levels were optimized 10%
Some are indirectly quantified
Some benefits are intangible
Sometimes, we get unexpected
results
Challenges



We need an information policy
We need to administer all of this
data
We need to ensure data quality