Download Chapter 2-021112

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Entity–attribute–value model wikipedia , lookup

Big data wikipedia , lookup

Database model wikipedia , lookup

Functional Database Model wikipedia , lookup

Transcript
FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY
DATA WAREHOUSE & DATA MINING
CHAPTER 2 DATA WAREHOUSE PROCESS
Chapter 2
Data Warehouse process
2.1
Initiation of Data Warehouse
The concept of data warehousing dates back to the late 1980s when IBM researchers Barry
Devlin and Paul Murphy developed the "business data warehouse". In essence, the data
warehousing concept was intended to provide an architectural model for the flow of data
from operational systems to decision support environments. The concept attempted to
address the various problems associated with this flow, mainly the high costs associated
with it. In the absence of a data warehousing architecture, an enormous amount of
redundancy was required to support multiple decision support environments. In larger
corporations it was typical for multiple decision support environments to operate
independently. Though each environment served different users, they often required much
of the same stored data. The process of gathering, cleaning and integrating data from
various sources, usually from long-term existing operational systems (usually referred to as
legacy systems), was typically in part replicated for each environment. Moreover, the
operational systems were frequently reexamined as new decision support requirements
emerged. Often new requirements necessitated gathering, cleaning and integrating new
data from "data marts" that were tailored for ready access by users. (Source: Basil Soufi)
In decision support environment, DSS and EIS systems are very similar in that they
present information for decision making; however EIS application typically allow
greater flexibility in slicing and dicing data in style most acceptable. A Data
warehouse is not the same as a DSS. Rather, a data warehouse is a platform with
integrated data of improved quality to support many DSS and EIS application and
processes within an enterprise.
i)
An EIS is a special type of DSS designed to support decision making at
the top level of an organization.
ii) An EIS may help a CEO to get an accurate picture of overall operations,
and a summary of what competitors are doing.
iii) These systems are generally easy to operate and present information in
ways easy to quickly absorb (graphs, charts, etc.).
Page 1
FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY
DATA WAREHOUSE & DATA MINING
CHAPTER 2 DATA WAREHOUSE PROCESS
iv) It is not a substitute for other computer-based systems. The EIS actually
feeds off these systems.
v) It does not turn the executive suite into a haven for computer “techies”.
vi) It should be viewed by senior management as a trusted assistant who
can be called on when and where necessary.
Figure 1. Revolution of data warehouse
Problem with the current EIS: the data-processing department was not able to handle
huge backlogs of requests for data analysis. Applications data was hidden behind mainframe
files and databases, and it was periodically recorded in tapes for specific information
manipulation.
Page 2
FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY
DATA WAREHOUSE & DATA MINING
CHAPTER 2 DATA WAREHOUSE PROCESS
ENTERPRISE SYSTEM TYPE COMPARISON
Figure 2: System type comparison (Source : Database Systems: Design, Implementation and Management P.Rob and C. Coronel, 2007)
Page 3
FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY
DATA WAREHOUSE & DATA MINING
CHAPTER 2 DATA WAREHOUSE PROCESS
2.2
Data Warehouse Architecture
A data warehouse architecture is a description of the elements and services of the
warehouse, with details showing how the components will fit together and how the
system will grow over time. There is always an architecture, either ad hoc or
planned, but experience shows that planned architectures have a better chance of
succeeding (Laura Hadley).
Figure 3: Data warehouse architecture categories
There are 4 categories of data warehouse architecture namely 1) Data architecture, 2)
Information architecture, 3) Technical architecture and finally 4) Product architecture
Architecture
Data
Deliverables





Information





Define what data is needed to meet business
user needs.
Examine the completeness and correctness of
source systems that are needed to obtain data.
Identify the data facts and dimensions.
Define the logical data models.
Establish preliminary aggregation plan.
Define the framework for the transformation of
data into information from the source systems to
information used by the business users.
Recommend the data stages necessary for data
transform and information access.
Develop source-to-target data mapping for each
data stage.
Review data quality procedures and reconciliation
techniques.
Define the physical data models.
Page 4
FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY
DATA WAREHOUSE & DATA MINING
CHAPTER 2 DATA WAREHOUSE PROCESS
Technology




Product





Define technical functionality used to build a data
warehousing and business intelligence
environment.
Identify available technologies available and
review tradeoffs associated between any
overlapping or competing technologies.
Review current technical environment and
company's strategic technical directions.
Recommend technologies to be used to meet
your business requirements and implementation
plan.
List product categories needed to implement the
technology architecture.
Review tradeoffs between overlapping or
competing product categories.
Outline implementation of product architecture in
stages.
Identify short list of products in each of these
categories.
Recommend products and implementation
schedule.
Figure 4: Descriptions of data warehouse architecture categories
(source : http://www.athena-solutions.com/services-design-planning.shtml)
Data warehouse technical architecture (By components)
Figure 5: Data warehouse with staging and data marts
(Source - http://docs.oracle.com/cd/B10500_01/server.920/a96520/concept.htm)
Page 5
FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY
DATA WAREHOUSE & DATA MINING
CHAPTER 2 DATA WAREHOUSE PROCESS
Figure 5 illustrates an example where purchasing, sales, and inventories are separated. In
this example, a financial analyst might want to analyze historical data for purchases and
sales. The architecture consists of:





Data Sources (operational systems and flat files)
Staging Area (where data sources go before the warehouse)
Warehouse (metadata, summary data, and raw data)
Data Marts (purchasing, sales, and inventory)
Users (analysis, reporting, and mining)
Note : This architecture also well known as federated data warehouse architecture
Data warehouse technical architecture (By layer)
Figure 6: Data warehouse internal technical architecture
(Source : Modern Data Warehousing, Mining and Visualization, Marakas, 2002)
The technical architecture consists of various interconnected elements:
-
Operational and external database layer – the source data for the DW
-
Information access layer – the tools the end user access to extract and analyze the
data
-
Data access layer – the interface between the operational and information access
layers
-
Metadata layer – the data directory or repository of metadata information
Page 6
FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY
DATA WAREHOUSE & DATA MINING
CHAPTER 2 DATA WAREHOUSE PROCESS
Additional layers are:
-
Process management layer – the scheduler or job controller
-
Application messaging layer – the “middleware” that transports information
around the firm
-
Physical data warehouse layer – where the actual data used in the DSS are
located
-
Data staging layer – all of the processes necessary to select, edit,
summarize and load warehouse data from the operational and external
data bases
Data warehouse configuration:
The virtual data warehouse – the end users have direct access to
the data stores, using tools enabled at the data access layer
The central data warehouse – a single physical database contains
all of the data for a specific functional area
The distributed data warehouse – the components are distributed
across several physical databases
Developing an Architecture
When you develop the technical architecture model, draft the architecture requirements
document first. Next to each business requirement write down its architecture
implications. Group these implications according to architecture areas (remote access,
staging, data access tools, etc.) Understand how it fits in with the other areas. Capture
the definition of the area and its contents. Then refine and document the model.
Thornthwaite recognizes that developing a data warehouse architecture is difficult, and
thus warns against using a “just do it” approach, which he also calls “architecture lite.”
But the Zachman framework is more than what most organizations need for data
warehousing, so he recommends a reasonable compromise consisting of a four-layer
process: business requirements, technical architecture, standards, and products.
Business requirements essentially drive the architecture, so talk to business managers,
analysts, and power users. From your interviews look for major business issues, as well
as indicators of business strategy, direction, frustrations, business processes, timing,
availability, and performance expectations. Document everything well.
From an IT perspective, talk to existing data warehouse/DSS support staff, OLTP
application groups, and DBAs; as well as networking, OS, and desktop support staff.
Also speak with architecture and planning professionals. Here you want to get their
opinions on data warehousing considerations from the IT viewpoint. Learn if there are
existing architecture documents, IT principles, standards statements, organizational
power centers, etc.
Page 7
FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY
DATA WAREHOUSE & DATA MINING
CHAPTER 2 DATA WAREHOUSE PROCESS
2.2
Metadata
While metadata is not new, the role of metadata and its importance in the face of the data
warehouse certainly is new. For years the information technology professional has worked in
the same environment as metadata, but in many ways has paid little attention to metadata.
The information professional has spent a life dedicated to process and functional analysis,
user requirements, maintenance, architectures, and the like. The role of metadata has been
passive at best in this milieu.
But metadata plays a very different role in data warehouse. Relegating metadata to a
backwater, passive role in the data warehouse environment is to defeat the purpose of data
warehouse. Metadata plays a very active and important part in the data warehouse
environment. The reason why metadata plays such an important and active role in the data
warehouse environment is apparent when contrasting the operational environment to the
data warehouse environment insofar as the user community is concerned.
Figure 7: Data flow involve metadata
(Source-http://www.dwreview.com/Articles/Metadata.html)
Simply from the standpoint of who needs help the most in terms of finding one's way
around data and systems, it is assumed the DSS analysis community requires a much
more formal and intensive level of support than the information technology community. For
this reason alone, the formal establishment of and ongoing support of metadata becomes
important in the data warehouse environment.
Page 8
FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY
DATA WAREHOUSE & DATA MINING
CHAPTER 2 DATA WAREHOUSE PROCESS
But there is a secondary, yet important, reason why metadata plays an important role in the
data warehouse environment. In the data warehouse environment, the first thing the DSS
analyst needs to know in order to do his/her job is what data is available and where it is in
the data warehouse. In other words, when the DSS analyst receives an assignment, the first
thing the DSS analyst needs to know is what data there is that might be useful in fulfilling
the assignment. To this end the metadata for the warehouse is vital to the preparatory work
done by the DSS analyst.
Figure 8. Metadata layer throughout data warehouse architecture
(Source : BI 360)
Throughout the entire process of identifying, acquiring, and querying the data,
metadata management takes place. Metadata is defined as "data about data". An
example is a column in a table. The datatype (for instance a string or integer) of the
column is one piece of metadata. The name of the column is another. The actual
value in the column for a particular row is not metadata - it is data. Metadata is
stored in a Metadata Repository and provides extremely useful information to all of
the tools mentioned previously. Metadata management has developed into an
exacting science that can provide huge returns to an organization. It can assist
companies in analyzing the impact of changes to database tables, tracking owners of
Page 9
FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY
DATA WAREHOUSE & DATA MINING
CHAPTER 2 DATA WAREHOUSE PROCESS
individual data elements ("data stewards"), and much more. It is also required to
build the warehouse, since the ETL tool needs to know the metadata attributes of the
sources and targets in order to "map" the data properly. The BI tools need the
metadata for similar reasons.
i)
The name suggests some high-level technological concept, but it
really is fairly simple. Metadata is “data about data”.
ii) With the emergence of the data warehouse as a decision support
structure, the metadata are considered as much a resource as the
business data they describe.
iii) Metadata are abstractions -- they are high level data that provide
concise descriptions of lower-level data.
2 Basic types of metadata:
1)
Technical Metadata
Technical metadata provides the technical descriptions of data and
operations. This information is used by Data Modellers, application
programmers, system administrators, database administrators and
software tools.
Technical metadata includes information about data definition, data
format, processes, source data, target data, and the rules and
processes that are used to extract, filter, enhance, cleanse, and
transform source data to target data etc.
2)
Business Metadata
Business metadata (data and process) is used by business analysts
and end users, and provides a business description of informational
objects. It assists end users in locating, understanding, and accessing
information in applications, data marts, a data warehouse, or other
informational sources.
2.3.1 The Metadata in Action
The metadata are essential ingredients in the transformation of raw data into
knowledge. They are the “keys” that allow us to handle the raw data.
For example, a line in a sales database may contain:
1023 K596 111.21
This is mostly meaningless until we consult the metadata (in the data directory) that
tells us it was store number 1023, product K596 and sales of $111.21.
Page 10
FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY
DATA WAREHOUSE & DATA MINING
CHAPTER 2 DATA WAREHOUSE PROCESS
Metadata can be managed through individual tools:





Metadata manager / repository
Metadata extract tools
Data modeling
ETL
BI Reporting
2.3.2 The Need for Consistency in the Metadata
i)
The data warehouse is set up for the benefit of business analysts and
executives across all functional areas.
ii) In their individual databases, the different areas may define and store
data according to their own version of the “truth”.
iii) When data are retrieved from these different areas and placed in the
warehouse, the transformation and cleansing process ensures that
there is a single, integrated “truth” at the organizational level.
2.3.3 Interviewing the Data—Metadata Extraction
Regardless of the nature of a query, certain aspects of the metadata are
important to all decision-makers. Some of these are:
-
What tables, attributes and keys does the DW contain?
-
Where did each set of data come from?
-
What transformations were applied with cleansing?
-
How have the metadata changed over time?
-
How often do the data get reloaded?
-
Are there so many data elements that you need to be careful what
you ask for?
Page 11
FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY
DATA WAREHOUSE & DATA MINING
CHAPTER 2 DATA WAREHOUSE PROCESS
Sample metadata extraction
Figure 9 & 10 : Metadata extraction
Page 12
FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY
DATA WAREHOUSE & DATA MINING
CHAPTER 2 DATA WAREHOUSE PROCESS
2.3.4 Components of the Metadata
i)
Transformation maps – records that show what transformations were applied
ii) Extraction history – records that show what data was analyzed
iii) Algorithms for summarization – methods available for aggregating and
summarizing
iv) Data ownership – records that show origin
v) Access patterns – records that show what data are accessed and how often
2.3.5
Typical Mapping Metadata
Transformation mapping records include:
-
Identification of original source
-
Attribute conversions
-
Physical characteristic conversions
-
Encoding/reference table conversions
-
Naming changes
-
Key changes
-
Values of default attributes
-
Logic to choose from multiple sources
-
Algorithmic changes
Page 13
FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY
DATA WAREHOUSE & DATA MINING
CHAPTER 2 DATA WAREHOUSE PROCESS
2.4
Data Warehouse execution
Data warehouse development
Figure 11. Data warehouse development steps
(Source - http://www.baseline-consulting.com/images/Service_Data_Warehouse_Art.jpg )
Kozar assembled a list of “seven deadly sins” of data warehouse implementation:


“If you build it, they will come” – the DW needs to be designed to meet people’s needs

Underestimating the importance of documenting assumptions – the assumptions and
potential conflicts must be included in the framework

Failure to use the right tool – a DW project needs different tools than those used to
develop an application


Life cycle abuse – in a DW, the life cycle really never ends

Failure to learn from mistakes – since one DW project tends to be the cause of another,
learning from the early mistakes will yield higher quality later.
Omission of an architectural framework – you need to consider the number of users,
volume of data, update cycle, etc.
Ignorance about data conflicts – resolving these takes a lot more effort than most
people realize
Page 14
FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY
DATA WAREHOUSE & DATA MINING
CHAPTER 2 DATA WAREHOUSE PROCESS
Exercise:
1. Why
A.
B.
C.
D.
data warehouse exist?
Needs of bigger relational database
Needs of DSS
Weakness of relational database
Weakness of DSS
2. Choose correct development of data warehouse:
A. OLTP, EIS, DSS, DW
B. OLTP, DSS, EIS, DW
C. EIS, DSS, OLTP, DW
D. DSS, EIS, OLTP, DW
3. Which is NOT included in the architecture of data warehouse?
A. Data access
B. Information access
C. Knowledge access
D. Wisdom access
4. Which of the following is a valid data warehouse configuration?
A.
B.
C.
D.
Centralized data warehouse
Virtual data warehouse
Distributed data warehouse
All of the above.
5. The process that records how data from operational data stores and external sources
are transformed on the way into the warehouse is referred to as:
A.
B.
C.
D.
summarization algorithms.
transformation mapping.
back propagation.
extraction history
6. Which of these NOT TRUE about metadata?
A.
B.
C.
D.
Important in transforming data into information
Key that allow handling of the raw data
Data of metadata
Information about data warehouse
Page 15
FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY
DATA WAREHOUSE & DATA MINING
CHAPTER 2 DATA WAREHOUSE PROCESS
7. Which of the following would not be a good example of metadata?
A.
B.
C.
D.
The directory of where the data is stored.
The rules used for summarization and scrubbing.
Where the operational data came from.
All of the above are examples of metadata.
8. Which layer of the data warehouse architecture does the end user deal directly with?
A.
B.
C.
D.
Data access layer
Application messaging layer
Information access layer
None of the above.
9. What are 7 deadly sins by Kozar?
A.
B.
C.
D.
Myths of developing data warehouse
Curse of data warehouse
Rules of data warehouse
Tips of developing data warehouse
10. Which of the followings from Kozar’s seven deadly sins of DW implementation explained
about the importance to focus on user of data warehouse?
A.
B.
C.
D.
11.
Sin
Sin
Sin
Sin
a)
1
3
6
7
List FIVE (5) of the “seven deadly sins” in data warehouse implementation
suggested by Kozar.
(5 marks)
b)
Explain any THREE (3) of your answer in 1 a).
(6 marks)
c)
Illustrate different layers in the data warehouse architecture.
(10 marks)
Page 16