Download Midterm exam solution

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Big data wikipedia , lookup

Data Protection Act, 2012 wikipedia , lookup

Data model wikipedia , lookup

Data center wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Forecasting wikipedia , lookup

Data analysis wikipedia , lookup

Relational model wikipedia , lookup

3D optical data storage wikipedia , lookup

Information privacy law wikipedia , lookup

Business intelligence wikipedia , lookup

Data vault modeling wikipedia , lookup

Database model wikipedia , lookup

Transcript
ROLL NO.:
NAME:
CS 543 – Data Warehousing
Midterm Exam Solution
April 13, 2005
Duration: 75 minutes (11:45 to 13:00)
1. The following is NOT a reason for the failure of many data warehousing projects in
the 1990s:
a. Lack of planning
b. Unclear motivation and goal for the data warehouse
c. Poor project management
d. Underestimating cost of development
e. Insufficient hardware capabilities
f. None of the above
2. A dependent data mart results from a
a. top-down approach to DW design
b. bottom-up approach to DW design
c. hybrid (practical) approach to DW design
3. List at least 5 different sources from which data can be integrated into a data
warehouse?
 Production data / operational system data
 Internal data / private data / personal data / spreadsheets / databases, etc
 Archived data
 External data / subscripted data / industry stats, etc
Questions 4 to 10 are based on the following description.
The relational schema below defines an operational system of a university (say LUMS).
Underlined attribute(s) = primary key(s)
Asterisked* attribute(s) = foreign key(s)
Course table
level_offered = offering level of the course (freshman, sophomore, junior, senior,
graduate}
pre-requisites = list of pre-requisite courses course_ids
units = number of units
major = major in which course is offered
Student table
standing = rank of student (freshman, sophomore, junior, senior, graduate)
Faculty table
rank = rank of faculty (assistant, associate, full, adjunct)
Department table
HoD_id = faculty_id of current HoD
majors_offered = list of majors offered by the department
Enrollment table
quarter, year = quarter and year of offering
CS 543 (Sp 04/05) – Dr. Asim Karim
Page 1 of 6
grade = letter grade obtained
1
course_id, dept_id*, level_offered, pre-requisites, units, major
m
student_id, name, standing, major, units_completed
m
1
m
course_id, student_id, faculty_id, quarter, year, grade
1
m
faculty_id, name, rank, dept_id*
m
1
dept_id, name, HoD_id, majors_offered
1
4. Is the schema in 3NF? If not, identify the normal form(s) and their violations.
Normalize the schema to remove these violation(s).
No, the schema is not in 3NF. The primary violations are:
pre-requisites is a multi-valued attribute which violate the 1NF. To fix this, include
multiple attributes pre-req1, pre-req2, pre-req3 each with a single atomic value.
course_id, dept_id*, level_offered, pre-req1, pre-req2, pre-req3, units, major_id
majors_offered is also a multi-valued attribute which violate the 1NF. To fix this, the
best approach would be to create a new relation called major and link it with the
department and course tables.
major_id, major_name, dept_id*
In the enrollment table, quarter and year are not fully dependent on the primary keys.
This is a violation of the 2NF. To fix this, make quarter and year part of the primary keys.
5. How many new records will get added to the enrollment table every year (4 quarters)
if a student takes on average 3 courses per quarter. Assume there are 200 active
students, 50 active faculty members, and 200 courses offered in the year.
Each row in the enrollment table represents a student taking a particular course offered by
a given faculty in a given quarter/year.
CS 543 (Sp 04/05) – Dr. Asim Karim
Page 2 of 6
No. of rows/year = 200 * 3 * 4 = 2,400
6. Create a complete information package diagram for the information given in the
schema.
Subject: University Enrollment Analysis
Time
Course
Student
Quarter
Title
Name
Year
Level_offered Standing
Faculty
Name
Rank
Dept.
Name
Major
Name
Metrics or Facts: Units, Grade (GPA equivalent)
7. Create a star schema for the information showing all dimension tables, fact tables,
attributes, and the relationships between the tables. Identify key attributes and
relationship cardinalities.
Notice that enrollment is not an explicit fact or metric; it is implied from the fact table.
Time_key
Quarter
Year
Student_key
Name
Standing
Course_key
Title
Level_offerd
Faculty_key
Name
Rank
1
1
m
1
1
Time_key*
Student_key*
Course_key*
Faculty_key*
Dept_key*
Major_key*
Units
GPA
m
1
1
Dept_key
Name
Major_key
Major_nam
e
8. Suppose faculty XYZ gets promoted to the rank of ‘associate professor’. What type of
update is this? How would you update the faculty dimension table. Show the new
dimension table and its attributes.
This is a type 2 update (not type 3, since the faculty is not likely to be demoted to
‘assistant professor’).
A new row is added to the faculty dimension table with the new rank value of ‘associate
professor’. A new attribute eff_date may be also added to track the time of change.
CS 543 (Sp 04/05) – Dr. Asim Karim
Page 3 of 6
9. How many joins (based on the star schema) would be needed to answer the query:
what is the average enrollment in courses offered by faculty ABC? Explain briefly.
Two joins only. The faculty dimension table is joined with fact table to find all students
who have taken a course offered by ABC. To find the average, the number of rows in the
join is divided by the number of distinct course_keys.
10. Suppose a large number of queries will require aggregations for courses offered in a
major. Modify the star schema to cater for efficient processing of such queries.
An aggregate fact table is added to the schema in which each row represents aggregate
for all course in the major against time. Notice that the metric GPA is semi-additive and
its average is computed in the aggregate table.
Time_key
Quarter
Year
Student_key
Name
Standing
1
m
1
m
Course_key
Title
Level_offerd
1
Time_key*
Student_key*
Course_key*
Faculty_key*
Dept_key*
Major_key*
Units
GPA
Time_key
Major_key
Units
Av_GPA
Faculty_key
Name
Rank
1
1
m
1
Dept_key
Name
Major_key
Major_nam
e
11. What is a factless fact table?
A factless fact table does not have any numeric fact or metric. The metric is implied by
the presence of the row in the fact table.
12. How is the metadata in a data warehouse different from the data dictionary of an
operational system database? Explain briefly.
The DW metadata maintains information relevant to the entire data warehousing
environment and caters to the needs of humans as well as software components. Its serves
CS 543 (Sp 04/05) – Dr. Asim Karim
Page 4 of 6
as a glue that ties all components of the data warehousing environment. The data
dictionary, on the other hand, provides basic information on the data structures and
semantics.
13. The most important decisions in the logical and physical design of a data warehouse
are/is:
a. Level of data granularity and method of data partitioning
b. Sizing of physical storage
c. Type of data structures
d. Type of modeling technique
e. None of the above
14. How many multi-way aggregate fact tables are possible for a 4 dimension star schema
in which each dimension has 5 hierarchical attributes.
5 * 5 * 5 * 5 = 625
15. The snowflake schema is better than the star schema in the following way:
a. It saves large percentage of space
b. It is easier to maintain and keep consistent
c. It is easier to understand by end-users
d. It is more efficient for transaction storage
e. None of the above
16. List at least 5 key roles/titles (besides project manager) in a data warehouse project
team.













Executive sponsor
User liaison manager
Lead architect
Infrastructure specialist
Business analyst
Data modeler
DW administrator
ETL specialist
Quality assurance analyst
Testing coordinator
End-user applications specialist
Programmer
Lead trainer
17. The term ‘information crisis’ refers to
a. Inability to save all generated information
b. Inability to process all generated information
c. Paucity of information
d. Excess data but limited knowledge
e. Lack of good quality information
CS 543 (Sp 04/05) – Dr. Asim Karim
Page 5 of 6
18. Indicate true or false in front of the following statements.
a. ERP systems may be substituted for data warehouses (FALSE)
b. Metadata standards facilitate deploying a combination of best-of-breed
products (TRUE)
c. A separate data staging platform is necessary for a data warehousing
environment (FALSE/TRUE)
d. Business dimensions can be identified from operational system databases
(TRUE)
e. Transactional fact tables can grow in size very quickly over time (TRUE)
19. (10 points) Match the columns (write the matching statements’ letter in front of the
number).
1. information package diagram (H)
A. determine data extraction
2. need for drill-down (F)
B. provide OLAP
3. data transformations (J)
C. provide data feed
4. data sources (A)
D. influences load management
5. data aging (I)
E. query management in DBMS
6. sophisticated analysis (B)
F. low levels of data
7. simple and complex queries (E)
G. larger staging area
8. data volume (D)
H. influence data design
9. specialized DSS (C)
I. possible pollution source
10. corporate data warehouse (G)
J. data staging design
CS 543 (Sp 04/05) – Dr. Asim Karim
Page 6 of 6