Download CO7107 Data warehouse

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
CO7107 DATA
WAREHOUSING AND MULTIDIMENSIONAL DATABASES
Assessment
Andrew Jones
0817804 J01209
Table of Contents
Table of Figures ....................................................................................................................................... 1
Introduction ............................................................................................................................................ 2
The evolution of databases ..................................................................................................................... 2
The information environment ................................................................................................................ 3
The alternatives to SQL ........................................................................................................................... 3
The data warehouse ............................................................................................................................... 4
Star schema......................................................................................................................................... 4
Snowflake schema .............................................................................................................................. 4
Fact constellation ................................................................................................................................ 5
Data modelling ........................................................................................................................................ 5
System design ......................................................................................................................................... 6
Data mining ............................................................................................................................................. 7
Conclusion ............................................................................................................................................... 8
Bibliography ............................................................................................................................................ 9
Table of Figures
Figure 1 the types of noSQL databases ................................................................................................... 3
Figure 2 the database environment as idealised by the noSQL fan brigade .......................................... 4
Figure 3 Traditional process leading towards a Business Intelligence system ....................................... 4
Figure 4 Enterprise Information Systems Architecture .......................................................................... 5
Figure 5 the redundancy of schema less design ..................................................................................... 5
Figure 6 Transitioning from a data model to a database ........................................................................ 6
Figure 7 Data Mining Algorithms ............................................................................................................ 7
Introduction
The evolution of the modern Relational Database Management System has been shaped by the
database itself, but the problem remains the “distance” between the decision maker and the data.
The complexity arises when the decision maker requires information from the data in the form of
answers to queries. Now it is about the toolkit that allows for the searching, indexing and querying of
vast amounts of disparate data sources that exist within the enterprise; enter the data warehouse: a
read only set of data and query results all shaped up and fit for mining. But for today’s management
the time taken to run the warehousing routine and subsequently mine it is too long! They,
management, want answers to unstructured questions and general grammatical queries in
milliseconds utilising all of the available warehoused information and up to the moment transaction
data.
Is it that senior managers, or managers in general, lack the basic information retrieval skills necessary
to seek and utilise the toolkit resources available to them within the enterprise data structure?
The evolution of databases
If you can cast your mind back as far as the Halcion days of the 1960’s then this is where there the
computer database began; 1968 saw the arrival of the file based database in which data is stored and
maintained in a flat file. Things really took off with the next generation Hierarchical data model
databases. They were enormously complex in implementation and often difficult to manage. There
were no standards and the systems databases could not efficiently manage multiple many to many
relationships. On the plus side they were efficient at searching, together with there being less
redundant data. Data was
independent together with the
Hierarchical
implementation of security and
Database
referential integrities. The next
iteration
of
database
development came with the
advent of the schema and was
Relational
noSQL
Network
brought about by the network
Database
Database
Database
database: the advantages of
which were its’ ability to handle
many more relationship types,
combined with the simpler
Object
Orientated
access of data. On the downside
Database
the systems were complex,
difficult to design and maintain.
In 1970 E.F. Codd created the relational database model, it is this 3rd generation model that has been
so massively successful that it forms the vast majority of data stored and processed is utilising these
RDBMs to the present day and into the foreseeable future. The advantages of these RDBMs: data
management, redundancy, consistency, abstraction, integrity, standards, accessibility and
maintenance far outweigh the complexity that comes with size and the higher impact of failure. The
fourth generation databases are the object-orientated database, they can efficiently manage large
numbers of differing data types, although switching to an OODBMS would require stating fresh and
re-doing the entire database. Objects with complex behaviours are managed without difficulty
utilising inheritance and polymorphism; because they are typically tied to an API OODBMS their
flexibility is reduced. Querying is another feature that does not implement well, you are unable to join
two classes as you would be able to join two tables. Principally queries rely on the effective design of
the system. There are also problems with the bulk deletion of data. A hybrid emerges: in the form of
object relational databases, looking to extend over both the relation and object orientated concepts.
They had large storage capacities although the architecture is not web friendly, but it is fast with high
access speeds. Fifth generation databases brings about a divergence in the way that data is managed
and stored with the database. Until the emergence of noSQL database to gain in performance you
scaled up, with noSQL you scale out. This cost effective way of improving database performance is
attractive to modern business enterprises. Although they con not replace the ACID compliant SQL
based OLTP with the generally BASE compliant noSQL databases. Now with noSQL you have four
different thematic approaches to managing you data. (Haadi, 2010)
The information environment
Most commonly identified as “the information environment is the aggregate of individuals,
organizations, and systems that collect, process, disseminate or act on information.” In today’s world
this leads to a more complex information environment. There has been an exponential explosion in
data and data capture, as the internet of things starts to gain a foothold. The growth of “things” has
led to an enormous growth in sparse data in which schema less data structures are become the vogue
allowing the storage of data in the database without prior knowledge of its definition. There is a
downside to schema less data model in that: at some point it is necessary to derive or imply a schema.
Bringing together the disparate data stores and systems management processes within the enterprise
into a homogenous whole is the goal of the information systems manager and is primarily considered
as the enterprise information environment. One of the primary reasons for designing and developing
a database is for its use as an integration tool.
Ask the right questions of your data!
The alternatives to SQL
Figure 1 the types of noSQL databases
In a nutshell if you are serious about your
enterprise information system then there is not a
Key-values
Stores
viable alternative to SQL RDBMs. It’s super-fast,
reliable, scalable, resilient and cost effective.
noSQL is simply muddying the waters for the
Column
media savvy. To start with it is not just one
Family
Stores
alternative technology, but they share a scale out
philosophy and are predominantly open source
which makes choosing one of them a difficult
Document
choice as vendor lock in becomes critical to the
Databases
success of its implementation within the
enterprise. So thematically there are four
database types within the noSQL camp each
Graph
Databases
having a competitive advantage in the data
environment. However, within each theme there
are a plethora of alternates as there are over 250
different offerings: all of them different, all of them proprietary in nature; and, all of them demand
more work, effort, resources, time and money than originally planned for and advertised. Object
database management systems still exist in an extremely although in an extremely thin niche.
Figure 2 the database environment as idealised by the noSQL fan brigade
SQL relational databases do not have scalability
issues and problems, Facebook recently
extended its SQL systems to cover databases in
the petabyte range of size. noSQL is cluster
friendly so scales horizontally efficiently, which
is on a per unit basis more cost attractive than
up scaling an SQL server. A point which truly
expresses the lack of thought within the noSQL
camp is that the very “popular” description
noSQL came from a meeting hashtag on twitter.
noSQL lacks a standard language SQL has been
ANSI almost since the beginning. Because of the
factional nature of the noSQL market there are a
severe and stark lack of skills (Cruz, 2012).


Key Value
Store
Column
Family
Store
Document
Datbase
Scale Size
Graph
Database
SQL
RDBMS == ACID
noSQL == BASE
Scale Complexity
For the foreseeable future SQL is going to
continue to attract the majority of investment, with noSQL utilised and applied in specific project
arenas that aid and abet in the construction and management of the data warehouse as well as being
implemented in the OLAP and BI processes.
The data warehouse
Everything ploughed into the data warehouse is a solid philosophy. A data warehouse is not a
technology but an architecture. The advantage is gained in recognising the strategic value of a data
warehouse. Having identified the need the architectural requirements mean that the organisation
benefits from simply planning and designing for a data warehouse.
OLTP
ETL
DW
BI
Figure 3 Traditional process leading towards a Business Intelligence system
Star schema
This is often considered as the simplest of the data warehouse schemas, as such is utilised pervasively.
A single fact table is linked to multiple dimension tables radially: by use of a primary and foreign key
relationship. The fact table is third normal form, whereas the dimension tables are de-normalised.
Snowflake schema
This schema is again often considered as more complex. As with the star schema the fact table is third
normal form, but this time round the dimension tables are also third normal form. Although the
schema saves on redundant data and space within the database, there is an increase in the number
of joins required to build the dimension tables.
Fact constellation
This schema requires the deconstruction of the star schema into to multiple fact tables with many
shared dimensional links. Fact constellations are difficult to design due to the aggregation of variants.
The language of relationships is changing and evolving with trends.
Enterprise Modelling
Data Modelling
Operation Systems
Relational Databases
Enterprise
Process
Modelling
ETL
Enterprise Intelligence
Star Schema Marts
Design ETL Process
N
Data Models
Data Warehouse


Extract
Reporting
Structured
Ad hoc
S
noSQL Databases
Data Cleaning
Data Profiling
File Systems
Snowflake Schema Marts
Transact
OLAP
Conceptual
Process
Flow
Modelling
Constellation Marts
Web Systems
Aggregation
Filtering
Joining
Sorting
Logical
Data
Flow
Modelling
Data Mining
noSQL Marts
Virtual Data Sytems
Load
Transaction Systems
Physical
Load data from targets
Other Marts



Analysis:
Multi dimensional
Statistical
Analytical
Figure 4 Enterprise Information Systems Architecture
Data modelling
Not believing in schema less ideas at the on-set, because a schema
will always have to be derived, so why not start with one? It is the
constant tick of time and “management” demands for it now!
Without understanding that it takes time to learn the skills, design
the applications and deploy the solutions. Systems analysis has
developed in types to the point of ridicule; systems are deployed as
solutions from the first moment, without thought or planning, then
fixed along the way, meanwhile the project teams have churned so
many times that there is no one left who actually understands the
code, let alone the project. Start with a plan: add to it! Then refine
and tinker along the way. That way the project will understand who
and how, and the: what, where when and why?: are clearly evident
to those involved.
schema
less
explicit
implicit
dirived
Figure 5 the redundancy of schema less design
Feature
Conceptual Logical Physical
Entity Names
✓
✓
Entity Relationships ✓
✓
Attributes
Primary Keys
Foreign Keys
Table Names
Column Names
Column Data Types
✓
✓
✓
✓
✓
✓
✓
✓
Table 1 Data Modelling model types and their features
Methodologies and techniques utilised by the data modelling process provide a: standard, consistent
and predictable: aspect with which to administer data as a viable resource within the enterprise. By
managing data as a resource we are able to ensure close-fitting synergies for the integration of the
enterprises information systems as a complete all-inclusive entity. It is especially advantageous in the
functional design and development of databases and the data warehouse, the data repositories of the
enterprise IS.
System design
There is an enterprise requirement for the organisational structure for storing data. Lists carry
redundant data in multiple themes or concepts. These lists are broken apart by theme and concept to
provide relational tables. Third normal form relational databases are more complex requiring effort,
and as such are hard work; not an attractive feature for the masses, who demand things are easy and
effortless.
SQL is an international standard language, recognised and implemented by the biggest names in the
software industry, the trendy new noSQL brigade have not got a common standard, each of the more
than 250 noSQL databases requires that you learn their own proprietary query language.
There is a great deal of form without function. It seems today that the ethos and very fundamentals
of design are wrong.
Normalisation. When is de-normalisation appropriate? From relation to tables and entities that have
been derived by data modelling.
Table for each entity
Figure 6 Transitioning from a data model to a database
Primary key
Field properties
•Data type
•Null status
•Default values
•Data constraints
Relation is analysed and normalised
Costs of denormalisation
 Modification anomalies
 Redundant data
 More storage required
Table 2 Cost benefit analysis of denormalisation
noSQL is vulnerable to modification anomalies.
There comes a point when the computational
expense of multiple queries with complex joins
means a denormalisation of the data is necessary.
Denormalisation exposes problems that are not in
normal form databases, in the end it is all for
nothing.
Benefits of denormalisation
 Simplicity
 Query performance improved
Data mining
For this section we will be recreating the first chapter of the book Mining the Social Web, 2nd Edition
In chapter 1: Mining Twitter: Exploring Trending Topics, Discovering What People Are Talking About,
and More: we will work through a number of exercises. These cover a number of distinct problems
when utilising the programming language Python to interface with and interrogate the Twitter API.
Decision Trees
Regression
Cluster
analysis
Time Series
Neural Nets
Factor anlaysis
Text Mining
Association
Rules
Ensemble
Rules
Support
Vector
Bayesian
Anomoly
Detection
Survival
analysis
Rule Induction
Social
Network
analysis
Genetic
Algorithms
Link analysis
Uplit Modeling
MARS
Figure 7 Data Mining Algorithms
The twitter exercise was tedious and complicated, in addition to being completely uninteresting as a
scientific exercise. From the exercise we learned that the Python programming language is the
province of esoteric code writers: “anyone can hack a piece of code” has been said, and that is what
programming in Python is like. The language construct is minimal by design, it lacks IDE support and
interfaces, and it is predominantly made up of third party libraries and is so incredibly whitespace
unfriendly.
The exercise takes us through the authentication and creation of a Twitter object which is then
integrated in the iPython interface. First just a print to screen and then some formatted output
utilising some imported library functions. It starts to get interesting towards the end of the exercises,
where matplotlib is imported and some of the results are graphed and charted. Python is a powerful
programming language and some of the libraries available make it data analytics system. It would be
and has tremendous value and applications at the Extraction and Transaction phases in the
preparation and implementation of the data warehouse.
The introduction of the Pandas library really enhances the performance of Python, it makes the slicing
and dicing of data sets remarkably straight forwards. The problem is the subject: Twitter:
Understanding that a tweet is 140 characters and the data set for a tweet is approximately 8K bytes,
the amount of augmented data and meta data is enormous, but this is where it apparently falls down
is the analysis model. Yes we can analyse what’s trending, what devices, where the tweets come from,
how many times it’s been retweeted and a plethora of other analytics: but what’s the point? What is
twitter all about? What contribution does it actually make? How is it benefitting anything? Put simply
it makes no valid scientific contribution it is merely for the chattering non-cognisant. Celebrity status
does not make us wise!
Conclusion
Although it is all change, everything still remains the same. It is a question of the big picture; the bigger
the picture the more often the same patterns will remerge. In the beginning of database history there
was no SQL, then SQL got invented, adopted and grew to dominance in the database industry; now
comes no SQL in a resurgent attempt to gain market share, as it claims: a viable alternative to SQL.
It is the 80:20 rule. 80% of the database market will remain SQL and the rest comprising of the
remaining 20% will battle for dominance among the market and data niches. As SQL scales into the
petabyte range, it will not be long before the database sizes supported by SQL database systems
reaches and exceeds the Exabyte range.
The information environment is going to continue to evolve in complexity, challenging the systems
and database designers of now and tomorrow with difficult and multifaceted data sources,
operational systems and reporting mechanisms.
Bibliography
Alexios Delis, N. R. (1993). Performance Comparison of Three Modern DBMS Architectures. IEEE
TRANSACTIONS ON SOFTWARE ENGINEERING, 19(2). Retrieved 07 31, 2014
Ali, A. (2013, 06 24). Comparing Data Warehouse Design Methodologies for Microsoft SQL Server.
Retrieved
08
07,
2014,
from
MSSQLTips.com:
http://www.mssqltips.com/sqlservertip/2976/comparing-data-warehouse-designmethodologies-for-microsoft-sql-server/
Ambler, S. (2014, 07 03). Data Modeling 101. Retrieved 07 03, 2014, from Agile Data:
http://www.agiledata.org/essays/dataModeling101.html
Babcock, C. (2010, 09 10). What's So Great About NoSQL? InformationWeek(1279), 26-31. Retrieved
07
31,
2014,
from
http://search.proquest.com.voyager.chester.ac.uk/docview/1416192825/abstract?accounti
d=14620
Cruz, X. (2012, June 4). The Database Battle: MongoDB vs MySQL. Retrieved 08 30, 2014, from
CloudTimes: http://cloudtimes.org/2012/06/04/the-database-battle-mongodb-vs-mysql/
Delis, A., & Roussopoulos, N. (1991, 09). Modern Client-Server DBMS Architectures. SIGMOD RECORD,
52-61. Retrieved 08 01, 2014
Evans, P. (2011, 01 11). Business Intelligence - SQL or NoSQL - Which is better. Retrieved 07 28, 2014,
from
Database
Journal:
http://www.databasejournal.com/sqletc/article.php/3919371/Business-Intelligence---SQLor-NoSQL---Which-is-better.htm
George, S. (2012, April 01). Inmon vs. Kimball: Which approach is suitable for your data warehouse?
Retrieved
08
07,
2014,
from
TechTarget.in:
http://searchbusinessintelligence.techtarget.in/tip/Inmon-vs-Kimball-Which-approach-issuitable-for-your-data-warehouse
Haadi, M. (2010, October 118). The Evolution of Database. Retrieved September 03, 2014, from All
About Databases: http://mhaadi.wordpress.com/2010/10/18/the-evolution-of-database/
Inmon, W. H. (2002). Building the Data Warehouse (3rd ed.). (R. Elliott, Ed.) New York, USA: Robert
Ipsen. Retrieved September 16, 2014
Kajeepeta, S. (2012, 05 14). NoSQL Everywhere? Nope. InformationWeek(1333), 31-32. Retrieved 07
31, 2014, from http://search.proquest.com.voyager.chester.ac.uk/docview/1021724923?pqorigsite=summon
Kavis Technology Consulting. (2010, 03 01). NoSQL vs. RDBMS: Apples and Oranges? Retrieved 07 28,
2014, from Kavis Technology Consulting: http://www.kavistechnology.com/blog/nosql-vsrdbms-apples-and-oranges/
Loshin, D. (2013, 09 04). An Introduction to NoSQL Data Management for Big Data. Retrieved 08 07,
2014, from DataInformed: http://data-informed.com/introduction-nosql-data-managementbig-data/
M.Thenmozhi, K. (2014, 06). AN ONTOLOGICAL APPROACH TO HANDLE MULTIDIMENSIONAL SCHEMA
EVOLUTION FOR DATA WAREHOUSE. International Journal of Database Management
Systems, 6(3), 33-52. doi:10.5121/ijdms.2014.6303
Microsoft Corporation. (2014, 07 03). Data Warehouse Design Considerations. Retrieved 07 03, 2014,
from SQL Server: http://technet.microsoft.com/en-us/library/aa902672(v=sql.80).aspx
Stonebraker, M. (2010, 04). SQL Databases v. NoSQL Databases. communications of the acm, 53(4).
doi:10.1145/1721654.1721659
Vasco Santos, R. S. (2014, April). TOWARDS A LOW COST ETL SYSTEM. International Journal of
Database Management Systems ( IJDMS ), 6(2), 67-79. doi:10.5121/ijdms.2014.6205
Veronika Abramova, J. B. (2014, June). EXPERIMENTAL EVALUATION OF NOSQL DATABASES.
International
Journal
of
Database
Management
Systems,
6(3),
16.
doi:10.5121/ijdms.2014.6301