Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CO7107 DATA WAREHOUSING AND MULTIDIMENSIONAL DATABASES Assessment Andrew Jones 0817804 J01209 Table of Contents Table of Figures ....................................................................................................................................... 1 Introduction ............................................................................................................................................ 2 The evolution of databases ..................................................................................................................... 2 The information environment ................................................................................................................ 3 The alternatives to SQL ........................................................................................................................... 3 The data warehouse ............................................................................................................................... 4 Star schema......................................................................................................................................... 4 Snowflake schema .............................................................................................................................. 4 Fact constellation ................................................................................................................................ 5 Data modelling ........................................................................................................................................ 5 System design ......................................................................................................................................... 6 Data mining ............................................................................................................................................. 7 Conclusion ............................................................................................................................................... 8 Bibliography ............................................................................................................................................ 9 Table of Figures Figure 1 the types of noSQL databases ................................................................................................... 3 Figure 2 the database environment as idealised by the noSQL fan brigade .......................................... 4 Figure 3 Traditional process leading towards a Business Intelligence system ....................................... 4 Figure 4 Enterprise Information Systems Architecture .......................................................................... 5 Figure 5 the redundancy of schema less design ..................................................................................... 5 Figure 6 Transitioning from a data model to a database ........................................................................ 6 Figure 7 Data Mining Algorithms ............................................................................................................ 7 Introduction The evolution of the modern Relational Database Management System has been shaped by the database itself, but the problem remains the “distance” between the decision maker and the data. The complexity arises when the decision maker requires information from the data in the form of answers to queries. Now it is about the toolkit that allows for the searching, indexing and querying of vast amounts of disparate data sources that exist within the enterprise; enter the data warehouse: a read only set of data and query results all shaped up and fit for mining. But for today’s management the time taken to run the warehousing routine and subsequently mine it is too long! They, management, want answers to unstructured questions and general grammatical queries in milliseconds utilising all of the available warehoused information and up to the moment transaction data. Is it that senior managers, or managers in general, lack the basic information retrieval skills necessary to seek and utilise the toolkit resources available to them within the enterprise data structure? The evolution of databases If you can cast your mind back as far as the Halcion days of the 1960’s then this is where there the computer database began; 1968 saw the arrival of the file based database in which data is stored and maintained in a flat file. Things really took off with the next generation Hierarchical data model databases. They were enormously complex in implementation and often difficult to manage. There were no standards and the systems databases could not efficiently manage multiple many to many relationships. On the plus side they were efficient at searching, together with there being less redundant data. Data was independent together with the Hierarchical implementation of security and Database referential integrities. The next iteration of database development came with the advent of the schema and was Relational noSQL Network brought about by the network Database Database Database database: the advantages of which were its’ ability to handle many more relationship types, combined with the simpler Object Orientated access of data. On the downside Database the systems were complex, difficult to design and maintain. In 1970 E.F. Codd created the relational database model, it is this 3rd generation model that has been so massively successful that it forms the vast majority of data stored and processed is utilising these RDBMs to the present day and into the foreseeable future. The advantages of these RDBMs: data management, redundancy, consistency, abstraction, integrity, standards, accessibility and maintenance far outweigh the complexity that comes with size and the higher impact of failure. The fourth generation databases are the object-orientated database, they can efficiently manage large numbers of differing data types, although switching to an OODBMS would require stating fresh and re-doing the entire database. Objects with complex behaviours are managed without difficulty utilising inheritance and polymorphism; because they are typically tied to an API OODBMS their flexibility is reduced. Querying is another feature that does not implement well, you are unable to join two classes as you would be able to join two tables. Principally queries rely on the effective design of the system. There are also problems with the bulk deletion of data. A hybrid emerges: in the form of object relational databases, looking to extend over both the relation and object orientated concepts. They had large storage capacities although the architecture is not web friendly, but it is fast with high access speeds. Fifth generation databases brings about a divergence in the way that data is managed and stored with the database. Until the emergence of noSQL database to gain in performance you scaled up, with noSQL you scale out. This cost effective way of improving database performance is attractive to modern business enterprises. Although they con not replace the ACID compliant SQL based OLTP with the generally BASE compliant noSQL databases. Now with noSQL you have four different thematic approaches to managing you data. (Haadi, 2010) The information environment Most commonly identified as “the information environment is the aggregate of individuals, organizations, and systems that collect, process, disseminate or act on information.” In today’s world this leads to a more complex information environment. There has been an exponential explosion in data and data capture, as the internet of things starts to gain a foothold. The growth of “things” has led to an enormous growth in sparse data in which schema less data structures are become the vogue allowing the storage of data in the database without prior knowledge of its definition. There is a downside to schema less data model in that: at some point it is necessary to derive or imply a schema. Bringing together the disparate data stores and systems management processes within the enterprise into a homogenous whole is the goal of the information systems manager and is primarily considered as the enterprise information environment. One of the primary reasons for designing and developing a database is for its use as an integration tool. Ask the right questions of your data! The alternatives to SQL Figure 1 the types of noSQL databases In a nutshell if you are serious about your enterprise information system then there is not a Key-values Stores viable alternative to SQL RDBMs. It’s super-fast, reliable, scalable, resilient and cost effective. noSQL is simply muddying the waters for the Column media savvy. To start with it is not just one Family Stores alternative technology, but they share a scale out philosophy and are predominantly open source which makes choosing one of them a difficult Document choice as vendor lock in becomes critical to the Databases success of its implementation within the enterprise. So thematically there are four database types within the noSQL camp each Graph Databases having a competitive advantage in the data environment. However, within each theme there are a plethora of alternates as there are over 250 different offerings: all of them different, all of them proprietary in nature; and, all of them demand more work, effort, resources, time and money than originally planned for and advertised. Object database management systems still exist in an extremely although in an extremely thin niche. Figure 2 the database environment as idealised by the noSQL fan brigade SQL relational databases do not have scalability issues and problems, Facebook recently extended its SQL systems to cover databases in the petabyte range of size. noSQL is cluster friendly so scales horizontally efficiently, which is on a per unit basis more cost attractive than up scaling an SQL server. A point which truly expresses the lack of thought within the noSQL camp is that the very “popular” description noSQL came from a meeting hashtag on twitter. noSQL lacks a standard language SQL has been ANSI almost since the beginning. Because of the factional nature of the noSQL market there are a severe and stark lack of skills (Cruz, 2012). Key Value Store Column Family Store Document Datbase Scale Size Graph Database SQL RDBMS == ACID noSQL == BASE Scale Complexity For the foreseeable future SQL is going to continue to attract the majority of investment, with noSQL utilised and applied in specific project arenas that aid and abet in the construction and management of the data warehouse as well as being implemented in the OLAP and BI processes. The data warehouse Everything ploughed into the data warehouse is a solid philosophy. A data warehouse is not a technology but an architecture. The advantage is gained in recognising the strategic value of a data warehouse. Having identified the need the architectural requirements mean that the organisation benefits from simply planning and designing for a data warehouse. OLTP ETL DW BI Figure 3 Traditional process leading towards a Business Intelligence system Star schema This is often considered as the simplest of the data warehouse schemas, as such is utilised pervasively. A single fact table is linked to multiple dimension tables radially: by use of a primary and foreign key relationship. The fact table is third normal form, whereas the dimension tables are de-normalised. Snowflake schema This schema is again often considered as more complex. As with the star schema the fact table is third normal form, but this time round the dimension tables are also third normal form. Although the schema saves on redundant data and space within the database, there is an increase in the number of joins required to build the dimension tables. Fact constellation This schema requires the deconstruction of the star schema into to multiple fact tables with many shared dimensional links. Fact constellations are difficult to design due to the aggregation of variants. The language of relationships is changing and evolving with trends. Enterprise Modelling Data Modelling Operation Systems Relational Databases Enterprise Process Modelling ETL Enterprise Intelligence Star Schema Marts Design ETL Process N Data Models Data Warehouse Extract Reporting Structured Ad hoc S noSQL Databases Data Cleaning Data Profiling File Systems Snowflake Schema Marts Transact OLAP Conceptual Process Flow Modelling Constellation Marts Web Systems Aggregation Filtering Joining Sorting Logical Data Flow Modelling Data Mining noSQL Marts Virtual Data Sytems Load Transaction Systems Physical Load data from targets Other Marts Analysis: Multi dimensional Statistical Analytical Figure 4 Enterprise Information Systems Architecture Data modelling Not believing in schema less ideas at the on-set, because a schema will always have to be derived, so why not start with one? It is the constant tick of time and “management” demands for it now! Without understanding that it takes time to learn the skills, design the applications and deploy the solutions. Systems analysis has developed in types to the point of ridicule; systems are deployed as solutions from the first moment, without thought or planning, then fixed along the way, meanwhile the project teams have churned so many times that there is no one left who actually understands the code, let alone the project. Start with a plan: add to it! Then refine and tinker along the way. That way the project will understand who and how, and the: what, where when and why?: are clearly evident to those involved. schema less explicit implicit dirived Figure 5 the redundancy of schema less design Feature Conceptual Logical Physical Entity Names ✓ ✓ Entity Relationships ✓ ✓ Attributes Primary Keys Foreign Keys Table Names Column Names Column Data Types ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ Table 1 Data Modelling model types and their features Methodologies and techniques utilised by the data modelling process provide a: standard, consistent and predictable: aspect with which to administer data as a viable resource within the enterprise. By managing data as a resource we are able to ensure close-fitting synergies for the integration of the enterprises information systems as a complete all-inclusive entity. It is especially advantageous in the functional design and development of databases and the data warehouse, the data repositories of the enterprise IS. System design There is an enterprise requirement for the organisational structure for storing data. Lists carry redundant data in multiple themes or concepts. These lists are broken apart by theme and concept to provide relational tables. Third normal form relational databases are more complex requiring effort, and as such are hard work; not an attractive feature for the masses, who demand things are easy and effortless. SQL is an international standard language, recognised and implemented by the biggest names in the software industry, the trendy new noSQL brigade have not got a common standard, each of the more than 250 noSQL databases requires that you learn their own proprietary query language. There is a great deal of form without function. It seems today that the ethos and very fundamentals of design are wrong. Normalisation. When is de-normalisation appropriate? From relation to tables and entities that have been derived by data modelling. Table for each entity Figure 6 Transitioning from a data model to a database Primary key Field properties •Data type •Null status •Default values •Data constraints Relation is analysed and normalised Costs of denormalisation Modification anomalies Redundant data More storage required Table 2 Cost benefit analysis of denormalisation noSQL is vulnerable to modification anomalies. There comes a point when the computational expense of multiple queries with complex joins means a denormalisation of the data is necessary. Denormalisation exposes problems that are not in normal form databases, in the end it is all for nothing. Benefits of denormalisation Simplicity Query performance improved Data mining For this section we will be recreating the first chapter of the book Mining the Social Web, 2nd Edition In chapter 1: Mining Twitter: Exploring Trending Topics, Discovering What People Are Talking About, and More: we will work through a number of exercises. These cover a number of distinct problems when utilising the programming language Python to interface with and interrogate the Twitter API. Decision Trees Regression Cluster analysis Time Series Neural Nets Factor anlaysis Text Mining Association Rules Ensemble Rules Support Vector Bayesian Anomoly Detection Survival analysis Rule Induction Social Network analysis Genetic Algorithms Link analysis Uplit Modeling MARS Figure 7 Data Mining Algorithms The twitter exercise was tedious and complicated, in addition to being completely uninteresting as a scientific exercise. From the exercise we learned that the Python programming language is the province of esoteric code writers: “anyone can hack a piece of code” has been said, and that is what programming in Python is like. The language construct is minimal by design, it lacks IDE support and interfaces, and it is predominantly made up of third party libraries and is so incredibly whitespace unfriendly. The exercise takes us through the authentication and creation of a Twitter object which is then integrated in the iPython interface. First just a print to screen and then some formatted output utilising some imported library functions. It starts to get interesting towards the end of the exercises, where matplotlib is imported and some of the results are graphed and charted. Python is a powerful programming language and some of the libraries available make it data analytics system. It would be and has tremendous value and applications at the Extraction and Transaction phases in the preparation and implementation of the data warehouse. The introduction of the Pandas library really enhances the performance of Python, it makes the slicing and dicing of data sets remarkably straight forwards. The problem is the subject: Twitter: Understanding that a tweet is 140 characters and the data set for a tweet is approximately 8K bytes, the amount of augmented data and meta data is enormous, but this is where it apparently falls down is the analysis model. Yes we can analyse what’s trending, what devices, where the tweets come from, how many times it’s been retweeted and a plethora of other analytics: but what’s the point? What is twitter all about? What contribution does it actually make? How is it benefitting anything? Put simply it makes no valid scientific contribution it is merely for the chattering non-cognisant. Celebrity status does not make us wise! Conclusion Although it is all change, everything still remains the same. It is a question of the big picture; the bigger the picture the more often the same patterns will remerge. In the beginning of database history there was no SQL, then SQL got invented, adopted and grew to dominance in the database industry; now comes no SQL in a resurgent attempt to gain market share, as it claims: a viable alternative to SQL. It is the 80:20 rule. 80% of the database market will remain SQL and the rest comprising of the remaining 20% will battle for dominance among the market and data niches. As SQL scales into the petabyte range, it will not be long before the database sizes supported by SQL database systems reaches and exceeds the Exabyte range. The information environment is going to continue to evolve in complexity, challenging the systems and database designers of now and tomorrow with difficult and multifaceted data sources, operational systems and reporting mechanisms. Bibliography Alexios Delis, N. R. (1993). Performance Comparison of Three Modern DBMS Architectures. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 19(2). Retrieved 07 31, 2014 Ali, A. (2013, 06 24). Comparing Data Warehouse Design Methodologies for Microsoft SQL Server. Retrieved 08 07, 2014, from MSSQLTips.com: http://www.mssqltips.com/sqlservertip/2976/comparing-data-warehouse-designmethodologies-for-microsoft-sql-server/ Ambler, S. (2014, 07 03). Data Modeling 101. Retrieved 07 03, 2014, from Agile Data: http://www.agiledata.org/essays/dataModeling101.html Babcock, C. (2010, 09 10). What's So Great About NoSQL? InformationWeek(1279), 26-31. Retrieved 07 31, 2014, from http://search.proquest.com.voyager.chester.ac.uk/docview/1416192825/abstract?accounti d=14620 Cruz, X. (2012, June 4). The Database Battle: MongoDB vs MySQL. Retrieved 08 30, 2014, from CloudTimes: http://cloudtimes.org/2012/06/04/the-database-battle-mongodb-vs-mysql/ Delis, A., & Roussopoulos, N. (1991, 09). Modern Client-Server DBMS Architectures. SIGMOD RECORD, 52-61. Retrieved 08 01, 2014 Evans, P. (2011, 01 11). Business Intelligence - SQL or NoSQL - Which is better. Retrieved 07 28, 2014, from Database Journal: http://www.databasejournal.com/sqletc/article.php/3919371/Business-Intelligence---SQLor-NoSQL---Which-is-better.htm George, S. (2012, April 01). Inmon vs. Kimball: Which approach is suitable for your data warehouse? Retrieved 08 07, 2014, from TechTarget.in: http://searchbusinessintelligence.techtarget.in/tip/Inmon-vs-Kimball-Which-approach-issuitable-for-your-data-warehouse Haadi, M. (2010, October 118). The Evolution of Database. Retrieved September 03, 2014, from All About Databases: http://mhaadi.wordpress.com/2010/10/18/the-evolution-of-database/ Inmon, W. H. (2002). Building the Data Warehouse (3rd ed.). (R. Elliott, Ed.) New York, USA: Robert Ipsen. Retrieved September 16, 2014 Kajeepeta, S. (2012, 05 14). NoSQL Everywhere? Nope. InformationWeek(1333), 31-32. Retrieved 07 31, 2014, from http://search.proquest.com.voyager.chester.ac.uk/docview/1021724923?pqorigsite=summon Kavis Technology Consulting. (2010, 03 01). NoSQL vs. RDBMS: Apples and Oranges? Retrieved 07 28, 2014, from Kavis Technology Consulting: http://www.kavistechnology.com/blog/nosql-vsrdbms-apples-and-oranges/ Loshin, D. (2013, 09 04). An Introduction to NoSQL Data Management for Big Data. Retrieved 08 07, 2014, from DataInformed: http://data-informed.com/introduction-nosql-data-managementbig-data/ M.Thenmozhi, K. (2014, 06). AN ONTOLOGICAL APPROACH TO HANDLE MULTIDIMENSIONAL SCHEMA EVOLUTION FOR DATA WAREHOUSE. International Journal of Database Management Systems, 6(3), 33-52. doi:10.5121/ijdms.2014.6303 Microsoft Corporation. (2014, 07 03). Data Warehouse Design Considerations. Retrieved 07 03, 2014, from SQL Server: http://technet.microsoft.com/en-us/library/aa902672(v=sql.80).aspx Stonebraker, M. (2010, 04). SQL Databases v. NoSQL Databases. communications of the acm, 53(4). doi:10.1145/1721654.1721659 Vasco Santos, R. S. (2014, April). TOWARDS A LOW COST ETL SYSTEM. International Journal of Database Management Systems ( IJDMS ), 6(2), 67-79. doi:10.5121/ijdms.2014.6205 Veronika Abramova, J. B. (2014, June). EXPERIMENTAL EVALUATION OF NOSQL DATABASES. International Journal of Database Management Systems, 6(3), 16. doi:10.5121/ijdms.2014.6301