* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Lecture 8
Data Protection Act, 2012 wikipedia , lookup
Expense and cost recovery system (ECRS) wikipedia , lookup
Data center wikipedia , lookup
Clusterpoint wikipedia , lookup
Forecasting wikipedia , lookup
Data analysis wikipedia , lookup
3D optical data storage wikipedia , lookup
Information privacy law wikipedia , lookup
Database model wikipedia , lookup
IS6146 Databases for Management Information Systems Lecture 8: Working with unstructured data Rob Gleasure [email protected] robgleasure.com IS6146 Today’s session Technologies for analysis Technologies for storage NoSQL Distributed map reduce architectures, e.g. Hadoop Technologies and tools Data lifecycle Create Capture Store Analyse Present Business Intelligence Tools Credit to https://www.youtube.com/watch?v=IjpU0dLIRDI for visualisation Tools for analysis and presentation Massive range of software, depending on needs For visualising and sorting data Excel Pentaho For data mining, regressions, clustering, graphing, etc. SPSS R Gephi UNICET For reporting Excel Pentaho Let’s get our hands data-y! (I know. Sorry.) Tools for analysis and presentation Image from www.etsy.com Data warehousing Data lifecycle Data Warehousing Create Capture Store Analyse Present Business Intelligence Tools Credit to https://www.youtube.com/watch?v=IjpU0dLIRDI for visualisation Data warehousing OLTP OLAP Business intelligence database Operational databases HR and payroll Data mining Sales and customers Extract Transform Load Data warehouse Visualisation Orders Reporting Technical support Purchased data OTLP vs. OLAP Online transaction processing (OLTP) databases/data stores support ongoing activities in an organisation Hence, they need to Manage accurate real-time transactions Handle reads, writes, and updates by large numbers of concurrent users Decompose data into joinable, efficient rows (e.g. normalised to 3rd form) These issues are often labelled ACID database transactions Atomic: Every part of a transaction works or it’s all rolled back. Consistent: The database in never left in inconsistent states Isolated: Transactions do not interfere with one other Durable: Completed transactions are not lost if system crashes OTLP vs. OLAP Online analytical processing (OLAP) databases/data stores are used to support predictive analytics Hence, they need to Allow vast quantities of historical data to be accessed quickly Be updatable in batches (often daily) Aggregate diverse structures with summary data These issues are often labelled BASE database transactions Basic Availability Soft-state Eventual consistency NoSQL Data lifecycle NoSQL Create Capture Data Warehousing Store Analyse Present Business Intelligence Tools Credit to https://www.youtube.com/watch?v=IjpU0dLIRDI for visualisation What is NoSQL? What is NoSQL? Basically any database that isn’t a relational database Stands for ‘Not only SQL’ It’s NOT anti-SQL or anti-relational databases Image from www.improgrammer.net What is NoSQL (continued)? It’s not only rows in tables NoSQL systems store and retrieve data from many formats, e.g. text, csv, xml, graphml It’s not only joins NoSQL systems mean you can extract data using simple interfaces, rather than necessarily relying on joins It’s not only schemas NoSQL systems mean you can drag-and-drop data into a folder, without having to organise and query it according to entities, attributes, relationships, etc. What is NoSQL (continued)? It’s not only executed on one processor NoSQL systems mean you can stores databases on multiple processors with high-speed performance It’s not only specialised computers NoSQL systems mean you can leverage low-cost shared-nothing commodity processors that have separate RAM and disk. It’s not only logarithmically scalable NoSQL systems mean you can achieve linear scalability as you add more processors It’s not only anything, really… NoSQL systems emphasise innovation and inclusivity, meaning there are multiple recognised options for how data is stored, retrieved, and manipulated (including standard SQL solutions) What is NoSQL (continued)? Four Data Patterns in NoSQL Image from http://www.slideshare.net/KrishnakumarSukumaran/to-sql-or-no-sql-that-is-the-question Key-value stores A simple string (the key) returns a Binary Large OBject (BLOB) of data (the value) E.g. the web The key can take many formats Logical path names A hash string artificially generated from the value REST web service calls SQL queries Three basic functions Put Get Delete Key-value stores (continued) Advantages Scalability, reliability, portability Low operational costs Simplicity Disadvantages No real options for advanced search Commercial solutions Amazon S3 Voldemort Column-family stores Stores BLOBs of data in one big table, with four possible basic identifiers used for look-up Row Column Column-family Time-stamp More like a spreadsheet than an RDBMS in many ways (e.g. no indices, triggers, or SQL queries) Grew from an idea presented in a Google BigTable paper Column-family stores (continued) Advantages Scales pretty well Decent search ability Easy to add new data Pretty intuitive Disadvantages Can’t query BLOB content Not as efficient to search as some other options Commercial solutions Cassandra HBase Document stores Stores data in nested hierarchies (typically using XML or JSON) Keeps logical chunks of data together in one place Flat tables, e.g. csv Hierarchical docs, e.g. JSON Mixed content, e.g. XML Document stores (continued) Advantages Lends itself to efficient within-document search Very suitable for information retrieval Very suitable where data is fed directly into websites/applications Allows for structure without being overly restrictive Disadvantages Complicated to implement Search process may require opening and closing files Analysis requires some flattening Commercial solutions MarkLogic MongoDB Graph stores Model the interconnectivity of the data by focusing on nodes (sometimes called vertices), relationships (sometimes called edges), and properties Image from http://savas.me/2013/03/on-graph-data-model-design-relationships/ Graph stores (continued) Tables stored for nodes and edges separately, meaning types of search become possible Graph stores (continued) Advantages Fast network search Works with many public data sets Disadvantages Not very scalable Hard to query systematically unless you use specialised languages based on graph traversals Commercial solutions Neo4j AllegroGraph So, where to get the power needed for these giant data stores? Image from chucks-fun.blogspot.com NoSQL Data lifecycle MapReduce NoSQL Create Capture Data Warehousing Store Analyse Present Business Intelligence Tools Credit to https://www.youtube.com/watch?v=IjpU0dLIRDI for visualisation Traditional (Structured) Approach High Power Processor Big Data Data The MapReduce Concept Master Processor Data Node Data Node Data Node Data Big Data Data Node Slave Processor Standard Processor Slave Processor Data Node Data Node Data Node Data Node The MapReduce Concept Two fundamental steps 1. Map Master node takes large problem and slices it into sub problems Master node distributes these sub problems to worker nodes. Worker node may also subdivide and distribute (in which case, a multi-level tree structure results) Worker processes sub problems and hands back to master 2. Reduce Master node reassembles solutions to sub problems in a predefined way to answer high-level problem Issues in Distributed Model How should we decompose one big task into smaller ones? How do we figure out an efficient way to assign tasks to different machines? How do we exchange results between machines? How do we synchronize distributed tasks? What do we do if a task fails? Apache Hadoop Hadoop was created in 2005 by two Yahoo employees (Doug Cutting and Mike Cafarella) building on white papers by Google on their MapReduce process. The name refers to a toy elephant belonging to Doug Cutting’s son Yahoo later donated the project to Apache to maintain in 2006 Hadoop offers a framework of tools for dealing with big data Hadoop is open source, distributed under the Apache licence Hadoop Ecosystem Image from http://www.neevtech.com/blog/2013/03/18/hadoop-ecosystem-at-a-glance/ Hadoop Ecosystem Image from http://thebigdatablog.weebly.com/blog/the-hadoop-ecosystem-overview/ MapReducing in Hadoop Application Batches Queue Master Processor Job Tracker Task Tracker Data Node Name Node This is where HDFS comes in This is where MapReduce comes in Slave Processor Slave Processor Slave Processor Task Tracker Task Tracker Task Tracker Data Node Data Node Data Node Fault Handling in Hadoop Distributing processing means that sooner or later, part of the distributed processing network will fail Practical truth of networks – they are unreliable Hadoop’s HDFS has fault tolerance built-in for data nodes Three copies of each file maintained by Hadoop If one copy goes down, data is retrieved from another Faulty node is then updated with new (working) data from backup Hadoop’s HDFS also tracks failures in task trackers Master node’s job tracker watches for errors in slave nodes Allocates tasks to new slave if existing slave responsible fails Programming in Hadoop Programmers using Hadoop don’t have to worry about Where files are stored How to manage failures How to distribute computation How to scale up or down activities A variety of languages can be used, though Java is the most common and arguably most hassle-free Implementing a Hadoop System Hadoop can be run in traditional onsite data centres using multiple dedicated machines Hadoop can also be run via cloud-hosted services, including Microsoft Azure Amazon EC2/S3 Amazon Elastic MapReduce Google Compute Engine Implementing a Hadoop System: Yahoo Servers Running Hadoop Image from http://thecloudtutorial.com/hadoop-tutorial.html Applications of Hadoop Areas of application include Search engines – e.g. Google, Yahoo Social media – e.g. Facebook, Twitter Financial services – Morgan Stanley, BNY Mellon eCommerce – e.g. Amazon, American Airlines, eBay, IBM Government – e.g. Federal Reserve, Homeland Security Users of Hadoop Just like RDBMS, Hadoop systems have different levels of users Administrators handle Configuring of the system Updates and installation General firefighting Basic users Run tests and gather data for reporting, market research, general exploration, etc. Design applications to use data Accessibility of NoSQL databases? Image from spaaaawk.tumblr.com Want to read more? Apache Hadoop Documentation: http://hadoop.apache.org/docs/current/ Data Intensive Text Processing with Map-Reduce http://lintool.github.io/MapReduceAlgorithms/ Hadoop Definitive Guide: http://www.amazon.com/Hadoop-Definitive-Guide-TomWhite/dp/1449311520 Want to read more? Financial Services using Hadoop http://hortonworks.com/blog/financial-services-hadoop/ https://www.mapr.com/solutions/industry/big-data-and-apachehadoop-financial-services Hadoop at ND: http://ccl.cse.nd.edu/operations/hadoop/