Download Lecture 8

IS6146 Databases for Management Information Systems Lecture 8: Working with unstructured data Rob Gleasure [email protected] robgleasure.com IS6146  Today’s session  Technologies for analysis  Technologies for storage  NoSQL  Distributed map reduce architectures, e.g. Hadoop Technologies and tools Data lifecycle Create Capture Store Analyse Present Business Intelligence Tools Credit to https://www.youtube.com/watch?v=IjpU0dLIRDI for visualisation Tools for analysis and presentation  Massive range of software, depending on needs  For visualising and sorting data  Excel  Pentaho  For data mining, regressions, clustering, graphing, etc.  SPSS  R  Gephi  UNICET  For reporting  Excel  Pentaho Let’s get our hands data-y! (I know. Sorry.) Tools for analysis and presentation Image from www.etsy.com Data warehousing Data lifecycle Data Warehousing Create Capture Store Analyse Present Business Intelligence Tools Credit to https://www.youtube.com/watch?v=IjpU0dLIRDI for visualisation Data warehousing OLTP OLAP Business intelligence database Operational databases HR and payroll Data mining Sales and customers Extract Transform Load Data warehouse Visualisation Orders Reporting Technical support Purchased data OTLP vs. OLAP  Online transaction processing (OLTP) databases/data stores support ongoing activities in an organisation  Hence, they need to  Manage accurate real-time transactions  Handle reads, writes, and updates by large numbers of concurrent users  Decompose data into joinable, efficient rows (e.g. normalised to 3rd form)  These issues are often labelled ACID database transactions  Atomic: Every part of a transaction works or it’s all rolled back.  Consistent: The database in never left in inconsistent states  Isolated: Transactions do not interfere with one other  Durable: Completed transactions are not lost if system crashes OTLP vs. OLAP  Online analytical processing (OLAP) databases/data stores are used to support predictive analytics  Hence, they need to  Allow vast quantities of historical data to be accessed quickly  Be updatable in batches (often daily)  Aggregate diverse structures with summary data  These issues are often labelled BASE database transactions  Basic Availability  Soft-state  Eventual consistency NoSQL Data lifecycle NoSQL Create Capture Data Warehousing Store Analyse Present Business Intelligence Tools Credit to https://www.youtube.com/watch?v=IjpU0dLIRDI for visualisation What is NoSQL?  What is NoSQL?  Basically any database that isn’t a relational database  Stands for ‘Not only SQL’  It’s NOT anti-SQL or anti-relational databases Image from www.improgrammer.net What is NoSQL (continued)?    It’s not only rows in tables  NoSQL systems store and retrieve data from many formats, e.g. text, csv, xml, graphml It’s not only joins  NoSQL systems mean you can extract data using simple interfaces, rather than necessarily relying on joins It’s not only schemas  NoSQL systems mean you can drag-and-drop data into a folder, without having to organise and query it according to entities, attributes, relationships, etc. What is NoSQL (continued)?     It’s not only executed on one processor  NoSQL systems mean you can stores databases on multiple processors with high-speed performance It’s not only specialised computers  NoSQL systems mean you can leverage low-cost shared-nothing commodity processors that have separate RAM and disk. It’s not only logarithmically scalable  NoSQL systems mean you can achieve linear scalability as you add more processors It’s not only anything, really…  NoSQL systems emphasise innovation and inclusivity, meaning there are multiple recognised options for how data is stored, retrieved, and manipulated (including standard SQL solutions) What is NoSQL (continued)? Four Data Patterns in NoSQL Image from http://www.slideshare.net/KrishnakumarSukumaran/to-sql-or-no-sql-that-is-the-question Key-value stores  A simple string (the key) returns a Binary Large OBject (BLOB) of data (the value)  E.g. the web  The key can take many formats  Logical path names  A hash string artificially generated from the value  REST web service calls  SQL queries  Three basic functions  Put  Get  Delete Key-value stores (continued)  Advantages  Scalability, reliability, portability  Low operational costs  Simplicity  Disadvantages  No real options for advanced search  Commercial solutions  Amazon S3  Voldemort Column-family stores  Stores BLOBs of data in one big table, with four possible basic identifiers used for look-up  Row  Column  Column-family  Time-stamp  More like a spreadsheet than an RDBMS in many ways (e.g. no indices, triggers, or SQL queries)  Grew from an idea presented in a Google BigTable paper Column-family stores (continued)  Advantages  Scales pretty well  Decent search ability  Easy to add new data  Pretty intuitive  Disadvantages  Can’t query BLOB content  Not as efficient to search as some other options  Commercial solutions  Cassandra  HBase Document stores  Stores data in nested hierarchies (typically using XML or JSON)  Keeps logical chunks of data together in one place Flat tables, e.g. csv Hierarchical docs, e.g. JSON Mixed content, e.g. XML Document stores (continued)  Advantages  Lends itself to efficient within-document search  Very suitable for information retrieval  Very suitable where data is fed directly into websites/applications  Allows for structure without being overly restrictive  Disadvantages  Complicated to implement  Search process may require opening and closing files  Analysis requires some flattening  Commercial solutions  MarkLogic  MongoDB Graph stores  Model the interconnectivity of the data by focusing on nodes (sometimes called vertices), relationships (sometimes called edges), and properties Image from http://savas.me/2013/03/on-graph-data-model-design-relationships/ Graph stores (continued)  Tables stored for nodes and edges separately, meaning types of search become possible Graph stores (continued)  Advantages  Fast network search  Works with many public data sets  Disadvantages  Not very scalable  Hard to query systematically unless you use specialised languages based on graph traversals  Commercial solutions  Neo4j  AllegroGraph So, where to get the power needed for these giant data stores? Image from chucks-fun.blogspot.com NoSQL Data lifecycle MapReduce NoSQL Create Capture Data Warehousing Store Analyse Present Business Intelligence Tools Credit to https://www.youtube.com/watch?v=IjpU0dLIRDI for visualisation Traditional (Structured) Approach High Power Processor Big Data Data The MapReduce Concept Master Processor Data Node Data Node Data Node Data Big Data Data Node Slave Processor Standard Processor Slave Processor Data Node Data Node Data Node Data Node The MapReduce Concept  Two fundamental steps 1. Map  Master node takes large problem and slices it into sub problems  Master node distributes these sub problems to worker nodes.  Worker node may also subdivide and distribute (in which case, a multi-level tree structure results)  Worker processes sub problems and hands back to master 2. Reduce  Master node reassembles solutions to sub problems in a predefined way to answer high-level problem Issues in Distributed Model  How should we decompose one big task into smaller ones?  How do we figure out an efficient way to assign tasks to different machines?  How do we exchange results between machines?  How do we synchronize distributed tasks?  What do we do if a task fails? Apache Hadoop  Hadoop was created in 2005 by two Yahoo employees (Doug Cutting and Mike Cafarella) building on white papers by Google on their MapReduce process.  The name refers to a toy elephant belonging to Doug Cutting’s son  Yahoo later donated the project to Apache to maintain in 2006  Hadoop offers a framework of tools for dealing with big data  Hadoop is open source, distributed under the Apache licence Hadoop Ecosystem Image from http://www.neevtech.com/blog/2013/03/18/hadoop-ecosystem-at-a-glance/ Hadoop Ecosystem Image from http://thebigdatablog.weebly.com/blog/the-hadoop-ecosystem-overview/ MapReducing in Hadoop Application Batches Queue Master Processor Job Tracker Task Tracker Data Node Name Node This is where HDFS comes in This is where MapReduce comes in Slave Processor Slave Processor Slave Processor Task Tracker Task Tracker Task Tracker Data Node Data Node Data Node Fault Handling in Hadoop    Distributing processing means that sooner or later, part of the distributed processing network will fail  Practical truth of networks – they are unreliable Hadoop’s HDFS has fault tolerance built-in for data nodes  Three copies of each file maintained by Hadoop  If one copy goes down, data is retrieved from another  Faulty node is then updated with new (working) data from backup Hadoop’s HDFS also tracks failures in task trackers  Master node’s job tracker watches for errors in slave nodes  Allocates tasks to new slave if existing slave responsible fails Programming in Hadoop   Programmers using Hadoop don’t have to worry about  Where files are stored  How to manage failures  How to distribute computation  How to scale up or down activities A variety of languages can be used, though Java is the most common and arguably most hassle-free Implementing a Hadoop System  Hadoop can be run in traditional onsite data centres using multiple dedicated machines  Hadoop can also be run via cloud-hosted services, including  Microsoft Azure  Amazon EC2/S3  Amazon Elastic MapReduce  Google Compute Engine Implementing a Hadoop System: Yahoo Servers Running Hadoop Image from http://thecloudtutorial.com/hadoop-tutorial.html Applications of Hadoop  Areas of application include  Search engines – e.g. Google, Yahoo  Social media – e.g. Facebook, Twitter  Financial services – Morgan Stanley, BNY Mellon  eCommerce – e.g. Amazon, American Airlines, eBay, IBM  Government – e.g. Federal Reserve, Homeland Security Users of Hadoop  Just like RDBMS, Hadoop systems have different levels of users  Administrators handle  Configuring of the system  Updates and installation  General firefighting  Basic users  Run tests and gather data for reporting, market research, general exploration, etc.  Design applications to use data Accessibility of NoSQL databases? Image from spaaaawk.tumblr.com Want to read more?    Apache Hadoop Documentation:  http://hadoop.apache.org/docs/current/ Data Intensive Text Processing with Map-Reduce  http://lintool.github.io/MapReduceAlgorithms/ Hadoop Definitive Guide:  http://www.amazon.com/Hadoop-Definitive-Guide-TomWhite/dp/1449311520 Want to read more?   Financial Services using Hadoop  http://hortonworks.com/blog/financial-services-hadoop/  https://www.mapr.com/solutions/industry/big-data-and-apachehadoop-financial-services Hadoop at ND:  http://ccl.cse.nd.edu/operations/hadoop/

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Lecture 8