* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download CS194Lec03DataModels - b
Survey
Document related concepts
Transcript
Introduction to Data Science Lecture 3 Manipulating Tabular Data CS 194 Spring 2014 Michael Franklin Dan Bruckner, Evan Sparks, Shivaram Venkataraman Outline for this Evening • Lecture – Data Models, Tables, Structure, etc. – SQL, NoSQL – Schema on Read vs. Schema on Write – Non-Tabluar Structures • Exercise – Data Manipulation with Pandas (Shivaram: Tutorial and Lab) • Review of exercise • About the course – Overview of Assignments and rough schedule Data Science – One Definition The Big Picture Extract Transform Load 4 LET’S TALK ABOUT STRUCTURE 5 The Structure Spectrum Structured Semi-Structured Unstructured (schema-first) (schema-later) (schema-never) Relational Database Formatted Messages Documents XML Plain Text Media Tagged Text/Media When people use the word database, fundamentally what they are saying is that the data should be self-describing and it should have a schema. That’s really all the word database means. •-- Jim Gray, “The Fourth Paradigm” 7 Key Concept: Structured Data A data model is a collection of concepts for describing data. A schema is a description of a particular collection of data, using a given data model. The Relational Model* • The Relational Model is Ubiquitous • MySQL, PostgreSQL, Oracle, DB2, SQLServer, … • Foundational work done at • IBM Santa Teresa Labs (now IBM Almaden )“System R” • UC Berkeley CS – the “Ingres” System • Note: some Legacy systems use older models • e.g., IBM’s IMS E. F., “Ted” Codd Turing Award 1981 • Object-oriented concepts have been merged in • Early work: POSTGRES research project at Berkeley • Informix, IBM DB2, Oracle 8i • As has support for XML (semi-structured data) *Codd, E. F. (1970). "A relational model of data for large shared data banks". Communications of the ACM 13 (6): 37 Relational Database: Definitions • Relational database: a set of relations • Relation: made up of 2 parts: Schema : specifies name of relation, plus name and type of each column Students(sid: string, name: string, login: string, age: integer, gpa: real) Instance : the actual data at a given time • #rows = cardinality • #fields = degree / arity Ex: Instance of Students Relation sid 53666 53688 53650 name login Jones jones@cs Smith smith@eecs Smith smith @math age 18 18 19 gpa 3.4 3.2 3.8 • Cardinality = 3, arity = 5 , all rows distinct • Do all values in each column of a relation instance have to be unique? SQL - A language for Relational DBs • Say: “ess-cue-ell” or “sequel” – But spelled “SQL” • Data Definition Language (DDL) – create, modify, delete relations – specify constraints – administer users, security, etc. • Data Manipulation Language (DML) – Specify queries to find tuples that satisfy criteria – add, modify, remove tuples • The DBMS is responsible for efficient evaluation. Creating Relations in SQL • Creates the Students relation. – Note: the type (domain) of each field is specified, and enforced by the DBMS whenever tuples are added or modified. CREATE TABLE Students (sid CHAR(20), name CHAR(20), login CHAR(10), age INTEGER, gpa FLOAT) Table Creation (continued) • Another example: the Enrolled table holds information about courses students take. CREATE TABLE Enrolled (sid CHAR(20), cid CHAR(20), grade CHAR(2)) Adding and Deleting Tuples • Can insert a single tuple using: INSERT INTO Students (sid, name, login, age, gpa) VALUES ('53688', 'Smith', 'smith@ee', 18, 3.2) • Can delete all tuples satisfying some condition (e.g., name = Smith): DELETE FROM Students S WHERE S.name = 'Smith' Powerful variants of these commands are available; more later! Queries in SQL • Single-table queries are straightforward. • To find all 18 year old students, we can write: SELECT * FROM Students S WHERE S.age=18 • To find just names and logins, replace the first line: SELECT S.name, S.login Querying Multiple Relations • Can specify a join over two tables as follows: SELECT S.name, E.cid FROM Students S, Enrolled E WHERE S.sid=E.sid AND E.grade=‘B' sid 53831 53831 53650 53666 cid Carnatic101 Reggae203 Topology112 History105 grade C B A B sid name 53666 Jones login jones@cs age gpa 18 3.4 53688 Smith smith@ee 18 3.2 Note: obviously no referential integrity constraints have been used here. result = S.name Jones E.cid History105 Basic SQL Query SELECT FROM WHERE [DISTINCT] target-list relation-list qualification • relation-list : A list of relation names • possibly with a range-variable after each name • target-list : A list of attributes of tables in relation-list • qualification : Comparisons combined using AND, OR and NOT. • Comparisons are Attr op const or Attr1 op Attr2, where op is one of =≠<>≤≥ • DISTINCT: optional keyword indicating that the answer should not contain duplicates. • In SQL SELECT, the default is that duplicates are not eliminated! (Result is called a “multiset”) SQL Query Semantics Semantics of an SQL query are defined in terms of the following conceptual evaluation strategy: 1. do FROM clause: compute cross-product of tables (e.g., Students and Enrolled). 2. do WHERE clause: Check conditions, discard tuples that fail. (i.e., “selection”). 3. do SELECT clause: Delete unwanted fields. (i.e., “projection”). 4. If DISTINCT specified, eliminate duplicate rows. Probably the least efficient way to compute a query! – An optimizer will find more efficient strategies to get the same answer. Data Model (Tabular) • SQLite – Table: fixed number of named columns of specified type – 5 storage classes for columns • • • • • NULL INTEGER REAL TEXT BLOB – Data stored on disk in a single file in row-major order – Operations performed via sqlite3 shell 20 OTHER “TABLE-LIKE” DATA MODELS Data Model (Tabular) • Python – DataFrame: a dict of Series objects • Each Series object represents a column – Series: a named, ordered dictionary • The keys of the dictionary are the indexes • Built on NumPy’s ndarray • Values can be any Numpy data type object – Data stored in memory – Operations performed from Python shell 22 Operations • • • • • integrate (join), transform, clean, impute aggregate: sum, count, average, max, min sort pivot Relational – union, intersection, difference, cartesian product (CROSS JOIN) – select/filter, project – join: natural join (INNER JOIN), theta join, semi-join, etc. – rename 23 Operations • Summary() (descriptive statistics) • map() • Pandas – Group By/split-apply-combine (aggregation/transformation) – Merge/join – Pivot/reshape – Sampling • Add/remove columns (feature enrichment) • Clone (CTAS) • Chaining (correlated subqueries) 24 Data Model (Tabular) • R – data.frame: a list of vector objects • Each vector object represents a column – Possible vector types • logical, integer, double, complex, character, raw – Data stored in memory – Operations performed from the R shell 25 What’s Wrong with Tables? • Too limited in structure? • Too rigid? • Too old fashioned? BEYOND TABLES Column Family Data Models • Roots in “Big Table” system at Google • Used in Cassandra and other Key Value Stores 28 NoSQL Storage Systems 29 CouchDB Data Model (JSON) • “With CouchDB, no schema is enforced, so new document types with new meaning can be safely added alongside the old.” • A CouchDB document is an object that consists of named fields. Field values may be: – strings, numbers, dates, "Subject": "I like Plankton" – ordered lists, associative maps "Author": "Rusty" "PostedDate": "5/23/2006" "Tags": ["plankton", "baseball", "decisions"] "Body": "I decided today that I don't like baseball. I like plankton." 30 MongoDB Data Model “With Mongo, you do less "normalization" than you would perform designing a relational schema because there are no server-side joins. Generally, you will want one database collection for each of your top level objects.” from the MongoDB manual 31 Dremel Nested Data Model 32 Schema: Teaching a Pig to Sing? • “Pig Latin” [Olston et al. SIGMOD 08] • Why have a schema? 1) Referential (and other) Consistency 2) Fast point look ups through indexes 3) Curation for future (other) users • But many Big Data Workloads • Are Read Mostly/Append Only • Scan (not look up) Focused • On fairly Transient data sets Q: What about Query Optimization? • Pig (and other NoSQL systems have a • • Flexible, optional, nested data model Data remains in files (no admin) Pig • Started at Yahoo! Research • Runs about 50% of Yahoo!’s jobs • Features: – Expresses sequences of MapReduce jobs – Data model: nested “bags” of items • Schema is optional – Provides relational (SQL) operators (JOIN, GROUP BY, etc) – Easy to plug in Java functions An Example Problem Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited pages by users aged 18-25. Load Users Load Pages Filter by age Join on name Group on url Count clicks Order by clicks Take top 5 Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt In MapReduce Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt In Pig Latin Users = load ‘users’ as (name, age); Filtered = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url); Joined = join Filtered by name, Pages by user; Grouped = group Joined by url; Summed = foreach Grouped generate group, count(Joined) as clicks; Sorted = order Summed by clicks desc; Top5 = limit Sorted 5; store Top5 into ‘top5sites’; Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt Translation to MapReduce Notice how naturally the components of the job translate into Pig Latin. Load Users Load Pages Filter by age Join on name Group on url Count clicks Users = load … Filtered = filter … Pages = load … Joined = join … Grouped = group … Summed = … count()… Sorted = order … Top5 = limit … Order by clicks Take top 5 Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt Translation to MapReduce Notice how naturally the components of the job translate into Pig Latin. Load Users Load Pages Filter by age Join on name Job 1 Group on url Job 2 Count clicks Users = load … Filtered = filter … Pages = load … Joined = join … Grouped = group … Summed = … count()… Sorted = order … Top5 = limit … Order by clicks Job 3 Take top 5 Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt Hive • Developed at Facebook • Relational database built on Hadoop – Maintains table schemas – SQL-like query language (which can also call Hadoop Streaming scripts) – Supports table partitioning, complex data types, sampling, some query optimization • Used for most Facebook jobs – Less than 1% of daily jobs at Facebook use MapReduce directly!!! (SQL – or PIG – wins!) – Note: Google also has several SQL-like systems in use. (Thanks to David DeWitt for the following slides on Hive) Tables in Hive • Like a relational DBMS, data stored in tables • Richer column types than SQL – Primitive types: ints, floats, strings, date – Complex types: associative arrays, lists, structs Example: CREATE Table Employees ( Name string, Salary integer, Children List <Struct <firstName: string, DOB:date>> ) 41 (credit: David DeWitt) • Hive Data Storage Like a parallel DBMS, Hive tables can be partitioned • Example data file: Sales(custID, zipCode, date, amount) partitioned by state Hive DDL: Create Table Sales( custID INT, zipCode STRING, date DATE, amount FLOAT) Partitioned By (state STRING) HDFS Directory Sales custID zipCode … custID zipCode … 201 12345 … 13 54321 … 105 12345 … 67 54321 … 933 12345 … 45 74321 … Alabama custID … Alaska zipCode … 78 99221 … 345 99221 … 821 99221 … Wyoming 1 HDFS file per state 42 (credit: David DeWitt) HiveQL Example #1 Sales(custID, zipCode, date, amount) partitioned by state HDFS Directory custID HDFS files Sales zipCode … custID zipCode … 201 12345 … 13 54321 … 105 12345 … 67 54321 … 933 12345 … 45 74321 … Alabama custID … Alaska zipCode … 78 99221 … 345 99221 … 821 99221 … Wyoming Query 1: For the last 30 days obtain total sales by zipCode: SELECT zipCode, sum(amount) FROM Sales WHERE getDate()-30 < date < getDate() GROUP BY zipCode Query will be executed against all 50 partitions of Sales 43 (credit: David DeWitt) HiveQL Example #2 Sales(custID, zipCode, date, amount) partitioned by state HDFS Directory custID HDFS files Sales zipCode … custID zipCode … 201 12345 … 13 54321 … 105 12345 … 67 54321 … 933 12345 … 45 74321 … Alabama Alaska custID … zipCode … 78 99221 … 345 99221 … 821 99221 … Wyoming Query 2: For the last 30 days obtain total sales by zipCode for Alabama: SELECT zipCode, sum(amount) FROM Sales WHERE State = ‘Alabama’ and getDate()-30 < date < getDate() GROUP BY zipCode 44 (credit: David DeWitt) Whither Schemas? DB: Schemas are necessary for correctness, robustness, perfomrance and evolvability NoSQL: a) Schemas keep me from getting my job done. b) messy data is reality Fact: Most of the world’s data is unstructured. Not “IF” But “WHEN”? • “Schema on Write” – Traditional Approach • “Schema on Read” – Data is simply copied to the file store, no transformation is needed. – A SerDe (Serializer/Deserlizer) is applied during read time to extract the required columns (late binding) – New data can start flowing anytime and will appear retroactively once the SerDe is updated to parse it. 46 DATASPACES – WHAT ARE THEY? 47 What’s in a name? M. Franklin BNCOD 2009 7 July 2009 Dataspaces Inclusive Deal with all the data of interest – in whatever form Co-existence not Integration No integrated schema, no single warehouse, no ownership required Pay-as-you-go – Keyword search is bare minimum. – More function and increased consistency as you add work. M. Franklin BNCOD 2009 7 July 2009 Compare to Data Integration A quintessential schema-first approach. Mediated Schema Semantic mappings wrapper wrapper wrapper wrapper wrapper Courtesy of Alon Halevy M. Franklin BNCOD 2009 7 July 2009 Whither Structured Data? • Conventional Wisdom: only 20% of data is structured. • Decreasing due to: – Consumer applications – Enterprise search – Media applications M. Franklin BNCOD 2009 7 July 2009 But Structure Matters! Functionality Structure enables computers to help users manipulate and maintain the data. Dataspaces (pay-as-you-go) Structured (schema-first) Unstructured (schema-less) M. Franklin 2009 Time BNCOD (and cost) 7 July 2009 An Alternative View Weak Administrative Control Strong Virtual Organization Federated DBMS DBMS Strong Web Search Desktop Search Weak Semantic Integration M. Franklin BNCOD 2009 7 July 2009 A Recent Example • Hadapt’s “Schemaless SQL” A Recent Example • Hadapt’s “Schemaless SQL” “Schemaless SQL” • Schema Evolution – adding a column Next Time • Data Cleaning and Integration • But now – time for a Pandas query processing and anlytics exercise