* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Wrangler - TACC User Portal
Survey
Document related concepts
Object storage wikipedia , lookup
Data center wikipedia , lookup
Expense and cost recovery system (ECRS) wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Concurrency control wikipedia , lookup
Data analysis wikipedia , lookup
Information privacy law wikipedia , lookup
Microsoft SQL Server wikipedia , lookup
Data vault modeling wikipedia , lookup
Versant Object Database wikipedia , lookup
Business intelligence wikipedia , lookup
3D optical data storage wikipedia , lookup
Open data in the United Kingdom wikipedia , lookup
Relational model wikipedia , lookup
Transcript
Databases on Wrangler Niall Gaffney, Christopher Jordan, Tomislav Urban & David Walling Outline • Quick Introduction to Wrangler (Chris J) • Database Technology Overview (Chris J) – Short Break (5-10 minutes) • Database Options on Wrangler (Chris J) • Populating and Maintaining Data (David W) • Wrangler Allocations for Databases (Niall G) Mechanics • Will move succinctly through presentations without pausing for questions • Leave time for QA at the end of each section • Online Chat - TACC staff should be online to either answer questions or hold them for the presenter Wrangler in 10 Minutes Niall Gaffney, Christopher Jordan Acknowledgments • The Wrangler project is supported by the Division of Advanced Cyber Infrastructure at the National Science Foundation. – Award #ACI-1447307 “Wrangler: A Transformational Data Intensive Resource for the Open Science Community “ Project Partners • Academic partners: – TACC – Primary system design, deployment, and operations – Indiana U. ; Hosting/Operating replicated system and end-to-end network tuning. – U. of Chicago: Globus Online integration, high speed data transfer from user and XSEDE sites. • Vendors: Dell, DSSD (subsidiary of EMC2) Goals for Wrangler • To address the data problem in multiple dimensions – Data at large and smalls scale, reliable, secure – Lots of data types: Structured and unstructured – Fast, but not just for large files and sequential access. Need high transaction rates and random access too. • To support a wide range of applications and interfaces – Hadoop, but not *just* Hadoop. – Traditional languages, but also R, GIS, DB, and other, less HPC style performing workflows. • To support the full data lifecycle – More than scratch – Metadata and collection management support Wrangler Hardware Mass Storage Subsystem 10 PB (Replicated) IB Interconnect 120 Lanes (56 Gb/s) non-blocking Access & Analysis System 96 Nodes 128 GB+ Memory Haswell CPUs Interconnect with 1 TB/s throughput High Speed Storage System 500+ TB 1 TB/s 250M+ IOPS Three primary subsystems: – A 10PB, replicated disk storage system. – An embedded analytics capability of several thousand cores. – A high speed global file store • 1TB/s • 250M+ IOPS Wrangler At Large TACC Indiana Mass Storage Subsystem 10 PB (Replicated) Mass Storage Subsystem 10 PB (Replicated) IB Interconnect 120 Lanes (56 Gb/s) non-blocking Access & Analysis System 96 Nodes 128 GB+ Memory Haswell CPUs Interconnect with 1 TB/s throughput High Speed Storage System 500+ TB 1 TB/s 250M+ IOPS 40 Gb/s Ethernet 100 Gbps Public Network Globus Access & Analysis System 24 Nodes 128 GB+ Memory Haswell CPUs Analysis Hardware • The high speed storage will be directly connected to 96 nodes for embedded processing. – Each analytics node will have 24 Intel Haswell cores, and 128GB of RAM, 40 GB Ethernet and Mellenox FDR networking. DSSD Storage • The flash storage provides the truly “innovative capability” of Wrangler • Not SSD; a direct attached PCI interface allows access to the NAND flash. – Not limited by 40 Gb/s Ethernet or 56 GB/s IB networking • Flash storage not tied to individual nodes – Not single PCI storage in a node • More than half a petabyte of usable storage space once “RAIDed” • Could handle continuous writes to storage for 5+ years without loss due to Memory Wear Wrangler Reservations • Data motion is too expensive for many data driven jobs to work in the HPC 48 hour maximum job length environment and the shared flash storage system • Many data investigations are interactive data analysis “campaigns”, but HPC environments do not allow working on the login nodes • For this, we introduce the ability to reserve subclusters in Wrangler for data analysis in a cloud-like way • Allocation charged for reservation used from start of reservation until the end or until the reservation is canceled for all nodes reserved • Jobs run within a reservation are not charged • Jobs run outside of a reservation are charged for Node Hours used Long Term File Systems • Have standard /home directory for each user • Have mounted global /work file system – Your work directory is stored in $WORK environment variable for Wrangler – Global Work (which is Stampedes $WORK) is saved in $STOCKYARD • Have the /data file system for staging input files for processing, interim result files from computations, and preserving results from computations Database Technology Overview Christopher Jordan, Tomislav Urban Database Options (and more options) • Many, many database options now available • Open source and startup worlds both producing lots of innovation/development • We will focus on robust, widely-used options • Expansive internet resources available for the interested Data Models • Relational Data is the most common model in current databases • Other data models (Object, Graph/Network) may be appropriate for specific needs • Will mostly focus on relational data here Database Concepts • Record – a single instance of the data structure – a row in RDBMS • Column – all rows of a single data element, e.g. all the last names in an address book • Table or Relation – a collection of records or rows Database Concepts 2 • Key – An element that occurs in multiple tables, can be used to link rows across tables • Index – A secondary data structure used for accelerating access to records, usually a key • Query – A command issued to the database • Constraint – A limitation on a specific data element, e.g. “integers less than 1000” Structured Query Lanaguage • SQL Standard is a very powerful language for defining and querying structured data • Appropriate when data is consistent, structured • Relational Database Management Systems – Typically an RDBMS is a SQL engine – Postgres, MariaDB/MySQL are supported – Others (MonetDB) as requested What does SQL look like? • CREATE TABLE tablename (id int, name varchar(80), address text); • INSERT INTO tablename (data); • SELECT name FROM tablename where id = 6; • Note all operations on one or more “rows” Transactions and ACID • • • • • Atomicity – Each transaction in isolation Consistency – State before and after Isolation – Concurrency doesn’t change state Durability – Changed data stays changed In SQL, you can group operations together into a single “transaction” in ACID terms Stored Procedures and Triggers • Complex/multi-part queries may be reused • One change to data may imply or mandate another change (or something different) • Stored procedures are like functions, triggers are policies or rules (when X happens, then Y must be done as well) Relational Data Model Database Interaction Model • Almost all SQL engines are client-server systems (SQLite arguably an exception) • One or more server processes on one or more nodes, sometimes a “head” node • Clients run anywhere with network access to the server process • Interact using standard protocols, SQL or a database-specific language Database Access • Direct clients – command-line or graphical shells allowing client-server interaction • APIs – libraries for programming languages • ODBC/JDBC – standard connectors allowing interaction with many different database types SQL Example Applications • Web backend – most common example, backend storage for web application – Can include both application and domain data • Large tabular dataset storage – SQL used to extract subsets for analysis • Database as application engine – SQL can be used for many patternmatching/comparison/data structure applications SQL as Application Engine • OrthoMCL Example – Grouping of orthologous protein sequences – Load output of BLAST sequence recognition tool into MySQL – SQL used to compare sequences globally, identify similar sequences and generate weightings – All significant computational effort besides BLAST is done in the database PostgreSQL • Mature database with focus on standardscompliance, ACID properties, reliability • Single-node, threaded but with limitations • Supports the widest array of SQL standard operations of the open-source options MariaDB/MySQL • MySQL - Open source database now “owned” by Oracle and subsequently forked • MariaDB/Percona/others all based on MYSQL • Relatively easy to install and use, plug-in architecture for storage/authentication • Not as ACID/SQL-compliant as Postgres NoSQL • Trading one or more SQL characteristics for reliability, performance, simplicity, etc • Used when data is semi-structured or has a very particular structure: – Columnar data – Key-Value pairs – JSON and XML data When to Choose a NoSQL Option • In most cases a traditional RDBMS will be your best choice (maturity, application support) • If your data/queries don’t match the RDBMS model well (e.g. graph/network data) • If your application depends on a key-value store or another specialized component • If you need to apply node parallelism MongoDB • “NoSQL” – has a Mongo-specific query language • Document-centered – store and manipulate JSON “documents” rather than rows • Node-parallel – Can “shard” databases across multiple nodes for capacity and performance MongoDB and JSON • JSON/BSON as native data type • JSON, like XML, is structured but without an inherent schema • Can define/impose a schema in your application • Still need to pay attention to data structure The GIS Software Stack Tomislav Urban What is GIS • GIS stands for Geographic Information System A geographic information system (GIS) is a system designed to capture, store, manipulate, analyze, manage, and present all types of spatial or geographical data. (Wikipedia) • So just as a normal relational database tracks attributes of entities in the form of strings, integers, Booleans, floating point numbers, etc., in columns or fields; a GIS adds the ability to store spatial data in the form of points, lines or polygons alongside the non-spatial attributes of those entities. Spatial Data • Storing spatial data is similar to any other data type. Typically, the GIS will provide a geometry data type that can often be subtyped as a point, line or polygon. In a database this will be stored in a native way: • Here is a point: 0101000020E6100000D9A0F226633944401E338C5A0D172640 • But can also be shown in the WKT (Well Known Text) format: POINT(40.44833838317 11.0450237556593) Spatial Data • Spatial data exists in the real world on Earth and as such also must be characterized with an SRID (Spatial Reference System Identifier) – SRIDs are numbers that often refer to the codes used by the EPSG (European Petroleum Survey Group) – For example, GPS devices typically return readings using EPSG:4326 otherwise known as WGS 84 or World Geodetic System 1984, while Google Maps uses EPSG:3857 referred to as “Web Mercator” • The SRIDs may refer to nonprojected (i.e. geographic coordinates), projected or local coordinate systems. Spatial Databases • Many modern relational database systems provide support for spatial data – – – – – – PostgreSQL with PostGIS MariaDB/MySQL Microsoft SQL Server Oracle Spatial SpatiaLite H2 Spatial Databases • The database provides a storage facility for spatial data, but also allows for queries and analysis. – Here is a simple query (using PostGIS) that lists the find locations of specimens collected based on the search locality in which they were found: SELECT l.id AS locality_id, o.id AS specimen_id FROM drp_occurrence o, drp_locality l WHERE ST_Within(o.geom, l.geom) – The ST_Within function determines whether one geometry falls within another. The Stack • Once your spatial data have been stored in a database, analyzed, queried and so forth. You may wish to make these data available. Here is a typical GIS software stack. Desktop Client Web Client OGC Compliant GIS Server Spatially enabled RDBMS The Stack • We’ve covered a bit about the bottom part of the stack, the database, above. • Now let’s address the server Desktop Client Web Client OGC Compliant GIS Server Spatially enabled RDBMS The GIS Server • The OGC (Open Geospatial Consortium) sets standards for web services that allow GIS servers to communicate with clients • The two most important standards are: – WMS – Web Map Service. WMS serves images that are tiled to produce maps. You can think of WMS as the “raster” service, though the underlying data is not typically stored as a raster. – WFS – Web Feature Service. WFS serves data providing the client with the discrete points that make up the geometry. This can be thought of as the “vector” service. WMS • Requests come in as parameterized URLs …/wms?LAYERS=stratigraphy%3Ausgs_strata&FORMAT=image%2Fpng& SERVICE=WMS&VERSION=1.1.1&REQUEST=GetMap&STYLES=&SRS=EPSG%3 A900913&WIDTH=256&HEIGHT=256 • Responses come back as image tiles (e.g. PNGs) WMS • The GIS Server will cache pre-computed tiles in order to support performant zoom capabilities for your maps WFS • Requests come in as parameterized URLs …/ows?service=WFS&version=1.0.0&request=GetFeature&typeName=stratig raphy:usgs_strata&maxFeatures=50&outputFormat=application%2Fjson • Responses come back as text (e.g. geojson) {"type":"FeatureCollection","totalFeatures":319822,"features":[{"ty pe":"Feature","id":"usgs_strata.309165","geometry":{"type":"MultiPo lygon","coordinates":[[[[1.2219492278393215E7,5573973.086145616],[1.2218750755323239E7,5574930.573897161],[1.2218400054957133E7,5574396.582388744],[1.2219140100366896E7,5573980.728155615],[1.2219312228531137E7,5573800.41573675],[1.2219492278393215E7,5573973.086145616]]]]},"geometry_name":"geom", "properties":{"url":"http://mrdata.usgs.gov/geology/state/sgmcunit.php... The GIS Server • Some OGC Compliant GIS Servers – GeoServer (Open source, Java) – MapServer (Open source, C++) – ArcGIS Server (Commercial, ESRI) Web Clients • Web clients that consume OGC services typically take the form of Javascript libraries such as OpenLayers or Leaflet • They make consuming and displaying GIS data relatively easy within a web application. Desktop Client Web Client OGC Compliant GIS Server Spatially enabled RDBMS Web Clients • This example shows three layers assembled using Openlayers – A Google Maps base layer showing terrain – A partially transparent WMS layer showing stratigraphy – A WFS layer showing specific points Desktop Clients • Desktop clients provide an interactive interface to spatial data. Allow editing, analysis and high quality presentation • They can connect to databases directly, via OCG services, or with legacy file formats such as shapefiles. Desktop Client Web Client OGC Compliant GIS Server Spatially enabled RDBMS Desktop Clients • Some desktop GIS clients – QGIS (Open source) – ArcMap (Commercial, ESRI) Questions Databases on Wrangler Niall Gaffney, Christopher Jordan Database Usage Modes • Some “Applications” are primarily database queries/processing on temporary datasets • Persistent Databases – Data Collections/Resources, Stream Processing • “Transient” Databases – Database used as temporary engine for processing – SQLite, other node-local options Databases for Collections • • • • • Database may itself be a resource Can be accessed directly or via a web interface This is a valid allocation type for Wrangler Needs to utilize “persistent” usage mode ECSS may help with creating a collection Databases for Persistent Services • Database may be a component of a networked data service (e.g. GIS layers) • Wrangler persistent databases are appropriate for part or all of such services • Database component should be significant factor (e.g. not just user accounts) Current Persistent DB Support • MySQL/MariaDB v10 – Includes native GIS data types • Postgres v9.4.1 – PostGIS v2.1.7 • Replication for both coming soon • Oracle and MonetDB under consideration Persistent DB Storage • Persistent databases live on disk • Flash is volatile, not yet suited for long-term storage • Flash will be provided as an option – Reliability considerations for both users and admin Persistent Database Provisioning Persistent Database Options • Type of DB: Postgres or Maria/MySQL – This list will grow based on demand • Database/Schema Name • DBA – Administrative user • Note that all persistent databases use TACC authentication and require SSL Accessing your Persistent DB • From a Linux system with Postgres Client: – psql "sslmode=require host=db1.wrangler.tacc.utexas.edu user=<username> dbname=db” • MySQL/MariaDB: – mysql –ssl –h db1.wrangler.tacc.utexas.edu –u <username> -p <dbname> Configuring ODBC/JDBC • Many tools such as SAS support databases via ODBC/JDBC • Configuring these is OS and applicationspecific, but the options are the same • Two-step process – configure data source, select data source in application Wrangler and NoSQL databases • There are many, many NoSQL and SQL database technologies in development • Too many options to support them all • Wrangler support will be based on demand • Express interest in specific technologies by contacting TACC or XSEDE Transient Databases • Although most databases require a “server” process, this doesn’t have to be long-lived • All the DB “servers” we have encountered run in user-space, just as any other process • This is also true of many “cluster” databases • We encourage users to experiment Running Transient Databases • • • • • DO NOT run on the login node You will want a reservation in most cases Start an interactive job on one (or more) nodes Configure/Start server within the job Eventually, we will provide scripts for most common options (e.g. Postgres, MongoDB) Using Transient Databases • Very flexible execution model • Can run clients directly on compute node alongside database server • Or, can run clients on any XSEDE system, or “in the cloud” • Details of connection and use will be applicationspecific MongoDB Example • MongoDB doesn’t support TACC auth • Server installed on all compute nodes • From idev or inside a job script: – “mongod –config –dbpath /data/<userpath>/mongodb” (or put in background with nohup) – “mongo <dbname>” • Can put data path on Flash as well Hadoop and Transient Databases • Some database technologies now run on top of Hadoop/MapReduce • We have a Hadoop partition! • Same procedure as for other databases, but create a Hadoop reservation in the user portal • Should be familiar with Hadoop, though… Performance Expectations • Particularly for transient databases, expect at least 2X-3X improvement over disk-based hosts • Can be significantly more or less than that, depending on data size and complexity • The more of the total dataset is being accessed, the better comparative performance you will get Staging Data In • You can load your database directly from a desktop/laptop/other cluster • Performance may be an issue • Consider staging your data to Wrangler ahead of time • Consider preparing your data Staging Data Out • Important to remember that transient databases are transient • Data will remain on disk, but flash will be purged, and you need the server process • Must consider how to retrieve your data before the job/reservation ends • Use database dump/backup tools? Questions? • Switch to ETL presentation ETL to Analytics David Walling [email protected] Motivation Extract->Transform->Load ● Extract ● ● ● ● SOURCE DATA External source: CSV, database, web scraped data, free text Data is messy Understanding+Cleaning it is hard: ~ 80% of the effort Transform ● Rename/map fields: Ex. ‘TX’::string -> 9::foreign_key Convert text to integer: Ex. 5’10’’::string -> 70::int Convert missing values: Ex. 99999 -> NA CLEAN ● ● ● ● ● ● Load ● ● Often a database, order of load important to comply with integrity constraints CSV, HDFS DATA STORE ETL Tools ● ● ● Manual ● Just don’t! Unix Tools + BASH ● Glue together all the tried and true unix tools into a bash script ● grep, awk, sed, etc... Higher level scripting Languages ● ● R, Python, Perl, etc…. GUIs ● ● ● ● Graphically build network of sources, transformation, and destinations SSIS: Part of MS SQL Server stack Informatica: I hear it’s good, but expensive Pentaho Kettle: Open Source….yeah!!! Example ● ● ● ● ● players.txt Information on basketball players their position and their grade (1-5) by NBA scouts. Missing data Bad data (score=6 invalid) Data format issues (5’11’’) Unix Command Line ● ● ● ● Pipe together simple programs using stdin/stdout to perform powerful manipulation. Command line vs bash script Often very efficient Can be difficult to learn/remember. --help helps Core Commands ● ● ● ● ● grep - find files/lines in file matching *pattern* find - find files/dirs with name matching *pattern* cat - send all contents to stdout less - scrollable file content head/tail - show first/last -n 10 lines of huge file Additional Tools ● ● ● ● ● awk - process delimited files line-by-line. Ex. extract 2nd-5th, 37-52nd columns of 89 column CSV file sed - stream editor, similar to awk, work line by line. Ex. find/replace sort - order stdin data by line uniq - remove duplicates awk/sed - many Q/A online to get what you need Regular Expressions ● ● ● ● regex - standard...ish way of expressing complex pattern matching Used by MANY other tools, sometimes with slight variation Very powerful, but gets complex quickly Many resources exist to help ‘build’ your regex: ● ● http://regexr.com/ At a minimum get used to using wildcards/escaping as they are ubiquitous /regexr\.com.+foo/g http://regexr.com/foo.html?q= bar http://regexr2.com/foo.html?q =bar http://espn.com/foo.html http://regexr.com/me.html Scripting Languages ● ● ● ● ● ● Higher level programming languages, do many things Support for stdin/stdout, more often data import/export through functions More complex cleaning logic, modularize code Database lookups Write unit tests!!! Perl ● ● ● Pro: fast, can use almost like straight unix stdin/stdout tool Con: notoriously hard to read Python ● ● Pro: easy to learn, easy to read, access to 3rd party packages Con: a bit more cumbersome to get going GUI Based Tools ● ● ● ● Build pipelines via ‘task’ drag/drop + link ‘Easy’ to learn, many things done for you. Easy to understand how data moves through pipeline. Often many common tasks are quicker to implement. Ex. form based setup of database connection information, field mapping, etc... GUI Based Tools 2 ● ● MS SQL Server Integration Services (SSIS) ● Part of great MS stack ● Requires windows ● Not free Pentaho Kettle ● Java based, easy to install, access to other java tools ● Open source: Free + 3rd party contributions ● Can be kludgey passing params around to ‘tasks’ Pentaho Kettle Cronjobs ● Most ETL scripts need to be run on periodic basis ● You could execute manually on demand, ● When impractical, setup a ‘scheduled task’ via crontab ● Add notifications to scripts when things go wrong Cronjobs and Wrangler ● cronjobs are supported on wrangler ● You *cannot* enter cronjobs on the login node ● Talk with us to setup an appropriate pipeline for your data ● http://portal.tacc.utexas.edu or http://portal.xsede.org ● [email protected] Streaming ● ● Though rare, it may be the case that your research workflow requires the capture of streaming data that would otherwise be lost. More common in business/web, aggregating log files from all IT systems to drive quality control and business intelligence. ● Open source tools exist to handle these log heavy workflows ● Examples: Apache Flume, Loggly, LogStash, Splunk ● If needed on Wrangler, talk with us ● [email protected] Database + Analytic Tools ● ● ● Databases and SQL are widely supported in analytic tools Every package has their own way of connecting to and interacting with the database However, they generally follow the same pattern ● Manually open connection ● Use connection to send a sql statement ● Client packages up the results, usually in an array/list/dataframe ● Manually close connection (sometimes done for you) R Example (RJDBC) # Do stuff # Setup connection library(RJDBC) drv <- JDBC("com.mysql.jdbc.Driver", "/home1/0157/walling/drivers/mysql-connector-java3.1.14-bin.jar", identifier.quote="`") conn <- dbConnect(drv, “jdbc:mysql://db1.wrangler.tacc.utexas.edu/schema_n ame", "user", "pwd") > [1] [3] [5] [7] "columns_priv" "func" "help_keyword" "help_topic" dbListTables(conn) "db" "help_category" "help_relation" "host" data(iris) dbWriteTable(conn, "iris", iris, overwrite=TRUE) dbGetQuery(conn, "select count(*) from iris") d <- dbReadTable(conn, "iris") Python Example (Postgres) import psycopg2 def connect(): connection = psycopg2.connect(host='db1.wrangler.tacc.utexas.edu', dbname=’bigDB’, user='me’, password='pw') return connection def query(sql): conn = connect() cursor = conn.cursor() cursor.execute(sql) return(cursor.fetchall()) sql = """ select distinct user from bigusertable; """ result = query(sql) JDBC vs ODBC vs DBI vs …. ● ● ● Many applications rely on a particular protocol to abstract the type of database being used. Makes code more portable allowing you to swap backends if needed. Suggested to use one of these instead of a package specific to MariaDB, Postgres, etc... WRANGLER ALLOCATIONS Wrangler Allocation Model • Projects need to think about two different allocations – Compute Allocations – Time used computing with Wrangler and using Flash Storage – Storage Allocations – Storage needed for importing data, storing data between computations, collaborating with data, sharing data, preserving results Introducing the Node Hour • Computations use Flash storage and CPU and core memory • To simplify allocations and scheduling we combine these into a “Node Hour” of allocation – 1 Node Hour = use of 1 node with 2 CPUs (24 cores total) and 128 GB of memory PLUS 4 TB of DSSD Flash Storage – Need to find limiting factor for computations and ask for allocation For Dedicated Databases • Database has a nominal charge of 1 Node hour per day – Up to 150 GB database on Flash (larger Flash based database will use more node hours per day) – Can host database not needing high transaction rates on Disk system (requires storage allocation to cover its storage) – Should periodically backup any database to long-term storage (especially for Flash hosted databases which are not replicated) For Transient Databases • Typically one node hour per hour will be sufficient for most Transient Databases • Exceptions – Need more than 4 TB of flash storage to stage the input data, Extract, Translate and Load the database, host the database, and extract results – Need to use a multinode database solution for larger problems (e.g. Sharded Mongo Cluster, Neo4J cluster) https://portal.xsede.org Create an Account Login Learn About Allocations Learn About Allocations In particular Startups Submit/Review Requests Current Oportunites Quarterly Allocation Periods for Full Scale Requests Guided Process For startups… • Minimal information needed…what you are going to be doing and some rationalization of why you need Wrangler (e.g. I need a database) • Startup Requests are typically for 500 to 1000 Node hours, ~1 TB of storage • Turn around should be within a week (typically a few days depending on my email backlog)