Download Adrian Tear

Drilling for Big Data ‘GIS with NoSQL’ Workshop AGILE 2016, Helsinki, 14 June 2016 Adrian Tear Department of Geography, University of Portsmouth [email protected] Introduction •Past and Present projects Traditional and Modern database systems • Tried and Failed… • Tried and Worked! Demonstrations: • Oracle 12c / Tableau 9.3 (local) • Oracle 12c / Tableau 9.3 (remote supercomputer/local) • MapR 5 / Tableau 9.3 (remote supercomputer/local) Summary / Discussion Past projects •1997 UK General Election 2000 London Election 2001 UK General Election Present projects •PhD Researcher, University of Portsmouth Honorary Fellow, University of Edinburgh GIScience, Big Data, Locative Media Present projects – PhD Working Title •Geo-tagging matters! The role and significance of geography in online social network interactions. Why? Past: • Long-standing interest in politics • Published three major political web sites in the Web 1.0 Era • Could detect what people were looking at (sort of) from where • No User Generated Content or any opinions… Present: • Still interested in political geography in the Web 2.0 Era • No longer a publisher, but can consume public social media data • Can detect what people are looking at (sometimes) from where • Plenty of User Generated Content and many opinions!!! • What? •2012 US Presidential Election campaign: 2014 Scottish Independence Referendum campaign: A lot of data! • US2012_GEO – 146,424 records (1:5 sample, geo exists) • US2012_NON_GEO – 1,560,967 records (1:50 sample) • US2012_NON_GEO_HISP – 11,276 records (1:50, Lang=ES) • SCOT2014 – 6,477,713 records • ~8 million records – 90% Twitter/10% Facebook • 0.86-1.44% geo-tagged; 3 and 12 month data runs US/SCOT • >20GB raw CSV or JSON; >100GB in indexed database How? •Datasift: • Curated Stream Definition Language (CSDL) offers OSN filtering: • Twitter Firehose • Facebook • Others • Runs as a task • Stores data on Datasift servers (in MongoDB?) • Downloadable in CSV/JSON • Used to offer ‘Pay as you go’ pricing model: • Unfortunately now only open to Subscribers Datasift CSDL for US2012_GEO Stream • tag "Democratic Party" { interaction.content any "democrat,democrats,democratic,barack obama,obama,joe biden,biden" } tag "Republican Party" { interaction.content any "republican,republicans,mitt romney,romney,paul ryan,ryan" } tag "Positive" { salience.content.sentiment > 0 } tag "Neutral" { salience.content.sentiment == 0 } tag "Negative" { salience.content.sentiment < 0 } tag "Barack Obama" { interaction.content any "barack obama,barack,obama" } tag "Mitt Romney" { interaction.content contains "mitt romney,mitt,romney" } return { interaction.content any "2012 election,presidential election,US president,president of the united states,republican,republicans,mitt romney,mitt,romney,democrat,democrats,democratic,barack obama,barack,obama" AND language.tag == "en" AND salience.content.sentiment exists AND interaction.geo exists AND interaction.sample < 20 } Set up the task and leave it running… Download – Comma Separated Values (CSV) • US2012: SCOT2014: Aim: • 3 CSV files one for each Stream • Inconsistent set of fields in each file • 3 CSV files for one (bigger) Stream • Consistent themselves but inconsistent with US2012 files • Read into Relational Database Management System (RDBMS) • One table for all records with cross-consistent fields • Tagged by Stream Inconsistency in CSV fields • Tear, A. (2014). SQL or NoSQL? Contrasting Approaches to the Storage, Manipulation and Analysis of Spatio-temporal Online Social Network Data. In B. Murgante, S. Misra, A. M. A. C. Rocha, C. Torre, J. G. Rocha, M. I. Falcão, … O. Gervasi (Eds.), Computational Science and Its Applications -- ICCSA 2014: 14th International Conference, Guimarães, Portugal, June 30 -- July 3, 2014, Proceedings, Part I (pp. 221–236). Cham: Springer International Publishing. http://doi.org/10.1007/978-3-319-09144-0_16 Download – JavaScript Object Notation (JSON) US2012: • 3 JSON files one for each Stream • Self-describing variable numbers of key/value pairs (with nesting); 146 key names SCOT2014: • 3 JSON files for one (bigger) Stream • Self-describing variable numbers of key/value pairs (with nesting); 411 key names Aim: • Read into a database system to allow querying preferably without Extract/Transform/Load efforts required of CSV • Two ‘tables’ for US2012 and SCOT2014 datasets • Traditional/Modern DB – Tried and failed… •Microsoft SQL Server 2012 (CSV) • Could not handle UTF-8 encoded character set (Emoticons etc.) PostgreSQL 9.3 beta (JSON) • Couldn’t get it to work at the time with my JSON data Others: • Google BigQuery – loaded CSV OK but JSON (at the time) required schema definition for a schema-less dataset, puzzling? • IBM InfoSphere BigInsights Quickstart VM – would load smaller JSON files into a Hadoop-under-the-hood ‘sheets’ system but not the whole dataset • Teradata 15 – couldn’t get it to work • Apache Drill 0.9 – wouldn’t handle text over 128k characters (Facebook text wider than Twitter!) Traditional/Modern DB – Tried and worked! •Oracle 12c • CSV import successful with careful SQLLDR Control File editing • JSON successfully imported into Constrained CLOB field • US2012 dataset (~1.7m records) indexed correctly • SCOT2014 dataset (~6.5m records) did not (411 key names some exceeding 64 byte name length; fix coming 12c R2) MarkLogic 7 • JSON import successful (on US2012 dataset; but had to split 3 LF delimited files into ~1.7m atomic files at the time) Endeca Information Discovery • Ingested smaller JSON files but not the whole lot; would ingest large CSV files with processing to alter date formats etc MapR Hadoop ecosystem with Drill • Works locally in ‘single Drillbit’ mode on the laptop and recently installed on 5-node supercomputer; works on the whole dataset including GZ compressed JSON It’s a ‘moveable feast’… •Software is changing all the time: • ‘Big Data’ a growth area for many vendors • New releases bring new/improved functionality Some software hasn’t been tried: • MongoDB and CouchDB tried briefly; hard to fathom • Countless others… Inertia: • A definite factor; how many systems is it possible/desirable to learn Infrastructure: • Largely home-grown on old kit or using Virtual Machines; older physical kit or VMs may not be adequate to run high-end software SQL wins out? •Familiar Purpose of Drill Standard(ish) Mainstay of Oracle Demonstration Compute environments •Laptop – Windows 10 • Dell Latitude 7440 • Intel i5, 16GB RAM, 128GB (OS) and 512GB (Data) SSDs • Running Oracle 12c, Oracle SQL Developer, Tableau 9.3 and Oracle VirtualBox Supercomputer – Scientific Linux 6 • SCIAMA supercomputer; cluster of 3,702 compute cores (commodity Dell boxes with Infiniband) used by Institute of Cosmology and Gravitation, University of Portsmouth • MapR Hadoop ecosystem (v5) on 5 nodes (of 12 cores, 24GB RAM, 2TB SATA disk each) • Oracle 12c (12.1.0.2) on 1 node (of 12 cores, 24GB RAM, 2*500GB SSDs) Shedputer – Windows Server 2012 • 2*IBM X3850M2 quad core Xeon, 128GB RAM, 4*146GB 10K SAS Magnetic and 4*200GB SAS SSD drives (no speed increase; backplane too slow!!) • 1*DELL PowerEdge 2950, 32GB RAM, 4*73GB 15K SAS Magnetic drives – too noisy to use!! Software ‘stack’ •DBMS • Oracle 12c and/or • MapR/Drill Graph analysis • Oracle Spatial and Graph and/or • Neo4j Natural Language Processing • GATEcloud • AlchemyAPI Visualisation • Tableau • Desktop GIS Unattainability? •Recent decades have witnessed an increased growth in data generated by information, communication, and technological systems, giving birth to the ‘Big Data’ paradigm. Despite the profusion of raw data being captured by social media platforms, Big Data require specialized skills to parse and analyze — and even with the requisite skills, social media data are not readily available to download. Thus, the Big Data paradigm has not produced a coincidental explosion of research opportunities for the typical scholar. • Zelenkauskaite, A., & Bucy, E. (2016). A scholarly divide: Social media, Big Data, and unattainable scholarship. First Monday, 21(5). Retrieved from http://ojphi.org/ojs/index.php/fm/article/view/6358 Summary •Handling text-rich un/semi-structured data is difficult, whatever software vendors would have you believe! Fast I/O for large datasets – in exploratory mode – is most useful; use Solid State Disks if you can (Oracle/Tableau 4min to 40sec speed-up) The number of RDBMSs available has narrowed but there are many (perhaps too many?) commercial/open source NoSQL databases When choosing systems consider your own experience; e.g. if you think JavaScript try MongoDB, if you think SQL try MapR and Drill… …Or CSV and Hive… Or CSV and Spark… Or some other combination… Discussion •Familiarity/Unfamiliarity will inevitably sway your decision Once you have proof of concept get IT buy-in, you will need support The aim is to get results and not to become a prisoner of technology Resources • JSON in Oracle database 12c • Apache Drill • MapR/Apache Drill: It’s drilliant to query JSON files from Tableau • GATEcloud Natural Language Processing • AlchemyAPI Natural Language Processing

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Adrian Tear