Big Data and Official Statistics: Philippine Context Erniel B. Barrios Outline • Concepts and Definitions • Coverage of Big Data • Big Data and Official Statistics: Preliminary Framework • Current Practices (Some Models) • Possible Big Data in the Philippines • Next Steps Frequency of Documents Containing Big Data in ProQuest Research Library Basis of Definitions • Stakeholders may define Big Data differently • Data storage and data analysis • Intertwined technical and socio-technical issues • Multiple, ambiguous and often contradictory definitions • “Big” => significance, complexity, challenge • Five V’s • • • • • Volume (size) Velocity (rate of production) Variety (format, representations) IBM:V is Veracity (trust and uncertainty) SAS: Variability (complexity). • Intel: generating a median of 300 TB Basis of Definitions • Size: volume of the dataset • Complexity: structure, behavior and permutations of the dataset • Technologies: tools and techniques which are used to process a sizable or complex dataset Definitions • Appropriate description, integration, and sustainability of very large datasets generated by high throughput experiments • Large collection of small disparate, unstructured datasets, (taken together, can be analyzed to find unusual trends). • Emergence of digital enterprise, ability for an organization to take full advantage of its digital assets, collectively large amount of data • Oracle: Inclusion of additional data sources to augment current operations • Microsoft: process of applying serious computing power (machine learning, AI) to seriously massive and often highly complex sets of information. Definitions • Big Data describes the storage and analysis of large and/or complex data sets using a series of techniques. • High-volume, high-velocity, and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision-making. • Describes large volumes of high velocity, complex and variable data that require advanced techniques and technologies to enable the capture, storage, distribution, management, and analysis of information. • UNECE: Big Data-data that is difficult to collect, store of process within the conventional systems of statistical organizations. Either their volume, velocity, structure or variety Online Survey of 154 Global Executives (April 2012) Definition • Big Data • • • • Not only in size (though volume can be part of it) Varying Sources, Several Variables (Indicators) Differing data collection methods (compilation) Frequency (possibly irregular) • Issues • • • • • • • Quality Architecture Security/Confidentiality Integrity Standardization Data extraction Data Mining New Data Sources • Consumer Usage Database • Blogs • Social Media • Sensor Networks • Image Data • May vary in • Size • Structure • format Coverage of Big Data • Basic research data • Electronical health records • Consumer Usage Database • Proposals submitted • Administrative data • Censuses and Surveys Types of Big Data (Classification) • Social Network: Human-sourced information • Social networks, Blogs, Personal Documents, Pictures, Videos, Internet Searches, Mobile Data, User-generated maps, E-mail • Traditional Business Systems: Process-mediated data • Public agencies (including medical records), produced by business (commercial transactions, banking/stocks records, E-commerce, Credit Cards) • Internet: machine-generated • Fixed sensors: home automation, weather/pollution sensor, traffic, scientific, security/surveillance • Mobile sensors: mobile phone, cars, satellite images • Computer systems: logs, web logs Current Practices • Analysis of Traffic Loop Detection Data (Statistics Netherlands) • Traffic loop detection data: measurements of traffic intensity • Create maps that indicate the number of vehicles for each measurement location for each time point by means of color coding. • Number of vehicles in various length categories • Predictive modeling need to be developed⇒ estimated aggregates and variance estimates reflecting the uncertainty of the estimation procedure. • Analysis of Social Media Messages (Statistics Netherlands) • 70% of Dutch population actively posts messages on social media. • Sentiment = Consumer Confidence Big Data and Official Statistics • Location data for mobile phones • used for instantaneous daytime population and tourism statistics • proxy indicators for demand • Social media messages • Process into early indicators of consumer confidence • Price information on the web, from loyalty cards • Inflation level • Google search • Prevalence rate of Influenza • Tweets • Stock market prices Big Data and Official Statistics: Preliminary Framework Businesses Farms Surveys Households Census Administrative Reports Collaboration (PPP) Individuals Other Big Data Methodology Human Resources Official Statistics, SDG Possible Big Data in the Philippines • From PSA/NGA, LGU • Censuses • Survey • Administrative Reports • Regulation, Licensing and Compliance • Monitoring (e.g., MFO, Budgeting, Intervention (4Ps, RSBSA, etc.) • Registers (BIR, COMELEC, UMID, GSIS/SSS, Philhealth, Pag-Ibig, etc.) • Private/Commercial • • • • • • • Telco Credit cards Loyalty Cards POS Images Sensor Social Media, Google, etc. Next Steps • What is available? • • • • Big data sources Data that can shared, frequency, timeliness Data security, confidentiality issues Big Data and Official Statistics: Is it feasible?, Is it worthy? • What is needed for collaboration, data-sharing? Thank you.