Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Managing and mining smart meter data – at scale CSE Project Showcase 9 July 2013 Twitter: @cse_bristol #SmartMeterData Introduction Contents - Introduction to the project, the data, and its applications - Managing SM data at scale - Getting valuable knowledge out of SM data - Demo: Smart Meter Analytics, Scaled by Hadoop (SMASH) - Where next? - Discussion Introduction Project Background “Generating Value from Smart Electricity Meter Data” 18 Month TSB-supported collaboration CSE, University of Bristol, SSE and Western Power Distribution Three themes: • Managing the data at scale • Extracting useful knowledge • Integrating the above in a user-facing application Introduction The data A half-hourly timeseries for each smart meter / register Content: date, time, consumption in the half hour. For a single register: 17,520 records per year. This is what 18 months look like: Introduction The data EDRP: • • • • • • 18 months 16,250 smart metered households 16,250 smart electricity meters 9,364 smart gas meters 670m half-hourly records (E: 420m, G: 250m) 40GB of raw csv file data Post rollout, per year, domestic only: • 25m smart metered households • 25m smart electricity meters • 20m smart gas meters • 800 billion half-hourly records (E: 450Bn, G: 350Bn) • 50TB of raw csv file data EDRP ~ 0.1% of a year’s domestic data Introduction What might we use it for? Improve existing processes • • • • Settlement Billing, reconciliation, audit Demand profiling Customer profiling & segmentation New processes not possible without HH data at scale • • • • • Localised prediction Distribution network planning and modelling Automated DSM – prediction and verification System state detection Individualised consumer energy services Introduction What are the essential processes? Ingestion – getting the data into the system Storage – keeping it there securely Analysis and reporting • • • • • Ad-hoc queries Transaction reports Descriptives and summaries (e.g. OLAP) Mining and modelling Visualisation Data management & processing More fundamentally Moving data between storage, memory and CPU Transforming it in the CPU into desired forms There are physical constraints on the speed of this. (These are relevant at the scale of smart meter datasets). Data management & processing Single machine RDBMS CPU ~ 2.5GHz ~ 1000 MB/s MEMORY ~10s of GB per machine ~ 100 MB/s STORAGE ~ 1TB per disk Using SQL Server to sum half hourly consumption: 4 bn records: ~ 1 hour 40 bn records: ~ 10 hours 1 years’ worth: ~ 200 hours Data management & processing Single machine RDBMS CPU ~ 2.5GHz ~ 1000 MB/s MEMORY ~10s of GB per machine ~ 100 MB/s STORAGE ~ 1TB per disk Problem: the throughput of a single machine has not kept up with the growth in the size of datasets. Data management & processing Single machine RDBMS CPU ~ 2.5GHz ~ 1000 MB/s MEMORY ~10s of GB per machine ~ 100 MB/s STORAGE ~ 1TB per disk Problem: the throughput of a single machine has not kept up with the growth in the size of datasets. Solution: harness multiple individual machines (‘horizontal scaling’). Data management & processing Single machine RDBMS CPU ~ 2.5GHz ~ 1000 MB/s Problem: the throughput of a single machine has not kept up with the growth in the size of datasets. Solution: harness multiple individual machines. MEMORY ~10s of GB per machine ~ 100 MB/s STORAGE ~ 1TB per disk Problem: this is difficult and expensive using traditional relational database applications Data management & processing Solution Move away from traditional databases and use a purposedesigned (‘big data’) framework to get horizontal scaling: 1 machine ~£10k 2.5GHz 1 GB/s 100MB/s ~ a week Data management & processing Solution Move away from traditional databases and use a purposedesigned (‘big data’) framework to get horizontal scaling: 1 machine ~£10k 10 node cluster ~£50k 2.5GHz 1 GB/s 100MB/s 25GHz 10 GB/s 1 GB/s ~ a week ~ a day Data management & processing Solution Move away from traditional databases and use a purposedesigned (‘big data’) framework to get horizontal scaling: 1 machine ~£10k 10 node cluster ~£50k 100 node cluster ~£300k 2.5GHz 1 GB/s 100MB/s 25GHz 10 GB/s 1 GB/s 250GHz 100 GB/s 10 GB/s ~ a week ~ a day ~ an hour Data management & processing Hadoop Designed to solve the problem of exponentially growing data volumes (originally, google’s searchable copy of the web) Harness a large number of commodity machines and low cost networking and storage. Software takes a job (query, calculation, whatever) and ‘maps’ it out across the cluster. In parallel each node locally processes a subset of the problem, before the results are ‘reduced’ back to a single dataset. (Hence ‘Map/Reduce’) Data management & processing Experiments: SQL server Single high performance machine: bottlenecked by the speed of the hard drive Aggregation query performance versus dataset size 2,500,000 Runtime in seconds 2,000,000 1,500,000 1,000,000 SQL Rows/second 500,000 0 2,000,000,000 4,000,000,000 6,000,000,000~ 400GB Data management & processing Experiments: Hadoop 11 node physical cluster (~£50k hardware cost) Aggregation query performance versus dataset size 3,500,000 3,000,000 Runtime in seconds 2,500,000 2,000,000 1,500,000 1,000,000 SMASH Rows per second vs dataset size 500,000 0 0 10,000,000,000 20,000,000,000 30,000,000,000 40,000,000,000 ~2,500GB Data management & processing Experiments compared Not straightforward to get SQL Server to run over ~ 10Bn records. Aggregation query performance versus dataset size 3,500,000 3,000,000 Runtime in seconds 2,500,000 2,000,000 1,500,000 SMASH Rows per second vs dataset size 1,000,000 SQL Rows/second 500,000 0 0 10,000,000,000 20,000,000,000 30,000,000,000 40,000,000,000 ~2,500GB Data management & processing Experiments: growing the cluster Fixed dataset size of 500m records Aggregation query performance versus cluster size 1,200,000 Rows per second 1,000,000 800,000 R² = 0.9148 600,000 400,000 SMASH speed in records per second vs cluster size 200,000 0 1 2 3 4 5 6 7 Cluster size (nodes) 8 9 10 11 Data management & processing Hadoop Pros • Open source software – free and customisable • Adjustable data redundancy (data is replicated over the cluster) • Incrementally scalable – on both performance and cost measures: just add machines, system adapts automatically. • Responsive and cooperative developer community Cons • Not the last word in user-friendliness (but this is changing) • Sledgehammer to crack a nut below a certain scale • Less mature (but rapidly developing) software ecosystem • Algorithms must fit the framework Conclusion: low cost option for smart meter data processing Data mining and visualisation Finding value in the data Improve existing processes • • • • Settlement Billing, reconciliation, audit Demand profiling Customer profiling & segmentation New processes not possible without HH data at scale • • • • • Localised prediction Distribution network planning and modelling Automated DSM – prediction and verification System state detection Individualised consumer energy services Data mining and visualisation Finding value in the data Collaborative approach with industry partners to identify business needs Focus on: (1) Datamining for subgroup discovery – classifying end users (2) Cluster analysis on demand data – finding profiles (3) Innovative visualisation of consumption data and datamining results Data mining and visualisation Subgroup discovery “Pattern features”: 14 variables describing each household • Income, geography, access to gas, size of house, value of house etc. “Target features”: describe the behaviour of interest • Profile error: how different is usage from the assigned profile? Outputs: • groups of households with significantly different profile errors Data mining and visualisation Subgroup discovery Looking at % annual profile error against sociodemographics Data mining and visualisation Subgroup discovery Looking at % annual profile error against sociodemographics Data mining and visualisation Subgroup discovery Looking at % annual profile error against sociodemographics Data mining and visualisation Subgroup discovery Looking at % annual profile error against sociodemographics Data mining and visualisation Subgroup discovery Looking at % annual profile error against sociodemographics Data mining and visualisation Clustering Can we use demand data to create better profiles? Define target features: waveform’s properties of interest Two examples: using imposed and emergent properties. Each using 3 clusters. Data mining and visualisation Clustering Consumption (not to scale) E.g. 1 the average weekday as 5 pairs of numbers: Time of day (half hours from midnight) Data mining and visualisation Clustering E.g. 2: Frequency spectrum of the demand timeseries Data mining and visualisation Cluster analysis Project competition results (the University won) Average % difference from the cluster centroid 0.35 0.33 0.31 0.29 0.27 0.25 Data mining and visualisation Conclusions from datamining Subgroup discovery results suggest the approach is useful as long as you have metadata on the households Cluster analysis work suggests it is possible to improve on the standard profile classes using SM data Further work needs to be carried out on more representative datasets There are many other potential applications! The SMASH application Web application Installation of Hadoop on UoB and CSE clusters 11 Node physical cluster at the university (£50k) 8 Node virtual cluster at CSE (£15k) Integration of a range of Hadoop-friendly data management components Development of a proof-of-concept web application for user interaction, job management, visualisation etc. Deployment on both clusters The SMASH application Web application Currently running on the CSE virtual Hadoop cluster Generating Value from SM Data Where next? We have a proof-of-concept system developed with TSB R&D funding support. We have mastered the underlying technologies and established that this approach has the potential to be a low-cost solution to a number of industry data challenges. On a technical level the next steps are to • Further develop the web application • Refine the datamining algorithms (with more data) • Implement selected DM algorithms directly on the cluster On a policy/programme level we want ensure this knowledge is incorporated into SM rollout infrastructure decision making. Questions and discussion @cse_bristol #SmartMeterData Contacts: Simon Roberts [email protected] Joshua Thumim [email protected] Web: www.cse.org.uk Sign up to our monthly e-news through our website Follow us on Twitter @cse_bristol