Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Analysis of high-velocity data streams Saumyadipta Pyne Professor, Public Health Foundation of India Remote Associate, Broad Institute of MIT and Harvard University Convener, CSI SIG-Big Data Analytics What is Big Data? "Statisticians are used to developing methodologies for analysis of data collected for a specific purpose in a planned way. Sample surveys and design of experiments are typical examples. Big data refers to massive amounts of very high dimensional and unstructured data which are continuously produced and stored with much cheaper cost than they are used to be. High dimensionality combined with large sample size creates issues such as heavy computational cost and algorithmic instability. The massive samples in big data are typically aggregated from multiple sources at different time points using different technologies. This creates issues of heterogeneity, experimental variations, and statistical biases and requires us to develop more adaptive and robust procedures." -Professor C. R. Rao (BiDA 2014) 2 Big Data Sources 3 Data Tsunami – does it exist? IBM estimates 2.7 Zettabytes (1ZB=1021bytes) of data exist in the digital universe today. Ninety percent of the data that exists today were generated only over the last two years! Facebook stores, accesses, and analyzes 30+ Petabytes (1015) of user generated data. 100 terabytes of data are uploaded daily to Facebook. 30 Billion pieces of content are shared on Facebook every month (contrast with 7 Billion humans on earth). Twitter sees roughly 175 million tweets every day. YouTube users upload 48 hours of new video every minute of the day. Walmart handles more than 1 million customer transactions every hour, all of which are stored in databases. Decoding the human genome originally took 10 years to process; now it can be achieved in one week. More than 5 billion people are calling, texting, tweeting and browsing on mobile phones worldwide. In 2009, Google was processing 20 petabytes a day. 4 Data production will be 44 times greater in 2020 than it was in 2009. Unusual characteristics of Big Data Voluntary data generation – issues of veracity, privacy, continuous learning of the population features (e.g. Google’s search-driven training can help to learn subjective concepts and spot early trends) Unintended usage – data are used for analytics that could be quite different from the original purpose of data generation – ethics? Cheap so generate – less thought on the variables measured Unstructured data – text, pictures, audio, video, click streams High-dimensionality – challenges traditional statistics and data mining (e.g. infinite dimensional functional data) Relentless generation – such as satellite or astronomy data streams, socio-economic monitors, medical sensors 5 Big Data is not too hard to generate – especially if you can stream it out Design of just one brain EEG experiment 20 patients and 20 controls 10 min. of EEG readout Sampling rate of 200 MHz 20 electrodes readout 96 million data points! 6 The Problem of Relentless Data Generation Many organizations today produce an electronic record of essentially every transaction they are involved in. This results in hundreds of millions of records being produced everyday. E.g. In a single day WalMart records 20 million sales transactions, Google handles 150 million searches, and AT&T produces 270 million call records. Data rates of this level have significant consequences for data mining. A few months’ worth data can easily add up to billions of records, and the entire history of transactions or observations can be in hundreds of billions. 7 Concerns with mining “as usual” Current algorithms for mining complex models from data (such as decision trees, set of association rules, outlier analysis) can not mine even a fraction of this data in useful time. Mining a day’s worth of data can take more than a day of CPU time. Data accumulates faster than it can be mined. The fraction of the available data that we are able to mine in useful time is rapidly dwindling towards zero. Overcoming this state of affairs requires a shift in our frame of mind from mining database to mining data streams. 8 The Problem (Contd.) In the traditional data mining process, data loaded into a stable, infrequently–updated databases. Mining it can take weeks or months. Tradeoff between accuracy and speed. The data mining system should be continuously on. Processing records must take place at the speed they arrive. Incorporating them into the model it is building even if it never sees them again. 9 Big Data Analytics 10 What is Big Data Analytics? Analytical abilities that are specifically required to address such characteristics of data (input) and results (output) as: Large volume – storage, access High velocity – data in motion, time-changing processes Unusual variety – structured, unstructured (text, sound, pictures) Uncertain veracity – data reliability if not sampled or designed Issues of security, privacy and ethics – unintended data sources High-dimensionality – curse of dimensionality Incidental endogenity – spurious correlations Finally, the analytics must yield value (e.g. actionable intelligence) in the face of all the above challenges. 11 The power of Big Data Mining Narrative Science can not only produce a story from given data by “joining the dots”, but can actually conceive a story from data even when no dots are given to it (i.e. when humans are clueless). PredPol seeks to predict criminal activity patterns before any crime has occurred, which can lead to proactive rather than reactive policing. Google Hummingbird tries to search based on contexts and semantics of an input sentence, and not merely the words in it. IBM Watson improves on the subjective task of disease diagnosis – but can it do better than human experts, say, on Autism spectrum disorders? Google Flu Trends tries to predict a flu outbreak a week (or more) prior to its detection by aggregation of clinical data (or, by syndromic 12 surveillance). Stream Data Analysis 13 14 15 Background of Stream Data Analysis Typical statistical and data mining methods (e.g., clustering, regression, classification and frequent pattern mining) work with “static” data sets, meaning that the complete data set is available as a whole to perform all necessary computations on it. Well known traditional methods like k-means / PAM clustering, linear regression, decision tree based classification and the APRIORI algorithm to find frequent itemsets generally scan the complete data set repeatedly to produce their results. However, in recent years more and more applications need to work with data which are not static, but are the result of a continuous data generation process (“data in motion”) which is likely to evolve (and change) over time. Some examples are web click-stream data, computer network monitoring data, telecommunication connection data, news RSS feeds, readings from sensor networks (such as in health and environmental monitoring) and stock quotes. These types of data are called data streams and dealing with data streams has become an increasingly important area of research 16 A data stream A data stream can be formalized as an ordered sequence of data points: Y = y1, y2, y3, . . . where the index reflects the order (either by explicit time stamps or just by an integer reflecting order). The data points themselves can be simple vectors in multidimensional space, but can also contain categorical labels (nominal/ordinal variables), complex information (e.g., graphs) or unstructured information (e.g., text or snapshots). Temporal continuity may be weaker than that for usual time series data (E.g. sequence of successive news items may not be related to each other). 17 Stream characteristics The characteristic of continually arriving data points introduces an important property of data streams which also poses the greatest challenge: the size of a data stream is potentially unbounded. This leads to the following requirements for data stream processing algorithms: Bounded storage: The algorithm can only store a very limited amount of data to summarize the data stream. Single pass: The incoming data points cannot be permanently stored and need to be processed at once in the arriving order. Real-time: The algorithm has to process data points on average at least as fast as the data is arriving. Concept drift: The algorithm has to be able to deal with a data generating process which evolves over time (e.g., distributions change or new structure in the data appears). 18 Design Criteria for mining High Speed Data Streams A system capable of overcoming these High-velocity problem needs to meet a number of stringent design criteria / or requirements: 1. 2. 3. 4. It must be able to build a model using at most one scan of the data. It must use only a fixed amount of main memory. It must require small constant time per record. It must make a usable model available at any point in time, as opposed to only when it is done processing the data, since it may never be done processing. Ideally, it should produce a model that is equivalent to the one that would be obtained by the corresponding ordinary database mining algorithm, operating without the above constraints. When the data-generating phenomenon is changing over time, the model at any time should be up-to-date. 19 Stream data resources in R and other platforms 20 R Distributed computing With the development of Hadoop, distributed computing for statistical frameworks to solve large scale computational problems have become very popular. HadoopStreaming (Rosenberg 2012) is available to use R map and reduce scripts within the Hadoop framework. HadoopStreaming is used for batch processing. However, contrary to the word streaming in its name, Hadoop-Streaming does not support data streams. As with Hadoop itself, (Streaming in the name refers only to the internal usage of pipelines for “streaming” the input and output between the Hadoop framework and the used R scripts). A distributed framework for real-time computation is Storm (Apache). Storm builds on the idea of constructing a computing topology from spouts (data sources) and bolts (computational units). RStorm (Kaptein 2013) implements a simple, non-distributed version of Storm with multi-language capacity. 21 R: Stream Data sources Random numbers are typically created as a stream (see e.g., rstream (Leydold 2012) and rlecuyer (Sevcikova and Rossini 2012)). Financial data can be obtained via packages like quantmod (Ryan 2013). Intra-day price and trading volume can be considered a data stream. For Twitter, a popular micro-blogging service, packages like streamR (Barbera 2014) and twitteR (Gentry 2013) provide interfaces to retrieve life Twitter feeds. 22 R stream package 1. 2. The stream framework provides a R-based alternative to MOA which seamlessly integrates with the extensive existing R infrastructure. Since R can interface code written in a whole set of different programming languages (e.g., C/C++, Java, Python), data stream mining algorithms in any of these languages can be easily integrated into stream. The stream framework consists of two main components: Data Stream Data (DSD) which manages or creates a data stream, and Data Stream Task (DST) which performs a data stream mining task. 23 DSD and DST We start by creating a DSD object and a DST object. Then the DST object starts receiving data form the DSD object. At any time, we can obtain the current results from the DST object. DSTs can implement any type of data stream mining task (e.g., classification or clustering). 24 DST (stream data tasks) 25 Massive Online Analysis (MOA) “Online” analysis refers to learning that happens upon entry of each new data point. So it acts in a dynamic “then and there” manner. MOA is a framework implemented in Java for stream classification, regression and clustering (Bifet, Holmes, Kirkby, and Pfahringer 2010). It was the first experimental framework to provide easy access to multiple data stream mining algorithms, as well as tools to generate data streams that can be used to measure and compare the performance of different ML algorithms. Like WEKA (Witten and Frank 2005), a popular collection of machine learning algorithms, MOA is also developed by the University of Waikato and its interface and workflow are similar to those of WEKA. The workflow in MOA consists of three main steps: 1. Selection of the data stream model (also called data feeds). 2. Selection of the learning algorithm. 3. Apply selected evaluation methods on the results. 26 MOA (cont’d) Similar to WEKA, MOA uses a very appealing graphical user interface. Classification results are shown as text, while clustering results have a visualization component that shows both the evolution of the clustering (in two dimensions) and various performance metrics over time. MOA is currently the most complete framework for data stream clustering research. MOA’s advantages are that it interfaces with WEKA, provides already a set of data stream classification and clustering algorithms and it has a clear Java interface to add new algorithms or use the existing algorithms in other applications. A drawback of MOA and the other frameworks for R users is that for all (but very simple) experiments custom Java code has to be written. 27 Commercial stream data analytics platforms IBM InfoSphere Streams Microsoft StreamInsight (part of MS SQL Server) SAS DataFlux Event-Stream Processing Engine Oracle Streams 28 Apache Storm: Streaming meets Distributed Computing Apache Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process (bolts) sources of unbounded stream data (spouts) for realtime processing – as Hadoop did for batch processing. Storm can be used with any programming language. Storm has many use cases: realtime analytics, online ML, continuous computation, etc. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. Storm is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate. Ref. https://storm.apache.org/ 29 Apache Spark (Streaming) Apache Spark is an in-memory distributed data analysis platform-primarily targeted at speeding up batch analysis jobs, iterative machine learning jobs, interactive query and graph processing. One of Spark's primary distinctions is its use of RDDs or Resilient Distributed Datasets. RDDs are great for pipelining parallel operators for computation and are, by definition, immutable, which allows Spark a unique form of fault tolerance based on lineage information. If you are interested in, say, executing a Hadoop MapReduce job much faster, Spark is a great option (although memory requirements must be considered). Since Spark's RDDs are inherently immutable, Spark Streaming implements a method for "batching" incoming updates in user-defined time intervals that get transformed into their own RDDs. Spark's parallel operators can then perform computations on these RDDs. This is different 30 from Storm which deals with each event individually. Storm vs. Spark 31 Copyright Note: This presentation is based on the following papers: 1. Mining High-Speed Data Streams, P. Domingos and Geoff Hulten. Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining. ACM Press. pp. 71-80, 2000. 2. A General Framework for Mining Massive Data Streams, Geoff Hulten (short paper). Journal of Computational and Graphical Statistics, 12, 2003. 3. Mining time-changing data streams. G. Hulten et al. Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press, NY, pp. 97-106, 2001. 4. Introduction to stream: An Extensible Framework for Data Stream Clustering Research with R by M. Hashler et al. R Vignette at – http://cran.r-project.org/web/packages/stream 32 Stream data clustering 33 Data Stream Clustering Clustering, the assignment of data points to (typically k) groups such that points within each group are more similar to each other than to points in different groups, is a very basic unsupervised data mining task. For static data sets methods like k-means, k-medians, PAM, hierarchical clustering and density-based methods have been developed among others. Many of these methods are available in tools like R, however, the standard algorithms need access to all data points and typically iterate over the data multiple times. This requirement makes these algorithms unsuitable for data streams and led to the development of data stream clustering algorithms. Over the last 10 years many algorithms for clustering data streams have been proposed. Most data stream clustering algorithms that deal with the problems of unbounded stream size, and the requirements for real-time processing in a single pass, often use the two-stage online/offline approach introduced by Aggarwal, Han, Wang, and Yu (2003). 34 Step 1 Online: Summarize the data using a set of k′ micro-clusters organized in a space- efficient data structure which also enables fast look-up. Micro-clusters were introduced for CluStream by Aggarwal et al. (2003) based on the idea of cluster features developed for clustering large data sets with the BIRCH algorithm. Micro-clusters are representatives for sets of similar data points and are created using a single pass over the data (typically in real time when the data stream arrives). Microclusters are typically represented by cluster centers and additional statistics such as weight (local density) and dispersion (variance). The individual data-points are then let gone – a truly unique situation in Statistics. Each new data point is assigned to its closest (in terms of a similarity function) microcluster center. Some algorithms use a grid instead and micro-clusters are represented by non-empty grid cells (e.g., D-Stream by Tu and Chen (2009)). If a new point cannot be assigned to any micro-cluster, then a new micro-cluster is created. The algorithm might also perform some housekeeping (merging or deleting micro-clusters) to keep the number of micro-clusters at a manageable size or to remove information 35 outdated due to a change in the stream’s data generating process. Step 2 Offline: When the user or the application requires an overall clustering result, then the k′ micro-clusters are re-clustered into k ≪ k′ final clusters sometimes referred to as macro-clusters. Since the offline part is usually not regarded time critical, most researchers use a conventional clustering algorithm where micro-cluster centers are regarded as pseudo-points. Typical re-clustering methods involve k-means or reachability introduced by DBSCAN. The algorithms are often modified to take also the weight of micro-clusters into account. 36 Generating stream data R> library("stream") R> dsd = DSD_Gaussians(k=3, d=4, noise=0.05) R> dsd Mixture of Gaussians Class: DSD_Gaussians, DSD_R, DSD With 3 clusters in 4 dimensions R> p <- get_points(dsd, n=100, assignment=TRUE) R> attr(p, "assignment") [1] 2 2 2 2 2 2 NA 2 3 2 3 2 3 3 1 1 3 2 3 3 2 1 2 [24] 3 3 3 2 1 3 1 2 NA 3 2 1 1 2 3 3 2 1 2 2 NA 1 2 [47] 3 3 1 1 1 1 2 2 3 3 2 2 1 2 2 1 3 2 NA 3 1 3 3 [70] 3 1 3 3 1 2 3 3 3 1 2 1 3 3 1 1 2 3 1 3 1 1 1 [93] 3 2 3 1 3 1 2 3 37 Plot a part of the stream R> plot(dsd, n=500, method="pc") 38 Clustering stream data in R R> write_stream(dsd, "data.csv", n=100, sep=",") R> dsd_file = system.file("examples", "kddcup10000.data.gz", package="stream") R> dsd_scaled = DSD_ScaleStream(dsd_file, center=TRUE, scale=TRUE) R> get_points(dsd_scaled, n=5) R> dstream = DSC_Kmeans(k=3) R> cluster(dstream, dsd, 500) R> dstream R> plot(dstream, dsd) R> points(get_centers(dsd), col = "blue“) 39 From microclusters to macroclusters R> points(kmeans(get_centers(dsd), centers = 3, nstart = 5)$centers, col = "blue“) 40 Concept drift data streams 41 Google Flu Trends: Predicting Outbreaks ahead of Peaks in Clinical Visits (CDC records) Google Flu Trends uses aggregated Google search data to estimate flu activity. A linear model is used to compute the log-odds of Influenza-like illness (ILI) physician visit and the log-odds of ILI-related search query: logit(P) = β0 + β1 logit(Q) + ε where P is the percentage of ILI physician visit and Q is the ILI-related query fraction computed in previous steps. β0 is the intercept and β1 is the coefficient, while ε is the error term. Over-estimation after 2007? 42 Concept drift – benchmark datasets The most popular approach to adapt to concept drift (changes of the data generation process over time) is to use the exponential fading strategy. Micro-cluster weights are faded in every time step by a factor of 2−, where > 0 is a user-specified fading factor. New data points have more impact on the clustering and the effect of older points gradually disappears. R> dsd = DSD_Benchmark(1) 43 Concept Drift in Economics Macroeconomic forecasts and financial time series are also subjects to data stream mining. The data in those applications is drifting primary due to a large number of factors that are not included in the model. The publicly known information about companies can form only a small part of attributes needed to properly model financial forecasts as a stationary problem. That is why the main source of drift is hidden context. 44 Concept Drift in Finance Bankruptcy prediction or individual credit scoring is typically considered to be a stationary problem. Again, there is drift due to hidden context. The decisions that need to be made by the system are based on fragmentary information. The need for different models for bankruptcy prediction under different economic conditions was acknowledged, but the need for models to be able to deal with nonstationarity has been rarely researched. 45 Finite mixture models – allows parametric modeling of dynamically evolving concepts (differentiating phenotypes) In one of the earliest GMM studies, Karl Pearson fit a mixture of two univariate Gaussians to data on Naples crabs using the method of moments to distinguish the species. Although the empirical data histogram is single-peaked, the two constituent Gaussians could be distinguished and their parameters estimated. Finite mixture models are often used for data clustering, also in the multivariate & non-Gaussian cases. This density plot is due to Peter Macdonald. K. Pearson: Contributions to the Mathematical Theory of Evolution. Philosophical Transactions of the Royal Society of London. A, 1894.46 Gaussian Mixture Model – a weighted sum of Gaussians 0.4*N(54.5, 0.6) + 0.6*N(81, 0.7) Gaussian mixture modeling is a popular parametric approach for probabilistic learning of concepts. 47 Expectation Maximization (EM) 48 EM for Gaussian Mixtures 49 EM model-fitting 50 Online Incremental Learning Q1. How to make this EM algorithm “online”? Add the contribution of one point (of the stream) at a time to the model. Q2. How to capture concept drift? Add the contribution of the past points with less weightage. Q3. How to detect deviation in a “concept” distribution? Measure the deviation d(p,q) in the model (Gaussian Mixture probability distribution) before (q(x)) and after (p(x)) adding the contribution of the latest point of the stream. If the deviation is more than a certain threshold, be on the lookout for concept drift. 51 Mixture of drifting concepts modeled with mixture of skew distributions Machine Learning of the population parameters – Locations, Sizes, Covariances, Shape, … – EM based iterative data modeling Spatial features of populations – Shape (spherical or oblong?) – Orientation (vertical or slanted?) – Volume (tight or diffuse?) (by Eigen decomposition of scale matrix) 52 The skew normal distribution by Pyne et al. (2009) 53 The skew t distribution by Pyne et al. (2009) 54 Stream data Classification 55 The Classification problem Classification, learning a model in order to assign labels to new, unlabeled data points is a well studied supervised machine learning task. Methods include naive Bayes, k-nearest neighbors, classification trees, support vector machines, rule-based classifiers and many more. However, as with clustering, these algorithms need access to the complete training data several times and thus are not suitable for data streams with constantly arriving new training data. 56 Stream Classification Several classification methods suitable for data streams have been developed recently. Examples are Very Fast Decision Trees (VFDT) (Domingos and Hulten 2000) using Hoeffding trees, the time window-based Online Information Network (OLIN) (Last 2002) and On-demand Classification (Aggarwal, Han, Wang, and Yu 2004) based on micro-clusters found with the data-stream clustering algorithm CluStream (Aggarwal et al. 2003). 57 Fast decision trees Given N training samples (x, y), y is the label of data points x, to learn a model f : X Y Goal: To produce label y’ = f (x’) for a new test sample x’ Why are new algorithms needed? C4.5, CART, etc. assume data is in RAM SPRINT, SLIQ make multiple disk scans Hence the goal is to design a Decision tree learner from extremely large (potentially infinite) datasets. 58 Usual Decision Tree Classification Several classification methods suitable for data streams have been developed recently. Examples are Very Fast Decision Trees (VFDT) (Domingos and Hulten 2000) using Hoeffding trees, the time window-based Online Information Network (OLIN) (Last 2002) and On-demand Classification (Aggarwal, Han, Wang, and Yu 2004) based on micro-clusters found with the data-stream clustering algorithm CluStream (Aggarwal et al. 2003). 59 Creating a Decision Tree 60 Creating a decision tree for sample classification A decision tree is constructed by deciding to splitting nodes (partitioning the set of samples) based on that attribute which leads to “purer” children nodes with the least heterogeneity (impurity / misclassification) of sample labels. We measure impurity (with Entropy or Gini) of each child node, take their weighted sum, and subtract from the impurity of the mother node. We seek maximum decrease in impurity at each split. That is, the leaf nodes are as pure as possible. So that the classification is clean. We stop when there are too few samples in a node to split, or reach a certain tree height. 61 Can we create a good enough decision tree by training a node with only a small # of samples? A decision tree selects a data attribute at each node, and uses it to split the dataset at that node into 2 parts. The “goodness” of a split is measured by a resulting decrease in the uncertainty of class membership of the 2 split subsets while moving down the tree from a parent node to its 2 children nodes. Hoeffding bound ensures that it is unlikely that a split at a given node, determined by reading only a sample of certain size (N points), will be much worse than a split at the same node which is determined by reading the full dataset as is usually done for creating a conventional decision tree. The probability that the Hoeffding and conventional tree learners will choose different splits at any given node decreases exponentially with 62 the number of datapoints. How does VFDT work? The VFDT method (Domingos & Hulten, 2000) is based on the principle of the Hoeffding tree. A Hoeffding tree is constructed not with the full (high-volume high-velocity) stream but with the use of random sampling on static data. The idea is to determine a random sample of sufficient size so that the tree constructed on the sample is approximately the same as that constructed on the entire data set. The Hoeffding bound is used to ensure that the decision tree on the subsampled tree would make the same decisions or “split”s as an usual tree that is created on the full stream with high probability. CVFDT (Hulten et al., 2001) also allows for concept drift. 63 PAC learning “probably approximately correct” 64 PAC Learning (cont’d) 65 Hoeffding bound That is, given a sample of at least a certain size N, the probability that there will be too much deviation (i.e. beyond tolerance limit) of the sample mean from its expected value is low (i.e. bounded). The bound gets tighter with larger sample size N (fixed tolerance). 66 Hoeffding bound (re-stated) yields the needed sample size for decision-making 67 Hoeffding tree Let G(Xi) be the heuristic measure used to choose the best attribute Xi for splitting a node after seeing the full data. E.g., For G to measure “goodness of split”, we could use Entropy based information gain or Gini index. Goal: Ensure that, with a high probability, the optimal attribute chosen for splitting a node using n examples, is generally the same as that would be chosen using all infinite points of the data stream. Assuming G is to be maximized, let X1 be the attribute which gives the highest observed G’ and X2 be with second highest attribute, after seeing n examples. Let ΔG’ = G’(X1) – G’(X2) >= 0 be the difference between the observed heuristic values. 68 Hoeffding tree construction (Contd.) Then given a desired δ, Hoeffding bound guarantees that X1 is the correct choice with probability 1- δ, if n examples have been seen up until now at this node, and ΔG’ > ϵ . In other words, If the observed ΔG’ > ϵ, then the Hoeffding bound guarantees that the true ΔG >= ΔG’ - ϵ with probability 1 – δ, and therefore that X1 is indeed the best attribute with probability 1 – δ. Thus, a node needs to accumulate examples from the stream only until ϵ becomes smaller than ΔG. The node can then be split using the current best attribute, and all the succeeding data points in the stream will be passed on to train the new 69 leaves. Stream Anomaly Detection 70 Anomaly Detection/Outlier Analysis Clustering-based approaches are commonly adopted for outlier analysis. Fraud detection Tsunami detection Breaking News (novelty) detection System state monitoring to guard against intrusion or other “events” Flash crash early detection to prevent disruption in stock markets 71 A general area with many open research problems Anomaly detection in temporal data (Gupta, Gao, Agarwal, 2014) 72 Stream data Applications: Health Analytics and Disease Modeling Lab 73 Health Informatics Tools for Outbreak Detection: Crowdsourcing, Predictive Analytics, SNA, Fusion 74 Various data stream sources incorporated in Public Health Analytics & Outbreak Detection Social Networks Census and demographics studies Environmental monitoring and sensor networks Electronic health records Monitoring of lifestyle parameters Media reports and trends Environmental and occupational exposures Civil unrest, riot, war, migration, displacement spatio-temporal data Therefore, it calls for design of – Integrative frameworks with hierarchical models with predictive abilities. 75 Big Data in Computational Epidemiology 76 Engaging the BD Community Past few months: National Big Data Workshop, Hyderabad (Aug. 2014) ACM BigLS2014, Newport Beach, CA (Sept. 2014) IEEE BigData 2014, Washington DC (Oct. 2014) BigLSW2014, C-DAC, Bangalore (Dec. 2014) DST Curriculum Design for Big Data, Hyderabad (Mar. 2015) Computer Society of India Special Interest Group in Big Data 77 78 Thank you for your kind attention Acknowledgement for funding MoS&PI, DST, DRDE, DBT New NIH funded Health Analytics and Disease Modeling Lab (Sept. 2015) IIPH Hyderabad 79 80 Deadline: Sept. 28, 2015 81