Download slides - Methodologies to Improve big Data Projects

Towards Methods for Systematic Research On Big Data Manirupa Das, Renhao Cui, David R. Campbell, Gagan Agrawal, Rajiv Ramnath • Motivation • Case Studies • Characterizing data driven research • Methodologies and Processes • How to make data-driven research more systematic • Conclusion and Questions 2 Motivation • Big Data characterized by five V’s — of Volume, Velocity, Variety, Veracity and Value. • Research on Big Data — practice of gaining insights • challenges intellectual, process, and computational limits of an enterprise. • exploratory and often ad-hoc nature of analytic demands • Distinct lack of established processes and methodologies • difficult for Big Data teams to set expectations or even create valid project plans. 3 Digital business ranks high on the agenda for many companies [Source: McKinsey Global Survey Results 2012] 4 Executives do want companies to focus on generating customer insights [Source: McKinsey Global Survey Results 2012] 5 Motivation • “Data science” science of extraction of "actionable knowledge", usually from “Big Data” large volumes of data generated by systems, sensors or devices, or personal and social digital traces of information from people. • While database querying asks, “What data satisfy this pattern (query)?” data-driven discovery asks, “What patterns satisfy this data?” • Data-driven analytics pipelines thus often comprise the following activities: (i) Descriptive Analytics (What happened?), (ii) Diagnostic Analytics (Why did it happen?), (iii) Predictive Analytics (What will happen?) and (iv) Prescriptive Analytics (How can we make a desired effect happen?). 6 Motivation • Traditional database-focused methods optimized for fast access and summarization of data given what the user knows to ask (query) • But what if we don’t know what exactly to ask of the data? • Software applications for data science very different from traditional database systems provide probabilistic answers, and hardware architectures are designed for exploration at scale • We present our perspectives on characterizing data-driven research tools, methodologies and processes to make data-driven research more systematic, exemplars of several projects using large, heterogeneous, complex data sets providing ad-hoc tools to query a dataset to answer larger business or research question7 Case Studies A. Maximum Entropy Churn Prediction Using Topic Models B. Mining Emotion-Word Correlations in a Large Blog Corpus C. Using Latent Semantic Analysis to Identify Successful Bloggers D. Brand specific tweet classification with user provided topics 8 Project and Characteristics Projects Scientific Discipline/ Industrial Domain A B C D Media, Publishin g Data Mining, Computational Linguistics Artificial Intelligence Computational Linguistics Market Research A – Churn Prediction B – Mining Emotion-Word Correlations C – LSA to identify successful bloggers D – Brand specific tweet classification Mostly Structured Data Yes No No No Mostly unstructured Data Yes Yes Yes Yes Hypothesis Testing Yes No Yes No Hypothesis Generation Yes Yes Yes No Internetbased Yes Yes Yes Yes Scale TB GB GB TB 9 Projects Project and Characteristics contd.. A B C D Distributed elements Yes Yes Yes Yes Computationally intensive Data Preparation Yes Yes Yes Yes Computationally intensive Execution Yes Yes Yes Yes In-memory execution No Yes No No Parallelizable code Yes Yes Yes Yes LDAbased Topic modeling Association Rule Mining Latent Semantic Analysis LLDAbased Topic modeling Yes Yes Yes No A – Churn Prediction B – Mining Emotion-Word Correlations C – LSA to identify successful bloggers D – Brand specific tweet classification Nontraditional analysis Ad-hoc data product 10 Characterizing data driven research • Typical research starts from a pre-determined goal • Then collects data and validates and builds models to achieve the goal. • A data-driven research project, starts from the data, and tries to reveal the pattern or information stored • Data-driven research is atypical — no clear purpose or outcome at outset, but evolve this in an iterative fashion. • Highlight certain key considerations to characterize primary research activities 11 Key considerations in primary research activities A. Clarity About Purpose D. Type of Experiments B. Methods considerations E. Type of Analytics C. Type and Availability of Data F. Infrastructure/System considerations 12 Methodologies Agile — a group of software development methods in which solutions evolve through collaboration between self-organizing, cross-functional teams Promotes adaptive planning, evolutionary development, early delivery, continuous improvement, and encourages rapid and flexible response to change Key characteristic criteria for Agile Analytics 1-Iterative, incremental, evolutionary 2-Value-driven development 3-Production quality 4-Barely sufficient processes 5-Automation, automation, automation 6-Collaboration 7-Self-organizing, self-managing teams Challenges in data-driven research 1. Informal use of language and creation of new words 2. Noise and redundant information 3. Inadaptable methodology 4.High-frequency data generation 13 How to make data-driven research more systematic Agile analytic steps for data-driven planning and execution of the research pipeline: 1. Information extraction and cleaning 2. Preliminary data analysis 3. Research goal or Hypothesis generation 4. Research data design 5. Model and feature selection 6. Output evaluation 7. Visualization 8. Iterate in value-drive chunks (Agile) A generalized dataset tries to expand the chance that a single data extraction step can work for multiple projects and purposes. This does not necessarily mean increasing the volume of the data, but enlarges the coverage of the data Standardized data processing focuses on some common process of the data that can be abstracted to make it reusable for multiple proj 14 A Process for Systematic Data-driven Research 15 Thoughts We believe that in today's competitive research and business landscape, it is heterogeneity (variety), the speed at which data is being generated (velocity) and the inconsistency and incompleteness (veracity) of the data, that are the most cross-cutting aspects of Big Data, touching organizations of nearly every type and size. We thus provide a fairly comprehensive overview of research methods and key considerations in characterizing data-driven research, drawing from our experiences in conducting data-driven research projects, including types of available data and experiments. Using the same, we recommend a process for performing systematic research on Big Data, akin to Agile methodology for software development. 16 Conclusions While Big data technologies developed mostly out of the large web companies to come up with strategies to process voluminous data, every organization with data, regardless of a global user base, needs to run efficiently and provide the best possible service to its customers This can be done by harnessing systematic processes to better channel data, in order to provide data-driven value – the most important of the V’s Questions How can Big Data serve to support studies designed to perform causal inference, given that the two have opposing starting and ending points? How does Agile fit into different stages of data driven research? Can Big Data be a part of a new mixed methods approach where we try to find the individual stories that support the data? 17 The Hidden Biases in Big Data The Business and the Science worlds are focused on how large [quantitative] datasets can give insight on previously intractable challenges. The hype becomes problematic when it leads to “data fundamentalism”, the notion that correlation always indicates causation and that massive datasets and predictive analytics always reflects objective truth. Data and data sets are not objective; they are creations of human design. We give numbers their voice, draw inferences from them, and define their meaning through our interpretations. Hidden biases in both the collection and analysis stages present considerable risks, and are as important to the big-data equation as the numbers themselves. – Kate Crawford(2013, HBR) 18 Thank you 19 Key considerations A. Clarity About Purpose 1-Basic research: Contributes to fundamental knowledge and theory 2-Applied research: Illuminates a societal concern or problem in search for solutions 3-Summative evaluation: Determines if a solution (policy or program) works 4-Formative evaluation: Improves a policy or program as it is being implemented 5-Action research: Understands and solve a problem as quickly as possible. 20 Key considerations B. Methods considerations – Qualitative and Quantitative Approaches and Outcomes “The key to making good forecasts is weighing quantitative and qualitative information appropriately” – Nate Silver Qualitative - When we want to explore a problem in depth. Quantitative - Well suited for the testing of theories and hypotheses. C. Type and Availability of Data • Heterogeneous and Complex Data • Data Ownership and Distribution 21 Key considerations D. Type of Experiments Field experiments (in natural conditions, e.g., in space), Laboratory experiments (in artificial conditions) Qualitative and Quantitative experiments, Computer simulation experiments Retrospection , a review of the past events. Forecasting, a special scientific study of concrete development prospects of an object. E. Type of Analytics a. Predictive tasks: Classification, Regression, Recommendation b. Descriptive tasks: Cluster analysis, Anomaly detection, Association Analysis ( used to discover patterns that describe strongly associated features) F. Infrastructure/System considerations: Scalability, High dimensionality 22 Case Studies A. Maximum Entropy Churn Prediction Using Topic Models • • • • • 1.2 million subscribers, 1.5 years worth of news/blog data from 13 websites, 3.4TB of Server logs. Explore structured and unstructured data to come up with predictive models for customer churn using features mined from transactional databases or Web-based textual data to determine which factors most impact user engagement unique dataset normalization and modeling approach to carve out a future timeframe from the present data for prediction topic and metadata features reveal engagement patterns 23 Case Studies B. Mining Emotion-Word Correlations in a Large Blog Corpus • Spinn3r dataset for the ICWSM 2009 data challenge - 44 million blog posts spanning 62 days covering some big news events such as the 2008 Olympics, both 2008 US presidential nominating conventions, and the beginnings of the financial crisis. Total size - 142 GB uncompressed • exploratory study, try to determine whether the words people choose correlate well with categories drawn from a basic theory on emotion • If successful, can use this information to better predict how blog entries might cluster based on emotion, leading to improved models of information retrieval for blogs 24 Case Studies C. Using Latent Semantic Analysis to Identify Successful Bloggers • • • • • Spinn3r dataset In this work, we hypothesized that there may exist characteristics of language use by informal writers, such as vocabulary or word choice, that are directly associated with successful communication. Specifically, we hypothesized a relationship between the vocabulary of a blog and comment density used latent semantic analysis (LSA) to reduce the dimensionality of a term-document matrix for each blog in a collection (where a blog is a concatenated set of blog entries). Two experiments performed . (i) using an unsupervised clustering approach to see if relationships to comment density naturally emerge, (ii) a supervised classification method to identify high and low comment density blogs, by using two complimentary models built through LSA. 25 Case Studies D. Brand specific tweet classification with user provided topics • Many companies take feedback information on their products from Twitter, from where a large number of tweets are collected for data analysis for the purpose of market research • define some rules containing keywords and simple logic to be able to label some tweets into certain bins of interest • starting from the brand specific data, and the simple keyword-based logic rules, we build a system that is able to label many more tweets with a certain confidence level using LLDA model • We collect the mentioning tweets for 5 brands and build a topic model for each of them to better be able to predict tweets for each bin with high precision ~85-90% considering the size of the data. 26

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download slides - Methodologies to Improve big Data Projects