Download slides - Methodologies to Improve big Data Projects

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Predictive analytics wikipedia , lookup

Geographic information system wikipedia , lookup

Theoretical computer science wikipedia , lookup

Pattern recognition wikipedia , lookup

Neuroinformatics wikipedia , lookup

Corecursion wikipedia , lookup

Data analysis wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
Towards Methods for Systematic
Research On Big Data
Manirupa Das, Renhao Cui, David R. Campbell, Gagan Agrawal, Rajiv Ramnath
• Motivation
• Case Studies
• Characterizing data driven research
• Methodologies and Processes
• How to make data-driven research more
systematic
• Conclusion and Questions
2
Motivation
• Big Data characterized by five V’s — of Volume, Velocity,
Variety, Veracity and Value.
• Research on Big Data — practice of gaining insights
• challenges intellectual, process, and computational limits of an
enterprise.
• exploratory and often ad-hoc nature of analytic demands
• Distinct lack of established processes and methodologies
• difficult for Big Data teams to set expectations or even create valid
project plans.
3
Digital business ranks high on the agenda for many companies
[Source: McKinsey Global Survey Results 2012] 4
Executives do want companies to focus on generating customer insights
[Source: McKinsey Global Survey Results 2012] 5
Motivation
• “Data science” science of extraction of "actionable knowledge",
usually from “Big Data”
large volumes of data generated by systems, sensors or devices, or
personal and social digital traces of information from people.
• While database querying asks, “What data satisfy this pattern
(query)?” data-driven discovery asks, “What patterns satisfy this
data?”
• Data-driven analytics pipelines thus often comprise the following
activities:
(i) Descriptive Analytics (What happened?), (ii) Diagnostic Analytics (Why did it
happen?), (iii) Predictive Analytics (What will happen?) and (iv) Prescriptive
Analytics (How can we make a desired effect happen?).
6
Motivation
• Traditional database-focused methods optimized for fast
access and summarization of data given what the user
knows to ask (query)
• But what if we don’t know what exactly to ask of the data?
• Software applications for data science very different from
traditional database systems
provide probabilistic answers, and hardware architectures are designed for exploration at
scale
• We present our perspectives on characterizing data-driven
research
tools, methodologies and processes to make data-driven research more systematic,
exemplars of several projects using large, heterogeneous, complex data sets
providing ad-hoc tools to query a dataset to answer larger business or research question7
Case Studies
A. Maximum Entropy Churn Prediction Using Topic Models
B. Mining Emotion-Word Correlations in a Large Blog Corpus
C. Using Latent Semantic Analysis to Identify Successful
Bloggers
D. Brand specific tweet classification with user provided topics
8
Project and
Characteristics
Projects
Scientific
Discipline/
Industrial
Domain
A
B
C
D
Media,
Publishin
g
Data
Mining,
Computational Linguistics
Artificial
Intelligence
Computational
Linguistics
Market
Research
A – Churn Prediction
B – Mining Emotion-Word Correlations
C – LSA to identify successful bloggers
D – Brand specific tweet classification
Mostly Structured Data
Yes
No
No
No
Mostly
unstructured
Data
Yes
Yes
Yes
Yes
Hypothesis
Testing
Yes
No
Yes
No
Hypothesis
Generation
Yes
Yes
Yes
No
Internetbased
Yes
Yes
Yes
Yes
Scale
TB
GB
GB
TB
9
Projects
Project and
Characteristics
contd..
A
B
C
D
Distributed
elements
Yes
Yes
Yes
Yes
Computationally
intensive
Data
Preparation
Yes
Yes
Yes
Yes
Computationally
intensive
Execution
Yes
Yes
Yes
Yes
In-memory
execution
No
Yes
No
No
Parallelizable
code
Yes
Yes
Yes
Yes
LDAbased
Topic
modeling
Association Rule
Mining
Latent
Semantic
Analysis
LLDAbased
Topic
modeling
Yes
Yes
Yes
No
A – Churn Prediction
B – Mining Emotion-Word Correlations
C – LSA to identify successful bloggers
D – Brand specific tweet classification
Nontraditional
analysis
Ad-hoc data
product
10
Characterizing data driven research
• Typical research starts from a pre-determined goal
• Then collects data and validates and builds models to
achieve the goal.
• A data-driven research project, starts from the data, and tries
to reveal the pattern or information stored
• Data-driven research is atypical — no clear purpose or
outcome at outset, but evolve this in an iterative fashion.
• Highlight certain key considerations to characterize primary
research activities
11
Key considerations in primary research activities
A. Clarity About Purpose
D. Type of Experiments
B. Methods considerations
E. Type of Analytics
C. Type and Availability of Data
F. Infrastructure/System
considerations
12
Methodologies
Agile — a group of software development methods in which solutions evolve through
collaboration between self-organizing, cross-functional teams
Promotes adaptive planning, evolutionary development, early delivery, continuous
improvement, and encourages rapid and flexible response to change
Key characteristic criteria for Agile Analytics
1-Iterative, incremental, evolutionary
2-Value-driven development
3-Production quality
4-Barely sufficient processes
5-Automation, automation, automation
6-Collaboration
7-Self-organizing, self-managing teams
Challenges in data-driven research
1. Informal use of language and creation of new words 2. Noise and redundant information
3. Inadaptable methodology
4.High-frequency data generation
13
How to make data-driven research more systematic
Agile analytic steps for data-driven planning and execution of the research pipeline:
1. Information extraction and cleaning
2. Preliminary data analysis
3. Research goal or Hypothesis generation
4. Research data design
5. Model and feature selection
6. Output evaluation
7. Visualization
8. Iterate in value-drive chunks (Agile)
A generalized dataset tries to expand the chance that a single data extraction step can work
for multiple projects and purposes.
This does not necessarily mean increasing the volume of the data, but enlarges the coverage
of the data
Standardized data processing focuses on some common process of the data that can be
abstracted to make it reusable for multiple proj
14
A Process for Systematic Data-driven Research
15
Thoughts
We believe that in today's competitive research and business landscape, it is
heterogeneity (variety), the speed at which data is being generated (velocity) and
the inconsistency and incompleteness (veracity) of the data, that are the most
cross-cutting aspects of Big Data, touching organizations of nearly every type and
size.
We thus provide a fairly comprehensive overview of research methods and key
considerations in characterizing data-driven research, drawing from our
experiences in conducting data-driven research projects, including types of
available data and experiments.
Using the same, we recommend a process for performing systematic research on
Big Data, akin to Agile methodology for software development.
16
Conclusions
While Big data technologies developed mostly out of the large web companies to
come up with strategies to process voluminous data, every organization with data,
regardless of a global user base, needs to run efficiently and provide the best
possible service to its customers
This can be done by harnessing systematic processes to better channel data, in
order to provide data-driven value – the most important of the V’s
Questions
How can Big Data serve to support studies designed to perform causal inference,
given that the two have opposing starting and ending points?
How does Agile fit into different stages of data driven research?
Can Big Data be a part of a new mixed methods approach where we try to find the
individual stories that support the data?
17
The Hidden Biases in Big Data
The Business and the Science worlds are focused on how large
[quantitative] datasets can give insight on previously intractable
challenges. The hype becomes problematic when it leads to “data
fundamentalism”, the notion that correlation always indicates causation
and that massive datasets and predictive analytics always reflects
objective truth.
Data and data sets are not objective; they are creations of human design.
We give numbers their voice, draw inferences from them, and define their
meaning through our interpretations. Hidden biases in both the collection
and analysis stages present considerable risks, and are as important to
the big-data equation as the numbers themselves.
– Kate Crawford(2013, HBR)
18
Thank you
19
Key considerations
A. Clarity About Purpose
1-Basic research: Contributes to fundamental knowledge and theory
2-Applied research: Illuminates a societal concern or problem in search for
solutions
3-Summative evaluation: Determines if a solution (policy or program) works
4-Formative evaluation: Improves a policy or program as it is being
implemented
5-Action research: Understands and solve a problem as quickly as
possible.
20
Key considerations
B. Methods considerations – Qualitative and Quantitative
Approaches and Outcomes
“The key to making good forecasts is weighing quantitative and qualitative
information appropriately” – Nate Silver
Qualitative - When we want to explore a problem in depth.
Quantitative - Well suited for the testing of theories and hypotheses.
C. Type and Availability of Data
• Heterogeneous and Complex Data
• Data Ownership and Distribution
21
Key considerations
D. Type of Experiments
Field experiments (in natural conditions, e.g., in space),
Laboratory experiments (in artificial conditions)
Qualitative and Quantitative experiments,
Computer simulation experiments
Retrospection , a review of the past events.
Forecasting, a special scientific study of concrete development prospects of an object.
E. Type of Analytics
a. Predictive tasks: Classification, Regression, Recommendation
b. Descriptive tasks: Cluster analysis, Anomaly detection, Association Analysis ( used to
discover patterns that describe strongly associated features)
F. Infrastructure/System considerations:
Scalability, High dimensionality
22
Case Studies
A. Maximum Entropy Churn Prediction Using Topic Models
•
•
•
•
•
1.2 million subscribers, 1.5 years worth of news/blog data from 13
websites, 3.4TB of Server logs.
Explore structured and unstructured data to come up with predictive
models for customer churn
using features mined from transactional databases or Web-based textual
data to determine which factors most impact user engagement
unique dataset normalization and modeling approach to carve out a
future timeframe from the present data for prediction
topic and metadata features reveal engagement patterns
23
Case Studies
B. Mining Emotion-Word Correlations in a Large Blog Corpus
•
Spinn3r dataset for the ICWSM 2009 data challenge - 44 million blog
posts spanning 62 days covering some big news events such as the
2008 Olympics, both 2008 US presidential nominating conventions, and
the beginnings of the financial crisis. Total size - 142 GB uncompressed
•
exploratory study, try to determine whether the words people choose
correlate well with categories drawn from a basic theory on emotion
• If successful, can use this information to better predict how blog entries
might cluster based on emotion, leading to improved models of
information retrieval for blogs
24
Case Studies
C. Using Latent Semantic Analysis to Identify Successful
Bloggers
•
•
•
•
•
Spinn3r dataset
In this work, we hypothesized that there may exist characteristics of
language use by informal writers, such as vocabulary or word choice, that
are directly associated with successful communication.
Specifically, we hypothesized a relationship between the vocabulary of a
blog and comment density
used latent semantic analysis (LSA) to reduce the dimensionality of a
term-document matrix for each blog in a collection (where a blog is a
concatenated set of blog entries).
Two experiments performed . (i) using an unsupervised clustering
approach to see if relationships to comment density naturally emerge, (ii)
a supervised classification method to identify high and low comment
density blogs, by using two complimentary models built through LSA. 25
Case Studies
D. Brand specific tweet classification with user provided topics
•
Many companies take feedback information on their products from Twitter,
from where a large number of tweets are collected for data analysis for the
purpose of market research
•
define some rules containing keywords and simple logic to be able to label
some tweets into certain bins of interest
•
starting from the brand specific data, and the simple keyword-based logic
rules, we build a system that is able to label many more tweets with a
certain confidence level using LLDA model
•
We collect the mentioning tweets for 5 brands and build a topic model for
each of them to better be able to predict tweets for each bin with high
precision ~85-90% considering the size of the data.
26