Download Big Data A statistician`s perspective

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Big Data
---a statistician’s perspective
Ming Ji, PhD
College of Nursing
USF
Disclaimer
 I am not an expert in big
data and cannot cover all
the developments in big
data
Big Data is Here
 What is Big Data?
 Data that are too big to handle
 Data that challenge existing
technology and methods to store,
process and analyze.
Examples of Big Data
 Science Data (CERN)
 National Survey Data (NHANES,
NHIS, ACS, CPS, NHGIS)
 Genomic Data (microarray, DNA
sequencing, GWAS, microbiome)
 Clinical Data (EHR)
 Sensor Data (mHealth)
 Social Media (Facebook, Twitter,
LinkedIn, websites, blogs)
 Climate Data (NOAA)
 Financial Data (stock trading,
banking, insurance, mortgage,
credit cards )
Characteristics of Big Data




Volume
Velocity
Variety
Veracity
Generation of Big Data
 Employee generated
 User generated
 Machine generated
Volume --- Big Data is Big
 2.7 Zetabytes of data exist in the
digital universe today.
 Facebook stores, accesses, and
analyzes 30+ Petabytes of user
generated data.
 In 2008, Google was processing 20,000
terabytes of data (20 petabytes) a day.
 100 terabytes of data uploaded daily
to Facebook.
 Data production will be 44 times
greater in 2020 than it was in 2009.
 In the last 5 years, more scientific data
were generated than the total amount
of data generated in previous human
history.
Velocity
 High speed of streaming data.
Variety
 Besides the traditional structured data
such as numerical data sets stored in
relational databases, big data are
everywhere and are in many different
formats with some are unstructured.
 Numerical data; audios; videos; text
messaging data; websites; blogs;
imaging data; genomic data;
environmental data; climate data;
clinical data; handwritings, etc.
Veracity
 Bias
 Uncertainty
 Abnormality
Challenges of Big Data
 Require new data systems to transfer,
store and process big data (Hadoop,
Storm, SAP, BigQuery, Amazon EC2)
 Require data analysis methods of big
data (big data analytics using data
mining and statistical data analysis)
 Challenges traditional statistics theory
(Law of Large Samples, Central Limit
Theorem, n<<p)
 Challenges traditional scientific
research method (prediction based vs
mechanism based research, can big
data replace traditional scientific
research?)
Personal View: Big Data is Still Data
 Big data must follow the same
principles of data management
1.
Data collection (streaming data,
sensors, GigaScience)
2.
Data storage (Oracle, SAP, IBM,
EMC, Hadoop, Storm, BigQuery,
Amazon EC2 and EMR)
3.
Data format conversion (voice2txt,
txt2voice, natural language
processing from unstructured to
structured)
4. Data integration ( data linkage,
meta data)
5.
Data privacy (privacy-preserved
data mining, computer security)
Personal View: Big Data May Not Be Big
Enough
 GWAS studies do not identify any
genetic mutation for disease
prediction.
 Whole genome sequencing fails to
predict risk of most common
diseases. BMJ 2012.

Personal View: Big Data Cannot
Escape Statistics Principles
 Collecting and analyzing data from any
real world process must follow the same
principles in statistical study design and
data analysis.
1. Big sample size does not remove bias (<sampling).
2. Big data may not be big enough (failure
of predictive models from genomic data
alone <- unmeasured confounders and
underspecified models)
3. Not all the big data are useful and only a
small subset is interesting to us --- find a
needle in a hay stack(dimension
reduction, Google’s MapReduce, real
time data analysis)
Personal View: Big Data and
Cybernetics
 Big data will advance the further
merging of humans and machines as
predicted by Norbert Wiener on
automation and human society.
(wearable technology, machine
intelligence, hybrid decision making )
 System Sciences and Information
Theory may be good theoretical
models to guide us build more big
data systems for various applications
(feedback, control, adaptation,
information).
Personal View: Big data will boost
computational sciences
 Big data calls for new hardware and
software for computation (GPU, cloud
computing , DNA computing, quantum
computing)
 Big data calls for the next generation
artificial intelligence to produce “
smarter algorithms” to handle big
data because we humans cannot
directly process big data. (Super
Turing Machine)
The Future of Big Data: Hope or
Hype?
 We are at the cross road. The true
effect of big data on human society is
yet to be seen.
 And we cannot use predictive
analytics to predict the future of big
data.
How do we use big data in our
research?
 Think Big: Can you use historically
collected and archived big data ?
(genomic data, large national surveys,
NOAA climate data, etc.)
 Think Measurement: Do you have
measurement devices that can
generate big data ? (sensors, images,
videos, genomics, climates, etc.)
 Think Multidisciplinary: Do you have
experts from other disciplines
(informatics, computer sciences,
engineering, biology, mathematics,
statistics, etc) to work on big data?
Case Studies of Big Data: IBM Watson
 Sloan Kettering Cancer Center
doctors are training IBM Watson to
be an expert in cancer diagnosis and
treatment based on learning:
 Over 600,000 diagnostic reports
 Two million pages of medical
journal articles
 One and a half million patient
records
 14,700 hours of hands-on training
Case Study: Quantified Self lead by
Larry Smarr
 The Quantified Self Movement
participants uses different devices to
collect physical activity, sleep, diet,
gut microbiome data to monitor their
own health and use the data analysis
results to work with their doctors for
intervention.
 Larry Smarr considers this as the
future of disease prevention.
Case Study: Use big data to fight
fraud in Medicare and Medicaid
 CMS estimated that 65 billion dollars in
Medicare and Medicaid lost to fraud in
2011
 Fraud detection algorithms are
implemented in large claim data system
to capture suspicious fraudulent cases.
(Real time fraud detection, fraud
detection using social network data)
 Health Care Fraud and Abuse Control
Program reported to have recovered 4.2
billion dollars
