Download Computational Big Data Analytics Computational Big Data Analytics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
CB
DA
International Master's
Degree Programme in
Computational Big
Data Analytics
Make
BIG SENSE
BIG DATA
out
of
Data analytics is not just
numbers in a table or graphs on
a paper!
Data analytics is a means for
society, industry, and science to
control uncertainty
and to make
discoveries!
Costello et al. A community effort to assess and improve drug sensitivity prediction algorithms. Nature Biotechnology, 2014.
Lähde: Morningstar Stock Report, morningstar.fi
The spatial patterns of the four leading interannual components extracted from climate data.
A. Ilin, H. Valpola and E. Oja. Exploratory Analysis of Climate Data Using Source Separation Methods. Neural Networks, 19(2):155-167, 2006.
?
?
?
?
José Caldas, Nils Gehlenborg, Ali Faisal, Alvis Brazma, and Samuel Kaski. Probabilistic retrieval and visualization of biologically relevant microarray experiments. Bioinformatics,
25:i145–i153, 2009.
Jaakko Peltonen and Samuel Kaski. Generative Modeling for Maximizing Precision and Recall in Information Visualization. In Geoffrey Gordon, David Dunson, and Miroslav
Dudik, eds., Proceedings of AISTATS 2011, the 14th International Conference on Artificial Intelligence and Statistics. JMLR W&CP, vol. 15, 2011.
Lawyers Are Turning to Big Data
Analysis (The National Law Journal,
Big data for big business - analytics
are no longer optional (The Globe and July 2015)
Mail, August 2015)
Intel Unveils Analytics Technologies
for Big Data, IoT (eWeek, August 2015)
Put big data to work with Cortana
Analytics (TechRepublic, July 2015)
How the age of Big Data made
statistics the hottest job around
(Canadian Business, April 2015)
What can big data do for small
startups? (VentureBurn, August 2015)
Why big data isn't always the
answer (ComputerWorld, August 2015)
Data Scientist: The Sexiest Job of
the 21st Century (Harvard Business
Review, October 2012)
Making Sense of Our Big Data World:
Statistics for the 99% (Business 2
'Big data' useful but caution is still Community, August 2015)
needed (Daily Record, August 2015)
Growth in big data draws
women to statistics (FWC.com,
How To Identify A Good/Bad Data February 2015)
Scientist In A Job Interview?
Why your kids will want to be data
(LinkedIn, August 2015)
scientists (CNBC, June 2014)
Statistics in CBDA
The
The roots
roots of
of statistics
statistics
are
are in
in probability
probability theory,
theory, which
which
begun
begun from
from investigation
investigation of
of games
games of
of chance.
chance.
CB International Master's Degree Programme in
DA Computational Big Data Analytics
Statistics in the CBDA programme:
Large data sets incude many kinds of variation. Expertise is
needed to go from mere measurements to models and
understanding.
It is hard to judge based only on looking which of the possible
trends are ”real” and which ones are only coincidences.
Computers can search for possible trends among large sets of
alternatives, but they need to be told how to evaluate the
goodness of the findings.
Statistics studies in CBDA tell:
●
what kinds of statistical structure and trends to look for
●
how to measure whether they are ”real”
●
tools and methods to find them and to present the results
Statistics in CBDA
Statistics is versatile data analysis including management of chance
and variation, extraction of information from data and modeling.
Statistics has a close connection to data mining and machine
learning - in CBDA this connection becomes strongly visible.
An important modern trend is computational statistics, where
interesting nonlinear characteristics are sought from data sets, and
complicated models are solved e.g. by advanced and distributed
optimization and computation methods. CBDA teaching in statistics
and computer science enables you to use computational statistics.
Our teaching familiarizes you with the central theory, most important
methods of data acquisition and analysis, and how to apply these in a
computer based fashion.
Distributions, prediction, hypothesis testing, time series analysis, multivariate
methods, information visualization, learning from multiple sources...
Statistics in CBDA
Poor use of measurements
and statistics can lead to
false and misleading
conclusions
”The numbers have taken over.
Numbers lie and are misused. They
are used to prove just anything.
People believe in numbers even if
they have been computed
incorrectly.”
”The amount of random chance is
too large” (discussion of
conclusions of a research study)
Oakland A's GM Billy Beane is
handicapped with the lowest
salary constraint in baseball. If
he ever wants to win the World
Series, Billy must find a
competitive advantage. Billy is
about to turn baseball on its ear
when he uses statistical data
to analyze and place value
on the players he picks for
the team.
"geek-stats book turned into a
movie with a lot of heart"
"persuasively exposed front
office tension between ... old
school "eye-balling" of players
and newer models of datadriven statistical analysis”
Texts from IMDB, Wikipedia
Carl
Carl
Friedrich
Friedrich
Gauss
Gauss
s.
s. 1777
1777
Blaise
Blaise
Pascal
Pascal
s.
s. 1623
1623
Thomas
Thomas Bayes
Bayes
s.
s. 1702
1702
Pierre-Simon
Pierre-Simon
Laplace
Laplace
s.
s. 1749
1749
Ronald
Ronald Fisher
Fisher
s.
s. 1890
1890
Karl
Karl Pearson
Pearson
s.
s. 1857
1857
Stephen L. Portnoy
Alan Agresti
Irene Gijbels
University of Illinois Noel Cressie
Christian
P.
Robert
Harvey Goldstein
Hirotugu Akaike
University of FloridaCatholic University Urbana-Champaign
Paris
Dauphine
University
Ohio State
University of Bristol
Institute of
of Leuven
University
Statistical Mathematics
Jon A. Wellner
Jerome H. Friedman
University of Washington
The MITRE Corporation
Iain M. Johnstone
Stanford University
Peter Hall
University of Melbourne
Hira Lal Koul
Michigan State University
Peter Diggle
Lancaster University
Dan-Yu Lin
University of North
Carolina Chapel Hill
Gareth O. Roberts
David Donoho University of Warwick
Stanford University
Joseph G. Ibrahim
University of North
Carolina Chapel Hill
James Berger
Duke University
Donald Rubin
Harvard University
James Stephen Marron
University of North
Carolina Chapel Hill
Norman R. Draper
University of
Ingram Olkin
Wisconsin Madison
Stanford University
Jianqing Fan
Princeton University
Bernard W. Silverman
University of Oxford
Michael B. Woodroofe
University of Michigan
Peter J. Rousseeuw
University of Antwerp
Ole Barndorff-Nielsen
Enno
Mammen
Aarhus University
David B. Dunson
University
of
Mannheim
Duke University
Nancy Reid
University of
Toronto
Kanti V. Mardia
University of Leeds
Alexandre TsybakovPaul Rosenbaum Marc Hallin
CREST & Universite University of Universite Libre
Pennsylvania
de Bruxelles
Paris VI
Marc Yor
Raymond Carroll
Texas A&M University Pierre and Marie
Curie University
Bruce Lindsay
Pennsylvania
State University
Bradley Efron
George Box
Stanford University
University of
Hans-Georg Muller
Wisconsin Madison
University of
Peter J. Bickel Erich Leo Lehmann Alan Gelfand
California Davis Murad Taqqu William E. Strawderman
David O. Siegmund University of
Rutgers, the State
Duke University
Boston University University of New Jersey
Stanford UniversityCalifornia Berkeley University of
Wolfgang Karl Härdle
California Berkeley
Humboldt University
of Berlin
Peter Buhlmann
Ricardo Fraiman
ETH Zurich
Adrian Raftery Universidad de
Andrew Gelman
San Andres
John W. Tukey
Columbia UniversityPersi Diaconis
David A. FreedmanUniversity of Buenos Aires
Luc Devroye
Washington
Princeton University
Stanford University
University of
McGill University
California Berkeley
Robert Tibshirani
David Ruppert
Peter M. Robinson Standford University
Moscow State
London School of
Pedagogical University
Theodore W. Anderson Leo Breiman
Economics and
Stanford University
Holger Dette
George Casella Political Science
Richard David Gill
University of
Trevor Hastie
Ruhr University Bochum
University of Florida
California Berkeley Leiden University
Stanford University
Stephen L. Portnoy
Alan Agresti
Irene Gijbels
University of Illinois Noel Cressie
Christian
P.
Robert
Harvey Goldstein
Hirotugu Akaike
University of FloridaCatholic University Urbana-Champaign
Paris
Dauphine
University
Ohio State
University of Bristol
Institute of
of Leuven
University
Statistical Mathematics
Jon A. Wellner
Jerome H. Friedman
University of Washington
The MITRE Corporation
Iain M. Johnstone
Stanford University
Peter Hall
University of Melbourne
Hira Lal Koul
Michigan State University
Peter Diggle
Lancaster University
Dan-Yu Lin
University of North
Carolina Chapel Hill
Gareth O. Roberts
David Donoho University of Warwick
Stanford University
Joseph G. Ibrahim
University of North
Carolina Chapel Hill
James Berger
Duke University
Donald Rubin
Harvard University
James Stephen Marron
University of North
Carolina Chapel Hill
Norman R. Draper
University of
Ingram Olkin
Wisconsin Madison
Stanford University
Jianqing Fan
Princeton University
Bernard W. Silverman
University of Oxford
Michael B. Woodroofe
University of Michigan
Peter J. Rousseeuw
University of Antwerp
Ole Barndorff-Nielsen
Enno
Mammen
Aarhus University
David B. Dunson
University
of
Mannheim
Duke University
Nancy Reid
University of
Toronto
Kanti V. Mardia
University of Leeds
Alexandre TsybakovPaul Rosenbaum Marc Hallin
CREST & Universite University of Universite Libre
Pennsylvania
de Bruxelles
Paris VI
Marc Yor
Raymond Carroll
Texas A&M University Pierre and Marie
Curie University
Bruce Lindsay
Pennsylvania
State University
Bradley Efron
George Box
Stanford University
University of
Hans-Georg Muller
Wisconsin Madison
University of
Peter J. Bickel Erich Leo Lehmann Alan Gelfand
California Davis Murad Taqqu William E. Strawderman
David O. Siegmund University of
Rutgers, the State
Duke University
Boston University University of New Jersey
Stanford UniversityCalifornia Berkeley University of
Wolfgang Karl Härdle
California Berkeley
Humboldt University
of Berlin
Peter Buhlmann
Ricardo Fraiman
ETH Zurich
Adrian Raftery Universidad de
Andrew Gelman
San Andres
John W. Tukey
Columbia UniversityPersi Diaconis
David A. FreedmanUniversity of Buenos Aires
Luc Devroye
Washington
Princeton University
Stanford University
University of
McGill University
California Berkeley
Robert Tibshirani
David Ruppert
Peter M. Robinson Standford University
Moscow State
London School of
Pedagogical University
Theodore W. Anderson Leo Breiman
Economics and
Stanford University
Holger Dette
George Casella Political Science
Richard David Gill
University of
Trevor Hastie
Ruhr University Bochum
University of Florida
California Berkeley Leiden University
Stanford University
You
University of Tampere
Jobs for Data Analytics Experts (Data
Scientists)
A data scientist, combining expertise in statistics and computer
science, will work in cooperation with experts from other fields.
Application areas:
●
●
●
●
●
Technology and natural sciences (technometrics, chemometrics)
Biology (biometrics, see e.g.
http://www.uta.fi/hes/tutkimus/tutkimusryhmat/Biometria.html)
Medicine (epidemiology)
Economics (econometrics)
Social and behavioral sciences (demometrics, psychometrics)
Examples of Finnish jobs for data analysts
Jobs for Data Analytics Experts (Data
Scientists)
See also (in Finnish)
http://www.uta.fi/opiskelu/selvitykset/matematiikka_tilastotiede_sijoittuminen.pdf
Optional studies can influence which field the student ends up in.
An example of statistics jobs (in Finnish):
http://www.luonnontieteet.fi/tyo/tilastotiede
Examples of statistics jobs and employers (in Finnish)
http://www.uta.fi/rekrytointi/opiskelijalle_ja_tyonhakijalle/uraseuranta/oppiainekoosteet/tilastotiede.html
Jobs for Data Analytics Experts (Data
Scientists), information from graduates
The career and recruitment services of the University of Tampere
http://www.uta.fi/rekrytointi monitors placement of graduates in the
working life http://www.uta.fi/opiskelu/tyoelama/seurannat/index.html
A slightly old report on 2011 master's degree graduates
http://www.uta.fi/opiskelu/tyoelama/seurannat/maisterit/index/sijoittumisseuranta%202011.pdf
(1 year from graduation all statistics students were in permanent or
temporary jobs or as researchers funded by grants)
Tales from students of mathematics and statistics about studies and
placement in the working life
http://www.uta.fi/opiskelu/selvitykset/matematiikka_tilastotiede_sijoi
ttuminen.pdf
“researcher in government research institute”, “mathematician in a
government agency research unit”, “head of quality control in an
industrial company”, “Data Mining analyst”
Structure of CBDA
studies: upcoming
courses
Master's programme in Computational Big Data Analytics (CBDA)
General Studies in Master's Degree Programmes given in English 2015-18 1–22 ECTS
General studies in the Master's degree programmes given in English are different depending on the student's educational background.
Please choose below only one of the three options A, B or C.
A) General studies for
international students 12–22
cr
Compulsory studies 12 cr
●
SISYY006 Orientation, 2 cr
●
SISYY005 Study Skills and
Personal Study Planning, 2 cr
●
KKENMP3 Scientific Writing,
5 cr
●
KKSU1 Finnish Elementary
Course 1, 3 cr
Free-choice studies 0–10 cr
●
YKYYKV1 Finnish Society
and Culture, 3–5 cr
●
YKYYV07 Introduction to
Science and Research, 2–5
cr
B) General studies for
students with education in
Finnish and BSc degree taken
outside SIS 9–18 cr
Compulsory studies 9–13 cr
Swedish course is required only
if no Swedish studies were taken
in the Bachelor's degree.
●
SISYY006 Orientation, 2 cr
●
SISYY005 Study Skills and
Personal Study Planning, 2 cr
●
KKENMP3 Scientific Writing, 5
cr
●
KKRULUK Ruotsin kielen
kirjallinen ja suullinen
viestintä, 4 cr
Free-choice studies 0–5 cr
●
YKYYV07 Introduction to
Science and Research, 2–5 cr
C) General studies for
students who have taken their
BSc degree at SIS 1–11 cr
Compulsory studies 1 cr
Basics of Information Literacy 1
cr is not required, only Personal
study planning 1 cr from
SISYY005.
●
SISYY005 Study Skills and
Personal Study Planning, 2 cr
Free-choice studies 0–10 cr
Scientific Writing is
recommended if the Master's
thesis is written in English.
●
KKENMP3 Scientific Writing,
5 cr
●
YKYYV07 Introduction to
Science and Research, 2–5
cr
Master's programme in Computational Big Data Analytics (CBDA)
Advanced Studies in Big Data Analytics 85 cr
Compulsory Advanced
Courses in Big Data
Analytics 50 cr
● MTTTS11 Master's
Seminar and Thesis, 40 cr
● MTTTS12 Introduction to
Bayesian Analysis 1, 5 cr
● TIETS01 Algorithms, 5 cr
Advanced Courses in
Methods of
Computational DataAnalytics 15– cr
● TIETS07
Neurocomputing, 5 cr
● TIETS11 Data Mining, 5
cr
● TIETS31 Knowledge
Discovery, 5–10 cr
● TIETS39 Machine
Learning Algorithms, 5 cr
● TIETS33 Advanced
Course in Computer
Science, 1–10 cr
Advanced Courses in
Methods of Statistical
Data-Analytics 20– cr
● MTTTS13 Introduction to
Bayesian Analysis 2, 5 cr
● MTTTS14 Statistical
Modeling 1, 5 cr
● MTTTS15 Statistical
Modeling 2, 5 cr
● MTTTS16 Learning from
Multiple Sources, 5 cr
● MTTTS17 Dimensionality
Reduction and
Visualization, 5 cr
● MTTTS18 Time Series
Analysis 1, 5 cr
● MTTTS19 Advanced
Regression Methods, 5 cr
● MTTTS21 Statistical
Inference 2, 5 cr
● MTTS1 Other course
(advanced)
Master's programme in Computational Big Data Analytics (CBDA)
Other and optional Studies in Big Data Analytics Programme 13–29 cr
Compulsory Introductory
Studies 5 cr
● TIETA17 Introduction to
Big Data Processing, 5 cr
Complementing Studies
Optional Studies
Complementing studies
determined based on
previous education
Recommended studies in
Applications of Data-Analytics
●
TIETS05 Digital Image
Processing, 5 cr
●
MTTTS20 Basics of Financial
Data-Analysis and Risk
Theory, 5 cr
●
ITIS13 Information retrieval
methods, 5 cr
●
ITIS16 Information practices
literature, 5–20 cr
●
MTTA3 Internship, 2–10 cr
CBDA Courses
Fall 2015
I: Introduction to Bayesian Analysis 1
I: Introduction to Big Data Processing
I-II: Learning from Multiple Sources
I-IV: Information practices literature
Prior and posterior distributions, Bayes
estimators, posterior predictive distribution,
interval estimation and hypothesis testing,
single-parameter models, simple
multiparameter models.
Data fusion, transfer learning, multitask
learning, multiview learning, and learning
under covariate shift
II: Time Series Analysis 1
Simple time series models, stationary time
series models (ARMA), nonstationary and
seasonal time series models (SARIMA), time
series regression, periodogram.
(Master's thesis and seminar runs
every fall and spring.)
Typical characteristics and common
applications of big data; basics of distributed
file systems, databases and computing;
practical data processing skills with
MapReduce / Apache Hadoop
Literature package on either: Information
practices; Information retrieval systems;
Interactive information retrieval; task-based
information retrieval
I-II: Knowledge Discovery
phases of the process of knowledge
discovery and its nature; basic data
prepocessing, data mining and
postprocessing tasks and methods;
application in practical knowledge discovery
tasks; advanced methods in knowledge
discovery; data management issues
CBDA Courses
Spring 2016
III: Introduction to Bayesian Analysis 2 III: Data Mining
Markov chains, MCMC methods, model
checking and comparison, commonly used
statistical models, such as hierarchical and
regression models, binomial and count data
models.
III-IV: Dimensionality Reduction and
Visualization
premises, objectives, relevance, and basic
methods of data mining; properties of data
and measurements, preprocessing methods,
some data mining algorithms and their
applications, for instance, for classification
and prediction of data.
I-IV: Information practices literature
Properties of high-dim data; Feature
Selection; Linear feature extraction; Graphical
excellence; Human perception; Nonlinear
dimensionality reduction; Neighbor embedding
methods; Graph visualization.
Literature package on either: Information
practices; Information retrieval systems;
Interactive information retrieval; task-based
information retrieval
IV: Statistical Inference 2
basic and advanced machine learning
methods for data mining, pattern recognition
and other problems
Roles of Modeling in Statistical Inference,
Principles of Data Reduction,
Estimation: Risk, Loss of estimators,... Large
sample properties
Likelihood-Based Methods, likelihood-based
tests and confidence regions
IV: Machine Learning Algorithms
CBDA Statistics Courses
Fall 2016 (preliminary!)
Spring 2017 (preliminary!)
I: Introduction to Bayesian Analysis 1
III: Statistical Modeling 1
I-II: Learning from Multiple Sources
III-IV: Dimensionality Reduction and
Visualization
Prior and posterior distributions, Bayes
estimators, posterior predictive distribution,
interval estimation and hypothesis testing,
single-parameter models, simple
multiparameter models.
Data fusion, transfer learning, multitask
learning, multiview learning, and learning
under covariate shift
II: Possibly ”Basics of financial data
analysis and risk theory 5cr”, or
another course
Multinomial and ordinal regression,
nonlinear regression, parametric survival
analysis, counting process models,
semiparametric hazard models.
Properties of high-dim data; Feature
Selection; Linear feature extraction;
Graphical excellence; Human perception;
Nonlinear dimensionality reduction;
Neighbor embedding methods; Graph
visualization.
IV: Statistical Modeling 2
Normal mixed model and extensions,
growth curve models, models for panel
discrete (binary,count, categorical)
observations, analysis of missing data,
mixture or latent class regression,
hierarchical and latent structure models
Data analytics is management
of knowledge and uncertainty.
As long as there is uncertainty
in the world, there is a need
for data analytics.