Download p - Claudia Wagner

Document related concepts

History of statistics wikipedia , lookup

Data mining wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Time series wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Web Science Summer School WS3 , Southampton, UK , 21th July 2014
Dr. Claudia Wagner
http://claudiawagner.info/
source: Twitter
2

Statistical computing is very central , but data
science is more than statistics

Activities of data scientists:





collection and generation,
preparation,
analysis,
visualization,
management and preservation of large collections of
data
Jeffrey Stanton, Introduction to Data Science, free e-book
3

Ask interesting question
 Why is it important? Which number answers your question?

Get or generate the data
 Which data will help answering you question? How is the data
generated? Are their any sampling biases? Ethical issues?

Analyze the data
 Are there any anomalies or regularities?
 Which hidden process has generated the data?
 Fit a model to the data and validate it

Visualize and communicate results
 What does 75% probability mean?

Preserve and share the data to make results reproducible
4




Data is a collection of facts
Facts can be numbers, words,
measurements, observations or even just
descriptions of things
Qualitative data (e.g., “it was great”)
Quantitative data
 Discrete (e.g., 5)
 Continuous (e.g., 3.723)
5
Ratio (e.g., weight)
Absolute zero
Interval (e.g., temperature in Celsius)
Distance is meaningful
Ordinal (e.g., status)
Nominal (e.g., ethnic group, sex, nationality)
Observations can be ordered
Observations are
only named
Stevens, S. S. (1946). "On the Theory of Scales of Measurement". Science 103 (2684):
677–680.
6
7

Random sample of Twitter users



Random sample of tweets from the public timeline
More active users are more likely to be included
Friendship Paradox
 Select a random sample of people and ask them to list
the people they know. Contact a sample of the listed
friends and repeat the survey.
 Sampling bias: people with more friends are more
likely to show up in the friend lists which we generate
at the first stage
8

A study found that the profession with the
lowest average age of death was student.
 Being a student does not cause you to die at an early
age. Being a student means you are young. This is
what makes the average of those that die so low.

Amount of ice cream consumed per day is highly
correlated with number of drownings per day
 Both variables are correlated with the daily
temperature
"Teaching Statistics: A Bag of Tricks," by Gelman and Nolan (2002)
9

A study found that only 1.5% of drivers in accidents
reported that they were using a cell phone, whereas
10.9% reported that they were distracted by another
occupant in the car.

Can we conclude that using a cell phone safer than
speaking with another occupant?




P(cellphone | accident) != P(accident | cellphone)
Compare P(accident|cellphone) and P(accident|occupant)
We need to know the prevalence of cell phone use
It is likely that much more people talk to another occupant
in the car while driving than talking on the cell phone
Jessica Utts, What Educated Citizens Should Know about Statistics and Probability, The American
Statistician, Vol. 57, No. 2 (May, 2003), pp. 74-79
10

Ecological Fallacy
 Illiteracy rate in each US state and the
proportion of immigrants per state
 Negative correlation of −0.53
▪ The greater the proportion of immigrants in a state,
the lower its average illiteracy.
 When individuals are considered, the
correlation was +0.12 — immigrants were on
average more illiterate than native citizens.
Robinson, W.S. (1950). "Ecological Correlations and the Behavior of Individuals". American
Sociological Review (American Sociological Review, Vol. 15, No. 3) 15 (3): 351–357.
11
Data Collection
Data Preprocessing
Data Analysis
Data Visualization
Data Preservation

Found data or observational data
 Are observational data enough?
 Are such data available?

Generate Data
 Designs the data generation process
▪ E.g., via surveys, experiments, crowdsourcing
13
http://www.intel.com/content/www/us/en/communications/internet-minute-infographic.html
14
Two general types of traces:
Accretion - a build-up
of physical traces
Erosion - the wearing away
of material
Webb, Eugene J. et al. Unobtrusive Measures: nonreactive research in the social
sciences. Chicago: Rand McNally, 1966
15

Bulk downloads
 Wikipedia, IMDB, Million Song Database, etc.

API access
 NY Times, Twitter, Facebook, Foursquare, etc.

Web scraping
 Tools e.g., http://scrapy.org/
 What data is ok to scrap?
▪ Public, non-sensitive, anonymized, fully referenced
information, Check terms of conditions!
16

Takes time to accumulate

Conservative estimate
 Only what happened counts! Intentions,
motivations or internal states don’t count.

Inferentially weak
 Cannot answer “what-if” questions
17

Surveys

Simulations
 Model behavior of users/agents on a micro-level
 Simulate what happens under different conditions
 Empirical validation

Experiments
 Keep all variables constant and only manipulate one
variable (e.g., emotions)
18

Simulations
 Study of macro-phenomena
 Difficult to validate empirically

Surveys and/or Experiments
 We only get data from those who are accessible and
willing to respond or participate
 Responders provide answers that are in line with selfimage and researcher’s expectations
 Hawthorne effect, etc.
19
Data Collection
Data Preprocessing
Data Analysis
Data Visualization
Data Preservation

Data cleaning
 Fill in missing values
 Smooth noisy data
 Identify or remove outliers
 Resolve inconsistencies

Data integration
 Integration of multiple databases, or files
21

Data transformation
 Normalization: scaled to fall within a small, specified range
 Standardization: how many standard deviations from the mean
lies each data point
 Discretization: divide the range of a continuous attribute into
intervals  some algorithms require discrete attributes.

Data reduction
 Dimensionality reduction (remove unimportant attributes via
feature selection, group features into factors e.g. PCA, SVD)
 Aggregation and clustering
 Sampling
22
Data Collection
Data Preprocessing
Data Analysis
Data Visualization
Data Preservation

Data Mining
Statistical Inference
Machine Learning

Problem:
 Given high dimensional space (e.g., fb-user which
are described via various attributes such as
locations they visited)
 Find pairs of data points (𝒙, y) that are within
some distance threshold 𝒅(𝒙, y) ≤ 𝒔

We first need to decide what „distance“
means
24

Distance Measures
 Jaccard similarity between 2 sets of items I1, I2
|𝐼1 ∩ 𝐼2|
sim(I1, I2) =
|𝐼1 ∪ 𝐼2|
dist(I1, I2) = 1- sim(I1, I2)

Euclidian distance, Hamming distance,
Cosine Similarity, etc.
25

Goal: Given a set of items group the items
into some number of clusters, so that
 Members of a cluster are similar to each other
 Members of different clusters are dissimilar
Anand Rajaraman, Jeffrey Ullman, Jure Leskovec, Mining of Massive Datasets, Cambridge University Press
26

Not-Hierarchical / Point assignment:
 Maintain a set of clusters
 Point belong to “nearest” cluster

Hierarchical:
 Agglomerative (bottom up):
▪ Initially, each point is a cluster
▪ Repeatedly combine the two “nearest” clusters into one
 Divisive (top down):
▪ Start with one cluster and recursively split it
Anand Rajaraman, Jeffrey Ullman, Jure Leskovec, Mining of Massive Datasets, Cambridge University Press
27
28
29
30
31
32


Try different k, looking at the change in the
average distance to centroid as k increases
Average falls rapidly until right k, then
changes little
best k
Average
Diameter
k
33


Aim: Find hidden concepts/groups in a matrix
Method: Singular Value Decomposition (SVD)
Lescovec et al., Mining of Massive Datasets, p. 418
34


Rank = 2
Rank denotes the
information content of
the matrix.
 For instance, a rank-1
matrix can be written as a
product of one column and
one vector
35
36
Relates users
and concepts
Lescovec et al., Mining of Massive Datasets, p. 418
Strength
of
concepts
Relates movies to
concepts
37
Data Collection
Data Preprocessing
Data Analysis
Data Visualization
Data Preservation

Data Mining
Statistical Inference
Machine Learning

Estimate population parameter from sample statistics

Sampling Distribution of statistic:
 Draw a finite set of samples of size n from the population
 Computing the statistic on the sample
 Repeat this process
The mean of the sampling distribution is the expected
value of the statistic in the true population
 SD of the sampling distribution is the standard error

39
40

Some descriptive statistics such as mean or
median are unbiased estimators of central
tendency
 Expected value of the statistic is the true
population parameter

Expected value of dispersion in a sample is an
underestimate of the true population value
41


True population size is N
Sample size n < N (e.g., n=100)
𝑛
 Correction factor :
𝑛−1



For n=100 the correction factor is ~ 1.01
For n=100.000 our correction factor is
~1.00001
Estimate Population Var:
𝑛
(
)∗
𝑛−1
𝑛
𝑖=1(𝑥𝑖 −𝜇
)
𝑛
42

Specify the range of values that have a high
probability of containing the true population
parameter

Confidence level α: the probability that
confidence interval contains true population
parameter
43


CI = sample statistic + MOE
MOE = SE * Critical value
𝜎
𝑛
∗ 𝑧𝛼/2
n … sample size
σ … standard deviation
z α/2 … confidence coefficient

MOE =

Critical Value: how far away from the mean
must a point lie in order to be considered as
“extreme” or “unexpected”?
44
45
Area
under the
curve is
0.475
What’s
the zscore?
46
Select 1000 fb-user randomly
Average number of bar visits per year X = 78
𝑛 (𝑥 −𝜇 )2
𝑖=1 𝑖
 Standard Deviation:
= 30


𝑛
Confidence level is 95%  divide 0.95 by 2 to get
0.475
 Check out the z table  z = 1.98

𝜎
𝑛

MOE =

78 +/- 1.88
∗ 𝑧𝛼/2 =
30
1000
∗ 1.98= 1.88
CI: [76.12 ; 79.88]
47

Exact CI can only be computed when the
sampling distribution and SD of sampling
distribution (i.e., SE) are known

Otherwise we have to estimate the Standard
Error (SE)  Bootstrap
48

Sampling with replacement
 Population is unknown
 But we observe one sample from the population of
size n=4: {2, 3, 8, 8}
 We use this sample to generate a large number of
bootstrap samples of size n:
▪ 8, 8, 8, 3
▪ 3, 3, 8, 2
▪ …


Compute statistic (e.g. ,mean) for each
bootstrap sample
Estimate SE from the bootstrap distribution
49
Statistic +/- MOE
Population
MOE for 95% CI = 2 * SE
Sample
Standard Error (SE):
SD of bootstrap
distribution
Calculate statistic for
each bootstrap sample
Bootstrap
Sample
Bootstrap
Sample
Bootstrap
Sample
Bootstrap Distribution
Bootstrap
Sample
50

Randomly selected sample of fb-user
 Have they ever checked in at a nightclub?
 Democrats: 100/1000 yes
 Republican: 90/1000 yes


Do the nightlife preferences differ
significantly across political parties?
Give 95% CI for difference in proportions
51





dems = rep( c(0,1), c(1000-100, 100) )
repubs = rep( c(0,1), c(1000-90, 90) )
mean(dems) #0.1
mean(repubs) #0.09
del.p = mean(dems) - mean(repubs) #0.01 (point estimate)

reps = replicate( 1000, {
ds = sample( dems, 1000, replace=TRUE )
rs = sample( repubs, 1000, replace=TRUE )
mean( ds ) - mean( rs )
})


SE = sd( reps ) # 0.0131
c( del.p - 2*SE, del.p + 2*SE ) #-0.0162 0.0362 (interval estimate)
52
Democrats
Republicans
yes
100
90
190
no
900
910
1810
H1: political party affects the nightlife-preferences
H0: political party does not affects the nightlifepreferences
 Proportion of users who visited nightclubs not matter
which party they belong to: 190/2000 = 0.095



If political affinities have no effect, we would expect
the following frequencies:
Democrats
Republicans
yes
95
95
190
no
905
905
1810
53
Democrats
Republicans
yes
100
90
190
no
900
910
1810
2
𝑜−𝑒
 χ2 =
= 0.5815
𝑒
 DF = (number of rows – 1) x (number of columns – 1) = 1
 Critical value of χ2 at 5% significance and 1 DF is
3.84
 Our χ2 does not exceed the critical value
 We cannot reject H0
54

If α=0.05 then 95%
of all values fall in
this interval

Two-tail test:
 2.5% of values in the
upper tail and 2.5%
of the lower tail are
considered as so
extreme that we
reject H0 if we
observe them
55

Test if democrats on fb, on average, have more
than 60 bar visits per year
 H1: µ > 60
 H0: µ <= 60

Random sample of 20 democratic fb-user:
 {65 73 51 67 48 80 69 53 59 62 71 67 64 78 65 490
80 60 51 70}
 Sample mean 𝜇 =64.1
 Assume we know SD in population = 10

𝑧=
𝜇− 𝜇
𝑆𝐸
𝑆𝐸 =
𝑆𝐷
𝑛
𝑧=
64.1−60
10/ 20
= 1.8336
56

Would we expect that? How extreme is
this observation?
 If H0 is true (mean<=60)  in which area
around the mean do 95% of all points lie

Pick alpha level α=0.05  that’s the
maximum probability where you reject
the null hypothesis if the null hypothesis
is true

Right-tail test: find our critical value for
0.45 using the z-distribution

If the z-score of our observed data exceed
this value we have to reject H0
1.8336 > 1.645  reject
the null hypothesis
57

Large Effects, Small Samples:
 In small samples it is easy to overestimate an effect which
might have happened by chance

Small Effects, Large Samples:
 The smaller the effect you want to measure the larger the
sample size you need to prove it significant!

Example: Assume a coin is biased: 10% head and 90% tail
 Tossing the coin 10 times should be enough to convince people
that the coin is biased.

Example: Assume a coin is biased: 51% head and 49% tail
 Minimum sample size increases with decreasing effect size
which one wants to demonstrate
58

The more we analyze, the more we find by
chance!

If you calculate correlation between 10 variables
(i.e., 44 different correlation coefficients) you
should expect that at least 2 correlations are
significant with p < 0.05 by chance (one in every
20)

Corrections or adjustments for the total number
of comparison are needed!
59
Many tests such as z-test, t-test, ANOVA make the
normality assumption.
 If true population is very skewed (e.g. power law) the
sampling distribution of the statistic will not be normal


Nonparametric methods like sign-test use e.g. median
rather than the mean
 Hypothesis about the median of the true population (e.g. H1:
median < 100, H0: median = 100)
 Count number of measurements that favor the null hypothesis
 If H0 is true half of the measurement should fall on each side.
60
Data Collection
Data Preprocessing
Data Analysis
Data Visualization
Data Preservation

Data Mining
Statistical Inference
Machine Learning

Aim
 Find a function that describes the relation between X
(e.g. bar visits) and Y (e.g. new friends)
 Given X predict Y

Problem
 Infinite number of ways X and Y could be related

Idea
 Reduce space of possible function and start with the
simplest one (linear relation)
 Y= 𝑏0 + 𝑏1 𝑋
62

Y = 2 + 0.5 X
6
4
Y
2
0
0
2
4
6
8
X
63

Use Gradient Descent to minimize Cost
function C 𝑏0 , 𝑏1
 C 𝑏0 , 𝑏1 =
 C 𝑏0 , 𝑏1 =


1
2𝑁
1
2𝑁
𝑁
2
(𝑌
−𝑌
)
𝑖=1 𝑖
𝑖
𝑁
𝑖=1(𝑌𝑖
− 𝑏0 − 𝑏1 𝑋)2
Start with some guess for 𝑏0 , 𝑏1
Keep changing 𝑏0 , 𝑏1 to reduce C 𝑏0 , 𝑏1 until
we hopefully end up at a minimum
64
𝑏0 ≔ 𝑏0 −
𝜕
𝛼
C
𝜕𝑏𝑗
𝑏0 , 𝑏1
Derivative of cost function
informs us about the slope of
the cost function
Learning rate
 𝑏1 ≔ 𝑏1 −
𝜕
𝛼
C
𝜕𝑏𝑗
𝑏0 , 𝑏1
 Simultaneous updates of b0 and b1
65
C(b)
b
66

Residuals: deviation between the observed
and the predicted values

Residual sum of squares:
What if we
multiply it with
1/N?
Is this a good
measure?
No it depends on
the number of
observations N
67
Unexplained variability!
Residuals: difference
between the observed value
and the estimated value
Proportion of the total variability
unexplained by the model



𝑦𝑖 … observed value
𝑦 … value predicted by the model
𝑦 … mean of observed data
Total variability
in the outcome
that needs to be
explained
68
Independent variable is binary (e.g., went to nightclub
or not)
 We can group users by number of new friends year
(20-25, 25-30, 30-35, etc.) and compute the proportion
of people with high “nightclub-probability”

69

Logistic Regression:
𝑃(𝑌 = 1)
ln
= 𝑏0 + 𝑏1 X + ϵ
1 − 𝑃(𝑌 = 1)
 Maximum Likelihood Estimator
 Estimate unknown coefficients by
maximizing the log likelihood function
 Coefficient is interpreted as the rate of
change in the "log odds" as X changes
70
Simple Example:
You have a coin that you know is biased towards
heads and you want to know what the probability of
heads (p) is.
We want to estimate the unknown parameter p!
71
You flip the coin 10 times and the coin comes
up head 7 times.
What’s your best guess for p?
72
Find the value for p that makes our data most likely!
The probability of observing 7 times head when tossing
a coin 10 times is given by this binomial distribution:
10 7
10! 7
3
 P(7 heads)    p (1  p) 
p (1  p)3
7!3!
7 
73
web.stanford.edu/~kcobb/hrp261/lecture4.ppt
10
10! 7
Likelihood    p 7 (1  p)3 
p (1  p)3
7!3!
7 
10!
log Likelihood  log
 7 log p  3 log(1  p)
7!3!
Derivative with respect to p.
d
7
3
log Likelihood  0  
dp
p 1 p
*derivative of a constant is 0
*derivative 7f(x)=7f '(x)
*derivative of log x is 1/x
Set the derivative equal to 0 and solve for p.
7
3
7(1  p)  3 p

 0

 0
 7(1  p)  3 p
p 1 p
p(1  p)
7  7 p  3p 
 7  10 p
p
7
10
74
Likelihood of observing 7 times head when tossing a
biased coin with p(head) = 0.7 and p(tail)=0.3 10 times
is:
10 7
Value of the Likelihood  (.7) (.3)3  120(.7) 7 (.3)3  .267
7 
75

Linear Regression (R-squared)

Logistic Regression (pseudo R-squared)
76
you can “prove” anything with graphics
Data Collection
Data Preprocessing
Data Analysis
Data Visualization
Data Preservation
78
http://www.motherjones.com/kevin-drum/2012/01/lying-charts-global-warming-edition
79
http://www.motherjones.com/kevin-drum/2012/01/lying-charts-global-warming-edition
80

Be careful when drawing conclusions from
graphs

Size of effect shown in graphic != Size of
effect in sample data != Size of the effect
in the true population
 Scale Disorting (e.g., bar charts not starting with
zero)
 Snapshot
…
81
Data Collection
Data Preprocessing
Data Analysis
Data Visualization
Data Preservation

GESIS Data Archives & Data Centers
 Preserve research data and make them accessible for
reuse.
 Competencies and infrastructure
▪ e.g. https://datorium.gesis.org/xmlui/

CESSDA:
 umbrella organisation for the European national data
archives (http://www.cessda.net/)

Re3data
 browse data archives by topic: http://www.re3data.org/
DPC Digital Preservation Handbook:
http://www.dpconline.org/advice/preservationhandbook
83

Legal and regulatory framework
 including open access and licenses

Incentives to share data
 Credentials? Citation principles under development (see
e.g. http://www.datacite.org/).

Long term preservation strategies
 software and hardware changes, documentation,
metadata and retrieval/access
 Data preservation starts at an individual level
 Reasons for data loss often on an individual level,
e.g. broken hardware, researchers leaving a
group.
84
http://claudiawagner.info/teaching/WebSciSS2014/

Vasant Dhar. Data Science and Prediction. In: Communications of
the ACM, December 2013, Vol. 56, No. 12, pp. 64-73

Anand Rajaraman, Jeffrey Ullman, Jure Leskovec, Mining of
Massive Datasets, Cambridge University Press (free download)

Jeffrey Stanton, Introduction to Data Science (free download)

Steffen Staab, Data Science Course University Koblenz-Landau,
https://www.uni-koblenz-landau.de/campuskoblenz/fb4/west/teaching/ss14/data-science/data-science1

Serious Stats, Thom Baguley
86