Download Data Analysis - freshmanclinic

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Regression toward the mean wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Dr. Hong Zhang








Tables and Graphs
Populations and Samples
Mean, Median, and Standard Deviation
Standard Error & 95% Confidence Interval
(CI)
Error Bars
Comparing Means of Two Data Sets
Linear Regression (LR)
Coefficient of Correlation

Statistics is a huge field, I’ve simplified
considerably here. For example:
◦ Mean, Median, and Standard Deviation
 There are alternative formulas
◦ Standard Error and the 95% Confidence Interval
 There are other ways to calculate CIs (e.g., z statistic instead
of t; difference between two means, rather than single
mean…)
◦ Error Bars
 Don’t go beyond the interpretations I give here!
◦ Comparing Means of Two Data Sets
 We just cover the t test for two means when the variances
are unknown but equal, there are other tests
◦ Linear Regression
 We only look at simple LR and only calculate the intercept,
slope and R2. There is much more to LR!

All of the possible outcomes of experiment or
observation
◦ US population
◦ Cars in market

A large population may be impractical and costly
to study. It might be impossible to collect data
from every member of the population.
◦ Weight and height of every US citizen
◦ Quality of every car in market

A part of population that we actually
measure or observe and to draw
outcome or conclusion
◦ 1000 US citizens
◦ 100 cars

We use samples to estimate population
properties
◦ Use 1000 US citizens to estimate the height
of entire US population
◦ Use 100 cars to estimate quality of all Toyota
Corolla cars under 3 years old

Sample should fully represent the entire
population.
◦ Good
 Randomly select 1000 names from a phone
book to represent the region
 Randomly select 100 cars from DMV record
◦ Bad
 Use a college campus to represent the country
 Use cars in dealers lot to represent cars in
market
 Reporters randomly stop 3 persons on street
for opinions


Sum of values divided by number of samples,
also called Average
Example:
◦
◦
◦
◦

Data: 3, 8, 5, 10, 4, 6
Sum = 3+8+5+10+4+6 = 36
Number of samples (data points) = 6
Mean = 36 / 6 = 6
Exercise
◦ Mean of height of the entire class
◦ Average commute time of the students




Bill Gates comes to give a presentation to 100
of students in Rowan Auditorium.
Suppose the personal wealth of Bill Gates is
$50 billion.
The personal wealth of each student is $0.
What is the mean of the personal wealth for
the entire population in the room?


Value of the middle item of data arranged in
increasing or decreasing order of magnitude
Example:
◦ Data: 3, 8, 5, 10, 4, 6
◦ Rearrange: 3, 4, 5, 6, 8, 10
◦ The middle two are 5 & 6, the average of the two is
5.5
◦ The mean of the data set is 5.5

Exercise:
◦ Medium height of the class
◦ Medium commute time of the class
◦ Medium personal wealth in the room with Bill Gates.
12
10
8
Data
6
Mean
4
Mediam
2
0
1
2
3
4
5
6
Data Points: 3, 8, 5, 10, 4, 6

Standard deviation of mean
◦ Sample size n
◦ taken from population with standard deviation s
s
sX 
n
◦ Estimate of mean depends on sample selected
◦ As n , variance of mean estimate goes down, i.e.,
estimate of population mean improves
◦ As n , mean estimate distribution approaches
normal, regardless of population distribution


x i  





n





1/
2
2
μ: Mean, n: Sample size, xi: Data point

xi  x



s

n






xi  x



s
 n 1





2 1/ 2
For n > 30
2 1/ 2
For n < 30
2
S=s

Data: 70 69 60 65 72 80 75 64 68 85 66 72
Frequancy
6
5
4
3
Frequancy
2
1
0
<60
60~65
65~70
70~75
75~80

Flip a coin, chances of upside up and
downside up are equal. (It’s also called
binomial dist.)
50%
up
dow
n

Normal distribution
◦ Women’s shoe size sold by a shoe store.

Chemical distribution of a well mixed
compound
Y


2


e
(x  )
2
2
2
where X is a normal random variable, μ is the
mean, σ is the standard deviation, π is
approximately 3.14159, and e is
approximately 2.71828.
Nσ
Confidence Intervals
Error per million
1
2
3
0.682689492137
0.954499736104
0.997300203937
317310.5079
45500.2639
2699.796063
4
5
6
0.999936657516
0.999999426697
0.999999998027
63.342484
0.573303
0.001973
6 sigma



Rank k has a frequency roughly proportional
to 1/k, or more accurately
Pn=a/nb
Developed by George Kingsley Zipf
Occurs naturally in many situations
◦
◦
◦
◦
City population
Colors in images
Call center
Website traffic
Rank Word Freq
1 the
69970
2 of
36410
3 and
28854
4 to
26154
5 a
23363
6 in
21345
7 that 10594
8 is
10102
9 was
9815
10
he
11
for
12
it
13
with
14
as
15
his
16
on
17
be
18
at
19
by
20
I
% Freq
6.8872
3.5839
2.8401
2.5744
2.2996
2.1010
1.0428
0.9943
0.9661
9542
9489
8760
7290
7251
6996
6742
6376
5377
5307
5180
Theoretical
69970
36470
24912
19009
15412
12985
11233
9908
8870
0.9392
0.9340
0.8623
0.7176
0.7137
0.6886
0.6636
0.6276
0.5293
0.5224
0.5099
Zipf Distribution
8033
7345
6768
6277
5855
5487
5164
4878
4623
4394
4187

If a distribution gives us a straight line on a
log-log scale, then we can say that it is a Zipf
Distribution.

Count the vehicles in Rowan Parking lots
◦ Distribution of colors
◦ Distribution of cars and trucks
◦ Distribution of last letter (digit) of license number





Select a parking lot
Design a strategy to count
Design a method to record data
Design a method to represent result
Write a one page report per group







White:2
Black:1
Red:2
Blue:2
Silver:4
Gold: 1
Beige: 1
Voltage (V)
Height (in)
2.34
8.69
2.56
11.88
2.79
15.19
2.98
17.88
3.13
19.94
3.27
22.06
3.47
25.00
3.62
27.06
Result for Pressure Transducer Calibration
Pressure Transducer Calibration
30
Height (in)
25
20
15
10
5
0
2
2.5
3
Output Voltage (V)
3.5
4
Pressure Transducer Calibration
30
y = 14.361x - 24.908
Height (in)
25
R² = 0.9999
20
15
10
5
0
2
2.5
3
Output Voltage (V)
3.5
4
Time
(s)
Voltage
(V)
0
10
1
6.1
2
3.7
3
2.2
4
1.4
5
0.8
6
0.5
7
0.3
8
0.2
9
0.1
10
0.07
12
0.03
Time
(s)
Voltage
(V)
log(Voltage)
0
10
1.00
1
6.1
0.79
2
3.7
0.57
3
2.2
0.34
4
1.4
0.15
5
0.8
-0.1
6
0.5
-0.3
7
0.3
-0.52
8
0.2
-0.7
9
0.1
-1
10
0.07
-1.15
12
0.03
-1.52
Capacitor Discharge Rate: Semilog Coordinates
Voltage (V)
10.00
1.00
0.10
0.01
0.0
5.0
10.0
Time (s)
15.0
Reaction Rate for Polymer Production
Concentration Reaction Rate
(Mol/ft3)
(Mol/s)
3
2.5
2.8500
80.0
2.0000
60.0
1.2500
40.0
0.6700
20.0
0.2200
0.5
10.0
0.0720
0
5.0
0.0240
Reaction Rate (mol/s)
100.0
2
1.5
1
0
50
100
Concentration (Mol/ft^3)
1.0
0.0018
150
log
log (reaction rate)
(concentration)
Concentration
Reaction Rate
100.0
2.8500
2.00
0.45
80.0
2.0000
1.90
0.30
60.0
1.2500
1.78
0.10
40.0
0.6700
1.60
-0.17
20.0
0.2200
1.30
-0.66
10.0
0.0720
1.00
-1.14
5.0
0.0240
0.70
-1.62
1.0
0.0018
0.00
-2.74
Polymer Reaction Rate: log plot
Polymer Reaction Rate: Cartesian Coordinates
10.000
0.5
Reaction Rate (mol/s)
log [Reaction Rate (mol/s)]
1.0
0.0
-0.5
-1.0
-1.5
-2.0
1.000
0.100
0.010
-2.5
0.001
-3.0
0.0
0.5
1.0
1.5
log [Concentration (mol/ft3)]
2.0
1
10
Concentration (mol/ft3)
100
Table 1: Average Turbidity and Color of Water Treated by Portable Water Filter
a
Water
Turbidity
(NTU)
True Color
(Pt-Co)
(1)
Pond Water
(2)
10
(3)
13
Apparent
Color
(Pt-Co)
(4)
30
Sweetwater
4
4
55
12
12
Hiker
3
8
11
MiniWorks
2
3
5
Standard
5a
15
15
Level at which humans can visually detect turbidity
Consistent Format, Title, Units, Big Fonts
Differentiate Headings, Number Columns
Consistent Format, Title, Units
Good Axis Titles, Big Fonts
25
20
20
Turbidity (NTU)
Turbidity (NTU)
25
15
10
20
11
15
11
10
10
7
5
5
5
1
0
0
Pond Water
Sweetwater
Pond Water Sweetwater
Miniworks
Hiker
Miniworks
Hiker
Pioneer
Pioneer
Voyager
Voyager
Filter
Filter
Figure 1: Turbidity of Pond Water, Treated and Untreated