Download Lec1 - Center for Statistical Sciences

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
PHP2500: Introduction to Biostatistics
Lecture I: Introduction to statistics
1
.
What is statistics about?
”Lies, damned lies, and statistics”?
• Most smokers do not develop lung cancer in their life time.
(The lifetime risk of developing lung cancer is only 11.6% for
female smokers in the United States.)
• Smoking is responsible for 90% of cancer deaths.
• “Those who switch to X auto insurance save $$$ on average ”
2
.
What is statistics about?
How should our observations (data) of the world affect our
knowledge (inference) and hence our behavior (decision)?
3
.
Statistics: numerical summarization of data (observations) that
assists drawing conclusions and also assess the uncertainty in the
conclusions.
Statistical Science:
• The design of experiments and studies
• the collection, summarization and analysis of data
• the inference from the analysis, the interpretation and
presentation of the result
4
.
Biostatistics focuses on methods with applications in biomedical
sciences, including
• public health : epidemiology, health services, environmental
health
• design and analysis of clinical trials
• genetics and molecular biology
5
.
Three basic questions:
• What do the data say? – How should the information affect my
view of the world?
• What should I believe (now that I have seen the data)?
• How should I act?
6
.
Example: For disease D, the prevalence in the United States is 5%.
There is a blood screening test. The test is good but not perfect:
• for people who do have the disease, 95% of the time the test
will show positive result
• for people who do not have the disease, 20% of the time the
test may also give a positive result (a false positive).
7
.
Now suppose we have a subject randomly chosen from the public.
• Before the we do any test, i.e., before we collect any data on
this particular person, do we have any reasonable guess about
the probability of him/her having the disease?
• What is the evidence if the test result is positive? What if it is
negative? Do we reach any definite conclusion?
• How should we update our belief/guess/estimate on the
probability of him/her having the disease after the test is done?
8
.
Key elements of a statistical problem:
• Probability model: systematic and random components
• Data
• context: assumptions, generalizability
9
.
Types of data and variables
10
.
Types of variables
• Numerical: variables that take on numerical values.
– Examples: weight, age, blood pressure, body temperature,
annual expenditure on health product ...
– computable and ordered (<,>,=)
• Categorical: variables that take values corresponding to
categories
– for example, gender, race, eye color, blood type(A,B,AB,O),
final grade(A,B,C,F)
– incomputable; unordered or ordered
11
.
Types of numerical variables
• Discrete: integer value
– years of age
– number of children
– number of accidents
– number of hospital visits
• Continuous: real numbers, theoretically these can go to
unlimited decimal points
– age calculated from date of birth to study date
– weight,height
– alcohol consumption
– price
12
.
Types of Categorical variables
• Nominal (unordered):
– gender: male/female
– smoking status: smoker/nonsmoker
– tumor: primary/metastatic
– major: public health, biology, other
• Ordinal (ordered)
– smoking status: nonsmoker/light/heavy
– grades: A/B/C/F
– Olympic medals: gold/silver/bronze
– your steak: rare,medium, well-done
– exercise frequency: often/rarely/never
13
.
• Many variables are naturally categorical: for example, types of
occupation
• Often you can obtain categorical variables from numerical
variables by ”categorizing” these into intervals:
for example,
– age −→ children, young adults, middle-aged, senior
– blood pressure −→ low, normal, high
• In most situations there is no ambiguity whether a variable is
categorical or numerical. In some special cases it may depends
on the study and/or how the data is collected. For example,
education can be collected as ”degrees” or ”years of education”.
14
.
Sometimes we may use numbers to label categories. For example,
male=1 and female =2; white=1, black=2, other=3.
This does not change the nature of the variable being categorical:
• they are still not computable
• for unordered variables, they are still not ordered (consider
race)
Changing the labels should not affect our inference from the data.
You should pay special attention to this while using statistical
computing softwares. Computers do not necessarily know a variable
is categorical when you use numbers as labels and you may need to
define your variable type.
15
.
Prevalence: The proportion of a population that is affected by a
particular disease at a given time.
No. of people having the disease in the population
Prevalence=
Size of the population
Examples:
◦ In 2008, Child Asthma prevalence is 13.3% in United States.
◦ In 2007-2008, prevalence of obesity in adults 20 years and over is
34% in United States.
◦ In 2007-2008, prevalence of obesity in children 6-11 years is 20%
◦ In 2008, percent of noninstitutionalized adults with diagnosed
arthritis: 23%
16
.