Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
PHP2500: Introduction to Biostatistics Lecture I: Introduction to statistics 1 . What is statistics about? ”Lies, damned lies, and statistics”? • Most smokers do not develop lung cancer in their life time. (The lifetime risk of developing lung cancer is only 11.6% for female smokers in the United States.) • Smoking is responsible for 90% of cancer deaths. • “Those who switch to X auto insurance save $$$ on average ” 2 . What is statistics about? How should our observations (data) of the world affect our knowledge (inference) and hence our behavior (decision)? 3 . Statistics: numerical summarization of data (observations) that assists drawing conclusions and also assess the uncertainty in the conclusions. Statistical Science: • The design of experiments and studies • the collection, summarization and analysis of data • the inference from the analysis, the interpretation and presentation of the result 4 . Biostatistics focuses on methods with applications in biomedical sciences, including • public health : epidemiology, health services, environmental health • design and analysis of clinical trials • genetics and molecular biology 5 . Three basic questions: • What do the data say? – How should the information affect my view of the world? • What should I believe (now that I have seen the data)? • How should I act? 6 . Example: For disease D, the prevalence in the United States is 5%. There is a blood screening test. The test is good but not perfect: • for people who do have the disease, 95% of the time the test will show positive result • for people who do not have the disease, 20% of the time the test may also give a positive result (a false positive). 7 . Now suppose we have a subject randomly chosen from the public. • Before the we do any test, i.e., before we collect any data on this particular person, do we have any reasonable guess about the probability of him/her having the disease? • What is the evidence if the test result is positive? What if it is negative? Do we reach any definite conclusion? • How should we update our belief/guess/estimate on the probability of him/her having the disease after the test is done? 8 . Key elements of a statistical problem: • Probability model: systematic and random components • Data • context: assumptions, generalizability 9 . Types of data and variables 10 . Types of variables • Numerical: variables that take on numerical values. – Examples: weight, age, blood pressure, body temperature, annual expenditure on health product ... – computable and ordered (<,>,=) • Categorical: variables that take values corresponding to categories – for example, gender, race, eye color, blood type(A,B,AB,O), final grade(A,B,C,F) – incomputable; unordered or ordered 11 . Types of numerical variables • Discrete: integer value – years of age – number of children – number of accidents – number of hospital visits • Continuous: real numbers, theoretically these can go to unlimited decimal points – age calculated from date of birth to study date – weight,height – alcohol consumption – price 12 . Types of Categorical variables • Nominal (unordered): – gender: male/female – smoking status: smoker/nonsmoker – tumor: primary/metastatic – major: public health, biology, other • Ordinal (ordered) – smoking status: nonsmoker/light/heavy – grades: A/B/C/F – Olympic medals: gold/silver/bronze – your steak: rare,medium, well-done – exercise frequency: often/rarely/never 13 . • Many variables are naturally categorical: for example, types of occupation • Often you can obtain categorical variables from numerical variables by ”categorizing” these into intervals: for example, – age −→ children, young adults, middle-aged, senior – blood pressure −→ low, normal, high • In most situations there is no ambiguity whether a variable is categorical or numerical. In some special cases it may depends on the study and/or how the data is collected. For example, education can be collected as ”degrees” or ”years of education”. 14 . Sometimes we may use numbers to label categories. For example, male=1 and female =2; white=1, black=2, other=3. This does not change the nature of the variable being categorical: • they are still not computable • for unordered variables, they are still not ordered (consider race) Changing the labels should not affect our inference from the data. You should pay special attention to this while using statistical computing softwares. Computers do not necessarily know a variable is categorical when you use numbers as labels and you may need to define your variable type. 15 . Prevalence: The proportion of a population that is affected by a particular disease at a given time. No. of people having the disease in the population Prevalence= Size of the population Examples: ◦ In 2008, Child Asthma prevalence is 13.3% in United States. ◦ In 2007-2008, prevalence of obesity in adults 20 years and over is 34% in United States. ◦ In 2007-2008, prevalence of obesity in children 6-11 years is 20% ◦ In 2008, percent of noninstitutionalized adults with diagnosed arthritis: 23% 16 .