Download t 0

Hygiene and Preventive Medicine Institute University of Sassari Medical School Italy Simple statistics for clinicians on respiratory research By Giovanni Sotgiu What are your expectations? Too difficult to explain medical statistics in 30 min….. What is medical statistics? What is medical statistics? • “..Discipline concerned with the treatment of numerical data derived from groups of individuals..” P Armitage • “..Art of dealing with variation in data through collection, classification and analysis in such a way as to obtain reliable results..” JM Last What is medical statistics? Collection of statistical procedures well-suited to the analysis of healthcare-related data Why we need to study statistics in the field of medicine…….. Why we need to study statistics… 1) Basic requirement of medical research 2)Update your medical knowledge 3)Data management and treatment Road map 1) Basic concepts 2) Sample and population 3)Probability 4) Data description 5) Measures of disease Basic concepts Basic concepts 1. Homogeneity All individuals have similar values or belong to the same category Ex.: all individuals are Chinese, ….women, ….middle age (30~40 years old), ….work in the same factory homogeneity in nationality, gender, age and occupation Basic concepts 1. Variation Differences in height, weight, treatment… 1. Variation • Toss a coin The mark face may be up or down • Treat the patients suffering from TB with the same antibiotics: a part of them recovered and others didn’t 1. Variation no variation, no statistics What is the target of our studies? Population 2. Population the whole collection of individuals that one intends to study 2. Population  economic issues  short time 2. Population and sample 2. Sample a representative part of the population Sampling By chance! Random • Random event  the event may occur or may not occur in one experiment before one experiment, nobody is sure whether the event occurs or not Random Please, give some examples of random event… The mathematical procedures whereby we convert information about the sample into intelligent guesses about the population fall under the section of inferential Statistics (generalization) Probability 3. Probability Measure the possibility of occurrence of a random event P(A) = The Number Of Ways Event A Can Occur The total number Of Possible Outcomes Estimation of Probability Frequency Number of observations: n (large enough) Number of occurrences of random event A: m P(A)  m/n relative frequency theory 3. Probability A random event P(A) Probability of the random event A P(A)1 , if an event always occurs P(A)0, if an event never occurs Please, give some examples for probability of a random event and frequency of that random event Parameters and statistics 4. Parameter A measurement describing some characteristic of a population or A measurement of the distribution of a characteristic of a population Greek letter (μ,π, etc.) Usually unknown to know the parameter of a population we need a sample 4. Statistic A measurement describing some characteristic of a sample or A measurement of the distribution of a characteristic of a sample Latin letter (s, p, etc.) 4. Statistic Please give an example for parameter and statistics Does a parameter vary? Does a statistic vary? Sampling Error 5. Sampling Error Difference between observed value and true value 5. Sampling Error 1) Systematic error (fixed) 2) Measurement error (random) 3) Sampling error (random) Sampling error • The statistics different from the parameter! • The statistics of different samples from same population different each other! Sampling error The sampling error exists in any sampling research It can not be avoided but may be estimated Nature of data Variables and data • Variables are labels whose value can literally vary • Data is the value you get from observing measuring, counting, assessing etc. Data Nominal Data Categorical Data Ordinal Data Data Discrete Data Metric Data Continuous Data Nominal or categorical data • It can be allocated into one of a number of categories • Blood type, sex, Linezolid treatment (y/n) • Data cannot be arranged in an ordering scheme Ordinal categorical data • It can be allocated to one of a number of categories but it has to be put in meaningful order • Differences cannot be determined or are meaningless • Very satisfied, satisfied, neutral, unsatisfied, very unsatisfied (new treatment) Discrete metric data • Countable variables  number of possible values is a finite number • Numbers of days of hospitalization • Numbers of men treated with isoniazid Continuous metric data • Measurable variables • Infinitely many possible values  continuous scale covering a range of values without gaps • Kg, m, mmHg, years Describing data….. with tables Describing data with tables 1) actual frequency 2) relative and cumulative frequency 3) grouped frequency 4) open- ended groups 5) cross-tabulation 1) Frequency table Frequency distribution variables frequency TB mortality (%) Tally No. of wards 11.2-15.1 1, 1, 1, 1, 1, 1, 1, 1, 1 9 15.2-20.1 1, 1, 1, 1, 1, 1, 1, 1 8 20.2-25.1 1, 1, 1, 1, 1 5 25.2-30.1 1, 1, 1 3 30.2-35.1 1, 1 2) Relative frequency, cumulative frequency Relative frequency proportion of the total No. of resistances No. of patients Relative frequency (%) Cumulative frequency (%) 0 5 12.5 12.5 1 6 15 27.5 2 14 35 62.5 3 10 25 87.5 4 3 7.5 95 7 1 2.5 97.5 8 1 2.5 100 3) Grouped frequency Grouped frequency  works for continuous metric data Birth weight A group width of 300g The class lower limit The class upper limit No. of infants born from mothers with TB 2700-2999 2 3000-3299 3 3300-3599 9 3600-3899 9 3900-4199 4 4200-4499 3 General rules • Frequency table nominal, ordinal and discrete metric data • Grouped frequency table continuous metric data 4) Open-ended group • One or more values which are called outliers, long away from the general mass of the data • Use ≤ or ≥ 5) Cross-tabulation • Two variables within a single group of individuals TB/HIV+ Pulmonary mass Yes No Benign 21 11 32 Malignant 4 4 8 Totals Totals 25 15 40 Describing data….. with charts 3. Describing data with charts 1) Charting nominal data a) b) c) d) pie chart simple bar chart cluster bar chart stacked bar chart 2) Charting ordinal data a) b) c) pie chart bar chart dotplot 3) Charting discrete metric data 4) Charting continuous metric data histogram 5) Charting cumulative ordinal or discrete metric data step chart 6) Charting cumulative metric continuous data cumulative frequency or ogive 7) Charting time based time –series chart 1-a) Pie chart • 4-5 categories • One variable • Start at 0° in the same order as the table Adverse events of ethionamide Neuropathy;4; 4% Cough; 55; 55% Hepatitis; 21; 21% Rash; 20; 20% 1-b) Simple bar chart • Same widths, equal spaces b/w bars n 1-c) Clustered bar chart 1-d) Stacked bar chart 2-3) Dot-plot Useful with ordinal variables if the number of categories is too large for a bar chart 4) Histogram % Percentage of age distribution of pregnant TB women 40 35 30 25 20 TB cases 15 10 5 0 <19 20-24 25-29 30-34 >35 6) Cumulative frequency curve Percentage of cumulative frequency curves of age for males and females who develop TB 100 80 60 40 20 0 > 85 75-84 65-74 55-64 45-54 35-44 25-34 15-24 Describing data from its distributional shape Describing data from its distributional shape Symmetric mound-shaped distributions Skewed distributions Age distribution for migrants who develop TB 160 140 120 100 80 60 40 20 0 15- 25- 35- 45- 55- 65- 75- > 24 34 44 54 64 74 84 85 Bimodal distributions A bimodal distribution is one with two distinct humps Normal-ness • Symmetric • Same mean, median, mode Describing data with numeric summary value Describing data with numeric summary value • 1. numbers, proportions (percentages) • 2. summary measures of location • 3. summary measures of spread Numbers and proportions • Numbers  actual frequencies • Percentage is a proportion multiplied by 100 1) Prevalence 2) Incidence Prevalence -nature relative frequency number of existing cases in some population at a given time disease health t0 Prevalence No. of existing cases of a disease at t0 = 0…..1 total population A (N=6) B (N=4) fa=1 fa=1 No comparison fr=0.17 fr=0.25 Comparison Disease Health Prevalence P= =0 P= = 0.25 P= =1 Disease Health Prevalence Prevalence data: - Highlight the time of the evaluation Example: P (2010)= 0.17 P (2010)= 17 per 100 individuals Incidence estimates the risk of developing disease People at risk (healthy) Disease t0 Health t1 Incidence No. of new cases during given t0- t1 total population at risk - Measures the probability or risk of developing disease during given time period - Absolute risk probabilityof developing an adverse event Incidence -Assess the health status at baseline esclude prevalent cases at t0 -Define a follow-up for the cohort  Healthy people followed-up for a given time period Cohort Closed Population adds no new members over time, and loses members only to disease/death Open Population may gain members over time, through immigration or birth, or lose members through emigration Cumulative incidence - Closed population - Individual time period at risk same period for all the members P e A> B> o p l e C> D> E> t0 t1 0 3 time Cumulative incidence No. of new cases during given t0- t1 total population at risk Cumulative incidence Example: t0 = 24; new cases= 3; follow-up = 3 years CI in 3 years = 0.125 new cases per 1 individual at risk enrolled at t0 12.5 new cases in 100 individuals at risk enrolled at t0 P e o p l e t0 t1 0 3 time Cumulative incidence…critical features - Closed popularion rare - Short follow-up and enrollment of a few individuals - Open population Open population -Non cases (drop-out) and cases during the follow-up - Enrollment of new individuals during the follow-up - Length of follow-up not uniform Open population P e A> B> C> o p l e D> E> F> G> H> I> t1 t0 Drop-out Case time Coorte dinamica Individual time period at risk not uniform  Estimate the population at risk: - Total person-time - Estimate of the total person-time Coorte dinamica Total person-time  S individual time period at risk Person-time: days-, months-, years Density of incidence No. of new cases during given t0- t1 total person-time Density of incidence N Individual time period at Person-years Person-years risk 1 (A) 5 1 person x 5 years 5 person-years 2 3 person x 2 years 6 person-years 2 (E, F) 2.5 2 person x 2.5 years 5 person-years 2 (G, H) 1.5 2 person x 1.5 years 3 person-years 1 (I) 3 1 person x 3 years 3 person-years 3 (B, C, D) Total person-time 22 person-years Density of incidence 1 new case 0,045 new cases = 22 person-years  45 per 1000 person-years = 0,045 1 person-years Open population Estimate of the total person-time  Individual time period at risk not known for all -Migration Movement of the cohort in the middle of the follow-up Estimate of the total person-time (P0 + Pt)/2 x follow-up Estimate of the total person-time At t0: 100 people Follow-up: 3 years New cases: 3 Drop-out: 17 Enrollment during the follow-up: 16 >>>P0 = 100; Pt = (100-3-17+16) = 96 (P0 + Pt)/2 x follow-up (100 + 96)/2 x 3 = 294 person-years Estimate of the total person-time At t0: 100 people Follow-up: 3 years New cases: 3 Drop-out: 17 Enrollment during the follow-up: 16 Test the estimate: 80 people x 3 years = 240 person-years Movement of the cohort (17 x 1.5) + (3 x 1.5) + (16 x 1.5) = 54 person-years 240 + 54 = 294 person-years Incidence rate No. of new cases during given t0- t1 estimate of total person-time 3 new cases/ 294 person-years x 1000 = 10.2 Summary measures of location 1) mode: category or value occurs the most often, typicalness. Categorical, metric discrete 2) median: middle value in ascending order, central-ness. ordinal and metric data 3) mean (average): divide the sum of the values by the number of values 4) percentile: divide the total number of the values into 100 equal-sized groups. Choosing the most appropriate measure Mode Median Mean Nominal yes no no Ordinal yes yes no Metric discrete yes Yes, when markedly skewed yes Metric continuous yes Yes, when markedly skewed yes Summary measure of spread • Range distance from the smallest value to the largest • IQR (interquartile range) spread of the middle half of the values • Boxplot  graphical summary of the three quartile values, the minimum and maximum values, and outliers. Standard deviation • Average distance of all the data values from the mean value • The smaller the average distance is, the narrower the spread, and vice versa • Used metric data only 1. Subtract the mean from each of the n value in the sample, to give the different values 2. Square each of these differences 3. Add these squared values together (sum of squares) 4. Divide the sum of squares by 1 less than the sample size. (n-1) 5. Take the square-root Standard deviation and the normal distribution The Basic Steps of Statistical Work 1. Design of study Professional design: Research aim Subjects, Measures, etc. • Statistical design: Sampling or allocation method, Sample size, Randomization, Data processing, etc. 2. Collection of data • Source of data Government report system Registration system Routine records Ad hoc survey • Data collection  complete, in time accuracy, Protocol: Place, subjects, timing; training; pilot; questionnaire; instruments; sampling method and sample size; budget Procedure: observation, interview filling form, letter telephone, web 3. Data Sorting • Checking Hand, computer software • Amend • Missing data? • Grouping According to categorical variables (sex, occupation, disease…) According to numerical variables (age, income, blood pressure …) 4. Data Analysis • Descriptive statistics (show the sample) mean, incidence rate … -- Table and plot • Inferential statistics (towards the population) -- Estimation Hypothesis test (comparison) Definition of Selection Bias Selection bias: Selection biases are distortions that result from procedures used to select subjects and from factors that influence study participation. The common element of such biases is that the association between exposure and disease is different for those who participate and those who should be theoretically eligible for study, including those who do not participate. Definition of Selection Bias It is sometimes (but not always) possible to disentangle the effects of participation from those of disease determinants using standard methods for the control of confounding. One example is the bias introduced by matching in case-control studies. Definition of Confounding Confounding: bias in estimating an epidemiologic measure of effect resulting from an imbalance of other causes of disease in the compared groups. (mixing of effects) Characteristics of a Confounder • associated with disease (in non-exposed) • associated with exposure (in source population) • not an intermediate cause

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download t 0