Download an overview of data analysis for researchers

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Categorical variable wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
What are Data?
Quantitative Data
o Sets of measurements of objective descriptions of physical and behavioural events;
susceptible to statistical analysis
Qualitative data
o Descriptive, views, actions and activities, non-verbal behaviour and interactions;
susceptible to interpretation
The Research Question (Randomised Controlled Trials (RCTs))
P = Population
Who is the question about
I = Intervention
What is happening/ being done to ‘P’
C = Comparison
What could be done instead of ‘I’
O = Outcome (s)
What happens to ‘P’ as a result of ‘I’
The Research Question (Non RCTs)
P = Population
Who is the question about?
I = Intervention
The group with the disease / characteristic of interest
C = Comparison
The group without the disease / characteristic of interest
O = Outcome (s)
The variable we are measuring for both the ‘I’ and ‘C’ groups
Descriptive Statistics
Data and methods that say something about a complete population
Inferential statistics
Data and methods that say something about a larger population which is probably true
What are we measuring
Need to know what we are measuring and how it is being measured. How we measure the variables
will influence the types of analysis we can carry out on our data
2 main types of variable
Categorical – categories e.g. age ranges, gender, cat/dog
Metric – e.g. actual values, not grouped, weight, time
Levels of measurement
Metric
Categorical
4 types of scle for measuring variables
Nominal: These are categories and lists
e.g. dog, cat, mouse, yes, no
Ordinal: These are ordered of ranked positions, not true numbers
e.g. Educational achievements, income bandings
Continuous: Values can lie anywhere within the possible range, are true numbers
e.g. height, can be any point on a scale
Discrete: Whole numbers, arise from counting things
e.g. number of decayed missing teeth
Identifying data type
Can the data be out in order?
No
Nominal
No
Ordinal
Yes
Do the data have units?
(inc. numbers of things)
Yes
Metric
Do the data come from measuring or counting things?
Measuring
Counting
Continuous
Discrete
How do you describe Data? The role of summary statistics
Central tendency
The typical values in a set of scores
Mode – most frequently occurring category of score
1122234455556
Median – the mid-point in a set of scores
1122234455556
Mean – average score Sum of X (scores)
N (number of scores) = 3.5
Summarising Date
Percentage
The frequency of people with a given characteristic expressed as a number out of 100. E.g. 52
people out of every 100 studied had blue eyes, can also be expressed as 52%. A percentage can also
be defined as a rate per 100.
Rate
The frequency of people with given characteristic expressed as a number out of a total population,
(usually multiples of 100). E.g. the rate of people with blue eyes can be expressed as 52 per 100, 520
per 1000, 5200 per 10,000.
What can we do with our data?
Prevalence
Defined as the proportion of individuals with a particular disease
P=
total number of cases at a given time
total population at that time
Prevalence is measured at a particular point in time, and as such may be referred to as a point
prevalence.
Incidence
Defined as the proportion of new cases in a population previously without disease in a specified
period.
I=
number of new cases in a period of time
Population at risk
N.B. the time period involved must always be specified when presenting incidence rates.
This is also referred to as the cumulative incidence.
Which summary statistic should I use?
Nominal – Mode /percentage
Ordinal – Median
Metric – Normal distribution?
Yes – Mean
No – Median
The Normal Distribution
= A lot of biological
measures
In a distribution of values that looks like this when plotted, the mean, median and mode are the
same.
Negative and Positive skews
Negative = mean is less than median
Positive = mean is greater then median
How do you know if the data is normally distributed?
To test for this you can either:
o
o
o
Plot a frequency diagram
See if mean= median= mode
If the standard deviation does not fit twice into the mean then it definitely isn’t
normally distributed (this is a good tip when looking a research papers).
Measures of Dispersion
25% of
observations
25%
25% of
observations
25% of
observations
25% of
observations
25% of
observations
Q1
Q2
Q3
Minimum
Maximum
Inter-quartile range – (IQR)
(observations in ascending order)
Examples of median and IQR
2 sets of data:
13, 2, 16, 1, 17
13, 14, 13, 14, 13
First sort them into numerical order
1, 2, 13, 16, 17
13, 13, 13, 14, 14
-
Median is iddle value, so for both it is 13.
We can calculate the poition of the lower quartile by (n+1)0.25 (n= number of values)
The upper quartile is (n+1)0.75
1, 2, 13, 16, 17
LQ = 1.5 UQ= 16.5
this shows that slthough they have the same
13, 13, 13, 14, 14
LQ= 13 UQ= 14
median, they have very different ranges.
-
If both the median and IQR are presented we can see that the data are different
The values are more dispersed in the 1st data set.
Standard Deviation
The standard deviation is very useful as statisticians have calculated that 68% of a normally
distributed population will have observations within 1 standard deviation of the mean,
approximately 95% within 2SD and approx 99% within 3 SD. However, this statistical estimation
assumes a mean of 0 and a SD of 1. The obvious problem is that we rarely collect data with a mean
of 0 and a SD of 1 – often the data we collect only has positive values, for example the mean
assessment score in a class may be 55 with a SD of 12 and nobody achieving a mark less than 0.
What statistics programmes such as spss do is convert this data so that it has a mean of 0 and a SD
of 1 and generate Z scores i.e. Normalised score
 Measure of dispersion when mean is used as measure of central tendency
 Based on all the individual scores
 Describes how individual scores typically vary from the mean
 The larger the SD the more spread out the scores are about the mean