Download Terminology (http://www.stats.gla.ac.uk/steps/glossary) A parameter

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Data mining wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Terminology (http://www.stats.gla.ac.uk/steps/glossary)
A parameter is a value, usually unknown (and which therefore has to be estimated), used to represent a certain population
characteristic.
An estimator is any quantity calculated from the sample data which is used to give information about an unknown quantity in the
population.
An estimate is an indication of the value of an unknown quantity based on observed data. More formally, an estimate is the
particular value of an estimator that is obtained from a particular sample of data and used to indicate the value of a parameter.
An outlier is an observation in a data set which is far removed in value from the others in the data set. It is an unusually large or
an unusually small value compared to the others. An outlier might be the result of an error in measurement, in which case it will
distort the interpretation of the data, having undue influence on many summary statistics, for example, the mean.
The sample mean is an estimator available for estimating the population mean. It is a measure of location, commonly called the
average, its value depends equally on all of the data which may include outliers. It may not appear representative of the central
region for skewed data sets.
The median is the value halfway through the ordered data set, below and above which there lies an equal number of data values. It
is generally a good descriptive measure of the location which works well for skewed data, or data with outliers.
The mode is the most frequently occurring value in a set of data. There can be more than one mode if two or more values are
equally common
The (population) variance of a random variable is a non-negative number which gives an idea of how widely spread the values of
the random variable are likely to be; the larger the variance, the more scattered the observations on average
Sample variance is a measure of the spread of or dispersion within a set of sample data. The sample variance is the sum of the
squared deviations from their average divided by one less than the number of observations in the data set.
Standard deviation is a measure of the spread or dispersion of a set of data. It is calculated by taking the square root of the
variance and is symbolised by s.d, or s.
The range of a sample (or a data set) is a measure of the spread or the dispersion of the observations. It is the difference between
the largest and the smallest observed value of some quantitative characteristic and is very easy to calculate
Quantiles are a set of 'cut points' that divide a sample of data into groups containing (as far as possible) equal numbers of
observations. Examples of quantiles include quartile, quintile, percentile. (The median is the 0.5 quantile.)
Quartiles are values that divide a sample of data into four groups containing (as far as possible) equal numbers of observations. A
data set has three quartiles. References to quartiles often relate to just the outer two, the upper and the lower quartiles; the second
quartile being equal to the median. The lower quartile is the data value a quarter way up through the ordered data set; the upper
quartile is the data value a quarter way down through the ordered data set
The inter-quartile range (IQR) is a measure of the spread of or dispersion within a data set. It is calculated by taking the
difference between the upper and the lower quartiles
Quintiles are values that divide a sample of data into five groups containing (as far as possible) equal numbers of observations.
Percentiles are values that divide a sample of data into one hundred groups containing (as far as possible) equal numbers of
observations. For example, 30% of the data values lie below the 30th percentile
The coefficient of variation measures the spread of a set of data as a proportion of its mean. It is often expressed as a percentage.
It is the ratio of the sample standard deviation to the sample mean
A confidence interval gives an estimated range of values which is likely to include an unknown population parameter. The width
of the confidence interval gives us some idea about how uncertain we are about the unknown parameter (see precision). If
independent samples are taken repeatedly from the same population, and a confidence interval calculated for each sample, then a
certain percentage (confidence level) of the intervals will include the unknown population parameter. Confidence intervals are
usually calculated so that this percentage is 95%, but we can produce 90%, 99%, 99.9% (or whatever) confidence intervals for the
unknown parameter. We calculate these intervals for different confidence levels, depending on how precise we want to be. We
interpret an interval calculated at a 95% level as, we are 95% confident that the interval contains the true population mean. We
could also say that 95% of all confidence intervals formed in this manner (from different samples of the population) will include
the true population mean.
Confidence limits are the lower and upper boundaries of a confidence interval, that is, the values which define the range of a
confidence interval.
The confidence level is the probability value associated with a confidence interval.
A confidence interval for a mean specifies a range of values within which the unknown population parameter, in this case the
mean, may lie.
A frequency table is a way of summarising a set of data. It is a record of how often each value (or set of values) of the variable in
question occurs. It may be enhanced by the addition of percentages that fall into each category. A frequency table is used to
summarise categorical, nominal, and ordinal data. It may also be used to summarise continuous data once the data set has been
divided up into sensible groups.
A pie chart is a way of summarising a set of categorical data. It is a circle which is divided into segments. Each segment
represents a particular category. The area of each segment is proportional to the number of cases in that category.
A bar chart is a way of summarising a set of categorical data. It is often used in exploratory data analysis to illustrate the major
features of the distribution of the data in a convenient form. It displays the data using a number of rectangles, of the same width,
each of which represents a particular category. The length of each rectangle is proportional to the number of cases in the category
it represents, for example, age group, religious affiliation. Bar charts are used to summarise nominal or ordinal data.
A dot plot is a way of summarising data, often used in exploratory data analysis to illustrate the major features of the distribution
of the data in a convenient form. For nominal or ordinal data, a dot plot is similar to a bar chart, with the bars replaced by a series
of dots. Each dot represents a fixed number of individuals. For continuous data, the dot plot is similar to a histogram, with the
rectangles replaced by dots. A dot plot can also help detect any unusual observations (outliers), or any gaps in the data set.
A histogram is a way of sumarising data that are measured on an interval scale (either discrete or continuous). It is often used in
exploratory data analysis to illustrate the major features of the distribution of the data in a convenient form. It divides up the range
of possible values in a data set into classes or groups. For each group, a rectangle is constructed with a base length equal to the
range of values in that specific group, and an area proportional to the number of observations falling into that group. This means
that the rectangles might be drawn of non-uniform height. The histogram is only appropriate for variables whose values are
numerical and measured on an interval scale. A histogram can also help detect any unusual observations (outliers), or any gaps in
the data set.
A box and whisker plot is a way of summarising a set of data measured on an interval scale. It is often used in exploratory data
analysis. It is a type of graph which is used to show the shape of the distribution, its central value, and variability. The plot may
consist of the most extreme values in the data set (maximum and minimum values), the lower and upper quartiles, and the median.
A box plot (as it is often called) is especially helpful for indicating whether a distribution is skewed and whether there are any
unusual observations (outliers) in the data set.
A stem and leaf plot is a way of summarising a set of data measured on an interval scale. It is often used in exploratory data
analysis to illustrate the major features of the distribution of the data in a convenient and easily drawn form.
A stem and leaf plot is similar to a histogram but is usually a more informative display for relatively small data sets (<100 data
points).
====================================================================================
Data: Student height, Blood type (O A B AB), Favorite discipline (anatomy, biology, chemistry, histology, history, other)
Activity (calculate, find or draw)
Mean
Variance
Standard Deviation
Standard error of mean
The coefficient of variation
Median
Mode
Low quartiles
Upper quartiles
Interquartile range
Quintiles
10-th percentile
90-th percentile
90% confidence limit
95% confidence limit
99% confidence limit
Draw frequency table
Draw pie chart
Draw bar chart
Draw dot plot
Draw histogram
Draw Box-Whisker Plot
(min, Q1, median, Q3, max)
Draw stem-and-leaves plot