Download H1: The Art and Science of Learning from Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Categorical variable wikipedia , lookup

Time series wikipedia , lookup

Transcript
H1: The Art and Science of Learning from Data
Data: Information we gather with experiments and with surveys.
Statistics is the art and science of designing studies and analyzing the data that those studies
produce. Its ultimate goal is translating data into knowledge and understanding of the world around
us. In short, statistics is the art and science of learning from data.
Why use statistical methods?
- Design: planning how to obtain data to answer the questions of interest.
- Description: summarizing the data that are obtained.
- Inference: making decisions and predictions based on the data.
Variable = the characteristic being measured, such as number of hours per day that you watch TV.
Subjects = the entities that we measure in a study.
The population is the total set of subjects in which we are interested. A sample is the subset of the
population for whom we have (or plan to have) data.
Descriptive statistics refers to methods for summarizing the data. The summaries usually consist of
graphs and numbers such as averages and percentages.
A descriptive statistical analysis usually combines graphical and numerical summaries, fe a bar graph.
Inferential statistics refers to the methods of making decisions or predictions about a population,
based on data obtained from a sample of that population.
A sample statistic:
A parameter is a numerical summary of the population. A statistic is a numerical summary of a
sample taken from the population.
Important: random sampling!
H2: Exploring Data with Graphs and Numerical Summaries
A variable is called categorical if each observation belongs to one set of categories.
=> Key feature: the relative number of observations.
A variable is called quantitive is observations on it take numerical values that represent different
magnitudes of the variable.
=> Key features: center & spread.
- Discrete: if its possible values form a set of separate values form a set of separate numbers.
- Continuous: if its possible values form an interval.
Mode = the category with the highest frequency.
The proportion of the observations that fall in a certain category is the frequency (count) of
observations in that category divides by the total number of observations. The percentage is the
proportion multiplied by 100. Proportions and percentages are also called relative frequencies.
A frequency table is a listing of possible values for a variable, together with the number of
observations for each value.
Graphs for categorical variables:
A pie chart is a circle having a β€œslice of the pie” for each category. The size of a slice corresponds to
the percentage of observations in the category.
A bar graph displays a vertical bar for each category. The height of the bar is the percentage of
observations in the category.
Pareto chart: the bars are ordered from largest to smallest based on the percentage use.
Graphs for quantitive variables:
A dot plot shows a dot for each observation, placed just above the value on the number line for that
observation.
A stem-and-leaf plot: Each observation is represented by a stem and a leaf. Usually the stem consists
of all the digits except for the final one, which is the leaf. Now sort the data in order from smallest to
largest. Place the stems in a column, starting with the smallest. Place a vertical line to their right. On
the right side of the vertical line, indicate each leaf (= final digit) that has a particular stem. List the
leaves in increasing order.
=> Truncate the data values to make it more compact.
A histogram is a graph that uses bars to portray the frequencies or the relative frequencies of the
possible outcomes for a quantitive variable.
A distribution of data is a frequency table or a graph that shows the values a variable takes and how
often they occur.
=> Look for the overall pattern (clustering together or a gap?).
=> Unimodal vs. bimodal.
Shape:
- Symmetric.
- Skewed (to the left or to the right).
The tails of the distribution = the parts of the curve for the lowest and for the highest values.
A data set collected over time is called a time series. We can display time-series data graphically
using a time plot.
Describing the center:
The mean is the sum of observations divided by the number of observations.
π‘₯Μ… =
βˆ‘π‘₯
𝑛
An outlier is an observation that falls well above or well below the overall bulk of the data.
The median is the midpoint of the observations when they are ordered from the smallest to the
largest (or the other way around).
Comparing mean & median:
- Symmetric: mean = median.
- Skewed to the right: mean > median.
- Skewed to the left: mean < median.
A numerical summary of the observations is called resistant if extreme observations have little, if
any, influence on its value => median.
Describing the spread:
The range is the difference between the largest and the smallest observation.
The deviation of an observation from the mean is the difference between the observation and the
sample mean.
The standard deviation s of n observations is:
βˆ‘(π‘₯ βˆ’ π‘₯Μ… )2
𝑠= √
π‘›βˆ’1
This is the square root of the variance s2, which is an average of the squares of the deviations from
their mean:
𝑠2 =
βˆ‘(π‘₯ βˆ’ π‘₯Μ… )2
π‘›βˆ’ 1
The Empirical Rule:
- 68% of the data falls within 1 standard deviation of the mean.
- 95% of the data falls within 2 standard deviations of the mean.
- All (or nearly all) observations fall within 3 standard deviations of the mean.
The pth percentile is a value such that p percent of the observations fall below or at that value.
=> Quartiles.
The interquartile range is the distance between the third and the first quartiles:
𝐼𝑄𝑅 = 𝑄3 βˆ’ 𝑄1
The five-number summary of a dataset is the minimum value, first quartile Q1, median, third quartile
Q3, and the maximum value.
=> Graph = box plot.
- Box: contains 50% of the distribution, from Q1 to Q3.
- Whiskers: the lines extending from the box, they encompass the rest of the data,
except potential outliers.
The z-score for an observation is the number of standard deviations that it falls from the mean. For
sample data, the z-score is calculated as:
𝑧=
π‘₯ βˆ’ π‘₯Μ…
π‘