Download H1: The Art and Science of Learning from Data

H1: The Art and Science of Learning from Data Data: Information we gather with experiments and with surveys. Statistics is the art and science of designing studies and analyzing the data that those studies produce. Its ultimate goal is translating data into knowledge and understanding of the world around us. In short, statistics is the art and science of learning from data. Why use statistical methods? - Design: planning how to obtain data to answer the questions of interest. - Description: summarizing the data that are obtained. - Inference: making decisions and predictions based on the data. Variable = the characteristic being measured, such as number of hours per day that you watch TV. Subjects = the entities that we measure in a study. The population is the total set of subjects in which we are interested. A sample is the subset of the population for whom we have (or plan to have) data. Descriptive statistics refers to methods for summarizing the data. The summaries usually consist of graphs and numbers such as averages and percentages. A descriptive statistical analysis usually combines graphical and numerical summaries, fe a bar graph. Inferential statistics refers to the methods of making decisions or predictions about a population, based on data obtained from a sample of that population. A sample statistic: A parameter is a numerical summary of the population. A statistic is a numerical summary of a sample taken from the population. Important: random sampling! H2: Exploring Data with Graphs and Numerical Summaries A variable is called categorical if each observation belongs to one set of categories. => Key feature: the relative number of observations. A variable is called quantitive is observations on it take numerical values that represent different magnitudes of the variable. => Key features: center & spread. - Discrete: if its possible values form a set of separate values form a set of separate numbers. - Continuous: if its possible values form an interval. Mode = the category with the highest frequency. The proportion of the observations that fall in a certain category is the frequency (count) of observations in that category divides by the total number of observations. The percentage is the proportion multiplied by 100. Proportions and percentages are also called relative frequencies. A frequency table is a listing of possible values for a variable, together with the number of observations for each value. Graphs for categorical variables: A pie chart is a circle having a “slice of the pie” for each category. The size of a slice corresponds to the percentage of observations in the category. A bar graph displays a vertical bar for each category. The height of the bar is the percentage of observations in the category. Pareto chart: the bars are ordered from largest to smallest based on the percentage use. Graphs for quantitive variables: A dot plot shows a dot for each observation, placed just above the value on the number line for that observation. A stem-and-leaf plot: Each observation is represented by a stem and a leaf. Usually the stem consists of all the digits except for the final one, which is the leaf. Now sort the data in order from smallest to largest. Place the stems in a column, starting with the smallest. Place a vertical line to their right. On the right side of the vertical line, indicate each leaf (= final digit) that has a particular stem. List the leaves in increasing order. => Truncate the data values to make it more compact. A histogram is a graph that uses bars to portray the frequencies or the relative frequencies of the possible outcomes for a quantitive variable. A distribution of data is a frequency table or a graph that shows the values a variable takes and how often they occur. => Look for the overall pattern (clustering together or a gap?). => Unimodal vs. bimodal. Shape: - Symmetric. - Skewed (to the left or to the right). The tails of the distribution = the parts of the curve for the lowest and for the highest values. A data set collected over time is called a time series. We can display time-series data graphically using a time plot. Describing the center: The mean is the sum of observations divided by the number of observations. 𝑥̅ = ∑𝑥 𝑛 An outlier is an observation that falls well above or well below the overall bulk of the data. The median is the midpoint of the observations when they are ordered from the smallest to the largest (or the other way around). Comparing mean & median: - Symmetric: mean = median. - Skewed to the right: mean > median. - Skewed to the left: mean < median. A numerical summary of the observations is called resistant if extreme observations have little, if any, influence on its value => median. Describing the spread: The range is the difference between the largest and the smallest observation. The deviation of an observation from the mean is the difference between the observation and the sample mean. The standard deviation s of n observations is: ∑(𝑥 − 𝑥̅ )2 𝑠= √ 𝑛−1 This is the square root of the variance s2, which is an average of the squares of the deviations from their mean: 𝑠2 = ∑(𝑥 − 𝑥̅ )2 𝑛− 1 The Empirical Rule: - 68% of the data falls within 1 standard deviation of the mean. - 95% of the data falls within 2 standard deviations of the mean. - All (or nearly all) observations fall within 3 standard deviations of the mean. The pth percentile is a value such that p percent of the observations fall below or at that value. => Quartiles. The interquartile range is the distance between the third and the first quartiles: 𝐼𝑄𝑅 = 𝑄3 − 𝑄1 The five-number summary of a dataset is the minimum value, first quartile Q1, median, third quartile Q3, and the maximum value. => Graph = box plot. - Box: contains 50% of the distribution, from Q1 to Q3. - Whiskers: the lines extending from the box, they encompass the rest of the data, except potential outliers. The z-score for an observation is the number of standard deviations that it falls from the mean. For sample data, the z-score is calculated as: 𝑧= 𝑥 − 𝑥̅ 𝑠

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download H1: The Art and Science of Learning from Data