Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data displays and summaries The first step in data analysis is exploratory data analysis (EDA). In EDA, we can display data using graphical tools, summarize data using numerical summaries. In EDA, we usually do not make any assumptions for the data. Data displays and summaries It is important to recognize different types of variables (data). There are two types of variable (data): continuous (or quantitative) and discrete (or categorical or qualitative). Statistical methods for different types of variables (data) are different. Thus, in data analysis you should be able to tell what types of data you are handling. In this course we will focus more on continuous (quantitative) data. Discrete data displays and summaries Discrete/categorical data can be graphically displayed by bar graphs or pie charts. Discrete/categorical data can be numerically summarized by counts and percentages/proportions. The distribution of data means how the data are distributed, i.e., all the values of the data and how frequent (how often) these values (or a group of similar values) occur in the dataset. Continuous data displays and summaries Continuous data can be graphically displayed by a histogram (or stemplot). A histogram shows the distribution of the data (i.e., data range and frequencies). Continuous data can also be graphically displayed by a boxplot. Continuous data can be numerically summarized by mean and standard deviation (or variance). Data displays and summaries The two most important summaries of continuous data are mean and standard deviation. The mean is the average (center). The standard deviation (or variance) measures the variation of the data. Both the mean and standard deviation are important for describing continuous data. Data displays and summaries Both the mean and the standard deviation can be greatly influenced by a few outliers in the data. More robust measures of center and variation are median and inter-quartile range (IQR), which are not affected by outliers. Other (robust) data summaries include percentiles and quartiles. Data displays and summaries Percentiles/quartiles for continuous data: The 5th percentile is a number such that 5% of data are smaller than it and 95% data are larger than it. The 95th percentile is a number such that 95% of data are smaller than it and 5% data are larger than it. The 50th percentile is just the median. The 25th (or 75th) percentiles are also called the first (or third) quartiles, denoted by Q1 and Q3. A boxplot shows the minimum, Q1, median, Q3, and the maximum. Data displays and summaries A boxplot is a useful tool for summarizing continuous data: it shows the five summary statistics of the dataset. Note that a histogram shows the distribution of the data, not the summary statistics. So a histogram and a boxplot serve different purposes. Boxplots are useful for comparing different groups of data. The values of mean and median are similar if the data distributions are symmetric, but they can be very different if the data distributions are skewed. Data displays and summaries There are infinitely many data distributions. The most common distribution is the normal distribution. The normal distribution, denoted by N(µ, σ ), is completely determined by its mean µ and standard deviation σ . When µ = 0 and σ = 1, the normal distribution N(0, 1) is called the standard normal distribution. Percentiles for N(0, 1) are available (tables, software, internet). Data displays and summaries A variable X is called to follow a normal distribution if its all possible values follow a normal distribution. The distribution of a continuous variable X can be displayed by its density function f (x). For any two numbers a and b (a < b), the proportion of values of X between a and b is just the area under the density function f (x) between a and b, i.e., Z b f (x)dx. a Data displays and summaries If a variable X follows N(µ, σ ), then the new variable Z= X−µ σ follows N(0, 1). Given data x1 , x2 , · · · , xn , the data transformation zi = xi − x̄ , s i = 1, 2, · · · , n is called standardization. The zi values are called z-scores. Data displays and summaries The 68-95-99.7 rule can be used to quickly obtain three common percentiles for normally distributed data or variables. To obtain any percentiles for normally distributed data or variables, we should first standardize the data or variable, and then we use the standard normal distribution N(0, 1) to find the desirable percentiles. Chapter 1 focuses on univariate EDA, i.e., exploratory data analysis on data from a single (one) variable.