Download slides - Bioinformatics Sannio

Machine Learning exploring and understanding data Luigi Cerulo Department of Science and Technology University of Sannio Understand your data The better you understand your data, the better you will be able to match a machine learning model to your learning problem • During this step that you will begin to explore the data's features and examples, and realize the peculiarities that make your data unique • Visualize data under different prospectives let you get new information • http://www.informationisbeautiful.net • https://www.youtube.com/watch?v=usdJgEwMinM The best stats you’ve ever seen • Understand your data Assume the role of a data scientist, who has the task of understanding the used car data. • Although data exploration is a fluid process, the steps can be imagined as a sort of investigation in which questions about the data are answered. • The exact questions may vary across projects, but the types of questions are always similar. • … lets start with an example dataset • (usedcars from the book Machine Learning with R, Brett Lantz) Explore the structure of data Exploring numeric variables • The six summary statistics that the summary() function provides are simple, yet powerful tools for investigating data. The summary statistics can be divided into two types: measures of center and measures of spread. Measuring the central tendency mean and median Measures of central tendency are a class of statistics used to identify a value that falls in the middle of a set of data. • An average item is typical, and not too unlike the others in the group. You might think of it as an exemplar by which all others are judged. • Usedcars observation What does this tell us about our data? Since the average price is relatively low, we might expect that the data includes economyclass cars. • Of course, the data can also include late-model luxury cars with high kilometer age, but the relatively low mean kmeterage statistic doesn't provide evidence to support this hypothesis. On the other hand, it doesn't provide evidence to ignore the possibility either. • Mean and outliers • Mean is highly sensitive to outliers (i.e. values that are atypically high or low relative to the majority of data) and could be shifted higher or lower by a small number of extreme values. Median = 3 Median = 3 Usedcars observation Although the mean and median for price are fairly similar (differs by ~5%), there is a much larger difference between the mean and median for kmeterage (~20% larger). • Since the mean is more sensitive to extreme values than the median, the fact that the mean is much higher than the median might lead us to suspect that there are some used cars in the dataset with extremely high kmeterage values. • Measuring spread quartiles and the five-number summary Mean and median quickly summarize the values, but these measures of center tell us little about whether or not there is diversity in the measurements. • To measure the diversity, another type of summary statistics is needed. It is concerned with the spread of the data, or how tightly or loosely the values are spaced. • Five number summary • • Five statistics that roughly depict the spread of a dataset • Minimum • First quartile (Q1) • Median (Second quartile Q2) • Third quartile (Q3) • Maximum The quartiles divide a dataset into four portions, each with the same number of values Interquartile range • The middle 50 percent of data between Q1 and Q3 is of particular interest because it itself is a simple measure of spread. Usedcars observation For price the difference between the minimum and Q1 is about $7,000, as is the difference between Q3 and the maximum; while, the difference from Q1 to the median to Q3 is roughly $2,000. • This suggests that the lower and upper 25 percent of values are more widely dispersed than the middle 50 percent of values, which seem to be more tightly grouped around the center (this pattern of spread is called a normal distribution of data). • Boxplots Histograms seems that the used car prices tend to be evenly divided on both sides of the middle, while the car mileages stretch further to the right. This characteristic is known as skew, specifically right skew, because the values on the high end (right side) are far more spread out than the values on the low end (left side). As shown in the following diagram, histograms of skewed data look stretched on one of the sides: Distributions • Histograms, boxplots, and statistics describing the center Chapter 2 The ability to quickly diagnoseways such patterns inexamine our data is one of the the strengths of the and spread all provide to distribution of You might also that the shape thewill twobecome histograms somewhat different. histogram as a notice data exploration tool.of This evenismore important as weIt seems thatvalues. the used carpatterns prices tend to be evenly divided on both sides ofdescribes the middle, start examining other of spread in numeric data. a variable's A variable's distribution how while the car mileages stretch further to the right. This characteristic is known as specifically right skew, because the values on the high end (right side) are far likely a skew, value is to fallnumeric within various ranges. Understanding data uniform andin the following more spread out than the values on the low end – (left side). As shown diagram, histograms of skewed data look stretched on one of the sides: normal distributions Histograms, boxplots, and statistics describing the center and spread all provide ways to examine the distribution of a variable's values. A variable's distribution describes how likely a value is to fall within various ranges. If all values are equally likely to occur, say for instance, in a dataset recording the values rolled on a fair six-sided die, the distribution is said to be uniform. A uniform distribution is easy to detect with a histogram because the bars are approximately the ability same height. When visualized a histogram, it may lookofsomething likeof the The to quickly diagnose suchwith patterns in our data is one the strengths the following as diagram: histogram a data exploration tool. This will become even more important as we start examining other patterns of spread in numeric data. Understanding numeric data – uniform and normal distributions Histograms, boxplots, and statistics describing the center and spread all provide ways to examine the distribution of a variable's values. A variable's distribution It's important to notea that not random events are ranges. uniform. For instance, rolling a describes how likely value is all to fall within various weighted six-sided trick die would result in some numbers coming up more often eters. The normal distribution, which describes many types of can be defined with just two: center and spread. The center of the on is defined by its mean value, which we have used before. The ed by a statistic called the standard deviation. Variance and standard deviation ate the standard deviation, we must first obtain the variance, which average of the squared differences between each value and the athematical notation, the variance of a set of n values of x is defined formula. The Greek letter mu (similar in appearance to an m) of the values, and the variance itself is denoted by the Greek letter milar to a b turned sideways): 1 n 2 Var(X) = σ = ∑ ( xi − µ ) n i =1 2 Chapter 2 deviation is the square root of the variance, and is denoted by sigma as following formula: 1 n 2 StdDev(X) = σ = x − µ ( ) ∑ i n i =1 [ 54 ] variance and standard deviation in R, the var() and sd() functions For example, computing the variance and standard deviation on our eage variables, we find: When interpreting the variance, larger numbers indicate that the data are spread more widely around the mean. rs$price) 2 The standard deviation indicates, on average, how much ars$mileage) each value differs from the mean. 54 • ars$price) rs$mileage) eting the variance, larger numbers indicate that the data are spread Standard deviation • • The standard deviation can be used to quickly estimate how extreme a given value is under the assumption that it came from a normal distribution. Managing and Understanding Data 68% of values in a normal distribution fall within one standard deviation of the The standard deviation can be used to quickly estimate how extreme a given value mean is under the assumption that it came from a normal distribution. The 68-95-99.7 rule • 95% states that 68 percent of values in a normaldeviations distribution fall within one standard of values fall within two standard deviation of the mean, while 95 percent and 99.7 percent of values fall within two and • 99.7% valuesdeviations, fall withinrespectively. three standard threeof standard This isdeviations illustrated in the following diagram: Applying this information to the used car data, we know that since the mean price Exploring categorical variables • In contrast to numeric data, categorical data is examined using tables rather than summary statistics. A table that presents a single categorical variable is known as a one-way table. Central tendency of categorical variables the mode In statistics terms, the mode of a feature is the value occurring most often. • A variable may have more than one mode: unimodal, bimodal, and multimodal. • • It would be dangerous to place too much emphasis on the mode since the most common value is not necessarily a majority. Mode for numerical data The mode for a numerical variable is the highest bars on a histogram. • It can be helpful to consider the mode when exploring numeric data, particularly to examine whether or not the data is multimodal. • Exploring relationships between variables So far, we have examined variables one at a time, calculating only univariate statistics. • Bivariate relationships consider the relationship between two variables. • Relationships of more than two variables are called multivariate relationships. • Visualizing relationships between numeric variables scatterplots A scatterplot is a diagram that visualizes a bivariate relationship by drawing on a plane using the values of one feature to provide the horizontal x coordinates, and the values of another feature to provide the vertical y coordinates. • Patterns in the placement of dots reveal underlying associations between the two features. • Usedcars observation • The absence of more points like this provides evidence to support a conclusion that our data is unlikely to include any high mileage luxury cars. Correlation • Pearson (default) • Spearman (non-parametric) Relationship between two nominal variables • To examine a relationship between two nominal variables, a two-way cross-tabulation is used (aka crosstab or contingency table). Usedcars observation Do relationships between the model and color data provide insight into the types of cars we are examining? • If black and silver are common used car colors, we might assume that the data are for luxury cars, which tend to be sold in more conservative colors, or they could also be economy cars, which are sold with fewer color options. • Chi square test • Capture the output with a variable Exercise • Explore all the features of the movies database • explore the histogram of year for different genres • explore di distribution of length (which are the outliers?) • compute the average length of short and long movies • compute the proportion of action movies that are also drama, and viceversa • analyze the distribution of budget for each genre in a decade (remove movies with NA budget) • analyze the proportion of each genre in each decade • analyze the distribution of rating at different ranges of votes

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download slides - Bioinformatics Sannio