Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
STA101 Introduction to Statistics Assignment 1 Luc Hens January 11, 2017 1 Due date Due on Thursday 9 February 2017 at the beginning of class. Start in time. If you run into trouble, ask me for help (that’s what office hours are for). 2 Objectives The objectives of this assignment are: – to make you acquainted with some key data sources (on-line or in print); – to make you acquainted with the standard format to store data on a computer: plain text files with comma-separated values; – to make you acquainted with statistical software (R and R Commander) and use it to generate a density histogram and descriptive statistics for a data set; – to make you understand what density means, and how it is computed; – to let you practice writing a clear and concise report describing your sources, the data manipulations you performed, and the results you obtained. 3 Details You can do the homework assignments individually or (preferably) in groups of two. If you work in group, both group members should contribute equally to each of the assignments. Both group members should carefully proofread the paper before its is submitted, and both group members are responsible for the quality of the entire paper. If you have problems co-operating with your partner, let me know immediately. Start by reading: – Chapters 3 and 4 from Freedman et al. (2007); 1 – the section on APA Style from the textbook from your Critical Thinking and Academic Writing courses (Bullock et al., 2014, pp. 158–191). Alternatively, you can use Kirszner and Mandell (2011, chapter 35, pp. 236–263) or Glenn et al. (2004, pp. 652–679); – the rules for written work (http://homepages.vub.ac.be/~lmahens/written_ work.html); – my document on how to typeset math with a word processor (http:// homepages.vub.ac.be/~lmahens/Typesetting_math.pdf); – McCloskey on writing well (McCloskey, 1999) (http://www.deirdremccloskey.com/docs/pdf/Article_309.pdf); – Mankiw on writing well (Mankiw, 2006), (http://gregmankiw.blogspot. be/2006/10/how-to-write-well.html); Download the pdf and the Microsoft Word-formatted (.doc) files with an example of how the paper should be formatted: http://homepages.vub.ac.be/~lmahens/example2.pdf http://homepages.vub.ac.be/~lmahens/example2.doc Use the Microsoft Word-formatted (.doc) file as a template, that is: save the file as STA101-assignment-1.doc, and write your text over the existing text using your favorite word processor. The title of your paper will be: Assignment 1: The distribution of . . . [a short description of your data, e.g., the height of clowns in 122 circuses] Start the main body of your paper (page 3) by repeating the title from the title page, followed a short general introduction (in APA Style the introduction has no header). Refer in the paper to at least the data source and Freedman et al. (2007), so the References section should contain at least those two sources. (Note that the author of the data source can be an institution, not a person.) The main body of the paper (the paper excluding the title page, the abstract, and the references) is about 600 words long. When you finished writing the main text, summarize the paper in an abstract (maximum 100 words, but for this short paper 50 words will do). In APA Style, the abstract is on page 2 (following the title page and before the introduction). For this assignment, you collect a data set on a subject that interests you, and use statistical software to generate a density histogram and compute statistics describing the data set (mean, median, standard deviation). You write up the results in a polished paper. Step 1 Collect a data set on a subject that interests you. The data set should have at least 50 cases, and one numerical variable. Don’t limit your population of interest to 50 cases just because I required a data set of at least 50 cases; if you use countries, use all countries of the world for which the data exist; if you use 2 scores of NBA basketball players, use all NBA players. The data should be cross-section (measured in a given period or at a point in time). Don’t use time series data, that is, data where the cases are subsequent fixed periods of time (such as annual GDP, 1950–2015). Here are some examples of data you can use for this assignment: Sports data, such as the performance of an NBA player during each game of a recent season (go to nba.com; pick a player; go to game logs, choose a variable, for instance: points per game, or rebounds per game, or minutes played per game); the times of al the riders at the end of a stage race (like the Tour de France); the times all the runners (or the top 100 runners) in a marathon; no. of. aces, no. of break points saved, first serve percentage, points won returning first serve, points won returning second serve etc. for tennis players. You can find such data on the web sites of ATP (men’s tennis) (http://www.atpworldtour. com/Rankings/MatchFacts.aspx), WTA (women’s tennis), Tour de France (http://www.letour.fr/); Vuelta a Espagna (http://www.lavuelta.com/); New York Marathon (http://www.nycmarathon.org/); Rotterdam Marathon (http://www.abnamromarathonrotterdam.com/); and for many other runs on http://www.uitslagen.com/. Many other sports data are available on-line. Note that if times are expressed as hours:minutes:seconds, you have to convert them to a single unit of measurement, say, minutes. For example, 3 hours, 55 minutes, and 22 seconds is converted to minutes as follows: (3 hr × 60 min/hr ) + 55 min + 22 sec = 235.3667 min 60 sec/min This conversion is easy to do in R if you store the times as three variables (hours, minutes, seconds), with one column for the hours, a second column for the minutes, and a third column for the seconds (columns separated by commas). You then create a new variable time.in.minutes using the following line of script: time.in.minutes <- hours*60 + minutes + seconds/60 Social, economic, or demographic data: The performance for a recent year of all countries of the world on some social, economic, or demographic measure (GDP, population, total imports, total exports, child mortality, unemployment rate, inflation rate, etc.). Make sure you understand the meaning of the variable: don’t choose gross domestic product (GDP) if you don’t know what gross domestic product means. Make sure the data are comparable. For example, don’t use GDP in national currencies for all countries of the world, because in that case Afghanistan’s GDP will be measured in Afghani, Albania’s GDP in Lek, etc., and the numbers will be incomparable; use GDP expressed in a common currency (like the US dollar) instead. Some sources are: – World Bank, World Development Report (www.worldbank.org/wdr/) – United Nations, Human Development Report (http://hdr.undp.org/ en/) – Feenstra, R. C., Inklaar, R. and Timmer, M. P. (2015). The Next Generation of the Penn World Table (http://www.rug.nl/research/ggdc/ data/pwt/) 3 – International Monetary Fund, International Financial Statistics Yearbook (available in print in the VUB library; check the World Tables) – United Nations Statistics Division (http://unstats.un.org/unsd/default. htm) – United Nations Commodity Trade Statistics Database (http://comtrade. un.org/) – United Nations, International Trade Statistics Yearbook (http://comtrade. un.org/pb/) – CIA, CIA World Fact Book (https://www.cia.gov/library/publications/ the-world-factbook/index.html) (select: Guide to Country Comparisons) If you are in doubt whether the data set you chose is appropriate for this assignment, check with me before you start your work. Step 2 Store the data in a plain text file with comma-separated values (csv), where lines represent cases and columns represent variables. Here’s how to do this. Start your text editor (Notepad in Windows, TextEdit set to plain text mode in macOS; see STA101-Getting-started.pdf). Create a new file by selecting File → New. The first line of the file should have the variable names (no blank spaces in variable names; if GDP per capita is the variable, use GDP.per.capita as the variable name. Separate the variable names by a comma. The next lines have the values of the variables, separated by a comma: country Afghanistan Albania etc. , , , GDP.per.capita 1322.18 7212.71 In the example above, I used blank spaces to nicely align the variables. That is optional, but a good idea because it is easier to spot typos or anomalies. Use blank spaces, not tabs. Make sure that there are no blank lines at the bottom of the file, and no commas other than the separators between the variables. Save the file: File → Save as. . . and choose an appropriate file name ending with the extension .cvs (comma-separated values), e.g., assignment-1-data.cvs Make sure you carefully record the bibliographical information of your data source (to document the source in your paper) and the units of measurement of your variable (e.g., millions of dollars) and any other relevant information concerning the variable (definition or description, year,. . . , e.g., real GDP in billions of dollars, 2013). Alternatively, if you know how to use a spreadsheet program, you can construct a spreadsheet containing the data and export the data to a plaintext comma-separated values (csv) file. You can use any spreadsheet program (Google docs, LibreOffice Calc, Numbers, Microsoft Excel), but make sure the spreadsheet uses decimal points (not decimal commas) (so two and a half is 2.5, not 2,5). Export the spreadsheet to a comma-separated values (csv) file (with 4 the extension .csv, so the file name will be something like STA101-assignment1-data.csv). If you don’t know how to do this, consult the Help function of your spreadsheet. Usually, it’s something like File → Export or File → Save as. . . Open the data file (STA101-assignment-1-data.csv) with your text editor: File → Open Document. . . Make sure there are no redundant commas at the end of a line (remove then if there are). Make sure there are no blank lines at the bottom (remove then if there are). Save the data file: File → Save. Close the data file. Step 3 Start R Commander. This is the best way to start R Commander in OS X: – go to the Utilities folder inside the Applications folder and drag the Terminal icon to the dock (this keeps the Terminal icon in the Dock from now on); – click the Terminal icon in the Dock to open the Terminal.app; – in the Terminal window, type R and press return; – in the Terminal window after the prompt (>), type: library(Rcmdr) (that’s Rcmdr, not rmcdr or RCMDR—R is case-sensitive) and press return. The XQuartz (X11) app will start and an R Commander window will open in X11. If you don’t see the R Commander window, click on the XQuartz (X11) icon in the Dock. To start R Commander in Windows: – double-click the R icon; – In the R console window after the prompt (>), type: library(Rcmdr) (that’s Rcmdr, not rmcdr or RCMDR—R is case-sensitive) and press return. The R Commander window will open. Step 4 Import your data from the data file (STA101-assignment-1-data.csv): in the R Commander window, do: Data → Import data → from text file. . . ; A window will open: – for “Enter name for data set”, keep the default name (Dataset); – for “Field Separator” check the box Commas; – for “Decimal Point Character” check the box Period [.] Click OK. A new window (called Open) appears. Select the data file and click Open. In the R Commander window, click the button ”View data set”. A new window with the data set appears. Inspect the window to see wither your data set was correctly imported. 5 Step 5 Use R Commander to compute the descriptive statistics (mean, median, standard deviation): Statistics → Summaries → Active data set, and Statistics → Summaries → Numerical Summaries Step 6 Use R Commander to generate a density histogram of your data. R Commander has a Histogram instruction in the Graphs menu: Graphs → Histogram. . . A window called Histogram opens. The window has two tabs: Data and Options. In the Data tab select the variable of which you want to generate the histogram. In the Options tab: – select for Axis Scaling: Densities (to get a density histogram); – type for Plot labels in the field for x-axis label the label you want to appear on the x-axis—depending on yuour variable that will be something like: height (cm) or tests scores (points) – leave the other fields (<auto>) unchanged. and press the OK button. R Commander will generate the histogram. (The histogram may be hidden under the other windows: on a Mac, click the Terminal or XQuartz icons in the Dock to see the histogram; you may have to resize the window a bit to see the x-axis.) You will notice that the histogram generated by R Commander differs from the histograms in class and in Freedman et al. (2007) (e.g., figure 4 p. 37): in the histogram generated by R Commander, the vertical axis (density) is not scaled as percent per horizontal unit but as a decimal fraction per horizontal unit. To find the density as percent per horizontal unit, you would need to multiply the numbers on the vertical axis by 100%. (It is possible to plot a histogram in R Commander with density as percent per horizontal unit by tweaking the script; don’t bother, as long as you know that the vertical axis shows density as a decimal fraction per horizontal unit.) Copy the histogram and paste it into your paper as Figure 1. Density histogram of. . . . To paste a graph in a word processor document, do the following. In Windows, bring the R window with the graph to the front. Do Edit → Copy. Go to your word processor document and do Edit → Paste. In macOS, bring the R window with the graph to the front. Choose Edit → Copy. Start the Preview application and in Preview do File > New From Clipboard. Save as a .png file. You can now copy and paste the graph from Preview into a word processor document. To save a graph, bring the window with the graph to the front and choose in the R menu File → Save as. . . In Windows, save in the .png format. In OS X, save as .pdf—that is the only option, but you can open the .pdf in Preview and save in .png. 6 Step 7 After R Commander generated the histogram, go back to the R script window and type hist(Dataset$nameofvariable, plot=FALSE) where Dataset is the name you gave to your data set (Dataset is the default) and nameofvariable is the name of your variable. The output window now displays the breaks between the class intervals, the counts (absolute frequencies), the densities (expressed as decimal fractions per horizontal unit; to get percentages per horizontal unit multiply by 100%), and the middle of the class intervals. To sort the values of the variable from low to high type in the script window: sort(Dataset$nameofvariable) and press the Submit button. Pick one class interval from the histogram. Use the sorted list to find the count (absolute frequency) over this class interval and explain in your paper. Then compute the relative frequency and the density (percent per horizontal unit) for this class interval. In the paper, show your work and explain. (You should of course get the same results as in count and density lists in the output window and in the histogram, but remember that R Commander expresses density as a decimal fraction per horizontal unit, not as a percentage per horizontal unit). Carefully explain what the meaning of the value for the density that you computed is, by referring to the histogram. Step 8 When you are finished, go to the Script window (the top panel of the R Commander window) and type date() Then press the Submit button (at the right, between the Script Window (top panel) and the Output Window (bottom panel)). This will put a date stamp in the Output Window. Save the output by doing File → Save output as. . . ; and choose as file name R-output-1.txt. Print this file and include it with your assignment as Appendix 1. 4 Structure and format of the paper You paper should in APA Style and be organized as follows: – A title page (page 1); – An abstract that summarizes the paper in maximum 100 words (page 2); – Repeat the title of page 1, and write an introduction that briefly describes the data (what? who? when?) and documents the data source using the APA-style parenthetical author-date format. Include the corresponding bibliographical reference in the references at the end of the paper). The author can be an institutional author: National Basketball Association (2011). 7 – A second paragraph refers to figure 1 (the density histogram of your data, included on a separate page) and briefly explains what a density histogram is (refer to Freedman et al. (2007). The same paragraph explains in detail how you computed the absolute frequency (count), the relative frequency, and the density for one class interval, and shows the work. Explain the meaning of the number you found, and compare to the corresponding density in figure 1 (the histogram generated with R). – A third paragraph reports and interprets the mean, median, and standard deviation (don’t forget the units of measurement; for instance, the mean height is 174 cm, not 174). The same paragraph explains the shape of the histogram: is it approximately symmetric or not? If not, is the histogram left-skewed or right-skewed? (Comparing the median and the mean may be helpful to determine skewness.) If the histogram is symmetric, is it approximately bell-shaped or not? – The references are at the end of the paper and should minimally contain Freedman et al. (2007) and your data source. – Then follows the sheet with figure 1. – Finally a sheet with Appendix 1. Output from R Commander Staple all pages together. 5 Rubric These are the criteria for grading this assignment: Formatting (2 points) Does the paper respect APA Style (double spaced, paragraphs indented, title repeated when main text starts, pages numbered on top right, equations indented, etc.)? Are tables and figured numbered, do they have a title, and does the author refer to them by number? Is the layout of tables and figures as required by APA Style? Does the physical presentation indicate care in preparation and respect for the reader’s specifications (A4 paper, stapled in the top left corner, general neatness etc.)? Grammar, style, and spelling (1 point) Are grammar and spelling correct? Was the paper carefully proofread? Does the narrative flow well, develop logically? Are introductions, transitions, subconclusions, conclusions convincing? Structure (1 point) Does the structure do justice to the exigencies of the topic? Is the paper organized into coherent subsections which, when integrated, form a logical sequence of arguments leading directly to the conclusions? 8 Documentation, Scientific Format (1 point) Is documentation extensive enough for a paper of this level? Does it consistently follow APA Style? Are there anomalies such as unnecessary second-hand quotations or imprecise references? Is the data source carefully documented? Contents (15 points) – Does the abstract summarize the main points of the paper? – Does the author in the introduction explain why the problem is relevant? Justify the method used? Report the data sources? (1 point) – Does the author present and discuss the histogram? Does the author explain how the density is computed, and illustrate this for one interval of the histogram? Does the author interpret the meaning of the density obtained? Does the author describe the shape of the histogram? (12 points) – Does the author correctly report the mean, the median, and the standard deviation of the data set in the paper? (3 points) References Bullock, R., Brody, M., and Weinberg, F. (2014). The Little Seagull Handbook. W.W. Norton & Company, New York and London, 2nd edition. Freedman, D., Pisani, R., and Purves, R. (2007). Statistics. Norton, New York and London, 4th edition. Glenn, C., Miller, R., Webb, S. S., L Gray, L., and Hodges, J. (2004). The Hodges Harbrace Handbook. Thomson, Boston, 15th edition. Kirszner, L. G. and Mandell, S. R. (2011). The Pocket Wadsworth Handbook. Wadsworth Cengage Learning, Boston, 5th edition. Mankiw, N. G. (2006). How to write well. Greg Mankiw’s Blog, 7 October 2006. McCloskey, D. N. (1999). Economical Writing: An Executive Summary. Eastern Economic Journal, 25(2):239–242. 9