Download Assignment 1 - Personal Homepages

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
STA101 Introduction to Statistics
Assignment 1
Luc Hens
January 11, 2017
1
Due date
Due on Thursday 9 February 2017 at the beginning of class. Start in time. If
you run into trouble, ask me for help (that’s what office hours are for).
2
Objectives
The objectives of this assignment are:
– to make you acquainted with some key data sources (on-line or in print);
– to make you acquainted with the standard format to store data on a
computer: plain text files with comma-separated values;
– to make you acquainted with statistical software (R and R Commander)
and use it to generate a density histogram and descriptive statistics for a
data set;
– to make you understand what density means, and how it is computed;
– to let you practice writing a clear and concise report describing your
sources, the data manipulations you performed, and the results you obtained.
3
Details
You can do the homework assignments individually or (preferably) in groups
of two. If you work in group, both group members should contribute equally
to each of the assignments. Both group members should carefully proofread
the paper before its is submitted, and both group members are responsible for
the quality of the entire paper. If you have problems co-operating with your
partner, let me know immediately.
Start by reading:
– Chapters 3 and 4 from Freedman et al. (2007);
1
– the section on APA Style from the textbook from your Critical Thinking and Academic Writing courses (Bullock et al., 2014, pp. 158–191).
Alternatively, you can use Kirszner and Mandell (2011, chapter 35, pp.
236–263) or Glenn et al. (2004, pp. 652–679);
– the rules for written work (http://homepages.vub.ac.be/~lmahens/written_
work.html);
– my document on how to typeset math with a word processor (http://
homepages.vub.ac.be/~lmahens/Typesetting_math.pdf);
– McCloskey on writing well (McCloskey, 1999)
(http://www.deirdremccloskey.com/docs/pdf/Article_309.pdf);
– Mankiw on writing well (Mankiw, 2006), (http://gregmankiw.blogspot.
be/2006/10/how-to-write-well.html);
Download the pdf and the Microsoft Word-formatted (.doc) files with an example of how the paper should be formatted:
http://homepages.vub.ac.be/~lmahens/example2.pdf
http://homepages.vub.ac.be/~lmahens/example2.doc
Use the Microsoft Word-formatted (.doc) file as a template, that is: save the file
as STA101-assignment-1.doc, and write your text over the existing text using
your favorite word processor.
The title of your paper will be:
Assignment 1: The distribution of . . . [a short description of your data, e.g.,
the height of clowns in 122 circuses]
Start the main body of your paper (page 3) by repeating the title from the
title page, followed a short general introduction (in APA Style the introduction
has no header). Refer in the paper to at least the data source and Freedman
et al. (2007), so the References section should contain at least those two sources.
(Note that the author of the data source can be an institution, not a person.)
The main body of the paper (the paper excluding the title page, the abstract,
and the references) is about 600 words long. When you finished writing the
main text, summarize the paper in an abstract (maximum 100 words, but for
this short paper 50 words will do). In APA Style, the abstract is on page 2
(following the title page and before the introduction).
For this assignment, you collect a data set on a subject that interests you, and
use statistical software to generate a density histogram and compute statistics
describing the data set (mean, median, standard deviation). You write up the
results in a polished paper.
Step 1
Collect a data set on a subject that interests you. The data set should have at
least 50 cases, and one numerical variable. Don’t limit your population of
interest to 50 cases just because I required a data set of at least 50 cases; if you
use countries, use all countries of the world for which the data exist; if you use
2
scores of NBA basketball players, use all NBA players. The data should be
cross-section (measured in a given period or at a point in time). Don’t use
time series data, that is, data where the cases are subsequent fixed periods of
time (such as annual GDP, 1950–2015). Here are some examples of data you
can use for this assignment:
Sports data, such as the performance of an NBA player during each game
of a recent season (go to nba.com; pick a player; go to game logs, choose a variable, for instance: points per game, or rebounds per game, or minutes played
per game); the times of al the riders at the end of a stage race (like the Tour de
France); the times all the runners (or the top 100 runners) in a marathon; no. of.
aces, no. of break points saved, first serve percentage, points won returning first
serve, points won returning second serve etc. for tennis players. You can find
such data on the web sites of ATP (men’s tennis) (http://www.atpworldtour.
com/Rankings/MatchFacts.aspx), WTA (women’s tennis), Tour de France
(http://www.letour.fr/); Vuelta a Espagna (http://www.lavuelta.com/);
New York Marathon (http://www.nycmarathon.org/); Rotterdam Marathon
(http://www.abnamromarathonrotterdam.com/); and for many other runs on
http://www.uitslagen.com/. Many other sports data are available on-line.
Note that if times are expressed as hours:minutes:seconds, you have to convert
them to a single unit of measurement, say, minutes. For example, 3 hours, 55
minutes, and 22 seconds is converted to minutes as follows:
(3 hr × 60 min/hr ) + 55 min +
22 sec
= 235.3667 min
60 sec/min
This conversion is easy to do in R if you store the times as three variables
(hours, minutes, seconds), with one column for the hours, a second column
for the minutes, and a third column for the seconds (columns separated by
commas). You then create a new variable time.in.minutes using the following
line of script:
time.in.minutes <- hours*60 + minutes + seconds/60
Social, economic, or demographic data: The performance for a recent
year of all countries of the world on some social, economic, or demographic
measure (GDP, population, total imports, total exports, child mortality, unemployment rate, inflation rate, etc.). Make sure you understand the meaning
of the variable: don’t choose gross domestic product (GDP) if you don’t know
what gross domestic product means. Make sure the data are comparable. For
example, don’t use GDP in national currencies for all countries of the world,
because in that case Afghanistan’s GDP will be measured in Afghani, Albania’s
GDP in Lek, etc., and the numbers will be incomparable; use GDP expressed
in a common currency (like the US dollar) instead. Some sources are:
– World Bank, World Development Report (www.worldbank.org/wdr/)
– United Nations, Human Development Report (http://hdr.undp.org/
en/)
– Feenstra, R. C., Inklaar, R. and Timmer, M. P. (2015). The Next Generation of the Penn World Table (http://www.rug.nl/research/ggdc/
data/pwt/)
3
– International Monetary Fund, International Financial Statistics Yearbook
(available in print in the VUB library; check the World Tables)
– United Nations Statistics Division (http://unstats.un.org/unsd/default.
htm)
– United Nations Commodity Trade Statistics Database (http://comtrade.
un.org/)
– United Nations, International Trade Statistics Yearbook (http://comtrade.
un.org/pb/)
– CIA, CIA World Fact Book (https://www.cia.gov/library/publications/
the-world-factbook/index.html) (select: Guide to Country Comparisons)
If you are in doubt whether the data set you chose is appropriate for this assignment, check with me before you start your work.
Step 2
Store the data in a plain text file with comma-separated values (csv), where
lines represent cases and columns represent variables. Here’s how to do this.
Start your text editor (Notepad in Windows, TextEdit set to plain text mode in
macOS; see STA101-Getting-started.pdf). Create a new file by selecting File →
New. The first line of the file should have the variable names (no blank spaces
in variable names; if GDP per capita is the variable, use GDP.per.capita as
the variable name. Separate the variable names by a comma. The next lines
have the values of the variables, separated by a comma:
country
Afghanistan
Albania
etc.
,
,
,
GDP.per.capita
1322.18
7212.71
In the example above, I used blank spaces to nicely align the variables. That is
optional, but a good idea because it is easier to spot typos or anomalies. Use
blank spaces, not tabs. Make sure that there are no blank lines at the bottom
of the file, and no commas other than the separators between the variables.
Save the file: File → Save as. . . and choose an appropriate file name ending
with the extension .cvs (comma-separated values), e.g., assignment-1-data.cvs
Make sure you carefully record the bibliographical information of your data
source (to document the source in your paper) and the units of measurement
of your variable (e.g., millions of dollars) and any other relevant information
concerning the variable (definition or description, year,. . . , e.g., real GDP in
billions of dollars, 2013).
Alternatively, if you know how to use a spreadsheet program, you can
construct a spreadsheet containing the data and export the data to a plaintext comma-separated values (csv) file. You can use any spreadsheet program
(Google docs, LibreOffice Calc, Numbers, Microsoft Excel), but make sure the
spreadsheet uses decimal points (not decimal commas) (so two and a half is 2.5,
not 2,5). Export the spreadsheet to a comma-separated values (csv) file (with
4
the extension .csv, so the file name will be something like STA101-assignment1-data.csv). If you don’t know how to do this, consult the Help function of
your spreadsheet. Usually, it’s something like File → Export or File → Save
as. . . Open the data file (STA101-assignment-1-data.csv) with your text editor:
File → Open Document. . . Make sure there are no redundant commas at the
end of a line (remove then if there are). Make sure there are no blank lines at
the bottom (remove then if there are). Save the data file: File → Save. Close
the data file.
Step 3
Start R Commander. This is the best way to start R Commander in OS X:
– go to the Utilities folder inside the Applications folder and drag the Terminal icon to the dock (this keeps the Terminal icon in the Dock from now
on);
– click the Terminal icon in the Dock to open the Terminal.app;
– in the Terminal window, type R and press return;
– in the Terminal window after the prompt (>), type:
library(Rcmdr)
(that’s Rcmdr, not rmcdr or RCMDR—R is case-sensitive) and press return.
The XQuartz (X11) app will start and an R Commander window will open
in X11. If you don’t see the R Commander window, click on the XQuartz
(X11) icon in the Dock.
To start R Commander in Windows:
– double-click the R icon;
– In the R console window after the prompt (>), type:
library(Rcmdr)
(that’s Rcmdr, not rmcdr or RCMDR—R is case-sensitive) and press return.
The R Commander window will open.
Step 4
Import your data from the data file (STA101-assignment-1-data.csv): in the R
Commander window, do: Data → Import data → from text file. . . ; A window
will open:
– for “Enter name for data set”, keep the default name (Dataset);
– for “Field Separator” check the box Commas;
– for “Decimal Point Character” check the box Period [.]
Click OK. A new window (called Open) appears. Select the data file and click
Open. In the R Commander window, click the button ”View data set”. A new
window with the data set appears. Inspect the window to see wither your data
set was correctly imported.
5
Step 5
Use R Commander to compute the descriptive statistics (mean, median, standard deviation):
Statistics → Summaries → Active data set, and
Statistics → Summaries → Numerical Summaries
Step 6
Use R Commander to generate a density histogram of your data. R Commander
has a Histogram instruction in the Graphs menu: Graphs → Histogram. . . A
window called Histogram opens. The window has two tabs: Data and Options.
In the Data tab select the variable of which you want to generate the histogram.
In the Options tab:
– select for Axis Scaling: Densities (to get a density histogram);
– type for Plot labels in the field for x-axis label the label you want to
appear on the x-axis—depending on yuour variable that will be something
like: height (cm) or tests scores (points)
– leave the other fields (<auto>) unchanged.
and press the OK button. R Commander will generate the histogram. (The
histogram may be hidden under the other windows: on a Mac, click the Terminal
or XQuartz icons in the Dock to see the histogram; you may have to resize the
window a bit to see the x-axis.)
You will notice that the histogram generated by R Commander differs from
the histograms in class and in Freedman et al. (2007) (e.g., figure 4 p. 37): in the
histogram generated by R Commander, the vertical axis (density) is not scaled
as percent per horizontal unit but as a decimal fraction per horizontal unit.
To find the density as percent per horizontal unit, you would need to multiply
the numbers on the vertical axis by 100%. (It is possible to plot a histogram
in R Commander with density as percent per horizontal unit by tweaking the
script; don’t bother, as long as you know that the vertical axis shows density as
a decimal fraction per horizontal unit.)
Copy the histogram and paste it into your paper as Figure 1. Density histogram of. . . . To paste a graph in a word processor document, do the following.
In Windows, bring the R window with the graph to the front. Do Edit → Copy.
Go to your word processor document and do Edit → Paste. In macOS, bring
the R window with the graph to the front. Choose Edit → Copy. Start the
Preview application and in Preview do File > New From Clipboard. Save as
a .png file. You can now copy and paste the graph from Preview into a word
processor document. To save a graph, bring the window with the graph to the
front and choose in the R menu File → Save as. . . In Windows, save in the
.png format. In OS X, save as .pdf—that is the only option, but you can open
the .pdf in Preview and save in .png.
6
Step 7
After R Commander generated the histogram, go back to the R script window
and type
hist(Dataset$nameofvariable, plot=FALSE)
where Dataset is the name you gave to your data set (Dataset is the default)
and nameofvariable is the name of your variable. The output window now
displays the breaks between the class intervals, the counts (absolute frequencies), the densities (expressed as decimal fractions per horizontal unit; to get
percentages per horizontal unit multiply by 100%), and the middle of the class
intervals. To sort the values of the variable from low to high type in the script
window:
sort(Dataset$nameofvariable)
and press the Submit button. Pick one class interval from the histogram. Use
the sorted list to find the count (absolute frequency) over this class interval and
explain in your paper. Then compute the relative frequency and the density
(percent per horizontal unit) for this class interval. In the paper, show your
work and explain. (You should of course get the same results as in count and
density lists in the output window and in the histogram, but remember that R
Commander expresses density as a decimal fraction per horizontal unit, not as
a percentage per horizontal unit). Carefully explain what the meaning of the
value for the density that you computed is, by referring to the histogram.
Step 8
When you are finished, go to the Script window (the top panel of the R Commander window) and type
date()
Then press the Submit button (at the right, between the Script Window (top
panel) and the Output Window (bottom panel)). This will put a date stamp
in the Output Window. Save the output by doing File → Save output as. . . ;
and choose as file name R-output-1.txt. Print this file and include it with your
assignment as Appendix 1.
4
Structure and format of the paper
You paper should in APA Style and be organized as follows:
– A title page (page 1);
– An abstract that summarizes the paper in maximum 100 words (page 2);
– Repeat the title of page 1, and write an introduction that briefly describes
the data (what? who? when?) and documents the data source using the
APA-style parenthetical author-date format. Include the corresponding
bibliographical reference in the references at the end of the paper). The
author can be an institutional author: National Basketball Association
(2011).
7
– A second paragraph refers to figure 1 (the density histogram of your data,
included on a separate page) and briefly explains what a density histogram
is (refer to Freedman et al. (2007). The same paragraph explains in detail
how you computed the absolute frequency (count), the relative frequency,
and the density for one class interval, and shows the work. Explain the
meaning of the number you found, and compare to the corresponding
density in figure 1 (the histogram generated with R).
– A third paragraph reports and interprets the mean, median, and standard
deviation (don’t forget the units of measurement; for instance, the mean
height is 174 cm, not 174). The same paragraph explains the shape of the
histogram: is it approximately symmetric or not? If not, is the histogram
left-skewed or right-skewed? (Comparing the median and the mean may
be helpful to determine skewness.) If the histogram is symmetric, is it
approximately bell-shaped or not?
– The references are at the end of the paper and should minimally contain
Freedman et al. (2007) and your data source.
– Then follows the sheet with figure 1.
– Finally a sheet with Appendix 1. Output from R Commander
Staple all pages together.
5
Rubric
These are the criteria for grading this assignment:
Formatting (2 points)
Does the paper respect APA Style (double spaced, paragraphs indented, title repeated when main text starts, pages numbered on top right, equations indented,
etc.)? Are tables and figured numbered, do they have a title, and does the author refer to them by number? Is the layout of tables and figures as required
by APA Style? Does the physical presentation indicate care in preparation and
respect for the reader’s specifications (A4 paper, stapled in the top left corner,
general neatness etc.)?
Grammar, style, and spelling (1 point)
Are grammar and spelling correct? Was the paper carefully proofread? Does
the narrative flow well, develop logically? Are introductions, transitions, subconclusions, conclusions convincing?
Structure (1 point)
Does the structure do justice to the exigencies of the topic? Is the paper organized into coherent subsections which, when integrated, form a logical sequence
of arguments leading directly to the conclusions?
8
Documentation, Scientific Format (1 point)
Is documentation extensive enough for a paper of this level? Does it consistently follow APA Style? Are there anomalies such as unnecessary second-hand
quotations or imprecise references? Is the data source carefully documented?
Contents (15 points)
– Does the abstract summarize the main points of the paper?
– Does the author in the introduction explain why the problem is relevant?
Justify the method used? Report the data sources? (1 point)
– Does the author present and discuss the histogram? Does the author
explain how the density is computed, and illustrate this for one interval
of the histogram? Does the author interpret the meaning of the density
obtained? Does the author describe the shape of the histogram? (12
points)
– Does the author correctly report the mean, the median, and the standard
deviation of the data set in the paper? (3 points)
References
Bullock, R., Brody, M., and Weinberg, F. (2014). The Little Seagull Handbook.
W.W. Norton & Company, New York and London, 2nd edition.
Freedman, D., Pisani, R., and Purves, R. (2007). Statistics. Norton, New York
and London, 4th edition.
Glenn, C., Miller, R., Webb, S. S., L Gray, L., and Hodges, J. (2004). The
Hodges Harbrace Handbook. Thomson, Boston, 15th edition.
Kirszner, L. G. and Mandell, S. R. (2011). The Pocket Wadsworth Handbook.
Wadsworth Cengage Learning, Boston, 5th edition.
Mankiw, N. G. (2006). How to write well. Greg Mankiw’s Blog, 7 October 2006.
McCloskey, D. N. (1999). Economical Writing: An Executive Summary. Eastern
Economic Journal, 25(2):239–242.
9