Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
EXISTING STATISTICS QUANTITATIVE DATA ANALYSIS BUSN 364 – Week 13 Özge Can Existing Statistics/ Documents Many types of data are already available in the form of statistical documents (books, reports, etc.) => Secondary Data Researchers find what data is available there Then, they decide how can it be used to address their own research question Existing Statistics/ Documents Existing statistics research is most appropriate: For testing hypothesis that involve variables in official reports of social, economic, and political conditions; When you are looking over time and across nations (large-scale, widespread data) Locating Data Main providers of existing statistics: Governments International agencies Industry/ sector bodies NGOs, universities and research institutions Other private sources Most existing documents are “free” (available to public) but time and effort required to search for specific information in them Locating Data: Some Key Sources Turkish government statistics (TUIK): www.tuik.gov.tr UN statistics: http://data.un.org/ OECD statistics portal: http://www.oecd.org/statsportal/ World Bank data: http://data.worldbank.org/ EU statistics (Eurostat): http://epp.eurostat.ec.europa.eu/portal/page/portal /eurostat/home/ US Census Bureau - Statistical Abstracts: http://www.census.gov/compendia/statab/ Example: Financial Databases Compustat (www.compustat.com) => annual income statements, balance sheets, cash flow and other data items from North American companies Datastream (www.datastream.com) => one of the largest financial statistical databases The Wharton Research Data Service (www.whartonwrds.com) => databases in the field of finance, accounting, banking, economics, management The Center for Research in Security Prices (CRSP, www.crsp.com) => security prices, returns, volume data from stock markets Research Example Using Existing Statistics: An androgynous first name is the one that can be for either a girl or boy without clearly marking the child’s gender (= unisex names) What is the extent of gender segregation in naming? Why parents name their children in certain ways? Research Example Using Existing Statistics: Lieberson et al. (2000) examined existing statistical data in the form of computerized records from the birth certificates of 11 million births in Illinois, US from 1916 to 1989. They found that androgynous first names are rare (only increase in very recent years) Parents give such names to girls more than to boys Limitations of Using Secondary Data Existing data may not be appropriate for your research question. You need to consider: The units of analysis; the time and place of data collection; the sampling method used You must understand the topic in them so that you don’t make false assumptions and interpretations Limitations of Using Secondary Data There are also problems regarding variable attributes: Validity problems: Your theoretical definitions does not match that of the organization that collected the info Reliability problems: Variable definitions or the method of collecting data changes over time Missing data problems: Government agencies start or stop collecting data for political, budgetary or other reasons. The data may not be complete Analysis of Quantitative Data Dealing with data: Coding data Entering data Cleaning data Analyzing data Coding Data Coding: Systematically recognizing raw data into a format that is easy to analyze using statistics Coding procedure => a set of rules stating that you will assign certain numbers to variable attributes Codebook => a document (one or more pages) describing the coding procedure One should prepare them before collecting the data Codebook Example: Entering and Cleaning Data Most computer programs designed for numerical data analysis require that the data be in a grid format (rows and columns) Accuracy is very important: the errors you make when coding and entering data threaten the validity of the measures and results Carefully check your coding and how you enter the data Two Types of Quantitative Analysis: 1) Descriptive statistics => for summarizing and describing data 2) Inferential statistics => for drawing conclusions from data 1) Descriptive Statistics Frequency Distributions: Summarizes the information in terms of the frequencies/ percentages in different categories Common types of graphical representations: Bar chart, pie chart, histogram Graphical Representations of Frequency Distributions: 1) Descriptive Statistics Measures of Central Tendency: Statistical measures that summarize the values/scores of a variable into a single number Mode => the most frequent or common score Median => the middle point: the score at which half of the cases are above it and half below it Mean (arithmetic average) => the sum of all scores divided by the total number of them. Most widely used measure of central tendency 1) Descriptive Statistics Measures of Central Tendency If the frequency distribution of the data forms a normal distribution or bell-shaved curve, the three measures of central tendency equal each other If it is a skewed distribution (more cases in the upper or lower scores), then the three will not be equal Measures of Central Tendency 1) Descriptive Statistics Measures of Variation (Spread): The dispersion or distribution of the data around the mean Range => the distance between the highest and lowest scores Percentile => the percentage of cases at or below a a score or point Standard deviation => the average distance between the score and the mean Standard Deviation: Exercise: 27 What are the mode, median, mean and range for the below data? 13, 13, 13, 13, 14, 14, 16, 18, 21 Exercise: 28 What are the mode, median, mean and range for the below data? 13, 13, 13, 13, 14, 14, 16, 18, 21 Mode: Median: Mean: Range: 15 14 13 8 Statistical Relationships Expression of whether there is an association between two variables Covariation => Whether they tend to appear together (or they are independent) Statistical Relationships To indicate statistical relationships: Scattergram => a diagram displaying the relationship between two variables Contingency table => a summary format for two or more variables by showing the percentage or number of cases at the intersection of variable categories Measure of association => a single number that expresses the strength of a relationship. There are many of them (e.g. chi-square, rho, lambda, correlation coefficient) A Scattergram: A Contingency Table: Statistical Control For causal relationships, temporal order and association are not enough We must eliminate alternative explanations that can make the hypothesize relationship spurious In non-experimental research, we can statistically control for alternative explanations by adding control variables Statistical Control: Example “The relationship between height and liking of basketball” Control variable: Gender If it has no effect => Both tall males and tall females like basketball more than short males and short females (suspected alternative explanation has no effect) If it has an effect => Tall males are more likely than short males to like basketball; and tall females are no more likely to like it than short females (gender, not height, is the true explanation) Multiple Regression Very popular statistical technique used with interval or ratio-level data Great advantage is its ability to adjust for several control variables simultaneously Its results tell two things: Overall predictive powers of the set of independent and control variables on the dependent variable (Rsquare) The direction and size of the effect of each variable on a dependent variable 2) Inferential Statistics Build on probability theory to test hypotheses formally; permit inferences from a sample to a population Statistical significance => the probability of finding a relationship in the sample when there is none in the population (tell us what is likely) Levels of significance => expression of statistical significance in terms of levels (e.g. the results are significant at the .05 level) 2) Inferential Statistics There are two objectives: I. To find out differences among groups Are small start-up businesses more innovative than alrge bureaucratic ones? Are there more men than women on the boards of companies? 2) Inferential Statistics There are two objectives: II. To find out relationships Is there a link between sunny weather and how people feel at work? Does greater company use of social media (such as Facebook and Twitter) increase their reputation with customers?