# Statistics for the Sciences

Go to TOC

Statistics for the Sciences

Charles Peters

Go to TOC

Contents

1 Background 6 1.1 Populations, Samples and Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 Types of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Random Experiments and Sample Spaces . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Computing in Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Descriptive and Graphical Statistics 11 2.1 Location Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.1 The Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.2 The Median and Other Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.3 Trimmed Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.4 Grouped Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.5 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.6 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.7 The Five Number Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.8 The Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Measures of Variability or Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.1 The Variance and Standard Deviation . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.2 The Coefficient of Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.3 The Mean and Median Absolute Deviation . . . . . . . . . . . . . . . . . . . . 17 2.2.4 The Interquartile Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.5 Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3 Jointly Distributed Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.1 Side by Side Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.2 Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.3 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3 Probability 28 3.1 Basic Definitions. Equally Likely Outcomes . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 Combinations of Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

1

Go to TOC

CONTENTS 2

3.3 Rules for Probability Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.4 Counting Outcomes. Sampling with and without Replacement . . . . . . . . . . . . . 32

3.4.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.5 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.5.1 Relating Conditional and Unconditional Probabilities . . . . . . . . . . . . . . 36 3.5.2 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.6 Independent Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.6.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.7 Replications of a Random Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4 Discrete Distributions 40 4.1 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3 Expected Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.4 Bernoulli Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.4.1 The Mean and Variance of a Bernoulli Variable . . . . . . . . . . . . . . . . . . 44 4.5 Binomial Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.5.1 The Mean and Variance of a Binomial Distribution . . . . . . . . . . . . . . . . 48 4.5.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.6 Hypergeometric Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.6.1 The Mean and Variance of a Hypergeometric Distribution . . . . . . . . . . . . 51

4.7 Poisson Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.7.1 The Mean and Variance of a Poisson Distribution . . . . . . . . . . . . . . . . 54 4.7.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.8 Jointly Distributed Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.8.1 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.9 Multinomial Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.9.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5 Continuous Distributions 62 5.1 Density Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.2 Expected Values and Quantiles for Continuous Distributions . . . . . . . . . . . . . . 67

5.2.1 Expected Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.2.2 Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.2.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.3 Uniform Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.4 Exponential Distributions and Their Relatives . . . . . . . . . . . . . . . . . . . . . . . 70

5.4.1 Exponential Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.4.2 Gamma Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.4.3 Weibull Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.4.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.5 Normal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.5.1 Tables of the Standard Normal Distribution . . . . . . . . . . . . . . . . . . . . 80 5.5.2 Other Normal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.5.3 The Normal Approximation to the Binomial Distribution . . . . . . . . . . . . 83

Go to TOC

CONTENTS 3

5.5.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

6 Joint Distributions and Sampling Distributions 85 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.2 Jointly Distributed Continuous Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 85

6.2.1 Mixed Joint Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.2.2 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.2.3 Bivariate Normal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.3 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

6.4 Sums of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.4.1 Simulating Random Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6.5 Sample Sums and the Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . 98 6.5.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.6 Other Distributions Associated with Normal Sampling . . . . . . . . . . . . . . . . . . 103 6.6.1 Chi Square Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.6.2 Student t Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.6.3 The Joint Distribution of the Sample Mean and Variance . . . . . . . . . . . . 108 6.6.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

7 Statistical Inference for a Single Population 110 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.2 Estimation of Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

7.2.1 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.2.2 Desireable Properties of Estimators . . . . . . . . . . . . . . . . . . . . . . . . . 111

7.3 Estimating a Population Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.3.1 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7.3.2 Small Sample Confidence Intervals for a Normal Mean . . . . . . . . . . . . . . 115 7.3.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

7.4 Estimating a Population Proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 7.4.1 Choosing the Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.4.2 Confidence Intervals for p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 7.4.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

7.5 Estimating Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.5.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

7.6 Estimating the Variance and Standard Deviation . . . . . . . . . . . . . . . . . . . . . 125 7.7 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

7.7.1 Test Statistics, Type 1 and Type 2 Errors . . . . . . . . . . . . . . . . . . . . . 127 7.8 Hypotheses About a Population Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

7.8.1 Tests for the mean when the variance is unknown . . . . . . . . . . . . . . . . . 129 7.9 p-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

7.9.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 7.10 Hypotheses About a Population Proportion . . . . . . . . . . . . . . . . . . . . . . . . 132

7.10.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

Go to TOC

CONTENTS 4

8 Regression and Correlation 136 8.1 Examples of Linear Regression Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 136 8.2 Least Squares Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

8.2.1 The ”lm” Function in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 8.2.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

8.3 Distributions of the Least Squares Estimators . . . . . . . . . . . . . . . . . . . . . . . 145 8.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

8.4 Inference for the Regression Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 148 8.4.1 Confidence Intervals for the Parameters . . . . . . . . . . . . . . . . . . . . . . 150 8.4.2 Hypothesis Tests for the Parameters . . . . . . . . . . . . . . . . . . . . . . . . 150 8.4.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

8.5 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 8.5.1 Confidence intervals for ρ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 8.5.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

9 Inference from Multiple Samples 160 9.1 Comparison of Two Population Means . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

9.1.1 Large Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 9.1.2 Comparing Two Population Proportions . . . . . . . . . . . . . . . . . . . . . . 162 9.1.3 Samples from Normal Distributions . . . . . . . . . . . . . . . . . . . . . . . . 164 9.1.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

9.2 Paired Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 9.2.1 Crossover Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 9.2.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

9.3 More than Two Independent Samples: Single Factor Analysis of Variance . . . . . . . 171 9.3.1 Example Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 9.3.2 Multiple Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 9.3.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

9.4 Two-Way Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 9.4.1 Interactions Between the Factors . . . . . . . . . . . . . . . . . . . . . . . . . . 183 9.4.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

10 Analysis of Categorical Data 185 10.1 Multinomial Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

10.1.1 Estimators and Hypothesis Tests for the Parameters . . . . . . . . . . . . . . . 186 10.1.2 Multinomial Probabilities That Are Functions of Other Parameters . . . . . . . 187 10.1.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

10.2 Testing Equality of Multinomial Probabilities . . . . . . . . . . . . . . . . . . . . . . . 190 10.3 Independence of Attributes: Contingency Tables . . . . . . . . . . . . . . . . . . . . . 192

10.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

11 Miscellaneous Topics 196 11.1 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

11.1.1 Inferences Based on Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 11.1.2 Using R’s ”lm” Function for Multiple Regression . . . . . . . . . . . . . . . . . 198 11.1.3 Factor Variables as Predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 11.1.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

Go to TOC

CONTENTS 5

11.2 Nonparametric Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 11.2.1 The Signed Rank Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 11.2.2 The Wilcoxon Rank Sum Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 11.2.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

11.3 Bootstrap Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 11.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

Go to TOC

Chapter 1

Background

Statistics is the art of summarizing data, depicting data, and extracting information from it. Statistics and the theory of probability are distinct subjects, although statistics depends on probability to quantify the strength of its inferences. The probability used in this course will be developed in Chapter 3 and throughout the text as needed. We begin by introducing some basic ideas and terminology.

1.1 Populations, Samples and Variables

A population is a set of individual elements whose collective properties are the subject of investigation. Usually, populations are large collections whose individual members cannot all be examined in detail. In statistical inference a manageable subset of the population is selected according to certain sampling procedures and properties of the subset are generalized to the entire population. These generalizations are accompanied by statements quantifying their accuracy and reliability. The selected subset is called a sample from the population.

Examples:

(a) the population of registered voters in a congressional district, (b) the population of U.S. adult males, (c) the population of currently enrolled students at a certain large urban university, (d) the population of all transactions in the U.S. stock market for the past month, (e) the population of all peak temperatures at points on the Earth’s surface over a given time interval.

Some samples from these populations might be: (a) the voters contacted in a pre-election telephone poll, (b) adult males interviewed by a TV reporter, (c) the dean’s list, (d) transactions recorded on the books of Smith Barney, (e) peak temperatures recorded at several weather stations.

Clearly, for these particular samples, some generalizations from sample to population would be highly questionable.

6

Go to TOC

CHAPTER 1. BACKGROUND 7

A population variable is an attribute that has a value for each individual in the population. In other words, it is a function from the population to some set of possible values. It may be helpful to imagine a population as a spreadsheet with one row or record for each individual member. Along the ith row, the values of a number of attributes of the ith individual are recorded in different columns. The column headings of the spreadsheet can be thought of as the population variables. For example, if the population is the set of currently enrolled students at the urban university, some of the variables are academic classification, number of hours currently enrolled, total hours taken, grade point average, gender, ethnic classification, major, and so on. Variables, such as these, that are defined for the same population are said to be jointly observed or jointly distributed.

1.2 Types of Variables

Variables are classified according to the kinds of values they have. The three basic types are numeric variables, factor variables, and ordered factor variables. Numeric variables are those for which arith- metic operations such as addition and subtraction make sense. Numeric variables are often related to a scale of measurement and expressed in units, such as meters, seconds, or dollars. Factor variables are those whose values are mere names, to which arithmetic operations do not apply. Factors usually have a small number of possible values. These values might be designated by numbers. If they are, the numbers that represent distinct values are chosen merely for convenience. The values of factors might also be letters, words, or pictorial symbols. Factor variables are sometimes called nominal variables or categorical variables. Ordered factor variables are factors whose values are ordered in some natural and important way. Ordered factors are also called ordinal variables. Some textbooks have a more elaborate classification of variables, with various subtypes. The three types above are enough for our purposes.

Examples: Consider the population of students currently enrolled at a large university. Each stu- dent has a residency status, either resident or nonresident. Residency status is an unordered factor variable. Academic classification is an ordered factor with values “freshman”, “sophomore”, “junior”, “senior”, “post-baccalaureate” and “graduate student”. The number of hours enrolled is a numeric variable with integer values. The distance a student travels from home to campus is a numeric vari- able expressed in miles or kilometers. Home area code is an unordered factor variable whose values are designated by numbers.

1.3 Random Experiments and Sample Spaces

An experiment can be something as simple as flipping a coin or as complex as conducting a public opinion poll. A random experiment is one with the following two characteristics:

(1) The experiment can be replicated an indefinite number of times under essentially the same exper- imental conditions.

(2) There is a degree of uncertainty in the outcome of the experiment. The outcome may vary from replication to replication even though experimental conditions are the same.

Go to TOC

CHAPTER 1. BACKGROUND 8

When we say that an experiment can be replicated under the same conditions, we mean that control- lable or observable conditions that we think might affect the outcome are the same. There may be hidden conditions that affect the outcome, but we cannot account for them. Implicit in (1) is the idea that replications of a random experiment are independent, that is, the outcomes of some replications do not affect the outcomes of others. Obviously, a random experiment is an idealization of a real experiment. Some simple experiments, such as tossing a coin, approach this ideal closely while more complicated experiments may not.

The sample space of a random experiment is the set of all its possible outcomes. We use the Greek capital letter Ω (omega)to denote the sample space. There is some degree of arbitrariness in the description of Ω. It depends on how the outcomes of the experiment are represented symbolically.

Examples:

(a) Toss a coin. Ω = {H,T}, where “H” denotes a head and “T” a tail. Another way of repre- senting the outcome is to let the number 1 denote a head and 0 a tail (or vice-versa). If we do this, then Ω = {0, 1}. In the latter representation the outcome of the experiment is just the number of heads.

(b) Toss a coin 5 times, i.e., replicate the experiment in (a) 5 times. An outcome of this experiment is a 5 term sequence of heads and tails. A typical outcome might be indicated by (H,T,T,H,H), or by (1,0,0,1,1). Even for this little experiment it is cumbersome to list all the outcomes, so we use a shorter notation

Ω = {(x1, x2, x3, x4, x5) | xi = 0 or xi = 1 for each i} .

(c) Select a student randomly from the population of all currently enrolled students. The sample space is the same as the population. The word “randomly” is vague. We will define it later.

(d) Repeat the Michelson-Morley experiment to measure the speed of the Earth relative to the ether (which doesn’t exist, as we now know). The outcome of the experiment could conceivably be any nonnegative number, so we take Ω = [0,∞) = {x | x is a real number and x ≥ 0.} Uncertainty arises from the fact that this is a very delicate experiment with several sources of unpredictable error.

1.4 Computing in Statistics

Even moderately large data sets cannot be managed effectively without a computer and computer software. Furthermore, much of applied statistics is exploratory in nature and cannot be carried out by hand, even with a calculator. Spreadsheet programs, such as Microsoft Excel, are designed to manipulate data in tabular form and have functions for performing the common tasks of statistics. In addition, many add-ins are available, some of them free, for enhancing the graphical and statistical capabilities of spreadsheet programs. Some of the exercises and examples in this text make use of Excel with its built-in data analysis package. Because it is so common in the business world, it is important for students to have some experience with Excel or a similar program.

The disadvantages of spreadsheet programs are their dependence on the spreadsheet data format with cell ranges as input for statistical functions, their lack of flexibility, and their relatively poor graphics. Many highly sophisticated packages for statistics and data analysis are available. Some of

Go to TOC

CHAPTER 1. BACKGROUND 9

the best known commercial packages are Minitab, SAS, SPSS, Splus, Stata, and Systat. The package used in this text is called R. It is an open source implementation of the same language used in Splus and may be downloaded free at

http://www.r-project.org .

After downloading and installing R we recommend that you download and install another free package called Rstudio. It can be obtained from

http://www.rstudio.com .

Rstudio makes importing data into R much easier and makes it easier to integrate R output with other programs. Detailed instructions on using R and Rstudio for the exercises will be provided.

Data files used in this course are from four sources. Some are local in origin and come from student or course data at the University of Houston. Others are simulated but made to look as realistic as possible. These and others are available at

http://www.math.uh.edu/ charles/data .

Many data sets are included with R in the datasets library and other contributed packages. We will refer to them frequently. The main external sources of data are the data archives maintained by the Journal of Statistics Education.

www.amstat.org/publications/jse

and the Statistical Science Web:

http://www.stasci.org/datasets.html.

1.5 Exercises

1. Go to http://www.math.uh.edu/ charles/data. Examine the data set “Air Pollution Filter Noise”. Identify the variables and give their types.

2. Highlight the data in Air Pollution Filter Noise. Include the column headings but not the language preceding the column headings. Copy and paste the data into a plain text file, for example with Notepad in Windows. Import the text file into Excel or another spread sheet program. Create a new folder or directory named “math3339” and save both files there.

3. Start R by double clicking on the big blue R icon on your desktop. Click on the file menu at the top of the R Gui window. Select “change dir . . . ” . In the window that opens next, find the name of the directory where you saved the text file and double click on the name of that directory. Suppose that you named your file “apfilternoise”. (Name it anything you like.) Import the file into R with the command

Go to TOC

CHAPTER 1. BACKGROUND 10

> apfilternoise=read.table(”apfilternoise.txt”,header=T)

and display it with the command

> apfilternoise

Click on the file menu at the top again and select “Exit”. At the prompt to save your workspace, click “Yes”. If you open the folder where your work was saved you will see another big blue R icon. If you double click on it, R will start again and your previously saved workspace will be restored.

If you use Rstudio for this exercise you can import apfilternoise into R by clicking on the ”Import Dataset” tab. This will open a window on your file system and allow you to select the file you saved in Exercise 2. The dialog box allows you to rename the data and make other minor changes before importing the data as a data frame in R.

4. If you are using Rstudio, click on the ”Packages” tab and then the word ”datasets”. Find the data set ”airquality” and click on it. Read about it. If you are using R alone, type

> help(airquality)

at the command prompt > in the Console window.

Then type

> airquality

to view the data. Could ”Month” and ”Day” be considered ordered factors rather than numeric vari- ables?

5. A random experiment consists of throwing a standard 6-sided die and noting the number of spots on the upper face. Describe the sample space of this experiment.

6. An experiment consists of replicating the experiment in exercise 4 four times. Describe the sample space of this experiment. How many possible outcomes does this experiment have?

Go to TOC

Chapter 2

Descriptive and Graphical Statistics

A large part of a statistician’s job consists of summarizing and presenting important features of data. Simply looking at a spreadsheet with 1000 rows and 50 columns conveys very little information. Most likely, the user of the data would rather see numerical and graphical summaries of how the values of different variables are distributed and how the variables are related to each other. This chapter concerns some of the most important ways of summarizing data.

2.1 Location Measures

2.1.1 The Mean

Suppose that x is the name of a numeric variable whose values are recorded either for the entire population or for a sample from that population. Let the n recorded values of x be denoted by x1, x2, . . . , xn. These are not necessarily distinct numbers. The mean or average of these values is

x̄ = 1

n

n∑ i=1

xi

When the values of x for the entire population are included, it is customary to denote this quantity by µ(x) and call it the population mean. The mean is called a location measure partly because it is taken as a representative or central value of x. More importantly, it behaves in a certain way if we change the scale of measurement for values of x. Imagine that x is temperature recorded in degrees Celsius and we decide to change the unit of measurement to degrees Fahrenheit. If yi denotes the Fahrenheit temperature of the ith individual, then yi = 1.8xi + 32. In effect, we have defined a new variable y by the equation y = 1.8x + 32. The means of the new and old variables have the same relationship as the individual measurements have.

ȳ = 1

n

n∑ i=1

yi = 1

n

n∑ 1

(1.8xi + 32) = 1.8x̄+ 32

In general, if a and b > 0 are constants and y = a+bx, ȳ = a+bx̄. Other location measures introduced below behave in the same way.

11

Go to TOC

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 12

When there are repeated values of x, there is an equivalent formula for the mean. Let the m distinct values of x be denoted by v1, . . . , vm. Let ni be the number of times vi is repeated and let fi = ni/n. Note that

∑m i=1 ni = n and

∑m i=1 fi = 1. Then the average is given by

x̄ =

m∑ i=1

fivi

The number ni is the frequency of the value vi and fi is its relative frequency.

2.1.2 The Median and Other Quantiles

Let x be a numeric variable with values x1, x2, . . . , xn. Arrange the values in increasing order x(1) ≤ x(2) ≤ . . . ≤ x(n). The median of x is a number median(x) such that at least half the values of x are ≤ median(x) and at least half the values of x are ≥ median(x). This conveys the essential idea but unfortunately it may define an interval of numbers rather than a single number. The ambiguity is usually resolved by taking the median to be the midpoint of that interval. Thus, if n is odd, n = 2k+1, where k is a positive integer,

median(x) = x(k+1)

, while if n is even, n = 2k,

median(x) = x(k) + x(k+1)

2 .

Let p ∈ (0, 1) be a number between 0 and 1. The pth quantile of x is more commonly known as the 100pth percentile; e.g., the 0.8 quantile is the same as the 80th percentile. We define it as a number q(x, p) such that the fraction of values of x that are ≤ q(x, p) is at least p and the fraction of values of x that are ≥ q(x, p) is at least 1−p. For example, at least 80 percent of the values of x are ≤ the 80th percentile of x and at least 20 percent of the values of x are ≥ its 80th percentile. Again, this may not define a unique number q(x, p). Software packages have rules for resolving the ambiguity, but the details are usually not important.

The median is the 50th percentile, i.e., the 0.5 quantile. The 25th and 75th percentiles are called the first and third quartiles. The 10th, 20th, 30th, etc. percentiles are called the deciles. The median is a location measure as defined in the preceding section.

2.1.3 Trimmed Means

Trimmed means of a variable x are obtained by finding the mean of the values of x excluding a given percentage of the largest and smallest values. For example, the 5% trimmed mean is the mean of the values of x excluding the largest 5% of the values and the smallest 5% of the values. In other words, it is the mean of all the values between the 5th and 95th percentiles of x. A trimmed mean is a location measure.

Go to TOC

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 13

2.1.4 Grouped Data

Sometimes large data sets are summarized by grouping values. Let x be a numeric variable with values x1, x2, . . . , xn. Let c0 < c1 < . . . < cm be numbers such that all the values of x are between c0 and cm. For each i, let ni be the number of values of x (including repetitions) that are in the interval (ci−1, ci], i.e., the number of indices j such that ci−1 < xj ≤ ci. A frequency table of x is a table showing the class intervals (ci−1, ci] along with frequencies ni with which the data values fall into each interval. Sometimes additional columns are included showing the relative frequencies fi = ni/n, the cumulative relative frequencies Fi =

∑ j≤i fj , and the midpoints of the intervals.

Example 2.1. The data below are 50 measured reaction times in response to a sensory stimulus, arranged in increasing order. A frequency table is shown below the data.

0.12 0.30 0.35 0.37 0.44 0.57 0.61 0.62 0.71 0.80 0.88 1.02 1.08 1.12 1.13 1.17 1.21 1.23 1.35 1.41 1.42 1.42 1.46 1.50 1.52 1.54 1.60 1.61 1.68 1.72 1.86 1.90 1.91 2.07 2.09 2.16 2.17 2.20 2.29 2.32 2.39 2.47 2.60 2.86 3.43 3.43 3.77 3.97 4.54 4.73

Interval Midpoint ni fi Fi (0,1] 0.5 11 0.22 0.22 (1,2] 1.5 22 0.44 0.66 (2,3] 2.5 11 0.22 0.88 (3,4] 3.5 4 0.08 0.96 (4,5] 4.5 2 0.04 1.00

If only a frequency table like the one above is given, the mean and median cannot be calculated exactly. However, they can be estimated. If we take the midpoint of an interval as a stand-in for all the values in that interval, then we can use the formula in the preceding section for calculating a mean with repeated values. Thus, in the example above, we would estimate the mean as

0.22(0.5) + .44(1.5) + 0.22(2.5) + 0.08(3.5) + 0.04(4.5) = 1.78

Estimating the median is a bit more difficult. By examining the cumulative frequencies Fi, we see that 22% of the data is less than or equal to 1 and 66% of the data is less than or equal to 2. Therefore, the median lies between 1 and 2. That is, it is 1 + a certain fraction of the distance from 1 to 2. A reasonable guess at that fraction is given by linear interpolation between the cumulative frequencies at 1 and 2. In other words, we estimate the median as

1 + .50− .22 .66− .22

(2− 1) = 1.636.

A cruder estimate of the median is just the midpoint of the interval that contains the median, in this case 1.5. We leave it as an exercise to calculate the mean and median from the data of Example 1 and to compare them to these estimates.

Go to TOC

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 14

2.1.5 Histograms

The figure below is a histogram of the reaction times.

> reacttimes=read.table(“reacttimes.txt”,header=T)

> hist(reacttimes$Times,breaks=0:5,xlab=”Reaction Times”,main=” “)

Reaction Times

F re

qu en

cy

0 1 2 3 4 5

0 5

10 15

20

The histogram is a graphical depiction of the grouped data. The end points ci of the class intervals are shown on the horizontal axis. This is an absolute frequency histogram because the heights of the vertical bars above the class intervals are the absolute frequencies ni. A relative frequency histogram would show the relative frequencies fi. A density histogram has bars whose heights are the relative frequencies divided by the lengths of the corresponding class intervals. Thus,in a density histogram the area of the bar is equal to the relative frequency. If all class intervals have the same length, these types of histograms all have the same shape and convey the same visual information.

Go to TOC

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 15

2.1.6 Robustness

A robust measure of location is one that is not affected by a few extremely large or extremely small values. Values of a numeric variable that lie a great distance from most of the other values are called outliers. Outliers might be the result of mistakes in measuring or recording data, perhaps from misplacing a decimal point. The mean is not a robust location measure. It can be affected significantly by a single extreme outlier if that outlying value is extreme enough. Thus, if there is any doubt about the quality of the data, the median or a trimmed mean might be preferred to the mean as a reliable location measure. The median is very insensitive to outliers. A 5% trimmed mean is insensitive to outliers that make up no more than 5% of the data values.

2.1.7 The Five Number Summary

The five number summary is a convenient way of summarizing numeric data. The five numbers are the minimum value, the first quartile (25th percentile), the median, the third quartile (75th percentile), and the maximum value. Sometimes the mean is also included, which makes it a six number summary.

Example 2.2. The natural logarithms y of the data values x in Example 1 are, to two places:

-2.12 -1.20 -1.05 -0.99 -0.82 -0.56 -0.49 -0.48 -0.34 -0.22 -0.13 0.02 0.08 0.11 0.12 0.16 0.19 0.21 0.30 0.34 0.35 0.35 0.38 0.40 0.42 0.43 0.47 0.48 0.52 0.54 0.62 0.64 0.65 0.73 0.74 0.77 0.78 0.79 0.83 0.84 0.87 0.90 0.96 1.05 1.23 1.23 1.33 1.38 1.51 1.55

It is sometimes advantageous to transform data in some way, i.e., to define a new variable y as a function of the old variable x. In this case, we have transformed the reaction times x with the natural logarithm transformation. We might want to do this to so that we can more easily apply certain statistical inference procedures you will learn about later. The six number summary of the transformed data y is:

> reacttimes=read.table(“reacttimes.txt”,header=T)

> summary(log(reacttimes$Times))

Min. 1st Qu. Median Mean 3rd Qu. Max.

-2.12000 0.08605 0.42520 0.33710 0.78500 1.55400

2.1.8 The Mode

The mode of a variable is its most frequently occurring value. With numeric variables the mode is less important than the mean and median for descriptive purposes or for statistical inference. For factor variables the mode is the most natural way of choosing a ”most representative” value. We hear this frequently in the media, in statements such as ”Financial problems are the most common cause of marital strife”. For grouped numeric data the modal class interval is the class interval having the highest absolute or relative frequency. In Example 1, the modal class interval is the interval (1,2].

2.1.9 Exercises

1. Find the mean and median of the reaction time data in Example 1.

Go to TOC

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 16

2. Find the quartiles of the reaction time data. There is more than one acceptable answer.

3. The 40th value x40 of the reaction time data has a value of 2.32. Replace that with 232.0. Recalculate the mean and median. Comment.

4. Construct a frequency table like the one in Example 1 for the log-transformed reaction times of Example 2. Use 5 class intervals of equal length beginning at -3 and ending at 2. Draw an absolute frequency histogram.

5. Estimate the mean and median of the grouped log-transformed reaction times by using the tech- niques discussed in Example 1. Compare your answers to the summary in Example 2.

6. Repeat exercises 1, 2, and the histogram of exercise 4 by using R.

7. Let x be a numeric variable with values x1, . . . , xn−1, xn. Let x̄n be the average of all n val- ues and let x̄n−1 be the average of x1, . . . , xn−1. Show that x̄n = (1− 1n )x̄n−1 +

1 nxn. What happens

if xn →∞ while all the other values of x are fixed?

2.2 Measures of Variability or Scale

2.2.1 The Variance and Standard Deviation

Let x be a population variable with values x1, x2, . . . , xn. Some of the values might be repeated. The variance of x is

var(x) = σ2 = 1

n

n∑ i=1

(xi − µ(x))2.

The standard deviation of x is sd(x) = σ =

√ var(x).

When x1, x2, . . . , xn are values of x from a sample rather than the entire population, we modify the definition of the variance slightly, use a different notation, and call these objects the sample variance and standard deviation.

s2 = 1

n− 1

n∑ i=1

(xi − x̄)2,

s = √ s2.

The reason for modifying the definition for the sample variance has to do with its properties as an estimate of the population variance.

Go to TOC

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 17

Alternate algebraically equivalent formulas for the variance and sample variance are

σ2 = 1

n

n∑ i=1

x2i − µ(x)2,

s2 = 1

n− 1 (

n∑ i=1

x2i − nx̄2).

These are sometimes easier to use for hand computation.

The standard deviation σ is called a measure of scale because of the way it behaves under linear transformations of the data. If a new variable y is defined by y = a+ bx, where a and b are constants, sd(y) = |b|sd(x). For example, the standard deviation of Fahrenheit temperatures is 1.8 times the standard deviation of Celsius temperatures. The transformation y = a + bx can be thought of as a rescaling operation, or a choice of a different system of measurement units, and the standard deviation takes account of it in a natural way.

2.2.2 The Coefficient of Variation

For a variable that has only positive values, it may be more important to measure the relative vari- ability than the absolute variability. That is, the amount of variation should be compared to the mean value of the variable. The coefficient of variation for a population variable is defined as

cv(x) = sd(x)

µ(x) ,

For a sample of values of x we substitute the sample standard deviation s and the sample average x̄.

2.2.3 The Mean and Median Absolute Deviation

Suppose that you must choose a single number c to represent all the values of a variable x as accurately as possible. One measure of the overall error with which c represents the values of x is

g(c) =

√√√√ 1 n

n∑ i=1

(xi − c)2.

In the exercises, you are asked to show that this expression is minimized when c = x̄. In other words, the single number which most accurately represents all the values is, by this criterion, the mean of the variable. Furthermore, the minimum possible overall error, by this criterion, is the standard deviation. However, this is not the only reasonable criterion. Another is

h(c) = 1

n

n∑ i=1

|xi − c|.

It can be shown that this criterion is minimized when c = median(x). The minimum value of h(c) is called the mean absolute deviation from the median. It is a scale measure which is somewhat more robust(less affected by outliers) than the standard deviation, but still not very robust. A related very robust measure of scale is the median absolute deviation from the median, or mad :

mad(x) = median(|x−median(x)|).

Go to TOC

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 18

2.2.4 The Interquartile Range

The interquartile range of a variable x is the difference between its 75th and 25th percentiles.

IQR(x) = q(x, .75)− q(x, .25).

It is a robust measure of scale which is important in the construction and interpretation of boxplots, discussed below.

All of these measures of scale are valid for comparison of the ”spread”or variability of numeric variables about a central value. In general, the greater their values, the more spread out the values of the variable are. Of course, the standard deviation, median absolute deviation, and interquartile range of a variable will be different numbers and one must be careful to compare like measures.

2.2.5 Boxplots

Boxplots are also called box and whisker diagrams. Essentially, a boxplot is a graphical representation of the five number summary. The boxplot below depicts the sensory response data of the preceding section without the log transformation.

> reacttimes=read.table(“reacttimes.txt”,header=T)

> boxplot(reacttimes$Times,horizontal=T,xlab=”Reaction Times”)

> summary(reacttimes)

Times

Min. :0.120

1st Qu.:1.090

Median :1.530

Mean :1.742

3rd Qu.:2.192

Max. :4.730

Go to TOC

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 19

0 1 2 3 4

Reaction Times

The central box in the diagram encloses the middle 50% of the numeric data. Its left and right bound- aries mark the first and third quartiles. The boldface middle line in the box marks the median of the data. Thus, the interquartile range is the distance between the left and right boundaries of the central box. For construction of a boxplot, an outlier is defined as a data value whose distance from the nearest quartile is more than 1.5 times the interquartile range. Outliers are indicated by isolated points (tiny circles in this boxplot). The dashed lines extending outward from the quartiles are called the whiskers. They extend from the quartiles to the most extreme values in either direction that are not outliers.

This boxplot shows a number of interesting things about the response time data.

(a) The median is about 1.5. The interquartile range is slightly more than 1.

(b) The three largest values are outliers. They lie a long way from most of the data. They might call for special investigation or explanation.

Go to TOC

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 20

(c) The distribution of values is not symmetric about the median. The values in the lower half of the data are more crowded together than those in the upper half. This is shown by comparing the distances from the median to the two quartiles, by the lengths of the whiskers and by the presence of outliers at the upper end .

The asymmetry of the distribution of values is also evident in the histogram of the preceding sec- tion.

2.2.6 Exercises

1. Find the variance and standard deviation of the response time data. Treat it as a sample from a larger population.

2. Find the interquartile range and the median absolute deviation for the response time data.

3. In the response time data, replace the value x40 = 2.32 by 232.0. Recalculate the standard deviation, the interquartile range and the median absolute deviation and compare with the answers from problems 1 and 2.

4. Make a boxplot of the log-transformed reaction time data. Is the transformed data more sym- metrically distributed than the original data?

5. Show that the function g(c) in section 2.2.3 is minimized when c = µ(x). Hint: Minimize g(c)2.

6. Find the variance, standard deviation, IQR, mean absolute deviation and median absolute de- viation of the variable ”Ozone” in the data set ”airquality”. Use R or Rstudio. You can address the variable Ozone directly if you attach the airquality data frame to the search path as follows:

> attach(airquality)

The R functions you will need are ”sd” for standard deviation, ”var” for variance, ”IQR” for the interquartile range, and ”mad” for the median absolute deviation. There is no built-in function in R for the mean absolute deviation, but it is easy to obtain it.

> mean(abs(Ozone-median(Ozone)))

2.3 Jointly Distributed Variables

When two or more variables are jointly distributed, or jointly observed, it is important to understand how they are related and how closely they are related. We will first consider the case where one variable is numeric and the other is a factor.

Go to TOC

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL