# General assessment information

MATH 3823/5824M Page 1 MATH3823/5824M Practical Coursework Year 2020/21 General assessment information: • This practical coursework is included in the assessment for MATH3823/5824M. It comprises 20% of your overall mark for the module. • Questions on this sheet that are only relevant to MATH5824M students are indicated by “[Level 5 only]”. • The written report (which must be typed, not handwritten) must be clearly marked with your name, student ID and module code, and should be no more than 8 pages for MATH3823 students and no more than 10 pages for MATH5824M students. • Your report should be handed in on, or before, Wednesday 31 March 2021 by 17:00 in Minerva (go to Learning resources → “Assessment Coursework” folder, and then look for MATH3823 submission or MATH5824M submission). Late submissions will carry a 5% penalty for each day up to 14 days. • We have set up a drop-in session on Monday 22 March, 10:00 – 12:00 to offer help and advice on this practical. Links for the drop-in session will be available in Minerva (Learning resources → “Assessment Coursework” folder). • The analyses in the practical should be performed using the statistics program R. Background to the problem: The dataset adelaide.txt has been taken from Dobson and Barnett (2008), pp. 145–146. It contains information on the numbers of graduates surviving for 50 years, as a function of the following explanatory variables, • year — year of graduation, 1938–1947 • faculty — M = Medicine, A = Arts, S = Science, E = Engineering • sex — M=Male, F = Female with two response variables, • survive — the number of graduates surviving for 50 years, and • total — the total number of graduates with this combination of explanatory variables. MATH 3823/5824M Page 2 Tasks you need to perform: 1. Read the data into R using the commands adelaide=read.table(“adelaide.txt”, header=TRUE) attach(adelaide) 2. Fit a model for which the survival probability depends on year+faculty+sex. 3. To begin with, ignore questions of statistical significance, and explain what each parameter represents (i.e., how do they affect the survival probabilities). In particular, what is the fitted probability of survival for each of the following combinations: sex=M, year=1941, faculty=M sex=F, year=1938, faculty=E 4. However, you might have noted that there are no women recorded doing Engineering! So what does the second fitted probability mean? 5. Comment on whether or not the model fitted in point 2 is appropriate. Investigate whether or not a more complicated (e.g., interactions) or a simpler model would be appropriate to describe the data. You might use parameter estimates and their p-values, deviances and residuals. If you do include any interactions, you should count the degrees of freedom carefully. 6. Next analyze the male and female data separately. Compare the outputs to the model involving the same explanatory variables but fitted with the original data. Do you get any differences in interpretation? [This question is motivated by the fact that women did not do the full range of subjects.] 7. [Level 5 only] Now compute a new variable, say y, which is the total survival proportion in year t, including both male and female students from all faculties. For example, in 1938 there were 102 students of whom 68 survived for 50 years, so we have t1 = 1938 and y1 = 68/102 ≈ 0.667. Fit a smoothing spline f (t) to these proportions using smoothing parameter λ = 1, and fit a model which uses the smoothing spline as an explanatory variable in place of the year; also include sex and faculty in your model. Compare this model to the model from task 2 above, and comment on any similarities and/or differences. Writing up: You should take some care with the presentation of your results. This includes using a clear structure and layout, careful explanation of your conclusions and how you arrived at them, meaningful plots with appropriate labels, etc. You should start your report MATH 3823/5824M Page 3 with a short (1 paragraph) summary of your findings, written in a style suitable for a non-statistician. The aim of this practical is to explain the analyses that you have performed and simply giving R commands does not do this. Any R output should be carefully explained, with extracts of output inserted in the text. It should be clear what R commands have been used to produce the output. In addition to the report, attach an appendix that includes all of the R commands that you have used. The appendix does not count towards the number of pages in your report. year faculty sex survive total 1938 M M 18 22 1939 M M 16 23 1940 M M 7 17 1941 M M 12 25 1942 M M 24 50 1943 M M 16 21 1944 M M 22 32 1945 M M 12 14 1946 M M 22 34 1947 M M 28 37 1938 A M 16 30 1939 A M 13 22 1940 A M 11 25 1941 A M 12 14 1942 A M 8 12 1943 A M 11 20 1944 A M 4 10 1945 A M 4 12 1946 A M 0 0 1947 A M 13 23 1938 S M 9 14 1939 S M 9 12 1940 S M 12 19 1941 S M 12 15 1942 S M 20 28 1943 S M 16 21 1944 S M 25 31 1945 S M 32 38 1946 S M 4 5 1947 S M 25 31 1938 E M 10 16 1939 E M 7 11 1940 E M 12 15 1941 E M 8 9 1942 E M 5 7 1943 E M 1 2 1944 E M 16 22 1945 E M 19 25 1946 E M 0 0 1947 E M 25 35 1938 A F 14 19 1939 A F 11 16 1940 A F 15 18 1941 A F 15 21 1942 A F 8 9 1943 A F 13 13 1944 A F 18 22 1945 A F 18 22 1946 A F 1 1 1947 A F 13 16 1938 S F 1 1 1939 S F 4 4 1940 S F 6 7 1941 S F 3 3 1942 S F 4 4 1943 S F 8 9 1944 S F 5 5 1945 S F 16 17 1946 S F 1 1 1947 S F 10 10 1 1 MATH 3823 Lecture notes (January 2021) Math 3823: Generalized linear models Outline Lecture Notes Lecturer: Dr Lanpeng Ji adapted from earlier notes by A Minter, J T Kent and W R Gilks Last updated: January 2021 Preliminaries Outline of the course: 1. Revision of normal linear models, NLMs. 2. Introduction to generalized linear models, GLMs. 3. Logistic regression models. 4. Loglinear models, including contingency-table models. 1 Introduction In previous modules you have studied normal linear models (i.e. linear models with a normally distributed error term) such as multiple linear regression and ANOVA for normally distributed observations. In this module we will study generalized linear models. The purpose of a generalized linear model is: to model the dependence of a dependent variable y on a set of p explanatory variables x = (x1 , x2 , . . . , xp ), where conditionally on x, observation y has a distribution which is not necessarily normal. Note that in this note we use y, yi to denote observed values or random variables, which should be clear from the context. 2 1.1 MATH 3823 Lecture notes (January 2021) Motivation Table 1.1 shows data on the number of beetles killed by five hours of exposure to n = 8 different concentrations of gaseous carbon disulphide (Dobson and Barnett, 3rd edn, p.127). Dose xi 1.6907 1.7242 1.7552 1.7842 1.8113 1.8369 1.8610 1.8839 No. of beetles mi 59 60 62 56 63 59 62 60 No. killed yi 6 13 18 28 52 53 61 60 Table 1.1: Numbers of beetles killed by five hours of exposure to n = 8 different concentrations of gaseous carbon disulphide. Figure 1.1a shows the empirical mortality curve for these data, and Figure 1.1b shows the same data with a linear regression line superimposed. Although this line goes close to the plotted points, we can see some fluctuations around it. More seriously, this is a stupid model: it would predict a mortality rate >100% at a dose of 1.9units, and a negative mortality rate at 1.65units! A more sensible dose–response relationship might be based on the logistic function, plotted in Figure 1.1c. The logistic function is defined as logistic(x) = 1 1 + e−x (1.1) whose inverse is the logit function: logit (q) = log q . 1−q (1.2) (In this module, all logarithms are to base e.) The value of the logistic function is between 0 and 1. Fitting the logistic function to the beetle mortality data results in the curve plotted in Figure 1.1d, which results in a closer, more-sensible, fit. Later in this module we will see how this curve was fitted using maximum likelihood estimation for an appropriate generalized linear model. 3 MATH 3823 Lecture notes (January 2021) (a) Empirical beetle mortality (b) Linear regression model (c) The logistic curve (d) Logistic regression model Figure 1.1: Beetle mortality rates from Table 1.1 with fitted dose–response curves. 1.2 Revision: Normal Linear Models In many fields of application, we might assume the dependent variable is Gaussian, i.e. normally distributed; for example: heights, weights, log prices. The data in Table 1.2 record the birth weights of 12 boys and 12 girls and their gestational ages (time from conception to birth). We can use these data to predict the birth weight of a baby born at a given gestational age. This will require a model. We consider several. Figure 1.2 plots these data, superimposed with fitted regression lines from the following models: Model Model Model Model 0: 1: 2: 3: Weight = α Weight = α + β.Age Weight = α + β.Age + γ.Sex Weight = α + β.Age + γ.Sex + δ.Sex*Age. (1.3) (1.4) (1.5) (1.6) 4 MATH 3823 Boys Gestational Age Birth weight 40 2968 38 2795 40 3163 35 2925 36 2625 37 2847 41 3292 40 3473 37 2628 38 3176 40 3421 38 2975 Lecture notes (January 2021) Girls Gestational age Birth weight 40 3317 36 2729 40 2935 38 2754 42 3210 39 2817 40 3126 37 2539 36 2412 38 2991 39 2875 40 3231 Table 1.2: Gestational ages (in weeks) and birth weights (in grams) for n = 24 babies (n1 = 12 boys and n2 = 12 girls) from Dobson and Barnett, 3rd edition, Table 2.3. In these models, Weight is called the dependent variable (sometimes called the response variable) and Age and Sex are called the covariates or explanatory variables (sometimes called the predictor or independent variables). Here, Age is a continuous variable. Sex is a dummy variable taking the value 0 for girls and 1 for boys; it is an example of a factor, in this case with just two levels: boy and girl. In these models, α, β, γ and δ are model parameters. Parameter α is called the intercept term; β is called the main effect of Age; and is interpreted as the effect on birthweight per week of gestational age. Similarly, γ is the main effect of Sex, interpreted as the effect on birthweight of being a boy (because girl is the baseline category). Parameter δ is called the interaction effect between Age and Sex. Take care when interpreting an interaction effect. Here, it does not mean that age somehow affects sex, or vice-versa. It means that the effect of gestational age on birthweight depends on whether the baby is a boy or a girl. These models can be fitted to the data using Ordinary Least Squares. Which model should we use? We can choose between models using F -tests. Let the four models (1.3, 1.4, 1.5, 1.6) be indexed by k = 0, 1, 2, 3. Let yi denote the value of the dependent variable Weight for individual i = 1, . . . , n. Let rk denote the residual degrees of freedom for Model k (the number of observations minus the number of model parameters). Let Rk denote the 5 MATH 3823 Lecture notes (January 2021) (a) Model 0: equation (1.3) (b) Model 1: equation (1.4) (c) Model 2: equation (1.5) (d) Model 3: equation (1.6) Figure 1.2: The data in Table 1.2 superimposed with fitted regression lines from various models. Open circles: boys; solid circles: girls; broken lines: boys; solid lines: girls. residual sum of squares for Model k: Rk = n X (yi − µ̂ki )2 , (1.7) i=1 where µ̂ki is the fitted value for individual i under Model k. Consider the following hypotheses: H0 : Model 0 is true; H1 : Model 1 is true. Under the null hypothesis H0 , the difference between R0 and R1 will be purely random, so the between-model mean-square (R0 − R1 )/(r0 − r1 ) should be comparable to the residual mean-square R1 /r1 . Thus our test statistic for comparing Model 1 to the simpler Model 0 is: (R0 − R1 )/(r0 − r1 ) F01 = . (1.8) R1 /(r1 ) 6 MATH 3823 Lecture notes (January 2021) It can be shown that, under the null hypothesis H0 , the statistic F01 will have an Fdistribution on r0 − r1 and r1 degrees of freedom, which we write: Fr0 −r1 , r1 . Under the alternative hypothesis H1 , the difference R0 − R1 will tend to be larger than expected under H0 , and so the observed value F01 will probably lie in the upper tail of the Fr0 −r1 , r1 distribution. What do we find for the data in Table 1.2? The following R code calculates the F01 statistic for these data: # read the data from file into a dataframe called ’birthweight’ birthweight = read.table(“birthwt.txt”,header=TRUE) # fit Model 1 fit1 = lm(weight ˜ age, data=birthweight) # print the parameter estimates from Model 1 coefficients(fit1) # perform an analysis of variance of Model 1 anova(fit1) The above commands produce the following results: Coefficients: (Intercept) age -1484.9846 115.5283 Analysis of Variance Table Response: weight Df Sum Sq Mean Sq F value Pr(>F) age 1 1013799 1013799 27.33 3.04e-05 *** Residuals 22 816074 37094 –Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Thus we have parameter estimates: α̂ = −1484.98; β̂ = 115.5. The Analysis of Variance (ANOVA) gives: r1 = 22; R1 = 816074. We can get values for r0 and R0 by fitting Model 0. Alternatively, we can get what we want from the above results: r0 − r1 = 1; R0 − R1 = 1013799. The F01 statistic (1.8) is then F01 = 1013799/1 (R0 − R1 )/(r0 − r1 ) = = 27.33, R1 /(r1 ) 816074/22 which can be read directly from the ANOVA table in the column headed ‘F value’. Is F01 = 27.33 in the upper tail of the F1,22 distribution? The final column of the ANOVA table tells us that the probability of observing F01 > 27.33 is only 3.04 × 10−5 : this is called a p-value. The *** beside this p-value highlights that its value lies between 0 and 0.001. This indicates that we should reject H0 in favour of H1 . Thus we would conclude that the effect of gestational age is statistically significant in these data. 7 MATH 3823 Lecture notes (January 2021) Next, consider the following hypotheses: H0 : Model 1 is true; H1 : Model 2 is true. Under this null hypothesis H0 , the difference between R1 and R2 will be purely random, so the between-model mean-square (R1 − R2 )/(r1 − r2 ) should be comparable to the residual mean-square R2 /r2 . Thus our test statistic for comparing Model 2 to the simpler Model 1 is: F12 = (R1 − R2 )/(r1 − r2 ) . R2 /(r2 ) (1.9) Under this null hypothesis H0 , the statistic F12 will have an F-distribution on r1 − r2 and r2 degrees of freedom, which we write: Fr1 −r2 , r2 . Under the alternative hypothesis H1 , the difference R1 − R2 will tend to be larger than expected under H0 , and so the observed value F12 will probably lie in the upper tail of the Fr1 −r2 , r2 distribution. The following R code calculates the F12 statistic for these data: # fit Model 2 fit2 = lm(weight ˜ age + sex, data=birthweight) # print the parameter estimates from Model 2 coefficients(fit2) # perform an analysis of variance on the fitted model anova(fit2) The above commands produce the following results (where sexM denotes ’boy’): Coefficients: (Intercept) age sexM -1773.3218 120.8943 163.0393 Analysis of Variance Table Response: weight Df Sum Sq Mean Sq F value Pr(>F) age 1 1013799 1013799 32.3174 1.213e-05 *** sex 1 157304 157304 5.0145 0.03609 * Residuals 21 658771 31370 –Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Thus we have parameter estimates: α̂ = −1773.3; β̂ = 120.9; γ̂ = 163.0, the latter being the effect of being a boy compared to the baseline category of being a girl. The Analysis of Variance (ANOVA) gives: r2 = 21; R2 = 658771. We have values for r1 and R1 from 8 MATH 3823 Lecture notes (January 2021) fitting Model 1, above. Alternatively, we can get what we want from the above results: r1 − r2 = 1; R1 − R2 = 157304. The F12 statistic (1.9) is then F12 = (R1 − R2 )/(r1 − r2 ) 157304/1 = = 5.0145, R2 /(r2 ) 658771/21 which can be read directly from the ANOVA table in the column headed ‘F value’. Is F12 = 5.0145 in the upper tail of the F1,21 distribution? The final column of the ANOVA table tells us that the probability of observing F12 > 5.0145 is only 0.03609: this is called a p-value. The * beside this p-value highlights that its value lies between 0.01 and 0.05. This indicates that we should reject H0 in favour of H1 . Thus we would conclude that the effect on birthweight of the sex of the baby, controlling for its gestational age, is statistically significant in these data. 1.3 Types of variable The way a variable enters a model will depends on its type. The most common types of variable are: 1. Quantitative (a) Continuous: e.g. height; weight; duration. Real valued – sometimes rounded but still usually best regarded as continuous. (b) Count: e.g. number of children in a family; accidents at a road junction; number of items sold. Non-negative and integer-valued. 2. Qualitative (a) Unordered categorical (nominal): – Dichotomous: 2 categories: e.g. sex (M/ F); agreement (Yes/ No); cointoss (Head/ Tail). – Polytomous: >2 categories: e.g. blood group (A/ B/ O); eye colour (Brown/ Blue/ Green). (b) Ordered categorical (ordinal): e.g. severity of illness (Mild/ Moderate/ Severe); degree classification (first/ upper-second/ lower-second/ third). 1.4 Types of normal linear model We consider how normal linear models can be set up for different types of explanatory variable. The dependent variable y is modelled as a linear combination of the explanatory variables x = (x1 , . . . , xp ) plus a random error ε ∼ N (0, σ 2 ), where ‘∼’ means ‘is 9 MATH 3823 Lecture notes (January 2021) distributed as’. Several models are of this kind, depending on the number and type of explanatory variables. Below, I() denotes the indicator function: ( 1, x = j I(x = j) = (1.10) 0, otherwise. Table 1.3 lists some types of normal linear model with their explanatory variable types. For the one-way ANOVA model, observations in Group j have mean α + δj , for j = 1, . . . , k, where we impose a ‘corner’ constraint: δ1 = 0 to avoid an identification problem with k + 1 parameters describing k group means. Then δj represents the difference in means between P Group j and Group 1. Alternatively, we may impose a ‘sum-to-zero’ constraint: j δj = 0. p Explanatory variables Model 1 Quantitative Simple linear regression y = α + γx + ε >1 Quantitative Multiple P linear regression y = α + pi=1 γi xi + ε 1 Dichotomous (x = 1 or 2) Two-sample t-test y = α + γI(x = 2) + ε 1 Polytomous, k levels (x = 1, . . . , k) One-way ANOVA P y = α + kj=1 δj I(x = j) + ε Qualitative p-way ANOVA >1 Table 1.3: Types of normal linear model and their explanatory variable types. 1.5 The matrix representation of a normal linear model All of the models in Table 1.3 are fitted by ordinary least squares (OLS). To describe this, a matrix formulation will be most convenient: Y = Xβ + ε (1.11) where • Y is the n × 1 vector of observed response values, n = number of observations. • X is the n × p design matrix, discussed below. • β is the p × 1 vector of parameters or coefficients to be estimated. • ε is the n × 1 vector of iid N (0, σ 2 ) “error” terms. 10 MATH 3823 1.6 Lecture notes (January 2021) Construction of the design matrix X The design matrix can be constructed by the following process. 1. Begin with an X containing only one column: a vector of ones for the overall mean or intercept term (the α’s in Table 1.3). 2. For each explanatory variable xj , do the following: (a) If a variable xj is quantitative, add a column to X containing the values of xj . (b) If xj is qualitative with k levels, add k “dummy” columns to X, taking values 0 and 1, where a 1 in the `th dummy column identifies that the corresponding observation is…

Purchase answer to see full attachment