Logistic regression: preliminaries
Regression models are used to estimate the relative effects on an outcome
of different variables. In this case we want to know which combination
of characteristics is most useful for predicting whether people will become
ill. Since we are interested in an outcome with only two possible states
– well and ill – the most appropriate model type is Logistic Regression.
Our outcome variable is coded like this:
The value which is coded 0 is known as the reference category, and is the one against which change is measured – what might be thought of as the 'normal' state. Each variable in the model (unless it is continuous e.g. age) has a reference category. Most software which offers regression will assume that the reference category for a variable is the lowest value code (here 0 because it is lower than 1) unless otherwise instructed.
The first hypothesis is that men and women will not be equally likely to develop a limiting long term illness in our ten year period. The sex variable is coded 0 = men, 1 = women, so the reference category is 'men'. Crosstabulating the two variables produces this table.
The table shows a modest difference by sex. We shall therefore run only one set of models, with sex as an explanatory variable, rather than separate sets for men and women, which would have been appropriate had there been major differences between the sexes.