CeLSIUS logo


Census logo
Logistic regression: preliminaries

Regression models are used to estimate the relative effects on an outcome of different variables. In this case we want to know which combination of characteristics is most useful for predicting whether people will become ill. Since we are interested in an outcome with only two possible states well and ill the most appropriate model type is Logistic Regression. Our outcome variable is coded like this:

 
0 = not ill
 
1 = ill

The value which is coded 0 is known as the reference category, and is the one against which change is measured what might be thought of as the 'normal' state. Each variable in the model (unless it is continuous e.g. age) has a reference category. Most software which offers regression will assume that the reference category for a variable is the lowest value code (here 0 because it is lower than 1) unless otherwise instructed.

The first hypothesis is that men and women will not be equally likely to develop a limiting long term illness in our ten year period. The sex variable is coded 0 = men, 1 = women, so the reference category is 'men'. Crosstabulating the two variables produces this table.

Whether reported limiting long-term illness in 2001 by sex, people aged 35+ with no such illness in 1991 (%)
Men
Women
Has limiting long-term illness
24.7
27.4
No limiting long-term illness
75.3
72.6
Total
100
100
Data extracted from the LS in November 2007.

The table shows a modest difference by sex. We shall therefore run only one set of models, with sex as an explanatory variable, rather than separate sets for men and women, which would have been appropriate had there been major differences between the sexes.