廣義線性模型 (HDAS7009-02)

Logistic Regression (1)

Author

Tsai, Dai-Rong

1 Pneumoconiosis dataset

Pneumoconiosis(塵肺病) is an occupational lung disease and a restrictive lung disease caused by the inhalation of dust, often in mines and from agriculture.

year: Number of years of exposure
n: Total Number of miners
case: Number of severe cases
noncase: Number of non-severe cases

pneu <- read.csv("data/pneumoconiosis.csv")
pneu <- transform(pneu,
                  noncase = n - case,
                  yeargrp = ifelse(year < 30, "＜30", "＞30"))
pneu

  year  n case noncase yeargrp
1  5.8 98    0      98    ＜30
2 15.0 54    1      53    ＜30
3 21.5 43    3      40    ＜30
4 27.5 48    8      40    ＜30
5 33.5 51    9      42    ＞30
6 39.5 38    8      30    ＞30
7 46.0 28   10      18    ＞30
8 51.5 11    5       6    ＞30

2 Contingency table analysis

pneu_tab <- xtabs(cbind(noncase, case) ~ yeargrp, data = pneu)
pneu_tab

       
yeargrp noncase case
   ＜30     231   12
   ＞30      96   32

class(pneu_tab)

[1] "xtabs" "table"

dim(pneu_tab)

[1] 2 2

2.1 Hypothesis testing

Statement 1
- $H_0$ : 礦工暴露年數與得到塵肺病與否無關
- $H_1$ : 礦工暴露年數與得到塵肺病與否有關
Statement 2: 令 $p_1$ 與 $p_2$ 分別代表暴露年數<30年與>30年的礦工得到塵肺病的機率
- $H_0$ : $p_1 = p_2$
- $H_1$ : $p_1 \ne p_2$
Statement 3: 令 $\theta$ 為暴露年數>30年的礦工相較於暴露<30年的礦工，得到塵肺病的勝算比(odds ratio)
- $H_0$ : $\theta = 1$
- $H_1$ : $\theta \ne 1$

2.2 Pearson’s chi-squared test

當自由度為 1 (即 $2 \times 2$ 列聯表)，並以 Pearson 近似卡方值進行適合度(goodness-of-fit) 或獨立性(independence)檢定時，由於列聯表的資料是離散型資料，而卡方分布是一個連續機率密度函數，故需要利用葉茲連續校正 (Yate’s continuity correction)，減少卡方分布和實際資料分布之間的差異。

chisq.test(pneu_tab) # default: correct = TRUE


    Pearson's Chi-squared test with Yates' continuity correction

data:  pneu_tab
X-squared = 30.389, df = 1, p-value = 3.536e-08

chisq.test(pneu_tab, correct = FALSE)


    Pearson's Chi-squared test

data:  pneu_tab
X-squared = 32.279, df = 1, p-value = 1.335e-08

Note

當所有細格內的期望次數 $\ge 10$ 時，連續校正對檢定結果的影響很小。

chisq.test(pneu_tab)$expected

       
yeargrp  noncase     case
   ＜30 214.1806 28.81941
   ＞30 112.8194 15.18059

2.3 Fisher’s exact test

對於任意維度的列聯表，當有「80%以上細格的期望次數>5」時，卡方檢定統計量才會較好地近似卡方分布。若是 $2 \times 2$ 列聯表，即 $4 \times 0.8 = 3.2$ 個以上的細格 (也就是所有細格) 的期望次數都要>5。若此原則不成立，卡方分布和實際資料分布之間的差異較大，建議改用 Fisher’s exact test。
Fisher’s exact test 利用超幾何分布 (hypergeometric distribution) 計算列聯表中觀測資料的機率。超幾何分布是一個離散機率分布，適合用於描述小樣本的次數資料。

Arguments of fisher.test()

alternative: alternative hypothesis
- "two.sided" $(H_0: OR = 1;~~~ H_1: OR \ne 1)$
- "greater" $~~~(H_0: OR \le 1;~~~ H_1: OR > 1)$
- "less" $~~~~~~~~(H_0: OR \ge 1;~~~ H_1: OR < 1)$
or = 1: the hypothesized odds ratio
conf.int = TRUE: whether a confidence interval for the odds ratio should be computed and returned
conf.level = 0.95: confidence level

See ?fisher.test for more details.

fisher.test(pneu_tab)


    Fisher's Exact Test for Count Data

data:  pneu_tab
p-value = 4.088e-08
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
  3.046001 14.210736
sample estimates:
odds ratio 
  6.380454

fisher.test(pneu_tab, alternative = "greater")


    Fisher's Exact Test for Count Data

data:  pneu_tab
p-value = 3.904e-08
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
 3.383273      Inf
sample estimates:
odds ratio 
  6.380454

fisher.test(pneu_tab, alternative = "less")


    Fisher's Exact Test for Count Data

data:  pneu_tab
p-value = 1
alternative hypothesis: true odds ratio is less than 1
95 percent confidence interval:
  0.00000 12.53254
sample estimates:
odds ratio 
  6.380454

Fisher’s exact test 假設在列總和 (row margin) 與行總和 (column margin) 固定的情況下， $2 \times 2$ 列聯表的任一細格發生數服從超幾何分布，其他細格的次數由邊際總和自動決定。

addmargins(pneu_tab)

year	noncase	case	Sum
＜30	231	12	243
＞30	96	32	128
Sum	327	44	371

在列總和 (243, 128) 與行總和 (327, 44) 固定，且虛無假設 ($H_0:OR=1$) 成立的情況下，假設暴露大於 30 年的 case 人數 $X$ (觀測人數 32) 服從超幾何分布，則其機率質量函數為:

\[ f(X = x) = \frac{\binom{44}{x}\binom{327}{128-x}}{\binom{371}{128}}, ~~~ x = 0, 1, ..., 44 \]

其直觀的解釋為：若母體中包含 44 位 case 和 327 位 noncase，從中以「取後不放回」的方式隨機抽出 128 位樣本，則裡面有 $x$ 位 case 的機率。

2.3.1 Exact p-value

Exact p-value for two-sided test ($H_1: OR \ne 1$)

Two-sided tests are based on the probabilities of the tables, and take as “more extreme” all tables with probabilities less than or equal to that of the observed table, the p-value being the sum of such probabilities.

From ?fisher.test

虛無假設 ($H_0:OR=1$) 成立下，得到「與目前的觀測值($X=32$)一樣」或「更極端」的觀測值的機率。

\[ p\text{-}value = \sum_{x:f(x) \le f(32)} f(X=x) \]
```
p_all <- dhyper(0:44, 44, 327, 128)
p_obs <- dhyper(32, 44, 327, 128)
sum(p_all[p_all <= p_obs])
```
```
[1] 4.088232e-08
```
Exact p-value for right-tailed test ($H_1: OR > 1$)

\[ p_{\text{upper}} = \sum_{x=32}^{44} f(X=x) \]
```
(pU <- sum(dhyper(32:44, 44, 327, 128)))
```
```
[1] 3.903773e-08
```
Exact p-value for left-tailed test ($H_1: OR < 1$)

\[ p_{\text{lower}} = \sum_{x=0}^{32} f(X=x) \]
```
(pL <- sum(dhyper(0:32, 44, 327, 128)))
```
```
[1] 1
```

Note

雙尾 p-value 除了 R 函數 fisher.test() 使用的定義之外，其另一個定義是:

\[ p\text{-}value = 2 \times Min\{ p_{\text{lower}},~ p_{\text{upper}}\} \]

2 * min(pL, pU)

[1] 7.807546e-08

2.3.2 Conditional maximum likelihood estimation (CMLE)

Noncentral hypergeometric distribution

\[ f(X = x) = \frac{\binom{44}{x}\binom{327}{128-x}OR^x} {\sum_{y=0}^{44} \binom{44}{y}\binom{327}{128-y}OR^y}, ~~~ x = 0, 1, ..., 44 \]

pneu_pmf <- function(x, or) {
  x_all <- 0:44
  denom <- sum(choose(44, x_all) * choose(327, 128 - x_all) * or^x_all)
  L <- (choose(44, x) * choose(327, 128 - x) * or^x) / denom
  return(L)
}

or_cmle <- optimize(\(or) pneu_pmf(x = 32, or = or),
                    interval = c(1e-5, 1e2), maximum = TRUE)$maximum
or_cmle

[1] 6.380425

plot(Vectorize(\(or) pneu_pmf(x = 32, or = or)), from = 1, to = 10,
     xlab = "Odds Ratio", ylab = "Likelihood")
abline(v = or_cmle, col = 2, lwd = 2)
mtext(round(or_cmle, 2), side = 1, at = or_cmle, col = 2)

2.3.3 Exact confidence interval

Lower bound

uniroot(\(or) sum(pneu_pmf(x = 32:44, or = or)) - 0.025, interval = c(1e-5, 1e2))$root

[1] 3.046096

Upper bound

uniroot(\(or) sum(pneu_pmf(x = 0:32, or = or)) - 0.025, interval = c(1e-5, 1e2))$root

[1] 14.20996

3 Logistic regression

3.1 Binary covariate

pneu 根據解釋變數的不同數值分別計算事件發生數，這種資料型態稱為 grouped data。對 grouped data 配適 logistic regression 有兩種語法：

將模型的 response 設為一個 2-column matrix，分別代表 case 與 non-case 的人數
```
m1 <- glm(cbind(case, noncase) ~ yeargrp, family = binomial, data = pneu)
```

將模型的 response 設為 case 的比例，但注意要用總人數加權

m1 <- glm(case/n ~ yeargrp, family = binomial, data = pneu, weights = n)

兩種語法完全等價。之後可用 summary() 檢視結果：

summary(m1)


Call:
glm(formula = case/n ~ yeargrp, family = binomial, data = pneu, 
    weights = n)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -2.9575     0.2961  -9.989  < 2e-16 ***
yeargrp＞30   1.8589     0.3596   5.169 2.35e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 56.903  on 7  degrees of freedom
Residual deviance: 26.274  on 6  degrees of freedom
AIC: 53.101

Number of Fisher Scoring iterations: 5

Intercept
解釋: 暴露年數<30年的礦工得到塵肺病的「勝算」(odds) 為 $e^{-2.9575}=0.052$，且達統計顯著 (與 $odds=1$ 有顯著差異):
- p-value < $2 \times 10^{-16} < 0.05$
- 95% 信賴區間未包含 1
\[ (e^{-2.9575 - 1.96 \times 0.2961}, e^{-2.9575 + 1.96 \times 0.2961}) = (0.029, 0.093) \]
Slope
解釋: 暴露年數>30年的礦工得到塵肺病的「勝算」(odds) 是暴露年數<30年的礦工的 $e^{1.8589}=6.417$ 倍，即暴露年數>30年的礦工相較於暴露<30年的礦工，得到塵肺病的「勝算比」(odds ratio) 為 6.417，且達統計顯著 (與 $OR=1$ 有顯著差異):
- p-value = $2.35 \times 10^{-7} < 0.05$
- 95% 信賴區間未包含 1
\[ (e^{1.8589 - 1.96 \times 0.3596}, e^{1.8589 + 1.96 \times 0.3596}) = (3.171, 12.984) \]

exp(coef(m1))

(Intercept) yeargrp＞30 
 0.05194805  6.41666666

exp(confint.default(m1))

                 2.5 %      97.5 %
(Intercept) 0.02907707  0.09280854
yeargrp＞30 3.17103066 12.98429926

Note

單變數 logistic regression 且解釋變數是類別變數時，迴歸係數即代表 $log(OR)$，其估計值與標準誤會與直接從列聯表計算的結果完全一致:

\[ OR = \frac{231 \times 32}{12 \times 96} = 6.417 \] \[ log(OR) = log(6.417) = 1.8589 \] \[ SE(log(OR)) = \sqrt{\frac{1}{231} + \frac{1}{12} + \frac{1}{96} + \frac{1}{32}} = 0.3596 \]

3.2 Continuous covariate

m2 <- glm(case/n ~ year, family = binomial, data = pneu, weights = n)
summary(m2)


Call:
glm(formula = case/n ~ year, family = binomial, data = pneu, 
    weights = n)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -4.79648    0.56859  -8.436  < 2e-16 ***
year         0.09346    0.01543   6.059 1.37e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 56.9028  on 7  degrees of freedom
Residual deviance:  6.0508  on 6  degrees of freedom
AIC: 32.877

Number of Fisher Scoring iterations: 4

Slope
解釋: 暴露年數每增加一年，得到塵肺病的「勝算」(odds) 是增加前的 $e^{0.093}=1.098$ 倍，且達統計顯著:
- p-value = $1.37 \times 10^{-9} < 0.05$
- 95% 信賴區間未包含 1
\[ (e^{0.0935 - 1.96 \times 0.0154}, e^{0.0935 + 1.96 \times 0.0154}) = (1.065, 1.132) \]

exp(coef(m2))[-1]

   year 
1.09797

exp(confint.default(m2))[-1, ]

   2.5 %   97.5 % 
1.065270 1.131673

Question

一位暴露年數為20年的礦工相較於暴露5年的礦工，得到塵肺病的勝算比是多少?

\[ \begin{aligned} OR &= \frac{odds|_{\text{year}=20}}{odds|_{\text{year}=5}} \\ log(odds|_{\text{year}=20}) &= -4.796 + 0.093 \times 20 ~~~...(1) \\ log(odds|_{\text{year}=5}) &= -4.796 + 0.093 \times 5 ~~~~~...(2) \\ (1)-(2): log(\frac{odds|_{\text{year}=20}}{odds|_{\text{year}=5}}) &= 0.093 \times (20-5) = 1.395 = log(OR) \\ OR &= e^{1.395} = 4.035 \end{aligned} \]

3.3 Prediction

Arguments of predict()

newdata: a data.frame in which to look for variables with which to predict
type:
- "link" log-odds (probabilities on logit scale)
- "response" probabilities
se.fit: whether standard errors are returned
- If FALSE, return a vector or matrix of predictions.
- If TRUE, return a list with components
  - $fit: predictions
  - $se.fit: estimated standard errors

See ?predict.glm for more details.

Create a new dataset for predictions

pneu_new <- data.frame(year = seq(0, 100, by = 1))

Predict probabilities at different levels of year with confidence intervals

pred <- predict(m2, newdata = pneu_new,
                type = "response", se.fit = TRUE)

pneu_ci <- cbind(fit = pred$fit,
                 lower = pred$fit - 1.96 * pred$se.fit,
                 upper = pred$fit + 1.96 * pred$se.fit)

matplot(pneu_new$year, pneu_ci, type = 'l', lty = c(1, 2, 2), ylim = c(0, 1),
        xlab = "Years of Exposure", ylab = "Proportion of Severe Cases",
        main = "Wrong 95% CI")
abline(h = 0:1, lty = 2, col = "gray")
points(case/n ~ year, pneu, cex = sqrt(n)/3)

pred <- predict(m2, newdata = pneu_new,
                type = "link", se.fit = TRUE)

pneu_ci <- cbind(fit = pred$fit,
                 lower = pred$fit - 1.96 * pred$se.fit,
                 upper = pred$fit + 1.96 * pred$se.fit) |> m2$family$linkinv()

matplot(pneu_new$year, pneu_ci, type = 'l', lty = c(1, 2, 2), ylim = c(0, 1),
        xlab = "Years of Exposure", ylab = "Proportion of Severe Cases",
        main = "Correct 95% CI")
abline(h = 0:1, lty = 2, col = "gray")
points(case/n ~ year, pneu, cex = sqrt(n)/3)

Note

GLM 的預測值在 linear predictor 的尺度下會更接近常態分布，因此 logistic regression 應該在 log odds 的尺度下計算信賴區間，之後再利用 link function 的反函數將 log odds 的信賴區間轉換為機率的信賴區間。

3.4 Comparison with linear model

library(ggplot2)

ggplot(pneu, aes(x = year, y = case/n, weight = n)) +
  geom_hline(yintercept = 0:1, alpha = 0.6, linetype = 2) +
  geom_smooth(aes(colour = "GLM", fill = "GLM"),
              method = "glm", fullrange = TRUE,
              method.args = list(family = "binomial")) +
  geom_smooth(aes(colour = "Linear", fill = "Linear"),
              method = "lm", fullrange = TRUE) +
  geom_point(aes(size = n), alpha = 0.6) +
  expand_limits(x = c(0, 100)) +
  scale_colour_discrete(name = "Method") +
  scale_fill_discrete(name = "Method") +
  scale_size_binned() +
  labs(x = "Years of Exposure", y = "Proportion of Severe Cases") +
  theme_bw()