Pneumoconiosis dataset
Pneumoconiosis(塵肺病) is an occupational lung disease and a restrictive lung disease caused by the inhalation of dust, often in mines and from agriculture.
year: Number of years of exposure
n: Total Number of miners
case: Number of severe cases
noncase: Number of non-severe cases
pneu <- read.csv ( "data/pneumoconiosis.csv" )
pneu <- transform ( pneu ,
noncase = n - case ,
yeargrp = ifelse ( year < 30 , "<30" , ">30" ) )
pneu
year n case noncase yeargrp
1 5.8 98 0 98 <30
2 15.0 54 1 53 <30
3 21.5 43 3 40 <30
4 27.5 48 8 40 <30
5 33.5 51 9 42 >30
6 39.5 38 8 30 >30
7 46.0 28 10 18 >30
8 51.5 11 5 6 >30
Contingency table analysis
pneu_tab <- xtabs ( cbind ( noncase , case ) ~ yeargrp , data = pneu )
pneu_tab
yeargrp noncase case
<30 231 12
>30 96 32
Hypothesis testing
Statement 1
\(H_0\) : 礦工暴露年數與得到塵肺病與否無關
\(H_1\) : 礦工暴露年數與得到塵肺病與否有關
Statement 2 : 令 \(p_1\) 與 \(p_2\) 分別代表暴露年數<30年與>30年的礦工得到塵肺病的機率
\(H_0\) : \(p_1 = p_2\)
\(H_1\) : \(p_1 \ne p_2\)
Statement 3 : 令 \(\theta\) 為暴露年數>30年的礦工相較於暴露<30年的礦工,得到塵肺病的勝算比(odds ratio)
\(H_0\) : \(\theta = 1\)
\(H_1\) : \(\theta \ne 1\)
Pearson’s chi-squared test
當自由度為 1 (即 \(2 \times 2\) 列聯表), 並以 Pearson 近似卡方值進行適合度(goodness-of-fit) 或獨立性(independence)檢定時,由於列聯表的資料是離散型資料, 而卡方分布是一個連續機率密度函數, 故需要利用葉茲連續校正 (Yate’s continuity correction), 減少卡方分布和實際資料分布之間的差異。
Pearson's Chi-squared test with Yates' continuity correction
data: pneu_tab
X-squared = 30.389, df = 1, p-value = 3.536e-08
Pearson's Chi-squared test
data: pneu_tab
X-squared = 32.279, df = 1, p-value = 1.335e-08
當所有細格內的期望次數 \(\ge 10\) 時,連續校正對檢定結果的影響很小。
yeargrp noncase case
<30 214.1806 28.81941
>30 112.8194 15.18059
Fisher’s exact test
對於任意維度的列聯表,當有「80%以上細格的期望次數>5」時, 卡方檢定統計量才會較好地近似卡方分布。 若是 \(2 \times 2\) 列聯表,即 \(4 \times 0.8 = 3.2\) 個以上的細格 (也就是所有細格) 的期望次數都要>5。 若此原則不成立,卡方分布和實際資料分布之間的差異較大,建議改用 Fisher’s exact test。
Fisher’s exact test 利用超幾何分布 (hypergeometric distribution) 計算列聯表中觀測資料的機率。 超幾何分布是一個離散機率分布,適合用於描述小樣本的次數資料。
alternative: alternative hypothesis
"two.sided" \((H_0: OR = 1;~~~ H_1: OR \ne 1)\)
"greater" \(~~~(H_0: OR \le 1;~~~ H_1: OR > 1)\)
"less" \(~~~~~~~~(H_0: OR \ge 1;~~~ H_1: OR < 1)\)
or = 1: the hypothesized odds ratio
conf.int = TRUE: whether a confidence interval for the odds ratio should be computed and returned
conf.level = 0.95: confidence level
See ?fisher.test for more details.
Fisher's Exact Test for Count Data
data: pneu_tab
p-value = 4.088e-08
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
3.046001 14.210736
sample estimates:
odds ratio
6.380454
Fisher's Exact Test for Count Data
data: pneu_tab
p-value = 3.904e-08
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
3.383273 Inf
sample estimates:
odds ratio
6.380454
Fisher's Exact Test for Count Data
data: pneu_tab
p-value = 1
alternative hypothesis: true odds ratio is less than 1
95 percent confidence interval:
0.00000 12.53254
sample estimates:
odds ratio
6.380454
Fisher’s exact test 假設在列總和 (row margin) 與行總和 (column margin) 固定的情況下, \(2 \times 2\) 列聯表的任一細格發生數服從超幾何分布,其他細格的次數由邊際總和自動決定。
year
noncase
case
Sum
<30
231
12
243
>30
96
32
128
Sum
327
44
371
在列總和 (243, 128) 與行總和 (327, 44) 固定,且虛無假設 (\(H_0:OR=1\) ) 成立的情況下, 假設暴露大於 30 年的 case 人數 \(X\) (觀測人數 32) 服從超幾何分布, 則其機率質量函數為:
\[
f(X = x) = \frac{\binom{44}{x}\binom{327}{128-x}}{\binom{371}{128}},
~~~ x = 0, 1, ..., 44
\]
其直觀的解釋為:若母體中包含 44 位 case 和 327 位 noncase, 從中以「取後不放回」的方式隨機抽出 128 位樣本,則裡面有 \(x\) 位 case 的機率。
Exact p -value
Exact p -value for two-sided test (\(H_1: OR \ne 1\) )
Two-sided tests are based on the probabilities of the tables, and take as “more extreme” all tables with probabilities less than or equal to that of the observed table, the p-value being the sum of such probabilities.
From ?fisher.test
虛無假設 (\(H_0:OR=1\) ) 成立下,得到「與目前的觀測值(\(X=32\) )一樣」 或「更極端」的觀測值的機率。
\[
p\text{-}value = \sum_{x:f(x) \le f(32)} f(X=x)
\]
p_all <- dhyper ( 0 : 44 , 44 , 327 , 128 )
p_obs <- dhyper ( 32 , 44 , 327 , 128 )
sum ( p_all [ p_all <= p_obs ] )
Exact p -value for right-tailed test (\(H_1: OR > 1\) )
\[
p_{\text{upper}} = \sum_{x=32}^{44} f(X=x)
\]
Exact p -value for left-tailed test (\(H_1: OR < 1\) )
\[
p_{\text{lower}} = \sum_{x=0}^{32} f(X=x)
\]
雙尾 p -value 除了 R 函數 fisher.test() 使用的定義之外,其另一個定義是:
\[
p\text{-}value = 2 \times Min\{ p_{\text{lower}},~ p_{\text{upper}}\}
\]
Conditional maximum likelihood estimation (CMLE)
Noncentral hypergeometric distribution
\[
f(X = x) = \frac{\binom{44}{x}\binom{327}{128-x}OR^x}
{\sum_{y=0}^{44} \binom{44}{y}\binom{327}{128-y}OR^y},
~~~ x = 0, 1, ..., 44
\]
pneu_pmf <- function ( x , or ) {
x_all <- 0 : 44
denom <- sum ( choose ( 44 , x_all ) * choose ( 327 , 128 - x_all ) * or ^ x_all )
L <- ( choose ( 44 , x ) * choose ( 327 , 128 - x ) * or ^ x ) / denom
return ( L )
}
or_cmle <- optimize ( \( or ) pneu_pmf ( x = 32 , or = or ) ,
interval = c ( 1e-5 , 1e2 ) , maximum = TRUE ) $ maximum
or_cmle
plot ( Vectorize ( \( or ) pneu_pmf ( x = 32 , or = or ) ) , from = 1 , to = 10 ,
xlab = "Odds Ratio" , ylab = "Likelihood" )
abline ( v = or_cmle , col = 2 , lwd = 2 )
mtext ( round ( or_cmle , 2 ) , side = 1 , at = or_cmle , col = 2 )
Exact confidence interval
Lower bound
uniroot ( \( or ) sum ( pneu_pmf ( x = 32 : 44 , or = or ) ) - 0.025 , interval = c ( 1e-5 , 1e2 ) ) $ root
Upper bound
uniroot ( \( or ) sum ( pneu_pmf ( x = 0 : 32 , or = or ) ) - 0.025 , interval = c ( 1e-5 , 1e2 ) ) $ root
Logistic regression
Binary covariate
pneu 根據解釋變數的不同數值分別計算事件發生數,這種資料型態稱為 grouped data。 對 grouped data 配適 logistic regression 有兩種語法:
將模型的 response 設為一個 2-column matrix,分別代表 case 與 non-case 的人數
m1 <- glm ( cbind ( case , noncase ) ~ yeargrp , family = binomial , data = pneu )
將模型的 response 設為 case 的比例,但注意要用總人數加權
m1 <- glm ( case / n ~ yeargrp , family = binomial , data = pneu , weights = n )
兩種語法完全等價。之後可用 summary() 檢視結果:
Call:
glm(formula = case/n ~ yeargrp, family = binomial, data = pneu,
weights = n)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.9575 0.2961 -9.989 < 2e-16 ***
yeargrp>30 1.8589 0.3596 5.169 2.35e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 56.903 on 7 degrees of freedom
Residual deviance: 26.274 on 6 degrees of freedom
AIC: 53.101
Number of Fisher Scoring iterations: 5
Intercept
解釋 : 暴露年數<30年的礦工得到塵肺病的「勝算」(odds) 為 \(e^{-2.9575}=0.052\) ,且達統計顯著 (與 \(odds=1\) 有顯著差異):
p -value < \(2 \times 10^{-16} < 0.05\)
95% 信賴區間未包含 1
\[
(e^{-2.9575 - 1.96 \times 0.2961}, e^{-2.9575 + 1.96 \times 0.2961})
= (0.029, 0.093)
\]
Slope
解釋 : 暴露年數>30年的礦工得到塵肺病的「勝算」(odds) 是暴露年數<30年的礦工的 \(e^{1.8589}=6.417\) 倍, 即暴露年數>30年的礦工相較於暴露<30年的礦工, 得到塵肺病的「勝算比」(odds ratio) 為 6.417, 且達統計顯著 (與 \(OR=1\) 有顯著差異):
p -value = \(2.35 \times 10^{-7} < 0.05\)
95% 信賴區間未包含 1
\[
(e^{1.8589 - 1.96 \times 0.3596}, e^{1.8589 + 1.96 \times 0.3596})
= (3.171, 12.984)
\]
(Intercept) yeargrp>30
0.05194805 6.41666666
2.5 % 97.5 %
(Intercept) 0.02907707 0.09280854
yeargrp>30 3.17103066 12.98429926
單變數 logistic regression 且解釋變數是類別變數時,迴歸係數即代表 \(log(OR)\) , 其估計值與標準誤會與直接從列聯表計算的結果完全一致:
\[ OR = \frac{231 \times 32}{12 \times 96} = 6.417 \] \[ log(OR) = log(6.417) = 1.8589 \] \[ SE(log(OR)) = \sqrt{\frac{1}{231} + \frac{1}{12} + \frac{1}{96} + \frac{1}{32}} = 0.3596 \]
Continuous covariate
m2 <- glm ( case / n ~ year , family = binomial , data = pneu , weights = n )
summary ( m2 )
Call:
glm(formula = case/n ~ year, family = binomial, data = pneu,
weights = n)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.79648 0.56859 -8.436 < 2e-16 ***
year 0.09346 0.01543 6.059 1.37e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 56.9028 on 7 degrees of freedom
Residual deviance: 6.0508 on 6 degrees of freedom
AIC: 32.877
Number of Fisher Scoring iterations: 4
Slope
解釋 : 暴露年數每增加一年,得到塵肺病的「勝算」(odds) 是增加前的 \(e^{0.093}=1.098\) 倍,且達統計顯著:
p -value = \(1.37 \times 10^{-9} < 0.05\)
95% 信賴區間未包含 1
\[
(e^{0.0935 - 1.96 \times 0.0154}, e^{0.0935 + 1.96 \times 0.0154})
= (1.065, 1.132)
\]
2.5 % 97.5 %
1.065270 1.131673
一位暴露年數為20年的礦工相較於暴露5年的礦工,得到塵肺病的勝算比是多少?
\[
\begin{aligned}
OR &= \frac{odds|_{\text{year}=20}}{odds|_{\text{year}=5}} \\
log(odds|_{\text{year}=20}) &= -4.796 + 0.093 \times 20 ~~~...(1) \\
log(odds|_{\text{year}=5}) &= -4.796 + 0.093 \times 5 ~~~~~...(2) \\
(1)-(2): log(\frac{odds|_{\text{year}=20}}{odds|_{\text{year}=5}})
&= 0.093 \times (20-5) = 1.395 = log(OR) \\
OR &= e^{1.395} = 4.035
\end{aligned}
\]
Prediction
newdata: a data.frame in which to look for variables with which to predict
type:
"link" log-odds (probabilities on logit scale)
"response" probabilities
se.fit: whether standard errors are returned
If FALSE, return a vector or matrix of predictions.
If TRUE, return a list with components
$fit: predictions
$se.fit: estimated standard errors
See ?predict.glm for more details.
pred <- predict ( m2 , newdata = pneu_new ,
type = "response" , se.fit = TRUE )
pneu_ci <- cbind ( fit = pred $ fit ,
lower = pred $ fit - 1.96 * pred $ se.fit ,
upper = pred $ fit + 1.96 * pred $ se.fit )
matplot ( pneu_new $ year , pneu_ci , type = 'l' , lty = c ( 1 , 2 , 2 ) , ylim = c ( 0 , 1 ) ,
xlab = "Years of Exposure" , ylab = "Proportion of Severe Cases" ,
main = "Wrong 95% CI" )
abline ( h = 0 : 1 , lty = 2 , col = "gray" )
points ( case / n ~ year , pneu , cex = sqrt ( n ) / 3 )
pred <- predict ( m2 , newdata = pneu_new ,
type = "link" , se.fit = TRUE )
pneu_ci <- cbind ( fit = pred $ fit ,
lower = pred $ fit - 1.96 * pred $ se.fit ,
upper = pred $ fit + 1.96 * pred $ se.fit ) |> m2 $ family $ linkinv ( )
matplot ( pneu_new $ year , pneu_ci , type = 'l' , lty = c ( 1 , 2 , 2 ) , ylim = c ( 0 , 1 ) ,
xlab = "Years of Exposure" , ylab = "Proportion of Severe Cases" ,
main = "Correct 95% CI" )
abline ( h = 0 : 1 , lty = 2 , col = "gray" )
points ( case / n ~ year , pneu , cex = sqrt ( n ) / 3 )
GLM 的預測值在 linear predictor 的尺度下會更接近常態分布, 因此 logistic regression 應該在 log odds 的尺度下計算信賴區間, 之後再利用 link function 的反函數將 log odds 的信賴區間轉換為機率的信賴區間。
Comparison with linear model
library ( ggplot2 )
ggplot ( pneu , aes ( x = year , y = case / n , weight = n ) ) +
geom_hline ( yintercept = 0 : 1 , alpha = 0.6 , linetype = 2 ) +
geom_smooth ( aes ( colour = "GLM" , fill = "GLM" ) ,
method = "glm" , fullrange = TRUE ,
method.args = list ( family = "binomial" ) ) +
geom_smooth ( aes ( colour = "Linear" , fill = "Linear" ) ,
method = "lm" , fullrange = TRUE ) +
geom_point ( aes ( size = n ) , alpha = 0.6 ) +
expand_limits ( x = c ( 0 , 100 ) ) +
scale_colour_discrete ( name = "Method" ) +
scale_fill_discrete ( name = "Method" ) +
scale_size_binned ( ) +
labs ( x = "Years of Exposure" , y = "Proportion of Severe Cases" ) +
theme_bw ( )