쿡의 거리

통계에서 Cook의 거리 또는 Cook의 D는 최소 제곱 회귀 분석을 수행할 때 데이터 점의 영향을 추정하는 데 일반적으로 사용된다.^[1]실용적인 일반 최소 제곱 분석에서 Cook의 거리는 여러 가지 방법으로 사용될 수 있다: 특히 유효성을 확인할 가치가 있는 영향력 있는 데이터 포인트를 표시하거나 더 많은 데이터 포인트를 획득하는 것이 좋은 설계 공간의 영역을 표시하기 위한 것이다.그것은 미국의 통계학자 R의 이름을 따서 명명되었다. 1977년 이 개념을 도입한 데니스 쿡.^[2]^[3]

정의

잔차(특이치)가 크거나 레버리지가 높은 데이터 점은 회귀 분석의 결과와 정확도를 왜곡할 수 있다.쿡의 거리는 주어진 관찰을 삭제하는 효과를 측정한다.쿡의 거리가 큰 점들은 분석에서 더 면밀하게 검토할 가치가 있는 것으로 간주된다.

대수적 표현식의 경우, 먼저 정의한다.

{\underset {n\times 1}{\mathbf {y} }}={\underset {n\times p}{\mathbf {X} }}\quad {\underset {p\times 1}{\boldsymbol {\beta }}}\quad +\quad {\underset {n\times 1}{\boldsymbol {\varepsilon }}}

where ${\boldsymbol {\varepsilon }}\sim {\mathcal {N}}\left(0,\sigma ^{2}\mathbf {I} \right)$ is the error term, ${\boldsymbol {\beta }}=\left[\beta _{0}\,\beta _{1}\dots \beta _{p-1}\right]$ is the coefficient matrix, $p$ $p$ 은 $p$ (는) 각 관측치에 대한 공변량 또는 예측 변수의 수이며, $\mathbf {X}$ {\ $displaystyle \mathbf {X}$ 은 $\mathbf {X}$ (는) 상수를 포함하는 설계 행렬이다.The least squares estimator then is $\mathbf {b} =\left(\mathbf {X} ^{\mathsf {T}}\mathbf {X} \right)^{-1}\mathbf {X} ^{\mathsf {T}}\mathbf {y}$ , and consequently the fitted (predicted) values for the mean of $\mathbf {y}$ are

\mathbf {\widehat {y}} =\mathbf {X} \mathbf {b} =\mathbf {X} \left(\mathbf {X} ^{\mathsf {T}}\mathbf {X} \right)^{-1}\mathbf {X} ^{\mathsf {T}}\mathbf {y} =\mathbf {H} \mathbf {y}

where $\mathbf {H} \equiv \mathbf {X} (\mathbf {X} ^{\mathsf {T}}\mathbf {X} )^{-1}\mathbf {X} ^{\mathsf {T}}$ is the projection matrix (or hat matrix).그 나는=H의 -th 대각선 요소(\,}, 나는 나는 ≡ h에 의해 주어지{\displaystyle 나는}나는 T(XTX)− 1)나는{\displaystyle h_{ii}\equiv\mathbf{)}_{나는}^{\mathsf{T}}(\mathbf{X}^{\mathsf{T}}\mathbf{X})^{)}\mathbf{)}_{나는}},[4]은으로 알려진 지렛대의 i{년.출신의. $splaystyle i}$ -th $i$ 관찰.Similarly, the $i$ -th element of the residual vector $\mathbf {e} =\mathbf {y} -\mathbf {\widehat {y\,}} =\left(\mathbf {I} -\mathbf {H} \right)\mathbf {y}$ is denoted by $e_{i}$ .

쿡의 관측치 $D_{i}$ $i\;({\text{for }}i=1,\dots ,n)$ 의 $i\;({\text{for }}i=1,\dots ,n)$ $D_{i}$ $i\;({\text{for }}i=1,\dots ,n)$ ${\$ ${$ $i$ $i$ = 1 , $i\;({\text{for }}i=1,\dots ,n)$ … $i\;({\text{for }}i=1,\dots ,n)$ , $i\;({\text{for }}i=1,\dots ,n)$ ) {\ $displaystyle$ i\;({\ $text{{}i=$ 1,\ $dots ,n)}$ 는 관측치 i ${\displaysty i}$ 이 제거될 $i$ ^[5] 때 회귀 모델의 모든 변경의 합으로 정의된다 $i\;({\text{for }}i=1,\dots ,n)$ .

{\displaystyle D_{i}={\frac _{j=1}^{n}\왼쪽({\widehat {y\,}}-{j}-{\widehat {y\,},}_{j(i)}\오른쪽)^{2}}:{ps^{2}}:

where ${\widehat {y\,}}_{j(i)}$ is the fitted response value obtained when excluding $i$ , and $s^{2}={\frac {\mathbf {e} ^{\top }\mathbf {e} }{n-p}}$ is the mean squared error of the regression model.^[6]

동등하게 지렛대^[5]( $h_{ii}$ i ${\$ 를 사용하여 표현할 수 있다. $h_{ii}$

D_{i}={\frac {e_{i}^{2}}:{ps^{2}}:}\왼쪽[{\frac {h_{ii}}}{{{}}}}}{2}}:}\오른쪽]

영향력이 큰 관측치 탐지

영향력이 큰 포인트를 포착하기 위해 어떤 컷오프 값을 사용할지에 대해서는 의견이 분분하다.Since Cook's distance is in the metric of an F distribution with $p$ and $n-p$ (as defined for the design matrix $\mathbf {X}$ above) degrees of freedom, the median point (i.e., $F_{0.5}(p,n-p)$ ) can be used as a cut-off.^[7]large $n$ ${\displaystyle$ n}의경우 이 값이 1에 가까우므로 $n$ $D_{i}>1$ i > $D_{i}>1$ {\ $displaystyle D_{i}>1}$ 의 간단한 작동 지침이 제시되었다 $D_{i}>1$ .^[8]Cook의 거리 측정이 항상 영향력 있는 관측치를 정확하게 식별하는 것은 아니라는 점에 유의하십시오.^[9]^[10]

기타 영향력 측정(및 해석)과의 관계

$D_{i}$ $D_{i}$ ${\$ $0\leq h_{ii}\leq 1$ ${i}}}$ 레버리지^[5]( $0\leq h_{ii}\leq 1$ $0\leq h_{ii}\leq 1$ $0\leq h_{ii}\leq 1$ $0\leq h_{ii}\leq 1$ $0\leq h_{ii}\leq 1$ 1 ${\displaystyle 0\leq$ h_ ${i}\leq 1$ 와 내부 학생화 잔차의 제곱( $0\leq t_{i}^{2}$ t $0\leq t_{i}^{2}$ 2 ${\\$ 을 사용하여 다음과 같이 표현할 $D_{i}$ 수 있다. $0\leq t_{i}^{2}$

{\reasoned}D_{i}&={\frac {e_{i}^{2}}{ps^{2}}}\left[{\frac {h_{ii}}{(1-h_{ii})^{2}}}\right]={\frac {1}{p}}{\frac {e_{i}^{2}}{{1 \over n-p}\sum _{j=1}^{n}{\widehat {\varepsilon \,}}_{j}^{\,2}(1-h_{ii})}}\left[{\frac {h_{ii}}{1-h_{ii}}}\right]\\&=\left[{\frac {1}{p}\right]t_{i}^{2}{\frac {h_{ii}{1-h_{ii}}}}.\end{정렬}}

마지막 공식에서 이점은 t $t_{i}^{2}$ 2 ${\$ }}과 $t_{i}^{2}$ $h_{ii}$ $h_{ii}$ ${\$ $displaystyle h_{i}}$ 사이의 $h_{ii}$ 관계를 $D_{i}$ 보여준다는 것이다 $($ p와 n은 $D_{i}$ 관측치에 대해 동일하다).If $t_{i}^{2}$ is large then it (for non-extreme values of $h_{ii}$ ) will increase $D_{i}$ . If $h_{ii}$ is close to 0 than $D_{i}$ will be small, while if ${\displaystyle h_{ii$ $}}}$ 이 $h_{ii}$ (가) $D_{i}$ 에 가까우면 ${\$ D_{ $i}}이($ 가) 매우 커진다(t $D_{i}$ i 2 > ${\displaystyle$ t_ ${i}^{2$ }}, 즉 $t_{i}^{2}>0$ $관찰$ $i$ $i$ 이(가) 관찰 $i$ ${\displaystystyle i}$ 없이 적합된 회귀선에 정확히 $i$ 있지 않은 경우). $i$

$D_{i}$ is related to DFFITS through the following relationship (note that ${{\widehat {\sigma }} \over {\widehat {\sigma }}_{(i)}}t_{i}=t_{i(i)}$ is the externally studentized residual, and ${\displayst$ $yle {\widehat{\prowidma },{\widehat {\probidma }}_{(i)}}}}$ 은(는) 여기에 정의되어 있다 ${\widehat {\sigma }},{\widehat {\sigma }}_{(i)}$ .

{\reasoned}D_{나는}&, =\left[{\frac{1}{p}}\right]t_{나는}^{2}{\frac{h_{ii}}{1-h_{ii}}}\\&, =\left[{\frac{1}{p}}\right]{{\widehat{\sigma}}_{(나는)}^{2}\over{\widehat{\sigma}}^{2}}{{\widehat{\sigma}}^{2}\over{\widehat{\sigma}}_{(나는)}^{2}}t_{나는}^{2}{\frac{h_{ii}}{1-h_{ii}}}=\left[{\frac{1}{p}}\right]{{\widehat{\sigma}}_{(나는)}^{2}\over{\widehat{\sig.엄마}}^{2}}\left(t_{i(i)}{\sqrt {\frac {h_{ii}}{1-h_{ii}}}}\right)^{2}\\&=\left[{\frac {1}{p}}\right]{{\widehat {\sigma }}_{(i)}^{2} \over {\widehat {\sigma }}^{2}}{\text{DFFITS}}^{2}\end{aligned}}

$D_{i}$ $D_{i}$ ${\$ 는 모수에 대해 그럴듯한 값의 영역을 나타내는 신뢰 타원 내에서 추정치가 이동하는 거리로 해석할 $D_{i}$ 수 있다.^{[clarification needed]}이는 특정 관측치가 회귀 분석에서 포함되거나 제외되는 경우 사이의 회귀 모수 추정치에 대한 변경 측면에서 대안적이지만 동등한 쿡 거리 표현으로 나타난다.

소프트웨어 구현

R, 파이톤 등과 같은 많은 프로그램과 통계 패키지는 쿡의 거리 구현을 포함한다.

언어/프로그램	함수	메모들
R	`cooks.distance(model, ...)`	[1] 참조
파이톤	`CooksDistance().fit(X, y)`	[2] 참조

확장

고차원 영향 측정(HIM)은 $p>n$ > $p>n$ ${\displaystyle p>n}($ 즉, 관측치보다 예측 변수가 많은 경우)에 대한 쿡의 거리에 대한 대안이다.^[11]Cook의 거리는 최소 제곱 회귀 계수 추정치에 대한 개별 관측치의 영향을 정량화하는 반면, HIM은 한계 상관관계에 대한 관측치의 영향을 측정한다.

참고 항목

참조

^ Mendenhall, William; Sincich, Terry (1996). A Second Course in Statistics: Regression Analysis (5th ed.). Upper Saddle River, NJ: Prentice-Hall. p. 422. ISBN 0-13-396821-9. A measure of overall influence an outlying observation has on the estimated $\beta$ coefficients was proposed by R. D. Cook (1979). Cook's distance, D_i, is calculated...
^ Cook, R. Dennis (February 1977). "Detection of Influential Observations in Linear Regression". Technometrics. American Statistical Association. 19 (1): 15–18. doi:10.2307/1268249. JSTOR 1268249. MR 0436478.
^ Cook, R. Dennis (March 1979). "Influential Observations in Linear Regression". Journal of the American Statistical Association. American Statistical Association. 74 (365): 169–174. doi:10.2307/2286747. hdl:11299/199280. JSTOR 2286747. MR 0529533.
^ Hayashi, Fumio (2000). Econometrics. Princeton University Press. pp. 21–23. ISBN 1400823838.
^ ^a ^b ^c "Cook's Distance".
^ "Statistics 512: Applied Linear Models" (PDF). Purdue University. Archived from the original (PDF) on 2016-11-30. Retrieved 2016-03-25.
^ Bollen, Kenneth A.; Jackman, Robert W. (1990). "Regression Diagnostics: An Expository Treatment of Outliers and Influential Cases". In Fox, John; Long, J. Scott (eds.). Modern Methods of Data Analysis. Newbury Park, CA: Sage. pp. 266. ISBN 0-8039-3366-5.
^ Cook, R. Dennis; Weisberg, Sanford (1982). Residuals and Influence in Regression. New York, NY: Chapman & Hall. hdl:11299/37076. ISBN 0-412-24280-X.
^ Kim, Myung Geun (31 May 2017). "A cautionary note on the use of Cook's distance". Communications for Statistical Applications and Methods. 24 (3): 317–324. doi:10.5351/csam.2017.24.3.317. ISSN 2383-4757.
^ 회귀 분석에서 삭제 시 진단 통계량
^ 고차원적 영향력 측정

추가 읽기

Atkinson, Anthony; Riani, Marco (2000). "Deletion Diagnostics". Robust Diagnostics and Regression Analysis. New York: Springer. pp. 22–25. ISBN 0-387-95017-6.
Heiberger, Richard M.; Holland, Burt (2013). "Case Statistics". Statistical Analysis and Data Display. Springer Science & Business Media. pp. 312–27. ISBN 9781475742848.
Krasker, William S.; Kuh, Edwin; Welsch, Roy E. (1983). "Estimation for dirty data and flawed models". Handbook of Econometrics. Vol. 1. Elsevier. pp. 651–698. doi:10.1016/S1573-4412(83)01015-6. ISBN 9780444861856.
Aguinis, Herman; Gottfredson, Ryan K.; Joo, Harry (2013). "Best-Practice Recommendations for Defining Identifying and Handling Outliers". Organizational Research Methods. Sage. 16 (2): 270–301. doi:10.1177/1094428112470848. S2CID 54916947. Retrieved 4 December 2015.

[1] Mendenhall, William; Sincich, Terry (1996). A Second Course in Statistics: Regression Analysis (5th ed.). Upper Saddle River, NJ: Prentice-Hall. p. 422. ISBN 0-13-396821-9. A measure of overall influence an outlying observation has on the estimated $\beta$ coefficients was proposed by R. D. Cook (1979). Cook's distance, D_i, is calculated...

[2] Cook, R. Dennis (February 1977). "Detection of Influential Observations in Linear Regression". Technometrics. American Statistical Association. 19 (1): 15–18. doi:10.2307/1268249. JSTOR 1268249. MR 0436478.

[3] Cook, R. Dennis (March 1979). "Influential Observations in Linear Regression". Journal of the American Statistical Association. American Statistical Association. 74 (365): 169–174. doi:10.2307/2286747. hdl:11299/199280. JSTOR 2286747. MR 0529533.

[4] Hayashi, Fumio (2000). Econometrics. Princeton University Press. pp. 21–23. ISBN 1400823838.

[mathworks-5] "Cook's Distance".

[6] "Statistics 512: Applied Linear Models" (PDF). Purdue University. Archived from the original (PDF) on 2016-11-30. Retrieved 2016-03-25.

[7] Bollen, Kenneth A.; Jackman, Robert W. (1990). "Regression Diagnostics: An Expository Treatment of Outliers and Influential Cases". In Fox, John; Long, J. Scott (eds.). Modern Methods of Data Analysis. Newbury Park, CA: Sage. pp. 266. ISBN 0-8039-3366-5.

[8] Cook, R. Dennis; Weisberg, Sanford (1982). Residuals and Influence in Regression. New York, NY: Chapman & Hall. hdl:11299/37076. ISBN 0-412-24280-X.

[9] Kim, Myung Geun (31 May 2017). "A cautionary note on the use of Cook's distance". Communications for Statistical Applications and Methods. 24 (3): 317–324. doi:10.5351/csam.2017.24.3.317. ISSN 2383-4757.

[10] 회귀 분석에서 삭제 시 진단 통계량

[11] 고차원적 영향력 측정

[1]

[2]

[3]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

Search