회귀 희석

오차-변수 모형의 회귀 추정치 범위별 회귀 희석(또는 감쇠 치우침)의 그림. 두 개의 회귀선(빨간색)이 선형 회귀 가능성 범위를 묶었다. 얄팍한 기울기는 독립 변수(또는 예측 변수)가 압시사(x축)에 있을 때 얻는다. 수직 경사는 독립 변수가 세로좌표(y축)에 있을 때 얻는다. 관례에 따라 x축에 독립 변수를 적용하여 셸하강 경사를 구한다. 녹색 기준선은 각 축을 따라 임의의 빈 내의 평균이다. 녹색 및 적색 회귀 추정치는 y축 변수의 작은 오차와 더 일치한다는 점에 유의하십시오.

회귀 감쇠라고도 하는 회귀 희석은 독립 변수의 오류로 인해 발생하는 0(절대값의 과소평가)으로 향하는 선형 회귀 기울기의 편향이다.

결과 변수 y와 예측 변수 x의 관계에 대해 직선을 적합하고 선의 기울기를 추정하는 것을 고려하십시오. y 변수의 통계적 변동성, 측정 오차 또는 무작위 노이즈는 추정된 기울기에 불확실성을 야기하지만 치우침이 아니다. 평균적으로 이 절차는 올바른 기울기를 계산한다. 그러나 x 변수의 변동성, 측정 오차 또는 무작위 노이즈로 인해 추정된 기울기(불확실성뿐만 아니라)에 치우침이 발생한다. x 측정값의 분산이 클수록 추정된 기울기가 참 값 대신 0에 근접해야 한다.

녹색 및 파란색 데이터 지점이 동일한 데이터를 캡처하지만 녹색 지점의 오류(x축의 +1 또는 -1)가 있다고 가정하십시오. Y축의 오차를 최소화하면 동일한 데이터의 시끄러운 버전일 뿐일지라도 녹색 점의 기울기가 작아진다.

예측 변수 x의 잡음이 편향을 유도한다는 것은 직관에 반하는 것처럼 보일 수 있지만 결과 변수 y의 잡음은 그렇지 않다. 선형 회귀가 대칭이 아니라는 점을 상기하십시오. x에서 y를 예측하기 위한 최적 적합선(일반적인 선형 회귀선)은 y에서 x를 예측하기 위한 최적 적합선과 동일하지 않다.^[1]

경사 보정

회귀 기울기 및 기타 회귀 계수는 다음과 같이 분리할 수 있다.

고정 x 변수의 경우

x가 고정되어 있지만 노이즈로 측정되는 경우를 기능 모델 또는 기능 관계라고 한다.^[2] 일반적으로 총 최소 제곱^[3] 및 오차 변수 모형을 사용하여 수정할 수 있다.

랜덤하게 분포된 x 변수의 경우

x 변수가 랜덤하게 발생하는 경우를 구조 모형 또는 구조 관계라고 한다. 예를 들어, 의학 연구에서 환자를 모집단으로부터 표본으로 모집하고 혈압과 같은 특성은 무작위 표본에서 발생하는 것으로 볼 수 있다.

특정 가정(일반적으로 정규 분포 가정)에서는 실제 기울기와 예상 추정 기울기 사이에 알려진 비율이 있다. 프로스트와 톰슨(2000년)은 이 비율을 추정하여 추정된 경사를 교정하기 위한 몇 가지 방법을 검토한다.^[4] 회귀 희석 비율이라는 용어는 모든 저자에 의해 상당히 동일한 방식으로 정의되지는 않았지만 일반적인 선형 회귀가 적합된 다음 보정을 적용하는 일반적인 접근법에 사용된다. 롱포드(2001)의 프로스트 & 톰슨에 대한 회신은 독자를 다른 방법으로 지칭하며, x 변수의 변동성을 인정하도록 회귀 모델을 확장하여 편견이 생기지 않도록 한다.^[5] 풀러(1987)는 회귀 희석 평가 및 보정에 대한 표준 참조 중 하나이다.^[6]

휴즈(1993)는 회귀 희석 비율 방법이 생존 모델에 대략적으로 적용된다는 것을 보여준다.^[7] Rosner(1992)는 비율 방법이 로지스틱 회귀 모형에 대략적으로 적용된다는 것을 보여준다.^[8] 캐롤 외 연구진(1995)은 비선형 모델에서 회귀 희석에 대한 보다 자세한 정보를 제공하며, 추가 공변량을 통합할 수 있는 회귀 분석 방법의 가장 단순한 사례로 회귀 희석 비율 방법을 제시한다.^[9]

In general, methods for the structural model require some estimate of the variability of the x variable. This will require repeated measurements of the x variable in the same individuals, either in a sub-study of the main data set, or in a separate data set. Without this information it will not be possible to make a correction.

Multiple x variables

The case of multiple predictor variables subject to variability (possibly correlated) has been well-studied for linear regression, and for some non-linear regression models.^[6]^[9] Other non-linear models, such as proportional hazards models for survival analysis, have been considered only with a single predictor subject to variability.^[7]

Correlation correction

Charles Spearman developed in 1904 a procedure for correcting correlations for regression dilution,^[10] i.e., to "rid a correlation coefficient from the weakening effect of measurement error".^[11]

In measurement and statistics, the procedure is also called correlation disattenuation or the disattenuation of correlation.^[12] The correction assures that the Pearson correlation coefficient across data units (for example, people) between two sets of variables is estimated in a manner that accounts for error contained within the measurement of those variables.^[13]

Formulation

Let $\beta$ and $\theta$ be the true values of two attributes of some person or statistical unit. These values are variables by virtue of the assumption that they differ for different statistical units in the population. Let ${\hat {\beta }}$ and ${\hat {\theta }}$ be estimates of $\beta$ and $\theta$ derived either directly by observation-with-error or from application of a measurement model, such as the Rasch model. Also, let

{\hat {\beta }}=\beta +\epsilon _{\beta },\quad \quad {\hat {\theta }}=\theta +\epsilon _{\theta },

where $\epsilon _{\beta }$ and $\epsilon _{\theta }$ are the measurement errors associated with the estimates ${\hat {\beta }}$ and ${\hat {\theta }}$ .

The estimated correlation between two sets of estimates is

\operatorname {corr} ({\hat {\beta }},{\hat {\theta }})={\frac {\operatorname {cov} ({\hat {\beta }},{\hat {\theta }})}{{\sqrt {\operatorname {var} [{\hat {\beta }}]\operatorname {var} [{\hat {\theta }}}}]}}

={\frac {\operatorname {cov} (\beta +\epsilon _{\beta },\theta +\epsilon _{\theta })}{\sqrt {\operatorname {var} [\beta +\epsilon _{\beta }]\operatorname {var} [\theta +\epsilon _{\theta }]}}},

which, assuming the errors are uncorrelated with each other and with the true attribute values, gives

\operatorname {corr} ({\hat {\beta }},{\hat {\theta }})={\frac {\operatorname {cov} (\beta ,\theta )}{\sqrt {(\operatorname {var} [\beta ]+\operatorname {var} [\epsilon _{\beta }])(\operatorname {var} [\theta ]+\operatorname {var} [\epsilon _{\theta }])}}}

={\frac {\operatorname {cov} (\beta ,\theta )}{\sqrt {(\operatorname {var} [\beta ]\operatorname {var} [\theta ])}}}.{\frac {\sqrt {\operatorname {var} [\beta ]\operatorname {var} [\theta ]}}{\sqrt {(\operatorname {var} [\beta ]+\operatorname {var} [\epsilon _{\beta }])(\operatorname {var} [\theta ]+\operatorname {var} [\epsilon _{\theta }])}}}

=\rho {\sqrt {R_{\beta }R_{\theta }}},

where $R_{\beta }$ is the separation index of the set of estimates of $\beta$ , which is analogous to Cronbach's alpha; that is, in terms of classical test theory, $R_{\beta }$ is analogous to a reliability coefficient. Specifically, the separation index is given as follows:

R_{\beta }={\frac {\operatorname {var} [\beta ]}{\operatorname {var} [\beta ]+\operatorname {var} [\epsilon _{\beta }]}}={\frac {\operatorname {var} [{\hat {\beta }}]-\operatorname {var} [\epsilon _{\beta }]}{\operatorname {var} [{\hat {\beta }}]}},

where the mean squared standard error of person estimate gives an estimate of the variance of the errors, $\epsilon _{\beta }$ . The standard errors are normally produced as a by-product of the estimation process (see Rasch model estimation).

The disattenuated estimate of the correlation between the two sets of parameter estimates is therefore

\rho ={\frac {{\mbox{corr}}({\hat {\beta }},{\hat {\theta }})}{\sqrt {R_{\beta }R_{\theta }}}}.

That is, the disattenuated correlation estimate is obtained by dividing the correlation between the estimates by the geometric mean of the separation indices of the two sets of estimates. Expressed in terms of classical test theory, the correlation is divided by the geometric mean of the reliability coefficients of two tests.

Given two random variables $X^{\prime }$ and $Y^{\prime }$ measured as $X$ and $Y$ with measured correlation $r_{xy}$ and a known reliability for each variable, $r_{xx}$ and $r_{yy}$ , the estimated correlation between $X^{\prime }$ and $Y^{\prime }$ corrected for attenuation is

r_{x'y'}={\frac {r_{xy}}{\sqrt {r_{xx}r_{yy}}}}

.

How well the variables are measured affects the correlation of X and Y. The correction for attenuation tells one what the estimated correlation is expected to be if one could measure X′ and Y′ with perfect reliability.

Thus if $X$ and $Y$ are taken to be imperfect measurements of underlying variables $X'$ and $Y'$ with independent errors, then $r_{x'y'}$ estimates the true correlation between $X'$ and $Y'$ .

Is correction necessary?

In statistical inference based on regression coefficients, yes; in predictive modelling applications, correction is neither necessary nor appropriate. To understand this, consider the measurement error as follows. Let y be the outcome variable, x be the true predictor variable, and w be an approximate observation of x. Frost and Thompson suggest, for example, that x may be the true, long-term blood pressure of a patient, and w may be the blood pressure observed on one particular clinic visit.^[4] Regression dilution arises if we are interested in the relationship between y and x, but estimate the relationship between y and w. Because w is measured with variability, the slope of a regression line of y on w is less than the regression line of y on x.

이게 중요한가? 예측 모델링에서, 아니오. 표준 방법은 치우침 없이 y의 회귀 분석을 w에 적합시킬 수 있다. 그런 다음 y on w의 회귀 분석을 x의 y 회귀에 대한 근사치로 사용하는 경우에만 치우침이 있다. 이 예에서 혈압 측정이 미래 환자에서도 비슷하게 가변적이라고 가정하면 w(관측 혈압)에 대한 y 회귀선은 편향되지 않은 예측을 제공한다.

수정이 필요한 상황의 예로는 변화의 예측이 있다. x의 변화가 어떤 새로운 상황에서 알려져 있다고 가정하자: 결과 변수 y의 가능한 변화를 추정하기 위해서는 w의 y가 아니라 x의 y 회귀 기울기가 필요하다. 이것은 역학에서 발생한다. x가 혈압을 나타내는 예를 계속하기 위해, 아마도 큰 임상 실험이 새로운 치료법에 따른 혈압 변화에 대한 추정치를 제공했을 것이다. 그러면 새로운 치료법에 따른 y에 대한 가능한 영향은 x에 대한 회귀 분석의 기울기에서 추정해야 한다.

또 다른 상황은 미래 관찰도 가변적이지만 (위에서 사용된 문구에서는) "비슷하게 가변적"이 아닌 예측 모델링이다. 예를 들어 현재 데이터 세트에 임상 실습에서 흔히 볼 수 있는 것보다 더 정밀하게 측정된 혈압이 포함되어 있는 경우. 이에 대한 한 가지 구체적인 예는 통상 혈압이 단일 측정인 임상 실습에서 사용하기 위해 혈압이 6 측정값의 평균인 임상 시험에 기초한 회귀 방정식을 개발할 때 일어났다.^[14]

주의사항

이 모든 결과는 전체에서 정규 분포를 가정하는 단순한 선형 회귀의 경우 수학적으로 보여질 수 있다(프로스트 & 톰슨의 프레임워크).

특히, 기초적인 가정을 확인하지 않고 수행했을 때, 회귀 희석에 대해 제대로 수행되지 않은 보정은 보정이 없는 것보다 추정에 더 많은 손상을 줄 수 있다는 것이 논의되었다.^[15]

추가 읽기

스피어맨(1904년)에 의해 이름 감쇠로 회귀 희석 현상이 처음 언급되었다.^[16] 읽기 쉬운 수학적 치료법을 찾는 사람들은 프로스트와 톰슨(2000년)^[4]으로 시작하거나 감쇠에 대한 보정을 볼 수 있다.

참고 항목

감쇠 보정
수량화(신호 처리) – 설명 변수 또는 독립 변수에서 공통적인 오류 발생원

참조

^ Draper, N.R.; Smith, H. (1998). Applied Regression Analysis (3rd ed.). John Wiley. p. 19. ISBN 0-471-17082-8.
^ Riggs, D. S.; Guarnieri, J. A.; et al. (1978). "Fitting straight lines when both variables are subject to error". Life Sciences. 22 (13–15): 1305–60. doi:10.1016/0024-3205(78)90098-x. PMID 661506.
^ Golub, Gene H.; van Loan, Charles F. (1980). "An Analysis of the Total Least Squares Problem". SIAM Journal on Numerical Analysis. Society for Industrial & Applied Mathematics (SIAM). 17 (6): 883–893. doi:10.1137/0717073. hdl:1813/6251. ISSN 0036-1429.
^ ^a ^b ^c 프로스트, C, S. 톰슨(2000). "회귀 희석 편향 수정: 단일 예측 변수에 대한 방법 비교" 영국 왕립통계학회 시리즈 A 163: 173–190.
^ Longford, N. T. (2001). "Correspondence". Journal of the Royal Statistical Society, Series A. 164 (3): 565. doi:10.1111/1467-985x.00219.
^ ^a ^b Fuller, W. A. (1987). Measurement Error Models. New York: Wiley. ISBN 9780470317334.
^ ^a ^b Hughes, M. D. (1993). "Regression dilution in the proportional hazards model". Biometrics. 49 (4): 1056–1066. doi:10.2307/2532247. JSTOR 2532247. PMID 8117900.
^ Rosner, B.; Spiegelman, D.; et al. (1992). "Correction of Logistic Regression Relative Risk Estimates and Confidence Intervals for Random Within-Person Measurement Error". American Journal of Epidemiology. 136 (11): 1400–1403. doi:10.1093/oxfordjournals.aje.a116453. PMID 1488967.
^ ^a ^b Carroll, R. J., Rupert, D., Stefanski, L. A. (1995년) 비선형 모델의 측정 오류. 뉴욕, 와일리
^ Spearman, C. (1904). "The Proof and Measurement of Association between Two Things". The American Journal of Psychology. University of Illinois Press. 15 (1): 72–101. ISSN 0002-9556. JSTOR 1412159. Retrieved 2021-07-10.
^ Jensen, A.R. (1998). The g Factor: The Science of Mental Ability. Human evolution, behavior, and intelligence. Praeger. ISBN 978-0-275-96103-9.
^ Osborne, Jason W. (2003-05-27). "Effect Sizes and the Disattenuation of Correlation and Regression Coefficients: Lessons from Educational Psychology". ScholarWorks@UMass Amherst. Retrieved 2021-07-10.
^ Franks, Alexander; Airoldi, Edoardo; Slavov, Nikolai (2017-05-08). "Post-transcriptional regulation across human tissues". PLOS Computational Biology. 13 (5): e1005535. doi:10.1371/journal.pcbi.1005535. ISSN 1553-7358. PMC 5440056. PMID 28481885.
^ Stevens, R. J.; Kothari, V.; Adler, A. I.; Stratton, I. M.; Holman, R. R. (2001). "Appendix to "The UKPDS Risk Engine: a model for the risk of coronary heart disease in type 2 diabetes UKPDS 56)". Clinical Science. 101: 671–679. doi:10.1042/cs20000335.
^ Davey Smith, G.; Phillips, A. N. (1996). "Inflation in epidemiology: 'The proof and measurement of association between two things' revisited". British Medical Journal. 312 (7047): 1659–1661. doi:10.1136/bmj.312.7047.1659. PMC 2351357. PMID 8664725.
^ Spearman, C (1904). "The proof and measurement of association between two things". American Journal of Psychology. 15 (1): 72–101. doi:10.2307/1412159. JSTOR 1412159.

[1] Draper, N.R.; Smith, H. (1998). Applied Regression Analysis (3rd ed.). John Wiley. p. 19. ISBN 0-471-17082-8.

[Riggs1978-2] Riggs, D. S.; Guarnieri, J. A.; et al. (1978). "Fitting straight lines when both variables are subject to error". Life Sciences. 22 (13–15): 1305–60. doi:10.1016/0024-3205(78)90098-x. PMID 661506.

[vanLoan1980-3] Golub, Gene H.; van Loan, Charles F. (1980). "An Analysis of the Total Least Squares Problem". SIAM Journal on Numerical Analysis. Society for Industrial & Applied Mathematics (SIAM). 17 (6): 883–893. doi:10.1137/0717073. hdl:1813/6251. ISSN 0036-1429.

[Frost2000-4] 프로스트, C, S. 톰슨(2000). "회귀 희석 편향 수정: 단일 예측 변수에 대한 방법 비교" 영국 왕립통계학회 시리즈 A 163: 173–190.

[5] Longford, N. T. (2001). "Correspondence". Journal of the Royal Statistical Society, Series A. 164 (3): 565. doi:10.1111/1467-985x.00219.

[Fuller1987-6] Fuller, W. A. (1987). Measurement Error Models. New York: Wiley. ISBN 9780470317334.

[Hughes1993-7] Hughes, M. D. (1993). "Regression dilution in the proportional hazards model". Biometrics. 49 (4): 1056–1066. doi:10.2307/2532247. JSTOR 2532247. PMID 8117900.

[8] Rosner, B.; Spiegelman, D.; et al. (1992). "Correction of Logistic Regression Relative Risk Estimates and Confidence Intervals for Random Within-Person Measurement Error". American Journal of Epidemiology. 136 (11): 1400–1403. doi:10.1093/oxfordjournals.aje.a116453. PMID 1488967.

[Carroll1995-9] Carroll, R. J., Rupert, D., Stefanski, L. A. (1995년) 비선형 모델의 측정 오류. 뉴욕, 와일리

[Spearman1904-10] Spearman, C. (1904). "The Proof and Measurement of Association between Two Things". The American Journal of Psychology. University of Illinois Press. 15 (1): 72–101. ISSN 0002-9556. JSTOR 1412159. Retrieved 2021-07-10.

[Jensen1998-11] Jensen, A.R. (1998). The g Factor: The Science of Mental Ability. Human evolution, behavior, and intelligence. Praeger. ISBN 978-0-275-96103-9.

[Osborne_2003-12] Osborne, Jason W. (2003-05-27). "Effect Sizes and the Disattenuation of Correlation and Regression Coefficients: Lessons from Educational Psychology". ScholarWorks@UMass Amherst. Retrieved 2021-07-10.

[13] Franks, Alexander; Airoldi, Edoardo; Slavov, Nikolai (2017-05-08). "Post-transcriptional regulation across human tissues". PLOS Computational Biology. 13 (5): e1005535. doi:10.1371/journal.pcbi.1005535. ISSN 1553-7358. PMC 5440056. PMID 28481885.

[14] Stevens, R. J.; Kothari, V.; Adler, A. I.; Stratton, I. M.; Holman, R. R. (2001). "Appendix to "The UKPDS Risk Engine: a model for the risk of coronary heart disease in type 2 diabetes UKPDS 56)". Clinical Science. 101: 671–679. doi:10.1042/cs20000335.

[15] Davey Smith, G.; Phillips, A. N. (1996). "Inflation in epidemiology: 'The proof and measurement of association between two things' revisited". British Medical Journal. 312 (7047): 1659–1661. doi:10.1136/bmj.312.7047.1659. PMC 2351357. PMID 8664725.

[16] Spearman, C (1904). "The proof and measurement of association between two things". American Journal of Psychology. 15 (1): 72–101. doi:10.2307/1412159. JSTOR 1412159.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

Search