Uncertainty coefficient

In statistics, the uncertainty coefficient, also called proficiency, entropy coefficient or Theil's U, is a measure of nominal association. It was first introduced by Henri Theil[citation needed] and is based on the concept of information entropy.

Definition

Suppose we have samples of two discrete random variables, X and Y. By constructing the joint distribution, PX,Y(x, y), from which we can calculate the conditional distributions, PX Y(x y) = PX,Y(x, y)/PY(y) and PY X(y x) = PX,Y(x, y)/PX(x), and calculating the various entropies, we can determine the degree of association between the two variables.

The entropy of a single distribution is given as: [1]

while the conditional entropy is given as:[1]

The uncertainty coefficient[2] or proficiency[3] is defined as:

and tells us: given Y, what fraction of the bits of X can we predict? In this case we can think of X as containing the total information, and of Y as allowing one to predict part of such information.

The above expression makes clear that the uncertainty coefficient is a normalised mutual information I(X;Y). In particular, the uncertainty coefficient ranges in [0, 1] as I(X;Y) < H(X) and both I(X,Y) and H(X) are positive or null.

Note that the value of U (but not H!) is independent of the base of the log since all logarithms are proportional.

The uncertainty coefficient is useful for measuring the validity of a statistical classification algorithm and has the advantage over simpler accuracy measures such as precision and recall in that it is not affected by the relative fractions of the different classes, i.e., P(x). [4] It also has the unique property that it won't penalize an algorithm for predicting the wrong classes, so long as it does so consistently (i.e., it simply rearranges the classes). This is useful in evaluating clustering algorithms since cluster labels typically have no particular ordering.[3]

변형

불확실성 계수는 XY의 역할과 관련하여 대칭적이지 않다. 역할을 역전시킬 수 있고, 따라서 대칭적인 척도는 두 가지 사이의 가중 평균으로 정의된다.[2]

일반적으로 이산형 변수에 적용되지만, 불확실성 계수는 밀도 추정을 사용하여 연속형 변수로[1] 확장할 수 있다.[citation needed]

참고 항목

참조

  1. ^ a b c Claude E. Shannon; Warren Weaver (1963). The Mathematical Theory of Communication. University of Illinois Press.
  2. ^ a b William H. Press; Brian P. Flannery; Saul A. Teukolsky; William T. Vetterling (1992). "14.7.4". Numerical Recipes: the Art of Scientific Computing (3rd ed.). Cambridge University Press. p. 761.
  3. ^ a b White, Jim; Steingold, Sam; Fournelle, Connie. "Performance Metrics for Group-Detection Algorithms" (pdf). Interface 2004. Cite 저널은 필요로 한다. journal= (도움말)
  4. ^ Peter, Mills (2011). "Efficient statistical classification of satellite measurements" (PDF). International Journal of Remote Sensing. 32 (21): 6109–6132. arXiv:1202.2194. doi:10.1080/01431161.2010.507795. Archived from the original (PDF) on 2012-04-26.

외부 링크

  • libagf 불확실성 계수를 계산하기 위한 소프트웨어를 포함한다.