하이퍼로그 로그

HyperLog는 카운트-간명성 문제에 대한 알고리즘으로, 멀티셋의 고유 요소 수에 근접한 값이다.^[1]멀티셋의 정확한 카디널리티를 계산하려면 카디널리티에 비례하는 메모리 양이 필요하며, 이는 매우 큰 데이터 세트에서는 비현실적이다.HyperLog 알고리즘과 같은 확률론적 카디널리티 추정기는 카디널리티의 근사치만 얻는 비용으로 이보다 훨씬 적은 메모리를 사용한다.HyperLog 알고리즘은 1.5kB의 메모리를 사용하여 2%의 일반적인 정확도(표준 오차)로 10의⁹ 기본값을 추정할 수 있다.^[1]HyperLog는 1984년 Flajolet-Martin 알고리즘에서 파생된 이전 LogLog 알고리즘의 확장이다.[2]^[3]

용어.

Flajolet 등이 작성한 원본 논문과 계수-간결함 문제에 관한 관련 문헌에서 "카르디날리티"라는 용어는 반복적인 요소가 있는 데이터 스트림에서 구별되는 요소의 수를 의미하기 위해 사용된다.^[1]그러나 멀티셋 이론에서 용어는 멀티셋의 각 멤버의 승수의 합을 가리킨다.이 기사는 출처와의 일관성을 위해 Flajolet의 정의를 사용한다.

알고리즘.

HyperLog 알고리즘의 기본은 균일하게 분포된 랜덤 숫자의 다중 집합의 카디널리티를 집합에 있는 각 숫자의 이진 표현에서 선행 0의 최대 수를 계산하여 추정할 수 있다는 관측이다.관측된 최대 선행 0 수가 n인 경우 집합의 고유 원소 수에 대한 추정치는 2이다ⁿ.^[1]

HyperLog 알고리즘에서는 원래 다중값의 각 요소에 해시 함수를 적용하여 원래 다중값과 동일한 카디널리티로 균일하게 분포된 랜덤 숫자의 다중값을 얻는다.이 랜덤하게 분포된 세트의 카디널리티는 위의 알고리즘을 사용하여 추정할 수 있다.

위의 알고리즘을 사용하여 얻은 카디널리티의 단순 추정치는 큰 분산이라는 단점이 있다.HyperLog 알고리즘에서는 멀티셋을 여러 하위 집합으로 분할하고, 이러한 하위 집합의 각 하위 집합의 숫자에 있는 선행 0의 최대 수를 계산하고, 조화 평균을 사용하여 이러한 하위 집합의 카디널리티 추정치를 결합함으로써 분산을 최소화한다.^[4]

운영

HyperLog는 세트에 새 요소를 추가하기 위해 추가, 세트의 카디널리티를 얻기 위해 카운트, 그리고 두 세트의 결합을 얻기 위해 병합하는 세 가지 주요 작업을 가지고 있다.일부 파생 연산은 교차로의 카디널리티 또는 병합 및 카운트 연산을 결합한 두 HyperLogs 간의 차이의 카디널리티와 같은 포함-제외 원리를 사용하여 계산할 수 있다.

HyperLog의 데이터는 초기 상태에서 0으로 설정된 $크기$ m 레지스터라고 하는 카운터의 $어레이$ M에 저장된다.

추가하다

추가 작업은 해시함수 $h$ 로 입력 데이터 $v$ 의 해시를 계산하여 첫 번째 b $비트$ ( ${\textstyle \log _{2}(m)}$ 서 $b$ 는 ${\textstyle \log _{2}(m)}$ ${\textstyle \log _{2}(m)}$ ${\textstyle \log _{2}(m)}$ ( ${\textstyle \log _{2}(m)}$ m ${\textstyle \log _{2}(m)}$ ) ${\textstyle \log_{2}(m)})$ 를 얻은 후 1을 추가하여 수정할 레지스터의 주소를 얻는 것으로 구성된다.나머지 비트로 ${\textstyle \rho (w)}$ ( w ) {\ $textstyle \rho (w)}$ 을(를 ${\textstyle \rho (w)}$ 계산하여 가장 ${\textstyle \rho (w)}$ 왼쪽 1의 위치를 반환하십시오.레지스터의 새 값은 레지스터의 현재 값과 ${\textstyle \rho (w)}$ ${\textstyle \rho (w)}$ ${\textstyle \rho(w)}$ 사이의 최대값이 될 것이다 ${\textstyle \rho (w)}$

{\displaysty}x&:=h(v)\j&:=1+\langle x_{1}x_{2}...x_{b}\rangele _{2}\\w&:=x_{b+1}x_{b+2}...\\M[j]&:=\max(M[j],\rho(w)\\end{aigned}

카운트

카운트 알고리즘은 $m$ 레지스터의 고조파 평균을 계산하고 상수를 사용하여 카운트의 추정 ${\textstyle E}$ ${\textstyle E}$ {\ $textstyle E$ }을(를) 도출하는 방식으로 구성된다.

Z={\bigg (}\sum _{j=1}^{m}{2^{-M[j]}{\Bigg )}^{-1

\m}=\left(m\int _{0}^{\inflt }\left(\log _{2}\leftwarden\fract\{2+u}{1+u}\right)\right)\right)^{m}\,du\오른쪽)^{-1

E=\alpha _{m}m^{2}Z

직관은 $n$ 이 $M$ 의 알 수 없는 카디널리티가 되는 것이며, 각 ${\textstyle M_{j}}$ ${\textstyle M_{j}}$ M ${\textstyle M_{j}}$ j {\ $textstyle M_$ {j $}$ 은 ${\textstyle M_{j}}$ ${\textstyle n/m}$ / ${\textstyle n/m}$ ${\textstyle n/m}$ 요소를 ${\textstyle n/m}$ 갖는다.그런 다음 ${\textstyle \max _{x\in M_{j}}\rho (x)}$ ${\textstyle \max _{x\in M_{j}}\rho (x)}$ ${\textstyle \max _{x\in M_{j}}\rho (x)}$ M ${\textstyle \max _{x\in M_{j}}\rho (x)}$ ${\textstyle \max _{x\in M_{j}}\rho (x)}$ ( ${\textstyle \max _{x\in M_{j}}\rho (x)}$ x ) ${\textstyle \max _{x\in M_{j}\rho$ (x $)}$ 은(는) $로그$ ${\textstyle \log _{2}(n/m)}$ ${\textstyle \log _{2}(n/m)}$ (n ${\textstyle \log _{2}(n/m)}$ / ${\textstyle \log _{2}(n/m)}$ ) ${\textstyle \log _{2}(n/m)}$ 에 가까워야 한다 ${\textstyle \max _{x\in M_{j}}\rho (x)}$ ${\textstyle \log _{2}(n/m)}$ 이 수량에 대한 2의 조화 평균은 ${\textstyle mZ}$ ${\textstyle n/m}$ ${\textstyle mZ}$ ${\$ $n/m}$ 에 가까워야 하는 m ${\textstyle mZ}$ ${\textstyle m^{2}Z}$ {\ ${\textstyle m^{2}Z}$ n/m}이므로 m 2 Z {\textstyle m $^{2}Z}$ 은(는 $)$ 대략 $n$ 이어야 한다 ${\textstyle m^{2}Z}$ ${\textstyle n/m}$

마지막으로 해시 충돌로 인한 ${\textstyle m^{2}Z}$ ${\textstyle m^{2}Z}$ 2 ${\textstyle m^{2}Z}$ ${\$ $textstyle \alpha_{m$ $}}$ 에 존재하는 체계적 승법편향을 교정하기 위해 상수 ${\textstyle \alpha _{m}}$ ${\$ $textstyle$ m $^{2}Z$ 를 도입한다 ${\textstyle \alpha _{m}}$ .

현실적 고려

상수 ${\textstyle \alpha _{m}}$ ${\$ 는 ${\textstyle \alpha _{m}}$ 계산이 간단하지 않으며, 공식으로^[1] 근사치를 구할 수 있다.

\case \{m}\case}m=16&0.673\\\m=32&0.697\\m=64&0.709\\m\geq 128&{0.7213}{1+{1\frac{1.079}}}}}}}}}{m}}}\case{case}}}}}}}}}}}}}}}\case}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}

그러나 HyperLog 기술은 임계값 ${\textstyle {\frac {5}{2}}m}$ ${\textstyle {\frac {5}{2}}m}$ ${\textstyle {\frac {5}{2}}m$ 미만의 작은 추기경에게 치우쳐 있다 ${\textstyle {\frac {5}{2}}m}$ 본 논문은 Linear Counting이라고 알려진 작은 추기경들에 대해 다른 알고리즘을 사용할 것을 제안한다.^[5]위에서 제공한 추정치가 임계값 ${\textstyle E<{\frac {5}{2}}m}$ < ${\textstyle E<{\frac {5}{2}}m}$ ${\textstyle E<{\frac {5}{2}}m}$ ${\textstyle E<{\frac {5}{2}}m}$ 보다 작은 경우 ${\textstyle E<{\frac {5}{2}}m}$ 대체 계산을 사용할 수 있다.

${\textstyle V}$ ${\textstyle V}$ 을(를) 0과 같은 레지스터의 개수로 설정하십시오 ${\textstyle V}$ .
${\textstyle V=0}$ = ${\textstyle V=0}$ ${\textstyle V=0$ 인 경우 위의 ${\textstyle E}$ 표준 HyperLog ${\textstyle E}$ E ${\textstyle E}$ 을(를) 사용하십시오.
그렇지 않으면 선형 카운팅을 ${\textstyle E^{\star }=m\log \left({\frac {m}{V}}\right)}$ 하십시오. E ${\textstyle E^{\star }=m\log \left({\frac {m}{V}}\right)}$ = ${\textstyle E^{\star }=m\log \left({\frac {m}{V}}\right)}$ log ${\textstyle E^{\star }=m\log \left({\frac {m}{V}}\right)}$ ( ${\textstyle E^{\star }=m\log \left({\frac {m}{V}}\right)}$ V ) ${\textstyle E^{\star }=m\log \left({\frac {m}{V}}\right)}$ {\ $textstyle E^{\star }=m\log \left({\frac {m}{V}\right)}$

또한, 레지스터의 크기 한계에 근접하는 매우 큰 추기경(32비트 레지스터의 경우 ${\textstyle E>{\frac {2^{32}}{30}}}$ ${\textstyle E>{\frac {2^{32}}{30}}}$ > 2 ${\textstyle E>{\frac {2^{32}}{30}}}$ ${\textstyle E>{\frac {2^{32}}{30}}}$ ${\$ 에 대해서는 다음과 같이 카디널리티를 추정할 수 있다.

E^{{32}}=-2^{32}\log \left(1-{\frac {E}{2^{32}}}\오른쪽)

하한과 상한에 대해 위의 보정을 통해 오차는 ${\textstyle \sigma =1.04/{\sqrt {m}}}$ = ${\textstyle \sigma =1.04/{\sqrt {m}}}$ / m ${\$ 로 추정할 수 있다 ${\textstyle \sigma =1.04/{\sqrt {m}}}$

병합

두 HLLs ${\textstyle {\mathit {hll}}_{1},{\mathit {hll}}_{2}}$ ${\textstyle {\mathit {hll}}_{1},{\mathit {hll}}_{2}}$ ${\textstyle {\mathit {hll}}_{1},{\mathit {hll}}_{2}}$ , ${\textstyle {\mathit {hll}}_{1},{\mathit {hll}}_{2}}$ ${\textstyle {\mathit {hll}}_{1},{\mathit {hll}}_{2}}$ ${\textstyle {\mathit {hll}}_{1},{\mathit {hll}}_{2}}$ l l ${\textstyle {\mathit {hll}}_{1},{\mathit {hll}}_{2}}$ ${\$ }})에 대한 병합 작업은 레지스터 ${\textstyle j:1..m}$ : ${\textstyle j:1..m}$ ${\textstyle j:1$ 의 각 쌍에 대한 최대값을 얻는 데 있다 $.$ $m}$

{\mathit{hll}_{\text{hll}}[j]=\maxit\mathit {hll}}{\mathit {hll}_{2}[j]}}}}

복잡성

그 복잡함를 분석하여, 데이터(ϵ, δ){\displaystyle(,\delta\epsilon)}model[6]스트리밍, 그 공간은 고정 성공 확률 1−δ{1-\delta\displaystyle}1±ϵ{\displaystyle 1\pm \epsilon}근사를 얻는 데 필요한 분석한다. HLL의 상대적 오류는 1.04/m{\dis 사용된다.pla $ystyle 1.04/{\sqrt{m}}}$ 이 $1.04/{\sqrt {m}}$ ( $O(\epsilon ^{-2}\log \log n+\log n)$ ) 필요하며, O $O(\epsilon ^{-2}\log \log n+\log n)$ $O(\epsilon ^{-2}\log \log n+\log n)$ - $O(\epsilon ^{-2}\log \log n+\log n)$ $O(\epsilon ^{-2}\log \log n+\log n)$ $O(\epsilon ^{-2}\log \log n+\log n)$ $O(\epsilon ^{-2}\log \log n+\log n)$ $O(\epsilon ^{-2}\log \log n+\log n)$ + $O(\epsilon ^{-2}\log \log n+\log n)$ $O(\epsilon ^{-2}\log \log n+\log n)$ ) ${\displaystyle$ O $(\epsilon ^{-2}\log \log n)$ 는 $O(\epsilon ^{-2}\log \log n+\log n)$ $설정$ 된 카디널리티이고 $m$ 은 레지스터 수(일반적으로 1바이트 크기 미만)이다.

추가 연산은 해시함수의 출력 크기에 따라 달라진다.이 크기가 고정되어 있으므로 추가 작업의 실행 시간을 $O(1)$ ( $O(1)$ $){\displaystyle$ O $(1)}$ 로 간주할 수 있다 $O(1)$

카운트 및 병합 연산은 레지스터 $m$ 의 수에 따라 달라지며 $O(m)$ 원가는 O $O(m)$ ( $O(m)$ ) ${\디스플레이 스타일$ O $(m)}$ 이다 $O(m)$ 일부 구현(Redis)^[7]에서는 레지스터 수가 고정되어 있고, 설명서에서는 $O(1)$ 레지스터 수가 $O(1)$ ( $O(1)$ ) ${\디스플레이 스타일$ O(1)로 $O(1)$ 간주된다.

HLL++

HyperLog++ 알고리즘은 메모리 요구사항을 줄이고 일부 범위의 추기경에서 정확도를 높이기 위해 HyperLog 알고리즘의 몇 가지 개선을 제안한다.^[6]

원래 종이에 사용된 32비트 대신 64비트 해시함수가 사용된다.이렇게 하면 큰 범위의 보정을 제거할 수 있는 큰 추기경의 해시 충돌이 감소한다.
선형 계수에서 HLL 계수로 전환할 때 작은 기질에 대해 일부 치우침이 발견된다.문제를 완화하기 위해 경험적 편향 보정을 제안한다.
작은 추기경의 기억 요건을 줄이기 위해 레지스터의 희박한 표현이 제안되며, 카디널리티가 커지면 나중에 밀집된 표현으로 변환될 수 있다.

스트리밍 HLL

데이터가^[8]^[9] 단일 스트림에 도착하면 HLL 스케치의 정확도를 크게 향상시키고 36% 적은 메모리를 사용하여 주어진 오류 수준을 달성한다.이 추정기는 단일 스트림에서 중복된 무감각 근사치 고유 계수 스케치에 대해 최적으로 적합하다.

단일 스트림 시나리오는 또한 HLL 스케치 구조에 변형을 초래한다.HLL-TailCut+는 원래 HLL 스케치보다 메모리 사용량이 45% 적지만 데이터 삽입 순서에 의존하고 스케치를 병합할 수 없는 비용이다.^[10]

추가 읽기

"Probabilistic Data Structures for Web Analytics and Data Mining". highlyscalable.wordpress.com. May 2012. Retrieved 2014-04-19.
"New cardinality estimation algorithms for HyperLogLog sketches" (PDF). Retrieved 2016-10-29.

참조

^ ^a ^b ^c ^d ^e Flajolet, Philippe; Fusy, Éric; Gandouet, Olivier; Meunier, Frédéric (2007). "Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm" (PDF). Discrete Mathematics and Theoretical Computer Science Proceedings. Nancy, France. AH: 127–146. CiteSeerX 10.1.1.76.4286. Retrieved 2016-12-11.
^ Durand, M.; Flajolet, P. (2003). "LogLog counting of large cardinalities." (PDF). In G. Di Battista and U. Zwick (ed.). Lecture Notes in Computer Science. Annual European Symposium on Algorithms (ESA03). Vol. 2832. Springer. pp. 605–617.
^ Flajolet, Philippe; Martin, G. Nigel (1985). "Probabilistic counting algorithms for data base applications" (PDF). Journal of Computer and System Sciences. 31 (2): 182–209. doi:10.1016/0022-0000(85)90041-8.
^ S Heule, M Nunkesser, A Hall (2013). "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm" (PDF). sec 4.{{cite web}}: CS1 maint: 작성자 매개변수 사용(링크)
^ Whang, Kyu-Young; Vander-Zanden, Brad T; Taylor, Howard M (1990). "A linear-time probabilistic counting algorithm for database applications". ACM Transactions on Database Systems. 15 (2): 208–229. doi:10.1145/78922.78925.
^ ^a ^b "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm". Retrieved 2014-04-19.
^ "PFCOUNT – Redis".
^ Cohen, E. (March 2015). "All-distances sketches, revisited: HIP estimators for massive graphs analysis". IEEE Transactions on Knowledge and Data Engineering. 27 (9): 2320–2334. arXiv:1306.3284. doi:10.1109/TKDE.2015.2411606.
^ Ting, D. (August 2014). "Streamed approximate counting of distinct elements: beating optimal batch methods". Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '14): 442–451. doi:10.1145/2623330.2623669. ISBN 9781450329569.
^ Xiao, Q.; Zhou, Y.; Chen, S. (May 2017). "Better with fewer bits: Improving the performance of cardinality estimation of large data streams". IEEE INFOCOM 2017 - IEEE Conference on Computer Communications: 1–9. doi:10.1109/INFOCOM.2017.8057088. ISBN 978-1-5090-5336-0.

[flajolet07-1] Flajolet, Philippe; Fusy, Éric; Gandouet, Olivier; Meunier, Frédéric (2007). "Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm" (PDF). Discrete Mathematics and Theoretical Computer Science Proceedings. Nancy, France. AH: 127–146. CiteSeerX 10.1.1.76.4286. Retrieved 2016-12-11.

[2] Durand, M.; Flajolet, P. (2003). "LogLog counting of large cardinalities." (PDF). In G. Di Battista and U. Zwick (ed.). Lecture Notes in Computer Science. Annual European Symposium on Algorithms (ESA03). Vol. 2832. Springer. pp. 605–617.

[3] Flajolet, Philippe; Martin, G. Nigel (1985). "Probabilistic counting algorithms for data base applications" (PDF). Journal of Computer and System Sciences. 31 (2): 182–209. doi:10.1016/0022-0000(85)90041-8.

[4] S Heule, M Nunkesser, A Hall (2013). "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm" (PDF). sec 4.{{cite web}}: CS1 maint: 작성자 매개변수 사용(링크)

[5] Whang, Kyu-Young; Vander-Zanden, Brad T; Taylor, Howard M (1990). "A linear-time probabilistic counting algorithm for database applications". ACM Transactions on Database Systems. 15 (2): 208–229. doi:10.1145/78922.78925.

[Heule13-6] "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm". Retrieved 2014-04-19.

[7] "PFCOUNT – Redis".

[8] Cohen, E. (March 2015). "All-distances sketches, revisited: HIP estimators for massive graphs analysis". IEEE Transactions on Knowledge and Data Engineering. 27 (9): 2320–2334. arXiv:1306.3284. doi:10.1109/TKDE.2015.2411606.

[9] Ting, D. (August 2014). "Streamed approximate counting of distinct elements: beating optimal batch methods". Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '14): 442–451. doi:10.1145/2623330.2623669. ISBN 9781450329569.

[10] Xiao, Q.; Zhou, Y.; Chen, S. (May 2017). "Better with fewer bits: Improving the performance of cardinality estimation of large data streams". IEEE INFOCOM 2017 - IEEE Conference on Computer Communications: 1–9. doi:10.1109/INFOCOM.2017.8057088. ISBN 978-1-5090-5336-0.

[1]

[3]

[4]

[5]

[7]

[6]

[8]

[9]

[10]

Search