t분포 확률적 인접 매입

19세기 문헌을 사용하여 생성된 단어 임베딩의 T-SNE 시각화

MNIST 데이터 세트의 T-SNE 임베딩

t-분산 확률 인접 임베딩(t-SNE)은 각 데이터 포인트에 2차원 또는 3차원 맵의 위치를 제공하여 고차원 데이터를 시각화하는 통계 방법이다.이것은 원래 Sam Roweis와 Geoffrey Hinton에 ^[1]의해 개발된 확률적 이웃 임베딩에 기초하고 있으며, 여기서 Laurns van der Maatten은 t-분포형 ^[2]변종을 제안했다.이 기술은 2차원 또는 3차원의 저차원 공간에 시각화를 위한 고차원 데이터를 삽입하는 데 적합한 비선형 차원 축소 기술입니다.특히, 비슷한 물체는 인근 점으로 모델링하고 다른 물체는 높은 확률로 먼 점으로 모델링하는 방식으로 각 고차원 물체를 2차원 또는 3차원 점으로 모델링합니다.

t-SNE 알고리즘은 2개의 주요 단계로 구성됩니다.첫째, t-SNE는 유사한 객체가 더 높은 확률로 할당되는 반면 다른 점은 더 낮은 확률로 할당되는 방식으로 고차원 객체의 쌍에 대한 확률 분포를 구성한다.둘째, t-SNE는 저차원 지도의 지점들에 걸쳐 유사한 확률 분포를 정의하며, 지도의 지점 위치에 관해 두 분포 사이의 쿨백-라이블러 발산(Kullback-Leibler divergence, KL divergence)을 최소화한다.원래 알고리즘은 유사성 측정 기준의 기준으로 객체 간의 유클리드 거리를 사용하지만, 이것은 적절히 변경될 수 있다.

t-SNE는 유전체학, 컴퓨터 ^[3]보안 연구, 자연어 처리, ^[4]음악 분석, 암 연구,^[5] 생물 ^[6]정보학, 지질 ^[7]^[8]^[9]영역 해석 및 생물의학 신호 ^[10]처리를 포함한 광범위한 애플리케이션에서 시각화에 사용되어 왔다.

t-SNE 그림은 종종 군집을 표시하는 것처럼 보이지만 시각 군집은 선택한 모수화의 영향을 강하게 받을 수 있으므로 t-SNE에 대한 모수를 잘 이해해야 합니다.이러한 "클러스터"는 비클러스터 데이터에도 ^[11]나타나므로 잘못된 발견일 수 있습니다.따라서 매개 변수를 선택하고 ^[12]^[13]결과를 검증하기 위해 대화형 탐사가 필요할 수 있습니다.t-SNE는 종종 잘 분리된 클러스터를 복구할 수 있으며, 특별한 매개 변수 선택으로 단순한 형태의 스펙트럼 ^[14]클러스터링에 가깝다는 것이 입증되었다.

세부 사항

$N개(\displaystyle$ $N개$ )의 $N$ 고차원 $\mathbf {x} _{1},\dots ,\mathbf {x} _{N}$ x $\mathbf {x} _{1},\dots ,\mathbf {x} _{N}$ , $\mathbf {x} _{1},\dots ,\mathbf {x} _{N}$ $\mathbf {x} _{1},\dots ,\mathbf {x} _{N}$ (\ $displaystyle \$ $mathbf {x}_{$ 1},\ $dots,\mathbf$ ${x$ $}_$ ${N})$ 가 $주어진$ 경우 t-SNE는 먼저 $p_{ij}$ $\mathbf {x} _{i}$ 의 $\mathbf {x} _{i}$ 에 비례하는 $p_{ij}$ 을 $계산합니다.$ $(\displaystyle$ \ $mathbf {x} _{$ j $\mathbf {x} _{j}$ 다음과 같습니다.

$i\neq j$ j \ $displaystyle$ i \ $neq$ j $i\neq j$ 、

p_{j\mid i}=parc frac {exp(-\lVert \mathbf {x}_{j}\rVert ^{2}/2\parc_{i}}{\sum _k\neq i}\exp(-\lVert \mathbf {x}_x}_i} _xp

$p_{i\mid i}=0$ $p_{i\mid i}=0$ i $p_{i\mid i}=0$ 0 { $display$ style p $_$ { $i$ \ $mid$ i } = $0$ 。 $모든$ i { $display style$ i $i$ }에 $\sum _{j}p_{j\mid i}=1$ $\sum _{j}p_{j\mid i}=1$ $\sum _{j}p_{j\mid i}=1$ $\sum _{j}p_{j\mid i}=1$ 1 { $displaystyle$ \ $sum$ _ { $j$ \ mid i } $= 1$ 입니다.

Van der Maatten과 Hinton은 다음과 같이 설명합니다.「 datappoint $x_{j}$ ( $display$ style $x$ _ { $x_{j}$ $j$ $x_{j}$ } )와 $x_{j}$ $x_{i}$ x $x_{i}$ ( $display$ style $x$ _ { $x_{i}$ $i$ $p_{j|i}$ } )의 $x_{i}$ $x_{i}$ $x_{j}$ 유사성은 조건적인 $p_{j|i}$ 입니다. $x_{i}$ $x_{j}$ i ( x { $display style$ $p$ _ { $i$ } } ）。이웃은 $(\$ i}) $x_{i}$ ^[2]에 중심을 둔 가우스에서 확률 밀도에 비례하여 선택되었다."

정의하다

p_{ij}=p_{j\midi}+p_{i\mid j}}{2N}}

$p$ i $p_{ij}=p_{ji}$ $p_{ij}=p_{ji}$ $p_{ij}=p_{ji}$ $p_{ij}=p_{ji}$ i $p_{ij}=p_{ji}$ $p_{ii}=0$ $p_$ { $ij$ } $p_{ij}=p_{ji}$ $=$ $p_{ji$ } $p_{ii}=0$ $}$ , p $\sum _{i,j}p_{ij}=1$ $p_{ij}=p_{ji}$ $=$ $p_{ii}=0$ { $\sum _{i,j}p_{ij}=1$ p_ $\sum _{i,j}p_{ij}=1$ } $\sum _{i,j}p_{ij}=1$ $\sum _{i,j}p_{ij}=1$ $p_{ii}=0$ { $displaystyle \sum$ _ { $i,j} p_{ij$ } $\sum _{i,j}p_{ij}=1$ = $1$ .

가우스 커널 $\sigma _{i}$ i \ $displaystyle \sigma$ _ { $i}$ 의 $\sigma _{i}$ 대역폭은 조건분포의 난이도가 이분법을 사용하여 미리 정의된 난이도와 같도록 설정됩니다.그 결과 대역폭은 데이터 밀도에 맞게 조정됩니다.데이터 공간의 밀도가 높은 부분에서는 작은 값인 $\sigma _{i}$ i $\sigma$ _ ${i}$ 가 $\sigma _{i}$ 사용됩니다.

가우스 커널은 유클리드 거리 $\lVert x_{i}-x_{j}\rVert$ $\lVert x_{i}-x_{j}\rVert$ i - $\lVert x_{i}-x_{j}\rVert$ j $†$ { $displaystyle \lVert x_{i}-x_{j}\rVert$ 를 사용하기 때문에 차원성의 저주에 영향을 받고 거리가 식별 능력을 상실하면 $p_{ij}$ j $p_{ij}$ { $displaystyle p_{ij}$ 가 $p_{ij}$ 매우 유사해집니다(점적으로 수렴됩니다).ge to constant).이를 ^[15]완화하기 위해 각 점의 고유 치수에 기초하여 멱변환으로 거리를 조정하는 것이 제안되었다.

t-SNE는 d{\ $displaystyle$ d $}$ 차원 $d$ $\mathbf {y} _{1},\dots ,\mathbf {y} _{N}$ $\mathbf {y} _{1},\dots ,\mathbf {y} _{N}$ 1, $\mathbf {y} _{1},\dots ,\mathbf {y} _{N}$ $\mathbf {y} _{1},\dots ,\mathbf {y} _{N}$ { $style \mathbf {y}$ _ ${1},\dots,\mathbf$ $\mathbf {y} _{i}\in \mathbb {R} ^{d}$ { $y}$ _ ${N}$ ( $\mathbf {y} _{i}\in \mathbb {R} ^{d}$ i $\mathbf {y} _{i}\in \mathbb {R} ^{d}$ R \ $displaystyle \mathbbf {y}$ _ ${i}$ \ $in$ \ $mathbr })$ $display$ $^}$ $d$ 를 $\mathbf {y} _{i}\in \mathbb {R} ^{d}$ $d$ 하는 것을 $목표$ 로 합니다. $({displaystyle p_{ij$ 가능한 한 많이 사용합니다.이를 위해 매우 유사한 접근방식을 사용하여 $\mathbf {y} _{i}$ 의 두 점 $\$ $\$ $mathbf {y}_$ ${$ $ij}$ 와 y $\mathbf {y} _{j}$ j\ $displaystyle$ \ $mathbf$ { $y}_{$ j $\mathbf {y} _{j}$ 사이의 $q_{ij}$ $q_ij}$ 를 $q_{ij}$ 측정합니다.구체적으로는 i $i\neq j$ j { $display style$ i \ $neq$ j $i\neq j$ }의 경우 $q_{ij}$ $q_{ij}$ j { $display$ style $q$ _ { $ij$ }를 $q_{ij}$ 다음과 같이 $q_{ij}$ 합니다.

q_{ij}=slfrac {(1+\lVert \mathbf {y}_{i}-\mathbf {y}_{j}\rVert ^{2}}{\sum _{k}\neq k}(1+\lVert \mathbf {y}_y}_mathb}_y}_mathb

$q_{ii}=0$ $q_{ii}=0$ $q_{ii}=0$ $q_{ii}=0$ { $display$ style $q$ _ { $ii$ } = 0 으로 설정합니다 $q_{ii}=0$ 여기서 헤비테일 Student t-분포(자유도 1도, 코시분포와 동일)는 지도에서 멀리 떨어진 다른 객체를 모델링할 수 있도록 저차원 점 사이의 유사성을 측정하는 데 사용됩니다.

맵에서 $(\$ 의 $\mathbf {y} _{i}$ 위치는 $분포$ Q(\ $displaystyle$ Q $Q$ 에서 $분포$ P(\ $displaystyle$ P $)$ 의 $P$ (비대칭) Kullback-Leibler 차이를 최소화함으로써 결정됩니다.

\displaystyle \mathrm {KL} \left(P\parallel Q\right)=\sum _{i\neq j}p_{ij}\log {frac {p_{ij}}{q_{ij}}}}

$(\$ _ ${i$ }) $\mathbf {y} _{i}$ 에 $\mathbf {y} _{i}$ 대한 쿨백-라이블러 발산 최소화는 경사 강하를 사용하여 수행된다.이 최적화의 결과는 고차원 입력 간의 유사성을 반영하는 지도입니다.

소프트웨어

R 패키지 Rtsne은 R에 t-SNE를 실장하고 있습니다.
ELKI에는 tSNE가 포함되어 있으며 반즈-허트 근사도 포함되어 있습니다.
Python에서 인기 있는 기계 학습 툴킷인 Scikit-learn은 정확한 솔루션과 반즈-허트 근사치를 모두 사용하여 t-SNE를 구현합니다.
TensorFlow와 관련된 시각화 키트인 Tensorboard도 t-SNE(온라인 버전)를 구현합니다.

레퍼런스

^ Roweis, Sam; Hinton, Geoffrey (January 2002). Stochastic neighbor embedding (PDF). Neural Information Processing Systems.
^ ^a ^b van der Maaten, L.J.P.; Hinton, G.E. (Nov 2008). "Visualizing Data Using t-SNE" (PDF). Journal of Machine Learning Research. 9: 2579–2605.
^ Gashi, I.; Stankovic, V.; Leita, C.; Thonnard, O. (2009). "An Experimental Study of Diversity with Off-the-shelf AntiVirus Engines". Proceedings of the IEEE International Symposium on Network Computing and Applications: 4–11.
^ Hamel, P.; Eck, D. (2010). "Learning Features from Music Audio with Deep Belief Networks". Proceedings of the International Society for Music Information Retrieval Conference: 339–344.
^ Jamieson, A.R.; Giger, M.L.; Drukker, K.; Lui, H.; Yuan, Y.; Bhooshan, N. (2010). "Exploring Nonlinear Feature Space Dimension Reduction and Data Representation in Breast CADx with Laplacian Eigenmaps and t-SNE". Medical Physics. 37 (1): 339–351. doi:10.1118/1.3267037. PMC 2807447. PMID 20175497.
^ Wallach, I.; Liliean, R. (2009). "The Protein-Small-Molecule Database, A Non-Redundant Structural Resource for the Analysis of Protein-Ligand Binding". Bioinformatics. 25 (5): 615–620. doi:10.1093/bioinformatics/btp035. PMID 19153135.
^ Balamurali, Mehala; Silversides, Katherine L.; Melkumyan, Arman (2019-04-01). "A comparison of t-SNE, SOM and SPADE for identifying material type domains in geological data". Computers & Geosciences. 125: 78–89. doi:10.1016/j.cageo.2019.01.011. ISSN 0098-3004. S2CID 67926902.
^ Balamurali, Mehala; Melkumyan, Arman (2016). Hirose, Akira; Ozawa, Seiichi; Doya, Kenji; Ikeda, Kazushi; Lee, Minho; Liu, Derong (eds.). "t-SNE Based Visualisation and Clustering of Geological Domain". Neural Information Processing. Lecture Notes in Computer Science. Cham: Springer International Publishing. 9950: 565–572. doi:10.1007/978-3-319-46681-1_67. ISBN 978-3-319-46681-1.
^ Leung, Raymond; Balamurali, Mehala; Melkumyan, Arman (2021-01-01). "Sample Truncation Strategies for Outlier Removal in Geochemical Data: The MCD Robust Distance Approach Versus t-SNE Ensemble Clustering". Mathematical Geosciences. 53 (1): 105–130. doi:10.1007/s11004-019-09839-z. ISSN 1874-8953. S2CID 208329378.
^ Birjandtalab, J.; Pouyan, M. B.; Nourani, M. (2016-02-01). Nonlinear dimension reduction for EEG-based epileptic seizure detection. 2016 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI). pp. 595–598. doi:10.1109/BHI.2016.7455968. ISBN 978-1-5090-2455-1. S2CID 8074617.
^ "K-means clustering on the output of t-SNE". Cross Validated. Retrieved 2018-04-16.
^ Pezzotti, Nicola; Lelieveldt, Boudewijn P. F.; Maaten, Laurens van der; Hollt, Thomas; Eisemann, Elmar; Vilanova, Anna (2017-07-01). "Approximated and User Steerable tSNE for Progressive Visual Analytics". IEEE Transactions on Visualization and Computer Graphics. 23 (7): 1739–1752. arXiv:1512.01655. doi:10.1109/tvcg.2016.2570755. ISSN 1077-2626. PMID 28113434. S2CID 353336.
^ Wattenberg, Martin; Viégas, Fernanda; Johnson, Ian (2016-10-13). "How to Use t-SNE Effectively". Distill. Retrieved 4 December 2017.
^ Linderman, George C.; Steinerberger, Stefan (2017-06-08). "Clustering with t-SNE, provably". arXiv:1706.02582 [cs.LG].
^ Schubert, Erich; Gertz, Michael (2017-10-04). Intrinsic t-Stochastic Neighbor Embedding for Visualization and Outlier Detection. SISAP 2017 – 10th International Conference on Similarity Search and Applications. pp. 188–203. doi:10.1007/978-3-319-68474-1_13.

외부 링크

t-SNE를 사용한 데이터 시각화, t-SNE에 대한 Google Tech Talk
다양한 언어로 t-SNE 구현, Laurens van der Maatten이 관리하는 링크 컬렉션

[SNE-1] Roweis, Sam; Hinton, Geoffrey (January 2002). Stochastic neighbor embedding (PDF). Neural Information Processing Systems.

[MaatenHinton-2] van der Maaten, L.J.P.; Hinton, G.E. (Nov 2008). "Visualizing Data Using t-SNE" (PDF). Journal of Machine Learning Research. 9: 2579–2605.

[3] Gashi, I.; Stankovic, V.; Leita, C.; Thonnard, O. (2009). "An Experimental Study of Diversity with Off-the-shelf AntiVirus Engines". Proceedings of the IEEE International Symposium on Network Computing and Applications: 4–11.

[4] Hamel, P.; Eck, D. (2010). "Learning Features from Music Audio with Deep Belief Networks". Proceedings of the International Society for Music Information Retrieval Conference: 339–344.

[5] Jamieson, A.R.; Giger, M.L.; Drukker, K.; Lui, H.; Yuan, Y.; Bhooshan, N. (2010). "Exploring Nonlinear Feature Space Dimension Reduction and Data Representation in Breast CADx with Laplacian Eigenmaps and t-SNE". Medical Physics. 37 (1): 339–351. doi:10.1118/1.3267037. PMC 2807447. PMID 20175497.

[6] Wallach, I.; Liliean, R. (2009). "The Protein-Small-Molecule Database, A Non-Redundant Structural Resource for the Analysis of Protein-Ligand Binding". Bioinformatics. 25 (5): 615–620. doi:10.1093/bioinformatics/btp035. PMID 19153135.

[7] Balamurali, Mehala; Silversides, Katherine L.; Melkumyan, Arman (2019-04-01). "A comparison of t-SNE, SOM and SPADE for identifying material type domains in geological data". Computers & Geosciences. 125: 78–89. doi:10.1016/j.cageo.2019.01.011. ISSN 0098-3004. S2CID 67926902.

[8] Balamurali, Mehala; Melkumyan, Arman (2016). Hirose, Akira; Ozawa, Seiichi; Doya, Kenji; Ikeda, Kazushi; Lee, Minho; Liu, Derong (eds.). "t-SNE Based Visualisation and Clustering of Geological Domain". Neural Information Processing. Lecture Notes in Computer Science. Cham: Springer International Publishing. 9950: 565–572. doi:10.1007/978-3-319-46681-1_67. ISBN 978-3-319-46681-1.

[9] Leung, Raymond; Balamurali, Mehala; Melkumyan, Arman (2021-01-01). "Sample Truncation Strategies for Outlier Removal in Geochemical Data: The MCD Robust Distance Approach Versus t-SNE Ensemble Clustering". Mathematical Geosciences. 53 (1): 105–130. doi:10.1007/s11004-019-09839-z. ISSN 1874-8953. S2CID 208329378.

[10] Birjandtalab, J.; Pouyan, M. B.; Nourani, M. (2016-02-01). Nonlinear dimension reduction for EEG-based epileptic seizure detection. 2016 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI). pp. 595–598. doi:10.1109/BHI.2016.7455968. ISBN 978-1-5090-2455-1. S2CID 8074617.

[11] "K-means clustering on the output of t-SNE". Cross Validated. Retrieved 2018-04-16.

[12] Pezzotti, Nicola; Lelieveldt, Boudewijn P. F.; Maaten, Laurens van der; Hollt, Thomas; Eisemann, Elmar; Vilanova, Anna (2017-07-01). "Approximated and User Steerable tSNE for Progressive Visual Analytics". IEEE Transactions on Visualization and Computer Graphics. 23 (7): 1739–1752. arXiv:1512.01655. doi:10.1109/tvcg.2016.2570755. ISSN 1077-2626. PMID 28113434. S2CID 353336.

[13] Wattenberg, Martin; Viégas, Fernanda; Johnson, Ian (2016-10-13). "How to Use t-SNE Effectively". Distill. Retrieved 4 December 2017.

[14] Linderman, George C.; Steinerberger, Stefan (2017-06-08). "Clustering with t-SNE, provably". arXiv:1706.02582 [cs.LG].

[15] Schubert, Erich; Gertz, Michael (2017-10-04). Intrinsic t-Stochastic Neighbor Embedding for Visualization and Outlier Detection. SISAP 2017 – 10th International Conference on Similarity Search and Applications. pp. 188–203. doi:10.1007/978-3-319-68474-1_13.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

Search

t분포 확률적 인접 매입

네임스페이스

더

목차

세부 사항

소프트웨어

레퍼런스

외부 링크