문자열 커널

머신러닝과 데이터 마이닝에서 문자열 커널은 문자열에서 동작하는 커널 함수, 즉 길이가 같을 필요가 없는 기호의 유한 시퀀스다.문자열 커널은 문자열 쌍의 유사성을 측정하는 함수로 직관적으로 이해할 수 있다: 두 문자열 a와 b가 비슷할수록 문자열 커널 K(a, b)의 값이 높아진다.

지원 벡터 기계와 같은 커널화된 학습 알고리즘과 함께 문자열 커널을 사용하면 이러한 알고리즘을 고정 길이의 실제 값 기능 벡터로 변환할 필요 없이 문자열과 함께 작동할 수 있다.^[1]문자열 커널은 텍스트 마이닝 및 유전자 분석과 같이 시퀀스 데이터를 클러스터링하거나 분류해야 하는 도메인에서 사용된다.^[2]

비공식 소개

어떤 사람이 일부 텍스트 구절을 자동으로 비교하고 그들의 상대적인 유사성을 나타내기를 원한다고 가정해보자.많은 응용 프로그램의 경우, 정확히 일치하는 키워드를 찾는 것으로 충분할 수 있다.정확한 일치가 항상 충분하지 않은 한 예는 스팸 탐지에서 찾을 수 있다.^[3]또 다른 것은 계산 유전자 분석에서 동질 유전자가 변이되어 삭제, 삽입 또는 교체된 기호와 함께 공통적으로 반복되는 결과를 낳았다.

동기

잘 검증된 여러 데이터 클러스터링, 분류 및 정보 검색 방법(예: 지원 벡터 머신)은 벡터(즉, 데이터는 벡터 공간의 요소)에서 작동하도록 설계되었기 때문에 문자열 커널을 사용하면 시퀀스 데이터를 처리하기 위해 이러한 방법의 확장을 가능하게 한다.

문자열 커널 방법은 형상 벡터가 단어의 유무만을 나타내는 텍스트 분류를 위한 이전의 접근법과 대조된다.이러한 접근방식에서 개선될 뿐만 아니라, 21세기 초에 나타나기 시작한 데이터 구조에 적응한 모든 종류의 커널의 예다.그러한 방법에 대한 조사는 게르트너에 의해 취합되었다.^[4]

생물정보학에서는 특히 단백질이나 DNA와 같은 생물학적 시퀀스를 기계 학습 모델에서 추가적으로 사용하기 위한 벡터로 변환하는데 사용된다.그러한 목적으로 사용되는 문자열 커널의 예는 프로파일 커널이다.^[5]

정의

도메인 $D$ $D$ 의 커널은 $K:D\times D\rightarrow \mathbb {R}$ K : $K:D\times D\rightarrow \mathbb {R}$ × $K:D\times D\rightarrow \mathbb {R}$ → $K:D\times D\rightarrow \mathbb {R}$ {\ $displaystyle K:$ $D\time D\rightarrow \mathb {R}}이($ 가) 일부 $K:D\times D\rightarrow \mathbb {R}$ 조건을 만족함(변수에 대칭, 어떤 의미에서는 연속적이고 양적인 세미데마인트가 됨)

Mercer의 정리는 $K$ ${\$ $displaystyle$ $K}$ 을(를) $K(x,y)=\varphi (x)\cdot \varphi (y)$ ( x $K(x,y)=\varphi (x)\cdot \varphi (y)$ , y $K(x,y)=\varphi (x)\cdot \varphi (y)$ ) $K(x,y)=\varphi (x)\cdot \varphi (y)$ = $K(x,y)=\varphi (x)\cdot \varphi (y)$ ( $K(x,y)=\varphi (x)\cdot \varphi (y)$ x $K(x,y)=\varphi (x)\cdot \varphi (y)$ ) $K(x,y)=\varphi (x)\cdot \varphi (y)$ ( $K(x,y)=\varphi (x)\cdot \varphi (y)$ ) ${\displaystyle K(x,y)=\varphi (x$ $)\$ $cdot$ $\varphi$ $(y$ $)}$ 로 $K(x,y)=\varphi (x)\cdot \varphi (y)$ 표현하면 인수를 내부 제품 공간으로 매핑할 $\varphi$ 수 있다고 $K$ 주장한다.

이제 문자열의 문자열 부분 커널의^[1] 정의를 알파벳 $\Sigma$ $\Sigma$ 을(를) 통해 재현할 수 있다 $\Sigma$ 좌표상 매핑은 다음과 같이 정의된다.

\varphi _{u}:\left\{{\begin{array}{l}\Sigma ^{n}\rightarrow \mathbb {R} ^{\Sigma ^{n}}\\s\mapsto \sum _{\mathbf {i} :u=s_{\mathbf {i} }}\lambda ^{l(\mathbf {i} )}\end{array}}\right.

$\mathbf {i}$ ${\$ }은(는) 다중 지수이고 $\mathbf {i}$ $u$ $u$ 은 $u$ (는) 일련의 길이 $n$ ${\displaystyle n$ 반복은 연속되지 않는 방식으로 발생할 수 있지만 간격은 불이익을 받는다.다중 인덱스 i $\mathbf {i}$ {\ $displaystyle$ \ $mathbf {i}$ 이 $\mathbf{i}$ $($ $l(\mathbf {i} )$ ) $s$ $s$ . $l(\mathbf {i} )$ ( $l(\mathbf {i} )$ ) ${\displaystyle$ l $(\mathbf {i} )$ 에서 $u$ $\mathbf {i}$ ${\$ $displaystyle \mathbf {i}$ 과(가 $\mathbf {i}$ 의 첫 번째 항목과 마지막 항목 간의 차이점 $l(\mathbf {i} )$ : ${\disp$ . $laystyle s}$ $u$ $u$ 과 $u$ (와) 일치하는 하위 $s$ 항목.매개 변수 $\lambda$ ${\$ $displaystyle$ $\lambda }$ 은(는) $0$ ${\$ $displaystyle 0^{0}$ 이 $0^{0}$ (가) $0$ 이 아니라 1 {\ $displaystyle$ $1$ 이고1{\ $displaystystyle$ 1 $}($ 이(이)도 외형과 동일한 가중치가 적용되므로 gaps가 허용되지 않는다 $\lambda$ . $0$ $1^{l(\mathbf {i} )}=1$ 인 하위 문자열로서 $,$ $1^{l(\mathbf {i} )}=1$ $1^{l(\mathbf {i} )}=1$ ( i ) = 1 {\ $displaystyle$ 1^{ $l(\$ mathbf { $i} )}=1}.$

몇 가지 관련 알고리즘의 경우, 데이터는 기능 벡터의 내부 생산물을 포함하는 식에서만 알고리즘에 입력되므로, 이름 커널 메서드가 된다.이것의 바람직한 결과는 변환 $\phi (x)$ ( x $\phi (x)$ ) ${\displaystyle \phi (x$ 오직 커널을 통한 내부 제품만을 명시적으로 계산할 필요가 없다는 것이다. 특히 근사치인 경우 훨씬 더 빠를 수 있다.^[1]

참조

^ ^a ^b ^c Lodhi, Huma; Saunders, Craig; Shawe-Taylor, John; Cristianini, Nello; Watkins, Chris (2002). "Text classification using string kernels". Journal of Machine Learning Research: 419–444.
^ Leslie, C.; Eskin, E.; Noble, W.S. (2002), "The spectrum kernel: A string kernel for SVM protein classification", Proceedings of the Pacific Symposium on Biocomputing, vol. 7, pp. 566–575, PMID 11928508
^ Amayri, O. (2009), "Improved Online Support Vector Machines Spam Filtering Using String Kernels", Lecture Notes in Computer Science, 5856: 621, Bibcode:2009LNCS.5856..621A, doi:10.1007/978-3-642-10268-4_73, ISBN 978-3-642-10267-7
^ Gärtner, T. (2003), "A survey of kernels for structured data", ACM SIGKDD Explorations Newsletter, ACM, 5 (1): 58, doi:10.1145/959242.959248, S2CID 4471326
^ Kuang, Rui; Ie, Eugene; Wang, Ke; Wang, Kai; Siddiqi, Mahira; Freund, Yoav; Leslie, Christina (2005-06-01). "Profile-based string kernels for remote homology detection and motif extraction". Journal of Bioinformatics and Computational Biology. 3 (3): 527–550. doi:10.1142/s021972000500120x. ISSN 0219-7200. PMID 16108083.

[Lodhi-1] Lodhi, Huma; Saunders, Craig; Shawe-Taylor, John; Cristianini, Nello; Watkins, Chris (2002). "Text classification using string kernels". Journal of Machine Learning Research: 419–444.

[2] Leslie, C.; Eskin, E.; Noble, W.S. (2002), "The spectrum kernel: A string kernel for SVM protein classification", Proceedings of the Pacific Symposium on Biocomputing, vol. 7, pp. 566–575, PMID 11928508

[3] Amayri, O. (2009), "Improved Online Support Vector Machines Spam Filtering Using String Kernels", Lecture Notes in Computer Science, 5856: 621, Bibcode:2009LNCS.5856..621A, doi:10.1007/978-3-642-10268-4_73, ISBN 978-3-642-10267-7

[4] Gärtner, T. (2003), "A survey of kernels for structured data", ACM SIGKDD Explorations Newsletter, ACM, 5 (1): 58, doi:10.1145/959242.959248, S2CID 4471326

[5] Kuang, Rui; Ie, Eugene; Wang, Ke; Wang, Kai; Siddiqi, Mahira; Freund, Yoav; Leslie, Christina (2005-06-01). "Profile-based string kernels for remote homology detection and motif extraction". Journal of Bioinformatics and Computational Biology. 3 (3): 527–550. doi:10.1142/s021972000500120x. ISSN 0219-7200. PMID 16108083.

[1]

[2]

[3]

[4]

[5]

Search

문자열 커널

네임스페이스

더

목차

비공식 소개

동기

정의

참조