시퀀스 클러스터링

생물정보학에서 시퀀스 클러스터링 알고리즘은 어떻게든 관련이 있는 생물학적 시퀀스를 그룹화하려고 시도한다. 시퀀스는 게놈, "트랜스펙토믹"(EST) 또는 단백질 원점이 될 수 있다. 단백질의 경우, 동음이의 염기서열은 일반적으로 가족으로 분류된다. EST 데이터의 경우, 클러스터링은 EST가 조립되기 전에 동일한 유전자로부터 유래한 시퀀스를 그룹화하여 원래의 mRNA를 재구성하는 데 중요하다.

일부 클러스터링 알고리즘은 단일 링크 클러스터링을 사용하여 특정 임계값에 대한 유사성을 가진 시퀀스의 전이적 폐쇄를 구성한다. UCLUST와^[1] CD-HIT는^[2] 각 클러스터의 대표 시퀀스를 식별하고 대표자와 충분히 유사한 경우 해당 클러스터에 새로운 시퀀스를 할당하는 탐욕스러운 알고리즘을 사용하며, 시퀀스가 일치하지 않으면 새로운 클러스터의 대표 시퀀스가 된다. 유사성 점수는 종종 시퀀스 정렬에 기초한다. 시퀀스 클러스터링은 대표적인 시퀀스의 비중복 집합을 만드는 데 종종 사용된다.

염기서열 집단은 종종 단백질 계열과 동의어다만 동일하지는 않다. 각 시퀀스 클러스터의 대표적인 3차 구조를 결정하는 것은 많은 구조 유전학 이니셔티브의 목적이다.

시퀀스 클러스터링 알고리즘 및 패키지

CD-HIT^[2]
UCLUST in Usearch^[1]
스타코드:^[3] 정확한 올페어 검색을 기반으로 한 빠른 시퀀스 클러스터링 알고리즘.^[4]
OrthoFinder:^[5] 단백질을 유전자 계열로 군집화하는 빠르고 확장 가능하며 정확한 방법(정규집단)^[6]^[7]
Linclust:^[8] 대용량 시퀀스 세트의 빠르고 민감한 시퀀스 검색 및 클러스터링을 위한 MMSeqs2^[9] 소프트웨어 제품군의 일부인 입력 세트 크기로 런타임이 선형적으로 확장되는 첫 번째 알고리즘
부족MCL : 단백질을 관련 집단으로^[10] 군집화하는 방법
BAG: 그래프 이론적 시퀀스 클러스터링 알고리즘^[11]
JESAM:^[12] 선택적 클러스터링 소프트웨어 구성요소를 갖춘 오픈 소스 병렬 확장형 DNA 정렬 엔진
UICluster:^[13] EST(Gene) 시퀀스의 병렬 클러스터링
블라스트클러스터^[14](블라스트 포함) 단일 링크 클러스터링 포함)
Clusterer:^[15] 시퀀스 그룹화 및 클러스터 분석을 위한 확장 가능한 Java 애플리케이션
PATDB: 완벽한 서브스트링을 신속하게 식별하기 위한 프로그램
nrdb:^[16] 사소한 중복(중복) 시퀀스 병합 프로그램
CluSTr:^[17] Smith-Waterman 시퀀스 유사성의 단일 링크 단백질 시퀀스 클러스터링 데이터베이스; UniProt 및 IPI를 포함한 7mln 시퀀스 포함
ICAtools^[18] - 아티팩트 검색 또는 EST 클러스터링에 유용한 여러 알고리즘이 포함된 원본(고전적) DNA 클러스터링 패키지
세트에서 중복 시퀀스를 제거하는 Skipredudant EMBOSS 도구^[19]
구조적으로, 기능적으로 또는 진화적으로 연관성이 있는 정렬하기 어려운 단백질 시퀀스 그룹을 식별하는 CLUS 알고리즘^[20]. CLUS 웹 서버
CLUSS2는 여러 생물학적 기능을 가진 정렬하기 어려운 단백질 시퀀스 패밀리를 클러스터링하기 위한 알고리즘이다^[22]. CLUSS2 웹 서버

중복되지 않은 시퀀스 데이터베이스

PISCES: 단백질 시퀀스 컬링 서버^[23]
RDB90^[24]
UniRef: 중복되지 않은 UniProt 시퀀스 데이터베이스^[25]
유니클러스트: 90%, 50% 및 30% 페어 와이즈 시퀀스 ID 수준에서 클러스터된 UniProtKB 시퀀스.^[26]
Virus Orthologous Clusters:^[27] 바이러스 단백질 시퀀스 클러스터링 데이터베이스, VOLASP 유사성에 의해 직교 그룹으로 구성된 11개 바이러스 계열의 모든 예측 유전자를 포함한다.

참고 항목

참조

^ ^a ^b "USEARCH". drive5.com.
^ ^a ^b "CD-HIT: a ultra-fast method for clustering protein and nucleotide sequences, with many new applications in next generation sequencing (NGS) data". cd-hit.org.
^ "Starcode repository". 2018-10-11.
^ Zorita E, Cuscó P, Filion GJ (June 2015). "Starcode: sequence clustering based on all-pairs search". Bioinformatics. 31 (12): 1913–9. doi:10.1093/bioinformatics/btv053. PMC 4765884. PMID 25638815.
^ "OrthoFinder". Steve Kelly Lab.
^ Emms DM, Kelly S (August 2015). "OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy". Genome Biology. 16: 157. doi:10.1186/s13059-015-0721-2. PMC 4531804. PMID 26243257.
^ Emms DM, Kelly S (November 2019). "OrthoFinder: phylogenetic orthology inference for comparative genomics". Genome Biology. 20 (1): 238. doi:10.1186/s13059-019-1832-y. PMC 6857279. PMID 31727128.
^ Steinegger M, Söding J (June 2018). "Clustering huge protein sequence sets in linear time". Nature Communications. 9 (1): 2542. Bibcode:2018NatCo...9.2542S. doi:10.1038/s41467-018-04964-5. PMC 6026198. PMID 29959318.
^ Steinegger M, Söding J (November 2017). "MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets". Nature Biotechnology. 35 (11): 1026–1028. doi:10.1038/nbt.3988. hdl:11858/00-001M-0000-002E-1967-3. PMID 29035372. S2CID 402352.
^ Enright AJ, Van Dongen S, Ouzounis CA (April 2002). "An efficient algorithm for large-scale detection of protein families". Nucleic Acids Research. 30 (7): 1575–84. doi:10.1093/nar/30.7.1575. PMC 101833. PMID 11917018.
^ "Archived copy". Archived from the original on 2003-12-06. Retrieved 2004-02-19.{{cite web}}: CS1 maint: 타이틀로 보관된 사본(링크)
^ "Bioinformatics Paper: JESAM: CORBA software components for EST alignments and clusters". littlest.co.uk.
^ http://ratest.eng.uiowa.edu/pubsoft/clustering/
^ "NCBI News: Spring 2004-BLASTLab". nih.gov.
^ "Clusterer: extendable java application for sequence grouping and cluster analyses". bugaco.com.
^ "Index of /pub/nrdb". Archived from the original on 2008-01-01.
^ "Archived copy". Archived from the original on 2006-09-24. Retrieved 2006-11-23.{{cite web}}: CS1 maint: 타이틀로 보관된 사본(링크)
^ "Introduction to the ICAtools". littlest.co.uk.
^ "EMBOSS: skipredundant". pasteur.fr.
^ Kelil A, Wang S, Brzezinski R, Fleury A (August 2007). "CLUSS: clustering of protein sequences based on a new similarity measure". BMC Bioinformatics. 8: 286. doi:10.1186/1471-2105-8-286. PMC 1976428. PMID 17683581.
^ ^a ^b "CLUSS Home Page".
^ Kelil A, Wang S, Brzezinski R (2008). "CLUSS2: an alignment-independent algorithm for clustering protein families with multiple biological functions". International Journal of Computational Biology and Drug Design. 1 (2): 122–40. doi:10.1504/ijcbdd.2008.020190. PMID 20058485.
^ "Dunbrack Lab". fccc.edu.
^ Holm L, Sander C (June 1998). "Removing near-neighbour redundancy from large protein sequence collections". Bioinformatics. 14 (5): 423–9. doi:10.1093/bioinformatics/14.5.423. PMID 9682055.
^ "About UniProt". uniprot.org.
^ Mirdita M, von den Driesch L, Galiez C, Martin MJ, Söding J, Steinegger M (January 2017). "Uniclust databases of clustered and deeply annotated protein sequences and alignments". Nucleic Acids Research. 45 (D1): D170–D176. doi:10.1093/nar/gkw1081. PMC 5614098. PMID 27899574.
^ "VOCS - Viral Bioinformatics Resource Center". uvic.ca.

[usearch-1] "USEARCH". drive5.com.

[cdhit-2] "CD-HIT: a ultra-fast method for clustering protein and nucleotide sequences, with many new applications in next generation sequencing (NGS) data". cd-hit.org.

[3] "Starcode repository". 2018-10-11.

[pmid25638815-4] Zorita E, Cuscó P, Filion GJ (June 2015). "Starcode: sequence clustering based on all-pairs search". Bioinformatics. 31 (12): 1913–9. doi:10.1093/bioinformatics/btv053. PMC 4765884. PMID 25638815.

[5] "OrthoFinder". Steve Kelly Lab.

[pmid26243257-6] Emms DM, Kelly S (August 2015). "OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy". Genome Biology. 16: 157. doi:10.1186/s13059-015-0721-2. PMC 4531804. PMID 26243257.

[pmid31727128-7] Emms DM, Kelly S (November 2019). "OrthoFinder: phylogenetic orthology inference for comparative genomics". Genome Biology. 20 (1): 238. doi:10.1186/s13059-019-1832-y. PMC 6857279. PMID 31727128.

[pmid29959318-8] Steinegger M, Söding J (June 2018). "Clustering huge protein sequence sets in linear time". Nature Communications. 9 (1): 2542. Bibcode:2018NatCo...9.2542S. doi:10.1038/s41467-018-04964-5. PMC 6026198. PMID 29959318.

[pmid29035372-9] Steinegger M, Söding J (November 2017). "MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets". Nature Biotechnology. 35 (11): 1026–1028. doi:10.1038/nbt.3988. hdl:11858/00-001M-0000-002E-1967-3. PMID 29035372. S2CID 402352.

[pmid11917018-10] Enright AJ, Van Dongen S, Ouzounis CA (April 2002). "An efficient algorithm for large-scale detection of protein families". Nucleic Acids Research. 30 (7): 1575–84. doi:10.1093/nar/30.7.1575. PMC 101833. PMID 11917018.

[11] "Archived copy". Archived from the original on 2003-12-06. Retrieved 2004-02-19.{{cite web}}: CS1 maint: 타이틀로 보관된 사본(링크)

[12] "Bioinformatics Paper: JESAM: CORBA software components for EST alignments and clusters". littlest.co.uk.

[13] ttp://ratest.eng.uiowa.edu/pubsoft/clustering/

[14] "NCBI News: Spring 2004-BLASTLab". nih.gov.

[15] "Clusterer: extendable java application for sequence grouping and cluster analyses". bugaco.com.

[16] "Index of /pub/nrdb". Archived from the original on 2008-01-01.

[17] "Archived copy". Archived from the original on 2006-09-24. Retrieved 2006-11-23.{{cite web}}: CS1 maint: 타이틀로 보관된 사본(링크)

[18] "Introduction to the ICAtools". littlest.co.uk.

[19] "EMBOSS: skipredundant". pasteur.fr.

[pmid17683581-20] Kelil A, Wang S, Brzezinski R, Fleury A (August 2007). "CLUSS: clustering of protein sequences based on a new similarity measure". BMC Bioinformatics. 8: 286. doi:10.1186/1471-2105-8-286. PMC 1976428. PMID 17683581.

[prospectus.usherbrooke.ca-21] "CLUSS Home Page".

[pmid20058485-22] Kelil A, Wang S, Brzezinski R (2008). "CLUSS2: an alignment-independent algorithm for clustering protein families with multiple biological functions". International Journal of Computational Biology and Drug Design. 1 (2): 122–40. doi:10.1504/ijcbdd.2008.020190. PMID 20058485.

[23] "Dunbrack Lab". fccc.edu.

[rdb90-24] Holm L, Sander C (June 1998). "Removing near-neighbour redundancy from large protein sequence collections". Bioinformatics. 14 (5): 423–9. doi:10.1093/bioinformatics/14.5.423. PMID 9682055.

[25] "About UniProt". uniprot.org.

[pmid27899574-26] Mirdita M, von den Driesch L, Galiez C, Martin MJ, Söding J, Steinegger M (January 2017). "Uniclust databases of clustered and deeply annotated protein sequences and alignments". Nucleic Acids Research. 45 (D1): D170–D176. doi:10.1093/nar/gkw1081. PMC 5614098. PMID 27899574.

[27] "VOCS - Viral Bioinformatics Resource Center". uvic.ca.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[22]

[23]

[24]

[25]

[26]

[27]

Search

시퀀스 클러스터링

네임스페이스

더

목차