ProClust

ProClust: improved clustering of protein sequences with an extended graph-based approach. Results: We extend a graph-based clustering algorithm which uses an asymmetric distance measure, scaling similarity values based on the length of the protein sequences compared. Additionally, the significance of alignment scores is taken into account and used for a filtering step in the algorithm. Post-processing, to merge further clusters based on profile HMMs is proposed. SCOP sequences and their super-family level classification are used as a test set for a clustering computed with our method for the joint data set containing both SCOP and SWISS-PROT. Note, the joint data set includes all multi-domain proteins, which contain the SCOP domains that are a potential source of incorrect links. Our method compares at high specificities very favorably with PSI-Blast, which is probably the most widely-used tool for finding remote homologues. We demonstrate that using transitivity with as many as twelve intermediate sequences is crucial to achieving this level of performance. Moreover, from analysis of false positives we conclude that our method seems to correctly bound the degree of transitivity used. This analysis also yields explicit guidance in choosing parameters. The heuristics of the asymmetric distance measure used neither solve the multi-domain problem from a theoretical point of view, nor do they avoid all types of problems we have observed in real data. Nevertheless, they do provide a substantial improvement over existing approaches. Availability: The complete software source is freely available to all users under the GNU General Public License (GPL) from http://www.bioinformatik.uni-koeln.de/ proclust/download/


References in zbMATH (referenced in 6 articles )

Showing results 1 to 6 of 6.
Sorted by year (citations)

  1. Malek, Sabrine; Naanaa, Wady: A new approximate cluster deletion algorithm for diamond-free graphs (2020)
  2. Schmidt, Markus; Kutzner, Arne; Heese, Klaus: A novel specialized single-linkage clustering algorithm for taxonomically ordered data (2017)
  3. Dai, Qi; Liu, Xiaoqing; Yao, Yuhua; Zhao, Fukun: Numerical characteristics of word frequencies and their application to dissimilarity measure for sequence comparison (2011)
  4. Nepusz, Tamás; Sasidharan, Rajkumar; Paccanaro, Alberto: SCPS: a fast implementation of a spectral method for detecting protein families on a genome-wide scale (2010) ioport
  5. Kelil, Abdellali; Wang, Shengrui; Brzezinski, Ryszard; Fleury, Alain: CLUSS: Clustering of protein sequences based on a new similarity measure (2007) ioport
  6. Tan, Meng Piao; Broach, James R.; Floudas, Christodoulos A.: A novel clustering approach and prediction of optimal number of clusters: global optimum search with enhanced positioning (2007)