Weighted k-word matches: a sequence comparison tool for proteins


  • J. Jing
  • S. R. Wilson
  • Conrad John Burden




protein sequences, word matching


The use of $k$-word matches was developed as a fast alignment-free comparison method for DNA sequences in cases where long range contiguity has been compromised, for example, by shuffling, duplication, deletion or inversion of extended blocks of sequence. Here we extend the algorithm to amino acid sequences. We define a new statistic, the weighted word match, which reflects the varying degrees of similarity between pairs of amino acids. We computed the mean and variance, and simulated the distribution function for various forms of this statistic for sequences of identically and independently distributed letters. We present these results and a method for choosing an optimal word size. The efficiency of the method is tested by using simulated evolutionary sequences, and the results compared with BLAST. References
  • R. A. Lippert, H. Huang, and M. S. Waterman. Distributional regimes for the number of $k$-word matches between two random sequences. Proc. Natl. Acad. Sci. USA, 99(22):13980--9, 2002. doi:10.1073/pnas.202468099
  • J. Jing, C. J. Burden, S. Foret, and S. R. Wilson. Statistical considerations underpinning an alignment-free sequence comparison method. J. Korean Stat. Soc., 39:325--335, 2010. doi:10.1016/j.jkss.2010.02.009
  • S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25(17):3389--402, 1997. doi:10.1093/nar/25.17.3389
  • W. J. Ewens and G. R. Grant. Statistical Methods in Bioinformatics: an Introduction. Springer, 2nd edition, 2005.
  • S. Foret, M. R. Kantorovitz, and C. J. Burden. Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences. BMC Bioinformatics, 7 Suppl 5:S21, 2006. doi:10.1186/1471-2105-7-S5-S21
  • S. Henikoff and J. G. Henikoff. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA, 89:10915--10919, 1992. doi:10.1073/pnas.89.22.10915
  • http://bioinfo.lifl.fr/reblosum/ [31 May 2011]
  • G. Reinert, D. Chew, F. Sun, and M. S. Waterman. Alignment-free sequence comparison (i): statistics and power. J. Comput. Biol., 16(12):1615--1634, 2009. doi:10.1089/cmb.2009.0198
  • S. Foret, S. R. Wilson, and C. J. Burden. Empirical distribution of $k$-word matches in biological sequences. Pattern Recogn., 42:539--548, 2009. doi:10.1016/j.patcog.2008.06.026
  • S. Foret, S. R. Wilson, and C. J. Burden. Characterizing the $D2$ statistic: Word matches in biological sequences. Stat. Appl. Genet. Mo. B., 8(1):Article 43, 2009. doi:10.2202/1544-6115.1447
  • M. R. Kantorovitz, H. S. Booth, C. J. Burden, and S. R. Wilson. Asymptotic behavior of $k$-word matches between two uniformly distributed sequences. J. Appl. Probab., 44:788--805, 2006. doi:10.1239/jap/1189717545
  • T. J. Wu, Y. H. Huang, and L. A. Li. Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences. Bioinformatics, 21(22):4125--32, 2005. doi:10.1093/bioinformatics/bti658
  • S. Q. Le and O. Gascuel. An improved general amino acid replacement marix. Mol. Biol. Evol., 25:1307--1320, 2008. doi:10.1093/molbev/msn067
  • E. Gazave, P. Lapebi, G. S. Richards, F. Brunet, A. V. Ereskovsky, B. M. Degnan, C. Borchiellini, M. Vervoort, and E. Renard. Origin and evolution of the Notch signalling pathway: an overview from eukaryotic genomes. BMC Evol. Biol., 9:249, 2009. doi:10.1186/1471-2148-9-249
  • S. Q. Schneider, J. R. Finnerty, and M. Q. Martindale. Protein evolution: structure-function relationships of the oncogene Beta-catenin in the evolution of multicellular animals. J. Exptl. Zool. (Mol. Dev. Evol.), 295B:25--44, 2003. doi:10.1002/jez.b.00006





Proceedings Computational Techniques and Applications Conference