Neighborhood Correlation is a novel homology identification method
based on the observation that gene duplication and domain insertion
result in different topological structures in the sequence similarity
network. For details of Neighborhood Correlation, please refer to the
publication:
In the PLoS paper above, we applied Neighborhood Correlation to all full length, mouse and human amino acid sequences in SwissProt Version 50.9. In an empirical validation of pairwise homology identification performance on twenty manually curated families, we show that Neighborhood Correlation achieves high sensitivity and specificity in both single domain and complex multidomain families. It outperforms traditional methods that combine sequence similarity with additional criteria based on alignment length.
To examine the performance on individual families and sequences in the mouse and human data set, please see the Neighborhood Correlation Browser. The Browser is allows exploratory analysis of the neighborhood structure of the protein sequence similarity network. The user may
We make available an open-source (GPL) implementation of Neighborhood Correlation to demonstrate our algorithms and to facilitate novel analysis of additional data sets.
Version 2.1 further improves the performance of Neighborhood Correlation, in two ways:
BUG FIX: Version 2.0 was released with LOG_10 transformation of the input inadvertently disabled. Version 2.1 restores the correct functionality, by using the LOG_10( BIT-score) for all internal calculations.
© 2011 Jacob Joseph
Version 2.0 is a complete rewrite of the Neighborhood Correlation implementation. It is meant to replace Version 1.0 (previously referred to as the "reference implementation"). Version 2.0 been optimized to accommodate large datasets through fast computation and greatly reduced memory usage.
This implementation has added a dependency upon the Numpy numerical package. It also requires a C compiler be available on the system. We believe it to be platform independent, and have tested on Linux and MacOS.
Performance is greatly improved over Version 1.0. As a rough guide, the set of Mouse and Human sequences used in our analysis included 26,197 sequences. From this, all-against-all BLAST yielded approximately 4.8 million pairwise relations. For this dataset, Neighborhood Correlation, Version 2.0 can be expected to consume approximately 125MB of memory. Running time for this dataset is approximately 45 minutes on an Intel Pentium D, at 3.2GHz. Greater than 1GB of memory, and 16 hours of running time were required by Version 1.0.
If you are working with small (1-2 million BLAST scores), and don't care to install Numpy, give version 1.0 a try. The input and output are equivalent, save the following: Version 1.0 reported NC scores for pairs that that satisfied the condition (NC(x,y) ≥ nc_thresh || BLAST score (x,y) exists). Now, this has been simplified to only (NC(x,y) ≥ nc_thresh).
© 2009 Jacob Joseph
This is the original "reference implementation" used to demonstrate the algorithms in the PLoS publication. We have focused upon an intuitive implementation with readable code. This program requires only a basic Python installation, and has no additional dependencies. It has been tested with Python version 2.5 on a Linux computer. It has no OS-specific requirements and should work on any complete Python installation.
© 2008 Jacob Joseph
We predicted mouse and human homologs using Neighborhood Correlation. Those predictions, our manually curated validation set, and a reference implementation are available here.
The PLoS Computational Biology study was carried out on all full length, mouse and human amino acid sequences in SwissProt Version 50.9 (11,553 mouse protein sequences and 14,644 human protein sequences). Data used in the study and predictions made using our method are available here:
For assistance with, or questions about any of the material on this page, please contact Jacob Joseph or Dannie Durand. We are always pleased to hear about new analyses.
This material is based upon work supported by the National Science Foundation (NSF) under Grant No. DBI-0641313, the National Institutes of Health (NIH) under Grant No. 1 K22 HG 02451-01, and a David and Lucille Packard Foundation fellowship. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of NSF, NIH, or the Packard Foundation.