Neighborhood Correlation

Mouse & Human Homology Results Browser
Download Neighborhood Correlation
Supplemental Data
Recent Publications
Contact
Funding

Neighborhood Correlation is a novel homology identification method based on the observation that gene duplication and domain insertion result in different topological structures in the sequence similarity network. For details of Neighborhood Correlation, please refer to the publication:

Sequence Similarity Network Reveals Common Ancestry of Multidomain Proteins
Song N, Joseph JM, Davis GB, Durand D

PLoS Computational Biology 4(5): e1000063
doi:10.1371/journal.pcbi.1000063

Mouse & Human Homology Results Browser

In the PLoS paper above, we applied Neighborhood Correlation to all full length, mouse and human amino acid sequences in SwissProt Version 50.9. In an empirical validation of pairwise homology identification performance on twenty manually curated families, we show that Neighborhood Correlation achieves high sensitivity and specificity in both single domain and complex multidomain families. It outperforms traditional methods that combine sequence similarity with additional criteria based on alignment length.

To examine the performance on individual families and sequences in the mouse and human data set, please see the Neighborhood Correlation Browser. The Browser is allows exploratory analysis of the neighborhood structure of the protein sequence similarity network. The user may

Download Neighborhood Correlation

We make available an open-source (GPL) implementation of Neighborhood Correlation to demonstrate our algorithms and to facilitate novel analysis of additional data sets.

Neighborhood Correlation Version 2.1

Version 2.1 further improves the performance of Neighborhood Correlation, in two ways:

  1. To produce Neighborhood Correlation scores for all pairs of sequences, previous versions iterated over all N^2 pairs, for N input sequences. This version progressively iterates through the neighborhood of each query sequence, resulting in N * M pairwise calculations, where M is the number of sequences in the neighborhood of each query sequence, plus the number of sequences in the neighborhoods of those sequences. For large datasets, this optimization is extremely beneficial.
  2. Neighborhood Correlation first makes the input BLAST scores symmetric: BIT-score(x,y) = max( BIT-score(x,y), BIT-score(y,x)). This version improves the efficiency of this calculation through use of a compiled C function.

BUG FIX: Version 2.0 was released with LOG_10 transformation of the input inadvertently disabled. Version 2.1 restores the correct functionality, by using the LOG_10( BIT-score) for all internal calculations.

© 2011 Jacob Joseph and Carnegie Mellon University. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.

Neighborhood Correlation Version 2.0

Version 2.0 is a complete rewrite of the Neighborhood Correlation implementation. It is meant to replace Version 1.0 (previously referred to as the "reference implementation"). Version 2.0 been optimized to accommodate large datasets through fast computation and greatly reduced memory usage.

This implementation has added a dependency upon the Numpy numerical package. It also requires a C compiler be available on the system. We believe it to be platform independent, and have tested on Linux and MacOS.

Performance is greatly improved over Version 1.0. As a rough guide, the set of Mouse and Human sequences used in our analysis included 26,197 sequences. From this, all-against-all BLAST yielded approximately 4.8 million pairwise relations. For this dataset, Neighborhood Correlation, Version 2.0 can be expected to consume approximately 125MB of memory. Running time for this dataset is approximately 45 minutes on an Intel Pentium D, at 3.2GHz. Greater than 1GB of memory, and 16 hours of running time were required by Version 1.0.

If you are working with small (1-2 million BLAST scores), and don't care to install Numpy, give version 1.0 a try. The input and output are equivalent, save the following: Version 1.0 reported NC scores for pairs that that satisfied the condition (NC(x,y) ≥ nc_thresh || BLAST score (x,y) exists). Now, this has been simplified to only (NC(x,y) ≥ nc_thresh).

© 2009 Jacob Joseph and Carnegie Mellon University. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.

Neighborhood Correlation Version 1.0

This is the original "reference implementation" used to demonstrate the algorithms in the PLoS publication. We have focused upon an intuitive implementation with readable code. This program requires only a basic Python installation, and has no additional dependencies. It has been tested with Python version 2.5 on a Linux computer. It has no OS-specific requirements and should work on any complete Python installation.

© 2008 Jacob Joseph , Nan Song, and Carnegie Mellon University. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.

Supplemental Data

We predicted mouse and human homologs using Neighborhood Correlation. Those predictions, our manually curated validation set, and a reference implementation are available here.

The PLoS Computational Biology study was carried out on all full length, mouse and human amino acid sequences in SwissProt Version 50.9 (11,553 mouse protein sequences and 14,644 human protein sequences). Data used in the study and predictions made using our method are available here:

Recent Publications

Jacob M Joseph and Dannie Durand. (2009) Family Classification without Domain Chaining.
Bioinformatics 2009 25(12):i45-i53; doi:10.1093/bioinformatics/btp207

This work extends the identification of homologous pairs to classification of entire protein families. We investigated the structure of the homology network and that inferred by Neighborhood Correlation. Of principal interest is the ability to evaluate a classification in the absence of hand-curated data, by considering intrinsic measures of that network. We demonstrated a strategy that reduces noise in and restores structure to artificial networks with simulated noise, as well as to the yeast genome homology network. We further evaluated this approach on a hand-curated set of multidomain sequences in mouse and human and demonstrate that classification using the rewired network delivers dramatic improvement in Precision and Recall, compared with current methods.

Contact

For assistance with, or questions about any of the material on this page, please contact Jacob Joseph or Dannie Durand. We are always pleased to hear about new analyses.

Funding

This material is based upon work supported by the National Science Foundation (NSF) under Grant No. DBI-0641313, the National Institutes of Health (NIH) under Grant No. 1 K22 HG 02451-01, and a David and Lucille Packard Foundation fellowship. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of NSF, NIH, or the Packard Foundation.