Neighborhood Correlation

Mouse & Human Homology Results Browser
Download Neighborhood Correlation
Supplemental Data
Recent Publications
Contact
Funding

Neighborhood Correlation is a novel homology identification method based on the observation that gene duplication and domain insertion result in different topological structures in the sequence similarity network. For details of Neighborhood Correlation, please refer to the publication:

Sequence Similarity Network Reveals Common Ancestry of Multidomain Proteins
Song N, Joseph JM, Davis GB, Durand D
PLoS Computational Biology 4(5): e1000063
doi:10.1371/journal.pcbi.1000063

Mouse & Human Homology Results Browser

In the PLoS paper above, we applied Neighborhood Correlation to all full length, mouse and human amino acid sequences in SwissProt Version 50.9. In an empirical validation of pairwise homology identification performance on twenty manually curated families, we show that Neighborhood Correlation achieves high sensitivity and specificity in both single domain and complex multidomain families. It outperforms traditional methods that combine sequence similarity with additional criteria based on alignment length.

To examine the performance on individual families and sequences in the mouse and human data set, please see the Neighborhood Correlation Browser. The Browser is allows exploratory analysis of the neighborhood structure of the protein sequence similarity network. The user may

select a protein sequence of interest by keyword search,
visit one of our twenty curated families, and
browse the protein sequences in our initial dataset.

Download Neighborhood Correlation

We make available an open-source (GPL) implementation of Neighborhood Correlation to demonstrate our algorithms and to facilitate novel analysis of additional data sets.

Neighborhood Correlation Version 2.1

Version 2.1 further improves the performance of Neighborhood Correlation, in two ways:

To produce Neighborhood Correlation scores for all pairs of sequences, previous versions iterated over all N^2 pairs, for N input sequences. This version progressively iterates through the neighborhood of each query sequence, resulting in N * M pairwise calculations, where M is the number of sequences in the neighborhood of each query sequence, plus the number of sequences in the neighborhoods of those sequences. For large datasets, this optimization is extremely beneficial.
Neighborhood Correlation first makes the input BLAST scores symmetric: BIT-score(x,y) = max( BIT-score(x,y), BIT-score(y,x)). This version improves the efficiency of this calculation through use of a compiled C function.

BUG FIX: Version 2.0 was released with LOG_10 transformation of the input inadvertently disabled. Version 2.1 restores the correct functionality, by using the LOG_10( BIT-score) for all internal calculations.

README: Program documentation
neighborhood_correlation-2.1.tar.gz : Installation package and source code.

© 2011 Jacob Joseph and Carnegie Mellon University. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.

Neighborhood Correlation Version 2.0

Version 2.0 is a complete rewrite of the Neighborhood Correlation implementation. It is meant to replace Version 1.0 (previously referred to as the "reference implementation"). Version 2.0 been optimized to accommodate large datasets through fast computation and greatly reduced memory usage.

This implementation has added a dependency upon the Numpy numerical package. It also requires a C compiler be available on the system. We believe it to be platform independent, and have tested on Linux and MacOS.

Performance is greatly improved over Version 1.0. As a rough guide, the set of Mouse and Human sequences used in our analysis included 26,197 sequences. From this, all-against-all BLAST yielded approximately 4.8 million pairwise relations. For this dataset, Neighborhood Correlation, Version 2.0 can be expected to consume approximately 125MB of memory. Running time for this dataset is approximately 45 minutes on an Intel Pentium D, at 3.2GHz. Greater than 1GB of memory, and 16 hours of running time were required by Version 1.0.

If you are working with small (1-2 million BLAST scores), and don't care to install Numpy, give version 1.0 a try. The input and output are equivalent, save the following: Version 1.0 reported NC scores for pairs that that satisfied the condition (NC(x,y) ≥ nc_thresh || BLAST score (x,y) exists). Now, this has been simplified to only (NC(x,y) ≥ nc_thresh).

README: Program documentation
neighborhood_correlation-2.0.tar.gz : Installation package and source code.

© 2009 Jacob Joseph and Carnegie Mellon University. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.

Neighborhood Correlation Version 1.0

This is the original "reference implementation" used to demonstrate the algorithms in the PLoS publication. We have focused upon an intuitive implementation with readable code. This program requires only a basic Python installation, and has no additional dependencies. It has been tested with Python version 2.5 on a Linux computer. It has no OS-specific requirements and should work on any complete Python installation.

readme.txt : Program documentation
NC_standalone-1.0.tar.gz : Executable code.

© 2008 Jacob Joseph , Nan Song, and Carnegie Mellon University. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.

Supplemental Data

We predicted mouse and human homologs using Neighborhood Correlation. Those predictions, our manually curated validation set, and a reference implementation are available here.

The PLoS Computational Biology study was carried out on all full length, mouse and human amino acid sequences in SwissProt Version 50.9 (11,553 mouse protein sequences and 14,644 human protein sequences). Data used in the study and predictions made using our method are available here:

FASTA sequences for all 26,197 human and mouse sequences used in the study.
Human and mouse BLAST scores : All against All BLAST scores for the Human and Mouse sequence data set.
Homologous Family Benchmark : SwissProt accession identifiers for all sequences in each family of our manually curated benchmark.
Homologous Family Benchmark : Panther 7.0 identifiers for all sequences in each family of our manually curated benchmark. The Panther dataset is newer than the SwissProt dataset used in original PLoS paper, and contains family members which were not in SwissProt at the time. This is our most current annotation set. (Updated 23 Aug 2011)
Pfam annotations for each sequence used in our study.
Mouse & Human NC Scores : The complete set of Neighborhood Correlation scores for all (subject to NC ≥ 0.05 or BLAST E < 10) sequence pairs in our dataset.
Novel predictions of mouse and human homologs using our method (NC ≥ 0.6).

Recent Publications

Jacob M Joseph and Dannie Durand. (2009) Family Classification without Domain Chaining.: Bioinformatics 2009 25(12):i45-i53; doi:10.1093/bioinformatics/btp207

This work extends the identification of homologous pairs to classification of entire protein families. We investigated the structure of the homology network and that inferred by Neighborhood Correlation. Of principal interest is the ability to evaluate a classification in the absence of hand-curated data, by considering intrinsic measures of that network. We demonstrated a strategy that reduces noise in and restores structure to artificial networks with simulated noise, as well as to the yeast genome homology network. We further evaluated this approach on a hand-curated set of multidomain sequences in mouse and human and demonstrate that classification using the rewired network delivers dramatic improvement in Precision and Recall, compared with current methods.

Contact

For assistance with, or questions about any of the material on this page, please contact Jacob Joseph or Dannie Durand. We are always pleased to hear about new analyses.

Funding

This material is based upon work supported by the National Science Foundation (NSF) under Grant No. DBI-0641313, the National Institutes of Health (NIH) under Grant No. 1 K22 HG 02451-01, and a David and Lucille Packard Foundation fellowship. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of NSF, NIH, or the Packard Foundation.