ParalogMatching.jl documentation

Introduction

This package implements the paralog matching technique presented in the paper "Simultaneous identification of specifically interacting paralogs and interprotein contacts by Direct Coupling Analysis" by Thomas Gueudré, Carlo Baldassi, Marco Zamparo, Martin Weigt and Andrea Pagnani, Proc. Natl. Acad. Sci. U.S.A. 201607570 (2016), doi:10.1073/pnas.1607570113

The main idea of the method is to perform a statistical analysis of two given multiple sequence alignments, each containing one protein family. Each familiy should comprise several species, and each species may have several sequences belonging to the family. The algorithm tries to associate (match) interacting partners from the two families within each species.

The underlying main assumption is that the proper matching is the one maximizing the co-evolution signal. Such maximization is performed over the Bayesian inference of a Gaussian model, by inverting the correlation matrix.

The code is written in Julia, and the functions are called from within Julia. However, a command-line interface is also provided for those unfamiliar with the language.

The current code was tested on Julia versions 0.4 and 0.5.

Installation

To install the module, use this command from within Julia:

julia> Pkg.clone("https://github.com/Mirmu/ParalogMatching.jl")

Dependencies will be installed automatically.

However, you will also need to install at least one linear programming solver supported by MathProgBase. See the list of available solvers at the JuliaOpt page. Note that the solver efficiency is not particularly important for paralog matching, whose computational time is dominated by matrix inversion operations, therefore you don't need a particularly fast solver. If unsure, use Pkg.add("Clp") or Pkg.add("GLPKMathProgInterface"), which are free and open-source solvers.

Usage

The module is loaded as any other Julia module:

julia> using ParalogMatching

The module provides a high-level interface, with one function which performs all operations and writes a file in output, and the low-level interface functions which perform the single steps of the algorithm.

A typical run of the algorithm could look like this:

julia> TrpAB, match = paralog_matching("TrpA.fasta", "TrpB.fasta", "Match_TrpAB.fasta.gz", batch=5);

High-level interface

# ParalogMatching.paralog_matchingFunction.

paralog_matching(infile1::AbstractString, infile2::AbstractString, outfile::AbstractString;
         cutoff = 500,
         batch = 1,
         strategy = "covariation",
         lpsolver = nothing)

This function performs the paralog matching from two given FASTA files containing the alignments for two different protein families. When the matching is done, it writes the result in a new FASTA file, in which each sequence is the concatenation of two mathing sequences in the original file.

The input format is standard FASTA, but the headers are parsed as explained in prepare_alignments.

The output file format is documented in write_fasta_match.

The keyword arguments are documented in prepare_alignments and run_matching.

Besides writing the outfile, this function also returns two values: the harmonized alignment (produced by prepare_alignments) and the resulting match (as returned by run_matching).

source

Low-level interface

The low-level interface functions are called by paralog_matching in the order in which they are documented here, but they can be called individually in order to inspect/manipulate the intermediate results if desired.

# ParalogMatching.read_fasta_alignmentFunction.

read_fasta_alignment(filename, max_gap_fraction=1.0; header_regex=nothing)

Parse a (possibly gzipped) FASTA file, converting it into an alignment which can be used in the Gaussian model used by ParalogMatching. Optionally, discards the sequences which have more than a fraction max_gap_fraction of missing values (gaps) in the alignment.

The function automatically tries to parse the species names and the Uniprot ID, using a standard format specification. The species names are then used in the matching procedure; the Uniprot IDs are only used for writing the output files.

You can parse arbitrary files by providing a custom header_regex (see also the Julia regex documentation). In its simples form, it needs to have at least two capture groups, the first one returning the Uniprot ID and the second the species name (additional capture groups are ignored). For example, if the headers in your FASTA file look like this:

>SOMELOCATION/SOMESPECIES/OTHERINFO

then you can use a regex like this one: read_fasta_alignment(..., header_regex=r"^([^/]+)/([^/]+)/"), i.e. line start, anything except a slash (captured), followed by a slash, then anything except a slash (captured), then a slash — the remainder of the line is then simply ignored. In more complicated cases (e.g. if the UniprotID and the species name are out of order), you can use named capture groups: in this case the header_regex needs to contain both an id group and a species group (all additional groups are ignored). For example, if the headers in your FASTA file look like this:

>SOMESPECIES/OTHERINFO/SOMELOCATION

then you can use a regex like this one: header_regex=r"^(?<species>[^/]+)/([^/]+)/(?<id>.+)", i.e. line start, anything except a slash (capture as the species name), followed by a slash, then anything except a slash (captured but ignored), followed by a slash, then the rest of the line (captured as the ID).

source

# ParalogMatching.prepare_alignmentsFunction.

prepare_alignments(X1::Alignment, X2::Alignment; cutoff::Integer = 500)

Prepares two alignments (as returned by read_fasta_alignment) and returns a HarmonizedAlignments object. This object contains two filtered version of the original alignments, in which only the sequences belonging to species which exist in both alignments are kept, and which is ready to be passed to run_matching.

The cutoff keyword argument can be used to discard all species for which there are more than a certain number of sequences in either alignment. Use 0 to disable this filter entirely.

source

# ParalogMatching.run_matchingFunction.

run_matching(X12::HarmonizedAlignments;
     batch = 1,
     strategy = "covariation",
     lpsolver = nothing)

Returns the matching of the two alignments contained in X12, which need to be obtained by prepare_alignments. The returned matching is a Vector in which the entry i determines which sequence of the second alignment is the partner to sequence i in the first alignment.

The keywords are:

      The default (`nothing`) uses the default solver as detected automatically by `MathProgBase`.
      You can override this by passing e.g. `lpsolver = GurobiSolver(OutputFlag=false)` or similar
      (see the documentation for `MathProgBase`).

Available strategies are:

       same operon (only used for testing, not a valid general strategy)

source

# ParalogMatching.write_fasta_matchFunction.

write_fasta_match(X12::HarmonizedAlignments, match::Vector{Int}, outfile::AbstractString)

Writes a new FASTA file obtained from the two alignments contained in the X12 object and the matching contained in match.

The sequences headers in the output file have the format >ID1::ID2/SPECIES, where ID1 represents the ID read from the first alignment, ID2 that for the second alignment, and SPECIES is the species name.

The new sequences in the output file are the concatenation of the matched sequences.

source

Command-line interface

In the utils directory there is a script which can be called from the command line, called paralog_matching.jl. Try calling it with:

$ julia paralog_matching.jl --help

to see the details. The arguments and options are basically the same as those of the paralog_matching function.