CoMetGeNe logo
Trail finding

CoMetGeNe models a metabolic pathway as a directed graph in which vertices are reactions. In this directed graph, an arc from a vertex ri to a vertex rj means that reaction ri produces a metabolite that serves as substrate to reaction rj.

Roughly speaking, a trail is a path in a graph that can include repeated vertices, but not repeated arcs. CoMetGeNe identifies trails of metabolic reactions such that the reactions in the trails are catalyzed by enzymes encoded by neighboring genes. More specifically, for every arc (ri, rj) in the graph modeling a metabolic pathway, CoMetGeNe attempts to determine a trail T of reactions passing through the arc (ri, rj) such that:

  • the genes involved in the reactions of T are neighbors, and
  • no other trail T' passing trhough (ri, rj) such that the genes involved in reactions of T' are neighbors contains more unique reactions than T.

Property 1 above assumes by default that only genes on the same strand of a given chromosome should be considered neighbors. If it is intended to allow for gene neighborhoods to take into account both strands, the -b (or --both-strands) option should be used for CoMetGeNe.py (see user manual).

Property 2 above refers to a concept termed span: the span of a trail T represents the number of unique reactions in T.

Flexibility is allowed in the definition of neighborhood: CoMetGeNe is able to skip a few reactions and/or genes.

CoMetGeNe.py

Trail finding is performed using the CoMetGeNe.py Python script. It is provided with a user manual. You can also check out an example.

CoMetGeNe_launcher.py

If a large number of species needs to be analyzed, an important speedup can be attained if CoMetGeNe.py is ran in parallel. This functionality is provided by the script CoMetGeNe_launcher.py:

  • Metabolic pathway maps are retrieved from KEGG using 3 threads (the maximum permitted as of June 2018).
  • Genomic information is retrieved from KEGG using 2 threads (the maximum permitted as of June 2018).
  • Trail finding is performed on the maximum number of available threads (e.g., 4 or 8 for a quad-core CPU).

These values can be adjusted in CoMetGeNe_launcher.py, where the data directory for metabolic pathways as well as the species to analyze can also be specified.

Potential caveats when running in multithread mode

On machines with a fast internet connection, it is possible that KEGG blocks multiple queries ran in parallel. If this is the case, CoMetGeNe displays a "HTTP download error" message. The workaround is to download metabolic pathway maps and genomic information from KEGG on a single thread by setting the variables kegg_max_thr_pw and kegg_max_thr_gen in CoMetGeNe_launcher.py to 1 instead of 3 and 2, respectively.

A Python MemoryError may be encountered on very large data sets (in the hundreds or thousands of species). This is due to the fact that every thread needs to load the pickle file storing genomic information (pickle/kegg_genome_info.pickle), whose size increases with the number of species analyzed. Two workarounds are possible. The first one is to decrease the number of threads on which CoMetGeNe is ran via CoMetGeNe_launcher.py. This can be accomplished by changing the value of the n_thr_cometgene variable in CoMetGeNe_launcher.py. The second possible workaround is to split the data set into several smaller batches.

Created by Alexandra Zaharia. Maintained by Alain Denise.
Site style derived from the GreenWorld template at Blue Website Templates.