Metagenomic studies often need to biologically classify millions of DNA sequences, for example so-called short reads. The assignment of those reads to known reference sequences helps to assess the composition and diversity of microbial communities and allows for comparing them.
For obtaining these assignments, you can use
In the following description, we will focus on EPA, although both programs work similarly. Their output is standardized in the jplace
file format [3], thus, genesis works with both of them.
The EPA takes as input:
The algorithm finds the most likely (via maximum likelihood) insertion positions for every query sequence on the reference tree. The resulting assignment of a query sequence to a branch is called a placement. A query sequence can have multiple possible placement positions at different branches of the tree, with different likelihoods.
The likelihoods of placement positions are usually transformed into the likelihood weight ratio. For a given query sequence, those values sum up to 1.0 for all branches of the tree. They can thus be seen as a probability distribution of possible placement positions on the tree.
The set of placements for a query sequence is called a Pquery. It contains a name (usually, that is the name of the original query sequence) and the placements with their features (e.g., an ID of the edge where the placement is located, its likelihood, etc.). See [3] for details.
Genesis has classes and functions to work with all relevant data of evolutionary placement. This tutorial focuses on the actual placement data, that is, Pqueries with their placement positions. See the tutorial pages Tree Basics and Sequence for details on those related topics.
The most important class for evolutionary placement is the Sample. A sample is a representation of a whole jplace
file: It stores the reference tree and a set of Pqueries.
To read the data from a jplace
file into a Sample, use a JplaceReader:
Each Pquery contains the Placements of a query sequence, as well as any Names associated with it.
You can add Pqueries or find them like this:
Removing placements with certain properties (filtering) works like this:
Writing back your results to a new jplace
file is done using a JplaceWriter:
You can find the whole code listing of this tutorial at
./doc/tutorials/samples.cpp
See the API reference for details and for more functions and classes related to this topic.
Please refer to the following articles for more information on phylogenetic placement of short reads:
[1]
S. Berger, D. Krompass, and A. Stamatakis, Performance, accuracy, and web server for evolutionary placement of short sequence reads under maximum likelihood, Syst. Biol., vol. 60, no. 3, pp. 291–302, 2011. DOI: 10.1093/sysbio/syr010
[2]
F. A. Matsen, R. B. Kodner, and E. V. Armbrust, pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree, BMC Bioinformatics, vol. 11, no. 1, p. 538, 2010. DOI: 10.1186/1471-2105-11-538
[3]
F. A. Matsen, N. G. Hoffman, A. Gallagher, and A. Stamatakis, A format for phylogenetic placements, PLoS One, vol. 7, no. 2, pp. 1–4, Jan. 2012. DOI: 10.1371/journal.pone.0031009