A taxonomy is a classification of groups of biological organisms on the basis of shared characteristics. Typically, a taxonomy forms a hierarchy of taxonomic names, where higher level taxa are more general groups of organisms that subsume lower, more specialized taxonomic levels.
In genesis, we model this hierarchy using two classes:
See the description of these classes for details.
Furthermore, we call a string of the form
Animalia;Vertebrata;Mammalia;Carnivora
a taxonomic path. Those strings are often used in taxonomic databases, and usually use semicola to separate their parts. We model a taxonomic path in the Taxopath class, which basically just consists of a std::vector< std::string >
to store the individual elements of the path. This class is a helper to read from databases, or write taxonomies in human-readable formats.
In the following, we assume the following headers and namespaces are used:
Taxonomies can be stored in several formats. We find the SILVA format the most convenient to work with, but also support to read from NCBI.
We use the SILVA Taxonomy as an exemplary database here. Their taxonomy file (here, tax_slv_ssu_123.1.txt
) starts like this:
Archaea; 2 domain Archaea;Aenigmarchaeota; 11084 phylum 123 Archaea;Aenigmarchaeota;Aenigmarchaeota Incertae Sedis; 11085 class 123 Archaea;Aenigmarchaeota;Aenigmarchaeota Incertae Sedis;Unknown Order; 11086 order 123 Archaea;Aenigmarchaeota;Aenigmarchaeota Incertae Sedis;Unknown Order;Unknown Family; 11087 family 123 Archaea;Aenigmarchaeota;Aenigmarchaeota Incertae Sedis;Unknown Order;Unknown Family;Candidatus Aenigmarchaeum; 11088 genus 123 ...
That is, it first contains the taxonomic path, followed by some meta-data. The Taxon allows to store the rank (domain
, phylum
, class
, order
, family
, genus
etc), which is stored in the third field of the database (at position 2
for zero-based counting).
Let's read that file:
The position of the name field is initialized to 0
by default, so that line is superfluous and included here only for reference. Then, we set the position (column of the file) where the rank is found. By default, this is set to -1
, meaning that it is omitted. In short, with the defaults, any database where the first column of a tab-separated table contains the taxonomic names can be read in one line of code.
The NCBI Database has a more complex format, with a "node" file that defines the hierarchy of the taxonomy, and a "name" file that defines the corresponding names of the nodes. This format can be read by read_ncbi_taxonomy(). Typically, the files are named "nodes.dmp" and "names.dmp", respectively.
Writing can be done with the TaxonomyWriter. It simply takes a Taxonomy and writes its taxonomic paths to a file, potentially with the ranks as well:
It is possible to set the delimiter between taxonomic path elements (;
by default), and other details. We currently do not support to write to the NCBI format, as it is quite cumbersome to use anyway.
See also TaxopathParser and TaxopathGenerator for additional tools to parse and generate taxonomic paths. These are used in the background by the TaxonomyReader and TaxonomyWriter.
Similar to the tree traversal, a Taxonomy can be traversed. As the algorithms that are typically performed on a Taxonomy are less involved than the ones on Trees, we use a simpler approach here, that takes a function to be applied to each Taxon in the Taxonomy:
Furthermore, for the common use case of preorder traversal, we offer an iterator similar to the one used for Trees:
This prints the taxonomy, indented according to the level in the hierarchy of each Taxon, followed by the rank (genus
etc in plain text) of each Taxon, if present.
Similar to the Data Model that we use for the Tree class, a Taxon offers a data model to store additional data on top of the name and rank that we already have seen above. Any class that is derived from BaseTaxonData can be used.
Currently, we only use this to store additional data in our PhAT method. The method stores information about all the Sequences that belong to each Taxon, as well as the entropy of these sequences. Hence, we use this as an example here:
For an in-depth example of how to use this, see the PhAT implementation.