#include <genesis/taxonomy/formats/taxonomy_reader.hpp>
Read Taxonomy file formats.
This reader populates a Taxonomy.
Exemplary usage:
std::string infile = "path/to/taxonomy.txt"; Taxonomy tax; TaxonomyReader() .rank_field_position( 2 ) .expect_strict_order( true ) .read( utils::from_file( infile ), tax );
It expects one taxon per input line. This line can also contain other information, for example
Archaea;Crenarchaeota;Thermoprotei;Desulfurococcales; 14 order 119
In order to separate the fields of the input, a CsvReader is used. By default, all its properties except for the separator chars are left at their default values. The separator char is set to a tab \t
instead of a comma, as this is more common for taxonomy files.
Use the getter csv_reader() to access the CsvReader and change its behaviour, for example, to change the field separator char. Also, all other properties of the CsvReader can be adjusted in order to suit any char-separated input format.
Once the fields of a line are split, this reader uses its properties name_field_position(), rank_field_position() and id_field_position() to determine which of the fields represent the taxon name, its rank, and its ID, respectively. For example, given the line from above, those would have to be set to 0
and 2
, and 1
, respectively. That is, the first field of the line is the name of the Taxon, the third ("order") the rank, and the second ("14") its ID. All other fields of the line are ignored, which in the example is the field "119".
The taxon name is expected to be a taxonomic path string. This is what we call a string consisting of the different parts of the taxonomic hierarchy, usually separated by semicola. See Taxopath for a description of the expected format.
This string is split into its Taxa using a TaxopathParser. In order to change the behaviour of this splitting, access the parser via taxopath_parser().
In summary, by default, this reader reads tab-separated lines and expects the taxonomy entry to be the first (or only) field in the line and to be a taxonomic path in the format described at Taxopath.
Definition at line 108 of file taxonomy_reader.hpp.
Public Member Functions | |
TaxonomyReader () | |
Default constructor. More... | |
TaxonomyReader (TaxonomyReader &&)=default | |
TaxonomyReader (TaxonomyReader const &)=default | |
~TaxonomyReader ()=default | |
utils::CsvReader & | csv_reader () |
Get the CsvReader used for reading a taxonomy file. More... | |
bool | expect_strict_order () const |
Return whether currently the reader expects a strict order of taxa. More... | |
TaxonomyReader & | expect_strict_order (bool value) |
Set whether the reader expects a strict order of taxa. More... | |
int | id_field_position () const |
Get the currently set position of the field in each line where the ID is located. More... | |
TaxonomyReader & | id_field_position (int value) |
Set the position of the field in each line where the ID is located. More... | |
int | name_field_position () const |
Get the currently set position of the field in each line where the taxon name is located. More... | |
TaxonomyReader & | name_field_position (int value) |
Set the position of the field in each line where the taxon name (Taxopath) is located. More... | |
TaxonomyReader & | operator= (TaxonomyReader &&)=default |
TaxonomyReader & | operator= (TaxonomyReader const &)=default |
void | parse_document (utils::InputStream &it, Taxonomy &tax) const |
Parse all data from an InputStream into a Taxonomy object. More... | |
Line | parse_line (utils::InputStream &it) const |
Read a single line of a taxonomy file and return the contained name and rank. More... | |
int | rank_field_position () const |
Get the currently set position of the field in each line where the rank name is located. More... | |
TaxonomyReader & | rank_field_position (int value) |
Set the position of the field in each line where the rank name is located. More... | |
Taxonomy | read (std::shared_ptr< utils::BaseInputSource > source) const |
Read a taxonomy from an input source and return the Taxonomy. More... | |
void | read (std::shared_ptr< utils::BaseInputSource > source, Taxonomy &target) const |
Read taxonomy data from an input source, and add the contents to a Taxonomy. More... | |
TaxopathParser & | taxopath_parser () |
Get the TaxopathParser used for parsing taxonomic path strings. More... | |
Classes | |
struct | Line |
Internal helper structure that stores the relevant data of one line while reading. More... | |
TaxonomyReader | ( | ) |
Default constructor.
Initializes the CsvReader so that tabs are used as field separators instead of commata.
Definition at line 57 of file taxonomy_reader.cpp.
|
default |
|
default |
|
default |
utils::CsvReader & csv_reader | ( | ) |
Get the CsvReader used for reading a taxonomy file.
This can be used to modify the reading behaviour, particularly values like the separator chars within the lines of the file. By default, the TaxonomyReader uses a tab \t
char to separate fields, which is different from the comma ',' that is used as default by the CsvReader.
It is also possible to change other properties of the CsvReader, for example escaping behaviour, if the input data needs special treatment in those regards.
See CsvReader for details about those properties.
Definition at line 154 of file taxonomy_reader.cpp.
bool expect_strict_order | ( | ) | const |
Return whether currently the reader expects a strict order of taxa.
See the setter for more information.
Definition at line 210 of file taxonomy_reader.cpp.
TaxonomyReader & expect_strict_order | ( | bool | value | ) |
Set whether the reader expects a strict order of taxa.
In a strictly ordered taxonomy file, the super-groups have to be listed before any sub-groups.
For example, the list
Archaea; Archaea;Aenigmarchaeota; Archaea;Crenarchaeota; Archaea;Crenarchaeota;Thermoprotei;
is in strict order.
If this property is set to true
, the reader expects this ordering and throws an exception if there is a violation, that is, if there is a sub-group in the list without a previous entry of its super-group (recursively). This is useful to check a file for consistency, e.g., it might happen that some super-group is misspelled by accident.
If set to false
(default), the order is ignored and all super-groups are created if necessary.
Definition at line 204 of file taxonomy_reader.cpp.
int id_field_position | ( | ) | const |
Get the currently set position of the field in each line where the ID is located.
See the setter of this function for details.
Definition at line 199 of file taxonomy_reader.cpp.
TaxonomyReader & id_field_position | ( | int | value | ) |
Set the position of the field in each line where the ID is located.
This value determines at with position (zero based) the field for the ID is located.
For example, in a taxonomy file with entries like
Archaea;Crenarchaeota;Thermoprotei; 7 class 119
this value could have to be set to 1
, as this is where the ID "7" is found.
If the file does not contain any IDs, or if this field should be skipped, set it to a value of -1
. This is also the default.
Definition at line 193 of file taxonomy_reader.cpp.
int name_field_position | ( | ) | const |
Get the currently set position of the field in each line where the taxon name is located.
See the setter of this function for details.
Definition at line 177 of file taxonomy_reader.cpp.
TaxonomyReader & name_field_position | ( | int | value | ) |
Set the position of the field in each line where the taxon name (Taxopath) is located.
This value determines at with position (zero based) the field for the taxon name is located.
For example, in a taxonomy file with entries like
Archaea;Crenarchaeota;Thermoprotei; 7 class 119
this value would have to be set to 0
, as this is where the taxon name is found. This reader expects the taxon name to be a Taxopath. This is what we call a string of taxonomic hierarchy elements, usually separated by semicola. See Taxopath for details.
By default, this value is set to 0
, that is, the first field. As it does not make sense to skip this value, it cannot be set to values below zero - which is different from rank_field_position. An exception is thrown should this be attempted.
Definition at line 164 of file taxonomy_reader.cpp.
|
default |
|
default |
void parse_document | ( | utils::InputStream & | it, |
Taxonomy & | tax | ||
) | const |
Parse all data from an InputStream into a Taxonomy object.
Definition at line 83 of file taxonomy_reader.cpp.
TaxonomyReader::Line parse_line | ( | utils::InputStream & | it | ) | const |
Read a single line of a taxonomy file and return the contained name and rank.
The name is expected to be a taxonomic path string. See Taxopath for details on that format.
Definition at line 109 of file taxonomy_reader.cpp.
int rank_field_position | ( | ) | const |
Get the currently set position of the field in each line where the rank name is located.
See the setter of this function for details.
Definition at line 188 of file taxonomy_reader.cpp.
TaxonomyReader & rank_field_position | ( | int | value | ) |
Set the position of the field in each line where the rank name is located.
This value determines at with position (zero based) the field for the rank name is located.
For example, in a taxonomy file with entries like
Archaea;Crenarchaeota;Thermoprotei; 7 class 119
this value would have to be set to 2
, as this is where the rank name "class" is found.
If the file does not contain any rank names, or if this field should be skipped, set it to a value of -1
. This is also the default.
Definition at line 182 of file taxonomy_reader.cpp.
Taxonomy read | ( | std::shared_ptr< utils::BaseInputSource > | source | ) | const |
Read a taxonomy from an input source and return the Taxonomy.
Use functions such as utils::from_file() and utils::from_string() to conveniently get an input source that can be used here.
Definition at line 72 of file taxonomy_reader.cpp.
void read | ( | std::shared_ptr< utils::BaseInputSource > | source, |
Taxonomy & | target | ||
) | const |
Read taxonomy data from an input source, and add the contents to a Taxonomy.
Use functions such as utils::from_file() and utils::from_string() to conveniently get an input source that can be used here.
Definition at line 66 of file taxonomy_reader.cpp.
TaxopathParser & taxopath_parser | ( | ) |
Get the TaxopathParser used for parsing taxonomic path strings.
The name field is expected to be a taxonomic path string. It is turned into a Taxon using the settings of the TaxopathParser. See there for details. See Taxopath for a path of the expected string format.
Definition at line 159 of file taxonomy_reader.cpp.