#include <genesis/sequence/formats/phylip_reader.hpp>
Read Phylip sequence data.
This class provides simple facilities for reading Phylip data.
Exemplary usage:
std::string infile = "path/to/file.phylip"; SequenceSet sset; PhylipReader() .site_casing( SiteCasing::kUnchanged ) .valid_chars( nucleic_acid_codes_all() ) .read( utils::from_file( infile ), sset );
The expected data format roughly follows the original definition. See mode( Mode ) to selected between sequential and interleaved mode, which are the two variants of Phylip files. We furthermore support a relaxed version (by default), where the label can be of any length. See label_length( size_t ) for more information.
Using site_casing(), the sequences can automatically be turned into upper or lower case letter. Also, see valid_chars( std::string const& ) for a way of checking correct input sequences.
Definition at line 86 of file phylip_reader.hpp.
Public Member Functions | |
PhylipReader () | |
Create a default PhylipReader. Per default, chars are turned upper case, but not validated. More... | |
PhylipReader (PhylipReader &&)=default | |
PhylipReader (PhylipReader const &)=default | |
~PhylipReader ()=default | |
size_t | label_length () const |
Return the currently set label length. More... | |
PhylipReader & | label_length (size_t value) |
Set the length of the label in front of the sequences. More... | |
Mode | mode () const |
PhylipReader & | mode (Mode value) |
Set the mode for reading sequences. More... | |
PhylipReader & | operator= (PhylipReader &&)=default |
PhylipReader & | operator= (PhylipReader const &)=default |
Header | parse_phylip_header (utils::InputStream &it) const |
Parse a Phylip header and return the contained sequence count and length. More... | |
void | parse_phylip_interleaved (utils::InputStream &it, SequenceSet &sset) const |
Parse a whole Phylip file using the interleaved variant (Mode::kInterleaved). More... | |
std::string | parse_phylip_label (utils::InputStream &it) const |
Parse and return a Phylip label. More... | |
std::string | parse_phylip_sequence_line (utils::InputStream &it) const |
Parse one sequence line. More... | |
void | parse_phylip_sequential (utils::InputStream &it, SequenceSet &sset) const |
Parse a whole Phylip file using the sequential variant (Mode::kSequential). More... | |
SequenceSet | read (std::shared_ptr< utils::BaseInputSource > source) const |
Read all Sequences from an input source in Phylip format and return them as a SequenceSet. More... | |
void | read (std::shared_ptr< utils::BaseInputSource > source, SequenceSet &target) const |
Read all Sequences from an input source in Phylip format and return them as a SequenceSet. More... | |
bool | remove_digits () const |
Return whether digits are removed from the Sequence. More... | |
PhylipReader & | remove_digits (bool value) |
Set whether digits in the Sequence should be kept (default) or removed. More... | |
SiteCasing | site_casing () const |
Return whether Sequence sites are automatically turned into upper or lower case. More... | |
PhylipReader & | site_casing (SiteCasing value) |
Set whether Sequence sites are automatically turned into upper or lower case. More... | |
utils::CharLookup< bool > & | valid_char_lookup () |
Return the internal CharLookup that is used for validating the Sequence sites. More... | |
std::string | valid_chars () const |
Return the currently set chars used for validating Sequence sites. More... | |
PhylipReader & | valid_chars (std::string const &chars) |
Set the chars that are used for validating Sequence sites when reading them. More... | |
Public Types | |
enum | Mode { kSequential, kInterleaved } |
Enum to distinguish between the different file variants of Phylip. See mode( Mode value ) for more details. More... | |
enum | SiteCasing { kUnchanged, kToUpper, kToLower } |
Enumeration of casing methods to apply to each site of a Sequence. More... | |
Classes | |
struct | Header |
Helper that stores the header information of a Phylip file. More... | |
PhylipReader | ( | ) |
Create a default PhylipReader. Per default, chars are turned upper case, but not validated.
See site_casing() and valid_chars() to change this behaviour.
Definition at line 56 of file phylip_reader.cpp.
|
default |
|
default |
|
default |
size_t label_length | ( | ) | const |
Return the currently set label length.
See the setter label_length( size_t ) for details.
Definition at line 363 of file phylip_reader.cpp.
PhylipReader & label_length | ( | size_t | value | ) |
Set the length of the label in front of the sequences.
Phylip has the weird property that labels are written in front of sequences and do not need to have a delimiter, but instead are simply the first n
characters of the string. This value determines after how many chars the label ends and the actual sequence begins.
If set to a value greater than 0, exaclty this many characters are read as label. Thus, they can also contain spaces. Spaces at the beginning or end of a label are stripped. The length that is dictated by the Phylip standard is 10, but any other length can also be used.
If set to 0 (default), a relaxed version of Phylip is used instead, where the sequence begin is automatically detected. Labels can then be of arbitrary lengths, as long as they do not contain white spaces. However, in this case, there has to be at least one space or tab character between the label and the sequence. After the whitespace(s), the rest of the line is then treated as sequence data.
The function returns the PhylipReader object to allow for fluent interfaces.
Definition at line 357 of file phylip_reader.cpp.
PhylipReader::Mode mode | ( | ) | const |
Return the currently set mode for parsing Phylip.
See the setter mode( Mode ) for details.
Definition at line 352 of file phylip_reader.cpp.
PhylipReader & mode | ( | Mode | value | ) |
Set the mode for reading sequences.
Phylip offers two variants for storing the sequences: sequential and interleaved. As there is no option or flag needed to distinguish between them in the file itself, there is no chance of knowing the variant without trying to parse it. If one fails but not the other, it is proabably the latter variant. However, there are instances where both variants are valid at the same time, but yield different sequences. So, in general detecting the correct variant is undecidable, making Phylip a non-well-defined format. If possible, try to avoid Phylip files.
In order to avoid those problems, this function explicitly sets the variant being used for parsing. By default, it is set to Mode::kSequential. Use Mode::kInterleaved for the other variant.
Definition at line 346 of file phylip_reader.cpp.
|
default |
|
default |
PhylipReader::Header parse_phylip_header | ( | utils::InputStream & | it | ) | const |
Parse a Phylip header and return the contained sequence count and length.
This helper function expects to find a Phylip header line in the form x y
, which describes the number of sequences x
in the Phylip data and their length y
. The remainder of the header line is interpreted as Phylip options. See Header struct for more information.
The function then advances the stream and skips potential empty lines after the header. It thus leaves the stream at the beginning of the first sequence line.
Definition at line 96 of file phylip_reader.cpp.
void parse_phylip_interleaved | ( | utils::InputStream & | it, |
SequenceSet & | sset | ||
) | const |
Parse a whole Phylip file using the interleaved variant (Mode::kInterleaved).
Definition at line 272 of file phylip_reader.cpp.
std::string parse_phylip_label | ( | utils::InputStream & | it | ) | const |
Parse and return a Phylip label.
This helper functions either takes the first label_length
chars as a label or, if label_length == 0
takes all chars until the first blank as label. It returns the trimmed label and leaves the stream at the next char after the label (and after subsequent blanks).
Definition at line 142 of file phylip_reader.cpp.
std::string parse_phylip_sequence_line | ( | utils::InputStream & | it | ) | const |
Parse one sequence line.
The line (which can also start after a label) is parsed until the first '\n' char. While parsing, the options site_casing() and valid_chars() are applied according to their settings. The stream is left at the beginning of the next line.
Definition at line 184 of file phylip_reader.cpp.
void parse_phylip_sequential | ( | utils::InputStream & | it, |
SequenceSet & | sset | ||
) | const |
Parse a whole Phylip file using the sequential variant (Mode::kSequential).
Definition at line 222 of file phylip_reader.cpp.
SequenceSet read | ( | std::shared_ptr< utils::BaseInputSource > | source | ) | const |
Read all Sequences from an input source in Phylip format and return them as a SequenceSet.
Use functions such as utils::from_file() and utils::from_string() to conveniently get an input source that can be used here.
Definition at line 65 of file phylip_reader.cpp.
void read | ( | std::shared_ptr< utils::BaseInputSource > | source, |
SequenceSet & | target | ||
) | const |
Read all Sequences from an input source in Phylip format and return them as a SequenceSet.
The Sequences are added to the SequenceSet, whose existing Sequences are kept. Thus, by repeatedly calling this or similar read functions, multiple input files can easily be read into one SequenceSet.
Use functions such as utils::from_file() and utils::from_string() to conveniently get an input source that can be used here.
Definition at line 73 of file phylip_reader.cpp.
bool remove_digits | ( | ) | const |
Return whether digits are removed from the Sequence.
Definition at line 385 of file phylip_reader.cpp.
PhylipReader & remove_digits | ( | bool | value | ) |
Set whether digits in the Sequence should be kept (default) or removed.
Usually, sequences do not contain digits. However, some Phylip variants allow to annotate sequences with positions in between, for example
2 10 foofoofoo AAGCC 5 TTGGC barbarbar AAACC 5 CTTGC
See http://evolution.genetics.washington.edu/phylip/doc/sequence.html for the definition of the Phylip standard that allows this. By default, we keep all symbols except white space, because some multi-state models might use digits as symbols. However, for files that use this weird variant of the standard, this option can be activated to remove the digits.
Definition at line 379 of file phylip_reader.cpp.
PhylipReader::SiteCasing site_casing | ( | ) | const |
Return whether Sequence sites are automatically turned into upper or lower case.
Definition at line 374 of file phylip_reader.cpp.
PhylipReader & site_casing | ( | SiteCasing | value | ) |
Set whether Sequence sites are automatically turned into upper or lower case.
Default is SiteCasing::kToUpper, that is, all sites of the read Sequences are turned into upper case letters automatically. The function returns the PhylipReader object to allow for fluent interfaces.
Definition at line 368 of file phylip_reader.cpp.
utils::CharLookup< bool > & valid_char_lookup | ( | ) |
Return the internal CharLookup that is used for validating the Sequence sites.
This function is provided in case direct access to the lookup is needed. Usually, the valid_chars() function should suffice. See there for details.
Definition at line 415 of file phylip_reader.cpp.
std::string valid_chars | ( | ) | const |
Return the currently set chars used for validating Sequence sites.
An empty string means that no validation is done.
Definition at line 404 of file phylip_reader.cpp.
PhylipReader & valid_chars | ( | std::string const & | chars | ) |
Set the chars that are used for validating Sequence sites when reading them.
When this function is called with a string of chars, those chars are used to validate the sites when reading them. That is, only sequences consisting of the given chars are valid.
If set to an empty string, this check is deactivated. This is also the default, meaning that no checking is done.
In case that site_casing() is set to a value other than SiteCasing::kUnchanged
: The validation is done after changing the casing, so that only lower or capital letters have to be provided for validation. In case that site_casing() is set to SiteCasing::kUnchanged
: All chars that are to be considered valid have to be provided for validation.
See nucleic_acid_codes...()
and amino_acid_codes...()
functions for presettings of chars that can be used for validation here.
Definition at line 390 of file phylip_reader.cpp.
|
strong |
Enum to distinguish between the different file variants of Phylip. See mode( Mode value ) for more details.
Enumerator | |
---|---|
kSequential | Read the data in Phylip sequential mode. |
kInterleaved | Read the data in Phylip interleaved mode. |
Definition at line 127 of file phylip_reader.hpp.
|
strong |
Enumeration of casing methods to apply to each site of a Sequence.
Enumerator | |
---|---|
kUnchanged | Do not change the case of the sites. |
kToUpper | Make all sites upper case. |
kToLower | Make all sites lower case. |
Definition at line 143 of file phylip_reader.hpp.