A toolkit for working with phylogenetic data.
v0.19.0
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Pages
PhylipReader Class Reference

#include <genesis/sequence/formats/phylip_reader.hpp>

Detailed Description

Read Phylip sequence data.

This class provides simple facilities for reading Phylip data. It supports to read

Exemplary usage:

std::string infile = "path/to/file.phylip";
SequenceSet sset;

PhylipReader()
    .site_casing( SiteCasing::kUnchanged )
    .valid_chars( nucleic_acid_codes_all() )
    .from_file( infile, sset );

The expected data format roughly follows the original definition. See mode( Mode ) to selected between sequential, interleaved and automatic mode. We furthermore support a relaxed version (by default), where the label can be of any length. See label_length( size_t ) for more information.

Using site_casing(), the sequences can automatically be turned into upper or lower case letter. Also, see valid_chars( std::string const& ) for a way of checking correct input sequences.

Definition at line 88 of file phylip_reader.hpp.

Public Member Functions

 PhylipReader ()
 Create a default PhylipReader. Per default, chars are turned upper case, but not validated. More...
 
 PhylipReader (PhylipReader const &)=default
 
 PhylipReader (PhylipReader &&)=default
 
 ~PhylipReader ()=default
 
void from_file (std::string const &file_name, SequenceSet &sequence_set) const
 Read all Sequences from a file in Phylip format into a SequenceSet. More...
 
SequenceSet from_file (std::string const &file_name) const
 Read all Sequences from a file in Phylip format and return them as a SequenceSet. More...
 
void from_stream (std::istream &input_stream, SequenceSet &sequence_set) const
 Read all Sequences from a std::istream in Phylip format into a SequenceSet. More...
 
SequenceSet from_stream (std::istream &input_stream) const
 Read all Sequences from a std::istream in Phylip format and return them as a SequenceSet. More...
 
void from_string (std::string const &input_string, SequenceSet &sequence_set) const
 Read all Sequences from a std::string in Phylip format into a SequenceSet. More...
 
SequenceSet from_string (std::string const &input_string) const
 Read all Sequences from a std::string in Phylip format and return them as a SequenceSet. More...
 
PhylipReaderlabel_length (size_t value)
 Set the length of the label in front of the sequences. More...
 
size_t label_length () const
 Return the currently set label length. More...
 
PhylipReadermode (Mode value)
 Set the mode for reading sequences. More...
 
Mode mode () const
 
PhylipReaderoperator= (PhylipReader const &)=default
 
PhylipReaderoperator= (PhylipReader &&)=default
 
Header parse_phylip_header (utils::InputStream &it) const
 Parse a Phylip header and return the contained sequence count and length. More...
 
void parse_phylip_interleaved (utils::InputStream &it, SequenceSet &sset) const
 Parse a whole Phylip file using the interleaved variant (Mode::kInterleaved). More...
 
std::string parse_phylip_label (utils::InputStream &it) const
 Parse and return a Phylip label. More...
 
std::string parse_phylip_sequence_line (utils::InputStream &it) const
 Parse one sequence line. More...
 
void parse_phylip_sequential (utils::InputStream &it, SequenceSet &sset) const
 Parse a whole Phylip file using the sequential variant (Mode::kSequential). More...
 
PhylipReadersite_casing (SiteCasing value)
 Set whether Sequence sites are automatically turned into upper or lower case. More...
 
SiteCasing site_casing () const
 Return whether Sequence sites are automatically turned into upper or lower case. More...
 
utils::CharLookup< bool > & valid_char_lookup ()
 Return the internal CharLookup that is used for validating the Sequence sites. More...
 
PhylipReadervalid_chars (std::string const &chars)
 Set the chars that are used for validating Sequence sites when reading them. More...
 
std::string valid_chars () const
 Return the currently set chars used for validating Sequence sites. More...
 

Public Types

enum  Mode { kSequential, kInterleaved, kAutomatic }
 Enum to distinguish between the different file variants of Phylip. See mode( Mode value ) for more details. More...
 
enum  SiteCasing { kUnchanged, kToUpper, kToLower }
 Enumeration of casing methods to apply to each site of a Sequence. More...
 

Classes

struct  Header
 Helper that stores the header information of a Phylip file. More...
 

Constructor & Destructor Documentation

Create a default PhylipReader. Per default, chars are turned upper case, but not validated.

See site_casing() and valid_chars() to change this behaviour.

Definition at line 54 of file phylip_reader.cpp.

~PhylipReader ( )
default
PhylipReader ( PhylipReader const &  )
default
PhylipReader ( PhylipReader &&  )
default

Member Function Documentation

void from_file ( std::string const &  file_name,
SequenceSet sequence_set 
) const

Read all Sequences from a file in Phylip format into a SequenceSet.

The Sequences are added to the SequenceSet, whose existing Sequences are kept. Thus, by repeatedly calling this or similar read functions, multiple input files can easily be read into one SequenceSet.

Definition at line 95 of file phylip_reader.cpp.

SequenceSet from_file ( std::string const &  file_name) const

Read all Sequences from a file in Phylip format and return them as a SequenceSet.

Definition at line 146 of file phylip_reader.cpp.

void from_stream ( std::istream &  input_stream,
SequenceSet sequence_set 
) const

Read all Sequences from a std::istream in Phylip format into a SequenceSet.

The Sequences are added to the SequenceSet, whose existing Sequences are kept. Thus, by repeatedly calling this or similar read functions, multiple input files can easily be read into one SequenceSet.

This function is only allowed for Mode::kSequential and Mode::kInterleaved. Automatic mode does not work, as the stream might need to be reset, which is not possible. See mode(Mode) for details.

Definition at line 63 of file phylip_reader.cpp.

SequenceSet from_stream ( std::istream &  input_stream) const

Read all Sequences from a std::istream in Phylip format and return them as a SequenceSet.

This function is only allowed for Mode::kSequential and Mode::kInterleaved. Automatic mode does not work, as the stream might need to be reset, which is not possible. See mode(Mode) for details.

Definition at line 87 of file phylip_reader.cpp.

void from_string ( std::string const &  input_string,
SequenceSet sequence_set 
) const

Read all Sequences from a std::string in Phylip format into a SequenceSet.

The Sequences are added to the SequenceSet, whose existing Sequences are kept. Thus, by repeatedly calling this or similar read functions, multiple input files can easily be read into one SequenceSet.

Definition at line 154 of file phylip_reader.cpp.

SequenceSet from_string ( std::string const &  input_string) const

Read all Sequences from a std::string in Phylip format and return them as a SequenceSet.

Definition at line 199 of file phylip_reader.cpp.

PhylipReader & label_length ( size_t  value)

Set the length of the label in front of the sequences.

Phylip has the weird property that labels are written in front of sequences and do not need to have a delimiter, but instead are simply the first n characters of the string. This value determines after how many chars the label ends and the actual sequence begins.

If set to a value greater than 0, exaclty this many characters are read as label. Thus, they can also contain spaces. Spaces at the beginning or end of a label are stripped. The length that is dictated by the Phylip standard is 10, but any other length can also be used.

If set to 0 (default), a relaxed version of Phylip is used instead, where the sequence begin is automatically detected. Labels can then be of arbitrary lengths, as long as they do not contain white spaces. However, in this case, there has to be at least one space or tab character between the label and the sequence. After the whitespace(s), the rest of the line is then treated as sequence data.

The function returns the PhylipReader object to allow for fluent interfaces.

Definition at line 476 of file phylip_reader.cpp.

size_t label_length ( ) const

Return the currently set label length.

See the setter label_length( size_t ) for details.

Definition at line 482 of file phylip_reader.cpp.

PhylipReader & mode ( Mode  value)

Set the mode for reading sequences.

Phylip offers two variants for storing the sequences: sequential and interleaved. As there is no option or flag in the file itself, there is no chance of knowing the variant without trying to parse it. If one fails but not the other, it is proabably the latter variant. However, there are instances where both variants are valid at the same time, but yield different sequences. So, in general detecting the correct variant is undecidable, making Phylip a non-well-defined format.

In order to avoid those problems, this function explicitly sets the variant being used for parsing. By default, it is set to Mode::kSequential. Use Mode::kInterleaved for the other variant.

We also offer a Mode::kAutomatic. It first tries to parse in sequential mode, and, if this fails, in interleaved mode. However, as this might involve starting from the beginning of the data, this is only possible with the from_file() and from_string() readers and does not work when using the from_stream() reader. Also, be aware that using automatic mode is slower because of implementation details induced by those limitations. Try to avoid automatic mode. If possible, try to avoid Phylip at all.

Definition at line 465 of file phylip_reader.cpp.

PhylipReader::Mode mode ( ) const

Return the currently set mode for parsing Phylip.

See the setter mode( Mode ) for details.

Definition at line 471 of file phylip_reader.cpp.

PhylipReader& operator= ( PhylipReader const &  )
default
PhylipReader& operator= ( PhylipReader &&  )
default
PhylipReader::Header parse_phylip_header ( utils::InputStream it) const

Parse a Phylip header and return the contained sequence count and length.

This helper function expects to find a Phylip header line in the form x y, which describes the number of sequences x in the Phylip data and their length y. The remainder of the header line is interpreted as Phylip options. See Header struct for more information.

The function then advances the stream and skips potential empty lines after the header. It thus leaves the stream at the beginning of the first sequence line.

Definition at line 211 of file phylip_reader.cpp.

void parse_phylip_interleaved ( utils::InputStream it,
SequenceSet sset 
) const

Parse a whole Phylip file using the interleaved variant (Mode::kInterleaved).

Definition at line 391 of file phylip_reader.cpp.

std::string parse_phylip_label ( utils::InputStream it) const

Parse and return a Phylip label.

This helper functions either takes the first label_length chars as a label or, if label_length == 0 takes all chars until the first blank as label. It returns the trimmed label and leaves the stream at the next char after the label (and after subsequent blanks).

Definition at line 257 of file phylip_reader.cpp.

std::string parse_phylip_sequence_line ( utils::InputStream it) const

Parse one sequence line.

The line (which can also start after a label) is parsed until the first '\n' char. While parsing, the options site_casing() and valid_chars() are applied according to their settings. The stream is left at the beginning of the next line.

Definition at line 298 of file phylip_reader.cpp.

void parse_phylip_sequential ( utils::InputStream it,
SequenceSet sset 
) const

Parse a whole Phylip file using the sequential variant (Mode::kSequential).

Definition at line 341 of file phylip_reader.cpp.

PhylipReader & site_casing ( SiteCasing  value)

Set whether Sequence sites are automatically turned into upper or lower case.

Default is SiteCasing::kToUpper, that is, all sites of the read Sequences are turned into upper case letters automatically. The function returns the PhylipReader object to allow for fluent interfaces.

Definition at line 487 of file phylip_reader.cpp.

PhylipReader::SiteCasing site_casing ( ) const

Return whether Sequence sites are automatically turned into upper or lower case.

Definition at line 493 of file phylip_reader.cpp.

utils::CharLookup< bool > & valid_char_lookup ( )

Return the internal CharLookup that is used for validating the Sequence sites.

This function is provided in case direct access to the lookup is needed. Usually, the valid_chars() function should suffice. See there for details.

Definition at line 523 of file phylip_reader.cpp.

PhylipReader & valid_chars ( std::string const &  chars)

Set the chars that are used for validating Sequence sites when reading them.

When this function is called with a string of chars, those chars are used to validate the sites when reading them. That is, only sequences consisting of the given chars are valid.

If set to an empty string, this check is deactivated. This is also the default, meaning that no checking is done.

In case that site_casing() is set to a value other than SiteCasing::kUnchanged: The validation is done after changing the casing, so that only lower or capital letters have to be provided for validation. In case that site_casing() is set to SiteCasing::kUnchanged: All chars that are to be considered valid have to be provided for validation.

See nucleic_acid_codes...() and amino_acid_codes...() functions for presettings of chars that can be used for validation here.

Definition at line 498 of file phylip_reader.cpp.

std::string valid_chars ( ) const

Return the currently set chars used for validating Sequence sites.

An empty string means that no validation is done.

Definition at line 512 of file phylip_reader.cpp.

Member Enumeration Documentation

enum Mode
strong

Enum to distinguish between the different file variants of Phylip. See mode( Mode value ) for more details.

Enumerator
kSequential 

Read the data in Phylip sequential mode.

kInterleaved 

Read the data in Phylip interleaved mode.

kAutomatic 

Infer the Phylip mode via trial and error.

Definition at line 129 of file phylip_reader.hpp.

enum SiteCasing
strong

Enumeration of casing methods to apply to each site of a Sequence.

Enumerator
kUnchanged 

Do not change the case of the sites.

kToUpper 

Make all sites upper case.

kToLower 

Make all sites lower case.

Definition at line 150 of file phylip_reader.hpp.


The documentation for this class was generated from the following files: