A library for working with phylogenetic and population genetic data.
v0.32.0
PhylipReader Class Reference

#include <genesis/sequence/formats/phylip_reader.hpp>

Detailed Description

Read Phylip sequence data.

This class provides simple facilities for reading Phylip data.

Exemplary usage:

std::string infile = "path/to/file.phylip";
SequenceSet sset;

PhylipReader()
    .site_casing( SiteCasing::kUnchanged )
    .valid_chars( nucleic_acid_codes_all() )
    .read( utils::from_file( infile ), sset );

The expected data format roughly follows the original definition. See mode( Mode ) to selected between sequential and interleaved mode, which are the two variants of Phylip files. We furthermore support a relaxed version (by default), where the label can be of any length. See label_length( size_t ) for more information.

Using site_casing(), the sequences can automatically be turned into upper or lower case letter. Also, see valid_chars( std::string const& ) for a way of checking correct input sequences.

Definition at line 86 of file phylip_reader.hpp.

Public Member Functions

 PhylipReader ()
 Create a default PhylipReader. Per default, chars are turned upper case, but not validated. More...
 
 PhylipReader (PhylipReader &&)=default
 
 PhylipReader (PhylipReader const &)=default
 
 ~PhylipReader ()=default
 
size_t label_length () const
 Return the currently set label length. More...
 
PhylipReaderlabel_length (size_t value)
 Set the length of the label in front of the sequences. More...
 
Mode mode () const
 
PhylipReadermode (Mode value)
 Set the mode for reading sequences. More...
 
PhylipReaderoperator= (PhylipReader &&)=default
 
PhylipReaderoperator= (PhylipReader const &)=default
 
Header parse_phylip_header (utils::InputStream &it) const
 Parse a Phylip header and return the contained sequence count and length. More...
 
void parse_phylip_interleaved (utils::InputStream &it, SequenceSet &sset) const
 Parse a whole Phylip file using the interleaved variant (Mode::kInterleaved). More...
 
std::string parse_phylip_label (utils::InputStream &it) const
 Parse and return a Phylip label. More...
 
std::string parse_phylip_sequence_line (utils::InputStream &it) const
 Parse one sequence line. More...
 
void parse_phylip_sequential (utils::InputStream &it, SequenceSet &sset) const
 Parse a whole Phylip file using the sequential variant (Mode::kSequential). More...
 
SequenceSet read (std::shared_ptr< utils::BaseInputSource > source) const
 Read all Sequences from an input source in Phylip format and return them as a SequenceSet. More...
 
void read (std::shared_ptr< utils::BaseInputSource > source, SequenceSet &target) const
 Read all Sequences from an input source in Phylip format and return them as a SequenceSet. More...
 
bool remove_digits () const
 Return whether digits are removed from the Sequence. More...
 
PhylipReaderremove_digits (bool value)
 Set whether digits in the Sequence should be kept (default) or removed. More...
 
SiteCasing site_casing () const
 Return whether Sequence sites are automatically turned into upper or lower case. More...
 
PhylipReadersite_casing (SiteCasing value)
 Set whether Sequence sites are automatically turned into upper or lower case. More...
 
utils::CharLookup< bool > & valid_char_lookup ()
 Return the internal CharLookup that is used for validating the Sequence sites. More...
 
std::string valid_chars () const
 Return the currently set chars used for validating Sequence sites. More...
 
PhylipReadervalid_chars (std::string const &chars)
 Set the chars that are used for validating Sequence sites when reading them. More...
 

Public Types

enum  Mode { kSequential, kInterleaved }
 Enum to distinguish between the different file variants of Phylip. See mode( Mode value ) for more details. More...
 
enum  SiteCasing { kUnchanged, kToUpper, kToLower }
 Enumeration of casing methods to apply to each site of a Sequence. More...
 

Classes

struct  Header
 Helper that stores the header information of a Phylip file. More...
 

Constructor & Destructor Documentation

◆ PhylipReader() [1/3]

Create a default PhylipReader. Per default, chars are turned upper case, but not validated.

See site_casing() and valid_chars() to change this behaviour.

Definition at line 56 of file phylip_reader.cpp.

◆ ~PhylipReader()

~PhylipReader ( )
default

◆ PhylipReader() [2/3]

PhylipReader ( PhylipReader const &  )
default

◆ PhylipReader() [3/3]

PhylipReader ( PhylipReader &&  )
default

Member Function Documentation

◆ label_length() [1/2]

size_t label_length ( ) const

Return the currently set label length.

See the setter label_length( size_t ) for details.

Definition at line 363 of file phylip_reader.cpp.

◆ label_length() [2/2]

PhylipReader & label_length ( size_t  value)

Set the length of the label in front of the sequences.

Phylip has the weird property that labels are written in front of sequences and do not need to have a delimiter, but instead are simply the first n characters of the string. This value determines after how many chars the label ends and the actual sequence begins.

If set to a value greater than 0, exaclty this many characters are read as label. Thus, they can also contain spaces. Spaces at the beginning or end of a label are stripped. The length that is dictated by the Phylip standard is 10, but any other length can also be used.

If set to 0 (default), a relaxed version of Phylip is used instead, where the sequence begin is automatically detected. Labels can then be of arbitrary lengths, as long as they do not contain white spaces. However, in this case, there has to be at least one space or tab character between the label and the sequence. After the whitespace(s), the rest of the line is then treated as sequence data.

The function returns the PhylipReader object to allow for fluent interfaces.

Definition at line 357 of file phylip_reader.cpp.

◆ mode() [1/2]

PhylipReader::Mode mode ( ) const

Return the currently set mode for parsing Phylip.

See the setter mode( Mode ) for details.

Definition at line 352 of file phylip_reader.cpp.

◆ mode() [2/2]

PhylipReader & mode ( Mode  value)

Set the mode for reading sequences.

Phylip offers two variants for storing the sequences: sequential and interleaved. As there is no option or flag needed to distinguish between them in the file itself, there is no chance of knowing the variant without trying to parse it. If one fails but not the other, it is proabably the latter variant. However, there are instances where both variants are valid at the same time, but yield different sequences. So, in general detecting the correct variant is undecidable, making Phylip a non-well-defined format. If possible, try to avoid Phylip files.

In order to avoid those problems, this function explicitly sets the variant being used for parsing. By default, it is set to Mode::kSequential. Use Mode::kInterleaved for the other variant.

Definition at line 346 of file phylip_reader.cpp.

◆ operator=() [1/2]

PhylipReader& operator= ( PhylipReader &&  )
default

◆ operator=() [2/2]

PhylipReader& operator= ( PhylipReader const &  )
default

◆ parse_phylip_header()

PhylipReader::Header parse_phylip_header ( utils::InputStream it) const

Parse a Phylip header and return the contained sequence count and length.

This helper function expects to find a Phylip header line in the form x y, which describes the number of sequences x in the Phylip data and their length y. The remainder of the header line is interpreted as Phylip options. See Header struct for more information.

The function then advances the stream and skips potential empty lines after the header. It thus leaves the stream at the beginning of the first sequence line.

Definition at line 96 of file phylip_reader.cpp.

◆ parse_phylip_interleaved()

void parse_phylip_interleaved ( utils::InputStream it,
SequenceSet sset 
) const

Parse a whole Phylip file using the interleaved variant (Mode::kInterleaved).

Definition at line 272 of file phylip_reader.cpp.

◆ parse_phylip_label()

std::string parse_phylip_label ( utils::InputStream it) const

Parse and return a Phylip label.

This helper functions either takes the first label_length chars as a label or, if label_length == 0 takes all chars until the first blank as label. It returns the trimmed label and leaves the stream at the next char after the label (and after subsequent blanks).

Definition at line 142 of file phylip_reader.cpp.

◆ parse_phylip_sequence_line()

std::string parse_phylip_sequence_line ( utils::InputStream it) const

Parse one sequence line.

The line (which can also start after a label) is parsed until the first '\n' char. While parsing, the options site_casing() and valid_chars() are applied according to their settings. The stream is left at the beginning of the next line.

Definition at line 184 of file phylip_reader.cpp.

◆ parse_phylip_sequential()

void parse_phylip_sequential ( utils::InputStream it,
SequenceSet sset 
) const

Parse a whole Phylip file using the sequential variant (Mode::kSequential).

Definition at line 222 of file phylip_reader.cpp.

◆ read() [1/2]

SequenceSet read ( std::shared_ptr< utils::BaseInputSource source) const

Read all Sequences from an input source in Phylip format and return them as a SequenceSet.

Use functions such as utils::from_file() and utils::from_string() to conveniently get an input source that can be used here.

Definition at line 65 of file phylip_reader.cpp.

◆ read() [2/2]

void read ( std::shared_ptr< utils::BaseInputSource source,
SequenceSet target 
) const

Read all Sequences from an input source in Phylip format and return them as a SequenceSet.

The Sequences are added to the SequenceSet, whose existing Sequences are kept. Thus, by repeatedly calling this or similar read functions, multiple input files can easily be read into one SequenceSet.

Use functions such as utils::from_file() and utils::from_string() to conveniently get an input source that can be used here.

Definition at line 73 of file phylip_reader.cpp.

◆ remove_digits() [1/2]

bool remove_digits ( ) const

Return whether digits are removed from the Sequence.

Definition at line 385 of file phylip_reader.cpp.

◆ remove_digits() [2/2]

PhylipReader & remove_digits ( bool  value)

Set whether digits in the Sequence should be kept (default) or removed.

Usually, sequences do not contain digits. However, some Phylip variants allow to annotate sequences with positions in between, for example

2 10
foofoofoo AAGCC
5 TTGGC
barbarbar AAACC
5 CTTGC

See http://evolution.genetics.washington.edu/phylip/doc/sequence.html for the definition of the Phylip standard that allows this. By default, we keep all symbols except white space, because some multi-state models might use digits as symbols. However, for files that use this weird variant of the standard, this option can be activated to remove the digits.

Definition at line 379 of file phylip_reader.cpp.

◆ site_casing() [1/2]

PhylipReader::SiteCasing site_casing ( ) const

Return whether Sequence sites are automatically turned into upper or lower case.

Definition at line 374 of file phylip_reader.cpp.

◆ site_casing() [2/2]

PhylipReader & site_casing ( SiteCasing  value)

Set whether Sequence sites are automatically turned into upper or lower case.

Default is SiteCasing::kToUpper, that is, all sites of the read Sequences are turned into upper case letters automatically. The function returns the PhylipReader object to allow for fluent interfaces.

Definition at line 368 of file phylip_reader.cpp.

◆ valid_char_lookup()

utils::CharLookup< bool > & valid_char_lookup ( )

Return the internal CharLookup that is used for validating the Sequence sites.

This function is provided in case direct access to the lookup is needed. Usually, the valid_chars() function should suffice. See there for details.

Definition at line 415 of file phylip_reader.cpp.

◆ valid_chars() [1/2]

std::string valid_chars ( ) const

Return the currently set chars used for validating Sequence sites.

An empty string means that no validation is done.

Definition at line 404 of file phylip_reader.cpp.

◆ valid_chars() [2/2]

PhylipReader & valid_chars ( std::string const &  chars)

Set the chars that are used for validating Sequence sites when reading them.

When this function is called with a string of chars, those chars are used to validate the sites when reading them. That is, only sequences consisting of the given chars are valid.

If set to an empty string, this check is deactivated. This is also the default, meaning that no checking is done.

In case that site_casing() is set to a value other than SiteCasing::kUnchanged: The validation is done after changing the casing, so that only lower or capital letters have to be provided for validation. In case that site_casing() is set to SiteCasing::kUnchanged: All chars that are to be considered valid have to be provided for validation.

See nucleic_acid_codes...() and amino_acid_codes...() functions for presettings of chars that can be used for validation here.

Definition at line 390 of file phylip_reader.cpp.

Member Enumeration Documentation

◆ Mode

enum Mode
strong

Enum to distinguish between the different file variants of Phylip. See mode( Mode value ) for more details.

Enumerator
kSequential 

Read the data in Phylip sequential mode.

kInterleaved 

Read the data in Phylip interleaved mode.

Definition at line 127 of file phylip_reader.hpp.

◆ SiteCasing

enum SiteCasing
strong

Enumeration of casing methods to apply to each site of a Sequence.

Enumerator
kUnchanged 

Do not change the case of the sites.

kToUpper 

Make all sites upper case.

kToLower 

Make all sites lower case.

Definition at line 143 of file phylip_reader.hpp.


The documentation for this class was generated from the following files: