A toolkit for working with phylogenetic data.
v0.20.0
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Pages
FastaReader Class Reference

#include <genesis/sequence/formats/fasta_reader.hpp>

Detailed Description

Read Fasta sequence data.

This class provides simple facilities for reading Fasta data. It supports to read

Exemplary usage:

std::string infile = "path/to/file.fasta";
SequenceSet sequence_set;

FastaReader()
    .site_casing( SiteCasing::kUnchanged )
    .valid_chars( nucleic_acid_codes_all() )
    .from_file( infile, sequence_set );

The expected data format:

  1. Has to start with a '>' character, followed by a label, ended by a '\n'.
  2. An arbitrary number of comment lines, starting with ';', can follow, but are ignored.
  3. After that, a sequence has to follow, over one or more lines.

More information on the format can be found at:

Using site_casing(), the sequences can automatically be turned into upper or lower case letter. Also, see valid_chars( std::string const& chars ) for a way of checking correct input sequences.

Definition at line 94 of file fasta_reader.hpp.

Public Member Functions

 FastaReader ()
 Create a default FastaReader. Per default, chars are turned upper case, but not validated. More...
 
 FastaReader (FastaReader const &)=default
 
 FastaReader (FastaReader &&)=default
 
 ~FastaReader ()=default
 
void from_file (std::string const &file_name, SequenceSet &sequence_set) const
 Read all Sequences from a file in Fasta format into a SequenceSet. More...
 
SequenceSet from_file (std::string const &file_name) const
 Read all Sequences from a file in Fasta format and return them as a SequenceSet. More...
 
void from_stream (std::istream &input_stream, SequenceSet &sequence_set) const
 Read all Sequences from a std::istream in Fasta format into a SequenceSet. More...
 
SequenceSet from_stream (std::istream &input_stream) const
 Read all Sequences from a std::istream in Fasta format and return them as a SequenceSet. More...
 
void from_string (std::string const &input_string, SequenceSet &sequence_set) const
 Read all Sequences from a std::string in Fasta format into a SequenceSet. More...
 
SequenceSet from_string (std::string const &input_string) const
 Read all Sequences from a std::string in Fasta format and return them as a SequenceSet. More...
 
FastaReaderguess_abundances (bool value)
 Set whether Sequence labels are used to guess/extract Sequence abundances. More...
 
bool guess_abundances () const
 Return whether the label is used to guess/extracat Sequence abundances. More...
 
FastaReaderoperator= (FastaReader const &)=default
 
FastaReaderoperator= (FastaReader &&)=default
 
void parse_document (utils::InputStream &input_stream, SequenceSet &sequence_set) const
 Parse a whole fasta document into a SequenceSet. More...
 
bool parse_sequence (utils::InputStream &input_stream, Sequence &sequence) const
 Parse a Sequence in Fasta format. More...
 
bool parse_sequence_pedantic (utils::InputStream &input_stream, Sequence &sequence) const
 Parse a Sequence in Fasta format. More...
 
FastaReaderparsing_method (ParsingMethod value)
 Set the parsing method. More...
 
ParsingMethod parsing_method () const
 Return the currently set parsing method. More...
 
FastaReadersite_casing (SiteCasing value)
 Set whether Sequence sites are automatically turned into upper or lower case. More...
 
SiteCasing site_casing () const
 Return whether Sequence sites are automatically turned into upper or lower case. More...
 
utils::CharLookup< bool > & valid_char_lookup ()
 Return the internal CharLookup that is used for validating the Sequence sites. More...
 
FastaReadervalid_chars (std::string const &chars)
 Set the chars that are used for validating Sequence sites when reading them. More...
 
std::string valid_chars () const
 Return the currently set chars used for validating Sequence sites. More...
 

Public Types

enum  ParsingMethod { kDefault, kPedantic }
 Enumeration of the available methods for parsing Fasta sequences. More...
 
enum  SiteCasing { kUnchanged, kToUpper, kToLower }
 Enumeration of casing methods to apply to each site of a Sequence. More...
 

Constructor & Destructor Documentation

Create a default FastaReader. Per default, chars are turned upper case, but not validated.

See site_casing() and valid_chars() to change this behaviour.

Definition at line 56 of file fasta_reader.cpp.

~FastaReader ( )
default
FastaReader ( FastaReader const &  )
default
FastaReader ( FastaReader &&  )
default

Member Function Documentation

void from_file ( std::string const &  file_name,
SequenceSet sequence_set 
) const

Read all Sequences from a file in Fasta format into a SequenceSet.

The Sequences are added to the SequenceSet, whose existing Sequences are kept. Thus, by repeatedly calling this or similar read functions, multiple input files can easily be read into one SequenceSet.

Definition at line 80 of file fasta_reader.cpp.

SequenceSet from_file ( std::string const &  file_name) const

Read all Sequences from a file in Fasta format and return them as a SequenceSet.

Definition at line 87 of file fasta_reader.cpp.

void from_stream ( std::istream &  input_stream,
SequenceSet sequence_set 
) const

Read all Sequences from a std::istream in Fasta format into a SequenceSet.

The Sequences are added to the SequenceSet, whose existing Sequences are kept. Thus, by repeatedly calling this or similar read functions, multiple input files can easily be read into one SequenceSet.

Definition at line 65 of file fasta_reader.cpp.

SequenceSet from_stream ( std::istream &  input_stream) const

Read all Sequences from a std::istream in Fasta format and return them as a SequenceSet.

Definition at line 72 of file fasta_reader.cpp.

void from_string ( std::string const &  input_string,
SequenceSet sequence_set 
) const

Read all Sequences from a std::string in Fasta format into a SequenceSet.

The Sequences are added to the SequenceSet, whose existing Sequences are kept. Thus, by repeatedly calling this or similar read functions, multiple input files can easily be read into one SequenceSet.

Definition at line 95 of file fasta_reader.cpp.

SequenceSet from_string ( std::string const &  input_string) const

Read all Sequences from a std::string in Fasta format and return them as a SequenceSet.

Definition at line 102 of file fasta_reader.cpp.

FastaReader & guess_abundances ( bool  value)

Set whether Sequence labels are used to guess/extract Sequence abundances.

Default is false, that is, labels are just taken as they are in the input. If set to true, the label is used to guess an abundance count, which is set in the Sequence. See guess_sequence_abundance( Sequence const& ) for the valid formats of such abundances.

Definition at line 420 of file fasta_reader.cpp.

bool guess_abundances ( ) const

Return whether the label is used to guess/extracat Sequence abundances.

Definition at line 426 of file fasta_reader.cpp.

FastaReader& operator= ( FastaReader const &  )
default
FastaReader& operator= ( FastaReader &&  )
default
void parse_document ( utils::InputStream input_stream,
SequenceSet sequence_set 
) const

Parse a whole fasta document into a SequenceSet.

This function is mainly used internally by the reading functions from_...(). It uses the currently set parsing_method() for parsing the data.

Definition at line 114 of file fasta_reader.cpp.

bool parse_sequence ( utils::InputStream input_stream,
Sequence sequence 
) const

Parse a Sequence in Fasta format.

This function takes an InputStream and interprets it as a Fasta formatted sequence. It extracts the data and writes it into the given Sequence object. See the class description of FastaReader for the expected data format.

The function stops after parsing one such sequence. It returns true if a sequence was extracted and false if the stream is empty. If the input is not in the correct format, an std::runtime_error exception is thrown indicating the malicious position in the input stream.

This method has a maximum line length of utils::InputStream::BlockLength and reports errors only on the line where the sequence starts. If you have files with longer lines or want error reporting at the exact line and column where the error occurs, use ParsingMethod::kPedantic instead. See FastaReader::ParsingMethod for details.

Definition at line 136 of file fasta_reader.cpp.

bool parse_sequence_pedantic ( utils::InputStream input_stream,
Sequence sequence 
) const

Parse a Sequence in Fasta format.

This function takes an InputStream and interprets it as a Fasta formatted sequence. It extracts the data and writes it into the given Sequence object. See the class description of FastaReader for the expected data format.

The function stops after parsing one such sequence. It returns true if a sequence was extracted and false if the stream is empty. If the input is not in the correct format, an std::runtime_error exception is thrown indicating the malicious position in the input stream.

Compared to parse_sequence(), this function allows for arbitrarily long lines and reports errors at the exact line and column where they occur. It is however slower. Apart from that, there are no differences. See FastaReader::ParsingMethod for details.

Definition at line 264 of file fasta_reader.cpp.

FastaReader & parsing_method ( FastaReader::ParsingMethod  value)

Set the parsing method.

The parsing method is used for all the reader functions and parse_document(). See the ParsingMethod enum for details.

Definition at line 398 of file fasta_reader.cpp.

FastaReader::ParsingMethod parsing_method ( ) const

Return the currently set parsing method.

See the ParsingMethod enum for details.

Definition at line 404 of file fasta_reader.cpp.

FastaReader & site_casing ( SiteCasing  value)

Set whether Sequence sites are automatically turned into upper or lower case.

Default is SiteCasing::kToUpper, that is, all sites of the read Sequences are turned into upper case letters automatically. This is demanded by the Fasta standard. The function returns the FastaReader object to allow for fluent interfaces.

Definition at line 409 of file fasta_reader.cpp.

FastaReader::SiteCasing site_casing ( ) const

Return whether Sequence sites are automatically turned into upper or lower case.

Definition at line 415 of file fasta_reader.cpp.

utils::CharLookup< bool > & valid_char_lookup ( )

Return the internal CharLookup that is used for validating the Sequence sites.

This function is provided in case direct access to the lookup is needed. Usually, the valid_chars() function should suffice. See there for details.

Definition at line 456 of file fasta_reader.cpp.

FastaReader & valid_chars ( std::string const &  chars)

Set the chars that are used for validating Sequence sites when reading them.

When this function is called with a string of chars, those chars are used to validate the sites when reading them. That is, only sequences consisting of the given chars are valid.

If set to an empty string, this check is deactivated. This is also the default, meaning that no checking is done.

In case that site_casing() is set to a value other than SiteCasing::kUnchanged: The validation is done after changing the casing, so that only lower or capital letters have to be provided for validation. In case that site_casing() is set to SiteCasing::kUnchanged: All chars that are to be considered valid have to be provided for validation.

See nucleic_acid_codes...() and amino_acid_codes...() functions for presettings of chars that can be used for validation here.

Definition at line 431 of file fasta_reader.cpp.

std::string valid_chars ( ) const

Return the currently set chars used for validating Sequence sites.

An empty string means that no validation is done.

Definition at line 445 of file fasta_reader.cpp.

Member Enumeration Documentation

enum ParsingMethod
strong

Enumeration of the available methods for parsing Fasta sequences.

Enumerator
kDefault 

Fast method, used by default.

There are two limitations of this method:

Those limitations do not affect most applications, as the maximum line length is long enough for most files, and if your data is good, there won't be errors to report. If you however have files with longer lines or want error reporting at the exact line and column where the error occurs, use kPedantic instead.

With this setting, parse_sequence() is used for parsing. In our tests, it achieved ~350 MB/s parsing speed.

kPedantic 

Pedantic method.

Compared to the fast method, this one allows for arbitrarily long lines and reports errors at the exact line and column where they occur. It is however slower (~3.5x the time of the default method). Apart from that, there are no differences.

If you need this method for certain files, it might be useful to use it only once and use a FastaWriter to write out a new Fasta file with fitting line lengths and without errors. This way, for subsequent reading you can then use the faster default method.

With this setting, parse_sequence_pedantic() is used for parsing. In our tests, it achieved ~100 MB/s parsing speed.

Definition at line 105 of file fasta_reader.hpp.

enum SiteCasing
strong

Enumeration of casing methods to apply to each site of a Sequence.

Enumerator
kUnchanged 

Do not change the case of the sites.

kToUpper 

Make all sites upper case.

kToLower 

Make all sites lower case.

Definition at line 145 of file fasta_reader.hpp.


The documentation for this class was generated from the following files: