A library for working with phylogenetic and population genetic data.
v0.27.0
FastaReader Class Reference

#include <genesis/sequence/formats/fasta_reader.hpp>

Detailed Description

Read Fasta sequence data.

This class provides simple facilities for reading Fasta data.

Exemplary usage:

std::string infile = "path/to/file.fasta";
SequenceSet sequence_set;

FastaReader()
    .site_casing( SiteCasing::kUnchanged )
    .valid_chars( nucleic_acid_codes_all() )
    .read( utils::from_file( infile ), sequence_set );

The expected data format:

  1. Has to start with a '>' character, followed by a label, ended by a '\n'.
  2. An arbitrary number of comment lines, starting with ';', can follow, but are ignored.
  3. After that, a sequence has to follow, over one or more lines.

More information on the format can be found at:

Using site_casing(), the sequences can automatically be turned into upper or lower case letter. Also, see valid_chars( std::string const& chars ) for a way of checking correct input sequences.

Definition at line 92 of file fasta_reader.hpp.

Public Member Functions

 FastaReader ()
 Create a default FastaReader. Per default, chars are turned upper case, but not validated. More...
 
 FastaReader (FastaReader &&)=default
 
 FastaReader (FastaReader const &)=default
 
 ~FastaReader ()=default
 
bool guess_abundances () const
 Return whether the label is used to guess/extracat Sequence abundances. More...
 
FastaReaderguess_abundances (bool value)
 Set whether Sequence labels are used to guess/extract Sequence abundances. More...
 
FastaReaderoperator= (FastaReader &&)=default
 
FastaReaderoperator= (FastaReader const &)=default
 
void parse_document (utils::InputStream &input_stream, SequenceSet &sequence_set) const
 Parse a whole fasta document into a SequenceSet. More...
 
bool parse_sequence (utils::InputStream &input_stream, Sequence &sequence) const
 Parse a Sequence in Fasta format. More...
 
bool parse_sequence_pedantic (utils::InputStream &input_stream, Sequence &sequence) const
 Parse a Sequence in Fasta format. More...
 
ParsingMethod parsing_method () const
 Return the currently set parsing method. More...
 
FastaReaderparsing_method (ParsingMethod value)
 Set the parsing method. More...
 
SequenceSet read (std::shared_ptr< utils::BaseInputSource > source) const
 Read all Sequences from an input source in Fasta format and return them as a SequenceSet. More...
 
void read (std::shared_ptr< utils::BaseInputSource > source, SequenceSet &sequence_set) const
 Read all Sequences from an input source in Fasta format into a SequenceSet. More...
 
SiteCasing site_casing () const
 Return whether Sequence sites are automatically turned into upper or lower case. More...
 
FastaReadersite_casing (SiteCasing value)
 Set whether Sequence sites are automatically turned into upper or lower case. More...
 
utils::CharLookup< bool > & valid_char_lookup ()
 Return the internal CharLookup that is used for validating the Sequence sites. More...
 
std::string valid_chars () const
 Return the currently set chars used for validating Sequence sites. More...
 
FastaReadervalid_chars (std::string const &chars)
 Set the chars that are used for validating Sequence sites when reading them. More...
 

Public Types

enum  ParsingMethod { kDefault, kPedantic }
 Enumeration of the available methods for parsing Fasta sequences. More...
 
enum  SiteCasing { kUnchanged, kToUpper, kToLower }
 Enumeration of casing methods to apply to each site of a Sequence. More...
 

Constructor & Destructor Documentation

◆ FastaReader() [1/3]

Create a default FastaReader. Per default, chars are turned upper case, but not validated.

See site_casing() and valid_chars() to change this behaviour.

Definition at line 56 of file fasta_reader.cpp.

◆ ~FastaReader()

~FastaReader ( )
default

◆ FastaReader() [2/3]

FastaReader ( FastaReader const &  )
default

◆ FastaReader() [3/3]

FastaReader ( FastaReader &&  )
default

Member Function Documentation

◆ guess_abundances() [1/2]

bool guess_abundances ( ) const

Return whether the label is used to guess/extracat Sequence abundances.

Definition at line 401 of file fasta_reader.cpp.

◆ guess_abundances() [2/2]

FastaReader & guess_abundances ( bool  value)

Set whether Sequence labels are used to guess/extract Sequence abundances.

Default is false, that is, labels are just taken as they are in the input. If set to true, the label is used to guess an abundance count, which is set in the Sequence. See guess_sequence_abundance( Sequence const& ) for the valid formats of such abundances.

Definition at line 395 of file fasta_reader.cpp.

◆ operator=() [1/2]

FastaReader& operator= ( FastaReader &&  )
default

◆ operator=() [2/2]

FastaReader& operator= ( FastaReader const &  )
default

◆ parse_document()

void parse_document ( utils::InputStream input_stream,
SequenceSet sequence_set 
) const

Parse a whole fasta document into a SequenceSet.

This function is mainly used internally by the reading functions read(). It uses the currently set parsing_method() for parsing the data.

Definition at line 85 of file fasta_reader.cpp.

◆ parse_sequence()

bool parse_sequence ( utils::InputStream input_stream,
Sequence sequence 
) const

Parse a Sequence in Fasta format.

This function takes an InputStream and interprets it as a Fasta formatted sequence. It extracts the data and writes it into the given Sequence object. See the class description of FastaReader for the expected data format.

The function stops after parsing one such sequence. It returns true if a sequence was extracted and false if the stream is empty. If the input is not in the correct format, an std::runtime_error exception is thrown indicating the malicious position in the input stream.

Definition at line 107 of file fasta_reader.cpp.

◆ parse_sequence_pedantic()

bool parse_sequence_pedantic ( utils::InputStream input_stream,
Sequence sequence 
) const

Parse a Sequence in Fasta format.

This function takes an InputStream and interprets it as a Fasta formatted sequence. It extracts the data and writes it into the given Sequence object. See the class description of FastaReader for the expected data format.

The function stops after parsing one such sequence. It returns true if a sequence was extracted and false if the stream is empty. If the input is not in the correct format, an std::runtime_error exception is thrown indicating the malicious position in the input stream.

Compared to parse_sequence(), this function reports errors at the exact line and column where they occur. It is however slower. Apart from that, there are no differences. See FastaReader::ParsingMethod for details.

Definition at line 236 of file fasta_reader.cpp.

◆ parsing_method() [1/2]

FastaReader::ParsingMethod parsing_method ( ) const

Return the currently set parsing method.

See the ParsingMethod enum for details.

Definition at line 379 of file fasta_reader.cpp.

◆ parsing_method() [2/2]

FastaReader & parsing_method ( FastaReader::ParsingMethod  value)

Set the parsing method.

The parsing method is used for all the reader functions and parse_document(). See the ParsingMethod enum for details.

Definition at line 373 of file fasta_reader.cpp.

◆ read() [1/2]

SequenceSet read ( std::shared_ptr< utils::BaseInputSource source) const

Read all Sequences from an input source in Fasta format and return them as a SequenceSet.

Use functions such as utils::from_file() and utils::from_string() to conveniently get an input source that can be used here.

Definition at line 65 of file fasta_reader.cpp.

◆ read() [2/2]

void read ( std::shared_ptr< utils::BaseInputSource source,
SequenceSet sequence_set 
) const

Read all Sequences from an input source in Fasta format into a SequenceSet.

The Sequences are added to the SequenceSet, whose existing Sequences are kept. Thus, by repeatedly calling this or similar read functions, multiple input files can easily be read into one SequenceSet.

Use functions such as utils::from_file() and utils::from_string() to conveniently get an input source that can be used here.

Definition at line 73 of file fasta_reader.cpp.

◆ site_casing() [1/2]

FastaReader::SiteCasing site_casing ( ) const

Return whether Sequence sites are automatically turned into upper or lower case.

Definition at line 390 of file fasta_reader.cpp.

◆ site_casing() [2/2]

FastaReader & site_casing ( SiteCasing  value)

Set whether Sequence sites are automatically turned into upper or lower case.

Default is SiteCasing::kToUpper, that is, all sites of the read Sequences are turned into upper case letters automatically. This is demanded by the Fasta standard. The function returns the FastaReader object to allow for fluent interfaces.

Definition at line 384 of file fasta_reader.cpp.

◆ valid_char_lookup()

utils::CharLookup< bool > & valid_char_lookup ( )

Return the internal CharLookup that is used for validating the Sequence sites.

This function is provided in case direct access to the lookup is needed. Usually, the valid_chars() function should suffice. See there for details.

Definition at line 431 of file fasta_reader.cpp.

◆ valid_chars() [1/2]

std::string valid_chars ( ) const

Return the currently set chars used for validating Sequence sites.

An empty string means that no validation is done.

Definition at line 420 of file fasta_reader.cpp.

◆ valid_chars() [2/2]

FastaReader & valid_chars ( std::string const &  chars)

Set the chars that are used for validating Sequence sites when reading them.

When this function is called with a string of chars, those chars are used to validate the sites when reading them. That is, only sequences consisting of the given chars are valid.

If set to an empty string, this check is deactivated. This is also the default, meaning that no checking is done.

In case that site_casing() is set to a value other than SiteCasing::kUnchanged: The validation is done after changing the casing, so that only lower or capital letters have to be provided for validation. In case that site_casing() is set to SiteCasing::kUnchanged: All chars that are to be considered valid have to be provided for validation.

See nucleic_acid_codes...() and amino_acid_codes...() functions for presettings of chars that can be used for validation here.

Definition at line 406 of file fasta_reader.cpp.

Member Enumeration Documentation

◆ ParsingMethod

enum ParsingMethod
strong

Enumeration of the available methods for parsing Fasta sequences.

Enumerator
kDefault 

Fast method, used by default.

This is by far the preferred method, it however has one slight limitation: It only reports errors using the line where the sequence starts. This does not affect most applications, as good data won't produce errors to report. If you however want error reporting at the exact line and column where the error occurs, use kPedantic instead.

With this setting, parse_sequence() is used for parsing. In our tests, it achieved ~350 MB/s parsing speed.

kPedantic 

Pedantic method.

Compared to the fast method, this one reports errors at the exact line and column where they occur. It is however slower (~3.5x the time of the default method). Apart from that, there are no differences.

If you need this method for certain files, it might be useful to use it only once and use a FastaWriter to write out a new Fasta file without errors. This way, for subsequent reading you can then use the faster default method.

With this setting, parse_sequence_pedantic() is used for parsing. In our tests, it achieved ~100 MB/s parsing speed.

Definition at line 103 of file fasta_reader.hpp.

◆ SiteCasing

enum SiteCasing
strong

Enumeration of casing methods to apply to each site of a Sequence.

Enumerator
kUnchanged 

Do not change the case of the sites.

kToUpper 

Make all sites upper case.

kToLower 

Make all sites lower case.

Definition at line 139 of file fasta_reader.hpp.


The documentation for this class was generated from the following files: