A library for working with phylogenetic and population genetic data.
v0.32.0
FastqReader Class Reference

#include <genesis/sequence/formats/fastq_reader.hpp>

Detailed Description

Read Fastq sequence data.

This class provides simple facilities for reading Fastq data.

Exemplary usage:

std::string infile = "path/to/file.fastq";
SequenceSet sequence_set;

FastqReader()
    .site_casing( SiteCasing::kUnchanged )
    .valid_chars( nucleic_acid_codes_all() )
    .read( utils::from_file( infile ), sequence_set );

The expected data format is:

  1. Line 1 begins with a '@' character and is followed by a sequence identifier (label) and an optional description (like a FASTA title line, see FastaReader for details).
  2. Line 2 (or more) is the raw sequence letters. In contrast to most other readers, we allow the sequence to use several lines.
  3. Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again. If this line is not empty, it has to be identical to line 1.
  4. Line 4 (or more) encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as there were letters in the sequence (line 2).

See https://en.wikipedia.org/wiki/FASTQ_format for details.

As the encoding for the quality values can be substantially different depending on the sequencing techonology used, parsing fastq files is more difficult than fasta. Two issues arise:

  • The quality encoding can be different depending on the used sequencing techonology. The most prominent difference is the used ASCII base for the phred quality scores. See https://en.wikipedia.org/wiki/FASTQ_format for a thorough discussion, or the article cited below [1]. Solexa even uses a different function to compute scores, making it even more complicated. We tried to make the standard use case as easy as possible, as explained below.
  • Most parsers expect the four lines as above without line breaks in between them. This is because the quality encoding might use the characters '@' and '+', which are also used as the starting characters for the first and third line, respectively (we here ignore the fact that, in theory, the seqeunce letters themselves could also be different than 'ACGT' and their degenerates, as this is also not defined in the format...).
    This simple format does work here as well. However, we are nice and also support line breaks.
    There is only one edge case where this breaks. If the sequence sites (line 2) contain a '+' character at the beginning of a wrapped line (i.e., immediately after a line break), we cannot distinguish this from the beginning of line 3. Unfortunately, this is an issue of the format itself that cannot be solved in a parser, as this is simply ill-defined.
    However, standard nucleic acid or amino acid codes do not use the + character, so this should rarely be an issue in practice.

By default, we interpret quality values as phred scores in the Sanger format, that is, use an ASCII offset of 33, where '!' stands for the lowest phred quality score of 0. To change the encoding, use the quality_encoding() function, which accepts Sanger, Solexa, and different Illumina versions.

For even more advanced used cases, the whole function for parsing the quality string can be changed as well, by setting the quality_string_plugin() function. This is for example useful if the quality scores are not needed at all (simply provide an empty function in this case), or if the file is first parsed once to detect the most probably encoding - see guess_fastq_quality_encoding() for an example.

To set the the quality_string_plugin(), use for example the following:

auto reader = FastqReader();
reader.quality_string_plugin(
    [&]( std::string const& quality_string, Sequence& sequence )
    {
        // do something with the quality_string, and potentially store it in the sequence
    }
);
reader.read( utils::from_file( infile ), sequence_set );

More information on the format can be found at:

[1] P. Cock, C. Fields, N. Goto, M. Heuer, P. Rice.
"The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants."
Nucleic Acids Research, 38(6), 1767–1771, 2009.
https://doi.org/10.1093/nar/gkp1137

Using site_casing(), the sequences can automatically be turned into upper or lower case letter. Also, see valid_chars( std::string const& chars ) for a way of checking correct input sequences.

Definition at line 149 of file fastq_reader.hpp.

Public Member Functions

 FastqReader ()
 Create a default FastqReader. More...
 
 FastqReader (FastqReader &&)=default
 
 FastqReader (FastqReader const &)=default
 
 ~FastqReader ()=default
 
FastqReaderoperator= (FastqReader &&)=default
 
FastqReaderoperator= (FastqReader const &)=default
 
void parse_document (utils::InputStream &input_stream, SequenceSet &sequence_set) const
 Parse a whole fastq document into a SequenceSet. More...
 
bool parse_sequence (utils::InputStream &input_stream, Sequence &sequence) const
 Parse a Sequence in Fastq format. More...
 
QualityEncoding quality_encoding ()
 Return the currently set QualityEncoding that is used for decoding the quality score line of the Fastq file. More...
 
FastqReaderquality_encoding (QualityEncoding encoding)
 Set the QualityEncoding used for decoding the quality score line of the Fastq file. More...
 
FastqReaderquality_string_plugin (quality_string_function const &plugin)
 Functional that can be set to process the quality string found in fastq files. More...
 
SequenceSet read (std::shared_ptr< utils::BaseInputSource > source) const
 Read all Sequences from an input source in Fastq format and return them as a SequenceSet. More...
 
void read (std::shared_ptr< utils::BaseInputSource > source, SequenceSet &sequence_set) const
 Read all Sequences from an input source in Fastq format into a SequenceSet. More...
 
SiteCasing site_casing () const
 Return whether Sequence sites are automatically turned into upper or lower case. More...
 
FastqReadersite_casing (SiteCasing value)
 Set whether Sequence sites are automatically turned into upper or lower case. More...
 
utils::CharLookup< bool > & valid_char_lookup ()
 Return the internal CharLookup that is used for validating the Sequence sites. More...
 
std::string valid_chars () const
 Return the currently set chars used for validating Sequence sites. More...
 
FastqReadervalid_chars (std::string const &chars)
 Set the chars that are used for validating Sequence sites when reading them. More...
 

Public Types

using quality_string_function = std::function< void(std::string const &quality_string, Sequence &sequence) >
 Function type that allows to work with the quality line(s) in fastq files. More...
 
enum  SiteCasing { kUnchanged, kToUpper, kToLower }
 Enumeration of casing methods to apply to each site of a Sequence. More...
 

Protected Member Functions

void parse_label1_ (utils::InputStream &input_stream, Sequence &sequence) const
 Parse the first label line (starting with an @). More...
 
void parse_label2_ (utils::InputStream &input_stream, Sequence &sequence) const
 Parse the second label line (starting with a +, and either empty or equal to the first). More...
 
void parse_quality_ (utils::InputStream &input_stream, Sequence &sequence) const
 Parse the quality score line(s), which also runs the plugin, if available. More...
 
bool parse_sequence_ (utils::InputStream &input_stream, Sequence &sequence) const
 Parse a fastq sequence into the given sequence object. More...
 
void parse_sites_ (utils::InputStream &input_stream, Sequence &sequence) const
 Parse the sequence line(s). More...
 

Constructor & Destructor Documentation

◆ FastqReader() [1/3]

Create a default FastqReader.

Per default, chars are turned upper case, but not validated. See site_casing() and valid_chars() to change this behaviour.

Furthermore, by default, we interpret the quality score scrint as being phred scores in the Sanger format.

Definition at line 57 of file fastq_reader.cpp.

◆ ~FastqReader()

~FastqReader ( )
default

◆ FastqReader() [2/3]

FastqReader ( FastqReader const &  )
default

◆ FastqReader() [3/3]

FastqReader ( FastqReader &&  )
default

Member Function Documentation

◆ operator=() [1/2]

FastqReader& operator= ( FastqReader &&  )
default

◆ operator=() [2/2]

FastqReader& operator= ( FastqReader const &  )
default

◆ parse_document()

void parse_document ( utils::InputStream input_stream,
SequenceSet sequence_set 
) const

Parse a whole fastq document into a SequenceSet.

This function is mainly used internally by the reading functions read(). It is however also fine to call it from the outside.

Definition at line 86 of file fastq_reader.cpp.

◆ parse_label1_()

void parse_label1_ ( utils::InputStream input_stream,
Sequence sequence 
) const
protected

Parse the first label line (starting with an @).

Definition at line 136 of file fastq_reader.cpp.

◆ parse_label2_()

void parse_label2_ ( utils::InputStream input_stream,
Sequence sequence 
) const
protected

Parse the second label line (starting with a +, and either empty or equal to the first).

Definition at line 234 of file fastq_reader.cpp.

◆ parse_quality_()

void parse_quality_ ( utils::InputStream input_stream,
Sequence sequence 
) const
protected

Parse the quality score line(s), which also runs the plugin, if available.

Definition at line 263 of file fastq_reader.cpp.

◆ parse_sequence()

bool parse_sequence ( utils::InputStream input_stream,
Sequence sequence 
) const

Parse a Sequence in Fastq format.

This function takes an utils::InputStream and interprets it as a Fastq formatted sequence. It extracts the data and writes it into the given Sequence object. See the class description of FastqReader for the expected data format.

The function stops after parsing one such sequence, and leaves the stream at the first character of the next line that follows the quality score string. It returns true if a sequence was extracted and false if the stream is empty. If the input is not in the correct format, an std::runtime_error exception is thrown indicating the malicious position in the input stream.

Definition at line 96 of file fastq_reader.cpp.

◆ parse_sequence_()

bool parse_sequence_ ( utils::InputStream input_stream,
Sequence sequence 
) const
protected

Parse a fastq sequence into the given sequence object.

Definition at line 107 of file fastq_reader.cpp.

◆ parse_sites_()

void parse_sites_ ( utils::InputStream input_stream,
Sequence sequence 
) const
protected

Parse the sequence line(s).

Definition at line 173 of file fastq_reader.cpp.

◆ quality_encoding() [1/2]

QualityEncoding quality_encoding ( )

Return the currently set QualityEncoding that is used for decoding the quality score line of the Fastq file.

Definition at line 356 of file fastq_reader.cpp.

◆ quality_encoding() [2/2]

FastqReader & quality_encoding ( QualityEncoding  encoding)

Set the QualityEncoding used for decoding the quality score line of the Fastq file.

By default, we use Sanger encoding. This can be changed here.

Definition at line 350 of file fastq_reader.cpp.

◆ quality_string_plugin()

FastqReader & quality_string_plugin ( quality_string_function const &  plugin)

Functional that can be set to process the quality string found in fastq files.

See the class description for details.

Definition at line 361 of file fastq_reader.cpp.

◆ read() [1/2]

SequenceSet read ( std::shared_ptr< utils::BaseInputSource source) const

Read all Sequences from an input source in Fastq format and return them as a SequenceSet.

Use functions such as utils::from_file() and utils::from_string() to conveniently get an input source that can be used here.

Definition at line 66 of file fastq_reader.cpp.

◆ read() [2/2]

void read ( std::shared_ptr< utils::BaseInputSource source,
SequenceSet sequence_set 
) const

Read all Sequences from an input source in Fastq format into a SequenceSet.

The Sequences are added to the SequenceSet, whose existing Sequences are kept. Thus, by repeatedly calling this or similar read functions, multiple input files can easily be read into one SequenceSet.

Use functions such as utils::from_file() and utils::from_string() to conveniently get an input source that can be used here.

Definition at line 74 of file fastq_reader.cpp.

◆ site_casing() [1/2]

FastqReader::SiteCasing site_casing ( ) const

Return whether Sequence sites are automatically turned into upper or lower case.

Definition at line 315 of file fastq_reader.cpp.

◆ site_casing() [2/2]

FastqReader & site_casing ( SiteCasing  value)

Set whether Sequence sites are automatically turned into upper or lower case.

Default is SiteCasing::kToUpper, that is, all sites of the read Sequences are turned into upper case letters automatically. This is typical behaviour, although not standardized. The function returns the FastqReader object to allow for fluent interfaces.

Definition at line 309 of file fastq_reader.cpp.

◆ valid_char_lookup()

utils::CharLookup< bool > & valid_char_lookup ( )

Return the internal CharLookup that is used for validating the Sequence sites.

This function is provided in case direct access to the lookup is needed. Usually, the valid_chars() function should suffice. See there for details.

Definition at line 345 of file fastq_reader.cpp.

◆ valid_chars() [1/2]

std::string valid_chars ( ) const

Return the currently set chars used for validating Sequence sites.

An empty string means that no validation is done.

Definition at line 334 of file fastq_reader.cpp.

◆ valid_chars() [2/2]

FastqReader & valid_chars ( std::string const &  chars)

Set the chars that are used for validating Sequence sites when reading them.

When this function is called with a string of chars, those chars are used to validate the sites when reading them. That is, only sequences consisting of the given chars are valid.

If set to an empty string, this check is deactivated. This is also the default, meaning that no checking is done.

In case that site_casing() is set to a value other than SiteCasing::kUnchanged: The validation is done after changing the casing, so that only lower or capital letters have to be provided for validation. In case that site_casing() is set to SiteCasing::kUnchanged: All chars that are to be considered valid have to be provided for validation.

See nucleic_acid_codes...() and amino_acid_codes...() functions for presettings of chars that can be used for validation here.

Definition at line 320 of file fastq_reader.cpp.

Member Typedef Documentation

◆ quality_string_function

using quality_string_function = std::function< void( std::string const& quality_string, Sequence& sequence ) >

Function type that allows to work with the quality line(s) in fastq files.

This reader class is adjustable towards the encoding and usage of the quality line(s) in fastq files. Typically, these lines contain some encoding of the phread quality score of the bases found in the sequence string. However, as there are several variants for this encoding, and as not always the quality score is needed at all, we leave the usage of the quality string adjustable.

This function type here can hence be used to process the quality_string, for example by storing it, or processing it to find the correct encoding first. Use quality_string_plugin() to set an according function.

Definition at line 172 of file fastq_reader.hpp.

Member Enumeration Documentation

◆ SiteCasing

enum SiteCasing
strong

Enumeration of casing methods to apply to each site of a Sequence.

Enumerator
kUnchanged 

Do not change the case of the sites.

kToUpper 

Make all sites upper case.

kToLower 

Make all sites lower case.

Definition at line 177 of file fastq_reader.hpp.


The documentation for this class was generated from the following files: