#include <genesis/sequence/formats/fastq_reader.hpp>
Read Fastq sequence data.
This class provides simple facilities for reading Fastq data.
Exemplary usage:
std::string infile = "path/to/file.fastq"; SequenceSet sequence_set; FastqReader() .site_casing( SiteCasing::kUnchanged ) .valid_chars( nucleic_acid_codes_all() ) .read( utils::from_file( infile ), sequence_set );
The expected data format is:
See https://en.wikipedia.org/wiki/FASTQ_format for details.
As the encoding for the quality values can be substantially different depending on the sequencing techonology used, parsing fastq files is more difficult than fasta. Two issues arise:
+
character, so this should rarely be an issue in practice.By default, we interpret quality values as phred scores in the Sanger format, that is, use an ASCII offset of 33, where '!' stands for the lowest phred quality score of 0. To change the encoding, use the quality_encoding() function, which accepts Sanger, Solexa, and different Illumina versions.
For even more advanced used cases, the whole function for parsing the quality string can be changed as well, by setting the quality_string_plugin() function. This is for example useful if the quality scores are not needed at all (simply provide an empty function in this case), or if the file is first parsed once to detect the most probably encoding - see guess_fastq_quality_encoding() for an example.
To set the the quality_string_plugin(), use for example the following:
auto reader = FastqReader(); reader.quality_string_plugin( [&]( std::string const& quality_string, Sequence& sequence ) { // do something with the quality_string, and potentially store it in the sequence } ); reader.read( utils::from_file( infile ), sequence_set );
More information on the format can be found at:
[1] P. Cock, C. Fields, N. Goto, M. Heuer, P. Rice.
"The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants."
Nucleic Acids Research, 38(6), 1767–1771, 2009.
https://doi.org/10.1093/nar/gkp1137
Using site_casing(), the sequences can automatically be turned into upper or lower case letter. Also, see valid_chars( std::string const& chars ) for a way of checking correct input sequences.
Definition at line 149 of file fastq_reader.hpp.
Public Member Functions | |
FastqReader () | |
Create a default FastqReader. More... | |
FastqReader (FastqReader &&)=default | |
FastqReader (FastqReader const &)=default | |
~FastqReader ()=default | |
FastqReader & | operator= (FastqReader &&)=default |
FastqReader & | operator= (FastqReader const &)=default |
void | parse_document (utils::InputStream &input_stream, SequenceSet &sequence_set) const |
Parse a whole fastq document into a SequenceSet. More... | |
bool | parse_sequence (utils::InputStream &input_stream, Sequence &sequence) const |
Parse a Sequence in Fastq format. More... | |
QualityEncoding | quality_encoding () |
Return the currently set QualityEncoding that is used for decoding the quality score line of the Fastq file. More... | |
FastqReader & | quality_encoding (QualityEncoding encoding) |
Set the QualityEncoding used for decoding the quality score line of the Fastq file. More... | |
FastqReader & | quality_string_plugin (quality_string_function const &plugin) |
Functional that can be set to process the quality string found in fastq files. More... | |
SequenceSet | read (std::shared_ptr< utils::BaseInputSource > source) const |
Read all Sequences from an input source in Fastq format and return them as a SequenceSet. More... | |
void | read (std::shared_ptr< utils::BaseInputSource > source, SequenceSet &sequence_set) const |
Read all Sequences from an input source in Fastq format into a SequenceSet. More... | |
SiteCasing | site_casing () const |
Return whether Sequence sites are automatically turned into upper or lower case. More... | |
FastqReader & | site_casing (SiteCasing value) |
Set whether Sequence sites are automatically turned into upper or lower case. More... | |
utils::CharLookup< bool > & | valid_char_lookup () |
Return the internal CharLookup that is used for validating the Sequence sites. More... | |
std::string | valid_chars () const |
Return the currently set chars used for validating Sequence sites. More... | |
FastqReader & | valid_chars (std::string const &chars) |
Set the chars that are used for validating Sequence sites when reading them. More... | |
Public Types | |
using | quality_string_function = std::function< void(std::string const &quality_string, Sequence &sequence) > |
Function type that allows to work with the quality line(s) in fastq files. More... | |
enum | SiteCasing { kUnchanged, kToUpper, kToLower } |
Enumeration of casing methods to apply to each site of a Sequence. More... | |
Protected Member Functions | |
void | parse_label1_ (utils::InputStream &input_stream, Sequence &sequence) const |
Parse the first label line (starting with an @). More... | |
void | parse_label2_ (utils::InputStream &input_stream, Sequence &sequence) const |
Parse the second label line (starting with a +, and either empty or equal to the first). More... | |
void | parse_quality_ (utils::InputStream &input_stream, Sequence &sequence) const |
Parse the quality score line(s), which also runs the plugin, if available. More... | |
bool | parse_sequence_ (utils::InputStream &input_stream, Sequence &sequence) const |
Parse a fastq sequence into the given sequence object. More... | |
void | parse_sites_ (utils::InputStream &input_stream, Sequence &sequence) const |
Parse the sequence line(s). More... | |
FastqReader | ( | ) |
Create a default FastqReader.
Per default, chars are turned upper case, but not validated. See site_casing() and valid_chars() to change this behaviour.
Furthermore, by default, we interpret the quality score scrint as being phred scores in the Sanger format.
Definition at line 57 of file fastq_reader.cpp.
|
default |
|
default |
|
default |
|
default |
|
default |
void parse_document | ( | utils::InputStream & | input_stream, |
SequenceSet & | sequence_set | ||
) | const |
Parse a whole fastq document into a SequenceSet.
This function is mainly used internally by the reading functions read(). It is however also fine to call it from the outside.
Definition at line 86 of file fastq_reader.cpp.
|
protected |
Parse the first label line (starting with an @).
Definition at line 136 of file fastq_reader.cpp.
|
protected |
Parse the second label line (starting with a +, and either empty or equal to the first).
Definition at line 234 of file fastq_reader.cpp.
|
protected |
Parse the quality score line(s), which also runs the plugin, if available.
Definition at line 263 of file fastq_reader.cpp.
bool parse_sequence | ( | utils::InputStream & | input_stream, |
Sequence & | sequence | ||
) | const |
Parse a Sequence in Fastq format.
This function takes an utils::InputStream and interprets it as a Fastq formatted sequence. It extracts the data and writes it into the given Sequence object. See the class description of FastqReader for the expected data format.
The function stops after parsing one such sequence, and leaves the stream at the first character of the next line that follows the quality score string. It returns true if a sequence was extracted and false if the stream is empty. If the input is not in the correct format, an std::runtime_error
exception is thrown indicating the malicious position in the input stream.
Definition at line 96 of file fastq_reader.cpp.
|
protected |
Parse a fastq sequence into the given sequence
object.
Definition at line 107 of file fastq_reader.cpp.
|
protected |
Parse the sequence line(s).
Definition at line 173 of file fastq_reader.cpp.
QualityEncoding quality_encoding | ( | ) |
Return the currently set QualityEncoding that is used for decoding the quality score line of the Fastq file.
Definition at line 356 of file fastq_reader.cpp.
FastqReader & quality_encoding | ( | QualityEncoding | encoding | ) |
Set the QualityEncoding used for decoding the quality score line of the Fastq file.
By default, we use Sanger encoding. This can be changed here.
Definition at line 350 of file fastq_reader.cpp.
FastqReader & quality_string_plugin | ( | quality_string_function const & | plugin | ) |
Functional that can be set to process the quality string found in fastq files.
See the class description for details.
Definition at line 361 of file fastq_reader.cpp.
SequenceSet read | ( | std::shared_ptr< utils::BaseInputSource > | source | ) | const |
Read all Sequences from an input source in Fastq format and return them as a SequenceSet.
Use functions such as utils::from_file() and utils::from_string() to conveniently get an input source that can be used here.
Definition at line 66 of file fastq_reader.cpp.
void read | ( | std::shared_ptr< utils::BaseInputSource > | source, |
SequenceSet & | sequence_set | ||
) | const |
Read all Sequences from an input source in Fastq format into a SequenceSet.
The Sequences are added to the SequenceSet, whose existing Sequences are kept. Thus, by repeatedly calling this or similar read functions, multiple input files can easily be read into one SequenceSet.
Use functions such as utils::from_file() and utils::from_string() to conveniently get an input source that can be used here.
Definition at line 74 of file fastq_reader.cpp.
FastqReader::SiteCasing site_casing | ( | ) | const |
Return whether Sequence sites are automatically turned into upper or lower case.
Definition at line 315 of file fastq_reader.cpp.
FastqReader & site_casing | ( | SiteCasing | value | ) |
Set whether Sequence sites are automatically turned into upper or lower case.
Default is SiteCasing::kToUpper, that is, all sites of the read Sequences are turned into upper case letters automatically. This is typical behaviour, although not standardized. The function returns the FastqReader object to allow for fluent interfaces.
Definition at line 309 of file fastq_reader.cpp.
utils::CharLookup< bool > & valid_char_lookup | ( | ) |
Return the internal CharLookup that is used for validating the Sequence sites.
This function is provided in case direct access to the lookup is needed. Usually, the valid_chars() function should suffice. See there for details.
Definition at line 345 of file fastq_reader.cpp.
std::string valid_chars | ( | ) | const |
Return the currently set chars used for validating Sequence sites.
An empty string means that no validation is done.
Definition at line 334 of file fastq_reader.cpp.
FastqReader & valid_chars | ( | std::string const & | chars | ) |
Set the chars that are used for validating Sequence sites when reading them.
When this function is called with a string of chars, those chars are used to validate the sites when reading them. That is, only sequences consisting of the given chars are valid.
If set to an empty string, this check is deactivated. This is also the default, meaning that no checking is done.
In case that site_casing() is set to a value other than SiteCasing::kUnchanged
: The validation is done after changing the casing, so that only lower or capital letters have to be provided for validation. In case that site_casing() is set to SiteCasing::kUnchanged
: All chars that are to be considered valid have to be provided for validation.
See nucleic_acid_codes...()
and amino_acid_codes...()
functions for presettings of chars that can be used for validation here.
Definition at line 320 of file fastq_reader.cpp.
using quality_string_function = std::function< void( std::string const& quality_string, Sequence& sequence ) > |
Function type that allows to work with the quality line(s) in fastq files.
This reader class is adjustable towards the encoding and usage of the quality line(s) in fastq files. Typically, these lines contain some encoding of the phread quality score of the bases found in the sequence string. However, as there are several variants for this encoding, and as not always the quality score is needed at all, we leave the usage of the quality string adjustable.
This function type here can hence be used to process the quality_string, for example by storing it, or processing it to find the correct encoding first. Use quality_string_plugin() to set an according function.
Definition at line 172 of file fastq_reader.hpp.
|
strong |
Enumeration of casing methods to apply to each site of a Sequence.
Enumerator | |
---|---|
kUnchanged | Do not change the case of the sites. |
kToUpper | Make all sites upper case. |
kToLower | Make all sites lower case. |
Definition at line 177 of file fastq_reader.hpp.