#include <genesis/sequence/formats/fasta_reader.hpp>
Read Fasta sequence data.
This class provides simple facilities for reading Fasta data.
Exemplary usage:
std::string infile = "path/to/file.fasta"; SequenceSet sequence_set; FastaReader() .site_casing( SiteCasing::kUnchanged ) .valid_chars( nucleic_acid_codes_all() ) .read( utils::from_file( infile ), sequence_set );
The expected data format:
More information on the format can be found at:
Using site_casing(), the sequences can automatically be turned into upper or lower case letter. Also, see valid_chars( std::string const& chars ) for a way of checking correct input sequences.
Definition at line 91 of file fasta_reader.hpp.
Public Member Functions | |
FastaReader () | |
Create a default FastaReader. Per default, chars are turned upper case, but not validated. More... | |
FastaReader (FastaReader &&)=default | |
FastaReader (FastaReader const &)=default | |
~FastaReader ()=default | |
bool | guess_abundances () const |
Return whether the label is used to guess/extracat Sequence abundances. More... | |
FastaReader & | guess_abundances (bool value) |
Set whether Sequence labels are used to guess/extract Sequence abundances. More... | |
FastaReader & | operator= (FastaReader &&)=default |
FastaReader & | operator= (FastaReader const &)=default |
void | parse_document (utils::InputStream &input_stream, SequenceSet &sequence_set) const |
Parse a whole fasta document into a SequenceSet. More... | |
bool | parse_sequence (utils::InputStream &input_stream, Sequence &sequence) const |
Parse a Sequence in Fasta format. More... | |
bool | parse_sequence_pedantic (utils::InputStream &input_stream, Sequence &sequence) const |
Parse a Sequence in Fasta format. More... | |
ParsingMethod | parsing_method () const |
Return the currently set parsing method. More... | |
FastaReader & | parsing_method (ParsingMethod value) |
Set the parsing method. More... | |
SequenceSet | read (std::shared_ptr< utils::BaseInputSource > source) const |
Read all Sequences from an input source in Fasta format and return them as a SequenceSet. More... | |
void | read (std::shared_ptr< utils::BaseInputSource > source, SequenceSet &sequence_set) const |
Read all Sequences from an input source in Fasta format into a SequenceSet. More... | |
SequenceDict | read_dict (std::shared_ptr< utils::BaseInputSource > source) const |
Read all Sequences from an input source in fasta format, but only return their names and lengths as a SequenceDict. More... | |
ReferenceGenome | read_reference_genome (std::shared_ptr< utils::BaseInputSource > source, bool also_look_up_first_word=true) const |
Read all Sequences from an input source in fasta format into a ReferenceGenome. More... | |
SiteCasing | site_casing () const |
Return whether Sequence sites are automatically turned into upper or lower case. More... | |
FastaReader & | site_casing (SiteCasing value) |
Set whether Sequence sites are automatically turned into upper or lower case. More... | |
utils::CharLookup< bool > & | valid_char_lookup () |
Return the internal CharLookup that is used for validating the Sequence sites. More... | |
std::string | valid_chars () const |
Return the currently set chars used for validating Sequence sites. More... | |
FastaReader & | valid_chars (std::string const &chars) |
Set the chars that are used for validating Sequence sites when reading them. More... | |
Public Types | |
enum | ParsingMethod { kDefault, kPedantic } |
Enumeration of the available methods for parsing Fasta sequences. More... | |
enum | SiteCasing { kUnchanged, kToUpper, kToLower } |
Enumeration of casing methods to apply to each site of a Sequence. More... | |
FastaReader | ( | ) |
Create a default FastaReader. Per default, chars are turned upper case, but not validated.
See site_casing() and valid_chars() to change this behaviour.
Definition at line 55 of file fasta_reader.cpp.
|
default |
|
default |
|
default |
bool guess_abundances | ( | ) | const |
Return whether the label is used to guess/extracat Sequence abundances.
Definition at line 394 of file fasta_reader.cpp.
FastaReader & guess_abundances | ( | bool | value | ) |
Set whether Sequence labels are used to guess/extract Sequence abundances.
Default is false
, that is, labels are just taken as they are in the input. If set to true
, the label is used to guess an abundance count, which is set in the Sequence. See guess_sequence_abundance( Sequence const& ) for the valid formats of such abundances.
Definition at line 388 of file fasta_reader.cpp.
|
default |
|
default |
void parse_document | ( | utils::InputStream & | input_stream, |
SequenceSet & | sequence_set | ||
) | const |
Parse a whole fasta document into a SequenceSet.
This function is mainly used internally by the reading functions read(). It uses the currently set parsing_method() for parsing the data.
Definition at line 102 of file fasta_reader.cpp.
bool parse_sequence | ( | utils::InputStream & | input_stream, |
Sequence & | sequence | ||
) | const |
Parse a Sequence in Fasta format.
This function takes an InputStream and interprets it as a Fasta formatted sequence. It extracts the data and writes it into the given Sequence object. See the class description of FastaReader for the expected data format.
The function stops after parsing one such sequence. It returns true if a sequence was extracted and false if the stream is empty. If the input is not in the correct format, an std::runtime_error
exception is thrown indicating the malicious position in the input stream.
Definition at line 109 of file fasta_reader.cpp.
bool parse_sequence_pedantic | ( | utils::InputStream & | input_stream, |
Sequence & | sequence | ||
) | const |
Parse a Sequence in Fasta format.
This function takes an InputStream and interprets it as a Fasta formatted sequence. It extracts the data and writes it into the given Sequence object. See the class description of FastaReader for the expected data format.
The function stops after parsing one such sequence. It returns true if a sequence was extracted and false if the stream is empty. If the input is not in the correct format, an std::runtime_error
exception is thrown indicating the malicious position in the input stream.
Compared to parse_sequence(), this function reports errors at the exact line and column where they occur. It is however slower. Apart from that, there are no differences. See FastaReader::ParsingMethod for details.
Definition at line 229 of file fasta_reader.cpp.
FastaReader::ParsingMethod parsing_method | ( | ) | const |
Return the currently set parsing method.
See the ParsingMethod enum for details.
Definition at line 372 of file fasta_reader.cpp.
FastaReader & parsing_method | ( | FastaReader::ParsingMethod | value | ) |
Set the parsing method.
The parsing method is used for all the reader functions and parse_document(). See the ParsingMethod enum for details.
Definition at line 366 of file fasta_reader.cpp.
SequenceSet read | ( | std::shared_ptr< utils::BaseInputSource > | source | ) | const |
Read all Sequences from an input source in Fasta format and return them as a SequenceSet.
Use functions such as utils::from_file() and utils::from_string() to conveniently get an input source that can be used here.
Definition at line 64 of file fasta_reader.cpp.
void read | ( | std::shared_ptr< utils::BaseInputSource > | source, |
SequenceSet & | sequence_set | ||
) | const |
Read all Sequences from an input source in Fasta format into a SequenceSet.
The Sequences are added to the SequenceSet, whose existing Sequences are kept. Thus, by repeatedly calling this or similar read functions, multiple input files can easily be read into one SequenceSet.
Use functions such as utils::from_file() and utils::from_string() to conveniently get an input source that can be used here.
Definition at line 72 of file fasta_reader.cpp.
SequenceDict read_dict | ( | std::shared_ptr< utils::BaseInputSource > | source | ) | const |
Read all Sequences from an input source in fasta format, but only return their names and lengths as a SequenceDict.
Definition at line 80 of file fasta_reader.cpp.
ReferenceGenome read_reference_genome | ( | std::shared_ptr< utils::BaseInputSource > | source, |
bool | also_look_up_first_word = true |
||
) | const |
Read all Sequences from an input source in fasta format into a ReferenceGenome.
This allows fast lookup of sequences by their name, while maintaining their order. See ReferenceGenome for details, and for the explanation of also_look_up_first_word
.
Definition at line 88 of file fasta_reader.cpp.
FastaReader::SiteCasing site_casing | ( | ) | const |
Return whether Sequence sites are automatically turned into upper or lower case.
Definition at line 383 of file fasta_reader.cpp.
FastaReader & site_casing | ( | SiteCasing | value | ) |
Set whether Sequence sites are automatically turned into upper or lower case.
Default is SiteCasing::kToUpper, that is, all sites of the read Sequences are turned into upper case letters automatically. This is demanded by the Fasta standard. The function returns the FastaReader object to allow for fluent interfaces.
Definition at line 377 of file fasta_reader.cpp.
utils::CharLookup< bool > & valid_char_lookup | ( | ) |
Return the internal CharLookup that is used for validating the Sequence sites.
This function is provided in case direct access to the lookup is needed. Usually, the valid_chars() function should suffice. See there for details.
Definition at line 424 of file fasta_reader.cpp.
std::string valid_chars | ( | ) | const |
Return the currently set chars used for validating Sequence sites.
An empty string means that no validation is done.
Definition at line 413 of file fasta_reader.cpp.
FastaReader & valid_chars | ( | std::string const & | chars | ) |
Set the chars that are used for validating Sequence sites when reading them.
When this function is called with a string of chars, those chars are used to validate the sites when reading them. That is, only sequences consisting of the given chars are valid.
If set to an empty string, this check is deactivated. This is also the default, meaning that no checking is done.
In case that site_casing() is set to a value other than SiteCasing::kUnchanged
: The validation is done after changing the casing, so that only lower or capital letters have to be provided for validation. In case that site_casing() is set to SiteCasing::kUnchanged
: All chars that are to be considered valid have to be provided for validation.
See nucleic_acid_codes...()
and amino_acid_codes...()
functions for presettings of chars that can be used for validation here.
Definition at line 399 of file fasta_reader.cpp.
|
strong |
Enumeration of the available methods for parsing Fasta sequences.
Enumerator | |
---|---|
kDefault | Fast method, used by default. This is by far the preferred method, it however has one slight limitation: It only reports errors using the line where the sequence starts. This does not affect most applications, as good data won't produce errors to report. If you however want error reporting at the exact line and column where the error occurs, use kPedantic instead. With this setting, parse_sequence() is used for parsing. In our tests, it achieved ~350 MB/s parsing speed. |
kPedantic | Pedantic method. Compared to the fast method, this one reports errors at the exact line and column where they occur. It is however slower (~3.5x the time of the default method). Apart from that, there are no differences. If you need this method for certain files, it might be useful to use it only once and use a FastaWriter to write out a new Fasta file without errors. This way, for subsequent reading you can then use the faster default method. With this setting, parse_sequence_pedantic() is used for parsing. In our tests, it achieved ~100 MB/s parsing speed. |
Definition at line 102 of file fasta_reader.hpp.
|
strong |
Enumeration of casing methods to apply to each site of a Sequence.
Enumerator | |
---|---|
kUnchanged | Do not change the case of the sites. |
kToUpper | Make all sites upper case. |
kToLower | Make all sites lower case. |
Definition at line 138 of file fasta_reader.hpp.