A library for working with phylogenetic and population genetic data.
v0.32.0
FrequencyTableInputStream Class Reference

#include <genesis/population/format/frequency_table_input_stream.hpp>

Detailed Description

Iterate an input source and parse it as a table of allele frequencies or counts.

The expected table has to be in what R calls the "wide" format, that is, samples are in separate columns. This is because otherwise, the amount of data duplication for the fixed columns such as chromosome name and position would just be too much waste.

The parser will automatically try to determine which samples contain which types of data (reference and alternative counts, frequencies, read depth), and compute whatever needed from that.

Some formats do not contain information on the reference and/or alternative base, such as the HAF-pipe frequency tables. For these cases, a reference_genome() can be provided, which will at least set the reference base of the Variant correctly. The alternative base will then be set to the transition base of the reference (A <-> G and C <-> T), which might be wrong, but is the most likely that we can do in the absence of further information. We might add using a reference panel VCF in the future to solve this problem, but as most of our downstream algorithms do not really care about which base is ref and alt, we don't support this as of now.

If there is no ref base column (or if it is N) or ref genome given, we cannot know to which bases the counts correspond to. In that case, we assign the ref count to A, and the alt count to G, respectively. If only the ref base is given, but no alt base, we again use the transition base, as explained above.

Definition at line 79 of file frequency_table_input_stream.hpp.

Public Member Functions

 FrequencyTableInputStream ()=default
 Create a default instance, with no input. More...
 
 FrequencyTableInputStream (self_type &&)=default
 
 FrequencyTableInputStream (self_type const &)=default
 
 FrequencyTableInputStream (std::shared_ptr< utils::BaseInputSource > input_source)
 Create an instance that reads from an input_source. More...
 
 FrequencyTableInputStream (std::shared_ptr< utils::BaseInputSource > input_source, std::unordered_set< std::string > const &sample_names_filter, bool inverse_sample_names_filter=false)
 Create an instance that reads from an input_source. More...
 
 ~FrequencyTableInputStream ()=default
 
double allowed_relative_frequency_error () const
 
self_typeallowed_relative_frequency_error (double value)
 Allowed error margin for frequencies. More...
 
Iterator begin () const
 
Iterator end () const
 
bool frequency_is_ref () const
 
self_typefrequency_is_ref (bool value)
 Set whether frequencies are ref or alt frequencies. More...
 
std::string const & header_alternative_base_string () const
 Return the currently set string that marks the alternative base columnn in the header. More...
 
self_typeheader_alternative_base_string (std::string const &str)
 Specify a string that marks the alternative base column in the header. More...
 
std::string const & header_chromosome_string () const
 Return the currently set string that marks the chromosome columnn in the header. More...
 
self_typeheader_chromosome_string (std::string const &str)
 Specify a string that marks the chromosome column in the header. More...
 
std::string const & header_position_string () const
 Return the currently set string that marks the position columnn in the header. More...
 
self_typeheader_position_string (std::string const &str)
 Specify a string that marks the position column in the header. More...
 
std::string const & header_reference_base_string () const
 Return the currently set string that marks the reference base columnn in the header. More...
 
self_typeheader_reference_base_string (std::string const &str)
 Specify a string that marks the reference base column in the header. More...
 
std::string const & header_sample_alternative_count_substring () const
 Return the currently set (sub)string that is the prefix or suffix for header columns containing the alternative base count of a sample. More...
 
self_typeheader_sample_alternative_count_substring (std::string const &str)
 Specify a (sub)string that is the prefix or suffix for header columns containing the alternative base count of a sample. More...
 
std::string const & header_sample_frequency_substring () const
 Return the currently set (sub)string that is the prefix or suffix for header columns containing the frequency of a sample. More...
 
self_typeheader_sample_frequency_substring (std::string const &str)
 Specify a (sub)string that is the prefix or suffix for header columns containing the frequency of a sample. More...
 
std::string const & header_sample_read_depth_substring () const
 Return the currently set (sub)string that is the prefix or suffix for header columns containing the read depth of a sample. More...
 
self_typeheader_sample_read_depth_substring (std::string const &str)
 Specify a (sub)string that is the prefix or suffix for header columns containing the read depth of a sample (that is, the sum of reference and alternative base counts). More...
 
std::string const & header_sample_reference_count_substring () const
 Return the currently set (sub)string that is the prefix or suffix for header columns containing the reference base count of a sample. More...
 
self_typeheader_sample_reference_count_substring (std::string const &str)
 Specify a (sub)string that is the prefix or suffix for header columns containing the reference base count of a sample. More...
 
std::shared_ptr< utils::BaseInputSourceinput_source () const
 
self_typeinput_source (std::shared_ptr< utils::BaseInputSource > value)
 Set the input source. More...
 
double int_factor () const
 
self_typeint_factor (double value)
 Set the factor by which frequencies are multiplied if no read depth information is present for a sample. More...
 
bool inverse_sample_names_filter () const
 
self_typeinverse_sample_names_filter (bool value)
 Set whether to reverse the sample names to filter for. More...
 
std::string const & missing_value () const
 
self_typemissing_value (std::string const &value)
 Set the string that indicates missing data. More...
 
self_typeoperator= (self_type &&)=default
 
self_typeoperator= (self_type const &)=default
 
std::shared_ptr<::genesis::sequence::ReferenceGenomereference_genome () const
 
self_typereference_genome (std::shared_ptr<::genesis::sequence::ReferenceGenome > value)
 Reference genome used to phase input data without reference bases. More...
 
std::unordered_set< std::string > const & sample_names_filter () const
 
self_typesample_names_filter (std::unordered_set< std::string > const &value)
 Set the sample names to filter for. More...
 
char separator_char () const
 
self_typeseparator_char (char value)
 Set the separator char used for parsing the tabluar input data. More...
 

Public Types

using difference_type = std::ptrdiff_t
 
using iterator_category = std::input_iterator_tag
 
using pointer = value_type const *
 
using reference = value_type const &
 
using self_type = FrequencyTableInputStream
 
using value_type = Variant
 

Classes

class  Iterator
 Iterator over loci of the input sources. More...
 

Constructor & Destructor Documentation

◆ FrequencyTableInputStream() [1/5]

Create a default instance, with no input.

Use input_source() to assign an input afterwards.

◆ FrequencyTableInputStream() [2/5]

FrequencyTableInputStream ( std::shared_ptr< utils::BaseInputSource input_source)
inlineexplicit

Create an instance that reads from an input_source.

Definition at line 436 of file frequency_table_input_stream.hpp.

◆ FrequencyTableInputStream() [3/5]

FrequencyTableInputStream ( std::shared_ptr< utils::BaseInputSource input_source,
std::unordered_set< std::string > const &  sample_names_filter,
bool  inverse_sample_names_filter = false 
)
inline

Create an instance that reads from an input_source.

Additionally, this constructor takes a list of sample_names which are used as filter so that only those samples are evaluated and accessible - or, if inverse_sample_names is set to true - instead all but those samples.

Definition at line 450 of file frequency_table_input_stream.hpp.

◆ ~FrequencyTableInputStream()

◆ FrequencyTableInputStream() [4/5]

FrequencyTableInputStream ( self_type const &  )
default

◆ FrequencyTableInputStream() [5/5]

Member Function Documentation

◆ allowed_relative_frequency_error() [1/2]

double allowed_relative_frequency_error ( ) const
inline

Definition at line 853 of file frequency_table_input_stream.hpp.

◆ allowed_relative_frequency_error() [2/2]

self_type& allowed_relative_frequency_error ( double  value)
inline

Allowed error margin for frequencies.

If an input table contains information on both the ref/alt counts (or only of of them, but also their read depth), as well as their frequency, we do a double check to make sure that everything is in order. This should be the case if the table was computed correctly.

This setting here allows to set the threshold for what is considered correct. It is a relative measure, defaulting to 0.1%. That is, the default value is 0.001 of allowed relative error between the count-based frequency that we compute, and the frequency given in the table.

Furthermore, we also use this threshold to check that frequencies as given in the input data fall within the range [0.0, 1.0]. Everything outside of that range that is not also within the allowed relative error (as provided here) will lead to an exception. Values thare are within that error, but still slightly outside of the range, will be set to be within range, to get proper frequencies.

Definition at line 876 of file frequency_table_input_stream.hpp.

◆ begin()

Iterator begin ( ) const
inline

Definition at line 472 of file frequency_table_input_stream.hpp.

◆ end()

Iterator end ( ) const
inline

Definition at line 477 of file frequency_table_input_stream.hpp.

◆ frequency_is_ref() [1/2]

bool frequency_is_ref ( ) const
inline

Definition at line 882 of file frequency_table_input_stream.hpp.

◆ frequency_is_ref() [2/2]

self_type& frequency_is_ref ( bool  value)
inline

Set whether frequencies are ref or alt frequencies.

When the data table contains frequencies, it needs to be decided whether this frequency corresponds to the reference base (use true here, default), or to the alternative base (use false here).

Definition at line 894 of file frequency_table_input_stream.hpp.

◆ header_alternative_base_string() [1/2]

std::string const& header_alternative_base_string ( ) const
inline

Return the currently set string that marks the alternative base columnn in the header.

See the setter header_alternative_base_string( std::string const& ) for details.

Definition at line 629 of file frequency_table_input_stream.hpp.

◆ header_alternative_base_string() [2/2]

self_type& header_alternative_base_string ( std::string const &  str)
inline

Specify a string that marks the alternative base column in the header.

See the setter header_chromosome_string( std::string const& ) for details; this setter here however specifies the column for the alternative base.

Definition at line 618 of file frequency_table_input_stream.hpp.

◆ header_chromosome_string() [1/2]

std::string const& header_chromosome_string ( ) const
inline

Return the currently set string that marks the chromosome columnn in the header.

See the setter header_chromosome_string( std::string const& ) for details.

Definition at line 563 of file frequency_table_input_stream.hpp.

◆ header_chromosome_string() [2/2]

self_type& header_chromosome_string ( std::string const &  str)
inline

Specify a string that marks the chromosome column in the header.

By default, this string is empty, and instead we search for the chromosome column in the header by matching with a list of commonly used strings, such as chromosome, chr, or contig.

However, if set to a non-empty string, this string is searched instead in the header, and the respective column is used for the chromosome information when parsing the table.

Definition at line 552 of file frequency_table_input_stream.hpp.

◆ header_position_string() [1/2]

std::string const& header_position_string ( ) const
inline

Return the currently set string that marks the position columnn in the header.

See the setter header_position_string( std::string const& ) for details.

Definition at line 585 of file frequency_table_input_stream.hpp.

◆ header_position_string() [2/2]

self_type& header_position_string ( std::string const &  str)
inline

Specify a string that marks the position column in the header.

See the setter header_chromosome_string( std::string const& ) for details; this setter here however specifies the column for the position within a given chromosome.

Definition at line 574 of file frequency_table_input_stream.hpp.

◆ header_reference_base_string() [1/2]

std::string const& header_reference_base_string ( ) const
inline

Return the currently set string that marks the reference base columnn in the header.

See the setter header_reference_base_string( std::string const& ) for details.

Definition at line 607 of file frequency_table_input_stream.hpp.

◆ header_reference_base_string() [2/2]

self_type& header_reference_base_string ( std::string const &  str)
inline

Specify a string that marks the reference base column in the header.

See the setter header_chromosome_string( std::string const& ) for details; this setter here however specifies the column for the reference base.

Definition at line 596 of file frequency_table_input_stream.hpp.

◆ header_sample_alternative_count_substring() [1/2]

std::string const& header_sample_alternative_count_substring ( ) const
inline

Return the currently set (sub)string that is the prefix or suffix for header columns containing the alternative base count of a sample.

See the setter header_sample_alternative_count_substring( std::string const& ) for details.

Definition at line 684 of file frequency_table_input_stream.hpp.

◆ header_sample_alternative_count_substring() [2/2]

self_type& header_sample_alternative_count_substring ( std::string const &  str)
inline

Specify a (sub)string that is the prefix or suffix for header columns containing the alternative base count of a sample.

See the setter header_sample_reference_count_substring( std::string const& ) for details; this setter here however specifies the prefix or suffix for columns containing the alternative base count of samples.

Definition at line 672 of file frequency_table_input_stream.hpp.

◆ header_sample_frequency_substring() [1/2]

std::string const& header_sample_frequency_substring ( ) const
inline

Return the currently set (sub)string that is the prefix or suffix for header columns containing the frequency of a sample.

See the setter header_sample_frequency_substring( std::string const& ) for details.

Definition at line 709 of file frequency_table_input_stream.hpp.

◆ header_sample_frequency_substring() [2/2]

self_type& header_sample_frequency_substring ( std::string const &  str)
inline

Specify a (sub)string that is the prefix or suffix for header columns containing the frequency of a sample.

See the setter header_sample_reference_count_substring( std::string const& ) for details; this setter here however specifies the prefix or suffix for columns containing the frequency of samples.

Definition at line 697 of file frequency_table_input_stream.hpp.

◆ header_sample_read_depth_substring() [1/2]

std::string const& header_sample_read_depth_substring ( ) const
inline

Return the currently set (sub)string that is the prefix or suffix for header columns containing the read depth of a sample.

See the setter header_sample_read_depth_substring( std::string const& ) for details.

Definition at line 735 of file frequency_table_input_stream.hpp.

◆ header_sample_read_depth_substring() [2/2]

self_type& header_sample_read_depth_substring ( std::string const &  str)
inline

Specify a (sub)string that is the prefix or suffix for header columns containing the read depth of a sample (that is, the sum of reference and alternative base counts).

See the setter header_sample_reference_count_substring( std::string const& ) for details; this setter here however specifies the prefix or suffix for columns containing the read depth of samples.

Definition at line 723 of file frequency_table_input_stream.hpp.

◆ header_sample_reference_count_substring() [1/2]

std::string const& header_sample_reference_count_substring ( ) const
inline

Return the currently set (sub)string that is the prefix or suffix for header columns containing the reference base count of a sample.

See the setter header_sample_reference_count_substring( std::string const& ) for details.

Definition at line 659 of file frequency_table_input_stream.hpp.

◆ header_sample_reference_count_substring() [2/2]

self_type& header_sample_reference_count_substring ( std::string const &  str)
inline

Specify a (sub)string that is the prefix or suffix for header columns containing the reference base count of a sample.

By default, this string is empty, and instead we search for the reference base count columns of samples in the header by matching with a list of commonly used prefixes and suffixes, such as ref_cnt or reference-base-count.

However, if set to a non-empty string, this string is searched instead in the header as a prefix or suffix, and for every match, the respective column is used as the reference base count information of a sample when parsing the table. The sample name is then the remainder of the column name that is left without the prefix or suffix.

Definition at line 647 of file frequency_table_input_stream.hpp.

◆ input_source() [1/2]

std::shared_ptr<utils::BaseInputSource> input_source ( ) const
inline

Definition at line 486 of file frequency_table_input_stream.hpp.

◆ input_source() [2/2]

self_type& input_source ( std::shared_ptr< utils::BaseInputSource value)
inline

Set the input source.

This overwrites the source if it was already given in the constructor. Shall not be called after iteration has been started.

Definition at line 497 of file frequency_table_input_stream.hpp.

◆ int_factor() [1/2]

double int_factor ( ) const
inline

Definition at line 808 of file frequency_table_input_stream.hpp.

◆ int_factor() [2/2]

self_type& int_factor ( double  value)
inline

Set the factor by which frequencies are multiplied if no read depth information is present for a sample.

We allow parsing information on allele counts (ref and alt counts), or frequencies and read depth. Howver, there are methods such as HAF-pipe that only output a final frequency, and (by default) do not offer any information on the (effective) read depth that a sample has.

However, our internal data representation uses counts instead of frequencies, as we based our equations on existing pool-sequencing population genetic statistics, such as those developed by PoPoolation. Hence, we need to convert from frequencies to counts somehow. In the absence of any read depth information, we hence use a trick, by multiplying the frequency with a large number to obtain counts. In subsequent analyses, using a large number here will basically inactivate the Bessel's correction for read depth (or at least minimize its influence).

By default, we use a factor that is the largest integer value that can be represented in double precision floating point numbers (i.e., 9007199254740992.0), which minimizes the above mentioned Bessel's correction influence. However, with this setting, a different factor can be used instead, which is useful when actual (effective) read depth information is available.

We currently only allow to set this for the whole input, instead of on a per-sample basis. If needed, we might re-work this feature in the future to allow per-sample effctive read depth.

Definition at line 838 of file frequency_table_input_stream.hpp.

◆ inverse_sample_names_filter() [1/2]

bool inverse_sample_names_filter ( ) const
inline

Definition at line 521 of file frequency_table_input_stream.hpp.

◆ inverse_sample_names_filter() [2/2]

self_type& inverse_sample_names_filter ( bool  value)
inline

Set whether to reverse the sample names to filter for.

This overwrites the sample names reverse setting if the were already given in the constructor. Shall not be called after iteration has been started.

Definition at line 532 of file frequency_table_input_stream.hpp.

◆ missing_value() [1/2]

std::string const& missing_value ( ) const
inline

Definition at line 790 of file frequency_table_input_stream.hpp.

◆ missing_value() [2/2]

self_type& missing_value ( std::string const &  value)
inline

Set the string that indicates missing data.

By default, we use ., na, and nan as indicators of missing data, in which case the SampleCounts will be set to missing when any of these values is involved in the parsing. With this setting, instead, the given value will be used to indicate missing data.

Definition at line 802 of file frequency_table_input_stream.hpp.

◆ operator=() [1/2]

self_type& operator= ( self_type &&  )
default

◆ operator=() [2/2]

self_type& operator= ( self_type const &  )
default

◆ reference_genome() [1/2]

std::shared_ptr<::genesis::sequence::ReferenceGenome> reference_genome ( ) const
inline

Definition at line 744 of file frequency_table_input_stream.hpp.

◆ reference_genome() [2/2]

self_type& reference_genome ( std::shared_ptr<::genesis::sequence::ReferenceGenome value)
inline

Reference genome used to phase input data without reference bases.

Some frequency table formats, such as the ones coming from HAF-pipe, do not contain information on the reference or alternative bases. In these cases, we could just assign the frequencies to random bases. However, when given the proper reference genome here that was used to infer the frequencies in the first place, we can at least assign the correct reference base.

Note: While supplying the reference genome will correctly set the referen base, we might still not be able to obtain the alternative base that the frequency represents, if that information is simply not present in the input file. For instance, with the HAF-pipe output format, that is unknowable. In the future, we might add reading in a founder panel VCF, or something alike that would give that information. However, most of our downstream algorithms do not really need to know the exact alternative base anyway, so instead, in these cases, we simply assign the transition base of the reference base (A <-> G and C <-> T) instead, to keep it simple. That is the most likely we can do without further information.

Definition at line 768 of file frequency_table_input_stream.hpp.

◆ sample_names_filter() [1/2]

std::unordered_set<std::string> const& sample_names_filter ( ) const
inline

Definition at line 503 of file frequency_table_input_stream.hpp.

◆ sample_names_filter() [2/2]

self_type& sample_names_filter ( std::unordered_set< std::string > const &  value)
inline

Set the sample names to filter for.

This overwrites the sample names if the were already given in the constructor. We internally sort them, for faster access. Shall not be called after iteration has been started.

Definition at line 515 of file frequency_table_input_stream.hpp.

◆ separator_char() [1/2]

char separator_char ( ) const
inline

Definition at line 774 of file frequency_table_input_stream.hpp.

◆ separator_char() [2/2]

self_type& separator_char ( char  value)
inline

Set the separator char used for parsing the tabluar input data.

By default, we use a tab \\t, but any other character, such as comma, can be used here.

Definition at line 784 of file frequency_table_input_stream.hpp.

Member Typedef Documentation

◆ difference_type

using difference_type = std::ptrdiff_t

Definition at line 91 of file frequency_table_input_stream.hpp.

◆ iterator_category

using iterator_category = std::input_iterator_tag

Definition at line 92 of file frequency_table_input_stream.hpp.

◆ pointer

using pointer = value_type const*

Definition at line 89 of file frequency_table_input_stream.hpp.

◆ reference

using reference = value_type const&

Definition at line 90 of file frequency_table_input_stream.hpp.

◆ self_type

◆ value_type

Definition at line 88 of file frequency_table_input_stream.hpp.


The documentation for this class was generated from the following file: