A library for working with phylogenetic data.
v0.25.0
VcfFormatIterator< S, T > Class Template Reference

#include <genesis/population/formats/vcf_format_iterator.hpp>

Detailed Description

template<typename S, typename T>
class genesis::population::VcfFormatIterator< S, T >

Iterate the FORMAT information for the samples in a SNP/variant line in a VCF/BCF file.

One instance is meant to iterate all values of the samples for one FORMAT field key (give by its id tag/key). VCF/BCF supports different data types, for which this class template is instanciated in the begin_format_*() and get_format_*() functions of VcfRecord: std::string, int, and float.

The FORMAT data in VCF is fairly flexible and complex:

  • For a given FORMAT ID field (e.g., "AF"), each sample can contain multiple values for that field, as indicated by the Number value in the VCF header line for that FORMAT field.
  • However, this seems not to be the case for strings (char* in htslib), which only ever seem to return one string value per sample in htslib. To our knowledge, this is not properly documented in htslib, but we implemented this class here as if this was true. So, for string data, one can only obtain a single string per sample, and has to split it afterwards, if necessary.
  • For the other data types (int and float/double), there can be missing data as well, so that not all samples might have the same number of values.
  • The genotype field (GT) is yet another special case that is handled by htslib as either string or int, depending on the context. We here hence wrap this as VcfGenotype for simplicity.

Basically, what that means is that we need an iterator for the values of each sample within this iterator over samples, which again usually is within an iterator over the records/lines if the VCF file. However, we simplify here a bit, and replace the innermost iterator (over values of the current sample): Most FORMAT tags only have one value anyway (see below for that use case), and also we don't want too many classes to confuse us. To accomodate for this, and to make iterating values as easy as possible (given the complexity), we offer functions to automatically skip such missing values.

A simple use case for this iterator is hence as follows:

// Load a file and init the data structures.
auto file = HtsFile( "path/to/file.vcf" );
auto header = VcfHeader( file );
auto record = VcfRecord( header );

// Iterate all records/lines of the VCF file.
while( record.read( file )) {
    // Skip if the read depth (DP) FORMAT is not available for the current record.
    if( !record.has_format("DP") ) {
        continue;
    }

    // Iterate the DP data for all samples of the record, loading them as int.
    for( auto& sample_dp : record.get_format_int("DP") ) {
        LOG_INFO << "At sample " << sample_dp.sample_name();

        // Iterate all individual values for that sample that are given in the data.
        while( sample_dp.has_value() ) {
            LOG_INFO << "- " << sample_dp.get_value();
            sample_dp.next_value();
        }
    }
}

See also VcfInputIterator for a wrapper around the basic loop over records/lines.

The above example relies on the implicit notion of a "current" value per sample, as we move between values via the next_value() function. Note that the next_value() function automatically skips missing values. If however the exact indices of the values within a sample are important (that is, if missing values shall not be skipped automatically), an alternative approach is to use the *_at() functions that this iterator class provides:

// (replacement for the innermost while loop of above)
for( size_t i = 0; i < sample_dp.values_per_sample(); ++i ) {
    if( sample_dp.has_value_at(i) ) {
        LOG_INFO << "- " << sample_dp.get_value_at(i);
    }
}

Alternatively, if only a single value is expected per sample anyway (which is probably the case for most kind of FORMAT fields), we do not need to loop and can simplify the access:

// (again, replacement for the innermost while loop of above)
if( sample_dp.has_value() ) {
    LOG_INFO << "- " << sample_dp.get_value();
}

Furthermore, in the first example, we used a range-based for loop, which automates the increment and comparison to the end iterator. It is instead of course also possible to do this explicitly:

// (replacement for the range-based for loop of the first example) for( auto sample_dp = record.begin_format_int("DP"); sample_dp != record.end_format_int(); ++sample_dp ) { // ... }

Here, each call to the pre-increment operator++() moves to the first value of the next sample. Then as before, in order to move between values of the current sample, use has_value(), get_value(), and next_value() within the loop.

Note that we also provide *_at() functions that take the sample index as input. These are hence fully independent of the current iterator position (sample and value position), and can be used to access values at arbitrary sample and value indices.

Furthermore, a vector with all values can be obtaind, which is for example useful for the genotype GT field, which can be used like this:

// (replacement for the range-based for loop of the first example)
for( auto sample_gt : record.get_format_genotype() ) {
    LOG_INFO << vcf_genotype_string( sample_gt.get_values() );
}

This yields a print-out of the genotypes of each sample in VCF style, see vcf_genotype_string() for details.

Further implementation details: The class is a template for the source type S as used by htslib (char*, int32_t, float), and for the target type T as used by us (std::string, int32_t, double, VcfGenotype). Most of the implementation is shared, but some htslib-related functions (check for their hard-coded special values, memory management of allocated arrays, ...) need to be specialized for the different data types. See the very end of this file for the respective implementations.

Definition at line 62 of file vcf_format_iterator.hpp.

Public Member Functions

 VcfFormatIterator ()=default
 Create a default (empty) instance, that is used to indicate the end iterator position. More...
 
 VcfFormatIterator (::bcf_hdr_t *header, ::bcf1_t *record, std::string const &id, VcfValueType ht_type)
 Create an instance, given the htslib header, record line, and the FORMAT id tag/key (as well as its data type ht_type) that we want to iterate over. More...
 
 VcfFormatIterator (VcfFormatIterator &&)=default
 
 VcfFormatIterator (VcfFormatIterator const &)=default
 
 ~VcfFormatIterator ()=default
 
get_value () const
 Get the value where the iterator currently resides. More...
 
get_value_at (size_t sample_index, size_t value_index) const
 Get the value at a given value_index of a given sample at sample_index. More...
 
get_value_at (size_t value_index) const
 Get the value at a given value_index of the current sample. More...
 
std::vector< T > get_values (bool include_missing=false) const
 Get a vector of all values for the current sample. More...
 
std::vector< T > get_values_at (size_t sample_index, bool include_missing=false) const
 Get a vector of all values for a given sample. More...
 
bool has_value () const
 Return whether the iterator currently resides at a valid value of the current sample. More...
 
bool has_value_at (size_t sample_index, size_t value_index) const
 Return whether the value at a given index within the given sample is valid. More...
 
bool has_value_at (size_t value_index) const
 Return whether the value at a given index within the current sample is valid. More...
 
::bcf_hdr_t * header_data ()
 Get the raw htslib structure pointer for the header. More...
 
void next_value ()
 Move to the next value within the current sample. More...
 
bool operator!= (self_type const &other) const
 Inequality comparison, needed to detect the end of the iteration. More...
 
self_typeoperator* ()
 Dereference, which gives the iterator itself instead of the value, as our values should be accessed via the get_value() or get_value_at() functions. More...
 
self_typeoperator++ ()
 Pre-increment operator to move to the next sample. More...
 
self_type operator++ (int)
 Post-increment operator to move to the next sample. More...
 
VcfFormatIteratoroperator= (VcfFormatIterator &&)=default
 
VcfFormatIteratoroperator= (VcfFormatIterator const &)=default
 
bool operator== (self_type const &other) const
 Equality comparison, needed to detect the end of the iteration. More...
 
::bcf1_t * record_data ()
 Get the raw htslib structure pointer for the record/line. More...
 
size_t sample_count () const
 Return the total number of samples that we are iterating over. More...
 
size_t sample_index () const
 Return the index of the column of the current sample. More...
 
std::string sample_name () const
 Return the name of the current sample, as given in the #CHROM ... header line of the VCF file. More...
 
std::string sample_name_at (size_t sample_index) const
 Return the sample name at a given index within 0 and sample_count(). More...
 
size_t valid_value_count () const
 Return the number of valid values for the current sample. More...
 
size_t valid_value_count_at (size_t sample_index) const
 Return the number of valid values for a given sample_index. More...
 
size_t value_index () const
 Return the index of the current value within the current sample. More...
 
size_t values_per_sample () const
 Return the number of values that each sample has. More...
 

Public Types

using self_type = VcfFormatIterator< S, T >
 

Constructor & Destructor Documentation

◆ VcfFormatIterator() [1/4]

VcfFormatIterator ( )
default

Create a default (empty) instance, that is used to indicate the end iterator position.

By default, this has is_end_ == true, so that we can easily check for default constructed instances, which are used as the past-the-end iterators in the loop when using this class.

◆ VcfFormatIterator() [2/4]

VcfFormatIterator ( ::bcf_hdr_t *  header,
::bcf1_t *  record,
std::string const &  id,
VcfValueType  ht_type 
)
inline

Create an instance, given the htslib header, record line, and the FORMAT id tag/key (as well as its data type ht_type) that we want to iterate over.

Usually, this class does not need to be constructed by the user. Instead, it is obtained from the begin_format_*() and end_format_*(), or the get_format_*() iterator functions of VcfRecord. That way, it can easily be used for iterating all samples of a given VCF record line.

Definition at line 307 of file vcf_format_iterator.hpp.

◆ ~VcfFormatIterator()

~VcfFormatIterator ( )
default

◆ VcfFormatIterator() [3/4]

VcfFormatIterator ( VcfFormatIterator< S, T > const &  )
default

◆ VcfFormatIterator() [4/4]

VcfFormatIterator ( VcfFormatIterator< S, T > &&  )
default

Member Function Documentation

◆ get_value()

T get_value ( ) const
inline

Get the value where the iterator currently resides.

That is, get the value at index value_index() for the sample at sample_index(). The function assumes that this is a valid value, that is, that has_value() returned true.

Definition at line 557 of file vcf_format_iterator.hpp.

◆ get_value_at() [1/2]

T get_value_at ( size_t  sample_index,
size_t  value_index 
) const
inline

Get the value at a given value_index of a given sample at sample_index.

Definition at line 655 of file vcf_format_iterator.hpp.

◆ get_value_at() [2/2]

T get_value_at ( size_t  value_index) const
inline

Get the value at a given value_index of the current sample.

Definition at line 647 of file vcf_format_iterator.hpp.

◆ get_values()

std::vector<T> get_values ( bool  include_missing = false) const
inline

Get a vector of all values for the current sample.

If include_missing is true, the resulting vector has the size of values_per_sample(), and also contains any missing or end-of-vector values as provided by htslib, using their raw constants to indicate these values. If include_missing is false (default), instead, these values are skipped, so that the resulting vector might be smaller than values_per_sample().

This function needs to allocate a vector; hence, the other access methods are preferred for speed reasons.

Definition at line 682 of file vcf_format_iterator.hpp.

◆ get_values_at()

std::vector<T> get_values_at ( size_t  sample_index,
bool  include_missing = false 
) const
inline

Get a vector of all values for a given sample.

If include_missing is true, the resulting vector has the size of values_per_sample(), and also contains any missing or end-of-vector values as provided by htslib, using their raw constants to indicate these values. If include_missing is false (default), instead, these values are skipped, so that the resulting vector might be smaller than values_per_sample().

This function needs to allocate a vector; hence, the other access methods are preferred for speed reasons.

Definition at line 692 of file vcf_format_iterator.hpp.

◆ has_value()

bool has_value ( ) const
inline

Return whether the iterator currently resides at a valid value of the current sample.

The function tests whether the value at value_index() of the current sample at sample_index() is valid, that is, not missing and not the end of the data for that sample.

This function is true for a maximum of values_per_sample() many values per sample when iterating through them via next_value(). It can be less than that if there are missing values in the VCF data.

Definition at line 539 of file vcf_format_iterator.hpp.

◆ has_value_at() [1/2]

bool has_value_at ( size_t  sample_index,
size_t  value_index 
) const
inline

Return whether the value at a given index within the given sample is valid.

Invalid values are either missing or marked as the end of the vector in htslib. If neither is the case (and if the indices are within bounds), the value is considered valid.

Definition at line 634 of file vcf_format_iterator.hpp.

◆ has_value_at() [2/2]

bool has_value_at ( size_t  value_index) const
inline

Return whether the value at a given index within the current sample is valid.

Invalid values are either missing or marked as the end of the vector in htslib. If neither is the case (and if the index is within bounds), the value is considered valid.

Definition at line 623 of file vcf_format_iterator.hpp.

◆ header_data()

::bcf_hdr_t* header_data ( )
inline

Get the raw htslib structure pointer for the header.

Definition at line 379 of file vcf_format_iterator.hpp.

◆ next_value()

void next_value ( )
inline

Move to the next value within the current sample.

This increases the value_index() to the next valid value within the current sample at sample_index(). Invalid values (e.g., missing data) are skipped automatically.

Definition at line 579 of file vcf_format_iterator.hpp.

◆ operator!=()

bool operator!= ( self_type const &  other) const
inline

Inequality comparison, needed to detect the end of the iteration.

Definition at line 446 of file vcf_format_iterator.hpp.

◆ operator*()

self_type& operator* ( )
inline

Dereference, which gives the iterator itself instead of the value, as our values should be accessed via the get_value() or get_value_at() functions.

Definition at line 420 of file vcf_format_iterator.hpp.

◆ operator++() [1/2]

self_type& operator++ ( )
inline

Pre-increment operator to move to the next sample.

In particular, move to the first valid value of the next sample, or, if we reached the end of the samples, set the end flag, so that we know we are done.

Definition at line 457 of file vcf_format_iterator.hpp.

◆ operator++() [2/2]

self_type operator++ ( int  )
inline

Post-increment operator to move to the next sample.

This does the same as the pre-increment, but returns an iterator to the previous sample/value. Note that this creates a copy, which is additional effort. Hence, we strongly recommend to use the pre-increment operator ++sample instead whenever possible.

Definition at line 480 of file vcf_format_iterator.hpp.

◆ operator=() [1/2]

VcfFormatIterator& operator= ( VcfFormatIterator< S, T > &&  )
default

◆ operator=() [2/2]

VcfFormatIterator& operator= ( VcfFormatIterator< S, T > const &  )
default

◆ operator==()

bool operator== ( self_type const &  other) const
inline

Equality comparison, needed to detect the end of the iteration.

Definition at line 428 of file vcf_format_iterator.hpp.

◆ record_data()

::bcf1_t* record_data ( )
inline

Get the raw htslib structure pointer for the record/line.

Definition at line 387 of file vcf_format_iterator.hpp.

◆ sample_count()

size_t sample_count ( ) const
inline

Return the total number of samples that we are iterating over.

Definition at line 395 of file vcf_format_iterator.hpp.

◆ sample_index()

size_t sample_index ( ) const
inline

Return the index of the column of the current sample.

See value_index() to get the index of the current value within the current sample.

Definition at line 496 of file vcf_format_iterator.hpp.

◆ sample_name()

std::string sample_name ( ) const
inline

Return the name of the current sample, as given in the #CHROM ... header line of the VCF file.

Definition at line 516 of file vcf_format_iterator.hpp.

◆ sample_name_at()

std::string sample_name_at ( size_t  sample_index) const
inline

Return the sample name at a given index within 0 and sample_count().

Definition at line 603 of file vcf_format_iterator.hpp.

◆ valid_value_count()

size_t valid_value_count ( ) const
inline

Return the number of valid values for the current sample.

That corresponds to how often next_value() will be called when looping over values before has_value() returns false.

Definition at line 591 of file vcf_format_iterator.hpp.

◆ valid_value_count_at()

size_t valid_value_count_at ( size_t  sample_index) const
inline

Return the number of valid values for a given sample_index.

This corresponds to the resulting vector size when calling get_values() or get_values_at() with include_missing == false.

Definition at line 726 of file vcf_format_iterator.hpp.

◆ value_index()

size_t value_index ( ) const
inline

Return the index of the current value within the current sample.

See sample_index() to get the index of the current sample.

Definition at line 506 of file vcf_format_iterator.hpp.

◆ values_per_sample()

size_t values_per_sample ( ) const
inline

Return the number of values that each sample has.

Note that VCF allows for unspecified values (missing data) and early ending data if a particular sample does not have that many values. This function here hence returns the maximum number of values per sample, as specified in the header.

Definition at line 407 of file vcf_format_iterator.hpp.

Member Typedef Documentation

◆ self_type

Definition at line 285 of file vcf_format_iterator.hpp.


The documentation for this class was generated from the following file: