#include <genesis/population/format/vcf_format_iterator.hpp>
Iterate the FORMAT information for the samples in a SNP/variant line in a VCF/BCF file.
One instance is meant to iterate all values of the samples for one FORMAT field key (give by its id
tag/key). VCF/BCF supports different data types, for which this class template is instanciated in the begin_format_*()
and get_format_*()
functions of VcfRecord: std::string, int, and float.
The FORMAT data in VCF is fairly flexible and complex:
Number
value in the VCF header line for that FORMAT field.GT
) is yet another special case that is handled by htslib as either string or int, depending on the context. We here hence wrap this as VcfGenotype for simplicity.Basically, what that means is that we need an iterator for the values of each sample within this iterator over samples, which again usually is within an iterator over the records/lines if the VCF file. However, we simplify here a bit, and replace the innermost iterator (over values of the current sample): Most FORMAT tags only have one value anyway (see below for that use case), and also we don't want too many classes to confuse us. To accomodate for this, and to make iterating values as easy as possible (given the complexity), we offer functions to automatically skip such missing values.
A simple use case for this iterator is hence as follows:
// Load a file and init the data structures. auto file = HtsFile( "path/to/file.vcf" ); auto header = VcfHeader( file ); auto record = VcfRecord( header ); // Iterate all records/lines of the VCF file. while( record.read( file )) { // Skip if the read depth (DP) FORMAT is not available for the current record. if( !record.has_format("DP") ) { continue; } // Iterate the DP data for all samples of the record, loading them as int. for( auto& sample_dp : record.get_format_int("DP") ) { LOG_INFO << "At sample " << sample_dp.sample_name(); // Iterate all individual values for that sample that are given in the data. while( sample_dp.has_value() ) { LOG_INFO << "- " << sample_dp.get_value(); sample_dp.next_value(); } } }
See also VcfInputStream for a wrapper around the basic loop over records/lines.
The above example relies on the implicit notion of a "current" value per sample, as we move between values via the next_value() function. Note that the next_value() function automatically skips missing values. If however the exact indices of the values within a sample are important (that is, if missing values shall not be skipped automatically), an alternative approach is to use the *_at()
functions that this iterator class provides:
// (replacement for the innermost while loop of above) for( size_t i = 0; i < sample_dp.values_per_sample(); ++i ) { if( sample_dp.has_value_at(i) ) { LOG_INFO << "- " << sample_dp.get_value_at(i); } }
Alternatively, if only a single value is expected per sample anyway (which is probably the case for most kind of FORMAT fields), we do not need to loop and can simplify the access:
// (again, replacement for the innermost while loop of above) if( sample_dp.has_value() ) { LOG_INFO << "- " << sample_dp.get_value(); }
Furthermore, in the first example, we used a range-based for loop, which automates the increment and comparison to the end iterator. It is instead of course also possible to do this explicitly:
// (replacement for the range-based for loop of the first example) for( auto sample_dp = record.begin_format_int("DP"); sample_dp != record.end_format_int(); ++sample_dp ) { // ... }
Here, each call to the pre-increment operator++() moves to the first value of the next sample. Then as before, in order to move between values of the current sample, use has_value(), get_value(), and next_value() within the loop.
Note that we also provide *_at()
functions that take the sample index as input. These are hence fully independent of the current iterator position (sample and value position), and can be used to access values at arbitrary sample and value indices.
Furthermore, a vector with all values can be obtaind, which is for example useful for the genotype GT
field, which can be used like this:
// (replacement for the range-based for loop of the first example) for( auto sample_gt : record.get_format_genotype() ) { LOG_INFO << vcf_genotype_string( sample_gt.get_values() ); }
This yields a print-out of the genotypes of each sample in VCF style, see vcf_genotype_string() for details.
Further implementation details: The class is a template for the source type S
as used by htslib (char*, int32_t, float), and for the target type T
as used by us (std::string, int32_t, double, VcfGenotype). Most of the implementation is shared, but some htslib-related functions (check for their hard-coded special values, memory management of allocated arrays, ...) need to be specialized for the different data types. See the very end of this file for the respective implementations.
Definition at line 62 of file vcf_format_iterator.hpp.
Public Member Functions | |
VcfFormatIterator ()=default | |
Create a default (empty) instance, that is used to indicate the end iterator position. More... | |
VcfFormatIterator (::bcf_hdr_t *header, ::bcf1_t *record, std::string const &id, VcfValueType ht_type) | |
Create an instance, given the htslib header , record line, and the FORMAT id tag/key (as well as its data type ht_type ) that we want to iterate over. More... | |
VcfFormatIterator (VcfFormatIterator &&)=default | |
VcfFormatIterator (VcfFormatIterator const &)=default | |
~VcfFormatIterator ()=default | |
T | get_value () const |
Get the value where the iterator currently resides. More... | |
T | get_value_at (size_t sample_index, size_t value_index) const |
Get the value at a given value_index of a given sample at sample_index . More... | |
T | get_value_at (size_t value_index) const |
Get the value at a given value_index of the current sample. More... | |
std::vector< T > | get_values (bool include_missing=false) const |
Get a vector of all values for the current sample. More... | |
std::vector< T > | get_values_at (size_t sample_index, bool include_missing=false) const |
Get a vector of all values for a given sample. More... | |
bool | has_value () const |
Return whether the iterator currently resides at a valid value of the current sample. More... | |
bool | has_value_at (size_t sample_index, size_t value_index) const |
Return whether the value at a given index within the given sample is valid. More... | |
bool | has_value_at (size_t value_index) const |
Return whether the value at a given index within the current sample is valid. More... | |
::bcf_hdr_t * | header_data () |
Get the raw htslib structure pointer for the header. More... | |
void | next_value () |
Move to the next value within the current sample. More... | |
bool | operator!= (self_type const &other) const |
Inequality comparison, needed to detect the end of the iteration. More... | |
self_type & | operator* () |
Dereference, which gives the iterator itself instead of the value, as our values should be accessed via the get_value() or get_value_at() functions. More... | |
self_type & | operator++ () |
Pre-increment operator to move to the next sample. More... | |
self_type | operator++ (int) |
Post-increment operator to move to the next sample. More... | |
VcfFormatIterator & | operator= (VcfFormatIterator &&)=default |
VcfFormatIterator & | operator= (VcfFormatIterator const &)=default |
bool | operator== (self_type const &other) const |
Equality comparison, needed to detect the end of the iteration. More... | |
::bcf1_t * | record_data () |
Get the raw htslib structure pointer for the record/line. More... | |
size_t | sample_count () const |
Return the total number of samples that we are iterating over. More... | |
size_t | sample_index () const |
Return the index of the column of the current sample. More... | |
std::string | sample_name () const |
Return the name of the current sample, as given in the #CHROM ... header line of the VCF file. More... | |
std::string | sample_name_at (size_t sample_index) const |
Return the sample name at a given index within 0 and sample_count(). More... | |
size_t | valid_value_count () const |
Return the number of valid values for the current sample. More... | |
size_t | valid_value_count_at (size_t sample_index) const |
Return the number of valid values for a given sample_index . More... | |
size_t | value_index () const |
Return the index of the current value within the current sample. More... | |
size_t | values_per_sample () const |
Return the number of values that each sample has. More... | |
Public Types | |
using | self_type = VcfFormatIterator< S, T > |
|
default |
Create a default (empty) instance, that is used to indicate the end iterator position.
By default, this has is_end_ == true
, so that we can easily check for default constructed instances, which are used as the past-the-end iterators in the loop when using this class.
|
inline |
Create an instance, given the htslib header
, record
line, and the FORMAT id
tag/key (as well as its data type ht_type
) that we want to iterate over.
Usually, this class does not need to be constructed by the user. Instead, it is obtained from the begin_format_*()
and end_format_*()
, or the get_format_*()
iterator functions of VcfRecord. That way, it can easily be used for iterating all samples of a given VCF record line.
Definition at line 307 of file vcf_format_iterator.hpp.
|
default |
|
default |
|
default |
|
inline |
Get the value where the iterator currently resides.
That is, get the value at index value_index() for the sample at sample_index(). The function assumes that this is a valid value, that is, that has_value() returned true
.
Definition at line 557 of file vcf_format_iterator.hpp.
|
inline |
Get the value at a given value_index
of a given sample at sample_index
.
Definition at line 655 of file vcf_format_iterator.hpp.
|
inline |
Get the value at a given value_index
of the current sample.
Definition at line 647 of file vcf_format_iterator.hpp.
|
inline |
Get a vector of all values for the current sample.
If include_missing
is true
, the resulting vector has the size of values_per_sample(), and also contains any missing or end-of-vector values as provided by htslib, using their raw constants to indicate these values. If include_missing
is false
(default), instead, these values are skipped, so that the resulting vector might be smaller than values_per_sample().
This function needs to allocate a vector; hence, the other access methods are preferred for speed reasons.
Definition at line 682 of file vcf_format_iterator.hpp.
|
inline |
Get a vector of all values for a given sample.
If include_missing
is true
, the resulting vector has the size of values_per_sample(), and also contains any missing or end-of-vector values as provided by htslib, using their raw constants to indicate these values. If include_missing
is false
(default), instead, these values are skipped, so that the resulting vector might be smaller than values_per_sample().
This function needs to allocate a vector; hence, the other access methods are preferred for speed reasons.
Definition at line 692 of file vcf_format_iterator.hpp.
|
inline |
Return whether the iterator currently resides at a valid value of the current sample.
The function tests whether the value at value_index() of the current sample at sample_index() is valid, that is, not missing and not the end of the data for that sample.
This function is true
for a maximum of values_per_sample() many values per sample when iterating through them via next_value(). It can be less than that if there are missing values in the VCF data.
Definition at line 539 of file vcf_format_iterator.hpp.
|
inline |
Return whether the value at a given index within the given sample is valid.
Invalid values are either missing or marked as the end of the vector in htslib. If neither is the case (and if the indices are within bounds), the value is considered valid.
Definition at line 634 of file vcf_format_iterator.hpp.
|
inline |
Return whether the value at a given index within the current sample is valid.
Invalid values are either missing or marked as the end of the vector in htslib. If neither is the case (and if the index is within bounds), the value is considered valid.
Definition at line 623 of file vcf_format_iterator.hpp.
|
inline |
Get the raw htslib structure pointer for the header.
Definition at line 379 of file vcf_format_iterator.hpp.
|
inline |
Move to the next value within the current sample.
This increases the value_index() to the next valid value within the current sample at sample_index(). Invalid values (e.g., missing data) are skipped automatically.
Definition at line 579 of file vcf_format_iterator.hpp.
|
inline |
Inequality comparison, needed to detect the end of the iteration.
Definition at line 446 of file vcf_format_iterator.hpp.
|
inline |
Dereference, which gives the iterator itself instead of the value, as our values should be accessed via the get_value() or get_value_at() functions.
Definition at line 420 of file vcf_format_iterator.hpp.
|
inline |
Pre-increment operator to move to the next sample.
In particular, move to the first valid value of the next sample, or, if we reached the end of the samples, set the end flag, so that we know we are done.
Definition at line 457 of file vcf_format_iterator.hpp.
|
inline |
Post-increment operator to move to the next sample.
This does the same as the pre-increment, but returns an iterator to the previous sample/value. Note that this creates a copy, which is additional effort. Hence, we strongly recommend to use the pre-increment operator ++sample
instead whenever possible.
Definition at line 480 of file vcf_format_iterator.hpp.
|
default |
|
default |
|
inline |
Equality comparison, needed to detect the end of the iteration.
Definition at line 428 of file vcf_format_iterator.hpp.
|
inline |
Get the raw htslib structure pointer for the record/line.
Definition at line 387 of file vcf_format_iterator.hpp.
|
inline |
Return the total number of samples that we are iterating over.
Definition at line 395 of file vcf_format_iterator.hpp.
|
inline |
Return the index of the column of the current sample.
See value_index() to get the index of the current value within the current sample.
Definition at line 496 of file vcf_format_iterator.hpp.
|
inline |
Return the name of the current sample, as given in the #CHROM ...
header line of the VCF file.
Definition at line 516 of file vcf_format_iterator.hpp.
|
inline |
Return the sample name at a given index within 0 and sample_count().
Definition at line 603 of file vcf_format_iterator.hpp.
|
inline |
Return the number of valid values for the current sample.
That corresponds to how often next_value() will be called when looping over values before has_value() returns false
.
Definition at line 591 of file vcf_format_iterator.hpp.
|
inline |
Return the number of valid values for a given sample_index
.
This corresponds to the resulting vector size when calling get_values() or get_values_at() with include_missing == false
.
Definition at line 726 of file vcf_format_iterator.hpp.
|
inline |
Return the index of the current value within the current sample.
See sample_index() to get the index of the current sample.
Definition at line 506 of file vcf_format_iterator.hpp.
|
inline |
Return the number of values that each sample has.
Note that VCF allows for unspecified values (missing data) and early ending data if a particular sample does not have that many values. This function here hence returns the maximum number of values per sample, as specified in the header.
Definition at line 407 of file vcf_format_iterator.hpp.
using self_type = VcfFormatIterator<S,T> |
Definition at line 285 of file vcf_format_iterator.hpp.