A library for working with phylogenetic and population genetic data.
v0.32.0
VariantGaplessInputStream Class Reference

#include <genesis/population/stream/variant_gapless_input_stream.hpp>

Detailed Description

Stream adapter that visits every position in the genome.

The iterator takes some other VariantInputStream as input. It then iterates all positions in the chromosomes of that input, starting at 1, and until the last position per chromosome of the input. All positions where the input does not have data (missing) instead dereference to a dummy Variant that is set up with the same number of samples as the input, but zero counts.

If additionally a reference genome or sequence dictionary is provided, the chromosomes are further iterated for the full length as specified in these references. This expects that the input data does not contain positions beyond the reference (otherwise, an exception is thrown), and we also check that the reference genome bases are compatible with the bases provided by the input data (the Variant::reference_base).

Furthermore, if a reference genome or sequence dictionary is provided, and iterate_extra_chromosomes() is set to true (which it is by default), we also iterate any chromosomes that appear in the reference but not in the input data at all (of course, all of them will then only contain missing data). This makes sure that the full reference is iterated over.

In some cases, the Variant stream is intended to be subset to particular genomic regions. For this, use genome_locus_set() to set a list of the regions to subset to. Note though that our current implementation here is slightly inefficient, as we here still first attempt to fill in the gaps in the input to some degree, only to then throw them out again if they are to be removed by that region filter. This is unfortunate, but a more efficient implementation that just skips all those regions in the first place turned out to be quite involved due to the interactions between the data stream, reference dict, and region filters, and we did not attempt to make this work for now. The current implementation is however still slightly more efficient than applying the region filter afterwards, as we are at least able to skip part of the process for the filtered positions.

The iterator is useful in siutations where input is expected to have missing data, so that it's skipped by its iterator, but some external algorithm or processing wants to use all the positions. For instance, when writing a sync file, this can be used to make a "gsync" file that contains all positions, instead of skipping missing data positions.

Definition at line 94 of file variant_gapless_input_stream.hpp.

Public Member Functions

 VariantGaplessInputStream ()=default
 
 VariantGaplessInputStream (VariantGaplessInputStream &&)=default
 
 VariantGaplessInputStream (VariantGaplessInputStream const &)=default
 
 VariantGaplessInputStream (VariantInputStream &&input)
 
 VariantGaplessInputStream (VariantInputStream const &input)
 
 ~VariantGaplessInputStream ()=default
 
Iterator begin ()
 Begin the iteration. More...
 
Iterator end ()
 End marker for the iteration. More...
 
std::shared_ptr< GenomeLocusSetgenome_locus_set () const
 Get the currently set GenomeLocusSet for subsetting the iteration positions. More...
 
self_typegenome_locus_set (std::shared_ptr< GenomeLocusSet > value)
 Set a genomic locus set for subsetting the iteration positions. More...
 
VariantInputStreaminput ()
 
VariantInputStream const & input () const
 
bool iterate_extra_chromosomes () const
 
self_typeiterate_extra_chromosomes (bool value)
 Determine whether extra chromosomes without any data in the input are itereated. More...
 
VariantGaplessInputStreamoperator= (VariantGaplessInputStream &&)=default
 
VariantGaplessInputStreamoperator= (VariantGaplessInputStream const &)=default
 
std::shared_ptr<::genesis::sequence::ReferenceGenomereference_genome () const
 Get the currently set reference genome to be used for the chromosome lengths and bases. More...
 
self_typereference_genome (std::shared_ptr<::genesis::sequence::ReferenceGenome > value)
 Set a reference genome to be used for the chromosome lengths and bases. More...
 
std::shared_ptr< genesis::sequence::SequenceDictsequence_dict () const
 Get the currently set sequence dictionary used for the chromosome lengths. More...
 
self_typesequence_dict (std::shared_ptr< genesis::sequence::SequenceDict > value)
 Set a sequence dictionary to be used for the chromosome lengths. More...
 

Public Types

using difference_type = std::ptrdiff_t
 
using iterator_category = std::input_iterator_tag
 
using pointer = value_type const *
 
using reference = value_type const &
 
using self_type = VariantGaplessInputStream
 
using value_type = Variant
 

Public Attributes

friend Iterator
 

Classes

class  Iterator
 Iterator over loci of the input source. More...
 

Constructor & Destructor Documentation

◆ VariantGaplessInputStream() [1/5]

◆ VariantGaplessInputStream() [2/5]

VariantGaplessInputStream ( VariantInputStream const &  input)
inline

Definition at line 395 of file variant_gapless_input_stream.hpp.

◆ VariantGaplessInputStream() [3/5]

Definition at line 401 of file variant_gapless_input_stream.hpp.

◆ ~VariantGaplessInputStream()

◆ VariantGaplessInputStream() [4/5]

◆ VariantGaplessInputStream() [5/5]

Member Function Documentation

◆ begin()

Iterator begin ( )
inline

Begin the iteration.

Definition at line 438 of file variant_gapless_input_stream.hpp.

◆ end()

Iterator end ( )
inline

End marker for the iteration.

Definition at line 453 of file variant_gapless_input_stream.hpp.

◆ genome_locus_set() [1/2]

std::shared_ptr<GenomeLocusSet> genome_locus_set ( ) const
inline

Get the currently set GenomeLocusSet for subsetting the iteration positions.

Definition at line 562 of file variant_gapless_input_stream.hpp.

◆ genome_locus_set() [2/2]

self_type& genome_locus_set ( std::shared_ptr< GenomeLocusSet value)
inline

Set a genomic locus set for subsetting the iteration positions.

Only positions listed in the povided set are iterated. This has the same effect as filtering out any positions that are not covered in the provided set after applying this gapless iterator. That means, any gaps of uncovered positions in the given genome locus set will still be gaps in the iteration here - they are not filled in. The main purpose of this is hence to filter for larger regions, and not for individual positions such as SNPs (in that case, doing a gapless iteration on top of SNP filtering wouldn't make sense in the first place anyway).

This is recommended in order to avoid unnecessary computations when subsetting the Variant stream to certain chromosomes or regions. If this was not used, the following would happen: On the one hand, if the input stream was already subset to some regions, then this gapless iterator would re-introduce any positions that were previously removed, but with missing data, which is likely not what we want. On the other hand, if the subsetting to regions was done after using this gapless iterator, a lot of uncessary positions would first be iterated, only to then be removed again if they are not in the required regions.

So instead, setting the region filter here already makes sure that we only iterate the regions and positions that are actually needed.

Definition at line 589 of file variant_gapless_input_stream.hpp.

◆ input() [1/2]

VariantInputStream& input ( )
inline

Definition at line 426 of file variant_gapless_input_stream.hpp.

◆ input() [2/2]

VariantInputStream const& input ( ) const
inline

Definition at line 421 of file variant_gapless_input_stream.hpp.

◆ iterate_extra_chromosomes() [1/2]

bool iterate_extra_chromosomes ( ) const
inline

Definition at line 462 of file variant_gapless_input_stream.hpp.

◆ iterate_extra_chromosomes() [2/2]

self_type& iterate_extra_chromosomes ( bool  value)
inline

Determine whether extra chromosomes without any data in the input are itereated.

If a reference_genome() or sequence_dict() is provided, there might be chromosomes in there that do not appear in the input data at all. With this setting, which is true by default, these chromosomes are iterated over, of course solely consisting of missing data then. If set to false, these are skipped instead and the iteration is ended with the end of the data.

Definition at line 475 of file variant_gapless_input_stream.hpp.

◆ operator=() [1/2]

◆ operator=() [2/2]

VariantGaplessInputStream& operator= ( VariantGaplessInputStream const &  )
default

◆ reference_genome() [1/2]

std::shared_ptr<::genesis::sequence::ReferenceGenome> reference_genome ( ) const
inline

Get the currently set reference genome to be used for the chromosome lengths and bases.

Definition at line 490 of file variant_gapless_input_stream.hpp.

◆ reference_genome() [2/2]

self_type& reference_genome ( std::shared_ptr<::genesis::sequence::ReferenceGenome value)
inline

Set a reference genome to be used for the chromosome lengths and bases.

When provided, this is used to determine the length of each chromosome during iteration, as well as the reference base at each position.

If iterate_extra_chromosomes() is set (true by default), this also is used to determine chromosomes that are not in the input at all, and iterate those as well (consisting solely of missing data then, of course).

For simplicity, reference_genome() and sequence_dict() cannot be used at the same time.

Definition at line 507 of file variant_gapless_input_stream.hpp.

◆ sequence_dict() [1/2]

std::shared_ptr<genesis::sequence::SequenceDict> sequence_dict ( ) const
inline

Get the currently set sequence dictionary used for the chromosome lengths.

Definition at line 528 of file variant_gapless_input_stream.hpp.

◆ sequence_dict() [2/2]

self_type& sequence_dict ( std::shared_ptr< genesis::sequence::SequenceDict value)
inline

Set a sequence dictionary to be used for the chromosome lengths.

See reference_genome() for details. Using a SequenceDict is similar, but without the ability to infer reference bases at the positions along the genome. Other than that, it behaves the same. For simplicity, sequence_dict() and reference_genome() cannot be used at the same time.

Definition at line 541 of file variant_gapless_input_stream.hpp.

Member Typedef Documentation

◆ difference_type

using difference_type = std::ptrdiff_t

Definition at line 106 of file variant_gapless_input_stream.hpp.

◆ iterator_category

using iterator_category = std::input_iterator_tag

Definition at line 107 of file variant_gapless_input_stream.hpp.

◆ pointer

using pointer = value_type const*

Definition at line 104 of file variant_gapless_input_stream.hpp.

◆ reference

using reference = value_type const&

Definition at line 105 of file variant_gapless_input_stream.hpp.

◆ self_type

◆ value_type

Definition at line 103 of file variant_gapless_input_stream.hpp.

Member Data Documentation

◆ Iterator

friend Iterator

Definition at line 415 of file variant_gapless_input_stream.hpp.


The documentation for this class was generated from the following file: