A library for working with phylogenetic and population genetic data.
v0.32.0
VariantParallelInputStream Class Reference

#include <genesis/population/stream/variant_parallel_input_stream.hpp>

Detailed Description

Iterate multiple input sources that yield Variants in parallel.

This stream allows to traverse multiple sources of data in parallel, where each stop of the traversal is a locus in the input sources. Using ContributionType, one can select the contribution of loci of each input, that is, whether all its loci get used, or just the ones that also overlap with other input sources. See also add_carrying_locus() for other ways to specify the loci to iterate over.

At each visited locus, the stream yields the data of the underlying input sources as a vector of Optional Variants, with one Variant per input source. If a source does not have data at the current locus, the Optional is empty. Use the dereference operator*() and operator->() of the iterator or the access functions variants() and variant_at() to get the set of variants at the current locus() of the iteration, or use joined_variant() to get one Variant that has all sample SampleCounts joined into it.

Furthermore, using the inputs() and input_at() functions, which are also available from the iterator itself, one can access additional information about the underlying streams, such as the file name and sample names that are being read. This is particularly useful if input sources are added as in the example below, where we use functions such as make_variant_input_stream_from_pileup_file(), and the related make_variant_input_stream_from_...() set of functions, to get access to the files, which encapsulate and hence would otherwise hide this information from us. See VariantInputStreamData for the data structure that is used to store these additional information, and see VariantInputStream for details on the underlying stream type.

Example:

// Add input sources to a parallel stream, one carrying, so that all its loci are visited,
// and one following, meaning that its loci are only visited if the first one also
// as those loci.
auto parallel = VariantParallelInputStream();
parallel.add_variant_input_stream(
make_variant_input_stream_from_pileup_file( "path/to/file.pileup.gz" ),
);
parallel.add_variant_input_stream(
);
for( auto it = parallel.begin(); it != parallel.end(); ++it ) {
// Work with the stream, which stops at every locus of the first input source.
std::cout << "At: " << it.locus() << "\n";
for( auto const& var : *it ) {
if( var ) {
// The optional has data, and the variant is valid, which means that
// the input has data at the current locus.
// For example, get the number of samples of that variant
// (which has to be constant over the genome for each input -
// we do not allow inputs to change their number of samples during iteration).
auto const s = var->samples.size();
}
}
// Or get all data combined into one Variant.
auto const joined_var = it.joined_variant();
}
// or
for( auto const& opt_variant : parallel ) {
// Work directly with the std::vector<utils::Optional<Variant>> opt_variant that is
// returned by dereferencing the iterator. Note that this does not allow access to other
// members of the iterator, such as its locus() and joined_variant() functions.
}

See the VariantParallelInputStream::Iterator class for details on access to the data during traversal.

By default, we expect the chromosomes of the underlying input sources to be sorted lexicographically. However, this might not always be the case. In order to allow any (fixed) order, a SequenceDict can be provided via sequence_dict(). When provided, the chromosomes of the input sources are expected to follow the order as specificied by that dictionary. See also locus_compare() and related locus comparison functions for more details on locus sorting order.

Note also that we allow input sources to not contain data at all chromosomes. That is, as long as they are in the correct order (either lexicographical, or according to the sequence dictionary), input sources do not need to contain all chromosomes.

Definition at line 145 of file variant_parallel_input_stream.hpp.

Public Member Functions

 VariantParallelInputStream ()=default
 
 VariantParallelInputStream (self_type &&)=default
 
 VariantParallelInputStream (self_type const &)=default
 
 ~VariantParallelInputStream ()=default
 
template<class ForwardIterator >
self_typeadd_carrying_loci (ForwardIterator first, ForwardIterator last)
 Add a set of GenomeLoci that are used as carrying loci in the iteration. More...
 
self_typeadd_carrying_loci (std::vector< GenomeLocus > const &loci)
 Add a set of GenomeLoci that are used as carrying loci in the iteration. More...
 
self_typeadd_carrying_locus (GenomeLocus const &locus)
 Add a set of GenomeLoci that are used as carrying loci in the iteration. More...
 
self_typeadd_variant_input (std::function< bool(Variant &)> input_element_generator, ContributionType selection)
 Add an input to the parallel stream. More...
 
self_typeadd_variant_input_stream (VariantInputStream const &input, ContributionType selection)
 Add an input to the parallel stream. More...
 
Iterator begin ()
 Begin the iteration. More...
 
Iterator end ()
 End marker for the iteration. More...
 
VariantInputStreaminput_at (size_t index)
 Get access to an input stream that has been added to this parallel stream. More...
 
VariantInputStream const & input_at (size_t index) const
 Get access to an input stream that has been added to this parallel stream. More...
 
size_t input_size () const
 Return the number of input sourced added. More...
 
std::vector< VariantInputStream > & inputs ()
 Get access to the input streams that have been added to this parallel stream. More...
 
std::vector< VariantInputStream > const & inputs () const
 Get access to the input streams that have been added to this parallel stream. More...
 
self_typeoperator= (self_type &&)=default
 
self_typeoperator= (self_type const &)=default
 
std::shared_ptr< genesis::sequence::SequenceDictsequence_dict () const
 Get the currently set sequence dictionary used for the chromosome sorting order. More...
 
self_typesequence_dict (std::shared_ptr< genesis::sequence::SequenceDict > value)
 Set a sequence dictionary to be used for the chromosome order. More...
 

Public Types

enum  ContributionType { kCarrying, kFollowing }
 Select which loci of an input are used. More...
 
using self_type = VariantParallelInputStream
 
using value_type = Variant
 

Public Attributes

friend Iterator
 

Classes

class  Iterator
 Iterator over loci of the input sources. More...
 
struct  JoinedVariantParams
 Parameters for VariantParallelInputStream::Iterator::joined_variant() More...
 

Constructor & Destructor Documentation

◆ VariantParallelInputStream() [1/3]

◆ ~VariantParallelInputStream()

◆ VariantParallelInputStream() [2/3]

VariantParallelInputStream ( self_type const &  )
default

◆ VariantParallelInputStream() [3/3]

Member Function Documentation

◆ add_carrying_loci() [1/2]

self_type& add_carrying_loci ( ForwardIterator  first,
ForwardIterator  last 
)
inline

Add a set of GenomeLoci that are used as carrying loci in the iteration.

See also
VariantParallelInputStream::add_carrying_locus( GenomeLocus const& )

Definition at line 744 of file variant_parallel_input_stream.hpp.

◆ add_carrying_loci() [2/2]

self_type& add_carrying_loci ( std::vector< GenomeLocus > const &  loci)
inline

Add a set of GenomeLoci that are used as carrying loci in the iteration.

See also
VariantParallelInputStream::add_carrying_locus( GenomeLocus const& )

Definition at line 734 of file variant_parallel_input_stream.hpp.

◆ add_carrying_locus()

self_type& add_carrying_locus ( GenomeLocus const &  locus)
inline

Add a set of GenomeLoci that are used as carrying loci in the iteration.

This allows to iterate over a pre-defined set of loci. The iterator stops at each of these loci, independently of whether any of the underlying input sources have data at this locus. That means, it acts as an "empty" input that only contributes loci, as if it were added with ContributionType::kCarrying, but without any actual variants. Duplicate loci in these additional carrying loci are ignored.

Using this is particularly useful for more complex subset operations of loci, such as intersections, complements, (symmetrical) differences, and exclusions. These cases cannot be modelled with our simple ContributionType based approach; so instead, one can externally prepare the list of loci that need to be visited, and provide these to this function. In these cases, to use exactly the list of provided loci, all actual input sources can be added as ContributionType::kFollowing, to make sure that none of them adds additional loci to the traversal.

Note that in addition to the loci added via this function, all loci of input sources that are of ContributionType::kCarrying are also visited.

Definition at line 703 of file variant_parallel_input_stream.hpp.

◆ add_variant_input()

self_type& add_variant_input ( std::function< bool(Variant &)>  input_element_generator,
ContributionType  selection 
)
inline

Add an input to the parallel stream.

This version of the function takes the function to obtain elements from the underlying data iterator, same as VariantInputStream. See there and GenericInputStream for details.

Definition at line 622 of file variant_parallel_input_stream.hpp.

◆ add_variant_input_stream()

self_type& add_variant_input_stream ( VariantInputStream const &  input,
ContributionType  selection 
)
inline

Add an input to the parallel stream.

Definition at line 595 of file variant_parallel_input_stream.hpp.

◆ begin()

Iterator begin ( )
inline

Begin the iteration.

Use this to obtain an VariantParallelInputStream::Iterator that starts traversing the input sources.

Definition at line 568 of file variant_parallel_input_stream.hpp.

◆ end()

Iterator end ( )
inline

End marker for the iteration.

Definition at line 583 of file variant_parallel_input_stream.hpp.

◆ input_at() [1/2]

VariantInputStream& input_at ( size_t  index)
inline

Get access to an input stream that has been added to this parallel stream.

Definition at line 664 of file variant_parallel_input_stream.hpp.

◆ input_at() [2/2]

VariantInputStream const& input_at ( size_t  index) const
inline

Get access to an input stream that has been added to this parallel stream.

Definition at line 656 of file variant_parallel_input_stream.hpp.

◆ input_size()

size_t input_size ( ) const
inline

Return the number of input sourced added.

Definition at line 672 of file variant_parallel_input_stream.hpp.

◆ inputs() [1/2]

std::vector<VariantInputStream>& inputs ( )
inline

Get access to the input streams that have been added to this parallel stream.

This non-const version of the function can for exmple be used to bulk-add filters and transformations to the iterators, using their functions add_transform(), add_filter(), and add_transform_filter(); see utils::GenericInputStream for details.

Definition at line 648 of file variant_parallel_input_stream.hpp.

◆ inputs() [2/2]

std::vector<VariantInputStream> const& inputs ( ) const
inline

Get access to the input streams that have been added to this parallel stream.

Definition at line 633 of file variant_parallel_input_stream.hpp.

◆ operator=() [1/2]

self_type& operator= ( self_type &&  )
default

◆ operator=() [2/2]

self_type& operator= ( self_type const &  )
default

◆ sequence_dict() [1/2]

std::shared_ptr<genesis::sequence::SequenceDict> sequence_dict ( ) const
inline

Get the currently set sequence dictionary used for the chromosome sorting order.

Definition at line 773 of file variant_parallel_input_stream.hpp.

◆ sequence_dict() [2/2]

self_type& sequence_dict ( std::shared_ptr< genesis::sequence::SequenceDict value)
inline

Set a sequence dictionary to be used for the chromosome order.

By default, we assume chromosomes to be sorted in lexicographical order. This might not always be the case with input data. When setting a SequenceDict here, the order as given by that dictionary is used instead. Then, chromosomes in the underlying input sources have to be sorted with respect to that.

To un-set the dictionary, simply call this function with a nullptr.

See also
See also locus_compare() and the related locus comparison functions for more details on locus sorting order.

Definition at line 792 of file variant_parallel_input_stream.hpp.

Member Typedef Documentation

◆ self_type

◆ value_type

Definition at line 245 of file variant_parallel_input_stream.hpp.

Member Enumeration Documentation

◆ ContributionType

enum ContributionType
strong

Select which loci of an input are used.

We offer two ways an input can be traversed over: Either take all its loci (carrying), or only those which also appear in other inputs as well (following).

For the most part, the kCarrying type acts as a set union of the input loci; all loci of all sources that are added with that type get visited. The kFollowing type on the other hand does not contribute its unique loci (i.e., the ones that are private to itself / do not appear in any other input source), but also does not change or constrain the ones that are visited by the carrying inputs.

A notable case happens if all inputs are added as type kFollowing: In the absence of a carrying set of loci, only those loci are visited that are in all inputs; in other words, in this case, the kFollowing type acts as an intersection of loci.

NB: We do not call these two types "union" and "intersection", as we feel that this might be confusing. These terms describe operations on two or more sets, and are not properties of any single set. For example, a carrying input and a following input combined do neither yield the union nor the intersection of the two, but instead just all loci from the first one. Only if all inputs are of the same type, either carrying or following, does the result behave as the union or intersection of the loci, respectively.

This model does not allow more complex subset operations of loci, such as per-input intersections, complements, (symmetrical) differences, and exclusions. For these cases, one can use the add_carrying_locus() and add_carrying_loci() functions that allow a pre-defined set of loci to be iterated over; if then all actual data inputs are set to be following, only those pre-defined loci will be visited, making it possible to select an arbitrary set of loci for iteration.

Enumerator
kCarrying 

For a given input, stop at all its positions.

Other input sources that do not have data at these loci will then have the Optional be empty in the iterator at this locus.

kFollowing 

For a given input, only stop at positions where other inputs also want to stop.

In other words, this input does not contribute the loci that are unique to it to the traversal, but contributes its data only at the loci that are visited by others (or has an empty Optional Variant, if it does not have data at a visited Locus).

Definition at line 183 of file variant_parallel_input_stream.hpp.

Member Data Documentation

◆ Iterator

friend Iterator

Definition at line 556 of file variant_parallel_input_stream.hpp.


The documentation for this class was generated from the following file:
genesis::population::make_variant_input_stream_from_sync_file
VariantInputStream make_variant_input_stream_from_sync_file(std::string const &filename)
Create a VariantInputStream to iterate the contents of a PoPoolation2 sync file as Variants.
Definition: variant_input_stream_sources.cpp:381
genesis::population::make_variant_input_stream_from_pileup_file
VariantInputStream make_variant_input_stream_from_pileup_file(std::string const &filename, SimplePileupReader const &reader)
Create a VariantInputStream to iterate the contents of a (m)pileup file as Variants.
Definition: variant_input_stream_sources.cpp:302
genesis::population::VariantParallelInputStream::VariantParallelInputStream
VariantParallelInputStream()=default
genesis::population::VariantParallelInputStream::ContributionType::kFollowing
@ kFollowing
For a given input, only stop at positions where other inputs also want to stop.
genesis::population::VariantParallelInputStream::ContributionType::kCarrying
@ kCarrying
For a given input, stop at all its positions.