#include <genesis/population/stream/variant_parallel_input_stream.hpp>
Iterate multiple input sources that yield Variants in parallel.
This stream allows to traverse multiple sources of data in parallel, where each stop of the traversal is a locus in the input sources. Using ContributionType, one can select the contribution of loci of each input, that is, whether all its loci get used, or just the ones that also overlap with other input sources. See also add_carrying_locus() for other ways to specify the loci to iterate over.
At each visited locus, the stream yields the data of the underlying input sources as a vector of Optional Variants, with one Variant per input source. If a source does not have data at the current locus, the Optional is empty. Use the dereference operator*()
and operator->()
of the iterator or the access functions variants() and variant_at() to get the set of variants at the current locus() of the iteration, or use joined_variant() to get one Variant that has all sample SampleCounts joined into it.
Furthermore, using the inputs() and input_at() functions, which are also available from the iterator itself, one can access additional information about the underlying streams, such as the file name and sample names that are being read. This is particularly useful if input sources are added as in the example below, where we use functions such as make_variant_input_stream_from_pileup_file(), and the related make_variant_input_stream_from_...()
set of functions, to get access to the files, which encapsulate and hence would otherwise hide this information from us. See VariantInputStreamData for the data structure that is used to store these additional information, and see VariantInputStream for details on the underlying stream type.
Example:
See the VariantParallelInputStream::Iterator class for details on access to the data during traversal.
By default, we expect the chromosomes of the underlying input sources to be sorted lexicographically. However, this might not always be the case. In order to allow any (fixed) order, a SequenceDict can be provided via sequence_dict(). When provided, the chromosomes of the input sources are expected to follow the order as specificied by that dictionary. See also locus_compare() and related locus comparison functions for more details on locus sorting order.
Note also that we allow input sources to not contain data at all chromosomes. That is, as long as they are in the correct order (either lexicographical, or according to the sequence dictionary), input sources do not need to contain all chromosomes.
Definition at line 145 of file variant_parallel_input_stream.hpp.
Public Member Functions | |
VariantParallelInputStream ()=default | |
VariantParallelInputStream (self_type &&)=default | |
VariantParallelInputStream (self_type const &)=default | |
~VariantParallelInputStream ()=default | |
template<class ForwardIterator > | |
self_type & | add_carrying_loci (ForwardIterator first, ForwardIterator last) |
Add a set of GenomeLoci that are used as carrying loci in the iteration. More... | |
self_type & | add_carrying_loci (std::vector< GenomeLocus > const &loci) |
Add a set of GenomeLoci that are used as carrying loci in the iteration. More... | |
self_type & | add_carrying_locus (GenomeLocus const &locus) |
Add a set of GenomeLoci that are used as carrying loci in the iteration. More... | |
self_type & | add_variant_input (std::function< bool(Variant &)> input_element_generator, ContributionType selection) |
Add an input to the parallel stream. More... | |
self_type & | add_variant_input_stream (VariantInputStream const &input, ContributionType selection) |
Add an input to the parallel stream. More... | |
Iterator | begin () |
Begin the iteration. More... | |
Iterator | end () |
End marker for the iteration. More... | |
VariantInputStream & | input_at (size_t index) |
Get access to an input stream that has been added to this parallel stream. More... | |
VariantInputStream const & | input_at (size_t index) const |
Get access to an input stream that has been added to this parallel stream. More... | |
size_t | input_size () const |
Return the number of input sourced added. More... | |
std::vector< VariantInputStream > & | inputs () |
Get access to the input streams that have been added to this parallel stream. More... | |
std::vector< VariantInputStream > const & | inputs () const |
Get access to the input streams that have been added to this parallel stream. More... | |
self_type & | operator= (self_type &&)=default |
self_type & | operator= (self_type const &)=default |
std::shared_ptr< genesis::sequence::SequenceDict > | sequence_dict () const |
Get the currently set sequence dictionary used for the chromosome sorting order. More... | |
self_type & | sequence_dict (std::shared_ptr< genesis::sequence::SequenceDict > value) |
Set a sequence dictionary to be used for the chromosome order. More... | |
Public Types | |
enum | ContributionType { kCarrying, kFollowing } |
Select which loci of an input are used. More... | |
using | self_type = VariantParallelInputStream |
using | value_type = Variant |
Public Attributes | |
friend | Iterator |
Classes | |
class | Iterator |
Iterator over loci of the input sources. More... | |
struct | JoinedVariantParams |
Parameters for VariantParallelInputStream::Iterator::joined_variant() More... | |
|
default |
|
default |
|
default |
|
default |
|
inline |
Add a set of GenomeLoci that are used as carrying loci in the iteration.
Definition at line 744 of file variant_parallel_input_stream.hpp.
|
inline |
Add a set of GenomeLoci that are used as carrying loci in the iteration.
Definition at line 734 of file variant_parallel_input_stream.hpp.
|
inline |
Add a set of GenomeLoci that are used as carrying loci in the iteration.
This allows to iterate over a pre-defined set of loci. The iterator stops at each of these loci, independently of whether any of the underlying input sources have data at this locus. That means, it acts as an "empty" input that only contributes loci, as if it were added with ContributionType::kCarrying, but without any actual variants. Duplicate loci in these additional carrying loci are ignored.
Using this is particularly useful for more complex subset operations of loci, such as intersections, complements, (symmetrical) differences, and exclusions. These cases cannot be modelled with our simple ContributionType based approach; so instead, one can externally prepare the list of loci that need to be visited, and provide these to this function. In these cases, to use exactly the list of provided loci, all actual input sources can be added as ContributionType::kFollowing, to make sure that none of them adds additional loci to the traversal.
Note that in addition to the loci added via this function, all loci of input sources that are of ContributionType::kCarrying are also visited.
Definition at line 703 of file variant_parallel_input_stream.hpp.
|
inline |
Add an input to the parallel stream.
This version of the function takes the function to obtain elements from the underlying data iterator, same as VariantInputStream. See there and GenericInputStream for details.
Definition at line 622 of file variant_parallel_input_stream.hpp.
|
inline |
Add an input to the parallel stream.
Definition at line 595 of file variant_parallel_input_stream.hpp.
|
inline |
Begin the iteration.
Use this to obtain an VariantParallelInputStream::Iterator that starts traversing the input sources.
Definition at line 568 of file variant_parallel_input_stream.hpp.
|
inline |
End marker for the iteration.
Definition at line 583 of file variant_parallel_input_stream.hpp.
|
inline |
Get access to an input stream that has been added to this parallel stream.
Definition at line 664 of file variant_parallel_input_stream.hpp.
|
inline |
Get access to an input stream that has been added to this parallel stream.
Definition at line 656 of file variant_parallel_input_stream.hpp.
|
inline |
Return the number of input sourced added.
Definition at line 672 of file variant_parallel_input_stream.hpp.
|
inline |
Get access to the input streams that have been added to this parallel stream.
This non-const version of the function can for exmple be used to bulk-add filters and transformations to the iterators, using their functions add_transform(), add_filter(), and add_transform_filter(); see utils::GenericInputStream for details.
Definition at line 648 of file variant_parallel_input_stream.hpp.
|
inline |
Get access to the input streams that have been added to this parallel stream.
Definition at line 633 of file variant_parallel_input_stream.hpp.
|
inline |
Get the currently set sequence dictionary used for the chromosome sorting order.
Definition at line 773 of file variant_parallel_input_stream.hpp.
|
inline |
Set a sequence dictionary to be used for the chromosome order.
By default, we assume chromosomes to be sorted in lexicographical order. This might not always be the case with input data. When setting a SequenceDict here, the order as given by that dictionary is used instead. Then, chromosomes in the underlying input sources have to be sorted with respect to that.
To un-set the dictionary, simply call this function with a nullptr
.
Definition at line 792 of file variant_parallel_input_stream.hpp.
using self_type = VariantParallelInputStream |
Definition at line 244 of file variant_parallel_input_stream.hpp.
using value_type = Variant |
Definition at line 245 of file variant_parallel_input_stream.hpp.
|
strong |
Select which loci of an input are used.
We offer two ways an input can be traversed over: Either take all its loci (carrying), or only those which also appear in other inputs as well (following).
For the most part, the kCarrying
type acts as a set union of the input loci; all loci of all sources that are added with that type get visited. The kFollowing
type on the other hand does not contribute its unique loci (i.e., the ones that are private to itself / do not appear in any other input source), but also does not change or constrain the ones that are visited by the carrying inputs.
A notable case happens if all inputs are added as type kFollowing
: In the absence of a carrying set of loci, only those loci are visited that are in all inputs; in other words, in this case, the kFollowing
type acts as an intersection of loci.
NB: We do not call these two types "union" and "intersection", as we feel that this might be confusing. These terms describe operations on two or more sets, and are not properties of any single set. For example, a carrying input and a following input combined do neither yield the union nor the intersection of the two, but instead just all loci from the first one. Only if all inputs are of the same type, either carrying or following, does the result behave as the union or intersection of the loci, respectively.
This model does not allow more complex subset operations of loci, such as per-input intersections, complements, (symmetrical) differences, and exclusions. For these cases, one can use the add_carrying_locus() and add_carrying_loci() functions that allow a pre-defined set of loci to be iterated over; if then all actual data inputs are set to be following, only those pre-defined loci will be visited, making it possible to select an arbitrary set of loci for iteration.
Enumerator | |
---|---|
kCarrying | For a given input, stop at all its positions. Other input sources that do not have data at these loci will then have the Optional be empty in the iterator at this locus. |
kFollowing | For a given input, only stop at positions where other inputs also want to stop. In other words, this input does not contribute the loci that are unique to it to the traversal, but contributes its data only at the loci that are visited by others (or has an empty Optional Variant, if it does not have data at a visited Locus). |
Definition at line 183 of file variant_parallel_input_stream.hpp.
friend Iterator |
Definition at line 556 of file variant_parallel_input_stream.hpp.