#include <genesis/population/window/sliding_window_generator.hpp>
Generator for sliding Windows over the chromosomes of a genome.
Hence, as of now, with our now 1-based position counting, this classes enques entries with an off-by-one error, so that entries at the borders of windows gets assigned to the next window instead. This can be fixed by setting current_start_ = 1 in the initial conditions and in the clear() function. But we don't fix that for now, as we discovered that PoPoolation has that bug, and so we keep this class around for a convenient way of comparing our results to PoPoolation...
We'll keep the class around, but not add the features that it is still missing (e.g., it can only do interval windows). Use SlidingIntervalWindowIterator instead.
The class allows to accumulate and compute arbitrary data within a sliding window over a genome. The basic setup is to provide a set of plugin functions that do the actual computation, and then feed the data in via the enqueue() functions. The SlidingWindowGenerator class then takes care of calling the respective plugin functions to compute values and emit results once a Window is finished.
To this end, the SlidingWindowGenerator takes care of collecting the data (whose type is given via the template parameter D
/Data
) in a list of Entry instances per Window. For each finished window, the on_emission plugin functions are called, which typically are set by the user code to compute and store/print/visualize a per-window summary of the Data
. Use the add_emission_plugin() function to add such plugin functions.
A typical use case for this class is a window over the variants that are present in a set of (pooled) individuals, for example, the records/lines of a VCF file. Each record would then form a Data Entry, and some summary of a window along the positions in the VCF file would be computed per Window. As those files can potentially contain multiple chromosomes, we also support that. In this case, the Window is "restarted" at the beginning of a new chromosome.
This however necessitates to finish each chromosome properly when sliding over intervals. This is because the Window cannot know how long a chromosome is from just the variants in the VCF file (there might just not be any variants at the end of a chromosome, but we still want our interval to cover these positions). Instead, we need the total chromosome length from somewhere else - typically this is provided in the VCF header. Use the convenience function run_vcf_window() to automatically take care of this - or see there for an example of how to do this in your own code. See also below in this description for some further details.
In some cases (in particular, if a stride is chosen that is less than the window size), it might be advantageous to not compute the summary per window from scratch each time, but instead hold a rolling record while sliding - that is, to add incrementally the values when they are enqueued, and to remove them once the window moves past their position in the genome. To this end, the second template parameter A
/Accumulator
can be used, which can store the necessary intermediate data. For example, to compute some average of values over a window, the Accumulator would need to store a sum of the values and a count of the number of values. In order to update the Accumulator for each Data Entry that is added or removed from the window while sliding, the plugin functions on_enqueue and on_dequeue need to be set accordingly via add_enqueue_plugin() and add_dequeue_plugin().
There are two Types of sliding window that this class can be used for:
Both types are possible here, and have to be determined at construction, along with the width of the Window (either in number of basepairs or in number of variants).
Once all data has been processed, finish_chromosome() should be called to emit the last remaining window(s). See the following note for details. Furthermore, in some cases, it might be desirable to emit a window for an incomplete interval or an incomplete numer of variants at the end of a chromosome, while in other cases, these incomplete last entries might need to be skipped. See emit_incomplete_windows() for details.
Note: The plugin functions are typically lambdas that might make use of other data from the calling code. However, as this SlidingWindowGenerator class works conceptually similar to a stream, where new data is enqueued in some form of loop or iterative process from the outside by the user, the class cannot know when the process is finished, that is, when the end of the genome is reached. Hence, either finish_chromosome() has to be called once all data has been processed, or it has to be otherwise ensured that the class instance is destructed before the other data that the plugin lambda funtions depend on. This is because the destructor also calls finish_chromosome(), in order to ensure that the last intervals are processed properly. Hence, if any of the functions called from within the plugin functions use data outside of this instance, that data has still to be alive (unless finish_chromosome() was called explicitly before, in which case the destructor does not call it again) - in other words, the instance has to be destroyed after these data.
Definition at line 173 of file sliding_window_generator.hpp.
Public Member Functions | |
SlidingWindowGenerator (SlidingWindowGenerator &&)=default | |
SlidingWindowGenerator (SlidingWindowGenerator const &)=default | |
SlidingWindowGenerator (SlidingWindowType type, size_t width, size_t stride=0) | |
Construct a SlidingWindowGenerator, given the SlidingWindowType and width, and potentially stride. More... | |
~SlidingWindowGenerator () | |
Destruct the instance. More... | |
self_type & | add_chromosome_finish_plugin (on_chromosome_finish const &plugin) |
Add an on_chromosome_finish plugin function, typically a lambda. More... | |
self_type & | add_chromosome_start_plugin (on_chromosome_start const &plugin) |
Add an on_chromosome_start plugin function, typically a lambda. More... | |
self_type & | add_dequeue_plugin (on_dequeue const &plugin) |
Add an on_dequeue plugin function, typically a lambda. More... | |
self_type & | add_emission_plugin (on_emission const &plugin) |
Add an on_emission plugin function, typically a lambda. More... | |
self_type & | add_enqueue_plugin (on_enqueue const &plugin) |
Add an on_enqueue plugin function, typically a lambda. More... | |
std::string const & | chromosome () const |
Get the chromosome name that we are currently processing. More... | |
void | clear () |
Clear all data of the Window. More... | |
void | clear_plugins () |
Clear all plugin functions. More... | |
bool | emit_incomplete_windows () const |
Get whether the last (incomplete) window is also emitted. More... | |
void | emit_incomplete_windows (bool value) |
Set whether the last (incomplete) window is also emitted. More... | |
bool | empty () const |
Return whether the instance is empty. More... | |
void | enqueue (size_t position, Data &&data) |
Enqueue a new Data value by moving it, without considering its chromosome. More... | |
void | enqueue (size_t position, Data const &data) |
Enqueue a new Data value, without considering its chromosome. More... | |
void | enqueue (std::string const &chromosome, size_t position, Data &&data) |
Enqueue a new Data value, by moving it. More... | |
void | enqueue (std::string const &chromosome, size_t position, Data const &data) |
Enqueue a new Data value. More... | |
void | finish_chromosome (size_t last_position=0) |
Explicitly finish a chromosome, and emit all remaining Windows. More... | |
SlidingWindowGenerator & | operator= (SlidingWindowGenerator &&)=default |
SlidingWindowGenerator & | operator= (SlidingWindowGenerator const &)=default |
void | start_chromosome (std::string const &chromosome) |
Signal the start of a new chromosome, given its name. More... | |
size_t | stride () const |
Get the non-mutable stride of this SlidingWindowGenerator. More... | |
size_t | width () const |
Get the non-mutable width of this SlidingWindowGenerator. More... | |
SlidingWindowType | window_type () const |
Get the non-mutable SlidingWindowType of this SlidingWindowGenerator. More... | |
Public Types | |
using | Accumulator = A |
using | Data = D |
using | Entry = typename Window::Entry |
using | on_chromosome_finish = std::function< void(std::string const &chromosome, typename Window::Accumulator &accumulator)> |
Plugin functions that are called when finishing a chromosome. More... | |
using | on_chromosome_start = std::function< void(std::string const &chromosome, typename Window::Accumulator &accumulator)> |
Plugin functions that are called on the first enqueue() of a newly started chromosome. More... | |
using | on_dequeue = std::function< void(typename Window::Entry const &entry, typename Window::Accumulator &accumulator)> |
Plugin functions to update the Accumulator when Data is removed due to the window moving away from it. More... | |
using | on_emission = std::function< void(Window const &window)> |
Main plugin functions that are called for every window. More... | |
using | on_enqueue = std::function< void(typename Window::Entry const &entry, typename Window::Accumulator &accumulator)> |
Plugin functions to update the Accumulator when new Data is enqueued. More... | |
using | self_type = SlidingWindowGenerator< Data, Accumulator > |
using | Window = ::genesis::population::Window< D, A > |
|
inline |
Construct a SlidingWindowGenerator, given the SlidingWindowType and width, and potentially stride.
The width
has to be > 0
, and the stride
has to be <= width
. If stride is not given (or set to 0
), it is set automatically to the width, which means, we create windows that do not overlap.
Definition at line 289 of file sliding_window_generator.hpp.
|
inline |
Destruct the instance.
This typically has to be called before other data storage instances on the user side go out of scope. See the SlidingWindowGenerator class description note for details on why that is the case.
Definition at line 312 of file sliding_window_generator.hpp.
|
default |
|
default |
|
inline |
Add an on_chromosome_finish plugin function, typically a lambda.
Definition at line 455 of file sliding_window_generator.hpp.
|
inline |
Add an on_chromosome_start plugin function, typically a lambda.
Definition at line 446 of file sliding_window_generator.hpp.
|
inline |
Add an on_dequeue plugin function, typically a lambda.
Definition at line 473 of file sliding_window_generator.hpp.
|
inline |
Add an on_emission plugin function, typically a lambda.
Definition at line 482 of file sliding_window_generator.hpp.
|
inline |
Add an on_enqueue plugin function, typically a lambda.
Definition at line 464 of file sliding_window_generator.hpp.
|
inline |
Get the chromosome name that we are currently processing.
Initially, this is empty. After enqueuing data, it contains the chromosome name of the last Data entry that was enqueued.
Definition at line 408 of file sliding_window_generator.hpp.
|
inline |
Clear all data of the Window.
This can be used to completely forget about the current chromosome, and start afresh. It just clears the data, while keeping all plugins and other settins as they are.
Definition at line 432 of file sliding_window_generator.hpp.
|
inline |
Clear all plugin functions.
Not sure why this would be needed. But doesn't hurt to have it.
Definition at line 493 of file sliding_window_generator.hpp.
|
inline |
Get whether the last (incomplete) window is also emitted.
See emit_incomplete_windows( bool ) for details.
Definition at line 365 of file sliding_window_generator.hpp.
|
inline |
Set whether the last (incomplete) window is also emitted.
For some computations that normalize by window width, this might be desirable, while in other cases where e.g. absolute per-window numbers are computed, it might not be. Hence, we offer this setting.
Definition at line 377 of file sliding_window_generator.hpp.
|
inline |
Return whether the instance is empty.
The Window and SlidingWindowGenerator are empty if no Data has been enqueued for the current chromosome yet.
Definition at line 421 of file sliding_window_generator.hpp.
|
inline |
Enqueue a new Data value by moving it, without considering its chromosome.
See the non-moving overload of this function for details.
Definition at line 572 of file sliding_window_generator.hpp.
|
inline |
Enqueue a new Data value, without considering its chromosome.
This alternative overload does not use the chromosome, and hence should only be used if we are sure that we are always on the same chromosome (or are not using chromosome information at all), and hence, that position
always increases between calls of this function.
This is mostly meant as a simplification in cases where the data does not come with chromosome information. Typically however, when using VCF data, the CHROM
column is present and should be used; that is, typically, the other overload of this function should be used.
Definition at line 562 of file sliding_window_generator.hpp.
|
inline |
Enqueue a new Data value, by moving it.
See the non-move overload of this function for details.
Definition at line 545 of file sliding_window_generator.hpp.
|
inline |
Enqueue a new Data value.
This is the main function to be called when processing data. It takes care of filling the Window, calling all necessary plugin functions, and in particular, calling the on_emission plugins once a Window is finished.
The function also takes the chromosome
that this Data entry belongs to. This allows to automatically determine when a new chromosome starts, so that the positions and all other data (and potentially accumulators) can be reset accordingly.
However, we cannot determine when the last chromosome ends automatically. Hence, see also finish_chromosome() for details on wrapping up the input of a chromosome.
Definition at line 534 of file sliding_window_generator.hpp.
|
inline |
Explicitly finish a chromosome, and emit all remaining Windows.
When sliding along a genome, we can typically use the provided chromosome name in enqueue() to determine the chromosome we are currently on (typically, the input for this is the CHROM
information of a VCF file, or the first column of a pileup file), and switch to a new chromosome if needed. In that case, all remaining data in the last window needs to be emitted, so that it is not forgotten. Only after that, we can start a new window for the new chromosome.
However, we cannot automatically tell when the last chromosome of the genome is finished from within this class here (as there will simply be no more enqueue() calls, but how would we know that?!). Hence, there might be windows with data at the end that are not yet emitted. In order to also process their data, we need to explicitly call this function here.
It makes sure that the remaining data is processed. If provided with a last_position
, all Windows up to that position are emitted (which is only relevant for interval windows) - that is, if the full genome length is known, there might be (potentially empty) windows at the end that do not contain any data, but which still need to be emitted for a thorough and complete output. In that case, call this function with the respective genome length, and it will take care of emitting all the windows.
Additionally, if emit_incomplete_windows() is set to true
, the last window that contains the last_position
is also emitted, which might be incomplete (it might be shorter than the window width). For some computations that normalize by window width, this might be desirable, while in other cases where e.g. absolute per-window numbers are computed, it might not be. Hence, we offer this setting.
NB: This function is also called from the destructor, to ensure that all data is processed properly. This also means that any calling code needs to make sure that all data that is needed for emitting window data is still available when the window is destructed without having called this function first. See the SlidingWindowGenerator class description for details.
Definition at line 611 of file sliding_window_generator.hpp.
|
default |
|
default |
|
inline |
Signal the start of a new chromosome, given its name.
This function is typically not needed to be called manually, but mostly here for symmetry reasons. See finish_chromosome() for details.
Definition at line 512 of file sliding_window_generator.hpp.
|
inline |
Get the non-mutable stride of this SlidingWindowGenerator.
With SlidingWindowType::kInterval, this is the shift towards the next interval, determining how the first and last position in each Window change. With SlidingWindowType::kVariants instead, this is the number of variants (SNPs or VCF records/lines) per Window that we dequeue and enqueue.
Definition at line 355 of file sliding_window_generator.hpp.
|
inline |
Get the non-mutable width of this SlidingWindowGenerator.
With SlidingWindowType::kInterval, this is the length of the interval, determining the first and last position in each Window. With SlidingWindowType::kVariants instead, this is the number of variants (SNPs or VCF records/lines) per Window.
Definition at line 342 of file sliding_window_generator.hpp.
|
inline |
Get the non-mutable SlidingWindowType of this SlidingWindowGenerator.
Definition at line 330 of file sliding_window_generator.hpp.
using Accumulator = A |
Definition at line 182 of file sliding_window_generator.hpp.
using Data = D |
Definition at line 181 of file sliding_window_generator.hpp.
using Entry = typename Window::Entry |
Definition at line 184 of file sliding_window_generator.hpp.
using on_chromosome_finish = std::function<void( std::string const& chromosome, typename Window::Accumulator& accumulator )> |
Plugin functions that are called when finishing a chromosome.
Use add_chromosome_finish_plugin() to add plugin functions.
The purpose of this plugin is to allow to clean up the accumulator as needed. The function is called when enqueue() is called with a new chromosome name, or when calling finish_chromosome() explictly, or on destruction of the whole class.
Definition at line 218 of file sliding_window_generator.hpp.
using on_chromosome_start = std::function<void( std::string const& chromosome, typename Window::Accumulator& accumulator )> |
Plugin functions that are called on the first enqueue() of a newly started chromosome.
Use add_chromosome_start_plugin() to add plugin functions.
The purpose of this plugin is to allow to prepare the Window accumulator as needed. Note that it is not immediately called when calling start_chromosome() (which is mostly a decorative function anyway), but instead is called once actual data is added via enqueue(). This is because calling start_chromosome() is optional. Furthermore, we would then also need to call this plugin function from the constructor, but during construction, we do not know the chromosome name yet - hence, the calling is delayed until actual data is seen.
Definition at line 204 of file sliding_window_generator.hpp.
using on_dequeue = std::function<void( typename Window::Entry const& entry, typename Window::Accumulator& accumulator )> |
Plugin functions to update the Accumulator when Data is removed due to the window moving away from it.
Use add_dequeue_plugin() to add plugin functions.
This is the counterpart for on_enqueue to remove data from the Accumulator once it is no longer part of the current window. See on_enqueue for details.
Definition at line 252 of file sliding_window_generator.hpp.
using on_emission = std::function<void( Window const& window )> |
Main plugin functions that are called for every window.
Use add_emission_plugin() to add plugin functions.
This is the plugin that typically is the most important to set for the user. This is where the data from the Window is processed, that is, where the summary of the window is computed and stored/visualized/plotted. This can either be done by using the Accumulator, or by computing the value based on all the Data Entry values in the window, or a mixture thereof.
Definition at line 267 of file sliding_window_generator.hpp.
using on_enqueue = std::function<void( typename Window::Entry const& entry, typename Window::Accumulator& accumulator )> |
Plugin functions to update the Accumulator when new Data is enqueued.
Use add_enqueue_plugin() to add plugin functions.
The purpose of this plugin function is to avoid re-computation of values in a Window if that computation can be done incrementally instead. This is of course mostly helpful if the stride is significantly smaller than the window width. Otherwise (if the stride is equal to the window width, or only a bit smaller), each window will contain more new/different data than re-used data, so incrementally prossing data when it is enqueued AND dequeued again might actually be more computationally expensive.
If used, this function is meant to update the Accumulator in a way that adds/incorporates the new data Entry.
Definition at line 238 of file sliding_window_generator.hpp.
using self_type = SlidingWindowGenerator<Data, Accumulator> |
Definition at line 186 of file sliding_window_generator.hpp.
using Window = ::genesis::population::Window<D, A> |
Definition at line 183 of file sliding_window_generator.hpp.