A library for working with phylogenetic data.
v0.25.0
SlidingWindowGenerator< D, A > Class Template Reference

#include <genesis/population/window/sliding_window_generator.hpp>

Detailed Description

template<class D, class A = EmptyAccumulator>
class genesis::population::SlidingWindowGenerator< D, A >

Generator for sliding Windows over the chromosomes of a genome.

The class allows to accumulate and compute arbitrary data within a sliding window over a genome. The basic setup is to provide a set of plugin functions that do the actual computation, and then feed the data in via the enqueue() functions. The SlidingWindowGenerator class then takes care of calling the respective plugin functions to compute values and emit results once a Window is finished.

To this end, the SlidingWindowGenerator takes care of collecting the data (whose type is given via the template parameter D/Data) in a list of Entry instances per Window. For each finished window, the on_emission plugin functions are called, which typically are set by the user code to compute and store/print/visualize a per-window summary of the Data. Use the add_emission_plugin() function to add such plugin functions.

A typical use case for this class is a window over the variants that are present in a set of (pooled) individuals, for example, the records/lines of a VCF file. Each record would then form a Data Entry, and some summary of a window along the positions in the VCF file would be computed per Window. As those files can potentially contain multiple chromosomes, we also support that. In this case, the Window is "restarted" at the beginning of a new chromosome.

This however necessitates to finish each chromosome properly when sliding over intervals. This is because the Window cannot know how long a chromosome is from just the variants in the VCF file (there might just not be any variants at the end of a chromosome, but we still want our interval to cover these positions). Instead, we need the total chromosome length from somewhere else - typically this is provided in the VCF header. Use the convenience function run_vcf_window() to automatically take care of this - or see there for an example of how to do this in your own code. See also below in this description for some further details.

In some cases (in particular, if a stride is chosen that is less than the window size), it might be advantageous to not compute the summary per window from scratch each time, but instead hold a rolling record while sliding - that is, to add incrementally the values when they are enqueued, and to remove them once the window moves past their position in the genome. To this end, the second template parameter A/Accumulator can be used, which can store the necessary intermediate data. For example, to compute some average of values over a window, the Accumulator would need to store a sum of the values and a count of the number of values. In order to update the Accumulator for each Data Entry that is added or removed from the window while sliding, the plugin functions on_enqueue and on_dequeue need to be set accordingly via add_enqueue_plugin() and add_dequeue_plugin().

There are two Types of sliding window that this class can be used for:

  1. For windows of a fixed size along the genome, that is, an interval of a certain number of basepairs/nucleotides. There may be a varying number of variants (Data Entries) in each such fixed interval (or none at all).
  2. For a fixed number of variants/polymorphisms. Some statistics are not computed over a fixed size window, but instead for n consecutive variants that can span an interval of varying size along the genome.

Both types are possible here, and have to be determined at construction, along with the width of the Window (either in number of basepairs or in number of variants).

Once all data has been processed, finish_chromosome() should be called to emit the last remaining window(s). See the following note for details. Furthermore, in some cases, it might be desirable to emit a window for an incomplete interval or an incomplete numer of variants at the end of a chromosome, while in other cases, these incomplete last entries might need to be skipped. See emit_incomplete_windows() for details.

Note: The plugin functions are typically lambdas that might make use of other data from the calling code. However, as this SlidingWindowGenerator class works conceptually similar to a stream, where new data is enqueued in some form of loop or iterative process from the outside by the user, the class cannot know when the process is finished, that is, when the end of the genome is reached. Hence, either finish_chromosome() has to be called once all data has been processed, or it has to be otherwise ensured that the class instance is destructed before the other data that the plugin lambda funtions depend on. This is because the destructor also calls finish_chromosome(), in order to ensure that the last intervals are processed properly. Hence, if any of the functions called from within the plugin functions use data outside of this instance, that data has still to be alive (unless finish_chromosome() was called explicitly before, in which case the destructor does not call it again) - in other words, the instance has to be destroyed after these data.

Definition at line 124 of file sliding_window_generator.hpp.

Public Member Functions

 SlidingWindowGenerator (SlidingWindowGenerator &&)=default
 
 SlidingWindowGenerator (SlidingWindowGenerator const &)=default
 
 SlidingWindowGenerator (WindowType type, size_t width, size_t stride=0)
 Construct a SlidingWindowGenerator, given the WindowType and width, and potentially stride. More...
 
 ~SlidingWindowGenerator ()
 Destruct the instance. More...
 
self_typeadd_chromosome_finish_plugin (on_chromosome_finish const &plugin)
 Add an on_chromosome_finish plugin function, typically a lambda. More...
 
self_typeadd_chromosome_start_plugin (on_chromosome_start const &plugin)
 Add an on_chromosome_start plugin function, typically a lambda. More...
 
self_typeadd_dequeue_plugin (on_dequeue const &plugin)
 Add an on_dequeue plugin function, typically a lambda. More...
 
self_typeadd_emission_plugin (on_emission const &plugin)
 Add an on_emission plugin function, typically a lambda. More...
 
self_typeadd_enqueue_plugin (on_enqueue const &plugin)
 Add an on_enqueue plugin function, typically a lambda. More...
 
WindowAnchorType anchor_type () const
 Get the WindowAnchorType that we use for the emitted Windows. More...
 
void anchor_type (WindowAnchorType value)
 Set the WindowAnchorType that we use for the emitted Windows. More...
 
std::string const & chromosome () const
 Get the chromosome name that we are currently processing. More...
 
void clear ()
 Clear all data of the Window. More...
 
void clear_plugins ()
 Clear all plugin functions. More...
 
bool emit_incomplete_windows () const
 Get whether the last (incomplete) window is also emitted. More...
 
void emit_incomplete_windows (bool value)
 Set whether the last (incomplete) window is also emitted. More...
 
bool empty () const
 Return whether the instance is empty. More...
 
void enqueue (size_t position, Data &&data)
 Enqueue a new Data value by moving it, without considering its chromosome. More...
 
void enqueue (size_t position, Data const &data)
 Enqueue a new Data value, without considering its chromosome. More...
 
void enqueue (std::string const &chromosome, size_t position, Data &&data)
 Enqueue a new Data value, by moving it. More...
 
void enqueue (std::string const &chromosome, size_t position, Data const &data)
 Enqueue a new Data value. More...
 
void finish_chromosome (size_t last_position=0)
 Explicitly finish a chromosome, and emit all remaining Windows. More...
 
SlidingWindowGeneratoroperator= (SlidingWindowGenerator &&)=default
 
SlidingWindowGeneratoroperator= (SlidingWindowGenerator const &)=default
 
void start_chromosome (std::string const &chromosome)
 Signal the start of a new chromosome, given its name. More...
 
size_t stride () const
 Get the non-mutable stride of this SlidingWindowGenerator. More...
 
size_t width () const
 Get the non-mutable width of this SlidingWindowGenerator. More...
 
WindowType window_type () const
 Get the non-mutable WindowType of this SlidingWindowGenerator. More...
 

Public Types

using Accumulator = A
 
using Data = D
 
using Entry = typename Window::Entry
 
using on_chromosome_finish = std::function< void(std::string const &chromosome, typename Window::Accumulator &accumulator)>
 Plugin functions that are called when finishing a chromosome. More...
 
using on_chromosome_start = std::function< void(std::string const &chromosome, typename Window::Accumulator &accumulator)>
 Plugin functions that are called on the first enqueue() of a newly started chromosome. More...
 
using on_dequeue = std::function< void(typename Window::Entry const &entry, typename Window::Accumulator &accumulator)>
 Plugin functions to update the Accumulator when Data is removed due to the window moving away from it. More...
 
using on_emission = std::function< void(Window const &window)>
 Main plugin functions that are called for every window. More...
 
using on_enqueue = std::function< void(typename Window::Entry const &entry, typename Window::Accumulator &accumulator)>
 Plugin functions to update the Accumulator when new Data is enqueued. More...
 
using self_type = SlidingWindowGenerator< Data, Accumulator >
 
using Window = ::genesis::population::Window< D, A >
 

Constructor & Destructor Documentation

◆ SlidingWindowGenerator() [1/3]

SlidingWindowGenerator ( WindowType  type,
size_t  width,
size_t  stride = 0 
)
inline

Construct a SlidingWindowGenerator, given the WindowType and width, and potentially stride.

The width has to be > 0, and the stride has to be <= width. If stride is not given (or set to 0), it is set automatically to the width, which means, we create windows that do not overlap.

Definition at line 240 of file sliding_window_generator.hpp.

◆ ~SlidingWindowGenerator()

Destruct the instance.

This typically has to be called before other data storage instances on the user side go out of scope. See the SlidingWindowGenerator class description note for details on why that is the case.

Definition at line 263 of file sliding_window_generator.hpp.

◆ SlidingWindowGenerator() [2/3]

SlidingWindowGenerator ( SlidingWindowGenerator< D, A > const &  )
default

◆ SlidingWindowGenerator() [3/3]

Member Function Documentation

◆ add_chromosome_finish_plugin()

self_type& add_chromosome_finish_plugin ( on_chromosome_finish const &  plugin)
inline

Add an on_chromosome_finish plugin function, typically a lambda.

Definition at line 405 of file sliding_window_generator.hpp.

◆ add_chromosome_start_plugin()

self_type& add_chromosome_start_plugin ( on_chromosome_start const &  plugin)
inline

Add an on_chromosome_start plugin function, typically a lambda.

Definition at line 396 of file sliding_window_generator.hpp.

◆ add_dequeue_plugin()

self_type& add_dequeue_plugin ( on_dequeue const &  plugin)
inline

Add an on_dequeue plugin function, typically a lambda.

Definition at line 423 of file sliding_window_generator.hpp.

◆ add_emission_plugin()

self_type& add_emission_plugin ( on_emission const &  plugin)
inline

Add an on_emission plugin function, typically a lambda.

Definition at line 432 of file sliding_window_generator.hpp.

◆ add_enqueue_plugin()

self_type& add_enqueue_plugin ( on_enqueue const &  plugin)
inline

Add an on_enqueue plugin function, typically a lambda.

Definition at line 414 of file sliding_window_generator.hpp.

◆ anchor_type() [1/2]

WindowAnchorType anchor_type ( ) const
inline

Get the WindowAnchorType that we use for the emitted Windows.

Definition at line 335 of file sliding_window_generator.hpp.

◆ anchor_type() [2/2]

void anchor_type ( WindowAnchorType  value)
inline

Set the WindowAnchorType that we use for the emitted Windows.

Definition at line 343 of file sliding_window_generator.hpp.

◆ chromosome()

std::string const& chromosome ( ) const
inline

Get the chromosome name that we are currently processing.

Initially, this is empty. After enqueuing data, it contains the chromosome name of the last Data entry that was enqueued.

Definition at line 358 of file sliding_window_generator.hpp.

◆ clear()

void clear ( )
inline

Clear all data of the Window.

This can be used to completely forget about the current chromosome, and start afresh. It just clears the data, while keeping all plugins and other settins as they are.

Definition at line 382 of file sliding_window_generator.hpp.

◆ clear_plugins()

void clear_plugins ( )
inline

Clear all plugin functions.

Not sure why this would be needed. But doesn't hurt to have it.

Definition at line 443 of file sliding_window_generator.hpp.

◆ emit_incomplete_windows() [1/2]

bool emit_incomplete_windows ( ) const
inline

Get whether the last (incomplete) window is also emitted.

See emit_incomplete_windows( bool ) for details.

Definition at line 315 of file sliding_window_generator.hpp.

◆ emit_incomplete_windows() [2/2]

void emit_incomplete_windows ( bool  value)
inline

Set whether the last (incomplete) window is also emitted.

For some computations that normalize by window width, this might be desirable, while in other cases where e.g. absolute per-window numbers are computed, it might not be. Hence, we offer this setting.

Definition at line 327 of file sliding_window_generator.hpp.

◆ empty()

bool empty ( ) const
inline

Return whether the instance is empty.

The Window and SlidingWindowGenerator are empty if no Data has been enqueued for the current chromosome yet.

Definition at line 371 of file sliding_window_generator.hpp.

◆ enqueue() [1/4]

void enqueue ( size_t  position,
Data &&  data 
)
inline

Enqueue a new Data value by moving it, without considering its chromosome.

See the non-moving overload of this function for details.

Definition at line 522 of file sliding_window_generator.hpp.

◆ enqueue() [2/4]

void enqueue ( size_t  position,
Data const &  data 
)
inline

Enqueue a new Data value, without considering its chromosome.

This alternative overload does not use the chromosome, and hence should only be used if we are sure that we are always on the same chromosome (or are not using chromosome information at all), and hence, that position always increases between calls of this function.

This is mostly meant as a simplification in cases where the data does not come with chromosome information. Typically however, when using VCF data, the CHROM column is present and should be used; that is, typically, the other overload of this function should be used.

Definition at line 512 of file sliding_window_generator.hpp.

◆ enqueue() [3/4]

void enqueue ( std::string const &  chromosome,
size_t  position,
Data &&  data 
)
inline

Enqueue a new Data value, by moving it.

See the non-move overload of this function for details.

Definition at line 495 of file sliding_window_generator.hpp.

◆ enqueue() [4/4]

void enqueue ( std::string const &  chromosome,
size_t  position,
Data const &  data 
)
inline

Enqueue a new Data value.

This is the main function to be called when processing data. It takes care of filling the Window, calling all necessary plugin functions, and in particular, calling the on_emission plugins once a Window is finished.

The function also takes the chromosome that this Data entry belongs to. This allows to automatically determine when a new chromosome starts, so that the positions and all other data (and potentially accumulators) can be reset accordingly.

However, we cannot determine when the last chromosome ends automatically. Hence, see also finish_chromosome() for details on wrapping up the input of a chromosome.

Definition at line 484 of file sliding_window_generator.hpp.

◆ finish_chromosome()

void finish_chromosome ( size_t  last_position = 0)
inline

Explicitly finish a chromosome, and emit all remaining Windows.

When sliding along a genome, we can typically use the provided chromosome name in enqueue() to determine the chromosome we are currently on (typically, the input for this is the CHROM information of a VCF file, or the first column of a pileup file), and switch to a new chromosome if needed. In that case, all remaining data in the last window needs to be emitted, so that it is not forgotten. Only after that, we can start a new window for the new chromosome.

However, we cannot automatically tell when the last chromosome of the genome is finished from within this class here (as there will simply be no more enqueue() calls, but how would we know that?!). Hence, there might be windows with data at the end that are not yet emitted. In order to also process their data, we need to explicitly call this function here.

It makes sure that the remaining data is processed. If provided with a last_position, all Windows up to that position are emitted (which is only relevant for interval windows) - that is, if the full genome length is known, there might be (potentially empty) windows at the end that do not contain any data, but which still need to be emitted for a thorough and complete output. In that case, call this function with the respective genome length, and it will take care of emitting all the windows.

Additionally, if emit_incomplete_windows() is set to true, the last window that contains the last_position is also emitted, which might be incomplete (it might be shorter than the window width). For some computations that normalize by window width, this might be desirable, while in other cases where e.g. absolute per-window numbers are computed, it might not be. Hence, we offer this setting.

NB: This function is also called from the destructor, to ensure that all data is processed properly. This also means that any calling code needs to make sure that all data that is needed for emitting window data is still available when the window is destructed without having called this function first. See the SlidingWindowGenerator class description for details.

Definition at line 561 of file sliding_window_generator.hpp.

◆ operator=() [1/2]

SlidingWindowGenerator& operator= ( SlidingWindowGenerator< D, A > &&  )
default

◆ operator=() [2/2]

SlidingWindowGenerator& operator= ( SlidingWindowGenerator< D, A > const &  )
default

◆ start_chromosome()

void start_chromosome ( std::string const &  chromosome)
inline

Signal the start of a new chromosome, given its name.

This function is typically not needed to be called manually, but mostly here for symmetry reasons. See finish_chromosome() for details.

Definition at line 462 of file sliding_window_generator.hpp.

◆ stride()

size_t stride ( ) const
inline

Get the non-mutable stride of this SlidingWindowGenerator.

With WindowType::kInterval, this is the shift towards the next interval, determining how the first and last position in each Window change. With WindowType::kVariants instead, this is the number of variants (SNPs or VCF records/lines) per Window that we dequeue and enqueue.

Definition at line 305 of file sliding_window_generator.hpp.

◆ width()

size_t width ( ) const
inline

Get the non-mutable width of this SlidingWindowGenerator.

With WindowType::kInterval, this is the length of the interval, determining the first and last position in each Window. With WindowType::kVariants instead, this is the number of variants (SNPs or VCF records/lines) per Window.

Definition at line 293 of file sliding_window_generator.hpp.

◆ window_type()

WindowType window_type ( ) const
inline

Get the non-mutable WindowType of this SlidingWindowGenerator.

Definition at line 281 of file sliding_window_generator.hpp.

Member Typedef Documentation

◆ Accumulator

using Accumulator = A

Definition at line 133 of file sliding_window_generator.hpp.

◆ Data

using Data = D

Definition at line 132 of file sliding_window_generator.hpp.

◆ Entry

using Entry = typename Window::Entry

Definition at line 135 of file sliding_window_generator.hpp.

◆ on_chromosome_finish

using on_chromosome_finish = std::function<void( std::string const& chromosome, typename Window::Accumulator& accumulator )>

Plugin functions that are called when finishing a chromosome.

Use add_chromosome_finish_plugin() to add plugin functions.

The purpose of this plugin is to allow to clean up the accumulator as needed. The function is called when enqueue() is called with a new chromosome name, or when calling finish_chromosome() explictly, or on destruction of the whole class.

Definition at line 169 of file sliding_window_generator.hpp.

◆ on_chromosome_start

using on_chromosome_start = std::function<void( std::string const& chromosome, typename Window::Accumulator& accumulator )>

Plugin functions that are called on the first enqueue() of a newly started chromosome.

Use add_chromosome_start_plugin() to add plugin functions.

The purpose of this plugin is to allow to prepare the Window accumulator as needed. Note that it is not immediately called when calling start_chromosome() (which is mostly a decorative function anyway), but instead is called once actual data is added via enqueue(). This is because calling start_chromosome() is optional. Furthermore, we would then also need to call this plugin function from the constructor, but during construction, we do not know the chromosome name yet - hence, the calling is delayed until actual data is seen.

Definition at line 155 of file sliding_window_generator.hpp.

◆ on_dequeue

using on_dequeue = std::function<void( typename Window::Entry const& entry, typename Window::Accumulator& accumulator )>

Plugin functions to update the Accumulator when Data is removed due to the window moving away from it.

Use add_dequeue_plugin() to add plugin functions.

This is the counterpart for on_enqueue to remove data from the Accumulator once it is no longer part of the current window. See on_enqueue for details.

Definition at line 203 of file sliding_window_generator.hpp.

◆ on_emission

using on_emission = std::function<void( Window const& window )>

Main plugin functions that are called for every window.

Use add_emission_plugin() to add plugin functions.

This is the plugin that typically is the most important to set for the user. This is where the data from the Window is processed, that is, where the summary of the window is computed and stored/visualized/plotted. This can either be done by using the Accumulator, or by computing the value based on all the Data Entry values in the window, or a mixture thereof.

Definition at line 218 of file sliding_window_generator.hpp.

◆ on_enqueue

using on_enqueue = std::function<void( typename Window::Entry const& entry, typename Window::Accumulator& accumulator )>

Plugin functions to update the Accumulator when new Data is enqueued.

Use add_enqueue_plugin() to add plugin functions.

The purpose of this plugin function is to avoid re-computation of values in a Window if that computation can be done incrementally instead. This is of course mostly helpful if the stride is significantly smaller than the window width. Otherwise (if the stride is equal to the window width, or only a bit smaller), each window will contain more new/different data than re-used data, so incrementally prossing data when it is enqueued AND dequeued again might actually be more computationally expensive.

If used, this function is meant to update the Accumulator in a way that adds/incorporates the new data Entry.

Definition at line 189 of file sliding_window_generator.hpp.

◆ self_type

◆ Window

Definition at line 134 of file sliding_window_generator.hpp.


The documentation for this class was generated from the following file: