Compute Theta Pi, Theta Watterson, and Tajia's D in their pool-sequencing corrected versions according to Kofler et al.
This is an efficient high level helper that is meant to compute these statistics on input iterator ranges. See theta_pi_pool(), theta_watterson_pool(), and tajima_d_pool() for details on the functions it computes.
The provided DiversityPoolSettings take care of most options offered by PoPoolation. In particular, we want to set the min_count
, as well as the min_read_depth
and max_read_depth
. These read depths are called "coverage" in PoPoolation, which seems wrong.
We do expect here that the input samples that are provided to the process() function are already filtered (with the appropriate filter status set for the Variant and the SampleCounts) and transformed as needed. For example, typically, we want to use a SampleCountsFilter with settings that match the DiversityPoolSettings:
filter.min_count = settings.min_count;
filter.min_read_depth = settings.min_read_depth;
filter.max_read_depth = settings.max_read_depth;
filter.only_snps = true;
That is, the settings for the pool statistics should match the settings used for filtering the samples. The function filter_sample_counts() can be used to transform and filter the input coming from a file, in order to filter out base counts and samples that do not match these filters.
There are multiple ways that this filtering can be applied. Typically for example, we want to process a VariantInputStream, which allows us to use input from a variety of input file formats, all converted into Variants at each position in the genome. This internally is a genesis::utils::GenericInputStream, which offers to add add_transform_filter() functions for this purpose. The make_sample_counts_filter_numerical_tagging() is a convenience function that creates such a filter/transform function given a SampleCountsFilter settings instance.
Alternaively, genesis::utils::make_filter_range() can be used to achieve the same effect, but requiring a bit more manual "wiring" of the components first. This however has the advantage that SampleCountsFilterStats can be provided, e.g., per window of the analysis, to capture the number of sites that pass read depth filters etc. These numbers can then be used for get_theta_pi_relative() and get_theta_watterson_relative(), respectively. Otherwise (when instead filtering directly in the VariantInputStream), these numbers are lost, and instead the relative values would need to be computed, e.g., using the full window sizes, instead of taking only sufficiently covered positions into account for the normalization.
With either way of filtering, for all SNP positions of interest, call process() to compute the values for theta pi and theta watterson of this sample. The values are internally accumualted.
Once all samples have been processed, the getter functions get_theta_pi_absolute(), get_theta_pi_relative(), get_theta_watterson_absolute(), and get_theta_watterson_relative() can be used to obtain Theta Pi and Theta Watterson directly. For Tajima's D, more computation is needed, which is why the according function is called compute_tajima_d().
See
R. Kofler, P. Orozco-terWengel, N. De Maio, R. V. Pandey, V. Nolte, A. Futschik, C. Kosiol, C. Schlötterer.
PoPoolation: A Toolbox for Population Genetic Analysis of Next Generation Sequencing Data from Pooled Individuals.
(2011) PLoS ONE, 6(1), e15925. https://doi.org/10.1371/journal.pone.0015925
for details on the equations. The paper unfortunately does not explain their equations, but there is a hidden document in their code repository that illuminates the situation a bit. See https://sourceforge.net/projects/popoolation/files/correction_equations.pdf
Definition at line 122 of file diversity_pool_calculator.hpp.