Classes  
struct  AfsPileupRecord 
Helper to store the data of one pileup line/record needed for the Boitard et al Allele Frequency Estimation computation. More...  
class  AlleleFrequencyWindow 
struct  BaseCounts 
One set of nucleotide base counts, for example for a given sample that represents a pool of sequenced individuals. More...  
struct  BaseCountsStatus 
struct  EmptyAccumulator 
Empty helper data struct to serve as a dummy for Window. More...  
struct  FstAN 
Helper struct for the a_1 , a_2 , n_1 , and n_2 values needed for f_st_asymptotically_unbiased(). More...  
class  GenomeHeatmap 
struct  GenomeRegion 
A region (between two positions) on a chromosome. More...  
class  GenomeRegionList 
A sorted list of genome regions. More...  
class  GffReader 
Reader for GFF (General Feature Format) and GTF (General Transfer Format) files. More...  
class  HeatmapColorization 
class  HeatmapMatrix 
Matrix to capture and accumulate columns of perposition or perwindow values along a chromosome. More...  
class  HtsFile 
Wrap an ::htsFile struct. More...  
struct  PoolDiversityResults 
Data struct to collect all diversity statistics computed by pool_diversity_measures(). More...  
struct  PoolDiversitySettings 
Settings used by different poolsequencing corrected diversity statistics. More...  
class  SimplePileupInputIterator 
Iterate an input source and parse it as a (m)pileup file. More...  
class  SimplePileupReader 
Reader for linebyline assessment of (m)pileup files. More...  
class  SlidingWindowGenerator 
Generator for sliding Windows over the chromosomes of a genome. More...  
class  SlidingWindowIterator 
Iterator for sliding Windows over the chromosomes of a genome. More...  
struct  SlidingWindowIteratorSettings 
Settings for running a sliding window iteration. More...  
class  SyncInputIterator 
Iterate an input source and parse it as a sync file. More...  
class  SyncReader 
Reader for PoPoolation2's "synchronized" files. More...  
struct  Variant 
A single variant at a position in a chromosome, along with BaseCounts for a set of samples. More...  
class  VcfFormatHelper 
Provide htslib helper functions. More...  
class  VcfFormatIterator 
Iterate the FORMAT information for the samples in a SNP/variant line in a VCF/BCF file. More...  
class  VcfGenotype 
Simple wrapper class for one genotype field for a sample. More...  
class  VcfHeader 
Capture the information from a header of a VCF/BCF file. More...  
class  VcfInputIterator 
Iterate an input source and parse it as a VCF/BCF file. More...  
class  VcfParallelInputIterator 
Iterate a set of input VCF/BCF file in parallel, that is, synchronize the iteration to be at the same positions in the genome at all times. More...  
class  VcfRecord 
Capture the information of a single SNP/variant line in a VCF/BCF file. More...  
struct  VcfSpecification 
Collect the four required keys that describe an INFO or FORMAT subfield of VCF/BCF files. More...  
class  Window 
Window over the chromosomes of a genome. More...  
Functions  
double  a_n (size_t n) 
Compute a_n , the sum of reciprocals. More...  
double  alpha_star (double n) 
Compute alpha* according to Achaz 2008 and Kofler et al. 2011. More...  
double  amnm_ (size_t poolsize, size_t nucleotide_count, size_t allele_frequency) 
Local helper function to compute values for the denominator. More...  
double  b_n (size_t n) 
Compute b_n , the sum of squared reciprocals. More...  
double  beta_star (double n) 
Compute beta* according to Achaz 2008 and Kofler et al. 2011. More...  
std::pair< char, double >  consensus (BaseCounts const &sample) 
Consensus character for a BaseCounts, and its confidence. More...  
std::pair< char, double >  consensus (BaseCounts const &sample, BaseCountsStatus const &status) 
Consensus character for a BaseCounts, and its confidence. More...  
AfsPileupRecord  convert_to_afs_pileup_record (SimplePileupReader::Record const &record) 
BaseCounts  convert_to_base_counts (SimplePileupReader::Sample const &sample, unsigned char min_phred_score) 
Variant  convert_to_variant (SimplePileupReader::Record const &record, unsigned char min_phred_score) 
Variant  convert_to_variant (VcfRecord const &record) 
template<class ForwardIterator >  
utils::Matrix< double >  f_st_asymptotically_unbiased (ForwardIterator begin, ForwardIterator end) 
Compute the asymptotically unbiased F_ST estimator of Karlsson et al, for all pairs of ranges of BaseCountss. More...  
template<class ForwardIterator1 , class ForwardIterator2 >  
double  f_st_asymptotically_unbiased (ForwardIterator1 p1_begin, ForwardIterator1 p1_end, ForwardIterator2 p2_begin, ForwardIterator2 p2_end) 
Compute the asymptotically unbiased F_ST estimator of Karlsson et al. More...  
FstAN  f_st_asymptotically_unbiased_a1n1a2n2 (BaseCounts const &p1, BaseCounts const &p2) 
Compute the a and n values needed for the asymptotically unbiased F_ST estimator of Karlsson et al. More...  
std::pair< double, double >  f_st_asymptotically_unbiased_nkdk (FstAN const &fstan) 
Compute the N_k and D_k values needed for the asymptotically unbiased F_ST estimator of Karlsson et al. More...  
template<class ForwardIterator >  
utils::Matrix< double >  f_st_conventional_pool (size_t number_of_samples, size_t poolsizes, ForwardIterator begin, ForwardIterator end) 
Compute the conventional F_ST statistic for poolsequenced data, following Kofler et al, for all pairs of ranges of BaseCountss. More...  
template<class ForwardIterator1 , class ForwardIterator2 >  
double  f_st_conventional_pool (size_t p1_poolsize, size_t p2_poolsize, ForwardIterator1 p1_begin, ForwardIterator1 p1_end, ForwardIterator2 p2_begin, ForwardIterator2 p2_end) 
Compute the conventional F_ST statistic for poolsequenced data, following Kofler et al, for two ranges of BaseCountss. More...  
template<class ForwardIterator >  
utils::Matrix< double >  f_st_conventional_pool (std::vector< size_t > const &poolsizes, ForwardIterator begin, ForwardIterator end) 
Compute the conventional F_ST statistic for poolsequenced data, following Kofler et al, for all pairs of ranges of BaseCountss. More...  
std::tuple< double, double, double >  f_st_conventional_pool_pi_snp (BaseCounts const &p1, BaseCounts const &p2) 
Compute the SNPbased Theta Pi values used in f_st_conventional_pool() More...  
double  f_star (double a_n, double n) 
Compute f* according to Achaz 2008 and Kofler et al. 2011. More...  
void  filter_min_count (BaseCounts &sample, size_t min_count) 
Filter by minimum count that we need for a type of nucleotide (A , C , G , T ) to be considered; set to zero if min_count is not reached. More...  
size_t  get_base_count (BaseCounts const &bc, char base) 
Get the count for a base given as a char. More...  
double  heterozygosity (BaseCounts const &sample) 
Compute classic heterozygosity. More...  
bool  is_covered (GenomeRegion const ®ion, std::string const &chromosome, size_t position) 
Test whether the chromosome/position is within a given genomic region . More...  
template<class T >  
bool  is_covered (GenomeRegion const ®ion, T const &variant) 
Test whether the chromosome/position of a variant is within a given genomic region . More...  
bool  is_covered (GenomeRegion const ®ion, VcfRecord const &variant) 
template<class ForwardIterator , class InputType , class DataType = InputType>  
SlidingWindowIterator< ForwardIterator, InputType, DataType >  make_sliding_window_iterator (SlidingWindowIteratorSettings< InputType, DataType > const &settings, ForwardIterator begin, ForwardIterator end) 
Helper function to instanciate a SlidingWindowIterator without the need to specify all template parameters manually. More...  
BaseCounts  merge (BaseCounts const &p1, BaseCounts const &p2) 
Merge the counts of two BaseCountss. More...  
BaseCounts  merge (std::vector< BaseCounts > const &p) 
Merge the counts of a vector BaseCountss. More...  
void  merge_inplace (BaseCounts &p1, BaseCounts const &p2) 
Merge the counts of two BaseCountss, by adding the counts of the second (p2 ) to the first (p1 ). More...  
double  n_base (size_t coverage, size_t poolsize) 
Compute the n_base term used for Tajima's D in Kofler et al. 2011, using a faster closed form expression. More...  
double  n_base_matrix (size_t coverage, size_t poolsize) 
Compute the n_base term used for Tajima's D in Kofler et al. 2011, following their approach. More...  
size_t  nucleotide_sum (BaseCounts const &sample) 
Count of the pure nucleotide bases at this position, that is, the sum of all A , C , G , and T . More...  
std::ostream &  operator<< (std::ostream &os, BaseCounts const &bs) 
Output stream operator for BaseCounts instances. More...  
std::ostream &  operator<< (std::ostream &os, GenomeRegion const ®ion) 
bool  operator== (FstAN const &f1, FstAN const &f2) 
Comparison operator equals for FstAN structs. More...  
bool  operator== (GenomeRegion const &a, GenomeRegion const &b) 
GenomeRegion  parse_genome_region (std::string const ®ion) 
Parse a genomic region. More...  
GenomeRegionList  parse_genome_regions (std::string const ®ions) 
Parse a set/list of genomic regions. More...  
genesis::utils::Matrix< double >  pij_matrix_ (size_t max_coverage, size_t poolsize) 
genesis::utils::Matrix< double > const &  pij_matrix_resolver_ (size_t max_coverage, size_t poolsize) 
template<class ForwardIterator >  
PoolDiversityResults  pool_diversity_measures (PoolDiversitySettings const &settings, ForwardIterator begin, ForwardIterator end) 
Compute Theta Pi, Theta Watterson, and Tajia's D in their poolsequencing corrected versions according to Kofler et al. More...  
std::vector< double >  prob_cond_true_freq (size_t n, std::vector< bool > const &alleles, std::vector< unsigned char > const &phred_scores, bool unfolded) 
std::vector< double >  prob_cond_true_freq_unfolded (size_t n, std::vector< bool > const &alleles, std::vector< unsigned char > const &phred_scores, bool invert_alleles) 
template<class ForwardIterator >  
void  process_conditional_probability (ForwardIterator begin, ForwardIterator end) 
Compute the conditional probabilities of AFs. This reimplements process_probCond from Boitard et al. More...  
template<class Data , class Accumulator = EmptyAccumulator>  
void  run_vcf_window (SlidingWindowGenerator< Data, Accumulator > &generator, std::string const &vcf_file, std::function< Data(VcfRecord const &)> conversion, std::function< bool(VcfRecord const &)> condition={}) 
Convenience function to iterate over a whole VCF file. More...  
std::array< std::pair< char, size_t >, 4 >  sorted_variant_counts (Variant const &variant, bool reference_first) 
BaseCountsStatus  status (BaseCounts const &sample, size_t min_coverage=0, size_t max_coverage=0, size_t min_count=0, bool tolerate_deletions=false) 
Compute a simple status with useful properties from the counts of a BaseCounts. More...  
template<class ForwardIterator >  
double  tajima_d_pool (PoolDiversitySettings const &settings, ForwardIterator begin, ForwardIterator end) 
Compute the poolsequencing corrected version of Tajima's D according to Kofler et al. More...  
template<class ForwardIterator >  
double  tajima_d_pool (PoolDiversitySettings const &settings, ForwardIterator begin, ForwardIterator end, double theta_pi, double theta_watterson) 
Compute the poolsequencing corrected version of Tajima's D according to Kofler et al. More...  
double  tajima_d_pool_denominator (PoolDiversitySettings const &settings, size_t snp_count, double theta) 
Compute the denominator for the poolsequencing correction of Tajima's D according to Kofler et al. More...  
template<class ForwardIterator >  
double  theta_pi (ForwardIterator begin, ForwardIterator end) 
Compute classic theta pi, that is, the sum of heterozygosities. More...  
double  theta_pi_pool (PoolDiversitySettings const &settings, BaseCounts const &sample) 
Compute theta pi with poolsequencing correction according to Kofler et al, for a single BaseCounts, that is, its heterozygosity() divided by the correction denominator. More...  
template<class ForwardIterator >  
double  theta_pi_pool (PoolDiversitySettings const &settings, ForwardIterator begin, ForwardIterator end) 
Compute theta pi with poolsequencing correction according to Kofler et al, that is, the sum of heterozygosities divided by the correction denominator. More...  
double  theta_pi_pool_denominator (PoolDiversitySettings const &settings, size_t nucleotide_count) 
Compute the denominator for the poolsequencing correction of theta pi according to Kofler et al. More...  
template<class ForwardIterator >  
double  theta_watterson_pool (PoolDiversitySettings const &settings, ForwardIterator begin, ForwardIterator end) 
Compute theta watterson with poolsequencing correction according to Kofler et al. More...  
double  theta_watterson_pool_denominator (PoolDiversitySettings const &settings, size_t nucleotide_count) 
Compute the denominator for the poolsequencing correction of theta watterson according to Kofler et al. More...  
std::string  to_string (GenomeRegion const ®ion) 
std::ostream &  to_sync (BaseCounts const &bs, std::ostream &os) 
Output a BaseCounts instance to a stream in the PoPoolation2 sync format. More...  
std::ostream &  to_sync (Variant const &var, std::ostream &os) 
Output a Variant instance to a stream in the PoPoolation2 sync format. More...  
BaseCounts  total_base_counts (Variant const &variant) 
std::string  vcf_genotype_string (std::vector< VcfGenotype > const &genotypes) 
Return the VCFlike string representation of a set of VcfGenotype entries. More...  
size_t  vcf_genotype_sum (std::vector< VcfGenotype > const &genotypes) 
Return the sum of genotypes for a set of VcfGenotype entries, typically used to construct a genotype matrix with entries 0,1,2. More...  
std::string  vcf_hl_type_to_string (int hl_type) 
Internal helper function to convert htslibinternal BCF_HL_* header line type values to their string representation as used in the VCF header ("FILTER", "INFO", "FORMAT", etc). More...  
std::string  vcf_value_special_to_string (int vl_type_num) 
std::string  vcf_value_special_to_string (VcfValueSpecial vl_type_num) 
std::string  vcf_value_type_to_string (int ht_type) 
std::string  vcf_value_type_to_string (VcfValueType ht_type) 
Enumerations  
enum  VcfHeaderLine : int { kFilter = 0, kInfo = 1, kFormat = 2, kContig = 3, kStructured = 4, kGeneric = 5 } 
Specification for the values determining header line types of VCF/BCF files. More...  
enum  VcfValueSpecial : int { kFixed = 0, kVariable = 1, kAllele = 2, kGenotype = 3, kReference = 4 } 
Specification for special markers for the number of values expected for keyvaluepairs of VCF/BCF files. More...  
enum  VcfValueType : int { kFlag = 0, kInteger = 1, kFloat = 2, kString = 3 } 
Specification for the data type of the values expected in keyvaluepairs of VCF/BCF files. More...  
enum  WindowAnchorType { kIntervalBegin, kIntervalEnd, kIntervalMidpoint, kVariantFirst, kVariantLast, kVariantMedian, kVariantMean, kVariantMidpoint } 
Position in the genome that is used for reporting when emitting or using a window. More...  
enum  WindowType { kInterval, kVariants } 
WindowType of a Window, that is, whether we slide along a fixed size interval of the genome, or along a fixed number of variants. More...  
Typedefs  
using  VcfFormatIteratorFloat = VcfFormatIterator< float, double > 
using  VcfFormatIteratorGenotype = VcfFormatIterator< int32_t, VcfGenotype > 
using  VcfFormatIteratorInt = VcfFormatIterator< int32_t, int32_t > 
using  VcfFormatIteratorString = VcfFormatIterator< char *, std::string > 
double a_n  (  size_t  n  ) 
Compute a_n
, the sum of reciprocals.
This is the sum of reciprocals up to n1
: \( a_n = \sum_{i=1}^{n1} \frac{1}{i} \).
See Equation 3.6 in
Hahn, M. W. (2018). Molecular Population Genetics. https://global.oup.com/academic/product/molecularpopulationgenetics9780878939657
for details.
Definition at line 215 of file diversity.cpp.
double alpha_star  (  double  n  ) 
Compute alpha*
according to Achaz 2008 and Kofler et al. 2011.
This is needed for the computation of tajima_d_pool() according to
R. Kofler, P. OrozcoterWengel, N. De Maio, R. V. Pandey, V. Nolte, A. Futschik, C. Kosiol, C. SchlĂ¶tterer.
PoPoolation: A Toolbox for Population Genetic Analysis of Next Generation Sequencing Data from Pooled Individuals.
(2011) PLoS ONE, 6(1), e15925. https://doi.org/10.1371/journal.pone.0015925
The paper unfortunately does not explain their equations, but there is a hidden document in their code repository that illuminates the situation a bit. See https://sourceforge.net/projects/popoolation/files/correction_equations.pdf
The equation is based on
G. Achaz.
Testing for neutrality in samples with sequencing errors.
(2008) Genetics, 179(3), 1409â€“1424. https://doi.org/10.1534/genetics.107.082198
See there for details.
Definition at line 247 of file diversity.cpp.
double genesis::population::amnm_  (  size_t  poolsize, 
size_t  nucleotide_count,  
size_t  allele_frequency  
) 
Local helper function to compute values for the denominator.
Definition at line 57 of file diversity.cpp.
double b_n  (  size_t  n  ) 
Compute b_n
, the sum of squared reciprocals.
This is the sum of squared reciprocals up to n1
: \( b_n = \sum_{i=1}^{n1} \frac{1}{i^2} \).
See
R. Kofler, P. OrozcoterWengel, N. De Maio, R. V. Pandey, V. Nolte, A. Futschik, C. Kosiol, C. SchlĂ¶tterer.
PoPoolation: A Toolbox for Population Genetic Analysis of Next Generation Sequencing Data from Pooled Individuals.
(2011) PLoS ONE, 6(1), e15925. https://doi.org/10.1371/journal.pone.0015925
for details. The paper unfortunately does not explain their equations, but there is a hidden document in their code repository that illuminates the situation a bit. See https://sourceforge.net/projects/popoolation/files/correction_equations.pdf
Definition at line 228 of file diversity.cpp.
double beta_star  (  double  n  ) 
Compute beta*
according to Achaz 2008 and Kofler et al. 2011.
This is needed for the computation of tajima_d_pool() according to
R. Kofler, P. OrozcoterWengel, N. De Maio, R. V. Pandey, V. Nolte, A. Futschik, C. Kosiol, C. SchlĂ¶tterer.
PoPoolation: A Toolbox for Population Genetic Analysis of Next Generation Sequencing Data from Pooled Individuals.
(2011) PLoS ONE, 6(1), e15925. https://doi.org/10.1371/journal.pone.0015925
The paper unfortunately does not explain their equations, but there is a hidden document in their code repository that illuminates the situation a bit. See https://sourceforge.net/projects/popoolation/files/correction_equations.pdf
The equation is based on
G. Achaz.
Testing for neutrality in samples with sequencing errors.
(2008) Genetics, 179(3), 1409â€“1424. https://doi.org/10.1534/genetics.107.082198
See there for details.
Definition at line 275 of file diversity.cpp.
std::pair< char, double > consensus  (  BaseCounts const &  sample  ) 
Consensus character for a BaseCounts, and its confidence.
This is simply the character (out of ACGT
) that appears most often (or, for ties, the lexicographically smallest character), unless all of (A
, C
, G
, T
) are zero, in which case the consensus character is N
. The confidence is the count of the consensus character, divided by the total count of all four nucleotides.
Definition at line 191 of file base_counts.cpp.
std::pair< char, double > consensus  (  BaseCounts const &  sample, 
BaseCountsStatus const &  status  
) 
Consensus character for a BaseCounts, and its confidence.
This is simply the character (out of ACGT
) that appears most often (or, for ties, the lexicographically smallest character). If the BaseCounts is not well covered by reads (that is, if its BaseCountsStatus::is_covered is false
), the consensus character is N
. The confidence is the count of the consensus character, divided by the total count of all four nucleotides.
Definition at line 232 of file base_counts.cpp.
AfsPileupRecord convert_to_afs_pileup_record  (  SimplePileupReader::Record const &  record  ) 
Definition at line 48 of file afs_estimate.cpp.
BaseCounts convert_to_base_counts  (  SimplePileupReader::Sample const &  sample, 
unsigned char  min_phred_score  
) 
Definition at line 245 of file base_counts.cpp.
Variant convert_to_variant  (  SimplePileupReader::Record const &  record, 
unsigned char  min_phred_score  
) 
Definition at line 158 of file variant.cpp.
Definition at line 193 of file variant.cpp.
utils::Matrix<double> genesis::population::f_st_asymptotically_unbiased  (  ForwardIterator  begin, 
ForwardIterator  end  
) 
Compute the asymptotically unbiased F_ST estimator of Karlsson et al, for all pairs of ranges of BaseCountss.
The function is intended to be used for computing pairwise F_ST for a set of BaseCountss along some region (e.g., a genomic Window).
This function expects a range (for example, a Window over the genome) of iterators, where each iterator dereferences to a std::vector<BaseCounts>
. Each entry in the range is used as one position in the genome contributing to F_ST. For all entries, the vector
needs to have the same number of entries.
Then, for each pair (i,j)
of pool samples, the range is iterated, and the respective entries i
and j
of the vectors in the range are used to compute F_ST for this pair of samples, and stored in the resulting matrix at positions (i,j)
and (j,i)
.
Definition at line 328 of file structure.hpp.
double genesis::population::f_st_asymptotically_unbiased  (  ForwardIterator1  p1_begin, 
ForwardIterator1  p1_end,  
ForwardIterator2  p2_begin,  
ForwardIterator2  p2_end  
) 
Compute the asymptotically unbiased F_ST estimator of Karlsson et al.
This follows the implementation in PoPoolation2 by Kofler et al.
Definition at line 277 of file structure.hpp.
FstAN f_st_asymptotically_unbiased_a1n1a2n2  (  BaseCounts const &  p1, 
BaseCounts const &  p2  
) 
Compute the a
and n
values needed for the asymptotically unbiased F_ST estimator of Karlsson et al.
See f_st_asymptotically_unbiased() for details.
Definition at line 102 of file structure.cpp.
std::pair< double, double > f_st_asymptotically_unbiased_nkdk  (  FstAN const &  fstan  ) 
Compute the N_k
and D_k
values needed for the asymptotically unbiased F_ST estimator of Karlsson et al.
See f_st_asymptotically_unbiased() for details.
Definition at line 199 of file structure.cpp.
utils::Matrix<double> genesis::population::f_st_conventional_pool  (  size_t  number_of_samples, 
size_t  poolsizes,  
ForwardIterator  begin,  
ForwardIterator  end  
) 
Compute the conventional F_ST statistic for poolsequenced data, following Kofler et al, for all pairs of ranges of BaseCountss.
This is a shortcut for f_st_conventional_pool, but instead of taking a vector of the sizes of each pool, it uses the number of samples and a fix poolsize that is used for all samples.
Definition at line 216 of file structure.hpp.
double genesis::population::f_st_conventional_pool  (  size_t  p1_poolsize, 
size_t  p2_poolsize,  
ForwardIterator1  p1_begin,  
ForwardIterator1  p1_end,  
ForwardIterator2  p2_begin,  
ForwardIterator2  p2_end  
) 
Compute the conventional F_ST statistic for poolsequenced data, following Kofler et al, for two ranges of BaseCountss.
Definition at line 70 of file structure.hpp.
utils::Matrix<double> genesis::population::f_st_conventional_pool  (  std::vector< size_t > const &  poolsizes, 
ForwardIterator  begin,  
ForwardIterator  end  
) 
Compute the conventional F_ST statistic for poolsequenced data, following Kofler et al, for all pairs of ranges of BaseCountss.
The function is intended to be used for computing pairwise F_ST for a set of BaseCountss along some region (e.g., a genomic Window).
This function expects a range (for example, a Window over the genome) of iterators, where each iterator dereferences to a std::vector<BaseCounts>
. Each entry in the range is used as one position in the genome contributing to F_ST. For all entries, the vector
needs to have the same number of entries, and that number has also to be the same as the size of the given poolsizes
vector.
Then, for each pair (i,j)
of pool samples, the range is iterated, and the respective entries i
and j
of the vectors in the range are used to compute F_ST for this pair of samples, and stored in the resulting matrix at positions (i,j)
and (j,i)
.
Definition at line 156 of file structure.hpp.
std::tuple< double, double, double > f_st_conventional_pool_pi_snp  (  BaseCounts const &  p1, 
BaseCounts const &  p2  
) 
Compute the SNPbased Theta Pi values used in f_st_conventional_pool()
See there for details. The tuple returns Theta Pi for an individual position, which is simply the heterozygosity() at this position, for both samples p1
and p2
, as well as their combined (average frequency) heterozygosity, in that order.
Definition at line 47 of file structure.cpp.
double f_star  (  double  a_n, 
double  n  
) 
Compute f*
according to Achaz 2008 and Kofler et al. 2011.
This is compuated as \( f_{star} = \frac{n  3}{a_n \cdot (n1)  n} \), and needed for the computation of alpha_star() and beta_star(). See there for some more details, and see
G. Achaz.
Testing for neutrality in samples with sequencing errors.
(2008) Genetics, 179(3), 1409â€“1424. https://doi.org/10.1534/genetics.107.082198
for the original equations.
Definition at line 241 of file diversity.cpp.
void filter_min_count  (  BaseCounts &  sample, 
size_t  min_count  
) 
Filter by minimum count that we need for a type of nucleotide (A
, C
, G
, T
) to be considered; set to zero if min_count
is not reached.
This filter is used as a type of quality control filter. All nucleotide counts (that is, BaseCounts::a_count, BaseCounts::c_count, BaseCounts::g_count, and BaseCounts::t_count) that are below the given min_count
are set to zero.
Definition at line 174 of file base_counts.cpp.
size_t get_base_count  (  BaseCounts const &  bc, 
char  base  
) 
Get the count for a base
given as a char.
The given base
has to be one of ACGTDN
(case insensitive), or *#.
for deletions as well.
Definition at line 96 of file base_counts.cpp.
double heterozygosity  (  BaseCounts const &  sample  ) 
Compute classic heterozygosity.
This is computed as \( h = \frac{n}{n1} \left( 1  \sum p^2 \right) \) with n
the total nucleotide_sum() (sum of A
,C
,G
,T
in the sample), and p
their respective nucleotide frequencies.
See Equation 3.1 in
Hahn, M. W.
(2018). Molecular Population Genetics.
https://global.oup.com/academic/product/molecularpopulationgenetics9780878939657
for details.
Definition at line 96 of file diversity.cpp.
bool is_covered  (  GenomeRegion const &  region, 
std::string const &  chromosome,  
size_t  position  
) 
Test whether the chromosome/position is within a given genomic region
.
Definition at line 144 of file functions/genome_region.cpp.
bool genesis::population::is_covered  (  GenomeRegion const &  region, 
T const &  variant  
) 
Test whether the chromosome/position of a variant
is within a given genomic region
.
This is a function template, so that it can accept any data structure that contains public member variables chromosome
(std::string
) and position
(size_t
), such as Variant.
Definition at line 86 of file functions/genome_region.hpp.
bool is_covered  (  GenomeRegion const &  region, 
VcfRecord const &  variant  
) 
Definition at line 165 of file functions/genome_region.cpp.
SlidingWindowIterator<ForwardIterator, InputType, DataType> genesis::population::make_sliding_window_iterator  (  SlidingWindowIteratorSettings< InputType, DataType > const &  settings, 
ForwardIterator  begin,  
ForwardIterator  end  
) 
Helper function to instanciate a SlidingWindowIterator without the need to specify all template parameters manually.
Definition at line 505 of file sliding_window_iterator.hpp.
BaseCounts merge  (  BaseCounts const &  p1, 
BaseCounts const &  p2  
) 
Merge the counts of two BaseCountss.
Definition at line 153 of file base_counts.cpp.
BaseCounts merge  (  std::vector< BaseCounts > const &  p  ) 
Merge the counts of a vector BaseCountss.
Definition at line 160 of file base_counts.cpp.
void merge_inplace  (  BaseCounts &  p1, 
BaseCounts const &  p2  
) 
Merge the counts of two BaseCountss, by adding the counts of the second (p2
) to the first (p1
).
Definition at line 136 of file base_counts.cpp.
double n_base  (  size_t  coverage, 
size_t  poolsize  
) 
Compute the n_base
term used for Tajima's D in Kofler et al. 2011, using a faster closed form expression.
This term is the expected number of distinct individuals sequenced, which is equivalent to finding the expected number of distinct values selected from a set of integers.
The computation in PoPoolation is slowm, see n_base_matrix(). We here instead use a closed form expression following the reasoning of https://math.stackexchange.com/a/72351 See there for the derivation of the equation.
Definition at line 416 of file diversity.cpp.
double n_base_matrix  (  size_t  coverage, 
size_t  poolsize  
) 
Compute the n_base
term used for Tajima's D in Kofler et al. 2011, following their approach.
This term is the expected number of distinct individuals sequenced, which is equivalent to finding the expected number of distinct values selected from a set of integers.
The computation of this term in PoPoolation uses a recursive dynamic programming approach to sum over different possibilities of selecting sets of integers. This gets rather slow for larger inputs, and there is an equivalent closed form that we here use instead. See n_base() for details. We here merely offer the original PoPoolation implementation as a point of reference.
R. Kofler, P. OrozcoterWengel, N. De Maio, R. V. Pandey, V. Nolte, A. Futschik, C. Kosiol, C. SchlĂ¶tterer.
PoPoolation: A Toolbox for Population Genetic Analysis of Next Generation Sequencing Data from Pooled Individuals.
(2011) PLoS ONE, 6(1), e15925. https://doi.org/10.1371/journal.pone.0015925
The paper unfortunately does not explain their equations, but there is a hidden document in their code repository that illuminates the situation a bit. See https://sourceforge.net/projects/popoolation/files/correction_equations.pdf
Definition at line 382 of file diversity.cpp.

inline 
Count of the pure nucleotide bases at this position, that is, the sum of all A
, C
, G
, and T
.
This is simply the sum of a_count + c_count + g_count + t_count
.
NB: In PoPoolation, this variable is called eucov
.
Definition at line 177 of file functions/base_counts.hpp.
std::ostream & operator<<  (  std::ostream &  os, 
BaseCounts const &  bs  
) 
Output stream operator for BaseCounts instances.
Definition at line 345 of file base_counts.cpp.
std::ostream & operator<<  (  std::ostream &  os, 
GenomeRegion const &  region  
) 
Definition at line 48 of file functions/genome_region.cpp.
Comparison operator equals for FstAN structs.
Definition at line 250 of file structure.hpp.
bool operator==  (  GenomeRegion const &  a, 
GenomeRegion const &  b  
) 
Definition at line 70 of file functions/genome_region.cpp.
GenomeRegion parse_genome_region  (  std::string const &  region  ) 
Parse a genomic region.
Accepted formats are "chromosome", "chromosome:position", "chromosome:startend", and "chromosome:start..end".
Definition at line 75 of file functions/genome_region.cpp.
GenomeRegionList parse_genome_regions  (  std::string const &  regions  ) 
Parse a set/list of genomic regions.
The individual regions need to be separated by commas (surrounding white space is okay), and each region needs to follow the format as explained in parse_genome_region().
Definition at line 132 of file functions/genome_region.cpp.
genesis::utils::Matrix<double> genesis::population::pij_matrix_  (  size_t  max_coverage, 
size_t  poolsize  
) 
Definition at line 310 of file diversity.cpp.
genesis::utils::Matrix<double> const& genesis::population::pij_matrix_resolver_  (  size_t  max_coverage, 
size_t  poolsize  
) 
Definition at line 344 of file diversity.cpp.
PoolDiversityResults genesis::population::pool_diversity_measures  (  PoolDiversitySettings const &  settings, 
ForwardIterator  begin,  
ForwardIterator  end  
) 
Compute Theta Pi, Theta Watterson, and Tajia's D in their poolsequencing corrected versions according to Kofler et al.
This is a high level function that is meant as a simple example of how to compute these statistics. See theta_pi_pool(), theta_watterson_pool(), and tajima_d_pool() for details. It takes care of most options offered by PoPoolation (as given by settings
here), except for the window width and stride and minimum phred quality score, which have to be applied before filling the window (or whatever other range is used as input here) before calling this function.
Furthermore, results here are not filtered aftwards, so any filtering based on e.g., minimum covered fraction has to be done downstream.
Definition at line 427 of file diversity.hpp.
std::vector< double > prob_cond_true_freq  (  size_t  n, 
std::vector< bool > const &  alleles,  
std::vector< unsigned char > const &  phred_scores,  
bool  unfolded  
) 
Definition at line 121 of file afs_estimate.cpp.
std::vector< double > prob_cond_true_freq_unfolded  (  size_t  n, 
std::vector< bool > const &  alleles,  
std::vector< unsigned char > const &  phred_scores,  
bool  invert_alleles  
) 
Definition at line 145 of file afs_estimate.cpp.
void genesis::population::process_conditional_probability  (  ForwardIterator  begin, 
ForwardIterator  end  
) 
Compute the conditional probabilities of AFs. This reimplements process_probCond from Boitard et al.
Definition at line 100 of file afs_estimate.hpp.
void genesis::population::run_vcf_window  (  SlidingWindowGenerator< Data, Accumulator > &  generator, 
std::string const &  vcf_file,  
std::function< Data(VcfRecord const &)>  conversion,  
std::function< bool(VcfRecord const &)>  condition = {} 

) 
Convenience function to iterate over a whole VCF file.
This function is convenience, and takes care of iterating a VCF file record by record (that is, line by line), using a provided conversion
function to extract the D
/Data
from the VcfRecord. It furthermore takes care of finishing all chromosomes properly, using their lengths as provided in the VCF header.
Before calling the function, of course, all necessary plugin functions have to be set in the SlidingWindowGenerator instance, so that the data is processed as intended. In particular, take care of setting SlidingWindowGenerator::emit_incomplete_windows() to the desired value.
Furthermore, the function offers a condition
function that can be used to skip records that do not fullfil a given condition. That is, if condition
is used, it needs to return true
for records that shall be processed, and false
for those that shall be skipped.
Definition at line 70 of file vcf_window.hpp.
std::array< std::pair< char, size_t >, 4 > sorted_variant_counts  (  Variant const &  variant, 
bool  reference_first  
) 
Definition at line 59 of file variant.cpp.
BaseCountsStatus status  (  BaseCounts const &  sample, 
size_t  min_coverage = 0 , 

size_t  max_coverage = 0 , 

size_t  min_count = 0 , 

bool  tolerate_deletions = false 

) 
Compute a simple status with useful properties from the counts of a BaseCounts.
min_coverage
Minimum coverage expected for a BaseCounts to be considered "covered". If the number of nucleotides (A
, C
, G
, T
) in the reads of a sample is less then the here provided min_coverage
, then the BaseCounts is not considered sufficiently covered, and the BaseCountsStatus::is_covered flag will be set to false
.
max_coverage
Same as min_coverage
, but the upper bound on coverage; maximum coverage expected for a BaseCounts to be considered "covered". If the number of nucleotides exceeds this bound, the BaseCountsStatus::is_covered flag will be set to false
. If provided with a value of 0
(default), max_coverage is not used.
Only if the nucleotide count is in between (or equal to either) these two bounds (min_coverage
and max_coverage
), it is considered to be covered, and BaseCountsStatus::is_covered will be set to true
.
min_count
This value is used to determine whether a BaseCounts has too many deletions, and unless tolerate_deletions() is set to true
, the BaseCountsStatus::is_ignored will be set to true
in that case (too many deletions, as given by BaseCounts::d_count), while the values for BaseCountsStatus::is_covered, BaseCountsStatus::is_snp, and BaseCountsStatus::is_biallelic will be set to false
.
Typically, if this function is used after calling filter_min_count() on the BaseCounts, the min_count
is set to the same value for consistency.
tolerate_deletions
Set whether we tolerate BaseCountss with a high amount of deletions.
If set to false
(default), we do not tolerate deletions. In that case, if the number of deletions in a Sample (given by Sample::d_count) is higher than or equal to min_count(), the Sample will be considered ignored (Sample::is_ignored set to true
), and considered not covered (Sample::is_covered, Sample::is_snp, and Sample::is_biallelic will all be set to false
).
If however set to true
, we tolerate high amounts of deletions, and the values for the above properties will be set as usual by considering the nucleotide counts (Sample::a_count, Sample::c_count, Sample::g_count, and Sample::t_count) instead.
Definition at line 46 of file base_counts.cpp.
double genesis::population::tajima_d_pool  (  PoolDiversitySettings const &  settings, 
ForwardIterator  begin,  
ForwardIterator  end  
) 
Compute the poolsequencing corrected version of Tajima's D according to Kofler et al.
Definition at line 397 of file diversity.hpp.
double genesis::population::tajima_d_pool  (  PoolDiversitySettings const &  settings, 
ForwardIterator  begin,  
ForwardIterator  end,  
double  theta_pi,  
double  theta_watterson  
) 
Compute the poolsequencing corrected version of Tajima's D according to Kofler et al.
Definition at line 373 of file diversity.hpp.
double tajima_d_pool_denominator  (  PoolDiversitySettings const &  settings, 
size_t  snp_count,  
double  theta  
) 
Compute the denominator for the poolsequencing correction of Tajima's D according to Kofler et al.
Definition at line 435 of file diversity.cpp.
double genesis::population::theta_pi  (  ForwardIterator  begin, 
ForwardIterator  end  
) 
Compute classic theta pi, that is, the sum of heterozygosities.
The function simply sums heterozygosity() for all samples in the given range.
Definition at line 154 of file diversity.hpp.

inline 
Compute theta pi with poolsequencing correction according to Kofler et al, for a single BaseCounts, that is, its heterozygosity() divided by the correction denominator.
Definition at line 195 of file diversity.hpp.
double genesis::population::theta_pi_pool  (  PoolDiversitySettings const &  settings, 
ForwardIterator  begin,  
ForwardIterator  end  
) 
Compute theta pi with poolsequencing correction according to Kofler et al, that is, the sum of heterozygosities divided by the correction denominator.
The function sums heterozygosity() for all samples in the given range, and divides each by the respective denominator to correct for error from pool sequencing. See theta_pi_pool_denominator() for details.
Definition at line 173 of file diversity.hpp.
double theta_pi_pool_denominator  (  PoolDiversitySettings const &  settings, 
size_t  nucleotide_count  
) 
Compute the denominator for the poolsequencing correction of theta pi according to Kofler et al.
We here compute the denominator for a given poolsize
, with a fix min_allele_count
, which is identical for each given nucleotide_count
, and henced cached internally for speedup.
See
R. Kofler, P. OrozcoterWengel, N. De Maio, R. V. Pandey, V. Nolte, A. Futschik, C. Kosiol, C. SchlĂ¶tterer.
PoPoolation: A Toolbox for Population Genetic Analysis of Next Generation Sequencing Data from Pooled Individuals.
(2011) PLoS ONE, 6(1), e15925. https://doi.org/10.1371/journal.pone.0015925
for details. The paper unfortunately does not explain their equations, but there is a hidden document in their code repository that illuminates the situation a bit. See https://sourceforge.net/projects/popoolation/files/correction_equations.pdf
Definition at line 120 of file diversity.cpp.
double genesis::population::theta_watterson_pool  (  PoolDiversitySettings const &  settings, 
ForwardIterator  begin,  
ForwardIterator  end  
) 
Compute theta watterson with poolsequencing correction according to Kofler et al.
Definition at line 221 of file diversity.hpp.
double theta_watterson_pool_denominator  (  PoolDiversitySettings const &  settings, 
size_t  nucleotide_count  
) 
Compute the denominator for the poolsequencing correction of theta watterson according to Kofler et al.
We here compute the denominator for a given poolsize
, with a fix min_allele_count
, which is identical for each given nucleotide_count
, and henced cached internally for speedup.
See
R. Kofler, P. OrozcoterWengel, N. De Maio, R. V. Pandey, V. Nolte, A. Futschik, C. Kosiol, C. SchlĂ¶tterer.
PoPoolation: A Toolbox for Population Genetic Analysis of Next Generation Sequencing Data from Pooled Individuals.
(2011) PLoS ONE, 6(1), e15925. https://doi.org/10.1371/journal.pone.0015925
for details. The paper unfortunately does not explain their equations, but there is a hidden document in their code repository that illuminates the situation a bit. See https://sourceforge.net/projects/popoolation/files/correction_equations.pdf
Definition at line 170 of file diversity.cpp.
std::string to_string  (  GenomeRegion const &  region  ) 
Definition at line 55 of file functions/genome_region.cpp.
std::ostream & to_sync  (  BaseCounts const &  bs, 
std::ostream &  os  
) 
Output a BaseCounts instance to a stream in the PoPoolation2 sync format.
This is one column from that file, outputting the counts separated by colons, in the order A:T:C:G:N:D
, with D
being deletions (*
in pileup).
Definition at line 352 of file base_counts.cpp.
std::ostream & to_sync  (  Variant const &  var, 
std::ostream &  os  
) 
Output a Variant instance to a stream in the PoPoolation2 sync format.
The format is a tabdelimited file with one variant per line:
Each population column outputs counts separated by colons, in the order A:T:C:G:N:D
, with D
being deletions (*
in pileup).
See https://sourceforge.net/p/popoolation2/wiki/Tutorial/ for details.
Definition at line 147 of file variant.cpp.
BaseCounts total_base_counts  (  Variant const &  variant  ) 
Definition at line 54 of file variant.cpp.
std::string vcf_genotype_string  (  std::vector< VcfGenotype > const &  genotypes  ) 
Return the VCFlike string representation of a set of VcfGenotype entries.
The VcfFormatIterator::get_values() function returns all genotype entries for a given sample of a record/line. Here, we return a string representation similar to VCF of these genotypes, for example 00
or ./1
.
Definition at line 216 of file vcf_common.cpp.
size_t vcf_genotype_sum  (  std::vector< VcfGenotype > const &  genotypes  ) 
Return the sum of genotypes for a set of VcfGenotype entries, typically used to construct a genotype matrix with entries 0,1,2.
The function takes the given genotypes
, encodes the reference as 0 and any alternative as 1, and then sums this over the values. For diploid organisms, this yields possible results in the range of 0 (homozygote for the reference), 1 (heterzygote), or 2 (homozygote for the alternative), which is typically used in genotype matrices.
Definition at line 230 of file vcf_common.cpp.
std::string vcf_hl_type_to_string  (  int  hl_type  ) 
Internal helper function to convert htslibinternal BCF_HL_* header line type values to their string representation as used in the VCF header ("FILTER", "INFO", "FORMAT", etc).
Definition at line 199 of file vcf_common.cpp.
std::string vcf_value_special_to_string  (  int  vl_type_num  ) 
Definition at line 171 of file vcf_common.cpp.
std::string vcf_value_special_to_string  (  VcfValueSpecial  vl_type_num  ) 
Definition at line 166 of file vcf_common.cpp.
std::string vcf_value_type_to_string  (  int  ht_type  ) 
Definition at line 141 of file vcf_common.cpp.
std::string vcf_value_type_to_string  (  VcfValueType  ht_type  ) 
Definition at line 136 of file vcf_common.cpp.
using VcfFormatIteratorFloat = VcfFormatIterator<float, double> 
Definition at line 67 of file vcf_format_iterator.hpp.
using VcfFormatIteratorGenotype = VcfFormatIterator<int32_t, VcfGenotype> 
Definition at line 68 of file vcf_format_iterator.hpp.
using VcfFormatIteratorInt = VcfFormatIterator<int32_t, int32_t> 
Definition at line 66 of file vcf_format_iterator.hpp.
using VcfFormatIteratorString = VcfFormatIterator<char*, std::string> 
Definition at line 65 of file vcf_format_iterator.hpp.

strong 
Specification for the values determining header line types of VCF/BCF files.
This list contains the types of header lines that htslib uses for identification, as specified in the VCF header. Corresponds to the BCF_HL_*
macro constants defined by htslib. We statically assert that these have the same values.
Enumerator  

kFilter  
kInfo  
kFormat  
kContig  
kStructured  
kGeneric 
Definition at line 60 of file vcf_common.hpp.

strong 
Specification for special markers for the number of values expected for keyvaluepairs of VCF/BCF files.
This list contains the special markers for the number of values of the INFO
and FORMAT
keyvalue pairs, as specified in the VCF header, and used in the record lines. Corresponds to the BCF_VL_*
macro constants defined by htslib. We statically assert that these have the same values.
Enumerator  

kFixed  Fixed number of values expected. In VCF, this is denoted simply by an integer number. This simply specifies that there is a fixed number of values to be expected; we do not further define how many exaclty are expected here (the integer value). This is taken care of in a separate variable that is provided whenever a fixedsize value is needed, see for example VcfSpecification. 
kVariable  Variable number of possible values, or unknown, or unbounded. In VCF, this is denoted by '.'. 
kAllele  One value per alternate allele. In VCF, this is denoted as 'A'. 
kGenotype  One value for each possible genotype (more relevant to the FORMAT tags). In VCF, this is denoated as 'G'. 
kReference  One value for each possible allele (including the reference). In VCF, this is denoted as 'R'. 
Definition at line 95 of file vcf_common.hpp.

strong 
Specification for the data type of the values expected in keyvaluepairs of VCF/BCF files.
This list contains the types of data in values of the INFO
and FORMAT
keyvalue pairs, as specified in the VCF header, and used in the record lines. Corresponds to the BCF_HT_*
macro constants defined by htslib. We statically assert that these have the same values.
Enumerator  

kFlag  
kInteger  
kFloat  
kString 
Definition at line 78 of file vcf_common.hpp.

strong 
Position in the genome that is used for reporting when emitting or using a window.
When a window is finished, the on_emission() plugin function is called, which reports the position in the genome at which the window is. There are several ways that this position is computed. Typically, just the first position of the window is used (that is, for an interval, the beginning of the interval, and for variants, the position of the first variant).
However, it might be desirable to report a different position, for example when plotting the results. When using WindowType::kVariants for example, one might want to plot the values computed per window at the midpoint genome position of the variants in that window.
Enumerator  

kIntervalBegin  
kIntervalEnd  
kIntervalMidpoint  
kVariantFirst  
kVariantLast  
kVariantMedian  
kVariantMean  
kVariantMidpoint 
Definition at line 69 of file window.hpp.

strong 
WindowType of a Window, that is, whether we slide along a fixed size interval of the genome, or along a fixed number of variants.
Enumerator  

kInterval  
kVariants 
Definition at line 51 of file window.hpp.