A library for working with phylogenetic data.
v0.25.0
genesis::population Namespace Reference

Classes

struct  AfsPileupRecord
 Helper to store the data of one pileup line/record needed for the Boitard et al Allele Frequency Estimation computation. More...
 
class  AlleleFrequencyWindow
 
struct  BaseCounts
 One set of nucleotide base counts, for example for a given sample that represents a pool of sequenced individuals. More...
 
struct  BaseCountsStatus
 
struct  EmptyAccumulator
 Empty helper data struct to serve as a dummy for Window. More...
 
struct  FstAN
 Helper struct for the a_1, a_2, n_1, and n_2 values needed for f_st_asymptotically_unbiased(). More...
 
class  GenomeHeatmap
 
struct  GenomeRegion
 A region (between two positions) on a chromosome. More...
 
class  GenomeRegionList
 A sorted list of genome regions. More...
 
class  GffReader
 Reader for GFF (General Feature Format) and GTF (General Transfer Format) files. More...
 
class  HeatmapColorization
 
class  HeatmapMatrix
 Matrix to capture and accumulate columns of per-position or per-window values along a chromosome. More...
 
class  HtsFile
 Wrap an ::htsFile struct. More...
 
struct  PoolDiversityResults
 Data struct to collect all diversity statistics computed by pool_diversity_measures(). More...
 
struct  PoolDiversitySettings
 Settings used by different pool-sequencing corrected diversity statistics. More...
 
class  SimplePileupInputIterator
 Iterate an input source and parse it as a (m)pileup file. More...
 
class  SimplePileupReader
 Reader for line-by-line assessment of (m)pileup files. More...
 
class  SlidingWindowGenerator
 Generator for sliding Windows over the chromosomes of a genome. More...
 
class  SlidingWindowIterator
 Iterator for sliding Windows over the chromosomes of a genome. More...
 
struct  SlidingWindowIteratorSettings
 Settings for running a sliding window iteration. More...
 
class  SyncInputIterator
 Iterate an input source and parse it as a sync file. More...
 
class  SyncReader
 Reader for PoPoolation2's "synchronized" files. More...
 
struct  Variant
 A single variant at a position in a chromosome, along with BaseCounts for a set of samples. More...
 
class  VcfFormatHelper
 Provide htslib helper functions. More...
 
class  VcfFormatIterator
 Iterate the FORMAT information for the samples in a SNP/variant line in a VCF/BCF file. More...
 
class  VcfGenotype
 Simple wrapper class for one genotype field for a sample. More...
 
class  VcfHeader
 Capture the information from a header of a VCF/BCF file. More...
 
class  VcfInputIterator
 Iterate an input source and parse it as a VCF/BCF file. More...
 
class  VcfParallelInputIterator
 Iterate a set of input VCF/BCF file in parallel, that is, synchronize the iteration to be at the same positions in the genome at all times. More...
 
class  VcfRecord
 Capture the information of a single SNP/variant line in a VCF/BCF file. More...
 
struct  VcfSpecification
 Collect the four required keys that describe an INFO or FORMAT sub-field of VCF/BCF files. More...
 
class  Window
 Window over the chromosomes of a genome. More...
 

Functions

double a_n (size_t n)
 Compute a_n, the sum of reciprocals. More...
 
double alpha_star (double n)
 Compute alpha* according to Achaz 2008 and Kofler et al. 2011. More...
 
double amnm_ (size_t poolsize, size_t nucleotide_count, size_t allele_frequency)
 Local helper function to compute values for the denominator. More...
 
double b_n (size_t n)
 Compute b_n, the sum of squared reciprocals. More...
 
double beta_star (double n)
 Compute beta* according to Achaz 2008 and Kofler et al. 2011. More...
 
std::pair< char, double > consensus (BaseCounts const &sample)
 Consensus character for a BaseCounts, and its confidence. More...
 
std::pair< char, double > consensus (BaseCounts const &sample, BaseCountsStatus const &status)
 Consensus character for a BaseCounts, and its confidence. More...
 
AfsPileupRecord convert_to_afs_pileup_record (SimplePileupReader::Record const &record)
 
BaseCounts convert_to_base_counts (SimplePileupReader::Sample const &sample, unsigned char min_phred_score)
 
Variant convert_to_variant (SimplePileupReader::Record const &record, unsigned char min_phred_score)
 
Variant convert_to_variant (VcfRecord const &record)
 
template<class ForwardIterator >
utils::Matrix< double > f_st_asymptotically_unbiased (ForwardIterator begin, ForwardIterator end)
 Compute the asymptotically unbiased F_ST estimator of Karlsson et al, for all pairs of ranges of BaseCountss. More...
 
template<class ForwardIterator1 , class ForwardIterator2 >
double f_st_asymptotically_unbiased (ForwardIterator1 p1_begin, ForwardIterator1 p1_end, ForwardIterator2 p2_begin, ForwardIterator2 p2_end)
 Compute the asymptotically unbiased F_ST estimator of Karlsson et al. More...
 
FstAN f_st_asymptotically_unbiased_a1n1a2n2 (BaseCounts const &p1, BaseCounts const &p2)
 Compute the a and n values needed for the asymptotically unbiased F_ST estimator of Karlsson et al. More...
 
std::pair< double, double > f_st_asymptotically_unbiased_nkdk (FstAN const &fstan)
 Compute the N_k and D_k values needed for the asymptotically unbiased F_ST estimator of Karlsson et al. More...
 
template<class ForwardIterator >
utils::Matrix< double > f_st_conventional_pool (size_t number_of_samples, size_t poolsizes, ForwardIterator begin, ForwardIterator end)
 Compute the conventional F_ST statistic for pool-sequenced data, following Kofler et al, for all pairs of ranges of BaseCountss. More...
 
template<class ForwardIterator1 , class ForwardIterator2 >
double f_st_conventional_pool (size_t p1_poolsize, size_t p2_poolsize, ForwardIterator1 p1_begin, ForwardIterator1 p1_end, ForwardIterator2 p2_begin, ForwardIterator2 p2_end)
 Compute the conventional F_ST statistic for pool-sequenced data, following Kofler et al, for two ranges of BaseCountss. More...
 
template<class ForwardIterator >
utils::Matrix< double > f_st_conventional_pool (std::vector< size_t > const &poolsizes, ForwardIterator begin, ForwardIterator end)
 Compute the conventional F_ST statistic for pool-sequenced data, following Kofler et al, for all pairs of ranges of BaseCountss. More...
 
std::tuple< double, double, double > f_st_conventional_pool_pi_snp (BaseCounts const &p1, BaseCounts const &p2)
 Compute the SNP-based Theta Pi values used in f_st_conventional_pool() More...
 
double f_star (double a_n, double n)
 Compute f* according to Achaz 2008 and Kofler et al. 2011. More...
 
void filter_min_count (BaseCounts &sample, size_t min_count)
 Filter by minimum count that we need for a type of nucleotide (A, C, G, T) to be considered; set to zero if min_count is not reached. More...
 
size_t get_base_count (BaseCounts const &bc, char base)
 Get the count for a base given as a char. More...
 
double heterozygosity (BaseCounts const &sample)
 Compute classic heterozygosity. More...
 
bool is_covered (GenomeRegion const &region, std::string const &chromosome, size_t position)
 Test whether the chromosome/position is within a given genomic region. More...
 
template<class T >
bool is_covered (GenomeRegion const &region, T const &variant)
 Test whether the chromosome/position of a variant is within a given genomic region. More...
 
bool is_covered (GenomeRegion const &region, VcfRecord const &variant)
 
template<class ForwardIterator , class InputType , class DataType = InputType>
SlidingWindowIterator< ForwardIterator, InputType, DataType > make_sliding_window_iterator (SlidingWindowIteratorSettings< InputType, DataType > const &settings, ForwardIterator begin, ForwardIterator end)
 Helper function to instanciate a SlidingWindowIterator without the need to specify all template parameters manually. More...
 
BaseCounts merge (BaseCounts const &p1, BaseCounts const &p2)
 Merge the counts of two BaseCountss. More...
 
BaseCounts merge (std::vector< BaseCounts > const &p)
 Merge the counts of a vector BaseCountss. More...
 
void merge_inplace (BaseCounts &p1, BaseCounts const &p2)
 Merge the counts of two BaseCountss, by adding the counts of the second (p2) to the first (p1). More...
 
double n_base (size_t coverage, size_t poolsize)
 Compute the n_base term used for Tajima's D in Kofler et al. 2011, using a faster closed form expression. More...
 
double n_base_matrix (size_t coverage, size_t poolsize)
 Compute the n_base term used for Tajima's D in Kofler et al. 2011, following their approach. More...
 
size_t nucleotide_sum (BaseCounts const &sample)
 Count of the pure nucleotide bases at this position, that is, the sum of all A, C, G, and T. More...
 
std::ostream & operator<< (std::ostream &os, BaseCounts const &bs)
 Output stream operator for BaseCounts instances. More...
 
std::ostream & operator<< (std::ostream &os, GenomeRegion const &region)
 
bool operator== (FstAN const &f1, FstAN const &f2)
 Comparison operator equals for FstAN structs. More...
 
bool operator== (GenomeRegion const &a, GenomeRegion const &b)
 
GenomeRegion parse_genome_region (std::string const &region)
 Parse a genomic region. More...
 
GenomeRegionList parse_genome_regions (std::string const &regions)
 Parse a set/list of genomic regions. More...
 
genesis::utils::Matrix< double > pij_matrix_ (size_t max_coverage, size_t poolsize)
 
genesis::utils::Matrix< double > const & pij_matrix_resolver_ (size_t max_coverage, size_t poolsize)
 
template<class ForwardIterator >
PoolDiversityResults pool_diversity_measures (PoolDiversitySettings const &settings, ForwardIterator begin, ForwardIterator end)
 Compute Theta Pi, Theta Watterson, and Tajia's D in their pool-sequencing corrected versions according to Kofler et al. More...
 
std::vector< double > prob_cond_true_freq (size_t n, std::vector< bool > const &alleles, std::vector< unsigned char > const &phred_scores, bool unfolded)
 
std::vector< double > prob_cond_true_freq_unfolded (size_t n, std::vector< bool > const &alleles, std::vector< unsigned char > const &phred_scores, bool invert_alleles)
 
template<class ForwardIterator >
void process_conditional_probability (ForwardIterator begin, ForwardIterator end)
 Compute the conditional probabilities of AFs. This reimplements process_probCond from Boitard et al. More...
 
template<class Data , class Accumulator = EmptyAccumulator>
void run_vcf_window (SlidingWindowGenerator< Data, Accumulator > &generator, std::string const &vcf_file, std::function< Data(VcfRecord const &)> conversion, std::function< bool(VcfRecord const &)> condition={})
 Convenience function to iterate over a whole VCF file. More...
 
std::array< std::pair< char, size_t >, 4 > sorted_variant_counts (Variant const &variant, bool reference_first)
 
BaseCountsStatus status (BaseCounts const &sample, size_t min_coverage=0, size_t max_coverage=0, size_t min_count=0, bool tolerate_deletions=false)
 Compute a simple status with useful properties from the counts of a BaseCounts. More...
 
template<class ForwardIterator >
double tajima_d_pool (PoolDiversitySettings const &settings, ForwardIterator begin, ForwardIterator end)
 Compute the pool-sequencing corrected version of Tajima's D according to Kofler et al. More...
 
template<class ForwardIterator >
double tajima_d_pool (PoolDiversitySettings const &settings, ForwardIterator begin, ForwardIterator end, double theta_pi, double theta_watterson)
 Compute the pool-sequencing corrected version of Tajima's D according to Kofler et al. More...
 
double tajima_d_pool_denominator (PoolDiversitySettings const &settings, size_t snp_count, double theta)
 Compute the denominator for the pool-sequencing correction of Tajima's D according to Kofler et al. More...
 
template<class ForwardIterator >
double theta_pi (ForwardIterator begin, ForwardIterator end)
 Compute classic theta pi, that is, the sum of heterozygosities. More...
 
double theta_pi_pool (PoolDiversitySettings const &settings, BaseCounts const &sample)
 Compute theta pi with pool-sequencing correction according to Kofler et al, for a single BaseCounts, that is, its heterozygosity() divided by the correction denominator. More...
 
template<class ForwardIterator >
double theta_pi_pool (PoolDiversitySettings const &settings, ForwardIterator begin, ForwardIterator end)
 Compute theta pi with pool-sequencing correction according to Kofler et al, that is, the sum of heterozygosities divided by the correction denominator. More...
 
double theta_pi_pool_denominator (PoolDiversitySettings const &settings, size_t nucleotide_count)
 Compute the denominator for the pool-sequencing correction of theta pi according to Kofler et al. More...
 
template<class ForwardIterator >
double theta_watterson_pool (PoolDiversitySettings const &settings, ForwardIterator begin, ForwardIterator end)
 Compute theta watterson with pool-sequencing correction according to Kofler et al. More...
 
double theta_watterson_pool_denominator (PoolDiversitySettings const &settings, size_t nucleotide_count)
 Compute the denominator for the pool-sequencing correction of theta watterson according to Kofler et al. More...
 
std::string to_string (GenomeRegion const &region)
 
std::ostream & to_sync (BaseCounts const &bs, std::ostream &os)
 Output a BaseCounts instance to a stream in the PoPoolation2 sync format. More...
 
std::ostream & to_sync (Variant const &var, std::ostream &os)
 Output a Variant instance to a stream in the PoPoolation2 sync format. More...
 
BaseCounts total_base_counts (Variant const &variant)
 
std::string vcf_genotype_string (std::vector< VcfGenotype > const &genotypes)
 Return the VCF-like string representation of a set of VcfGenotype entries. More...
 
size_t vcf_genotype_sum (std::vector< VcfGenotype > const &genotypes)
 Return the sum of genotypes for a set of VcfGenotype entries, typically used to construct a genotype matrix with entries 0,1,2. More...
 
std::string vcf_hl_type_to_string (int hl_type)
 Internal helper function to convert htslib-internal BCF_HL_* header line type values to their string representation as used in the VCF header ("FILTER", "INFO", "FORMAT", etc). More...
 
std::string vcf_value_special_to_string (int vl_type_num)
 
std::string vcf_value_special_to_string (VcfValueSpecial vl_type_num)
 
std::string vcf_value_type_to_string (int ht_type)
 
std::string vcf_value_type_to_string (VcfValueType ht_type)
 

Enumerations

enum  VcfHeaderLine : int {
  kFilter = 0, kInfo = 1, kFormat = 2, kContig = 3,
  kStructured = 4, kGeneric = 5
}
 Specification for the values determining header line types of VCF/BCF files. More...
 
enum  VcfValueSpecial : int {
  kFixed = 0, kVariable = 1, kAllele = 2, kGenotype = 3,
  kReference = 4
}
 Specification for special markers for the number of values expected for key-value-pairs of VCF/BCF files. More...
 
enum  VcfValueType : int { kFlag = 0, kInteger = 1, kFloat = 2, kString = 3 }
 Specification for the data type of the values expected in key-value-pairs of VCF/BCF files. More...
 
enum  WindowAnchorType {
  kIntervalBegin, kIntervalEnd, kIntervalMidpoint, kVariantFirst,
  kVariantLast, kVariantMedian, kVariantMean, kVariantMidpoint
}
 Position in the genome that is used for reporting when emitting or using a window. More...
 
enum  WindowType { kInterval, kVariants }
 WindowType of a Window, that is, whether we slide along a fixed size interval of the genome, or along a fixed number of variants. More...
 

Typedefs

using VcfFormatIteratorFloat = VcfFormatIterator< float, double >
 
using VcfFormatIteratorGenotype = VcfFormatIterator< int32_t, VcfGenotype >
 
using VcfFormatIteratorInt = VcfFormatIterator< int32_t, int32_t >
 
using VcfFormatIteratorString = VcfFormatIterator< char *, std::string >
 

Function Documentation

◆ a_n()

double a_n ( size_t  n)

Compute a_n, the sum of reciprocals.

This is the sum of reciprocals up to n-1: \( a_n = \sum_{i=1}^{n-1} \frac{1}{i} \).

See Equation 3.6 in

Hahn, M. W. (2018). Molecular Population Genetics. https://global.oup.com/academic/product/molecular-population-genetics-9780878939657

for details.

Definition at line 215 of file diversity.cpp.

◆ alpha_star()

double alpha_star ( double  n)

Compute alpha* according to Achaz 2008 and Kofler et al. 2011.

This is needed for the computation of tajima_d_pool() according to

R. Kofler, P. Orozco-terWengel, N. De Maio, R. V. Pandey, V. Nolte, A. Futschik, C. Kosiol, C. Schlötterer.
PoPoolation: A Toolbox for Population Genetic Analysis of Next Generation Sequencing Data from Pooled Individuals.
(2011) PLoS ONE, 6(1), e15925. https://doi.org/10.1371/journal.pone.0015925

The paper unfortunately does not explain their equations, but there is a hidden document in their code repository that illuminates the situation a bit. See https://sourceforge.net/projects/popoolation/files/correction_equations.pdf

The equation is based on

G. Achaz.
Testing for neutrality in samples with sequencing errors.
(2008) Genetics, 179(3), 1409–1424. https://doi.org/10.1534/genetics.107.082198

See there for details.

Definition at line 247 of file diversity.cpp.

◆ amnm_()

double genesis::population::amnm_ ( size_t  poolsize,
size_t  nucleotide_count,
size_t  allele_frequency 
)

Local helper function to compute values for the denominator.

Definition at line 57 of file diversity.cpp.

◆ b_n()

double b_n ( size_t  n)

Compute b_n, the sum of squared reciprocals.

This is the sum of squared reciprocals up to n-1: \( b_n = \sum_{i=1}^{n-1} \frac{1}{i^2} \).

See

R. Kofler, P. Orozco-terWengel, N. De Maio, R. V. Pandey, V. Nolte, A. Futschik, C. Kosiol, C. Schlötterer.
PoPoolation: A Toolbox for Population Genetic Analysis of Next Generation Sequencing Data from Pooled Individuals.
(2011) PLoS ONE, 6(1), e15925. https://doi.org/10.1371/journal.pone.0015925

for details. The paper unfortunately does not explain their equations, but there is a hidden document in their code repository that illuminates the situation a bit. See https://sourceforge.net/projects/popoolation/files/correction_equations.pdf

Definition at line 228 of file diversity.cpp.

◆ beta_star()

double beta_star ( double  n)

Compute beta* according to Achaz 2008 and Kofler et al. 2011.

This is needed for the computation of tajima_d_pool() according to

R. Kofler, P. Orozco-terWengel, N. De Maio, R. V. Pandey, V. Nolte, A. Futschik, C. Kosiol, C. Schlötterer.
PoPoolation: A Toolbox for Population Genetic Analysis of Next Generation Sequencing Data from Pooled Individuals.
(2011) PLoS ONE, 6(1), e15925. https://doi.org/10.1371/journal.pone.0015925

The paper unfortunately does not explain their equations, but there is a hidden document in their code repository that illuminates the situation a bit. See https://sourceforge.net/projects/popoolation/files/correction_equations.pdf

The equation is based on

G. Achaz.
Testing for neutrality in samples with sequencing errors.
(2008) Genetics, 179(3), 1409–1424. https://doi.org/10.1534/genetics.107.082198

See there for details.

Definition at line 275 of file diversity.cpp.

◆ consensus() [1/2]

std::pair< char, double > consensus ( BaseCounts const &  sample)

Consensus character for a BaseCounts, and its confidence.

This is simply the character (out of ACGT) that appears most often (or, for ties, the lexicographically smallest character), unless all of (A, C, G, T) are zero, in which case the consensus character is N. The confidence is the count of the consensus character, divided by the total count of all four nucleotides.

Definition at line 191 of file base_counts.cpp.

◆ consensus() [2/2]

std::pair< char, double > consensus ( BaseCounts const &  sample,
BaseCountsStatus const &  status 
)

Consensus character for a BaseCounts, and its confidence.

This is simply the character (out of ACGT) that appears most often (or, for ties, the lexicographically smallest character). If the BaseCounts is not well covered by reads (that is, if its BaseCountsStatus::is_covered is false), the consensus character is N. The confidence is the count of the consensus character, divided by the total count of all four nucleotides.

Definition at line 232 of file base_counts.cpp.

◆ convert_to_afs_pileup_record()

AfsPileupRecord convert_to_afs_pileup_record ( SimplePileupReader::Record const &  record)

Definition at line 48 of file afs_estimate.cpp.

◆ convert_to_base_counts()

BaseCounts convert_to_base_counts ( SimplePileupReader::Sample const &  sample,
unsigned char  min_phred_score 
)

Definition at line 245 of file base_counts.cpp.

◆ convert_to_variant() [1/2]

Variant convert_to_variant ( SimplePileupReader::Record const &  record,
unsigned char  min_phred_score 
)

Definition at line 158 of file variant.cpp.

◆ convert_to_variant() [2/2]

Variant convert_to_variant ( VcfRecord const &  record)

Definition at line 193 of file variant.cpp.

◆ f_st_asymptotically_unbiased() [1/2]

utils::Matrix<double> genesis::population::f_st_asymptotically_unbiased ( ForwardIterator  begin,
ForwardIterator  end 
)

Compute the asymptotically unbiased F_ST estimator of Karlsson et al, for all pairs of ranges of BaseCountss.

The function is intended to be used for computing pairwise F_ST for a set of BaseCountss along some region (e.g., a genomic Window).

This function expects a range (for example, a Window over the genome) of iterators, where each iterator dereferences to a std::vector<BaseCounts>. Each entry in the range is used as one position in the genome contributing to F_ST. For all entries, the vector needs to have the same number of entries.

Then, for each pair (i,j) of pool samples, the range is iterated, and the respective entries i and j of the vectors in the range are used to compute F_ST for this pair of samples, and stored in the resulting matrix at positions (i,j) and (j,i).

Definition at line 328 of file structure.hpp.

◆ f_st_asymptotically_unbiased() [2/2]

double genesis::population::f_st_asymptotically_unbiased ( ForwardIterator1  p1_begin,
ForwardIterator1  p1_end,
ForwardIterator2  p2_begin,
ForwardIterator2  p2_end 
)

Compute the asymptotically unbiased F_ST estimator of Karlsson et al.

This follows the implementation in PoPoolation2 by Kofler et al.

Definition at line 277 of file structure.hpp.

◆ f_st_asymptotically_unbiased_a1n1a2n2()

FstAN f_st_asymptotically_unbiased_a1n1a2n2 ( BaseCounts const &  p1,
BaseCounts const &  p2 
)

Compute the a and n values needed for the asymptotically unbiased F_ST estimator of Karlsson et al.

See f_st_asymptotically_unbiased() for details.

Definition at line 102 of file structure.cpp.

◆ f_st_asymptotically_unbiased_nkdk()

std::pair< double, double > f_st_asymptotically_unbiased_nkdk ( FstAN const &  fstan)

Compute the N_k and D_k values needed for the asymptotically unbiased F_ST estimator of Karlsson et al.

See f_st_asymptotically_unbiased() for details.

Definition at line 199 of file structure.cpp.

◆ f_st_conventional_pool() [1/3]

utils::Matrix<double> genesis::population::f_st_conventional_pool ( size_t  number_of_samples,
size_t  poolsizes,
ForwardIterator  begin,
ForwardIterator  end 
)

Compute the conventional F_ST statistic for pool-sequenced data, following Kofler et al, for all pairs of ranges of BaseCountss.

This is a shortcut for f_st_conventional_pool, but instead of taking a vector of the sizes of each pool, it uses the number of samples and a fix poolsize that is used for all samples.

Definition at line 216 of file structure.hpp.

◆ f_st_conventional_pool() [2/3]

double genesis::population::f_st_conventional_pool ( size_t  p1_poolsize,
size_t  p2_poolsize,
ForwardIterator1  p1_begin,
ForwardIterator1  p1_end,
ForwardIterator2  p2_begin,
ForwardIterator2  p2_end 
)

Compute the conventional F_ST statistic for pool-sequenced data, following Kofler et al, for two ranges of BaseCountss.

Definition at line 70 of file structure.hpp.

◆ f_st_conventional_pool() [3/3]

utils::Matrix<double> genesis::population::f_st_conventional_pool ( std::vector< size_t > const &  poolsizes,
ForwardIterator  begin,
ForwardIterator  end 
)

Compute the conventional F_ST statistic for pool-sequenced data, following Kofler et al, for all pairs of ranges of BaseCountss.

The function is intended to be used for computing pairwise F_ST for a set of BaseCountss along some region (e.g., a genomic Window).

This function expects a range (for example, a Window over the genome) of iterators, where each iterator dereferences to a std::vector<BaseCounts>. Each entry in the range is used as one position in the genome contributing to F_ST. For all entries, the vector needs to have the same number of entries, and that number has also to be the same as the size of the given poolsizes vector.

Then, for each pair (i,j) of pool samples, the range is iterated, and the respective entries i and j of the vectors in the range are used to compute F_ST for this pair of samples, and stored in the resulting matrix at positions (i,j) and (j,i).

Definition at line 156 of file structure.hpp.

◆ f_st_conventional_pool_pi_snp()

std::tuple< double, double, double > f_st_conventional_pool_pi_snp ( BaseCounts const &  p1,
BaseCounts const &  p2 
)

Compute the SNP-based Theta Pi values used in f_st_conventional_pool()

See there for details. The tuple returns Theta Pi for an individual position, which is simply the heterozygosity() at this position, for both samples p1 and p2, as well as their combined (average frequency) heterozygosity, in that order.

Definition at line 47 of file structure.cpp.

◆ f_star()

double f_star ( double  a_n,
double  n 
)

Compute f* according to Achaz 2008 and Kofler et al. 2011.

This is compuated as \( f_{star} = \frac{n - 3}{a_n \cdot (n-1) - n} \), and needed for the computation of alpha_star() and beta_star(). See there for some more details, and see

G. Achaz.
Testing for neutrality in samples with sequencing errors.
(2008) Genetics, 179(3), 1409–1424. https://doi.org/10.1534/genetics.107.082198

for the original equations.

Definition at line 241 of file diversity.cpp.

◆ filter_min_count()

void filter_min_count ( BaseCounts sample,
size_t  min_count 
)

Filter by minimum count that we need for a type of nucleotide (A, C, G, T) to be considered; set to zero if min_count is not reached.

This filter is used as a type of quality control filter. All nucleotide counts (that is, BaseCounts::a_count, BaseCounts::c_count, BaseCounts::g_count, and BaseCounts::t_count) that are below the given min_count are set to zero.

Definition at line 174 of file base_counts.cpp.

◆ get_base_count()

size_t get_base_count ( BaseCounts const &  bc,
char  base 
)

Get the count for a base given as a char.

The given base has to be one of ACGTDN (case insensitive), or *#. for deletions as well.

Definition at line 96 of file base_counts.cpp.

◆ heterozygosity()

double heterozygosity ( BaseCounts const &  sample)

Compute classic heterozygosity.

This is computed as \( h = \frac{n}{n-1} \left( 1 - \sum p^2 \right) \) with n the total nucleotide_sum() (sum of A,C,G,T in the sample), and p their respective nucleotide frequencies.

See Equation 3.1 in

Hahn, M. W.
(2018). Molecular Population Genetics.
https://global.oup.com/academic/product/molecular-population-genetics-9780878939657

for details.

Definition at line 96 of file diversity.cpp.

◆ is_covered() [1/3]

bool is_covered ( GenomeRegion const &  region,
std::string const &  chromosome,
size_t  position 
)

Test whether the chromosome/position is within a given genomic region.

Definition at line 144 of file functions/genome_region.cpp.

◆ is_covered() [2/3]

bool genesis::population::is_covered ( GenomeRegion const &  region,
T const &  variant 
)

Test whether the chromosome/position of a variant is within a given genomic region.

This is a function template, so that it can accept any data structure that contains public member variables chromosome (std::string) and position (size_t), such as Variant.

Definition at line 86 of file functions/genome_region.hpp.

◆ is_covered() [3/3]

bool is_covered ( GenomeRegion const &  region,
VcfRecord const &  variant 
)

Definition at line 165 of file functions/genome_region.cpp.

◆ make_sliding_window_iterator()

SlidingWindowIterator<ForwardIterator, InputType, DataType> genesis::population::make_sliding_window_iterator ( SlidingWindowIteratorSettings< InputType, DataType > const &  settings,
ForwardIterator  begin,
ForwardIterator  end 
)

Helper function to instanciate a SlidingWindowIterator without the need to specify all template parameters manually.

Definition at line 505 of file sliding_window_iterator.hpp.

◆ merge() [1/2]

BaseCounts merge ( BaseCounts const &  p1,
BaseCounts const &  p2 
)

Merge the counts of two BaseCountss.

Definition at line 153 of file base_counts.cpp.

◆ merge() [2/2]

BaseCounts merge ( std::vector< BaseCounts > const &  p)

Merge the counts of a vector BaseCountss.

Definition at line 160 of file base_counts.cpp.

◆ merge_inplace()

void merge_inplace ( BaseCounts p1,
BaseCounts const &  p2 
)

Merge the counts of two BaseCountss, by adding the counts of the second (p2) to the first (p1).

Definition at line 136 of file base_counts.cpp.

◆ n_base()

double n_base ( size_t  coverage,
size_t  poolsize 
)

Compute the n_base term used for Tajima's D in Kofler et al. 2011, using a faster closed form expression.

This term is the expected number of distinct individuals sequenced, which is equivalent to finding the expected number of distinct values selected from a set of integers.

The computation in PoPoolation is slowm, see n_base_matrix(). We here instead use a closed form expression following the reasoning of https://math.stackexchange.com/a/72351 See there for the derivation of the equation.

Definition at line 416 of file diversity.cpp.

◆ n_base_matrix()

double n_base_matrix ( size_t  coverage,
size_t  poolsize 
)

Compute the n_base term used for Tajima's D in Kofler et al. 2011, following their approach.

This term is the expected number of distinct individuals sequenced, which is equivalent to finding the expected number of distinct values selected from a set of integers.

The computation of this term in PoPoolation uses a recursive dynamic programming approach to sum over different possibilities of selecting sets of integers. This gets rather slow for larger inputs, and there is an equivalent closed form that we here use instead. See n_base() for details. We here merely offer the original PoPoolation implementation as a point of reference.

R. Kofler, P. Orozco-terWengel, N. De Maio, R. V. Pandey, V. Nolte, A. Futschik, C. Kosiol, C. Schlötterer.
PoPoolation: A Toolbox for Population Genetic Analysis of Next Generation Sequencing Data from Pooled Individuals.
(2011) PLoS ONE, 6(1), e15925. https://doi.org/10.1371/journal.pone.0015925

The paper unfortunately does not explain their equations, but there is a hidden document in their code repository that illuminates the situation a bit. See https://sourceforge.net/projects/popoolation/files/correction_equations.pdf

Definition at line 382 of file diversity.cpp.

◆ nucleotide_sum()

size_t genesis::population::nucleotide_sum ( BaseCounts const &  sample)
inline

Count of the pure nucleotide bases at this position, that is, the sum of all A, C, G, and T.

This is simply the sum of a_count + c_count + g_count + t_count.

NB: In PoPoolation, this variable is called eucov.

Definition at line 177 of file functions/base_counts.hpp.

◆ operator<<() [1/2]

std::ostream & operator<< ( std::ostream &  os,
BaseCounts const &  bs 
)

Output stream operator for BaseCounts instances.

Definition at line 345 of file base_counts.cpp.

◆ operator<<() [2/2]

std::ostream & operator<< ( std::ostream &  os,
GenomeRegion const &  region 
)

Definition at line 48 of file functions/genome_region.cpp.

◆ operator==() [1/2]

bool genesis::population::operator== ( FstAN const &  f1,
FstAN const &  f2 
)
inline

Comparison operator equals for FstAN structs.

Definition at line 250 of file structure.hpp.

◆ operator==() [2/2]

bool operator== ( GenomeRegion const &  a,
GenomeRegion const &  b 
)

Definition at line 70 of file functions/genome_region.cpp.

◆ parse_genome_region()

GenomeRegion parse_genome_region ( std::string const &  region)

Parse a genomic region.

Accepted formats are "chromosome", "chromosome:position", "chromosome:start-end", and "chromosome:start..end".

Definition at line 75 of file functions/genome_region.cpp.

◆ parse_genome_regions()

GenomeRegionList parse_genome_regions ( std::string const &  regions)

Parse a set/list of genomic regions.

The individual regions need to be separated by commas (surrounding white space is okay), and each region needs to follow the format as explained in parse_genome_region().

Definition at line 132 of file functions/genome_region.cpp.

◆ pij_matrix_()

genesis::utils::Matrix<double> genesis::population::pij_matrix_ ( size_t  max_coverage,
size_t  poolsize 
)

Definition at line 310 of file diversity.cpp.

◆ pij_matrix_resolver_()

genesis::utils::Matrix<double> const& genesis::population::pij_matrix_resolver_ ( size_t  max_coverage,
size_t  poolsize 
)

Definition at line 344 of file diversity.cpp.

◆ pool_diversity_measures()

PoolDiversityResults genesis::population::pool_diversity_measures ( PoolDiversitySettings const &  settings,
ForwardIterator  begin,
ForwardIterator  end 
)

Compute Theta Pi, Theta Watterson, and Tajia's D in their pool-sequencing corrected versions according to Kofler et al.

This is a high level function that is meant as a simple example of how to compute these statistics. See theta_pi_pool(), theta_watterson_pool(), and tajima_d_pool() for details. It takes care of most options offered by PoPoolation (as given by settings here), except for the window width and stride and minimum phred quality score, which have to be applied before filling the window (or whatever other range is used as input here) before calling this function.

Furthermore, results here are not filtered aftwards, so any filtering based on e.g., minimum covered fraction has to be done downstream.

Definition at line 427 of file diversity.hpp.

◆ prob_cond_true_freq()

std::vector< double > prob_cond_true_freq ( size_t  n,
std::vector< bool > const &  alleles,
std::vector< unsigned char > const &  phred_scores,
bool  unfolded 
)

Definition at line 121 of file afs_estimate.cpp.

◆ prob_cond_true_freq_unfolded()

std::vector< double > prob_cond_true_freq_unfolded ( size_t  n,
std::vector< bool > const &  alleles,
std::vector< unsigned char > const &  phred_scores,
bool  invert_alleles 
)

Definition at line 145 of file afs_estimate.cpp.

◆ process_conditional_probability()

void genesis::population::process_conditional_probability ( ForwardIterator  begin,
ForwardIterator  end 
)

Compute the conditional probabilities of AFs. This reimplements process_probCond from Boitard et al.

Definition at line 100 of file afs_estimate.hpp.

◆ run_vcf_window()

void genesis::population::run_vcf_window ( SlidingWindowGenerator< Data, Accumulator > &  generator,
std::string const &  vcf_file,
std::function< Data(VcfRecord const &)>  conversion,
std::function< bool(VcfRecord const &)>  condition = {} 
)

Convenience function to iterate over a whole VCF file.

This function is convenience, and takes care of iterating a VCF file record by record (that is, line by line), using a provided conversion function to extract the D/Data from the VcfRecord. It furthermore takes care of finishing all chromosomes properly, using their lengths as provided in the VCF header.

Before calling the function, of course, all necessary plugin functions have to be set in the SlidingWindowGenerator instance, so that the data is processed as intended. In particular, take care of setting SlidingWindowGenerator::emit_incomplete_windows() to the desired value.

Furthermore, the function offers a condition function that can be used to skip records that do not fullfil a given condition. That is, if condition is used, it needs to return true for records that shall be processed, and false for those that shall be skipped.

Definition at line 70 of file vcf_window.hpp.

◆ sorted_variant_counts()

std::array< std::pair< char, size_t >, 4 > sorted_variant_counts ( Variant const &  variant,
bool  reference_first 
)

Definition at line 59 of file variant.cpp.

◆ status()

BaseCountsStatus status ( BaseCounts const &  sample,
size_t  min_coverage = 0,
size_t  max_coverage = 0,
size_t  min_count = 0,
bool  tolerate_deletions = false 
)

Compute a simple status with useful properties from the counts of a BaseCounts.

min_coverage

Minimum coverage expected for a BaseCounts to be considered "covered". If the number of nucleotides (A, C, G, T) in the reads of a sample is less then the here provided min_coverage, then the BaseCounts is not considered sufficiently covered, and the BaseCountsStatus::is_covered flag will be set to false.

max_coverage

Same as min_coverage, but the upper bound on coverage; maximum coverage expected for a BaseCounts to be considered "covered". If the number of nucleotides exceeds this bound, the BaseCountsStatus::is_covered flag will be set to false. If provided with a value of 0 (default), max_coverage is not used.

Only if the nucleotide count is in between (or equal to either) these two bounds (min_coverage and max_coverage), it is considered to be covered, and BaseCountsStatus::is_covered will be set to true.

min_count

This value is used to determine whether a BaseCounts has too many deletions, and unless tolerate_deletions() is set to true, the BaseCountsStatus::is_ignored will be set to true in that case (too many deletions, as given by BaseCounts::d_count), while the values for BaseCountsStatus::is_covered, BaseCountsStatus::is_snp, and BaseCountsStatus::is_biallelic will be set to false.

Typically, if this function is used after calling filter_min_count() on the BaseCounts, the min_count is set to the same value for consistency.

tolerate_deletions

Set whether we tolerate BaseCountss with a high amount of deletions.

If set to false (default), we do not tolerate deletions. In that case, if the number of deletions in a Sample (given by Sample::d_count) is higher than or equal to min_count(), the Sample will be considered ignored (Sample::is_ignored set to true), and considered not covered (Sample::is_covered, Sample::is_snp, and Sample::is_biallelic will all be set to false).

If however set to true, we tolerate high amounts of deletions, and the values for the above properties will be set as usual by considering the nucleotide counts (Sample::a_count, Sample::c_count, Sample::g_count, and Sample::t_count) instead.

Definition at line 46 of file base_counts.cpp.

◆ tajima_d_pool() [1/2]

double genesis::population::tajima_d_pool ( PoolDiversitySettings const &  settings,
ForwardIterator  begin,
ForwardIterator  end 
)

Compute the pool-sequencing corrected version of Tajima's D according to Kofler et al.

Definition at line 397 of file diversity.hpp.

◆ tajima_d_pool() [2/2]

double genesis::population::tajima_d_pool ( PoolDiversitySettings const &  settings,
ForwardIterator  begin,
ForwardIterator  end,
double  theta_pi,
double  theta_watterson 
)

Compute the pool-sequencing corrected version of Tajima's D according to Kofler et al.

Definition at line 373 of file diversity.hpp.

◆ tajima_d_pool_denominator()

double tajima_d_pool_denominator ( PoolDiversitySettings const &  settings,
size_t  snp_count,
double  theta 
)

Compute the denominator for the pool-sequencing correction of Tajima's D according to Kofler et al.

Definition at line 435 of file diversity.cpp.

◆ theta_pi()

double genesis::population::theta_pi ( ForwardIterator  begin,
ForwardIterator  end 
)

Compute classic theta pi, that is, the sum of heterozygosities.

The function simply sums heterozygosity() for all samples in the given range.

Definition at line 154 of file diversity.hpp.

◆ theta_pi_pool() [1/2]

double genesis::population::theta_pi_pool ( PoolDiversitySettings const &  settings,
BaseCounts const &  sample 
)
inline

Compute theta pi with pool-sequencing correction according to Kofler et al, for a single BaseCounts, that is, its heterozygosity() divided by the correction denominator.

Definition at line 195 of file diversity.hpp.

◆ theta_pi_pool() [2/2]

double genesis::population::theta_pi_pool ( PoolDiversitySettings const &  settings,
ForwardIterator  begin,
ForwardIterator  end 
)

Compute theta pi with pool-sequencing correction according to Kofler et al, that is, the sum of heterozygosities divided by the correction denominator.

The function sums heterozygosity() for all samples in the given range, and divides each by the respective denominator to correct for error from pool sequencing. See theta_pi_pool_denominator() for details.

Definition at line 173 of file diversity.hpp.

◆ theta_pi_pool_denominator()

double theta_pi_pool_denominator ( PoolDiversitySettings const &  settings,
size_t  nucleotide_count 
)

Compute the denominator for the pool-sequencing correction of theta pi according to Kofler et al.

We here compute the denominator for a given poolsize, with a fix min_allele_count, which is identical for each given nucleotide_count, and henced cached internally for speedup.

See

R. Kofler, P. Orozco-terWengel, N. De Maio, R. V. Pandey, V. Nolte, A. Futschik, C. Kosiol, C. Schlötterer.
PoPoolation: A Toolbox for Population Genetic Analysis of Next Generation Sequencing Data from Pooled Individuals.
(2011) PLoS ONE, 6(1), e15925. https://doi.org/10.1371/journal.pone.0015925

for details. The paper unfortunately does not explain their equations, but there is a hidden document in their code repository that illuminates the situation a bit. See https://sourceforge.net/projects/popoolation/files/correction_equations.pdf

Definition at line 120 of file diversity.cpp.

◆ theta_watterson_pool()

double genesis::population::theta_watterson_pool ( PoolDiversitySettings const &  settings,
ForwardIterator  begin,
ForwardIterator  end 
)

Compute theta watterson with pool-sequencing correction according to Kofler et al.

Definition at line 221 of file diversity.hpp.

◆ theta_watterson_pool_denominator()

double theta_watterson_pool_denominator ( PoolDiversitySettings const &  settings,
size_t  nucleotide_count 
)

Compute the denominator for the pool-sequencing correction of theta watterson according to Kofler et al.

We here compute the denominator for a given poolsize, with a fix min_allele_count, which is identical for each given nucleotide_count, and henced cached internally for speedup.

See

R. Kofler, P. Orozco-terWengel, N. De Maio, R. V. Pandey, V. Nolte, A. Futschik, C. Kosiol, C. Schlötterer.
PoPoolation: A Toolbox for Population Genetic Analysis of Next Generation Sequencing Data from Pooled Individuals.
(2011) PLoS ONE, 6(1), e15925. https://doi.org/10.1371/journal.pone.0015925

for details. The paper unfortunately does not explain their equations, but there is a hidden document in their code repository that illuminates the situation a bit. See https://sourceforge.net/projects/popoolation/files/correction_equations.pdf

Definition at line 170 of file diversity.cpp.

◆ to_string()

std::string to_string ( GenomeRegion const &  region)

Definition at line 55 of file functions/genome_region.cpp.

◆ to_sync() [1/2]

std::ostream & to_sync ( BaseCounts const &  bs,
std::ostream &  os 
)

Output a BaseCounts instance to a stream in the PoPoolation2 sync format.

This is one column from that file, outputting the counts separated by colons, in the order A:T:C:G:N:D, with D being deletions (* in pileup).

Definition at line 352 of file base_counts.cpp.

◆ to_sync() [2/2]

std::ostream & to_sync ( Variant const &  var,
std::ostream &  os 
)

Output a Variant instance to a stream in the PoPoolation2 sync format.

The format is a tab-delimited file with one variant per line:

  • col1: reference contig
  • col2: position within the refernce contig
  • col3: reference character
  • col4: allele frequencies of population number 1
  • col5: allele frequencies of population number 2
  • coln: allele frequencies of population number n

Each population column outputs counts separated by colons, in the order A:T:C:G:N:D, with D being deletions (* in pileup).

See https://sourceforge.net/p/popoolation2/wiki/Tutorial/ for details.

Definition at line 147 of file variant.cpp.

◆ total_base_counts()

BaseCounts total_base_counts ( Variant const &  variant)

Definition at line 54 of file variant.cpp.

◆ vcf_genotype_string()

std::string vcf_genotype_string ( std::vector< VcfGenotype > const &  genotypes)

Return the VCF-like string representation of a set of VcfGenotype entries.

The VcfFormatIterator::get_values() function returns all genotype entries for a given sample of a record/line. Here, we return a string representation similar to VCF of these genotypes, for example 0|0 or ./1.

Definition at line 216 of file vcf_common.cpp.

◆ vcf_genotype_sum()

size_t vcf_genotype_sum ( std::vector< VcfGenotype > const &  genotypes)

Return the sum of genotypes for a set of VcfGenotype entries, typically used to construct a genotype matrix with entries 0,1,2.

The function takes the given genotypes, encodes the reference as 0 and any alternative as 1, and then sums this over the values. For diploid organisms, this yields possible results in the range of 0 (homozygote for the reference), 1 (heterzygote), or 2 (homozygote for the alternative), which is typically used in genotype matrices.

Definition at line 230 of file vcf_common.cpp.

◆ vcf_hl_type_to_string()

std::string vcf_hl_type_to_string ( int  hl_type)

Internal helper function to convert htslib-internal BCF_HL_* header line type values to their string representation as used in the VCF header ("FILTER", "INFO", "FORMAT", etc).

Definition at line 199 of file vcf_common.cpp.

◆ vcf_value_special_to_string() [1/2]

std::string vcf_value_special_to_string ( int  vl_type_num)

Definition at line 171 of file vcf_common.cpp.

◆ vcf_value_special_to_string() [2/2]

std::string vcf_value_special_to_string ( VcfValueSpecial  vl_type_num)

Definition at line 166 of file vcf_common.cpp.

◆ vcf_value_type_to_string() [1/2]

std::string vcf_value_type_to_string ( int  ht_type)

Definition at line 141 of file vcf_common.cpp.

◆ vcf_value_type_to_string() [2/2]

std::string vcf_value_type_to_string ( VcfValueType  ht_type)

Definition at line 136 of file vcf_common.cpp.

Typedef Documentation

◆ VcfFormatIteratorFloat

using VcfFormatIteratorFloat = VcfFormatIterator<float, double>

Definition at line 67 of file vcf_format_iterator.hpp.

◆ VcfFormatIteratorGenotype

Definition at line 68 of file vcf_format_iterator.hpp.

◆ VcfFormatIteratorInt

using VcfFormatIteratorInt = VcfFormatIterator<int32_t, int32_t>

Definition at line 66 of file vcf_format_iterator.hpp.

◆ VcfFormatIteratorString

using VcfFormatIteratorString = VcfFormatIterator<char*, std::string>

Definition at line 65 of file vcf_format_iterator.hpp.

Enumeration Type Documentation

◆ VcfHeaderLine

enum VcfHeaderLine : int
strong

Specification for the values determining header line types of VCF/BCF files.

This list contains the types of header lines that htslib uses for identification, as specified in the VCF header. Corresponds to the BCF_HL_* macro constants defined by htslib. We statically assert that these have the same values.

Enumerator
kFilter 
kInfo 
kFormat 
kContig 
kStructured 
kGeneric 

Definition at line 60 of file vcf_common.hpp.

◆ VcfValueSpecial

enum VcfValueSpecial : int
strong

Specification for special markers for the number of values expected for key-value-pairs of VCF/BCF files.

This list contains the special markers for the number of values of the INFO and FORMAT key-value pairs, as specified in the VCF header, and used in the record lines. Corresponds to the BCF_VL_* macro constants defined by htslib. We statically assert that these have the same values.

Enumerator
kFixed 

Fixed number of values expected. In VCF, this is denoted simply by an integer number.

This simply specifies that there is a fixed number of values to be expected; we do not further define how many exaclty are expected here (the integer value). This is taken care of in a separate variable that is provided whenever a fixed-size value is needed, see for example VcfSpecification.

kVariable 

Variable number of possible values, or unknown, or unbounded. In VCF, this is denoted by '.'.

kAllele 

One value per alternate allele. In VCF, this is denoted as 'A'.

kGenotype 

One value for each possible genotype (more relevant to the FORMAT tags). In VCF, this is denoated as 'G'.

kReference 

One value for each possible allele (including the reference). In VCF, this is denoted as 'R'.

Definition at line 95 of file vcf_common.hpp.

◆ VcfValueType

enum VcfValueType : int
strong

Specification for the data type of the values expected in key-value-pairs of VCF/BCF files.

This list contains the types of data in values of the INFO and FORMAT key-value pairs, as specified in the VCF header, and used in the record lines. Corresponds to the BCF_HT_* macro constants defined by htslib. We statically assert that these have the same values.

Enumerator
kFlag 
kInteger 
kFloat 
kString 

Definition at line 78 of file vcf_common.hpp.

◆ WindowAnchorType

enum WindowAnchorType
strong

Position in the genome that is used for reporting when emitting or using a window.

When a window is finished, the on_emission() plugin function is called, which reports the position in the genome at which the window is. There are several ways that this position is computed. Typically, just the first position of the window is used (that is, for an interval, the beginning of the interval, and for variants, the position of the first variant).

However, it might be desirable to report a different position, for example when plotting the results. When using WindowType::kVariants for example, one might want to plot the values computed per window at the midpoint genome position of the variants in that window.

Enumerator
kIntervalBegin 
kIntervalEnd 
kIntervalMidpoint 
kVariantFirst 
kVariantLast 
kVariantMedian 
kVariantMean 
kVariantMidpoint 

Definition at line 69 of file window.hpp.

◆ WindowType

enum WindowType
strong

WindowType of a Window, that is, whether we slide along a fixed size interval of the genome, or along a fixed number of variants.

Enumerator
kInterval 
kVariants 

Definition at line 51 of file window.hpp.