A library for working with phylogenetic and population genetic data.
v0.27.0
genesis::population Namespace Reference

Classes

struct  AfsPileupRecord
 Helper to store the data of one pileup line/record needed for the Boitard et al Allele Frequency Estimation computation. More...
 
class  AlleleFrequencyWindow
 
struct  BaseCounts
 One set of nucleotide base counts, for example for a given sample that represents a pool of sequenced individuals. More...
 
struct  BaseCountsStatus
 
class  BaseWindowIterator
 Base iterator class for Windows over the chromosomes of a genome. More...
 
class  BedReader
 Reader for BED (Browser Extensible Data) files. More...
 
struct  EmptyAccumulator
 Empty helper data struct to serve as a dummy for Window. More...
 
struct  EmptyGenomeData
 Helper struct to define a default empty data for the classes GenomeLocus, GenomeRegion, and GenomeRegionList. More...
 
class  GenomeHeatmap
 
struct  GenomeLocus
 A single locus, that is, a position (or coordinate) on a chromosome. More...
 
struct  GenomeRegion
 A region (between two positions) on a chromosome. More...
 
class  GenomeRegionList
 List of regions in a genome, for each chromosome. More...
 
class  GffReader
 Reader for GFF2 and GFF3 (General Feature Format) and GTF (General Transfer Format) files. More...
 
class  HeatmapColorization
 
class  HeatmapMatrix
 Matrix to capture and accumulate columns of per-position or per-window values along a chromosome. More...
 
class  HtsFile
 Wrap an ::htsFile struct. More...
 
struct  PoolDiversityResults
 Data struct to collect all diversity statistics computed by pool_diversity_measures(). More...
 
struct  PoolDiversitySettings
 Settings used by different pool-sequencing corrected diversity statistics. More...
 
class  RegionWindowIterator
 Iterator for Windows representing regions of a genome. More...
 
class  SamVariantInputIterator
 Input iterator for SAM/BAM/CRAM files that produces a Variant per genome position. More...
 
class  SimplePileupInputIterator
 Iterate an input source and parse it as a (m)pileup file. More...
 
class  SimplePileupReader
 Reader for line-by-line assessment of (m)pileup files. More...
 
class  SlidingIntervalWindowIterator
 Iterator for sliding Windows of fixed sized intervals over the chromosomes of a genome. More...
 
class  SlidingVariantsWindowIterator
 Iterator for sliding Windows of fixed sized intervals over the chromosomes of a genome. More...
 
class  SlidingWindowGenerator
 Generator for sliding Windows over the chromosomes of a genome. More...
 
struct  SortedBaseCounts
 Ordered array of base counts for the four nucleotides. More...
 
class  SyncInputIterator
 Iterate an input source and parse it as a sync file. More...
 
class  SyncReader
 Reader for PoPoolation2's "synchronized" files. More...
 
struct  Variant
 A single variant at a position in a chromosome, along with BaseCounts for a set of samples. More...
 
struct  VariantInputIteratorData
 Data storage for input-specific information when traversing a variant file. More...
 
class  VariantParallelInputIterator
 Iterate multiple input sources that yield Variants in parallel. More...
 
class  VcfFormatHelper
 Provide htslib helper functions. More...
 
class  VcfFormatIterator
 Iterate the FORMAT information for the samples in a SNP/variant line in a VCF/BCF file. More...
 
class  VcfGenotype
 Simple wrapper class for one genotype field for a sample. More...
 
class  VcfHeader
 Capture the information from a header of a VCF/BCF file. More...
 
class  VcfInputIterator
 Iterate an input source and parse it as a VCF/BCF file. More...
 
class  VcfRecord
 Capture the information of a single SNP/variant line in a VCF/BCF file. More...
 
struct  VcfSpecification
 Collect the four required keys that describe an INFO or FORMAT sub-field of VCF/BCF files. More...
 
class  Window
 Window over the chromosomes of a genome. More...
 

Functions

double a_n (size_t n)
 Compute a_n, the sum of reciprocals. More...
 
double alpha_star (double n)
 Compute alpha* according to Achaz 2008 and Kofler et al. 2011. More...
 
double amnm_ (size_t poolsize, size_t nucleotide_count, size_t allele_frequency)
 Local helper function to compute values for the denominator. More...
 
template<class D , class A = EmptyAccumulator>
size_t anchor_position (Window< D, A > const &window, WindowAnchorType anchor_type=WindowAnchorType::kIntervalBegin)
 Get the position in the chromosome reported according to a specific WindowAnchorType. More...
 
double b_n (size_t n)
 Compute b_n, the sum of squared reciprocals. More...
 
double beta_star (double n)
 Compute beta* according to Achaz 2008 and Kofler et al. 2011. More...
 
std::pair< char, double > consensus (BaseCounts const &sample)
 Consensus character for a BaseCounts, and its confidence. More...
 
std::pair< char, double > consensus (BaseCounts const &sample, BaseCountsStatus const &status)
 Consensus character for a BaseCounts, and its confidence. More...
 
AfsPileupRecord convert_to_afs_pileup_record (SimplePileupReader::Record const &record)
 
BaseCounts convert_to_base_counts (SimplePileupReader::Sample const &sample, unsigned char min_phred_score)
 
Variant convert_to_variant (SimplePileupReader::Record const &record, unsigned char min_phred_score)
 
Variant convert_to_variant_as_individuals (VcfRecord const &record, bool use_allelic_depth=false)
 Convert a VcfRecord to a Variant, treating each sample as an individual, and combining them all into one BaseCounts sample. More...
 
Variant convert_to_variant_as_pool (VcfRecord const &record)
 Convert a VcfRecord to a Variant, treating each sample column as a pool of individuals. More...
 
template<class ForwardIterator1 , class ForwardIterator2 >
double f_st_pool_karlsson (ForwardIterator1 p1_begin, ForwardIterator1 p1_end, ForwardIterator2 p2_begin, ForwardIterator2 p2_end)
 Compute the F_ST statistic for pool-sequenced data of Karlsson et al as used in PoPoolation2, for two ranges of BaseCountss. More...
 
std::pair< double, double > f_st_pool_karlsson_nkdk (std::pair< SortedBaseCounts, SortedBaseCounts > const &sample_counts)
 Compute the numerator N_k and denominator D_k needed for the asymptotically unbiased F_ST estimator of Karlsson et al (2007). More...
 
template<class ForwardIterator1 , class ForwardIterator2 >
double f_st_pool_kofler (size_t p1_poolsize, size_t p2_poolsize, ForwardIterator1 p1_begin, ForwardIterator1 p1_end, ForwardIterator2 p2_begin, ForwardIterator2 p2_end)
 Compute the F_ST statistic for pool-sequenced data of Kofler et al as used in PoPoolation2, for two ranges of BaseCountss. More...
 
std::tuple< double, double, double > f_st_pool_kofler_pi_snp (BaseCounts const &p1, BaseCounts const &p2)
 Compute the SNP-based Theta Pi values used in f_st_pool_kofler(). More...
 
template<class ForwardIterator1 , class ForwardIterator2 >
std::pair< double, double > f_st_pool_unbiased (size_t p1_poolsize, size_t p2_poolsize, ForwardIterator1 p1_begin, ForwardIterator1 p1_end, ForwardIterator2 p2_begin, ForwardIterator2 p2_end)
 Compute our unbiased F_ST statistic for pool-sequenced data for two ranges of BaseCountss. More...
 
std::tuple< double, double, double > f_st_pool_unbiased_pi_snp (size_t p1_poolsize, size_t p2_poolsize, BaseCounts const &p1, BaseCounts const &p2)
 Compute the SNP-based Theta Pi values used in f_st_pool_unbiased(). More...
 
double f_star (double a_n, double n)
 Compute f* according to Achaz 2008 and Kofler et al. 2011. More...
 
std::function< bool(Variant const &)> filter_by_region (GenomeRegion const &region, bool complement=false)
 Filter function to be used with VariantInputIterator to filter by a genome region. More...
 
std::function< bool(Variant const &)> filter_by_region (GenomeRegionList const &regions, bool complement=false, bool copy_regions=false)
 Filter function to be used with VariantInputIterator to filter by a list of genome regions. More...
 
std::function< bool(Variant const &)> filter_by_region (std::shared_ptr< GenomeRegionList > regions, bool complement=false)
 Filter function to be used with VariantInputIterator to filter by a list of genome regions. More...
 
bool filter_by_status (std::function< bool(BaseCountsStatus const &)> predicate, Variant const &variant, SampleFilterType type, size_t min_coverage=0, size_t max_coverage=0, size_t min_count=0, bool tolerate_deletions=false)
 Filter a Variant based on a predicate that is applied to the result of a status() call on the BaseCounts of the variant. More...
 
std::function< bool(Variant const &)> filter_by_status (std::function< bool(BaseCountsStatus const &)> predicate, SampleFilterType type, size_t min_coverage=0, size_t max_coverage=0, size_t min_count=0, bool tolerate_deletions=false)
 Filter a Variant based on a predicate that is applied to the result of a status() call on the BaseCounts of the variant. More...
 
std::function< bool(Variant const &)> filter_is_biallelic_snp (SampleFilterType type, size_t min_coverage=0, size_t max_coverage=0, size_t min_count=0, bool tolerate_deletions=false)
 Filter a Variant based on whether the sample counts are biallelic SNPs, that is, exactly two base counts in [ACGT] are non-zero. More...
 
bool filter_is_biallelic_snp (Variant const &variant, SampleFilterType type, size_t min_coverage=0, size_t max_coverage=0, size_t min_count=0, bool tolerate_deletions=false)
 Filter a Variant based on whether the sample counts are biallelic SNPs, that is, exactly two base counts in [ACGT] are non-zero. More...
 
std::function< bool(Variant const &)> filter_is_snp (SampleFilterType type, size_t min_coverage=0, size_t max_coverage=0, size_t min_count=0, bool tolerate_deletions=false)
 Filter a Variant based on whether the sample counts are SNPs, that is, more than one count in [ACGT] is non-zero. More...
 
bool filter_is_snp (Variant const &variant, SampleFilterType type, size_t min_coverage=0, size_t max_coverage=0, size_t min_count=0, bool tolerate_deletions=false)
 Filter a Variant based on whether the sample counts are SNPs, that is, more than one count in [ACGT] is non-zero. More...
 
GenomeRegionList genome_region_list_from_vcf_file (std::string const &file)
 Read a VCF file, and use its positions to create a GenomeRegionList. More...
 
void genome_region_list_from_vcf_file (std::string const &file, GenomeRegionList &target)
 Read a VCF file, and add its positions to an existing GenomeRegionList. More...
 
size_t get_base_count (BaseCounts const &bc, char base)
 Get the count for a base given as a char. More...
 
std::pair< std::array< char, 6 >, size_t > get_vcf_record_snp_ref_alt_chars_ (VcfRecord const &record)
 Local helper function that returns the REF and ALT chars of a VcfRecord for SNPs. More...
 
char guess_alternative_base (Variant const &variant, bool force=true)
 Guess the alternative base of a Variant. More...
 
char guess_reference_base (Variant const &variant)
 Guess the reference base of a Variant. More...
 
double heterozygosity (BaseCounts const &sample, bool with_bessel=false)
 Compute classic heterozygosity. More...
 
bool is_covered (GenomeRegion const &region, std::string const &chromosome, size_t position)
 Test whether the chromosome/position is within a given genomic region. More...
 
template<class T >
bool is_covered (GenomeRegion const &region, T const &locus)
 Test whether the chromosome/position of a locus is within a given genomic region. More...
 
bool is_covered (GenomeRegion const &region, VcfRecord const &variant)
 
bool is_covered (GenomeRegionList const &regions, std::string const &chromosome, size_t position)
 Test whether the chromosome/position is within a given list of genomic regions. More...
 
template<class T >
bool is_covered (GenomeRegionList const &regions, T const &locus)
 Test whether the chromosome/position of a locus is within a given list of genomic regions. More...
 
bool is_covered (GenomeRegionList const &regions, VcfRecord const &variant)
 
int locus_compare (GenomeLocus const &l, GenomeLocus const &r)
 Three-way comparison (spaceship operator <=>) for two loci in a genome. More...
 
int locus_compare (GenomeLocus const &l, std::string const &r_chromosome, size_t r_position)
 Three-way comparison (spaceship operator <=>) for two loci in a genome. More...
 
int locus_compare (std::string const &l_chromosome, size_t l_position, GenomeLocus const &r)
 Three-way comparison (spaceship operator <=>) for two loci in a genome. More...
 
int locus_compare (std::string const &l_chromosome, size_t l_position, std::string const &r_chromosome, size_t r_position)
 Three-way comparison (spaceship operator <=>) for two loci in a genome. More...
 
bool locus_equal (GenomeLocus const &l, GenomeLocus const &r)
 Equality comparison (==) for two loci in a genome. More...
 
bool locus_equal (GenomeLocus const &l, std::string const &r_chromosome, size_t r_position)
 Equality comparison (==) for two loci in a genome. More...
 
bool locus_equal (std::string const &l_chromosome, size_t l_position, GenomeLocus const &r)
 Equality comparison (==) for two loci in a genome. More...
 
bool locus_equal (std::string const &l_chromosome, size_t l_position, std::string const &r_chromosome, size_t r_position)
 Equality comparison (==) for two loci in a genome. More...
 
bool locus_greater (GenomeLocus const &l, GenomeLocus const &r)
 Greater than comparison (>) for two loci in a genome. More...
 
bool locus_greater (GenomeLocus const &l, std::string const &r_chromosome, size_t r_position)
 Greater than comparison (>) for two loci in a genome. More...
 
bool locus_greater (std::string const &l_chromosome, size_t l_position, GenomeLocus const &r)
 Greater than comparison (>) for two loci in a genome. More...
 
bool locus_greater (std::string const &l_chromosome, size_t l_position, std::string const &r_chromosome, size_t r_position)
 Greater than comparison (>) for two loci in a genome. More...
 
bool locus_greater_or_equal (GenomeLocus const &l, GenomeLocus const &r)
 Greater than or equal comparison (>=) for two loci in a genome. More...
 
bool locus_greater_or_equal (GenomeLocus const &l, std::string const &r_chromosome, size_t r_position)
 Greater than or equal comparison (>=) for two loci in a genome. More...
 
bool locus_greater_or_equal (std::string const &l_chromosome, size_t l_position, GenomeLocus const &r)
 Greater than or equal comparison (>=) for two loci in a genome. More...
 
bool locus_greater_or_equal (std::string const &l_chromosome, size_t l_position, std::string const &r_chromosome, size_t r_position)
 Greater than or equal comparison (>=) for two loci in a genome. More...
 
bool locus_inequal (GenomeLocus const &l, GenomeLocus const &r)
 Inequality comparison (!=) for two loci in a genome. More...
 
bool locus_inequal (GenomeLocus const &l, std::string const &r_chromosome, size_t r_position)
 Inequality comparison (!=) for two loci in a genome. More...
 
bool locus_inequal (std::string const &l_chromosome, size_t l_position, GenomeLocus const &r)
 Inequality comparison (!=) for two loci in a genome. More...
 
bool locus_inequal (std::string const &l_chromosome, size_t l_position, std::string const &r_chromosome, size_t r_position)
 Inequality comparison (!=) for two loci in a genome. More...
 
bool locus_less (GenomeLocus const &l, GenomeLocus const &r)
 Less than comparison (<) for two loci in a genome. More...
 
bool locus_less (GenomeLocus const &l, std::string const &r_chromosome, size_t r_position)
 Less than comparison (<) for two loci in a genome. More...
 
bool locus_less (std::string const &l_chromosome, size_t l_position, GenomeLocus const &r)
 Less than comparison (<) for two loci in a genome. More...
 
bool locus_less (std::string const &l_chromosome, size_t l_position, std::string const &r_chromosome, size_t r_position)
 Less than comparison (<) for two loci in a genome. More...
 
bool locus_less_or_equal (GenomeLocus const &l, GenomeLocus const &r)
 Less than or equal comparison (<=) for two loci in a genome. More...
 
bool locus_less_or_equal (GenomeLocus const &l, std::string const &r_chromosome, size_t r_position)
 Less than or equal comparison (<=) for two loci in a genome. More...
 
bool locus_less_or_equal (std::string const &l_chromosome, size_t l_position, GenomeLocus const &r)
 Less than or equal comparison (<=) for two loci in a genome. More...
 
bool locus_less_or_equal (std::string const &l_chromosome, size_t l_position, std::string const &r_chromosome, size_t r_position)
 Less than or equal comparison (<=) for two loci in a genome. More...
 
template<class ForwardIterator >
SlidingIntervalWindowIterator< ForwardIterator > make_default_sliding_interval_window_iterator (ForwardIterator begin, ForwardIterator end, size_t width=0, size_t stride=0)
 Helper function to instantiate a SlidingIntervalWindowIterator for a default use case. More...
 
template<class ForwardIterator >
SlidingVariantsWindowIterator< ForwardIterator > make_default_sliding_variants_window_iterator (ForwardIterator begin, ForwardIterator end, size_t width=0, size_t stride=0)
 Helper function to instantiate a SlidingVariantsWindowIterator for a default use case. More...
 
template<class T , class R >
std::shared_ptr< T > make_input_iterator_with_sample_filter_ (std::string const &filename, R const &reader, std::vector< size_t > const &sample_indices, bool inverse_sample_indices, std::vector< bool > const &sample_filter)
 Local helper function template that takes care of intilizing an input iterator, and setting the sample filters, for those iterators for which we do not know the number of samples prior to starting the file iteration. More...
 
template<class ForwardIterator , class DataType = typename ForwardIterator::value_type>
SlidingIntervalWindowIterator< ForwardIterator, DataType > make_sliding_interval_window_iterator (ForwardIterator begin, ForwardIterator end, size_t width=0, size_t stride=0)
 Helper function to instantiate a SlidingIntervalWindowIterator without the need to specify the template parameters manually. More...
 
template<class ForwardIterator , class DataType = typename ForwardIterator::value_type>
SlidingVariantsWindowIterator< ForwardIterator, DataType > make_sliding_variants_window_iterator (ForwardIterator begin, ForwardIterator end, size_t width=0, size_t stride=0)
 Helper function to instantiate a SlidingVariantsWindowIterator without the need to specify the template parameters manually. More...
 
VariantInputIterator make_variant_input_iterator_from_individual_vcf_file (std::string const &filename, bool use_allelic_depth=false, bool only_biallelic=true, bool only_filter_pass=true)
 Create a VariantInputIterator to iterate the contents of a VCF file as Variants, treating each sample as an individual, and combining them all into one BaseCounts sample. More...
 
VariantInputIterator make_variant_input_iterator_from_individual_vcf_file (std::string const &filename, std::vector< std::string > const &sample_names, bool inverse_sample_names=false, bool use_allelic_depth=false, bool only_biallelic=true, bool only_filter_pass=true)
 Create a VariantInputIterator to iterate the contents of a VCF file as Variants, treating each sample as an individual, and combining them all into one BaseCounts sample. More...
 
VariantInputIterator make_variant_input_iterator_from_pileup_file (std::string const &filename, SimplePileupReader const &reader=SimplePileupReader{})
 Create a VariantInputIterator to iterate the contents of a (m)pileup file as Variants. More...
 
VariantInputIterator make_variant_input_iterator_from_pileup_file (std::string const &filename, std::vector< bool > const &sample_filter, SimplePileupReader const &reader=SimplePileupReader{})
 Create a VariantInputIterator to iterate the contents of a (m)pileup file as Variants. More...
 
VariantInputIterator make_variant_input_iterator_from_pileup_file (std::string const &filename, std::vector< size_t > const &sample_indices, bool inverse_sample_indices=false, SimplePileupReader const &reader=SimplePileupReader{})
 Create a VariantInputIterator to iterate the contents of a (m)pileup file as Variants. More...
 
VariantInputIterator make_variant_input_iterator_from_pileup_file_ (std::string const &filename, SimplePileupReader const &reader, std::vector< size_t > const &sample_indices, bool inverse_sample_indices, std::vector< bool > const &sample_filter)
 Local helper function that takes care of the three functions below. More...
 
VariantInputIterator make_variant_input_iterator_from_pool_vcf_file (std::string const &filename, bool only_biallelic=true, bool only_filter_pass=true)
 Create a VariantInputIterator to iterate the contents of a VCF file as Variants, treating each sample as a pool of individuals. More...
 
VariantInputIterator make_variant_input_iterator_from_pool_vcf_file (std::string const &filename, std::vector< std::string > const &sample_names, bool inverse_sample_names=false, bool only_biallelic=true, bool only_filter_pass=true)
 Create a VariantInputIterator to iterate the contents of a VCF file as Variants, treating each sample as a pool of individuals. More...
 
VariantInputIterator make_variant_input_iterator_from_sam_file (std::string const &filename, SamVariantInputIterator const &reader=SamVariantInputIterator{})
 Create a VariantInputIterator to iterate the contents of a SAM/BAM/CRAM file as Variants. More...
 
VariantInputIterator make_variant_input_iterator_from_sync_file (std::string const &filename)
 Create a VariantInputIterator to iterate the contents of a PoPoolation2 sync file as Variants. More...
 
VariantInputIterator make_variant_input_iterator_from_sync_file (std::string const &filename, std::vector< bool > const &sample_filter)
 Create a VariantInputIterator to iterate the contents of a PoPoolation2 sync file as Variants. More...
 
VariantInputIterator make_variant_input_iterator_from_sync_file (std::string const &filename, std::vector< size_t > const &sample_indices, bool inverse_sample_indices=false)
 Create a VariantInputIterator to iterate the contents of a PoPoolation2 sync file as Variants. More...
 
VariantInputIterator make_variant_input_iterator_from_sync_file_ (std::string const &filename, std::vector< size_t > const &sample_indices, bool inverse_sample_indices, std::vector< bool > const &sample_filter)
 
VariantInputIterator make_variant_input_iterator_from_variant_parallel_input_iterator (VariantParallelInputIterator const &parallel_input, bool allow_ref_base_mismatches=false, bool allow_alt_base_mismatches=true, std::string const &source_sample_separator=":")
 Create a VariantInputIterator to iterate multiple input sources at once, using a VariantParallelInputIterator. More...
 
VariantInputIterator make_variant_input_iterator_from_vcf_file_ (std::string const &filename, std::vector< std::string > const &sample_names, bool inverse_sample_names, bool pool_samples, bool use_allelic_depth, bool only_biallelic, bool only_filter_pass)
 Local helper function that takes care of both main functions below. More...
 
BaseCounts merge (BaseCounts const &p1, BaseCounts const &p2)
 Merge the counts of two BaseCountss. More...
 
BaseCounts merge (std::vector< BaseCounts > const &p)
 Merge the counts of a vector BaseCountss. More...
 
void merge_inplace (BaseCounts &p1, BaseCounts const &p2)
 Merge the counts of two BaseCountss, by adding the counts of the second (p2) to the first (p1). More...
 
double n_base (size_t coverage, size_t poolsize)
 Compute the n_base term used for Tajima's D in Kofler et al. 2011, using a faster closed form expression. More...
 
double n_base_matrix (size_t coverage, size_t poolsize)
 Compute the n_base term used for Tajima's D in Kofler et al. 2011, following their approach. More...
 
template<typename T >
std::array< size_t, 4 > nucleotide_sorting_order_ (std::array< T, 4 > const &values)
 Local helper function that runs a sorting network to sort four values, coming from the four nucleotides. More...
 
size_t nucleotide_sum (BaseCounts const &sample)
 Count of the pure nucleotide bases at this position, that is, the sum of all A, C, G, and T. More...
 
bool operator!= (GenomeLocus const &l, GenomeLocus const &r)
 Inequality comparison (!=) for two loci in a genome. More...
 
bool operator!= (GenomeRegion const &a, GenomeRegion const &b)
 Inequality comparison (!=) for two GenomeRegions. More...
 
bool operator< (GenomeLocus const &l, GenomeLocus const &r)
 Less than comparison (<) for two loci in a genome. More...
 
std::ostream & operator<< (std::ostream &os, BaseCounts const &bs)
 Output stream operator for BaseCounts instances. More...
 
std::ostream & operator<< (std::ostream &os, GenomeLocus const &locus)
 
std::ostream & operator<< (std::ostream &os, GenomeRegion const &region)
 
bool operator<= (GenomeLocus const &l, GenomeLocus const &r)
 Less than or equal comparison (<=) for two loci in a genome. More...
 
bool operator== (GenomeLocus const &l, GenomeLocus const &r)
 Equality comparison (==) for two loci in a genome. More...
 
bool operator== (GenomeRegion const &a, GenomeRegion const &b)
 Equality comparison (!=) for two GenomeRegions. More...
 
bool operator> (GenomeLocus const &l, GenomeLocus const &r)
 Greater than comparison (>) for two loci in a genome. More...
 
bool operator>= (GenomeLocus const &l, GenomeLocus const &r)
 Greater than or equal comparison (>=) for two loci in a genome. More...
 
GenomeRegion parse_genome_region (std::string const &region, bool zero_based=false, bool end_exclusive=false)
 Parse a genomic region. More...
 
GenomeRegionList parse_genome_regions (std::string const &regions, bool zero_based=false, bool end_exclusive=false)
 Parse a set/list of genomic regions. More...
 
genesis::utils::Matrix< double > pij_matrix_ (size_t max_coverage, size_t poolsize)
 
genesis::utils::Matrix< double > const & pij_matrix_resolver_ (size_t max_coverage, size_t poolsize)
 
template<class ForwardIterator >
PoolDiversityResults pool_diversity_measures (PoolDiversitySettings const &settings, ForwardIterator begin, ForwardIterator end)
 Compute Theta Pi, Theta Watterson, and Tajia's D in their pool-sequencing corrected versions according to Kofler et al. More...
 
std::vector< double > prob_cond_true_freq (size_t n, std::vector< bool > const &alleles, std::vector< unsigned char > const &phred_scores, bool unfolded)
 
std::vector< double > prob_cond_true_freq_unfolded (size_t n, std::vector< bool > const &alleles, std::vector< unsigned char > const &phred_scores, bool invert_alleles)
 
template<class ForwardIterator >
void process_conditional_probability (ForwardIterator begin, ForwardIterator end)
 Compute the conditional probabilities of AFs. This reimplements process_probCond from Boitard et al. More...
 
void process_pileup_correct_input_order_check_ (utils::InputStream const &it, std::string &cur_chr, size_t &cur_pos, std::string const &new_chr, size_t new_pos)
 Local helper function to remove code duplication for the correct input order check. More...
 
void process_sync_correct_input_order_ (utils::InputStream const &it, std::string &cur_chr, size_t &cur_pos, Variant const &new_var)
 Local helper function to remove code duplication for the correct input order check. More...
 
template<class Data , class Accumulator = EmptyAccumulator>
void run_vcf_window (SlidingWindowGenerator< Data, Accumulator > &generator, std::string const &vcf_file, std::function< Data(VcfRecord const &)> conversion, std::function< bool(VcfRecord const &)> condition={})
 Convenience function to iterate over a whole VCF file. More...
 
std::string sam_flag_to_string (int flags)
 Turn a set of flags for sam/bam/cram reads into their textual representation. More...
 
template<>
void SimplePileupReader::process_ancestral_base_< SimplePileupReader::Sample > (utils::InputStream &input_stream, SimplePileupReader::Sample &sample) const
 
template<>
void SimplePileupReader::process_quality_string_< SimplePileupReader::Sample > (utils::InputStream &input_stream, SimplePileupReader::Sample &sample) const
 
template<>
void SimplePileupReader::set_sample_read_bases_< SimplePileupReader::Sample > (std::string const &read_bases, SimplePileupReader::Sample &sample) const
 
template<>
void SimplePileupReader::set_sample_read_coverage_< SimplePileupReader::Sample > (size_t read_coverage, SimplePileupReader::Sample &sample) const
 
template<>
void SimplePileupReader::set_target_alternative_base_< SimplePileupReader::Record > (SimplePileupReader::Record &target) const
 
std::pair< SortedBaseCounts, SortedBaseCountssorted_average_base_counts (BaseCounts const &sample_a, BaseCounts const &sample_b)
 Return the sorted base counts of both input samples, orderd by the average frequencies of the nucleotide counts in the two samples. More...
 
SortedBaseCounts sorted_base_counts (BaseCounts const &sample)
 Return the order of base counts (nucleotides), largest one first. More...
 
SortedBaseCounts sorted_base_counts (Variant const &variant, bool reference_first)
 Get a list of bases sorted by their counts. More...
 
BaseCountsStatus status (BaseCounts const &sample, size_t min_coverage=0, size_t max_coverage=0, size_t min_count=0, bool tolerate_deletions=false)
 Compute a simple status with useful properties from the counts of a BaseCounts. More...
 
int string_to_sam_flag (std::string const &value)
 Parse a string as a set of flags for sam/bam/cram reads. More...
 
template<class ForwardIterator >
double tajima_d_pool (PoolDiversitySettings const &settings, ForwardIterator begin, ForwardIterator end)
 Compute the pool-sequencing corrected version of Tajima's D according to Kofler et al. More...
 
template<class ForwardIterator >
double tajima_d_pool (PoolDiversitySettings const &settings, ForwardIterator begin, ForwardIterator end, double theta_pi, double theta_watterson)
 Compute the pool-sequencing corrected version of Tajima's D according to Kofler et al. More...
 
double tajima_d_pool_denominator (PoolDiversitySettings const &settings, size_t snp_count, double theta)
 Compute the denominator for the pool-sequencing correction of Tajima's D according to Kofler et al. More...
 
template<class ForwardIterator >
double theta_pi (ForwardIterator begin, ForwardIterator end, bool with_bessel=true)
 Compute classic theta pi, that is, the sum of heterozygosities. More...
 
double theta_pi_pool (PoolDiversitySettings const &settings, BaseCounts const &sample)
 Compute theta pi with pool-sequencing correction according to Kofler et al, for a single BaseCounts, that is, its heterozygosity() including Bessel's correction for the total nucleotide count at each position, divided by the correction denominator. More...
 
template<class ForwardIterator >
double theta_pi_pool (PoolDiversitySettings const &settings, ForwardIterator begin, ForwardIterator end)
 Compute theta pi with pool-sequencing correction according to Kofler et al, that is, the sum of heterozygosities divided by the correction denominator. More...
 
double theta_pi_pool_denominator (PoolDiversitySettings const &settings, size_t nucleotide_count)
 Compute the denominator for the pool-sequencing correction of theta pi according to Kofler et al. More...
 
template<class ForwardIterator >
double theta_pi_within_pool (ForwardIterator begin, ForwardIterator end, size_t poolsize)
 Compute classic theta pi (within a population), that is, the sum of heterozygosities including Bessel's correction for total nucleotide sum at each position, and Bessel's correction for the pool size. More...
 
template<class ForwardIterator >
double theta_watterson_pool (PoolDiversitySettings const &settings, ForwardIterator begin, ForwardIterator end)
 Compute theta watterson with pool-sequencing correction according to Kofler et al. More...
 
double theta_watterson_pool_denominator (PoolDiversitySettings const &settings, size_t nucleotide_count)
 Compute the denominator for the pool-sequencing correction of theta watterson according to Kofler et al. More...
 
std::string to_string (GenomeLocus const &locus)
 
std::string to_string (GenomeRegion const &region)
 
std::ostream & to_sync (BaseCounts const &bs, std::ostream &os)
 Output a BaseCounts instance to a stream in the PoPoolation2 sync format. More...
 
std::ostream & to_sync (Variant const &var, std::ostream &os)
 Output a Variant instance to a stream in the PoPoolation2 sync format. More...
 
BaseCounts total_base_counts (Variant const &variant)
 Get the summed up total base counts of a Variant. More...
 
size_t total_nucleotide_sum (Variant const &variant)
 Count of the pure nucleotide bases at this position, that is, the sum of all A, C, G, and T. More...
 
void transform_zero_out_by_max_count (BaseCounts &sample, size_t max_count)
 Transform a BaseCounts sample by setting any nucleotide count (A, C, G, T) to zero if max_count is exceeded for that nucleotide. More...
 
void transform_zero_out_by_max_count (Variant &variant, size_t max_count)
 Transform a variant by setting any nucleotide count (A, C, G, T) of its samples to zero if max_count is exceeded for that nucleotide. More...
 
void transform_zero_out_by_min_count (BaseCounts &sample, size_t min_count)
 Transform a BaseCounts sample by setting any nucleotide count (A, C, G, T) to zero if min_count is not reached for that nucleotide. More...
 
void transform_zero_out_by_min_count (Variant &variant, size_t min_count)
 Transform a variant by setting any nucleotide count (A, C, G, T) of its samples to zero if min_count is not reached for that nucleotide. More...
 
void transform_zero_out_by_min_max_count (BaseCounts &sample, size_t min_count, size_t max_count)
 Transform a BaseCounts sample by setting any nucleotide count (A, C, G, T) to zero if min_count is not reached or if max_count is exceeded for that nucleotide. More...
 
void transform_zero_out_by_min_max_count (Variant &variant, size_t min_count, size_t max_count)
 Transform a variant by setting any nucleotide count (A, C, G, T) of its samples to zero if min_count is not reached or if max_count is exceeded for that nucleotide. More...
 
std::string vcf_genotype_string (std::vector< VcfGenotype > const &genotypes)
 Return the VCF-like string representation of a set of VcfGenotype entries. More...
 
size_t vcf_genotype_sum (std::vector< VcfGenotype > const &genotypes)
 Return the sum of genotypes for a set of VcfGenotype entries, typically used to construct a genotype matrix with entries 0,1,2. More...
 
std::string vcf_hl_type_to_string (int hl_type)
 Internal helper function to convert htslib-internal BCF_HL_* header line type values to their string representation as used in the VCF header ("FILTER", "INFO", "FORMAT", etc). More...
 
std::string vcf_value_special_to_string (int vl_type_num)
 
std::string vcf_value_special_to_string (VcfValueSpecial vl_type_num)
 
std::string vcf_value_type_to_string (int ht_type)
 
std::string vcf_value_type_to_string (VcfValueType ht_type)
 

Enumerations

enum  SampleFilterType { kConjunction, kDisjunction, kMerge }
 Select how Variant filter functions that evaluate properties of the Variant::samples (BaseCounts) objects behave when the filter is not true or false for all samples. More...
 
enum  SlidingWindowType { kInterval, kVariants, kChromosome }
 SlidingWindowType of a Window, that is, whether we slide along a fixed size interval of the genome, along a fixed number of variants, or represents a whole chromosome. More...
 
enum  VcfHeaderLine : int {
  kFilter = 0, kInfo = 1, kFormat = 2, kContig = 3,
  kStructured = 4, kGeneric = 5
}
 Specification for the values determining header line types of VCF/BCF files. More...
 
enum  VcfValueSpecial : int {
  kFixed = 0, kVariable = 1, kAllele = 2, kGenotype = 3,
  kReference = 4
}
 Specification for special markers for the number of values expected for key-value-pairs of VCF/BCF files. More...
 
enum  VcfValueType : int { kFlag = 0, kInteger = 1, kFloat = 2, kString = 3 }
 Specification for the data type of the values expected in key-value-pairs of VCF/BCF files. More...
 
enum  WindowAnchorType {
  kIntervalBegin, kIntervalEnd, kIntervalMidpoint, kVariantFirst,
  kVariantLast, kVariantMedian, kVariantMean, kVariantMidpoint
}
 Position in the genome that is used for reporting when emitting or using a window. More...
 

Typedefs

using VariantInputIterator = utils::LambdaIterator< Variant, VariantInputIteratorData >
 Iterate Variants, using a variety of input file formats. More...
 
using VariantWindowIterator = BaseWindowIterator< VariantInputIterator::Iterator >
 
using VcfFormatIteratorFloat = VcfFormatIterator< float, double >
 
using VcfFormatIteratorGenotype = VcfFormatIterator< int32_t, VcfGenotype >
 
using VcfFormatIteratorInt = VcfFormatIterator< int32_t, int32_t >
 
using VcfFormatIteratorString = VcfFormatIterator< char *, std::string >
 

Variables

static const std::unordered_map< std::string, int > sam_flag_name_to_int_
 Map from sam flags to their numerical value, for different types of naming of the flags. More...
 

Function Documentation

◆ a_n()

double a_n ( size_t  n)

Compute a_n, the sum of reciprocals.

This is the sum of reciprocals up to n-1, which is \( a_n = \sum_{i=1}^{n-1} \frac{1}{i} \).

See Equation 3.6 in

Hahn, M. W. (2018). Molecular Population Genetics. https://global.oup.com/academic/product/molecular-population-genetics-9780878939657

for details.

See also
b_n(), the sum of squared reciprocals.

Definition at line 231 of file diversity.cpp.

◆ alpha_star()

double alpha_star ( double  n)

Compute alpha* according to Achaz 2008 and Kofler et al. 2011.

This is needed for the computation of tajima_d_pool() according to

R. Kofler, P. Orozco-terWengel, N. De Maio, R. V. Pandey, V. Nolte, A. Futschik, C. Kosiol, C. Schlötterer.
PoPoolation: A Toolbox for Population Genetic Analysis of Next Generation Sequencing Data from Pooled Individuals.
(2011) PLoS ONE, 6(1), e15925. https://doi.org/10.1371/journal.pone.0015925

The paper unfortunately does not explain their equations, but there is a hidden document in their code repository that illuminates the situation a bit. See https://sourceforge.net/projects/popoolation/files/correction_equations.pdf

The equation is based on

G. Achaz.
Testing for neutrality in samples with sequencing errors.
(2008) Genetics, 179(3), 1409–1424. https://doi.org/10.1534/genetics.107.082198

See there for details.

Definition at line 263 of file diversity.cpp.

◆ amnm_()

double genesis::population::amnm_ ( size_t  poolsize,
size_t  nucleotide_count,
size_t  allele_frequency 
)

Local helper function to compute values for the denominator.

This computes the sum over all r poolsizes of 1/r times a binomial:

\( \sum_{m=b}^{C-b} \frac{1}{k} {C \choose m} \left(\frac{k}{n}\right)^m \left(\frac{n-k}{n}\right)^{C-m} \)

This is needed in the pool seq correction denoinators of Theta Pi and Theta Watterson.

Definition at line 63 of file diversity.cpp.

◆ anchor_position()

size_t genesis::population::anchor_position ( Window< D, A > const &  window,
WindowAnchorType  anchor_type = WindowAnchorType::kIntervalBegin 
)

Get the position in the chromosome reported according to a specific WindowAnchorType.

When a window is filled with data, we need to report the position in the genome at which the window is. There are several ways that this position can be computed. Typically, just the first position of the window is used (that is, for an interval, the beginning of the interval, and for variants, the position of the first variant).

However, it might be desirable to report a different position, for example when plotting the results. When using WindowType::kVariants for example, one might want to plot the values computed per window at the midpoint genome position of the variants in that window.

Definition at line 77 of file population/window/functions.hpp.

◆ b_n()

double b_n ( size_t  n)

Compute b_n, the sum of squared reciprocals.

This is the sum of squared reciprocals up to n-1, which is \( b_n = \sum_{i=1}^{n-1} \frac{1}{i^2} \).

See

R. Kofler, P. Orozco-terWengel, N. De Maio, R. V. Pandey, V. Nolte, A. Futschik, C. Kosiol, C. Schlötterer.
PoPoolation: A Toolbox for Population Genetic Analysis of Next Generation Sequencing Data from Pooled Individuals.
(2011) PLoS ONE, 6(1), e15925. https://doi.org/10.1371/journal.pone.0015925

for details. The paper unfortunately does not explain their equations, but there is a hidden document in their code repository that illuminates the situation a bit. See https://sourceforge.net/projects/popoolation/files/correction_equations.pdf

See also
a_n(), the sum of reciprocals.

Definition at line 244 of file diversity.cpp.

◆ beta_star()

double beta_star ( double  n)

Compute beta* according to Achaz 2008 and Kofler et al. 2011.

This is needed for the computation of tajima_d_pool() according to

R. Kofler, P. Orozco-terWengel, N. De Maio, R. V. Pandey, V. Nolte, A. Futschik, C. Kosiol, C. Schlötterer.
PoPoolation: A Toolbox for Population Genetic Analysis of Next Generation Sequencing Data from Pooled Individuals.
(2011) PLoS ONE, 6(1), e15925. https://doi.org/10.1371/journal.pone.0015925

The paper unfortunately does not explain their equations, but there is a hidden document in their code repository that illuminates the situation a bit. See https://sourceforge.net/projects/popoolation/files/correction_equations.pdf

The equation is based on

G. Achaz.
Testing for neutrality in samples with sequencing errors.
(2008) Genetics, 179(3), 1409–1424. https://doi.org/10.1534/genetics.107.082198

See there for details.

Definition at line 291 of file diversity.cpp.

◆ consensus() [1/2]

std::pair< char, double > consensus ( BaseCounts const &  sample)

Consensus character for a BaseCounts, and its confidence.

This is simply the character (out of ACGT) that appears most often (or, for ties, the lexicographically smallest character), unless all of (A, C, G, T) are zero, in which case the consensus character is N. The confidence is the count of the consensus character, divided by the total count of all four nucleotides.

Definition at line 397 of file population/functions/functions.cpp.

◆ consensus() [2/2]

std::pair< char, double > consensus ( BaseCounts const &  sample,
BaseCountsStatus const &  status 
)

Consensus character for a BaseCounts, and its confidence.

This is simply the character (out of ACGT) that appears most often (or, for ties, the lexicographically smallest character). If the BaseCounts is not well covered by reads (that is, if its BaseCountsStatus::is_covered is false), the consensus character is N. The confidence is the count of the consensus character, divided by the total count of all four nucleotides.

Definition at line 438 of file population/functions/functions.cpp.

◆ convert_to_afs_pileup_record()

AfsPileupRecord convert_to_afs_pileup_record ( SimplePileupReader::Record const &  record)

Definition at line 48 of file afs_estimate.cpp.

◆ convert_to_base_counts()

BaseCounts convert_to_base_counts ( SimplePileupReader::Sample const &  sample,
unsigned char  min_phred_score 
)

Definition at line 45 of file simple_pileup_common.cpp.

◆ convert_to_variant()

Variant convert_to_variant ( SimplePileupReader::Record const &  record,
unsigned char  min_phred_score 
)

Definition at line 145 of file simple_pileup_common.cpp.

◆ convert_to_variant_as_individuals()

Variant convert_to_variant_as_individuals ( VcfRecord const &  record,
bool  use_allelic_depth = false 
)

Convert a VcfRecord to a Variant, treating each sample as an individual, and combining them all into one BaseCounts sample.

In this function, we assume that the data that was used to create the VCF file was the typical use case of VCF, where each sample (column) in the file corresponds to an individual. When using this function, all samples (individuals) are combined into one, as our targeted output type Variant is used to describe allele counts of several individual (e.g., in a pool). As all columns are combined, the resulting Variant only contains a single BaseCounts object. We only consider biallelic SNP positions here.

We offer two ways of combining the samples (columns) of the input VCF record into the BaseCounts:

  1. When use_allelic_depth is false (default), individuals simply contribute to the BaseCounts according to their polidy. That is, an individual with genotype A/T will contribute one count each for A and T.
  2. When use_allelic_depth is true instead, we use the "AD" FORMAT field instead, to obtain the actual counts for the reference and alterantive allele, and use these to sum up the BaseCounts data.
See also
See make_variant_input_iterator_from_individual_vcf_file() for an example where this is used.
See convert_to_variant_as_pool() for the alterantive function that instead interprets each sample (column) as a pool of individuals, e.g., from pool sequencing.

Definition at line 381 of file vcf_common.cpp.

◆ convert_to_variant_as_pool()

Variant convert_to_variant_as_pool ( VcfRecord const &  record)

Convert a VcfRecord to a Variant, treating each sample column as a pool of individuals.

This assumes that the data that was used to create the VCF file was actually a pool of individuals (e.g., from pool sequencing) for each sample (column) of the VCF file. We do not actually recommend to use variant calling software on pool-seq data, as it induces frequency shifts due to the statistical models employed by variant calles that were not built for pool sequencing data. It however seems to be a commonly used approach, and hence we offer this function here. For this type of data, the VCF allelic depth ("AD") information contains the counts of the reference and alternative base, which in this context can be interpreted as describing the allele frequencines of each pool of individuals. This requires the VCF to have the "AD" FORMAT field.

Only SNP data (no indels) are allowed in this function; use VcfRecord::is_snp() to test this.

See also
See make_variant_input_iterator_from_pool_vcf_file() for an example where this is used.
See convert_to_variant_as_individuals() for the function that instead interprets the VCF as usual as a set of individuals.

Definition at line 275 of file vcf_common.cpp.

◆ f_st_pool_karlsson()

double genesis::population::f_st_pool_karlsson ( ForwardIterator1  p1_begin,
ForwardIterator1  p1_end,
ForwardIterator2  p2_begin,
ForwardIterator2  p2_end 
)

Compute the F_ST statistic for pool-sequenced data of Karlsson et al as used in PoPoolation2, for two ranges of BaseCountss.

The approach is called the "asymptotically unbiased" estimator in PoPoolation2 [1], and follows Karlsson et al [2].

[1] PoPoolation2: identifying differentiation between populations using sequencing of pooled DNA samples (Pool-Seq).
Kofler R, Pandey RV, Schlotterer C.
Bioinformatics, 2011, 27(24), 3435–3436. https://doi.org/10.1093/bioinformatics/btr589

[2] Efficient mapping of mendelian traits in dogs through genome-wide association.
Karlsson EK, Baranowska I, Wade CM, Salmon Hillbertz NHC, Zody MC, Anderson N, Biagi TM, Patterson N, Pielberg GR, Kulbokas EJ, Comstock KE, Keller ET, Mesirov JP, Von Euler H, Kämpe O, Hedhammar Å, Lander ES, Andersson G, Andersson L, Lindblad-Toh K.
Nature Genetics, 2007, 39(11), 1321–1328. https://doi.org/10.1038/ng.2007.10

Definition at line 339 of file structure.hpp.

◆ f_st_pool_karlsson_nkdk()

std::pair< double, double > f_st_pool_karlsson_nkdk ( std::pair< SortedBaseCounts, SortedBaseCounts > const &  sample_counts)

Compute the numerator N_k and denominator D_k needed for the asymptotically unbiased F_ST estimator of Karlsson et al (2007).

See f_st_pool_karlsson() for details. The function expects sorted base counts for the two samples of which we want to compute F_ST, which are produced by sorted_average_base_counts().

Definition at line 101 of file structure.cpp.

◆ f_st_pool_kofler()

double genesis::population::f_st_pool_kofler ( size_t  p1_poolsize,
size_t  p2_poolsize,
ForwardIterator1  p1_begin,
ForwardIterator1  p1_end,
ForwardIterator2  p2_begin,
ForwardIterator2  p2_end 
)

Compute the F_ST statistic for pool-sequenced data of Kofler et al as used in PoPoolation2, for two ranges of BaseCountss.

The approach is called the "classical" or "conventional" estimator in PoPoolation2 [1], and follows Hartl and Clark [2].

[1] PoPoolation2: identifying differentiation between populations using sequencing of pooled DNA samples (Pool-Seq).
Kofler R, Pandey RV, Schlotterer C.
Bioinformatics, 2011, 27(24), 3435–3436. https://doi.org/10.1093/bioinformatics/btr589

[2] Principles of Population Genetics.
Hartl DL, Clark AG.
Sinauer, 2007.

Definition at line 202 of file structure.hpp.

◆ f_st_pool_kofler_pi_snp()

std::tuple< double, double, double > f_st_pool_kofler_pi_snp ( BaseCounts const &  p1,
BaseCounts const &  p2 
)

Compute the SNP-based Theta Pi values used in f_st_pool_kofler().

See there for details. The tuple returns Theta Pi for an individual position, which is simply the heterozygosity() at this position, for both samples p1 and p2, as well as their combined (average frequency) heterozygosity, in that order.

Definition at line 46 of file structure.cpp.

◆ f_st_pool_unbiased()

std::pair<double, double> genesis::population::f_st_pool_unbiased ( size_t  p1_poolsize,
size_t  p2_poolsize,
ForwardIterator1  p1_begin,
ForwardIterator1  p1_end,
ForwardIterator2  p2_begin,
ForwardIterator2  p2_end 
)

Compute our unbiased F_ST statistic for pool-sequenced data for two ranges of BaseCountss.

This is our novel approach for estimating F_ST, using pool-sequencing corrected estimates of Pi within, Pi between, and Pi total, to compute F_ST following the definitions of Nei [1] and Hudson [2], respectively. These are returned here as a pair in that order. See https://github.com/lczech/pool-seq-pop-gen-stats for details.

[1] Analysis of Gene Diversity in Subdivided Populations.
Nei M.
Proceedings of the National Academy of Sciences, 1973, 70(12), 3321–3323. https://doi.org/10.1073/PNAS.70.12.3321

[2] Estimation of levels of gene flow from DNA sequence data.
Hudson RR, Slatkin M, Maddison WP.
Genetics, 1992, 132(2), 583–589. https://doi.org/10.1093/GENETICS/132.2.583

Definition at line 433 of file structure.hpp.

◆ f_st_pool_unbiased_pi_snp()

std::tuple< double, double, double > f_st_pool_unbiased_pi_snp ( size_t  p1_poolsize,
size_t  p2_poolsize,
BaseCounts const &  p1,
BaseCounts const &  p2 
)

Compute the SNP-based Theta Pi values used in f_st_pool_unbiased().

The function returns pi within, between, and total, in that order. See f_st_pool_unbiased() for details.

Definition at line 166 of file structure.cpp.

◆ f_star()

double f_star ( double  a_n,
double  n 
)

Compute f* according to Achaz 2008 and Kofler et al. 2011.

This is compuated as \( f_{star} = \frac{n - 3}{a_n \cdot (n-1) - n} \), and needed for the computation of alpha_star() and beta_star(). See there for some more details, and see

G. Achaz.
Testing for neutrality in samples with sequencing errors.
(2008) Genetics, 179(3), 1409–1424. https://doi.org/10.1534/genetics.107.082198

for the original equations.

Definition at line 257 of file diversity.cpp.

◆ filter_by_region() [1/3]

std::function<bool(Variant const&)> genesis::population::filter_by_region ( GenomeRegion const &  region,
bool  complement = false 
)
inline

Filter function to be used with VariantInputIterator to filter by a genome region.

This function can be used as a filter with VariantInputIterator::add_filter(), in order to only iterate over Variants that are in the given region (if complement is false, default), or only over Variants that are outside of the region (if complement is true).

Definition at line 277 of file filter_transform.hpp.

◆ filter_by_region() [2/3]

std::function<bool(Variant const&)> genesis::population::filter_by_region ( GenomeRegionList const &  regions,
bool  complement = false,
bool  copy_regions = false 
)
inline

Filter function to be used with VariantInputIterator to filter by a list of genome regions.

This function can be used as a filter with VariantInputIterator::add_filter(), in order to only iterate over Variants that are in the given regions (if complement is false, default), or only over Variants that are outside of the regions (if complement is true).

This version of the function can be used if the regions is not given as a std::shared_ptr. The parameter copy_regions is an optimization. By default, the function stores a copy of the regions, in order to make sure that it is available. However, if it is guaranteed that the regions object stays in scope during the VariantInputIterator's lifetime, this copy can be avoided.

Definition at line 316 of file filter_transform.hpp.

◆ filter_by_region() [3/3]

std::function<bool(Variant const&)> genesis::population::filter_by_region ( std::shared_ptr< GenomeRegionList regions,
bool  complement = false 
)
inline

Filter function to be used with VariantInputIterator to filter by a list of genome regions.

This function can be used as a filter with VariantInputIterator::add_filter(), in order to only iterate over Variants that are in the given regions (if complement is false, default), or only over Variants that are outside of the regions (if complement is true).

Definition at line 293 of file filter_transform.hpp.

◆ filter_by_status() [1/2]

bool filter_by_status ( std::function< bool(BaseCountsStatus const &)>  predicate,
Variant const &  variant,
SampleFilterType  type,
size_t  min_coverage = 0,
size_t  max_coverage = 0,
size_t  min_count = 0,
bool  tolerate_deletions = false 
)

Filter a Variant based on a predicate that is applied to the result of a status() call on the BaseCounts of the variant.

See status() for details on the data of type BaseCountsStatus that predicate can use. This function applies the predicate to the BaseCounts samples of the variant (or to the merged one, depending on type, see also below), and returns whether the filter predicate passed or not.

Note that different type values have a distinct effect here: It might happen that all samples individually pass the predicate, but their merged counts do not, or vice versa. Hence, this choice needs to be made depending on downstream needs. For example, if we are filtering for Variants that are SNPs (where there exist at least two counts in [ACGT] that are non-zero), individual samples might only have one base count greater than zero, in which case they are not considered to be a SNP. However, if those non-zero counts are not for the same base in all samples, their merged counts will be non-zero for more than one base, and hence considered a SNP.

Definition at line 43 of file filter_transform.cpp.

◆ filter_by_status() [2/2]

std::function<bool(Variant const&)> genesis::population::filter_by_status ( std::function< bool(BaseCountsStatus const &)>  predicate,
SampleFilterType  type,
size_t  min_coverage = 0,
size_t  max_coverage = 0,
size_t  min_count = 0,
bool  tolerate_deletions = false 
)
inline

Filter a Variant based on a predicate that is applied to the result of a status() call on the BaseCounts of the variant.

Same as filter_by_status( std::function<...>, Variant const&, ... ) , but returns a callback to be used as a filter, e.g., with VariantInputIterator::add_filter().

Definition at line 123 of file filter_transform.hpp.

◆ filter_is_biallelic_snp() [1/2]

std::function<bool(Variant const&)> genesis::population::filter_is_biallelic_snp ( SampleFilterType  type,
size_t  min_coverage = 0,
size_t  max_coverage = 0,
size_t  min_count = 0,
bool  tolerate_deletions = false 
)
inline

Filter a Variant based on whether the sample counts are biallelic SNPs, that is, exactly two base counts in [ACGT] are non-zero.

Same as filter_is_biallelic_snp( Variant const&, ... ) , but returns a callback to be used as a filter, e.g., with VariantInputIterator::add_filter().

Definition at line 230 of file filter_transform.hpp.

◆ filter_is_biallelic_snp() [2/2]

bool genesis::population::filter_is_biallelic_snp ( Variant const &  variant,
SampleFilterType  type,
size_t  min_coverage = 0,
size_t  max_coverage = 0,
size_t  min_count = 0,
bool  tolerate_deletions = false 
)
inline

Filter a Variant based on whether the sample counts are biallelic SNPs, that is, exactly two base counts in [ACGT] are non-zero.

Same as filter_is_snp( Variant const&, ... ) , but additionally checks that the SNP is biallelic (BaseCountsStatus::is_biallelic). See there for more details.

Definition at line 204 of file filter_transform.hpp.

◆ filter_is_snp() [1/2]

std::function<bool(Variant const&)> genesis::population::filter_is_snp ( SampleFilterType  type,
size_t  min_coverage = 0,
size_t  max_coverage = 0,
size_t  min_count = 0,
bool  tolerate_deletions = false 
)
inline

Filter a Variant based on whether the sample counts are SNPs, that is, more than one count in [ACGT] is non-zero.

Same as filter_is_snp( Variant const&, ... ) , but returns a callback to be used as a filter, e.g., with VariantInputIterator::add_filter().

Definition at line 178 of file filter_transform.hpp.

◆ filter_is_snp() [2/2]

bool genesis::population::filter_is_snp ( Variant const &  variant,
SampleFilterType  type,
size_t  min_coverage = 0,
size_t  max_coverage = 0,
size_t  min_count = 0,
bool  tolerate_deletions = false 
)
inline

Filter a Variant based on whether the sample counts are SNPs, that is, more than one count in [ACGT] is non-zero.

This function checks that the samples are covered (BaseCountsStatus::is_covered) and have more than one non-zero count (BaseCountsStatus::is_snp).

See status() for details, and see filter_by_status() for details on the processing, in particular the type argument.

Definition at line 152 of file filter_transform.hpp.

◆ genome_region_list_from_vcf_file() [1/2]

GenomeRegionList genome_region_list_from_vcf_file ( std::string const &  file)

Read a VCF file, and use its positions to create a GenomeRegionList.

This is for example useful to restrict some analysis to the loci of known variants. Note that the whole file has to be read still; it can hence be better to only do this once and convert to a faster file format.

This ignores all sample information, and simply uses the CHROM and POS data to construct intervals of consecutive positions along the chromsomes, i.e., if the file contains positions 1, 2, and 3, but not 4, an interval spanning 1-3 is inserted into the list.

The VCF file does not have to be sorted for this.

Definition at line 486 of file vcf_common.cpp.

◆ genome_region_list_from_vcf_file() [2/2]

void genome_region_list_from_vcf_file ( std::string const &  file,
GenomeRegionList target 
)

Read a VCF file, and add its positions to an existing GenomeRegionList.

This is for example useful to restrict some analysis to the loci of known variants. Note that the whole file has to be read still; it can hence be better to only do this once and convert to a faster file format.

This ignores all sample information, and simply uses the CHROM and POS data to construct intervals of consecutive positions along the chromsomes, i.e., if the file contains positions 1, 2, and 3, but not 4, an interval spanning 1-3 is inserted into the list.

The VCF file does not have to be sorted for this. The regions are merged into the existing ones, potentially changing existing starts and ends of intervals if they overlap with regions found in the VCF.

Definition at line 493 of file vcf_common.cpp.

◆ get_base_count()

size_t get_base_count ( BaseCounts const &  bc,
char  base 
)

Get the count for a base given as a char.

The given base has to be one of ACGTDN (case insensitive), or *#. for deletions as well.

Definition at line 103 of file population/functions/functions.cpp.

◆ get_vcf_record_snp_ref_alt_chars_()

std::pair<std::array<char, 6>, size_t> genesis::population::get_vcf_record_snp_ref_alt_chars_ ( VcfRecord const &  record)

Local helper function that returns the REF and ALT chars of a VcfRecord for SNPs.

This function expects the record to only contain SNP REF and ALT (single nucleotides), and throws when not. It then fills the resulting array with these chars. That is, result[0] is the REF char, result[1] the first ALT char, and so forth.

To keep it speedy, we always return an array that is large enough for all ACGTND, and return the number of used entries as the second value of the pair.

Definition at line 232 of file vcf_common.cpp.

◆ guess_alternative_base()

char guess_alternative_base ( Variant const &  variant,
bool  force = true 
)

Guess the alternative base of a Variant.

If the Variant already has an alternative_base in ACGT and force is not true, this original base is returned (meaning that this function is idempotent; it does not change the alternative base if there already is one). However, if the alternative_base is N or any other char not in ACGT, or if force is true, the base with the highest count that is not the reference base is returned instead. This also means that the reference base has to be set to a value in ACGT, as otherwise the concept of an alternative base is meaningless anyway. If the reference base is not one of ACGT, the returned alternative base is N. Furthermore, if all three non-reference bases have count 0, the returned alternative base is N.

Definition at line 463 of file population/functions/functions.cpp.

◆ guess_reference_base()

char guess_reference_base ( Variant const &  variant)

Guess the reference base of a Variant.

If the Variant already has a reference_base in ACGT, this base is returned (meaning that this function is idempotent; it does not change the reference base if there already is one). However, if the reference_base is N or any other value not in ACGT, the base with the highest count is returned instead, unless all counts are 0, in which case the returned reference base is N.

Definition at line 447 of file population/functions/functions.cpp.

◆ heterozygosity()

double heterozygosity ( BaseCounts const &  sample,
bool  with_bessel = false 
)

Compute classic heterozygosity.

This is computed as \( h = \frac{n}{n-1} \left( 1 - \sum p^2 \right) \) with n the total nucleotide_sum() (sum of A,C,G,T in the sample), and p their respective nucleotide frequencies, with with_bessel, or without Bessel's correction in the beginning of the equation when with_bessel is set to false (default).

See Equation 3.1 in

Hahn, M. W.
(2018). Molecular Population Genetics.
https://global.oup.com/academic/product/molecular-population-genetics-9780878939657

for details.

Definition at line 110 of file diversity.cpp.

◆ is_covered() [1/6]

bool is_covered ( GenomeRegion const &  region,
std::string const &  chromosome,
size_t  position 
)

Test whether the chromosome/position is within a given genomic region.

Definition at line 190 of file genome_region.cpp.

◆ is_covered() [2/6]

bool genesis::population::is_covered ( GenomeRegion const &  region,
T const &  locus 
)

Test whether the chromosome/position of a locus is within a given genomic region.

This is a function template, so that it can accept any data structure that contains public member variables chromosome (std::string) and position (size_t), such as Variant or GenomeLocus.

Definition at line 121 of file functions/genome_region.hpp.

◆ is_covered() [3/6]

bool is_covered ( GenomeRegion const &  region,
VcfRecord const &  variant 
)

Definition at line 219 of file genome_region.cpp.

◆ is_covered() [4/6]

bool is_covered ( GenomeRegionList const &  regions,
std::string const &  chromosome,
size_t  position 
)

Test whether the chromosome/position is within a given list of genomic regions.

Definition at line 212 of file genome_region.cpp.

◆ is_covered() [5/6]

bool genesis::population::is_covered ( GenomeRegionList const &  regions,
T const &  locus 
)

Test whether the chromosome/position of a locus is within a given list of genomic regions.

This is a function template, so that it can accept any data structure that contains public member variables chromosome (std::string) and position (size_t), such as Variant or GenomeLocus.

Definition at line 135 of file functions/genome_region.hpp.

◆ is_covered() [6/6]

bool is_covered ( GenomeRegionList const &  regions,
VcfRecord const &  variant 
)

Definition at line 224 of file genome_region.cpp.

◆ locus_compare() [1/4]

int genesis::population::locus_compare ( GenomeLocus const &  l,
GenomeLocus const &  r 
)
inline

Three-way comparison (spaceship operator <=>) for two loci in a genome.

The comparison returns -1 if the left locus is before the right locus, +1 for the opposite, and 0 if the two loci are equal.

Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.

Definition at line 140 of file functions/genome_locus.hpp.

◆ locus_compare() [2/4]

int genesis::population::locus_compare ( GenomeLocus const &  l,
std::string const &  r_chromosome,
size_t  r_position 
)
inline

Three-way comparison (spaceship operator <=>) for two loci in a genome.

The comparison returns -1 if the left locus is before the right locus, +1 for the opposite, and 0 if the two loci are equal.

Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.

Definition at line 120 of file functions/genome_locus.hpp.

◆ locus_compare() [3/4]

int genesis::population::locus_compare ( std::string const &  l_chromosome,
size_t  l_position,
GenomeLocus const &  r 
)
inline

Three-way comparison (spaceship operator <=>) for two loci in a genome.

The comparison returns -1 if the left locus is before the right locus, +1 for the opposite, and 0 if the two loci are equal.

Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.

Definition at line 130 of file functions/genome_locus.hpp.

◆ locus_compare() [4/4]

int genesis::population::locus_compare ( std::string const &  l_chromosome,
size_t  l_position,
std::string const &  r_chromosome,
size_t  r_position 
)
inline

Three-way comparison (spaceship operator <=>) for two loci in a genome.

The comparison returns -1 if the left locus is before the right locus, +1 for the opposite, and 0 if the two loci are equal.

Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.

Definition at line 95 of file functions/genome_locus.hpp.

◆ locus_equal() [1/4]

bool genesis::population::locus_equal ( GenomeLocus const &  l,
GenomeLocus const &  r 
)
inline

Equality comparison (==) for two loci in a genome.

Definition at line 199 of file functions/genome_locus.hpp.

◆ locus_equal() [2/4]

bool genesis::population::locus_equal ( GenomeLocus const &  l,
std::string const &  r_chromosome,
size_t  r_position 
)
inline

Equality comparison (==) for two loci in a genome.

Definition at line 179 of file functions/genome_locus.hpp.

◆ locus_equal() [3/4]

bool genesis::population::locus_equal ( std::string const &  l_chromosome,
size_t  l_position,
GenomeLocus const &  r 
)
inline

Equality comparison (==) for two loci in a genome.

Definition at line 189 of file functions/genome_locus.hpp.

◆ locus_equal() [4/4]

bool genesis::population::locus_equal ( std::string const &  l_chromosome,
size_t  l_position,
std::string const &  r_chromosome,
size_t  r_position 
)
inline

Equality comparison (==) for two loci in a genome.

Definition at line 169 of file functions/genome_locus.hpp.

◆ locus_greater() [1/4]

bool genesis::population::locus_greater ( GenomeLocus const &  l,
GenomeLocus const &  r 
)
inline

Greater than comparison (>) for two loci in a genome.

Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.

Definition at line 361 of file functions/genome_locus.hpp.

◆ locus_greater() [2/4]

bool genesis::population::locus_greater ( GenomeLocus const &  l,
std::string const &  r_chromosome,
size_t  r_position 
)
inline

Greater than comparison (>) for two loci in a genome.

Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.

Definition at line 341 of file functions/genome_locus.hpp.

◆ locus_greater() [3/4]

bool genesis::population::locus_greater ( std::string const &  l_chromosome,
size_t  l_position,
GenomeLocus const &  r 
)
inline

Greater than comparison (>) for two loci in a genome.

Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.

Definition at line 351 of file functions/genome_locus.hpp.

◆ locus_greater() [4/4]

bool genesis::population::locus_greater ( std::string const &  l_chromosome,
size_t  l_position,
std::string const &  r_chromosome,
size_t  r_position 
)
inline

Greater than comparison (>) for two loci in a genome.

Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.

Definition at line 331 of file functions/genome_locus.hpp.

◆ locus_greater_or_equal() [1/4]

bool genesis::population::locus_greater_or_equal ( GenomeLocus const &  l,
GenomeLocus const &  r 
)
inline

Greater than or equal comparison (>=) for two loci in a genome.

Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.

Definition at line 471 of file functions/genome_locus.hpp.

◆ locus_greater_or_equal() [2/4]

bool genesis::population::locus_greater_or_equal ( GenomeLocus const &  l,
std::string const &  r_chromosome,
size_t  r_position 
)
inline

Greater than or equal comparison (>=) for two loci in a genome.

Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.

Definition at line 451 of file functions/genome_locus.hpp.

◆ locus_greater_or_equal() [3/4]

bool genesis::population::locus_greater_or_equal ( std::string const &  l_chromosome,
size_t  l_position,
GenomeLocus const &  r 
)
inline

Greater than or equal comparison (>=) for two loci in a genome.

Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.

Definition at line 461 of file functions/genome_locus.hpp.

◆ locus_greater_or_equal() [4/4]

bool genesis::population::locus_greater_or_equal ( std::string const &  l_chromosome,
size_t  l_position,
std::string const &  r_chromosome,
size_t  r_position 
)
inline

Greater than or equal comparison (>=) for two loci in a genome.

Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.

Definition at line 441 of file functions/genome_locus.hpp.

◆ locus_inequal() [1/4]

bool genesis::population::locus_inequal ( GenomeLocus const &  l,
GenomeLocus const &  r 
)
inline

Inequality comparison (!=) for two loci in a genome.

Definition at line 251 of file functions/genome_locus.hpp.

◆ locus_inequal() [2/4]

bool genesis::population::locus_inequal ( GenomeLocus const &  l,
std::string const &  r_chromosome,
size_t  r_position 
)
inline

Inequality comparison (!=) for two loci in a genome.

Definition at line 231 of file functions/genome_locus.hpp.

◆ locus_inequal() [3/4]

bool genesis::population::locus_inequal ( std::string const &  l_chromosome,
size_t  l_position,
GenomeLocus const &  r 
)
inline

Inequality comparison (!=) for two loci in a genome.

Definition at line 241 of file functions/genome_locus.hpp.

◆ locus_inequal() [4/4]

bool genesis::population::locus_inequal ( std::string const &  l_chromosome,
size_t  l_position,
std::string const &  r_chromosome,
size_t  r_position 
)
inline

Inequality comparison (!=) for two loci in a genome.

Definition at line 221 of file functions/genome_locus.hpp.

◆ locus_less() [1/4]

bool genesis::population::locus_less ( GenomeLocus const &  l,
GenomeLocus const &  r 
)
inline

Less than comparison (<) for two loci in a genome.

Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.

Definition at line 307 of file functions/genome_locus.hpp.

◆ locus_less() [2/4]

bool genesis::population::locus_less ( GenomeLocus const &  l,
std::string const &  r_chromosome,
size_t  r_position 
)
inline

Less than comparison (<) for two loci in a genome.

Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.

Definition at line 287 of file functions/genome_locus.hpp.

◆ locus_less() [3/4]

bool genesis::population::locus_less ( std::string const &  l_chromosome,
size_t  l_position,
GenomeLocus const &  r 
)
inline

Less than comparison (<) for two loci in a genome.

Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.

Definition at line 297 of file functions/genome_locus.hpp.

◆ locus_less() [4/4]

bool genesis::population::locus_less ( std::string const &  l_chromosome,
size_t  l_position,
std::string const &  r_chromosome,
size_t  r_position 
)
inline

Less than comparison (<) for two loci in a genome.

Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.

Definition at line 277 of file functions/genome_locus.hpp.

◆ locus_less_or_equal() [1/4]

bool genesis::population::locus_less_or_equal ( GenomeLocus const &  l,
GenomeLocus const &  r 
)
inline

Less than or equal comparison (<=) for two loci in a genome.

Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.

Definition at line 417 of file functions/genome_locus.hpp.

◆ locus_less_or_equal() [2/4]

bool genesis::population::locus_less_or_equal ( GenomeLocus const &  l,
std::string const &  r_chromosome,
size_t  r_position 
)
inline

Less than or equal comparison (<=) for two loci in a genome.

Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.

Definition at line 397 of file functions/genome_locus.hpp.

◆ locus_less_or_equal() [3/4]

bool genesis::population::locus_less_or_equal ( std::string const &  l_chromosome,
size_t  l_position,
GenomeLocus const &  r 
)
inline

Less than or equal comparison (<=) for two loci in a genome.

Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.

Definition at line 407 of file functions/genome_locus.hpp.

◆ locus_less_or_equal() [4/4]

bool genesis::population::locus_less_or_equal ( std::string const &  l_chromosome,
size_t  l_position,
std::string const &  r_chromosome,
size_t  r_position 
)
inline

Less than or equal comparison (<=) for two loci in a genome.

Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.

Definition at line 385 of file functions/genome_locus.hpp.

◆ make_default_sliding_interval_window_iterator()

SlidingIntervalWindowIterator<ForwardIterator> genesis::population::make_default_sliding_interval_window_iterator ( ForwardIterator  begin,
ForwardIterator  end,
size_t  width = 0,
size_t  stride = 0 
)

Helper function to instantiate a SlidingIntervalWindowIterator for a default use case.

This helper assumes that the underlying type of the input data stream and of the Windows that we are sliding over are of the same type, that is, we do no conversion in the entry_input_function functor of the SlidingIntervalWindowIterator. It further assumes that this data type has public member variables chromosome and position that are accessed by the chromosome_function and position_function functors of the SlidingIntervalWindowIterator. For example, a data type that this works for is Variant data.

Definition at line 495 of file sliding_interval_window_iterator.hpp.

◆ make_default_sliding_variants_window_iterator()

SlidingVariantsWindowIterator<ForwardIterator> genesis::population::make_default_sliding_variants_window_iterator ( ForwardIterator  begin,
ForwardIterator  end,
size_t  width = 0,
size_t  stride = 0 
)

Helper function to instantiate a SlidingVariantsWindowIterator for a default use case.

This helper assumes that the underlying type of the input data stream and of the Windows that we are sliding over are of the same type, that is, we do no conversion in the entry_input_function functor of the SlidingVariantsWindowIterator. It further assumes that this data type has public member variables chromosome and position that are accessed by the chromosome_function and position_function functors of the SlidingVariantsWindowIterator. For example, a data type that this works for is Variant data.

Definition at line 368 of file sliding_variants_window_iterator.hpp.

◆ make_input_iterator_with_sample_filter_()

std::shared_ptr<T> genesis::population::make_input_iterator_with_sample_filter_ ( std::string const &  filename,
R const &  reader,
std::vector< size_t > const &  sample_indices,
bool  inverse_sample_indices,
std::vector< bool > const &  sample_filter 
)

Local helper function template that takes care of intilizing an input iterator, and setting the sample filters, for those iterators for which we do not know the number of samples prior to starting the file iteration.

The template arguments are: T the returned type of input iterator, and R the underlying reader type. This is very specific for the use case here, and currently is only meant for how we work with the SimplePileupReader and the SyncReader and their iterators. Both their iterators accept a reader to take settings from.

Definition at line 62 of file variant_input_iterator.cpp.

◆ make_sliding_interval_window_iterator()

SlidingIntervalWindowIterator<ForwardIterator, DataType> genesis::population::make_sliding_interval_window_iterator ( ForwardIterator  begin,
ForwardIterator  end,
size_t  width = 0,
size_t  stride = 0 
)

Helper function to instantiate a SlidingIntervalWindowIterator without the need to specify the template parameters manually.

The three functors entry_input_function, chromosome_function, and position_function of the SlidingIntervalWindowIterator have to be set in the returned iterator before using it. See make_default_sliding_interval_window_iterator() for an alternative make function that sets these three functors to reasonable defaults that work for the Variant data type.

Definition at line 474 of file sliding_interval_window_iterator.hpp.

◆ make_sliding_variants_window_iterator()

SlidingVariantsWindowIterator<ForwardIterator, DataType> genesis::population::make_sliding_variants_window_iterator ( ForwardIterator  begin,
ForwardIterator  end,
size_t  width = 0,
size_t  stride = 0 
)

Helper function to instantiate a SlidingVariantsWindowIterator without the need to specify the template parameters manually.

Definition at line 347 of file sliding_variants_window_iterator.hpp.

◆ make_variant_input_iterator_from_individual_vcf_file() [1/2]

VariantInputIterator make_variant_input_iterator_from_individual_vcf_file ( std::string const &  filename,
bool  use_allelic_depth = false,
bool  only_biallelic = true,
bool  only_filter_pass = true 
)

Create a VariantInputIterator to iterate the contents of a VCF file as Variants, treating each sample as an individual, and combining them all into one BaseCounts sample.

See convert_to_variant_as_individuals( VcfRecord const&, bool ) for details on the conversion from VcfRecord to Variant. We only consider biallelic SNP positions here.

If only_filter_pass is set to true (default), only those positions are considered that have the FILTER field set to "PASS". That is, all variants that did not pass a filter in the VCF processing are skipped.

See also
See make_variant_input_iterator_from_pool_vcf_file() for the function that instead interprets each sample (column) as a pool of individuals, e.g., from pool sequencing.

Definition at line 456 of file variant_input_iterator.cpp.

◆ make_variant_input_iterator_from_individual_vcf_file() [2/2]

VariantInputIterator make_variant_input_iterator_from_individual_vcf_file ( std::string const &  filename,
std::vector< std::string > const &  sample_names,
bool  inverse_sample_names = false,
bool  use_allelic_depth = false,
bool  only_biallelic = true,
bool  only_filter_pass = true 
)

Create a VariantInputIterator to iterate the contents of a VCF file as Variants, treating each sample as an individual, and combining them all into one BaseCounts sample.

See convert_to_variant_as_individuals( VcfRecord const&, bool ) for details on the conversion from VcfRecord to Variant. We only consider biallelic SNP positions here.

If only_filter_pass is set to true (default), only those positions are considered that have the FILTER field set to "PASS". That is, all variants that did not pass a filter in the VCF processing are skipped.

See also
See make_variant_input_iterator_from_pool_vcf_file() for the function that instead interprets each sample (column) as a pool of individuals, e.g., from pool sequencing.

Additionally, this version of the function takes a list of sample_names which are used as filter so that only those samples (columns of the VCF records) are evaluated and accessible - or, if inverse_sample_names is set to true, instead all but those samples.

Definition at line 468 of file variant_input_iterator.cpp.

◆ make_variant_input_iterator_from_pileup_file() [1/3]

VariantInputIterator make_variant_input_iterator_from_pileup_file ( std::string const &  filename,
SimplePileupReader const &  reader = SimplePileupReader{} 
)

Create a VariantInputIterator to iterate the contents of a (m)pileup file as Variants.

Optionally, this takes a reader with settings to be used.

Definition at line 237 of file variant_input_iterator.cpp.

◆ make_variant_input_iterator_from_pileup_file() [2/3]

VariantInputIterator make_variant_input_iterator_from_pileup_file ( std::string const &  filename,
std::vector< bool > const &  sample_filter,
SimplePileupReader const &  reader = SimplePileupReader{} 
)

Create a VariantInputIterator to iterate the contents of a (m)pileup file as Variants.

This uses only the samples at the indices where the sample_filter is true. Optionally, this takes a reader with settings to be used.

Definition at line 257 of file variant_input_iterator.cpp.

◆ make_variant_input_iterator_from_pileup_file() [3/3]

VariantInputIterator make_variant_input_iterator_from_pileup_file ( std::string const &  filename,
std::vector< size_t > const &  sample_indices,
bool  inverse_sample_indices = false,
SimplePileupReader const &  reader = SimplePileupReader{} 
)

Create a VariantInputIterator to iterate the contents of a (m)pileup file as Variants.

This uses only the samples at the zero-based indices given in the sample_indices list. If inverse_sample_indices is true, this list is inversed, that is, all sample indices but the ones listed are included in the output.

For example, given a list { 0, 2 } and a file with 4 samples, only the first and the third sample will be in the output. When however inverse_sample_indices is also set, then the output will contain the second and fourth sample.

Optionally, this takes a reader with settings to be used.

Definition at line 246 of file variant_input_iterator.cpp.

◆ make_variant_input_iterator_from_pileup_file_()

VariantInputIterator genesis::population::make_variant_input_iterator_from_pileup_file_ ( std::string const &  filename,
SimplePileupReader const &  reader,
std::vector< size_t > const &  sample_indices,
bool  inverse_sample_indices,
std::vector< bool > const &  sample_filter 
)

Local helper function that takes care of the three functions below.

Definition at line 193 of file variant_input_iterator.cpp.

◆ make_variant_input_iterator_from_pool_vcf_file() [1/2]

VariantInputIterator make_variant_input_iterator_from_pool_vcf_file ( std::string const &  filename,
bool  only_biallelic = true,
bool  only_filter_pass = true 
)

Create a VariantInputIterator to iterate the contents of a VCF file as Variants, treating each sample as a pool of individuals.

See convert_to_variant_as_pool( VcfRecord const& ) for details on the conversion from VcfRecord to Variant.

This function requires the VCF to have the "AD" FORMAT field. It only iterates over those VCF record lines that actually have the "AD" FORMAT provided, as this is the information that we use to convert the samples to Variants. All records without that field are skipped. Only SNP records are processed; that is, all non-SNPs (indels and others) are ignord.

If only_biallelic is set to true (default), this is further restricted to only contain biallelic SNPs, that is, only positions with exactly one alternative allele.

If only_filter_pass is set to true (default), only those positions are considered that have the FILTER field set to "PASS". That is, all variants that did not pass a filter in the VCF processing are skipped.

See also
See make_variant_input_iterator_from_individual_vcf_file() for the function that instead interprets the VCF as usual as a set of individuals.

Definition at line 432 of file variant_input_iterator.cpp.

◆ make_variant_input_iterator_from_pool_vcf_file() [2/2]

VariantInputIterator make_variant_input_iterator_from_pool_vcf_file ( std::string const &  filename,
std::vector< std::string > const &  sample_names,
bool  inverse_sample_names = false,
bool  only_biallelic = true,
bool  only_filter_pass = true 
)

Create a VariantInputIterator to iterate the contents of a VCF file as Variants, treating each sample as a pool of individuals.

See convert_to_variant_as_pool( VcfRecord const& ) for details on the conversion from VcfRecord to Variant.

This function requires the VCF to have the "AD" FORMAT field. It only iterates over those VCF record lines that actually have the "AD" FORMAT provided, as this is the information that we use to convert the samples to Variants. All records without that field are skipped. Only SNP records are processed; that is, all non-SNPs (indels and others) are ignord.

If only_biallelic is set to true (default), this is further restricted to only contain biallelic SNPs, that is, only positions with exactly one alternative allele.

If only_filter_pass is set to true (default), only those positions are considered that have the FILTER field set to "PASS". That is, all variants that did not pass a filter in the VCF processing are skipped.

See also
See make_variant_input_iterator_from_individual_vcf_file() for the function that instead interprets the VCF as usual as a set of individuals.

Additionally, this version of the function takes a list of sample_names which are used as filter so that only those samples (columns of the VCF records) are evaluated and accessible - or, if inverse_sample_names is set to true, instead all but those samples.

Definition at line 443 of file variant_input_iterator.cpp.

◆ make_variant_input_iterator_from_sam_file()

VariantInputIterator make_variant_input_iterator_from_sam_file ( std::string const &  filename,
SamVariantInputIterator const &  reader = SamVariantInputIterator{} 
)

Create a VariantInputIterator to iterate the contents of a SAM/BAM/CRAM file as Variants.

An instance of SamVariantInputIterator can be provided from which the settings are copied.

Depending on the settings used in the reader, this can either produce a single sample (one BaseCounts object in the resulting Variant at each position in the genome), or split the input file by the read group (RG) tag (potentially also allowing for an "unaccounted" group of reads).

The other make_variant_input_iterator_... functions offer settings to sub-set (filter) the samples based on their names or indices. This can be achieved here as well, but has instead to be done directly in the reader, instead of providing the fitler arguments to this function.

Definition at line 129 of file variant_input_iterator.cpp.

◆ make_variant_input_iterator_from_sync_file() [1/3]

VariantInputIterator make_variant_input_iterator_from_sync_file ( std::string const &  filename)

Create a VariantInputIterator to iterate the contents of a PoPoolation2 sync file as Variants.

Definition at line 314 of file variant_input_iterator.cpp.

◆ make_variant_input_iterator_from_sync_file() [2/3]

VariantInputIterator make_variant_input_iterator_from_sync_file ( std::string const &  filename,
std::vector< bool > const &  sample_filter 
)

Create a VariantInputIterator to iterate the contents of a PoPoolation2 sync file as Variants.

This uses only the samples at the indices where the sample_filter is true. Optionally, this takes a reader with settings to be used.

Definition at line 332 of file variant_input_iterator.cpp.

◆ make_variant_input_iterator_from_sync_file() [3/3]

VariantInputIterator make_variant_input_iterator_from_sync_file ( std::string const &  filename,
std::vector< size_t > const &  sample_indices,
bool  inverse_sample_indices = false 
)

Create a VariantInputIterator to iterate the contents of a PoPoolation2 sync file as Variants.

This uses only the samples at the zero-based indices given in the sample_indices list. If inverse_sample_indices is true, this list is inversed, that is, all sample indices but the ones listed are included in the output.

For example, given a list { 0, 2 } and a file with 4 samples, only the first and the third sample will be in the output. When however inverse_sample_indices is also set, then the output will contain the second and fourth sample.

Definition at line 322 of file variant_input_iterator.cpp.

◆ make_variant_input_iterator_from_sync_file_()

VariantInputIterator genesis::population::make_variant_input_iterator_from_sync_file_ ( std::string const &  filename,
std::vector< size_t > const &  sample_indices,
bool  inverse_sample_indices,
std::vector< bool > const &  sample_filter 
)

Definition at line 271 of file variant_input_iterator.cpp.

◆ make_variant_input_iterator_from_variant_parallel_input_iterator()

VariantInputIterator make_variant_input_iterator_from_variant_parallel_input_iterator ( VariantParallelInputIterator const &  parallel_input,
bool  allow_ref_base_mismatches = false,
bool  allow_alt_base_mismatches = true,
std::string const &  source_sample_separator = ":" 
)

Create a VariantInputIterator to iterate multiple input sources at once, using a VariantParallelInputIterator.

This wraps multiple input sources into one iterator that traverses all of them in parallel, and is here then yet again turned into a Variant per position, using VariantParallelInputIterator::Iterator::joined_variant() to combine all input sources into one. See there for the meaning of the two bool parameters of this function.

As this is iterating multiple files, we leave the VariantInputIteratorData::file_path and VariantInputIteratorData::source_name empty, and fill the VariantInputIteratorData::sample_names with the sample names of the underlying input sources of the parallel iterator, using their respective source_name as a prefix, separated by source_sample_separator, for example my_bam:S1 for a source file /path/to/my_bam.bam with a RG read group tag S1.

Definition at line 488 of file variant_input_iterator.cpp.

◆ make_variant_input_iterator_from_vcf_file_()

VariantInputIterator genesis::population::make_variant_input_iterator_from_vcf_file_ ( std::string const &  filename,
std::vector< std::string > const &  sample_names,
bool  inverse_sample_names,
bool  pool_samples,
bool  use_allelic_depth,
bool  only_biallelic,
bool  only_filter_pass 
)

Local helper function that takes care of both main functions below.

Definition at line 351 of file variant_input_iterator.cpp.

◆ merge() [1/2]

BaseCounts merge ( BaseCounts const &  p1,
BaseCounts const &  p2 
)

Merge the counts of two BaseCountss.

Definition at line 372 of file population/functions/functions.cpp.

◆ merge() [2/2]

BaseCounts merge ( std::vector< BaseCounts > const &  p)

Merge the counts of a vector BaseCountss.

Definition at line 379 of file population/functions/functions.cpp.

◆ merge_inplace()

void merge_inplace ( BaseCounts p1,
BaseCounts const &  p2 
)

Merge the counts of two BaseCountss, by adding the counts of the second (p2) to the first (p1).

Definition at line 355 of file population/functions/functions.cpp.

◆ n_base()

double n_base ( size_t  coverage,
size_t  poolsize 
)

Compute the n_base term used for Tajima's D in Kofler et al. 2011, using a faster closed form expression.

This term is the expected number of distinct individuals sequenced, which is equivalent to finding the expected number of distinct values selected from a set of integers.

The computation in PoPoolation is slowm, see n_base_matrix(). We here instead use a closed form expression following the reasoning of https://math.stackexchange.com/a/72351 See there for the derivation of the equation.

Definition at line 432 of file diversity.cpp.

◆ n_base_matrix()

double n_base_matrix ( size_t  coverage,
size_t  poolsize 
)

Compute the n_base term used for Tajima's D in Kofler et al. 2011, following their approach.

This term is the expected number of distinct individuals sequenced, which is equivalent to finding the expected number of distinct values selected from a set of integers.

The computation of this term in PoPoolation uses a recursive dynamic programming approach to sum over different possibilities of selecting sets of integers. This gets rather slow for larger inputs, and there is an equivalent closed form that we here use instead. See n_base() for details. We here merely offer the original PoPoolation implementation as a point of reference.

R. Kofler, P. Orozco-terWengel, N. De Maio, R. V. Pandey, V. Nolte, A. Futschik, C. Kosiol, C. Schlötterer.
PoPoolation: A Toolbox for Population Genetic Analysis of Next Generation Sequencing Data from Pooled Individuals.
(2011) PLoS ONE, 6(1), e15925. https://doi.org/10.1371/journal.pone.0015925

The paper unfortunately does not explain their equations, but there is a hidden document in their code repository that illuminates the situation a bit. See https://sourceforge.net/projects/popoolation/files/correction_equations.pdf

Definition at line 398 of file diversity.cpp.

◆ nucleotide_sorting_order_()

std::array<size_t, 4> genesis::population::nucleotide_sorting_order_ ( std::array< T, 4 > const &  values)

Local helper function that runs a sorting network to sort four values, coming from the four nucleotides.

The input are four values, either counts or frequencies. The output are the indices into this array that are sorted so that the largest one comes first:

auto const data = std::array<T, 4>{ 15, 10, 20, 5 };
auto const order = nucleotide_sorting_order_( data );

yields { 2, 0, 1, 3 }, so that data[order[0]] = data[2] = 20 is the largest value, data[order[1]] = data[0] = 15 the second largest, and so forth.

Definition at line 162 of file population/functions/functions.cpp.

◆ nucleotide_sum()

size_t genesis::population::nucleotide_sum ( BaseCounts const &  sample)
inline

Count of the pure nucleotide bases at this position, that is, the sum of all A, C, G, and T.

This is simply the sum of a_count + c_count + g_count + t_count, which we often use as the coverage at the given site.

NB: In PoPoolation, this variable is called eucov.

Definition at line 222 of file population/functions/functions.hpp.

◆ operator!=() [1/2]

bool genesis::population::operator!= ( GenomeLocus const &  l,
GenomeLocus const &  r 
)
inline

Inequality comparison (!=) for two loci in a genome.

Definition at line 261 of file functions/genome_locus.hpp.

◆ operator!=() [2/2]

bool operator!= ( GenomeRegion const &  a,
GenomeRegion const &  b 
)

Inequality comparison (!=) for two GenomeRegions.

Definition at line 53 of file genome_region.cpp.

◆ operator<()

bool genesis::population::operator< ( GenomeLocus const &  l,
GenomeLocus const &  r 
)
inline

Less than comparison (<) for two loci in a genome.

Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.

Definition at line 317 of file functions/genome_locus.hpp.

◆ operator<<() [1/3]

std::ostream & operator<< ( std::ostream &  os,
BaseCounts const &  bs 
)

Output stream operator for BaseCounts instances.

Definition at line 486 of file population/functions/functions.cpp.

◆ operator<<() [2/3]

std::ostream& genesis::population::operator<< ( std::ostream &  os,
GenomeLocus const &  locus 
)
inline

Definition at line 64 of file functions/genome_locus.hpp.

◆ operator<<() [3/3]

std::ostream & operator<< ( std::ostream &  os,
GenomeRegion const &  region 
)

Definition at line 62 of file genome_region.cpp.

◆ operator<=()

bool genesis::population::operator<= ( GenomeLocus const &  l,
GenomeLocus const &  r 
)
inline

Less than or equal comparison (<=) for two loci in a genome.

Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.

Definition at line 427 of file functions/genome_locus.hpp.

◆ operator==() [1/2]

bool genesis::population::operator== ( GenomeLocus const &  l,
GenomeLocus const &  r 
)
inline

Equality comparison (==) for two loci in a genome.

Definition at line 209 of file functions/genome_locus.hpp.

◆ operator==() [2/2]

bool operator== ( GenomeRegion const &  a,
GenomeRegion const &  b 
)

Equality comparison (!=) for two GenomeRegions.

Definition at line 48 of file genome_region.cpp.

◆ operator>()

bool genesis::population::operator> ( GenomeLocus const &  l,
GenomeLocus const &  r 
)
inline

Greater than comparison (>) for two loci in a genome.

Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.

Definition at line 371 of file functions/genome_locus.hpp.

◆ operator>=()

bool genesis::population::operator>= ( GenomeLocus const &  l,
GenomeLocus const &  r 
)
inline

Greater than or equal comparison (>=) for two loci in a genome.

Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.

Definition at line 481 of file functions/genome_locus.hpp.

◆ parse_genome_region()

GenomeRegion parse_genome_region ( std::string const &  region,
bool  zero_based = false,
bool  end_exclusive = false 
)

Parse a genomic region.

Accepted formats are "chromosome", "chromosome:position", "chromosome:start-end", and "chromosome:start..end".

By default, we expect positions (coordindates) to be 1-based amd inclusive (closed interval), but this can be changed with the additional parameters zero_based and end_exclusive.

Definition at line 104 of file genome_region.cpp.

◆ parse_genome_regions()

GenomeRegionList parse_genome_regions ( std::string const &  regions,
bool  zero_based = false,
bool  end_exclusive = false 
)

Parse a set/list of genomic regions.

The individual regions need to be separated by commas (surrounding white space is okay), and each region needs to follow the format as explained in parse_genome_region(). See there for details.

Definition at line 173 of file genome_region.cpp.

◆ pij_matrix_()

genesis::utils::Matrix<double> genesis::population::pij_matrix_ ( size_t  max_coverage,
size_t  poolsize 
)

Definition at line 326 of file diversity.cpp.

◆ pij_matrix_resolver_()

genesis::utils::Matrix<double> const& genesis::population::pij_matrix_resolver_ ( size_t  max_coverage,
size_t  poolsize 
)

Definition at line 360 of file diversity.cpp.

◆ pool_diversity_measures()

PoolDiversityResults genesis::population::pool_diversity_measures ( PoolDiversitySettings const &  settings,
ForwardIterator  begin,
ForwardIterator  end 
)

Compute Theta Pi, Theta Watterson, and Tajia's D in their pool-sequencing corrected versions according to Kofler et al.

This is a high level function that is meant as a simple example of how to compute these statistics. See theta_pi_pool(), theta_watterson_pool(), and tajima_d_pool() for details. It takes care of most options offered by PoPoolation (as given by settings here), except for the window width and stride and minimum phred quality score, which have to be applied before filling the window (or whatever other range is used as input here) before calling this function.

Furthermore, results here are not filtered aftwards, so any filtering based on e.g., minimum covered fraction has to be done downstream.

Definition at line 484 of file diversity.hpp.

◆ prob_cond_true_freq()

std::vector< double > prob_cond_true_freq ( size_t  n,
std::vector< bool > const &  alleles,
std::vector< unsigned char > const &  phred_scores,
bool  unfolded 
)

Definition at line 121 of file afs_estimate.cpp.

◆ prob_cond_true_freq_unfolded()

std::vector< double > prob_cond_true_freq_unfolded ( size_t  n,
std::vector< bool > const &  alleles,
std::vector< unsigned char > const &  phred_scores,
bool  invert_alleles 
)

Definition at line 145 of file afs_estimate.cpp.

◆ process_conditional_probability()

void genesis::population::process_conditional_probability ( ForwardIterator  begin,
ForwardIterator  end 
)

Compute the conditional probabilities of AFs. This reimplements process_probCond from Boitard et al.

Definition at line 100 of file afs_estimate.hpp.

◆ process_pileup_correct_input_order_check_()

void genesis::population::process_pileup_correct_input_order_check_ ( utils::InputStream const &  it,
std::string &  cur_chr,
size_t &  cur_pos,
std::string const &  new_chr,
size_t  new_pos 
)

Local helper function to remove code duplication for the correct input order check.

Definition at line 54 of file simple_pileup_reader.cpp.

◆ process_sync_correct_input_order_()

void genesis::population::process_sync_correct_input_order_ ( utils::InputStream const &  it,
std::string &  cur_chr,
size_t &  cur_pos,
Variant const &  new_var 
)

Local helper function to remove code duplication for the correct input order check.

Definition at line 52 of file sync_reader.cpp.

◆ run_vcf_window()

void genesis::population::run_vcf_window ( SlidingWindowGenerator< Data, Accumulator > &  generator,
std::string const &  vcf_file,
std::function< Data(VcfRecord const &)>  conversion,
std::function< bool(VcfRecord const &)>  condition = {} 
)

Convenience function to iterate over a whole VCF file.

Deprecated:
Not in use any more, just kept around in case it might be needed later. Use SlidingIntervalWindowIterator instead.

This function is convenience, and takes care of iterating a VCF file record by record (that is, line by line), using a provided conversion function to extract the D/Data from the VcfRecord. It furthermore takes care of finishing all chromosomes properly, using their lengths as provided in the VCF header.

Before calling the function, of course, all necessary plugin functions have to be set in the SlidingWindowGenerator instance, so that the data is processed as intended. In particular, take care of setting SlidingWindowGenerator::emit_incomplete_windows() to the desired value.

Furthermore, the function offers a condition function that can be used to skip records that do not fullfil a given condition. That is, if condition is used, it needs to return true for records that shall be processed, and false for those that shall be skipped.

Definition at line 73 of file vcf_window.hpp.

◆ sam_flag_to_string()

std::string sam_flag_to_string ( int  flags)

Turn a set of flags for sam/bam/cram reads into their textual representation.

This is useful for user output. We here use the format of names as used by htslib and samtools, were names are upper case and words in flag names separated by underscores. This ensures compatibility of the output with existing tools.

See http://www.htslib.org/doc/samtools-flags.html and https://broadinstitute.github.io/picard/explain-flags.html for details.

Definition at line 132 of file sam_flags.cpp.

◆ SimplePileupReader::process_ancestral_base_< SimplePileupReader::Sample >()

void genesis::population::SimplePileupReader::process_ancestral_base_< SimplePileupReader::Sample > ( utils::InputStream input_stream,
SimplePileupReader::Sample sample 
) const

Definition at line 715 of file simple_pileup_reader.cpp.

◆ SimplePileupReader::process_quality_string_< SimplePileupReader::Sample >()

void genesis::population::SimplePileupReader::process_quality_string_< SimplePileupReader::Sample > ( utils::InputStream input_stream,
SimplePileupReader::Sample sample 
) const

Definition at line 568 of file simple_pileup_reader.cpp.

◆ SimplePileupReader::set_sample_read_bases_< SimplePileupReader::Sample >()

void genesis::population::SimplePileupReader::set_sample_read_bases_< SimplePileupReader::Sample > ( std::string const &  read_bases,
SimplePileupReader::Sample sample 
) const

Definition at line 546 of file simple_pileup_reader.cpp.

◆ SimplePileupReader::set_sample_read_coverage_< SimplePileupReader::Sample >()

void genesis::population::SimplePileupReader::set_sample_read_coverage_< SimplePileupReader::Sample > ( size_t  read_coverage,
SimplePileupReader::Sample sample 
) const

Definition at line 524 of file simple_pileup_reader.cpp.

◆ SimplePileupReader::set_target_alternative_base_< SimplePileupReader::Record >()

void genesis::population::SimplePileupReader::set_target_alternative_base_< SimplePileupReader::Record > ( SimplePileupReader::Record target) const

Definition at line 503 of file simple_pileup_reader.cpp.

◆ sorted_average_base_counts()

std::pair< SortedBaseCounts, SortedBaseCounts > sorted_average_base_counts ( BaseCounts const &  sample_a,
BaseCounts const &  sample_b 
)

Return the sorted base counts of both input samples, orderd by the average frequencies of the nucleotide counts in the two samples.

Both returned counts will be in the same order, with the nucleotide first that has the highest average count in the two samples, etc.

Definition at line 221 of file population/functions/functions.cpp.

◆ sorted_base_counts() [1/2]

SortedBaseCounts sorted_base_counts ( BaseCounts const &  sample)

Return the order of base counts (nucleotides), largest one first.

Definition at line 191 of file population/functions/functions.cpp.

◆ sorted_base_counts() [2/2]

SortedBaseCounts sorted_base_counts ( Variant const &  variant,
bool  reference_first 
)

Get a list of bases sorted by their counts.

If reference_first is set to true, the first entry in the resulting array is always the reference base of the Variant, while the other three bases are sorted by counts. If reference_first is set to false, all four bases are sorted by their counts.

Definition at line 288 of file population/functions/functions.cpp.

◆ status()

BaseCountsStatus status ( BaseCounts const &  sample,
size_t  min_coverage = 0,
size_t  max_coverage = 0,
size_t  min_count = 0,
bool  tolerate_deletions = false 
)

Compute a simple status with useful properties from the counts of a BaseCounts.

min_coverage

Minimum coverage expected for a BaseCounts to be considered "covered". If the number of nucleotides (A, C, G, T) in the reads of a sample is less then the here provided min_coverage, then the BaseCounts is not considered sufficiently covered, and the BaseCountsStatus::is_covered flag will be set to false.

max_coverage

Same as min_coverage, but the upper bound on coverage; maximum coverage expected for a BaseCounts to be considered "covered". If the number of nucleotides exceeds this bound, the BaseCountsStatus::is_covered flag will be set to false. If provided with a value of 0 (default), max_coverage is not used.

Only if the nucleotide count is in between (or equal to either) these two bounds (min_coverage and max_coverage), it is considered to be covered, and BaseCountsStatus::is_covered will be set to true.

min_count

This value is used to determine whether a BaseCounts has too many deletions, and unless tolerate_deletions() is set to true, the BaseCountsStatus::is_ignored will be set to true in that case (too many deletions, as given by BaseCounts::d_count), while the values for BaseCountsStatus::is_covered, BaseCountsStatus::is_snp, and BaseCountsStatus::is_biallelic will be set to false.

Typically, if this function is used after calling filter_min_count() on the BaseCounts, the min_count is set to the same value for consistency.

tolerate_deletions

Set whether we tolerate BaseCountss with a high amount of deletions.

If set to false (default), we do not tolerate deletions. In that case, if the number of deletions in a Sample (given by Sample::d_count) is higher than or equal to min_count(), the Sample will be considered ignored (Sample::is_ignored set to true), and considered not covered (Sample::is_covered, Sample::is_snp, and Sample::is_biallelic will all be set to false).

If however set to true, we tolerate high amounts of deletions, and the values for the above properties will be set as usual by considering the nucleotide counts (Sample::a_count, Sample::c_count, Sample::g_count, and Sample::t_count) instead.

Definition at line 49 of file population/functions/functions.cpp.

◆ string_to_sam_flag()

int string_to_sam_flag ( std::string const &  value)

Parse a string as a set of flags for sam/bam/cram reads.

The given string can either be the numeric value as specified by the sam standard, or given as a list of flag names or values, which can be separated by comma, space, vertical bar, or plus sign, and where each flag name is treated case-insensitive and without regarding non-alpha-numeric characters. This is a more lenient parsing than what htslib and samtools offer.

For example, it accepts:

1
0x12
PROPER_PAIR,MREVERSE
ProperPair + MateReverse
PROPER_PAIR | 0x20

See http://www.htslib.org/doc/samtools-flags.html and https://broadinstitute.github.io/picard/explain-flags.html for details.

Definition at line 81 of file sam_flags.cpp.

◆ tajima_d_pool() [1/2]

double genesis::population::tajima_d_pool ( PoolDiversitySettings const &  settings,
ForwardIterator  begin,
ForwardIterator  end 
)

Compute the pool-sequencing corrected version of Tajima's D according to Kofler et al.

Definition at line 454 of file diversity.hpp.

◆ tajima_d_pool() [2/2]

double genesis::population::tajima_d_pool ( PoolDiversitySettings const &  settings,
ForwardIterator  begin,
ForwardIterator  end,
double  theta_pi,
double  theta_watterson 
)

Compute the pool-sequencing corrected version of Tajima's D according to Kofler et al.

Definition at line 430 of file diversity.hpp.

◆ tajima_d_pool_denominator()

double tajima_d_pool_denominator ( PoolDiversitySettings const &  settings,
size_t  snp_count,
double  theta 
)

Compute the denominator for the pool-sequencing correction of Tajima's D according to Kofler et al.

Definition at line 451 of file diversity.cpp.

◆ theta_pi()

double genesis::population::theta_pi ( ForwardIterator  begin,
ForwardIterator  end,
bool  with_bessel = true 
)

Compute classic theta pi, that is, the sum of heterozygosities.

The function simply sums heterozygosity() for all samples in the given range. If with_bessel is set, Bessel's correction for the total nucleotide count is used.

Definition at line 178 of file diversity.hpp.

◆ theta_pi_pool() [1/2]

double genesis::population::theta_pi_pool ( PoolDiversitySettings const &  settings,
BaseCounts const &  sample 
)
inline

Compute theta pi with pool-sequencing correction according to Kofler et al, for a single BaseCounts, that is, its heterozygosity() including Bessel's correction for the total nucleotide count at each position, divided by the correction denominator.

Definition at line 222 of file diversity.hpp.

◆ theta_pi_pool() [2/2]

double genesis::population::theta_pi_pool ( PoolDiversitySettings const &  settings,
ForwardIterator  begin,
ForwardIterator  end 
)

Compute theta pi with pool-sequencing correction according to Kofler et al, that is, the sum of heterozygosities divided by the correction denominator.

The function sums heterozygosity() for all samples in the given range, including Bessel's correction for the total nucleotide count at each position, and divides each by the respective denominator to correct for error from pool sequencing. See theta_pi_pool_denominator() for details.

Definition at line 199 of file diversity.hpp.

◆ theta_pi_pool_denominator()

double theta_pi_pool_denominator ( PoolDiversitySettings const &  settings,
size_t  nucleotide_count 
)

Compute the denominator for the pool-sequencing correction of theta pi according to Kofler et al.

We here compute the denominator for a given poolsize, with a fix min_allele_count, which is identical for each given nucleotide_count, and henced cached internally for speedup.

See

R. Kofler, P. Orozco-terWengel, N. De Maio, R. V. Pandey, V. Nolte, A. Futschik, C. Kosiol, C. Schlötterer.
PoPoolation: A Toolbox for Population Genetic Analysis of Next Generation Sequencing Data from Pooled Individuals.
(2011) PLoS ONE, 6(1), e15925. https://doi.org/10.1371/journal.pone.0015925

for details. The paper unfortunately does not explain their equations, but there is a hidden document in their code repository that illuminates the situation a bit. See https://sourceforge.net/projects/popoolation/files/correction_equations.pdf

Definition at line 136 of file diversity.cpp.

◆ theta_pi_within_pool()

double genesis::population::theta_pi_within_pool ( ForwardIterator  begin,
ForwardIterator  end,
size_t  poolsize 
)

Compute classic theta pi (within a population), that is, the sum of heterozygosities including Bessel's correction for total nucleotide sum at each position, and Bessel's correction for the pool size.

This is the same computation used for theta pi within in the FST computation of f_st_pool_unbiased(). It does not use the pool seq correction of Kofler et al.

Definition at line 240 of file diversity.hpp.

◆ theta_watterson_pool()

double genesis::population::theta_watterson_pool ( PoolDiversitySettings const &  settings,
ForwardIterator  begin,
ForwardIterator  end 
)

Compute theta watterson with pool-sequencing correction according to Kofler et al.

Definition at line 272 of file diversity.hpp.

◆ theta_watterson_pool_denominator()

double theta_watterson_pool_denominator ( PoolDiversitySettings const &  settings,
size_t  nucleotide_count 
)

Compute the denominator for the pool-sequencing correction of theta watterson according to Kofler et al.

We here compute the denominator for a given poolsize, with a fix min_allele_count, which is identical for each given nucleotide_count, and henced cached internally for speedup.

See

R. Kofler, P. Orozco-terWengel, N. De Maio, R. V. Pandey, V. Nolte, A. Futschik, C. Kosiol, C. Schlötterer.
PoPoolation: A Toolbox for Population Genetic Analysis of Next Generation Sequencing Data from Pooled Individuals.
(2011) PLoS ONE, 6(1), e15925. https://doi.org/10.1371/journal.pone.0015925

for details. The paper unfortunately does not explain their equations, but there is a hidden document in their code repository that illuminates the situation a bit. See https://sourceforge.net/projects/popoolation/files/correction_equations.pdf

Definition at line 186 of file diversity.cpp.

◆ to_string() [1/2]

std::string genesis::population::to_string ( GenomeLocus const &  locus)
inline

Definition at line 48 of file functions/genome_locus.hpp.

◆ to_string() [2/2]

std::string to_string ( GenomeRegion const &  region)

Definition at line 69 of file genome_region.cpp.

◆ to_sync() [1/2]

std::ostream & to_sync ( BaseCounts const &  bs,
std::ostream &  os 
)

Output a BaseCounts instance to a stream in the PoPoolation2 sync format.

This is one column from that file, outputting the counts separated by colons, in the order A:T:C:G:N:D, with D being deletions (* in pileup).

Definition at line 43 of file sync_common.cpp.

◆ to_sync() [2/2]

std::ostream & to_sync ( Variant const &  var,
std::ostream &  os 
)

Output a Variant instance to a stream in the PoPoolation2 sync format.

The format is a tab-delimited file with one variant per line:

  • col1: reference contig
  • col2: position within the refernce contig
  • col3: reference character
  • col4: allele frequencies of population number 1
  • col5: allele frequencies of population number 2
  • coln: allele frequencies of population number n

Each population column outputs counts separated by colons, in the order A:T:C:G:N:D, with D being deletions (* in pileup).

See https://sourceforge.net/p/popoolation2/wiki/Tutorial/ for details.

Definition at line 50 of file sync_common.cpp.

◆ total_base_counts()

BaseCounts total_base_counts ( Variant const &  variant)

Get the summed up total base counts of a Variant.

This is the same as calling merge() on the samples in the Variant.

Definition at line 139 of file population/functions/functions.cpp.

◆ total_nucleotide_sum()

size_t genesis::population::total_nucleotide_sum ( Variant const &  variant)
inline

Count of the pure nucleotide bases at this position, that is, the sum of all A, C, G, and T.

See nucleotide_sum() for details. This function gives the sum over all samples in the Variant.

Definition at line 232 of file population/functions/functions.hpp.

◆ transform_zero_out_by_max_count() [1/2]

void transform_zero_out_by_max_count ( BaseCounts sample,
size_t  max_count 
)

Transform a BaseCounts sample by setting any nucleotide count (A, C, G, T) to zero if max_count is exceeded for that nucleotide.

This transformation is used as a type of quality control. All nucleotide counts (that is, BaseCounts::a_count, BaseCounts::c_count, BaseCounts::g_count, and BaseCounts::t_count) that are above the given max_count are set to zero.

Definition at line 101 of file filter_transform.cpp.

◆ transform_zero_out_by_max_count() [2/2]

void transform_zero_out_by_max_count ( Variant variant,
size_t  max_count 
)

Transform a variant by setting any nucleotide count (A, C, G, T) of its samples to zero if max_count is exceeded for that nucleotide.

Definition at line 122 of file filter_transform.cpp.

◆ transform_zero_out_by_min_count() [1/2]

void transform_zero_out_by_min_count ( BaseCounts sample,
size_t  min_count 
)

Transform a BaseCounts sample by setting any nucleotide count (A, C, G, T) to zero if min_count is not reached for that nucleotide.

This transformation is used as a type of quality control. All nucleotide counts (that is, BaseCounts::a_count, BaseCounts::c_count, BaseCounts::g_count, and BaseCounts::t_count) that are below the given min_count are set to zero.

Definition at line 77 of file filter_transform.cpp.

◆ transform_zero_out_by_min_count() [2/2]

void transform_zero_out_by_min_count ( Variant variant,
size_t  min_count 
)

Transform a variant by setting any nucleotide count (A, C, G, T) of its samples to zero if min_count is not reached for that nucleotide.

Definition at line 94 of file filter_transform.cpp.

◆ transform_zero_out_by_min_max_count() [1/2]

void transform_zero_out_by_min_max_count ( BaseCounts sample,
size_t  min_count,
size_t  max_count 
)

Transform a BaseCounts sample by setting any nucleotide count (A, C, G, T) to zero if min_count is not reached or if max_count is exceeded for that nucleotide.

This is the same as running transform_zero_out_by_min_count() and transform_zero_out_by_max_count() individually.

Definition at line 129 of file filter_transform.cpp.

◆ transform_zero_out_by_min_max_count() [2/2]

void transform_zero_out_by_min_max_count ( Variant variant,
size_t  min_count,
size_t  max_count 
)

Transform a variant by setting any nucleotide count (A, C, G, T) of its samples to zero if min_count is not reached or if max_count is exceeded for that nucleotide.

Definition at line 147 of file filter_transform.cpp.

◆ vcf_genotype_string()

std::string vcf_genotype_string ( std::vector< VcfGenotype > const &  genotypes)

Return the VCF-like string representation of a set of VcfGenotype entries.

The VcfFormatIterator::get_values() function returns all genotype entries for a given sample of a record/line. Here, we return a string representation similar to VCF of these genotypes, for example 0|0 or ./1.

Definition at line 560 of file vcf_common.cpp.

◆ vcf_genotype_sum()

size_t vcf_genotype_sum ( std::vector< VcfGenotype > const &  genotypes)

Return the sum of genotypes for a set of VcfGenotype entries, typically used to construct a genotype matrix with entries 0,1,2.

The function takes the given genotypes, encodes the reference as 0 and any alternative as 1, and then sums this over the values. For diploid organisms, this yields possible results in the range of 0 (homozygote for the reference), 1 (heterzygote), or 2 (homozygote for the alternative), which is typically used in genotype matrices.

Definition at line 574 of file vcf_common.cpp.

◆ vcf_hl_type_to_string()

std::string vcf_hl_type_to_string ( int  hl_type)

Internal helper function to convert htslib-internal BCF_HL_* header line type values to their string representation as used in the VCF header ("FILTER", "INFO", "FORMAT", etc).

Definition at line 205 of file vcf_common.cpp.

◆ vcf_value_special_to_string() [1/2]

std::string vcf_value_special_to_string ( int  vl_type_num)

Definition at line 177 of file vcf_common.cpp.

◆ vcf_value_special_to_string() [2/2]

std::string vcf_value_special_to_string ( VcfValueSpecial  vl_type_num)

Definition at line 172 of file vcf_common.cpp.

◆ vcf_value_type_to_string() [1/2]

std::string vcf_value_type_to_string ( int  ht_type)

Definition at line 147 of file vcf_common.cpp.

◆ vcf_value_type_to_string() [2/2]

std::string vcf_value_type_to_string ( VcfValueType  ht_type)

Definition at line 142 of file vcf_common.cpp.

Typedef Documentation

◆ VariantInputIterator

Iterate Variants, using a variety of input file formats.

This generic iterator is an abstraction that is agnostic to the underlying file format, and can be used with anything that can be converted to a Variant per genome position. It offers to iterate a whole input file, and transform and filter the Variant as needed in order to make downstream processing as easy as possible.

This is useful for downstream processing, where we just want to work with the Variants along the genome, but want to allow different file formats for their input. Use this iterator to achieve this. For example, use the make_variant_input_iterator_...() functions to get such an interator for different input file types.

The iterator furthermore offers a data field of type VariantInputIteratorData, which gets filled with basic data about the input file and sample names (if available in the file format). Use the data() function to access this data while iterating.

See also
LambdaIterator for usage and details.

Definition at line 124 of file variant_input_iterator.hpp.

◆ VariantWindowIterator

◆ VcfFormatIteratorFloat

using VcfFormatIteratorFloat = VcfFormatIterator<float, double>

Definition at line 67 of file vcf_format_iterator.hpp.

◆ VcfFormatIteratorGenotype

Definition at line 68 of file vcf_format_iterator.hpp.

◆ VcfFormatIteratorInt

using VcfFormatIteratorInt = VcfFormatIterator<int32_t, int32_t>

Definition at line 66 of file vcf_format_iterator.hpp.

◆ VcfFormatIteratorString

using VcfFormatIteratorString = VcfFormatIterator<char*, std::string>

Definition at line 65 of file vcf_format_iterator.hpp.

Enumeration Type Documentation

◆ SampleFilterType

enum SampleFilterType
strong

Select how Variant filter functions that evaluate properties of the Variant::samples (BaseCounts) objects behave when the filter is not true or false for all samples.

Enumerator
kConjunction 

The filter returns true only if all of the BaseCounts samples in the Variant return true for a given predicate. This is logical AND.

kDisjunction 

The filter returns true if any of the BaseCounts samples in the Variant return true for a given predicate. This is logical OR.

kMerge 

The filter is applied to the merged BaseCounts of all samples in the Variant.

In this special case, only one BaseCounts object is subjected to the filter function, and hence no logical compbination of the outcome is needed.

Definition at line 58 of file filter_transform.hpp.

◆ SlidingWindowType

enum SlidingWindowType
strong

SlidingWindowType of a Window, that is, whether we slide along a fixed size interval of the genome, along a fixed number of variants, or represents a whole chromosome.

Enumerator
kInterval 

Windows of this type are defined by a fixed start and end position on a chromosome.

The amount of data contained in between these two loci can differ, depending on the number of variant positions found in the underlying data iterator.

kVariants 

Windows of this type are defined as containing a fixed number of entries (usually, Variants or other data that), and hence can span window widths of differing sizes.

kChromosome 

Windows of this type contain positions across a whole chromosome.

The window contains all data from a whole chromosome. Moving to the next window then is equivalent to moving to the next chromosome. Note that this might need a lot of memory to keep all the data at once.

Definition at line 55 of file sliding_window_generator.hpp.

◆ VcfHeaderLine

enum VcfHeaderLine : int
strong

Specification for the values determining header line types of VCF/BCF files.

This list contains the types of header lines that htslib uses for identification, as specified in the VCF header. Corresponds to the BCF_HL_* macro constants defined by htslib. We statically assert that these have the same values.

Enumerator
kFilter 
kInfo 
kFormat 
kContig 
kStructured 
kGeneric 

Definition at line 70 of file vcf_common.hpp.

◆ VcfValueSpecial

enum VcfValueSpecial : int
strong

Specification for special markers for the number of values expected for key-value-pairs of VCF/BCF files.

This list contains the special markers for the number of values of the INFO and FORMAT key-value pairs, as specified in the VCF header, and used in the record lines. Corresponds to the BCF_VL_* macro constants defined by htslib. We statically assert that these have the same values.

Enumerator
kFixed 

Fixed number of values expected. In VCF, this is denoted simply by an integer number.

This simply specifies that there is a fixed number of values to be expected; we do not further define how many exaclty are expected here (the integer value). This is taken care of in a separate variable that is provided whenever a fixed-size value is needed, see for example VcfSpecification.

kVariable 

Variable number of possible values, or unknown, or unbounded. In VCF, this is denoted by '.'.

kAllele 

One value per alternate allele. In VCF, this is denoted as 'A'.

kGenotype 

One value for each possible genotype (more relevant to the FORMAT tags). In VCF, this is denoated as 'G'.

kReference 

One value for each possible allele (including the reference). In VCF, this is denoted as 'R'.

Definition at line 105 of file vcf_common.hpp.

◆ VcfValueType

enum VcfValueType : int
strong

Specification for the data type of the values expected in key-value-pairs of VCF/BCF files.

This list contains the types of data in values of the INFO and FORMAT key-value pairs, as specified in the VCF header, and used in the record lines. Corresponds to the BCF_HT_* macro constants defined by htslib. We statically assert that these have the same values.

Enumerator
kFlag 
kInteger 
kFloat 
kString 

Definition at line 88 of file vcf_common.hpp.

◆ WindowAnchorType

enum WindowAnchorType
strong

Position in the genome that is used for reporting when emitting or using a window.

See anchor_position() for details.

Enumerator
kIntervalBegin 
kIntervalEnd 
kIntervalMidpoint 
kVariantFirst 
kVariantLast 
kVariantMedian 
kVariantMean 
kVariantMidpoint 

Definition at line 52 of file population/window/functions.hpp.

Variable Documentation

◆ sam_flag_name_to_int_

const std::unordered_map<std::string, int> sam_flag_name_to_int_
static
Initial value:
= {
{ "paired", 0x1 },
{ "properpair", 0x2 },
{ "unmap", 0x4 },
{ "unmapped", 0x4 },
{ "munmap", 0x8 },
{ "mateunmapped", 0x8 },
{ "reverse", 0x10 },
{ "mreverse", 0x20 },
{ "matereverse", 0x20 },
{ "read1", 0x40 },
{ "read2", 0x80 },
{ "secondary", 0x100 },
{ "qcfail", 0x200 },
{ "dup", 0x400 },
{ "duplicate", 0x400 },
{ "supplementary", 0x800 }
}

Map from sam flags to their numerical value, for different types of naming of the flags.

Definition at line 58 of file sam_flags.cpp.