Classes | |
struct | AfsPileupRecord |
Helper to store the data of one pileup line/record needed for the Boitard et al Allele Frequency Estimation computation. More... | |
class | AlleleFrequencyWindow |
struct | BaseCounts |
One set of nucleotide base counts, for example for a given sample that represents a pool of sequenced individuals. More... | |
struct | BaseCountsStatus |
class | BaseWindowIterator |
Base iterator class for Windows over the chromosomes of a genome. More... | |
class | BedReader |
Reader for BED (Browser Extensible Data) files. More... | |
struct | EmptyAccumulator |
Empty helper data struct to serve as a dummy for Window. More... | |
struct | EmptyGenomeData |
Helper struct to define a default empty data for the classes GenomeLocus, GenomeRegion, and GenomeRegionList. More... | |
class | GenomeHeatmap |
struct | GenomeLocus |
A single locus, that is, a position (or coordinate) on a chromosome. More... | |
struct | GenomeRegion |
A region (between two positions) on a chromosome. More... | |
class | GenomeRegionList |
List of regions in a genome, for each chromosome. More... | |
class | GffReader |
Reader for GFF2 and GFF3 (General Feature Format) and GTF (General Transfer Format) files. More... | |
class | HeatmapColorization |
class | HeatmapMatrix |
Matrix to capture and accumulate columns of per-position or per-window values along a chromosome. More... | |
class | HtsFile |
Wrap an ::htsFile struct. More... | |
struct | PoolDiversityResults |
Data struct to collect all diversity statistics computed by pool_diversity_measures(). More... | |
struct | PoolDiversitySettings |
Settings used by different pool-sequencing corrected diversity statistics. More... | |
class | RegionWindowIterator |
Iterator for Windows representing regions of a genome. More... | |
class | SamVariantInputIterator |
Input iterator for SAM/BAM/CRAM files that produces a Variant per genome position. More... | |
class | SimplePileupInputIterator |
Iterate an input source and parse it as a (m)pileup file. More... | |
class | SimplePileupReader |
Reader for line-by-line assessment of (m)pileup files. More... | |
class | SlidingIntervalWindowIterator |
Iterator for sliding Windows of fixed sized intervals over the chromosomes of a genome. More... | |
class | SlidingVariantsWindowIterator |
Iterator for sliding Windows of fixed sized intervals over the chromosomes of a genome. More... | |
class | SlidingWindowGenerator |
Generator for sliding Windows over the chromosomes of a genome. More... | |
struct | SortedBaseCounts |
Ordered array of base counts for the four nucleotides. More... | |
class | SyncInputIterator |
Iterate an input source and parse it as a sync file. More... | |
class | SyncReader |
Reader for PoPoolation2's "synchronized" files. More... | |
struct | Variant |
A single variant at a position in a chromosome, along with BaseCounts for a set of samples. More... | |
struct | VariantInputIteratorData |
Data storage for input-specific information when traversing a variant file. More... | |
class | VariantParallelInputIterator |
Iterate multiple input sources that yield Variants in parallel. More... | |
class | VcfFormatHelper |
Provide htslib helper functions. More... | |
class | VcfFormatIterator |
Iterate the FORMAT information for the samples in a SNP/variant line in a VCF/BCF file. More... | |
class | VcfGenotype |
Simple wrapper class for one genotype field for a sample. More... | |
class | VcfHeader |
Capture the information from a header of a VCF/BCF file. More... | |
class | VcfInputIterator |
Iterate an input source and parse it as a VCF/BCF file. More... | |
class | VcfRecord |
Capture the information of a single SNP/variant line in a VCF/BCF file. More... | |
struct | VcfSpecification |
Collect the four required keys that describe an INFO or FORMAT sub-field of VCF/BCF files. More... | |
class | Window |
Window over the chromosomes of a genome. More... | |
Functions | |
double | a_n (size_t n) |
Compute a_n , the sum of reciprocals. More... | |
double | alpha_star (double n) |
Compute alpha* according to Achaz 2008 and Kofler et al. 2011. More... | |
double | amnm_ (size_t poolsize, size_t nucleotide_count, size_t allele_frequency) |
Local helper function to compute values for the denominator. More... | |
template<class D , class A = EmptyAccumulator> | |
size_t | anchor_position (Window< D, A > const &window, WindowAnchorType anchor_type=WindowAnchorType::kIntervalBegin) |
Get the position in the chromosome reported according to a specific WindowAnchorType. More... | |
double | b_n (size_t n) |
Compute b_n , the sum of squared reciprocals. More... | |
double | beta_star (double n) |
Compute beta* according to Achaz 2008 and Kofler et al. 2011. More... | |
std::pair< char, double > | consensus (BaseCounts const &sample) |
Consensus character for a BaseCounts, and its confidence. More... | |
std::pair< char, double > | consensus (BaseCounts const &sample, BaseCountsStatus const &status) |
Consensus character for a BaseCounts, and its confidence. More... | |
AfsPileupRecord | convert_to_afs_pileup_record (SimplePileupReader::Record const &record) |
BaseCounts | convert_to_base_counts (SimplePileupReader::Sample const &sample, unsigned char min_phred_score) |
Variant | convert_to_variant (SimplePileupReader::Record const &record, unsigned char min_phred_score) |
Variant | convert_to_variant_as_individuals (VcfRecord const &record, bool use_allelic_depth=false) |
Convert a VcfRecord to a Variant, treating each sample as an individual, and combining them all into one BaseCounts sample. More... | |
Variant | convert_to_variant_as_pool (VcfRecord const &record) |
Convert a VcfRecord to a Variant, treating each sample column as a pool of individuals. More... | |
template<class ForwardIterator1 , class ForwardIterator2 > | |
double | f_st_pool_karlsson (ForwardIterator1 p1_begin, ForwardIterator1 p1_end, ForwardIterator2 p2_begin, ForwardIterator2 p2_end) |
Compute the F_ST statistic for pool-sequenced data of Karlsson et al as used in PoPoolation2, for two ranges of BaseCountss. More... | |
std::pair< double, double > | f_st_pool_karlsson_nkdk (std::pair< SortedBaseCounts, SortedBaseCounts > const &sample_counts) |
Compute the numerator N_k and denominator D_k needed for the asymptotically unbiased F_ST estimator of Karlsson et al (2007). More... | |
template<class ForwardIterator1 , class ForwardIterator2 > | |
double | f_st_pool_kofler (size_t p1_poolsize, size_t p2_poolsize, ForwardIterator1 p1_begin, ForwardIterator1 p1_end, ForwardIterator2 p2_begin, ForwardIterator2 p2_end) |
Compute the F_ST statistic for pool-sequenced data of Kofler et al as used in PoPoolation2, for two ranges of BaseCountss. More... | |
std::tuple< double, double, double > | f_st_pool_kofler_pi_snp (BaseCounts const &p1, BaseCounts const &p2) |
Compute the SNP-based Theta Pi values used in f_st_pool_kofler(). More... | |
template<class ForwardIterator1 , class ForwardIterator2 > | |
std::pair< double, double > | f_st_pool_unbiased (size_t p1_poolsize, size_t p2_poolsize, ForwardIterator1 p1_begin, ForwardIterator1 p1_end, ForwardIterator2 p2_begin, ForwardIterator2 p2_end) |
Compute our unbiased F_ST statistic for pool-sequenced data for two ranges of BaseCountss. More... | |
std::tuple< double, double, double > | f_st_pool_unbiased_pi_snp (size_t p1_poolsize, size_t p2_poolsize, BaseCounts const &p1, BaseCounts const &p2) |
Compute the SNP-based Theta Pi values used in f_st_pool_unbiased(). More... | |
double | f_star (double a_n, double n) |
Compute f* according to Achaz 2008 and Kofler et al. 2011. More... | |
std::function< bool(Variant const &)> | filter_by_region (GenomeRegion const ®ion, bool complement=false) |
Filter function to be used with VariantInputIterator to filter by a genome region. More... | |
std::function< bool(Variant const &)> | filter_by_region (GenomeRegionList const ®ions, bool complement=false, bool copy_regions=false) |
Filter function to be used with VariantInputIterator to filter by a list of genome regions. More... | |
std::function< bool(Variant const &)> | filter_by_region (std::shared_ptr< GenomeRegionList > regions, bool complement=false) |
Filter function to be used with VariantInputIterator to filter by a list of genome regions. More... | |
bool | filter_by_status (std::function< bool(BaseCountsStatus const &)> predicate, Variant const &variant, SampleFilterType type, size_t min_coverage=0, size_t max_coverage=0, size_t min_count=0, bool tolerate_deletions=false) |
Filter a Variant based on a predicate that is applied to the result of a status() call on the BaseCounts of the variant . More... | |
std::function< bool(Variant const &)> | filter_by_status (std::function< bool(BaseCountsStatus const &)> predicate, SampleFilterType type, size_t min_coverage=0, size_t max_coverage=0, size_t min_count=0, bool tolerate_deletions=false) |
Filter a Variant based on a predicate that is applied to the result of a status() call on the BaseCounts of the variant . More... | |
std::function< bool(Variant const &)> | filter_is_biallelic_snp (SampleFilterType type, size_t min_coverage=0, size_t max_coverage=0, size_t min_count=0, bool tolerate_deletions=false) |
Filter a Variant based on whether the sample counts are biallelic SNPs, that is, exactly two base counts in [ACGT] are non-zero. More... | |
bool | filter_is_biallelic_snp (Variant const &variant, SampleFilterType type, size_t min_coverage=0, size_t max_coverage=0, size_t min_count=0, bool tolerate_deletions=false) |
Filter a Variant based on whether the sample counts are biallelic SNPs, that is, exactly two base counts in [ACGT] are non-zero. More... | |
std::function< bool(Variant const &)> | filter_is_snp (SampleFilterType type, size_t min_coverage=0, size_t max_coverage=0, size_t min_count=0, bool tolerate_deletions=false) |
Filter a Variant based on whether the sample counts are SNPs, that is, more than one count in [ACGT] is non-zero. More... | |
bool | filter_is_snp (Variant const &variant, SampleFilterType type, size_t min_coverage=0, size_t max_coverage=0, size_t min_count=0, bool tolerate_deletions=false) |
Filter a Variant based on whether the sample counts are SNPs, that is, more than one count in [ACGT] is non-zero. More... | |
GenomeRegionList | genome_region_list_from_vcf_file (std::string const &file) |
Read a VCF file, and use its positions to create a GenomeRegionList. More... | |
void | genome_region_list_from_vcf_file (std::string const &file, GenomeRegionList &target) |
Read a VCF file, and add its positions to an existing GenomeRegionList. More... | |
size_t | get_base_count (BaseCounts const &bc, char base) |
Get the count for a base given as a char. More... | |
std::pair< std::array< char, 6 >, size_t > | get_vcf_record_snp_ref_alt_chars_ (VcfRecord const &record) |
Local helper function that returns the REF and ALT chars of a VcfRecord for SNPs. More... | |
char | guess_alternative_base (Variant const &variant, bool force=true) |
Guess the alternative base of a Variant. More... | |
char | guess_reference_base (Variant const &variant) |
Guess the reference base of a Variant. More... | |
double | heterozygosity (BaseCounts const &sample, bool with_bessel=false) |
Compute classic heterozygosity. More... | |
bool | is_covered (GenomeRegion const ®ion, std::string const &chromosome, size_t position) |
Test whether the chromosome/position is within a given genomic region . More... | |
template<class T > | |
bool | is_covered (GenomeRegion const ®ion, T const &locus) |
Test whether the chromosome/position of a locus is within a given genomic region . More... | |
bool | is_covered (GenomeRegion const ®ion, VcfRecord const &variant) |
bool | is_covered (GenomeRegionList const ®ions, std::string const &chromosome, size_t position) |
Test whether the chromosome/position is within a given list of genomic regions . More... | |
template<class T > | |
bool | is_covered (GenomeRegionList const ®ions, T const &locus) |
Test whether the chromosome/position of a locus is within a given list of genomic regions . More... | |
bool | is_covered (GenomeRegionList const ®ions, VcfRecord const &variant) |
int | locus_compare (GenomeLocus const &l, GenomeLocus const &r) |
Three-way comparison (spaceship operator <=> ) for two loci in a genome. More... | |
int | locus_compare (GenomeLocus const &l, std::string const &r_chromosome, size_t r_position) |
Three-way comparison (spaceship operator <=> ) for two loci in a genome. More... | |
int | locus_compare (std::string const &l_chromosome, size_t l_position, GenomeLocus const &r) |
Three-way comparison (spaceship operator <=> ) for two loci in a genome. More... | |
int | locus_compare (std::string const &l_chromosome, size_t l_position, std::string const &r_chromosome, size_t r_position) |
Three-way comparison (spaceship operator <=> ) for two loci in a genome. More... | |
bool | locus_equal (GenomeLocus const &l, GenomeLocus const &r) |
Equality comparison (== ) for two loci in a genome. More... | |
bool | locus_equal (GenomeLocus const &l, std::string const &r_chromosome, size_t r_position) |
Equality comparison (== ) for two loci in a genome. More... | |
bool | locus_equal (std::string const &l_chromosome, size_t l_position, GenomeLocus const &r) |
Equality comparison (== ) for two loci in a genome. More... | |
bool | locus_equal (std::string const &l_chromosome, size_t l_position, std::string const &r_chromosome, size_t r_position) |
Equality comparison (== ) for two loci in a genome. More... | |
bool | locus_greater (GenomeLocus const &l, GenomeLocus const &r) |
Greater than comparison (> ) for two loci in a genome. More... | |
bool | locus_greater (GenomeLocus const &l, std::string const &r_chromosome, size_t r_position) |
Greater than comparison (> ) for two loci in a genome. More... | |
bool | locus_greater (std::string const &l_chromosome, size_t l_position, GenomeLocus const &r) |
Greater than comparison (> ) for two loci in a genome. More... | |
bool | locus_greater (std::string const &l_chromosome, size_t l_position, std::string const &r_chromosome, size_t r_position) |
Greater than comparison (> ) for two loci in a genome. More... | |
bool | locus_greater_or_equal (GenomeLocus const &l, GenomeLocus const &r) |
Greater than or equal comparison (>= ) for two loci in a genome. More... | |
bool | locus_greater_or_equal (GenomeLocus const &l, std::string const &r_chromosome, size_t r_position) |
Greater than or equal comparison (>= ) for two loci in a genome. More... | |
bool | locus_greater_or_equal (std::string const &l_chromosome, size_t l_position, GenomeLocus const &r) |
Greater than or equal comparison (>= ) for two loci in a genome. More... | |
bool | locus_greater_or_equal (std::string const &l_chromosome, size_t l_position, std::string const &r_chromosome, size_t r_position) |
Greater than or equal comparison (>= ) for two loci in a genome. More... | |
bool | locus_inequal (GenomeLocus const &l, GenomeLocus const &r) |
Inequality comparison (!= ) for two loci in a genome. More... | |
bool | locus_inequal (GenomeLocus const &l, std::string const &r_chromosome, size_t r_position) |
Inequality comparison (!= ) for two loci in a genome. More... | |
bool | locus_inequal (std::string const &l_chromosome, size_t l_position, GenomeLocus const &r) |
Inequality comparison (!= ) for two loci in a genome. More... | |
bool | locus_inequal (std::string const &l_chromosome, size_t l_position, std::string const &r_chromosome, size_t r_position) |
Inequality comparison (!= ) for two loci in a genome. More... | |
bool | locus_less (GenomeLocus const &l, GenomeLocus const &r) |
Less than comparison (< ) for two loci in a genome. More... | |
bool | locus_less (GenomeLocus const &l, std::string const &r_chromosome, size_t r_position) |
Less than comparison (< ) for two loci in a genome. More... | |
bool | locus_less (std::string const &l_chromosome, size_t l_position, GenomeLocus const &r) |
Less than comparison (< ) for two loci in a genome. More... | |
bool | locus_less (std::string const &l_chromosome, size_t l_position, std::string const &r_chromosome, size_t r_position) |
Less than comparison (< ) for two loci in a genome. More... | |
bool | locus_less_or_equal (GenomeLocus const &l, GenomeLocus const &r) |
Less than or equal comparison (<= ) for two loci in a genome. More... | |
bool | locus_less_or_equal (GenomeLocus const &l, std::string const &r_chromosome, size_t r_position) |
Less than or equal comparison (<= ) for two loci in a genome. More... | |
bool | locus_less_or_equal (std::string const &l_chromosome, size_t l_position, GenomeLocus const &r) |
Less than or equal comparison (<= ) for two loci in a genome. More... | |
bool | locus_less_or_equal (std::string const &l_chromosome, size_t l_position, std::string const &r_chromosome, size_t r_position) |
Less than or equal comparison (<= ) for two loci in a genome. More... | |
template<class ForwardIterator > | |
SlidingIntervalWindowIterator< ForwardIterator > | make_default_sliding_interval_window_iterator (ForwardIterator begin, ForwardIterator end, size_t width=0, size_t stride=0) |
Helper function to instantiate a SlidingIntervalWindowIterator for a default use case. More... | |
template<class ForwardIterator > | |
SlidingVariantsWindowIterator< ForwardIterator > | make_default_sliding_variants_window_iterator (ForwardIterator begin, ForwardIterator end, size_t width=0, size_t stride=0) |
Helper function to instantiate a SlidingVariantsWindowIterator for a default use case. More... | |
template<class T , class R > | |
std::shared_ptr< T > | make_input_iterator_with_sample_filter_ (std::string const &filename, R const &reader, std::vector< size_t > const &sample_indices, bool inverse_sample_indices, std::vector< bool > const &sample_filter) |
Local helper function template that takes care of intilizing an input iterator, and setting the sample filters, for those iterators for which we do not know the number of samples prior to starting the file iteration. More... | |
template<class ForwardIterator , class DataType = typename ForwardIterator::value_type> | |
SlidingIntervalWindowIterator< ForwardIterator, DataType > | make_sliding_interval_window_iterator (ForwardIterator begin, ForwardIterator end, size_t width=0, size_t stride=0) |
Helper function to instantiate a SlidingIntervalWindowIterator without the need to specify the template parameters manually. More... | |
template<class ForwardIterator , class DataType = typename ForwardIterator::value_type> | |
SlidingVariantsWindowIterator< ForwardIterator, DataType > | make_sliding_variants_window_iterator (ForwardIterator begin, ForwardIterator end, size_t width=0, size_t stride=0) |
Helper function to instantiate a SlidingVariantsWindowIterator without the need to specify the template parameters manually. More... | |
VariantInputIterator | make_variant_input_iterator_from_individual_vcf_file (std::string const &filename, bool use_allelic_depth=false, bool only_biallelic=true, bool only_filter_pass=true) |
Create a VariantInputIterator to iterate the contents of a VCF file as Variants, treating each sample as an individual, and combining them all into one BaseCounts sample. More... | |
VariantInputIterator | make_variant_input_iterator_from_individual_vcf_file (std::string const &filename, std::vector< std::string > const &sample_names, bool inverse_sample_names=false, bool use_allelic_depth=false, bool only_biallelic=true, bool only_filter_pass=true) |
Create a VariantInputIterator to iterate the contents of a VCF file as Variants, treating each sample as an individual, and combining them all into one BaseCounts sample. More... | |
VariantInputIterator | make_variant_input_iterator_from_pileup_file (std::string const &filename, SimplePileupReader const &reader=SimplePileupReader{}) |
Create a VariantInputIterator to iterate the contents of a (m)pileup file as Variants. More... | |
VariantInputIterator | make_variant_input_iterator_from_pileup_file (std::string const &filename, std::vector< bool > const &sample_filter, SimplePileupReader const &reader=SimplePileupReader{}) |
Create a VariantInputIterator to iterate the contents of a (m)pileup file as Variants. More... | |
VariantInputIterator | make_variant_input_iterator_from_pileup_file (std::string const &filename, std::vector< size_t > const &sample_indices, bool inverse_sample_indices=false, SimplePileupReader const &reader=SimplePileupReader{}) |
Create a VariantInputIterator to iterate the contents of a (m)pileup file as Variants. More... | |
VariantInputIterator | make_variant_input_iterator_from_pileup_file_ (std::string const &filename, SimplePileupReader const &reader, std::vector< size_t > const &sample_indices, bool inverse_sample_indices, std::vector< bool > const &sample_filter) |
Local helper function that takes care of the three functions below. More... | |
VariantInputIterator | make_variant_input_iterator_from_pool_vcf_file (std::string const &filename, bool only_biallelic=true, bool only_filter_pass=true) |
Create a VariantInputIterator to iterate the contents of a VCF file as Variants, treating each sample as a pool of individuals. More... | |
VariantInputIterator | make_variant_input_iterator_from_pool_vcf_file (std::string const &filename, std::vector< std::string > const &sample_names, bool inverse_sample_names=false, bool only_biallelic=true, bool only_filter_pass=true) |
Create a VariantInputIterator to iterate the contents of a VCF file as Variants, treating each sample as a pool of individuals. More... | |
VariantInputIterator | make_variant_input_iterator_from_sam_file (std::string const &filename, SamVariantInputIterator const &reader=SamVariantInputIterator{}) |
Create a VariantInputIterator to iterate the contents of a SAM/BAM/CRAM file as Variants. More... | |
VariantInputIterator | make_variant_input_iterator_from_sync_file (std::string const &filename) |
Create a VariantInputIterator to iterate the contents of a PoPoolation2 sync file as Variants. More... | |
VariantInputIterator | make_variant_input_iterator_from_sync_file (std::string const &filename, std::vector< bool > const &sample_filter) |
Create a VariantInputIterator to iterate the contents of a PoPoolation2 sync file as Variants. More... | |
VariantInputIterator | make_variant_input_iterator_from_sync_file (std::string const &filename, std::vector< size_t > const &sample_indices, bool inverse_sample_indices=false) |
Create a VariantInputIterator to iterate the contents of a PoPoolation2 sync file as Variants. More... | |
VariantInputIterator | make_variant_input_iterator_from_sync_file_ (std::string const &filename, std::vector< size_t > const &sample_indices, bool inverse_sample_indices, std::vector< bool > const &sample_filter) |
VariantInputIterator | make_variant_input_iterator_from_variant_parallel_input_iterator (VariantParallelInputIterator const ¶llel_input, bool allow_ref_base_mismatches=false, bool allow_alt_base_mismatches=true, std::string const &source_sample_separator=":") |
Create a VariantInputIterator to iterate multiple input sources at once, using a VariantParallelInputIterator. More... | |
VariantInputIterator | make_variant_input_iterator_from_vcf_file_ (std::string const &filename, std::vector< std::string > const &sample_names, bool inverse_sample_names, bool pool_samples, bool use_allelic_depth, bool only_biallelic, bool only_filter_pass) |
Local helper function that takes care of both main functions below. More... | |
BaseCounts | merge (BaseCounts const &p1, BaseCounts const &p2) |
Merge the counts of two BaseCountss. More... | |
BaseCounts | merge (std::vector< BaseCounts > const &p) |
Merge the counts of a vector BaseCountss. More... | |
void | merge_inplace (BaseCounts &p1, BaseCounts const &p2) |
Merge the counts of two BaseCountss, by adding the counts of the second (p2 ) to the first (p1 ). More... | |
double | n_base (size_t coverage, size_t poolsize) |
Compute the n_base term used for Tajima's D in Kofler et al. 2011, using a faster closed form expression. More... | |
double | n_base_matrix (size_t coverage, size_t poolsize) |
Compute the n_base term used for Tajima's D in Kofler et al. 2011, following their approach. More... | |
template<typename T > | |
std::array< size_t, 4 > | nucleotide_sorting_order_ (std::array< T, 4 > const &values) |
Local helper function that runs a sorting network to sort four values, coming from the four nucleotides. More... | |
size_t | nucleotide_sum (BaseCounts const &sample) |
Count of the pure nucleotide bases at this position, that is, the sum of all A , C , G , and T . More... | |
bool | operator!= (GenomeLocus const &l, GenomeLocus const &r) |
Inequality comparison (!= ) for two loci in a genome. More... | |
bool | operator!= (GenomeRegion const &a, GenomeRegion const &b) |
Inequality comparison (!= ) for two GenomeRegions. More... | |
bool | operator< (GenomeLocus const &l, GenomeLocus const &r) |
Less than comparison (< ) for two loci in a genome. More... | |
std::ostream & | operator<< (std::ostream &os, BaseCounts const &bs) |
Output stream operator for BaseCounts instances. More... | |
std::ostream & | operator<< (std::ostream &os, GenomeLocus const &locus) |
std::ostream & | operator<< (std::ostream &os, GenomeRegion const ®ion) |
bool | operator<= (GenomeLocus const &l, GenomeLocus const &r) |
Less than or equal comparison (<= ) for two loci in a genome. More... | |
bool | operator== (GenomeLocus const &l, GenomeLocus const &r) |
Equality comparison (== ) for two loci in a genome. More... | |
bool | operator== (GenomeRegion const &a, GenomeRegion const &b) |
Equality comparison (!= ) for two GenomeRegions. More... | |
bool | operator> (GenomeLocus const &l, GenomeLocus const &r) |
Greater than comparison (> ) for two loci in a genome. More... | |
bool | operator>= (GenomeLocus const &l, GenomeLocus const &r) |
Greater than or equal comparison (>= ) for two loci in a genome. More... | |
GenomeRegion | parse_genome_region (std::string const ®ion, bool zero_based=false, bool end_exclusive=false) |
Parse a genomic region. More... | |
GenomeRegionList | parse_genome_regions (std::string const ®ions, bool zero_based=false, bool end_exclusive=false) |
Parse a set/list of genomic regions. More... | |
genesis::utils::Matrix< double > | pij_matrix_ (size_t max_coverage, size_t poolsize) |
genesis::utils::Matrix< double > const & | pij_matrix_resolver_ (size_t max_coverage, size_t poolsize) |
template<class ForwardIterator > | |
PoolDiversityResults | pool_diversity_measures (PoolDiversitySettings const &settings, ForwardIterator begin, ForwardIterator end) |
Compute Theta Pi, Theta Watterson, and Tajia's D in their pool-sequencing corrected versions according to Kofler et al. More... | |
std::vector< double > | prob_cond_true_freq (size_t n, std::vector< bool > const &alleles, std::vector< unsigned char > const &phred_scores, bool unfolded) |
std::vector< double > | prob_cond_true_freq_unfolded (size_t n, std::vector< bool > const &alleles, std::vector< unsigned char > const &phred_scores, bool invert_alleles) |
template<class ForwardIterator > | |
void | process_conditional_probability (ForwardIterator begin, ForwardIterator end) |
Compute the conditional probabilities of AFs. This reimplements process_probCond from Boitard et al. More... | |
void | process_pileup_correct_input_order_check_ (utils::InputStream const &it, std::string &cur_chr, size_t &cur_pos, std::string const &new_chr, size_t new_pos) |
Local helper function to remove code duplication for the correct input order check. More... | |
void | process_sync_correct_input_order_ (utils::InputStream const &it, std::string &cur_chr, size_t &cur_pos, Variant const &new_var) |
Local helper function to remove code duplication for the correct input order check. More... | |
template<class Data , class Accumulator = EmptyAccumulator> | |
void | run_vcf_window (SlidingWindowGenerator< Data, Accumulator > &generator, std::string const &vcf_file, std::function< Data(VcfRecord const &)> conversion, std::function< bool(VcfRecord const &)> condition={}) |
Convenience function to iterate over a whole VCF file. More... | |
std::string | sam_flag_to_string (int flags) |
Turn a set of flags for sam/bam/cram reads into their textual representation. More... | |
template<> | |
void | SimplePileupReader::process_ancestral_base_< SimplePileupReader::Sample > (utils::InputStream &input_stream, SimplePileupReader::Sample &sample) const |
template<> | |
void | SimplePileupReader::process_quality_string_< SimplePileupReader::Sample > (utils::InputStream &input_stream, SimplePileupReader::Sample &sample) const |
template<> | |
void | SimplePileupReader::set_sample_read_bases_< SimplePileupReader::Sample > (std::string const &read_bases, SimplePileupReader::Sample &sample) const |
template<> | |
void | SimplePileupReader::set_sample_read_coverage_< SimplePileupReader::Sample > (size_t read_coverage, SimplePileupReader::Sample &sample) const |
template<> | |
void | SimplePileupReader::set_target_alternative_base_< SimplePileupReader::Record > (SimplePileupReader::Record &target) const |
std::pair< SortedBaseCounts, SortedBaseCounts > | sorted_average_base_counts (BaseCounts const &sample_a, BaseCounts const &sample_b) |
Return the sorted base counts of both input samples, orderd by the average frequencies of the nucleotide counts in the two samples. More... | |
SortedBaseCounts | sorted_base_counts (BaseCounts const &sample) |
Return the order of base counts (nucleotides), largest one first. More... | |
SortedBaseCounts | sorted_base_counts (Variant const &variant, bool reference_first) |
Get a list of bases sorted by their counts. More... | |
BaseCountsStatus | status (BaseCounts const &sample, size_t min_coverage=0, size_t max_coverage=0, size_t min_count=0, bool tolerate_deletions=false) |
Compute a simple status with useful properties from the counts of a BaseCounts. More... | |
int | string_to_sam_flag (std::string const &value) |
Parse a string as a set of flags for sam/bam/cram reads. More... | |
template<class ForwardIterator > | |
double | tajima_d_pool (PoolDiversitySettings const &settings, ForwardIterator begin, ForwardIterator end) |
Compute the pool-sequencing corrected version of Tajima's D according to Kofler et al. More... | |
template<class ForwardIterator > | |
double | tajima_d_pool (PoolDiversitySettings const &settings, ForwardIterator begin, ForwardIterator end, double theta_pi, double theta_watterson) |
Compute the pool-sequencing corrected version of Tajima's D according to Kofler et al. More... | |
double | tajima_d_pool_denominator (PoolDiversitySettings const &settings, size_t snp_count, double theta) |
Compute the denominator for the pool-sequencing correction of Tajima's D according to Kofler et al. More... | |
template<class ForwardIterator > | |
double | theta_pi (ForwardIterator begin, ForwardIterator end, bool with_bessel=true) |
Compute classic theta pi, that is, the sum of heterozygosities. More... | |
double | theta_pi_pool (PoolDiversitySettings const &settings, BaseCounts const &sample) |
Compute theta pi with pool-sequencing correction according to Kofler et al, for a single BaseCounts, that is, its heterozygosity() including Bessel's correction for the total nucleotide count at each position, divided by the correction denominator. More... | |
template<class ForwardIterator > | |
double | theta_pi_pool (PoolDiversitySettings const &settings, ForwardIterator begin, ForwardIterator end) |
Compute theta pi with pool-sequencing correction according to Kofler et al, that is, the sum of heterozygosities divided by the correction denominator. More... | |
double | theta_pi_pool_denominator (PoolDiversitySettings const &settings, size_t nucleotide_count) |
Compute the denominator for the pool-sequencing correction of theta pi according to Kofler et al. More... | |
template<class ForwardIterator > | |
double | theta_pi_within_pool (ForwardIterator begin, ForwardIterator end, size_t poolsize) |
Compute classic theta pi (within a population), that is, the sum of heterozygosities including Bessel's correction for total nucleotide sum at each position, and Bessel's correction for the pool size. More... | |
template<class ForwardIterator > | |
double | theta_watterson_pool (PoolDiversitySettings const &settings, ForwardIterator begin, ForwardIterator end) |
Compute theta watterson with pool-sequencing correction according to Kofler et al. More... | |
double | theta_watterson_pool_denominator (PoolDiversitySettings const &settings, size_t nucleotide_count) |
Compute the denominator for the pool-sequencing correction of theta watterson according to Kofler et al. More... | |
std::string | to_string (GenomeLocus const &locus) |
std::string | to_string (GenomeRegion const ®ion) |
std::ostream & | to_sync (BaseCounts const &bs, std::ostream &os) |
Output a BaseCounts instance to a stream in the PoPoolation2 sync format. More... | |
std::ostream & | to_sync (Variant const &var, std::ostream &os) |
Output a Variant instance to a stream in the PoPoolation2 sync format. More... | |
BaseCounts | total_base_counts (Variant const &variant) |
Get the summed up total base counts of a Variant. More... | |
size_t | total_nucleotide_sum (Variant const &variant) |
Count of the pure nucleotide bases at this position, that is, the sum of all A , C , G , and T . More... | |
void | transform_zero_out_by_max_count (BaseCounts &sample, size_t max_count) |
Transform a BaseCounts sample by setting any nucleotide count (A , C , G , T ) to zero if max_count is exceeded for that nucleotide. More... | |
void | transform_zero_out_by_max_count (Variant &variant, size_t max_count) |
Transform a variant by setting any nucleotide count (A , C , G , T ) of its samples to zero if max_count is exceeded for that nucleotide. More... | |
void | transform_zero_out_by_min_count (BaseCounts &sample, size_t min_count) |
Transform a BaseCounts sample by setting any nucleotide count (A , C , G , T ) to zero if min_count is not reached for that nucleotide. More... | |
void | transform_zero_out_by_min_count (Variant &variant, size_t min_count) |
Transform a variant by setting any nucleotide count (A , C , G , T ) of its samples to zero if min_count is not reached for that nucleotide. More... | |
void | transform_zero_out_by_min_max_count (BaseCounts &sample, size_t min_count, size_t max_count) |
Transform a BaseCounts sample by setting any nucleotide count (A , C , G , T ) to zero if min_count is not reached or if max_count is exceeded for that nucleotide. More... | |
void | transform_zero_out_by_min_max_count (Variant &variant, size_t min_count, size_t max_count) |
Transform a variant by setting any nucleotide count (A , C , G , T ) of its samples to zero if min_count is not reached or if max_count is exceeded for that nucleotide. More... | |
std::string | vcf_genotype_string (std::vector< VcfGenotype > const &genotypes) |
Return the VCF-like string representation of a set of VcfGenotype entries. More... | |
size_t | vcf_genotype_sum (std::vector< VcfGenotype > const &genotypes) |
Return the sum of genotypes for a set of VcfGenotype entries, typically used to construct a genotype matrix with entries 0,1,2. More... | |
std::string | vcf_hl_type_to_string (int hl_type) |
Internal helper function to convert htslib-internal BCF_HL_* header line type values to their string representation as used in the VCF header ("FILTER", "INFO", "FORMAT", etc). More... | |
std::string | vcf_value_special_to_string (int vl_type_num) |
std::string | vcf_value_special_to_string (VcfValueSpecial vl_type_num) |
std::string | vcf_value_type_to_string (int ht_type) |
std::string | vcf_value_type_to_string (VcfValueType ht_type) |
Enumerations | |
enum | SampleFilterType { kConjunction, kDisjunction, kMerge } |
Select how Variant filter functions that evaluate properties of the Variant::samples (BaseCounts) objects behave when the filter is not true or false for all samples. More... | |
enum | SlidingWindowType { kInterval, kVariants, kChromosome } |
SlidingWindowType of a Window, that is, whether we slide along a fixed size interval of the genome, along a fixed number of variants, or represents a whole chromosome. More... | |
enum | VcfHeaderLine : int { kFilter = 0, kInfo = 1, kFormat = 2, kContig = 3, kStructured = 4, kGeneric = 5 } |
Specification for the values determining header line types of VCF/BCF files. More... | |
enum | VcfValueSpecial : int { kFixed = 0, kVariable = 1, kAllele = 2, kGenotype = 3, kReference = 4 } |
Specification for special markers for the number of values expected for key-value-pairs of VCF/BCF files. More... | |
enum | VcfValueType : int { kFlag = 0, kInteger = 1, kFloat = 2, kString = 3 } |
Specification for the data type of the values expected in key-value-pairs of VCF/BCF files. More... | |
enum | WindowAnchorType { kIntervalBegin, kIntervalEnd, kIntervalMidpoint, kVariantFirst, kVariantLast, kVariantMedian, kVariantMean, kVariantMidpoint } |
Position in the genome that is used for reporting when emitting or using a window. More... | |
Typedefs | |
using | VariantInputIterator = utils::LambdaIterator< Variant, VariantInputIteratorData > |
Iterate Variants, using a variety of input file formats. More... | |
using | VariantWindowIterator = BaseWindowIterator< VariantInputIterator::Iterator > |
using | VcfFormatIteratorFloat = VcfFormatIterator< float, double > |
using | VcfFormatIteratorGenotype = VcfFormatIterator< int32_t, VcfGenotype > |
using | VcfFormatIteratorInt = VcfFormatIterator< int32_t, int32_t > |
using | VcfFormatIteratorString = VcfFormatIterator< char *, std::string > |
Variables | |
static const std::unordered_map< std::string, int > | sam_flag_name_to_int_ |
Map from sam flags to their numerical value, for different types of naming of the flags. More... | |
double a_n | ( | size_t | n | ) |
Compute a_n
, the sum of reciprocals.
This is the sum of reciprocals up to n-1
, which is \( a_n = \sum_{i=1}^{n-1} \frac{1}{i} \).
See Equation 3.6 in
Hahn, M. W. (2018). Molecular Population Genetics. https://global.oup.com/academic/product/molecular-population-genetics-9780878939657
for details.
Definition at line 231 of file diversity.cpp.
double alpha_star | ( | double | n | ) |
Compute alpha*
according to Achaz 2008 and Kofler et al. 2011.
This is needed for the computation of tajima_d_pool() according to
R. Kofler, P. Orozco-terWengel, N. De Maio, R. V. Pandey, V. Nolte, A. Futschik, C. Kosiol, C. Schlötterer.
PoPoolation: A Toolbox for Population Genetic Analysis of Next Generation Sequencing Data from Pooled Individuals.
(2011) PLoS ONE, 6(1), e15925. https://doi.org/10.1371/journal.pone.0015925
The paper unfortunately does not explain their equations, but there is a hidden document in their code repository that illuminates the situation a bit. See https://sourceforge.net/projects/popoolation/files/correction_equations.pdf
The equation is based on
G. Achaz.
Testing for neutrality in samples with sequencing errors.
(2008) Genetics, 179(3), 1409–1424. https://doi.org/10.1534/genetics.107.082198
See there for details.
Definition at line 263 of file diversity.cpp.
double genesis::population::amnm_ | ( | size_t | poolsize, |
size_t | nucleotide_count, | ||
size_t | allele_frequency | ||
) |
Local helper function to compute values for the denominator.
This computes the sum over all r poolsizes of 1/r times a binomial:
\( \sum_{m=b}^{C-b} \frac{1}{k} {C \choose m} \left(\frac{k}{n}\right)^m \left(\frac{n-k}{n}\right)^{C-m} \)
This is needed in the pool seq correction denoinators of Theta Pi and Theta Watterson.
Definition at line 63 of file diversity.cpp.
size_t genesis::population::anchor_position | ( | Window< D, A > const & | window, |
WindowAnchorType | anchor_type = WindowAnchorType::kIntervalBegin |
||
) |
Get the position in the chromosome reported according to a specific WindowAnchorType.
When a window is filled with data, we need to report the position in the genome at which the window is. There are several ways that this position can be computed. Typically, just the first position of the window is used (that is, for an interval, the beginning of the interval, and for variants, the position of the first variant).
However, it might be desirable to report a different position, for example when plotting the results. When using WindowType::kVariants for example, one might want to plot the values computed per window at the midpoint genome position of the variants in that window.
Definition at line 77 of file population/window/functions.hpp.
double b_n | ( | size_t | n | ) |
Compute b_n
, the sum of squared reciprocals.
This is the sum of squared reciprocals up to n-1
, which is \( b_n = \sum_{i=1}^{n-1} \frac{1}{i^2} \).
See
R. Kofler, P. Orozco-terWengel, N. De Maio, R. V. Pandey, V. Nolte, A. Futschik, C. Kosiol, C. Schlötterer.
PoPoolation: A Toolbox for Population Genetic Analysis of Next Generation Sequencing Data from Pooled Individuals.
(2011) PLoS ONE, 6(1), e15925. https://doi.org/10.1371/journal.pone.0015925
for details. The paper unfortunately does not explain their equations, but there is a hidden document in their code repository that illuminates the situation a bit. See https://sourceforge.net/projects/popoolation/files/correction_equations.pdf
Definition at line 244 of file diversity.cpp.
double beta_star | ( | double | n | ) |
Compute beta*
according to Achaz 2008 and Kofler et al. 2011.
This is needed for the computation of tajima_d_pool() according to
R. Kofler, P. Orozco-terWengel, N. De Maio, R. V. Pandey, V. Nolte, A. Futschik, C. Kosiol, C. Schlötterer.
PoPoolation: A Toolbox for Population Genetic Analysis of Next Generation Sequencing Data from Pooled Individuals.
(2011) PLoS ONE, 6(1), e15925. https://doi.org/10.1371/journal.pone.0015925
The paper unfortunately does not explain their equations, but there is a hidden document in their code repository that illuminates the situation a bit. See https://sourceforge.net/projects/popoolation/files/correction_equations.pdf
The equation is based on
G. Achaz.
Testing for neutrality in samples with sequencing errors.
(2008) Genetics, 179(3), 1409–1424. https://doi.org/10.1534/genetics.107.082198
See there for details.
Definition at line 291 of file diversity.cpp.
std::pair< char, double > consensus | ( | BaseCounts const & | sample | ) |
Consensus character for a BaseCounts, and its confidence.
This is simply the character (out of ACGT
) that appears most often (or, for ties, the lexicographically smallest character), unless all of (A
, C
, G
, T
) are zero, in which case the consensus character is N
. The confidence is the count of the consensus character, divided by the total count of all four nucleotides.
Definition at line 397 of file population/functions/functions.cpp.
std::pair< char, double > consensus | ( | BaseCounts const & | sample, |
BaseCountsStatus const & | status | ||
) |
Consensus character for a BaseCounts, and its confidence.
This is simply the character (out of ACGT
) that appears most often (or, for ties, the lexicographically smallest character). If the BaseCounts is not well covered by reads (that is, if its BaseCountsStatus::is_covered is false
), the consensus character is N
. The confidence is the count of the consensus character, divided by the total count of all four nucleotides.
Definition at line 438 of file population/functions/functions.cpp.
AfsPileupRecord convert_to_afs_pileup_record | ( | SimplePileupReader::Record const & | record | ) |
Definition at line 48 of file afs_estimate.cpp.
BaseCounts convert_to_base_counts | ( | SimplePileupReader::Sample const & | sample, |
unsigned char | min_phred_score | ||
) |
Definition at line 45 of file simple_pileup_common.cpp.
Variant convert_to_variant | ( | SimplePileupReader::Record const & | record, |
unsigned char | min_phred_score | ||
) |
Definition at line 145 of file simple_pileup_common.cpp.
Variant convert_to_variant_as_individuals | ( | VcfRecord const & | record, |
bool | use_allelic_depth = false |
||
) |
Convert a VcfRecord to a Variant, treating each sample as an individual, and combining them all into one BaseCounts sample.
In this function, we assume that the data that was used to create the VCF file was the typical use case of VCF, where each sample (column) in the file corresponds to an individual. When using this function, all samples (individuals) are combined into one, as our targeted output type Variant is used to describe allele counts of several individual (e.g., in a pool). As all columns are combined, the resulting Variant only contains a single BaseCounts object. We only consider biallelic SNP positions here.
We offer two ways of combining the samples (columns) of the input VCF record into the BaseCounts:
use_allelic_depth
is false
(default), individuals simply contribute to the BaseCounts according to their polidy. That is, an individual with genotype A/T
will contribute one count each for A
and T
.use_allelic_depth
is true
instead, we use the "AD" FORMAT field instead, to obtain the actual counts for the reference and alterantive allele, and use these to sum up the BaseCounts data.Definition at line 381 of file vcf_common.cpp.
Convert a VcfRecord to a Variant, treating each sample column as a pool of individuals.
This assumes that the data that was used to create the VCF file was actually a pool of individuals (e.g., from pool sequencing) for each sample (column) of the VCF file. We do not actually recommend to use variant calling software on pool-seq data, as it induces frequency shifts due to the statistical models employed by variant calles that were not built for pool sequencing data. It however seems to be a commonly used approach, and hence we offer this function here. For this type of data, the VCF allelic depth ("AD") information contains the counts of the reference and alternative base, which in this context can be interpreted as describing the allele frequencines of each pool of individuals. This requires the VCF to have the "AD" FORMAT field.
Only SNP data (no indels) are allowed in this function; use VcfRecord::is_snp() to test this.
Definition at line 275 of file vcf_common.cpp.
double genesis::population::f_st_pool_karlsson | ( | ForwardIterator1 | p1_begin, |
ForwardIterator1 | p1_end, | ||
ForwardIterator2 | p2_begin, | ||
ForwardIterator2 | p2_end | ||
) |
Compute the F_ST statistic for pool-sequenced data of Karlsson et al as used in PoPoolation2, for two ranges of BaseCountss.
The approach is called the "asymptotically unbiased" estimator in PoPoolation2 [1], and follows Karlsson et al [2].
[1] PoPoolation2: identifying differentiation between populations using sequencing of pooled DNA samples (Pool-Seq).
Kofler R, Pandey RV, Schlotterer C.
Bioinformatics, 2011, 27(24), 3435–3436. https://doi.org/10.1093/bioinformatics/btr589
[2] Efficient mapping of mendelian traits in dogs through genome-wide association.
Karlsson EK, Baranowska I, Wade CM, Salmon Hillbertz NHC, Zody MC, Anderson N, Biagi TM, Patterson N, Pielberg GR, Kulbokas EJ, Comstock KE, Keller ET, Mesirov JP, Von Euler H, Kämpe O, Hedhammar Å, Lander ES, Andersson G, Andersson L, Lindblad-Toh K.
Nature Genetics, 2007, 39(11), 1321–1328. https://doi.org/10.1038/ng.2007.10
Definition at line 339 of file structure.hpp.
std::pair< double, double > f_st_pool_karlsson_nkdk | ( | std::pair< SortedBaseCounts, SortedBaseCounts > const & | sample_counts | ) |
Compute the numerator N_k
and denominator D_k
needed for the asymptotically unbiased F_ST estimator of Karlsson et al (2007).
See f_st_pool_karlsson() for details. The function expects sorted base counts for the two samples of which we want to compute F_ST, which are produced by sorted_average_base_counts().
Definition at line 101 of file structure.cpp.
double genesis::population::f_st_pool_kofler | ( | size_t | p1_poolsize, |
size_t | p2_poolsize, | ||
ForwardIterator1 | p1_begin, | ||
ForwardIterator1 | p1_end, | ||
ForwardIterator2 | p2_begin, | ||
ForwardIterator2 | p2_end | ||
) |
Compute the F_ST statistic for pool-sequenced data of Kofler et al as used in PoPoolation2, for two ranges of BaseCountss.
The approach is called the "classical" or "conventional" estimator in PoPoolation2 [1], and follows Hartl and Clark [2].
[1] PoPoolation2: identifying differentiation between populations using sequencing of pooled DNA samples (Pool-Seq).
Kofler R, Pandey RV, Schlotterer C.
Bioinformatics, 2011, 27(24), 3435–3436. https://doi.org/10.1093/bioinformatics/btr589
[2] Principles of Population Genetics.
Hartl DL, Clark AG.
Sinauer, 2007.
Definition at line 202 of file structure.hpp.
std::tuple< double, double, double > f_st_pool_kofler_pi_snp | ( | BaseCounts const & | p1, |
BaseCounts const & | p2 | ||
) |
Compute the SNP-based Theta Pi values used in f_st_pool_kofler().
See there for details. The tuple returns Theta Pi for an individual position, which is simply the heterozygosity() at this position, for both samples p1
and p2
, as well as their combined (average frequency) heterozygosity, in that order.
Definition at line 46 of file structure.cpp.
std::pair<double, double> genesis::population::f_st_pool_unbiased | ( | size_t | p1_poolsize, |
size_t | p2_poolsize, | ||
ForwardIterator1 | p1_begin, | ||
ForwardIterator1 | p1_end, | ||
ForwardIterator2 | p2_begin, | ||
ForwardIterator2 | p2_end | ||
) |
Compute our unbiased F_ST statistic for pool-sequenced data for two ranges of BaseCountss.
This is our novel approach for estimating F_ST, using pool-sequencing corrected estimates of Pi within, Pi between, and Pi total, to compute F_ST following the definitions of Nei [1] and Hudson [2], respectively. These are returned here as a pair in that order. See https://github.com/lczech/pool-seq-pop-gen-stats for details.
[1] Analysis of Gene Diversity in Subdivided Populations.
Nei M.
Proceedings of the National Academy of Sciences, 1973, 70(12), 3321–3323. https://doi.org/10.1073/PNAS.70.12.3321
[2] Estimation of levels of gene flow from DNA sequence data.
Hudson RR, Slatkin M, Maddison WP.
Genetics, 1992, 132(2), 583–589. https://doi.org/10.1093/GENETICS/132.2.583
Definition at line 433 of file structure.hpp.
std::tuple< double, double, double > f_st_pool_unbiased_pi_snp | ( | size_t | p1_poolsize, |
size_t | p2_poolsize, | ||
BaseCounts const & | p1, | ||
BaseCounts const & | p2 | ||
) |
Compute the SNP-based Theta Pi values used in f_st_pool_unbiased().
The function returns pi within, between, and total, in that order. See f_st_pool_unbiased() for details.
Definition at line 166 of file structure.cpp.
double f_star | ( | double | a_n, |
double | n | ||
) |
Compute f*
according to Achaz 2008 and Kofler et al. 2011.
This is compuated as \( f_{star} = \frac{n - 3}{a_n \cdot (n-1) - n} \), and needed for the computation of alpha_star() and beta_star(). See there for some more details, and see
G. Achaz.
Testing for neutrality in samples with sequencing errors.
(2008) Genetics, 179(3), 1409–1424. https://doi.org/10.1534/genetics.107.082198
for the original equations.
Definition at line 257 of file diversity.cpp.
|
inline |
Filter function to be used with VariantInputIterator to filter by a genome region.
This function can be used as a filter with VariantInputIterator::add_filter(), in order to only iterate over Variants that are in the given region
(if complement
is false
, default), or only over Variants that are outside of the region
(if complement
is true
).
Definition at line 277 of file filter_transform.hpp.
|
inline |
Filter function to be used with VariantInputIterator to filter by a list of genome regions.
This function can be used as a filter with VariantInputIterator::add_filter(), in order to only iterate over Variants that are in the given regions
(if complement
is false
, default), or only over Variants that are outside of the regions
(if complement
is true
).
This version of the function can be used if the regions
is not given as a std::shared_ptr
. The parameter copy_regions
is an optimization. By default, the function stores a copy of the regions
, in order to make sure that it is available. However, if it is guaranteed that the regions
object stays in scope during the VariantInputIterator's lifetime, this copy can be avoided.
Definition at line 316 of file filter_transform.hpp.
|
inline |
Filter function to be used with VariantInputIterator to filter by a list of genome regions.
This function can be used as a filter with VariantInputIterator::add_filter(), in order to only iterate over Variants that are in the given regions
(if complement
is false
, default), or only over Variants that are outside of the regions
(if complement
is true
).
Definition at line 293 of file filter_transform.hpp.
bool filter_by_status | ( | std::function< bool(BaseCountsStatus const &)> | predicate, |
Variant const & | variant, | ||
SampleFilterType | type, | ||
size_t | min_coverage = 0 , |
||
size_t | max_coverage = 0 , |
||
size_t | min_count = 0 , |
||
bool | tolerate_deletions = false |
||
) |
Filter a Variant based on a predicate
that is applied to the result of a status() call on the BaseCounts of the variant
.
See status() for details on the data of type BaseCountsStatus that predicate
can use. This function applies the predicate
to the BaseCounts samples of the variant
(or to the merged one, depending on type
, see also below), and returns whether the filter predicate
passed or not.
Note that different type
values have a distinct effect here: It might happen that all samples individually pass the predicate
, but their merged counts do not, or vice versa. Hence, this choice needs to be made depending on downstream needs. For example, if we are filtering for Variants that are SNPs (where there exist at least two counts in [ACGT]
that are non-zero), individual samples might only have one base count greater than zero, in which case they are not considered to be a SNP. However, if those non-zero counts are not for the same base in all samples, their merged counts will be non-zero for more than one base, and hence considered a SNP.
Definition at line 43 of file filter_transform.cpp.
|
inline |
Filter a Variant based on a predicate
that is applied to the result of a status() call on the BaseCounts of the variant
.
Same as filter_by_status( std::function<...>, Variant const&, ... ) , but returns a callback to be used as a filter, e.g., with VariantInputIterator::add_filter().
Definition at line 123 of file filter_transform.hpp.
|
inline |
Filter a Variant based on whether the sample counts are biallelic SNPs, that is, exactly two base counts in [ACGT]
are non-zero.
Same as filter_is_biallelic_snp( Variant const&, ... ) , but returns a callback to be used as a filter, e.g., with VariantInputIterator::add_filter().
Definition at line 230 of file filter_transform.hpp.
|
inline |
Filter a Variant based on whether the sample counts are biallelic SNPs, that is, exactly two base counts in [ACGT]
are non-zero.
Same as filter_is_snp( Variant const&, ... ) , but additionally checks that the SNP is biallelic (BaseCountsStatus::is_biallelic). See there for more details.
Definition at line 204 of file filter_transform.hpp.
|
inline |
Filter a Variant based on whether the sample counts are SNPs, that is, more than one count in [ACGT]
is non-zero.
Same as filter_is_snp( Variant const&, ... ) , but returns a callback to be used as a filter, e.g., with VariantInputIterator::add_filter().
Definition at line 178 of file filter_transform.hpp.
|
inline |
Filter a Variant based on whether the sample counts are SNPs, that is, more than one count in [ACGT]
is non-zero.
This function checks that the samples are covered (BaseCountsStatus::is_covered) and have more than one non-zero count (BaseCountsStatus::is_snp).
See status() for details, and see filter_by_status() for details on the processing, in particular the type
argument.
Definition at line 152 of file filter_transform.hpp.
GenomeRegionList genome_region_list_from_vcf_file | ( | std::string const & | file | ) |
Read a VCF file, and use its positions to create a GenomeRegionList.
This is for example useful to restrict some analysis to the loci of known variants. Note that the whole file has to be read still; it can hence be better to only do this once and convert to a faster file format.
This ignores all sample information, and simply uses the CHROM
and POS
data to construct intervals of consecutive positions along the chromsomes, i.e., if the file contains positions 1
, 2
, and 3
, but not 4
, an interval spanning 1-3
is inserted into the list.
The VCF file does not have to be sorted for this.
Definition at line 486 of file vcf_common.cpp.
void genome_region_list_from_vcf_file | ( | std::string const & | file, |
GenomeRegionList & | target | ||
) |
Read a VCF file, and add its positions to an existing GenomeRegionList.
This is for example useful to restrict some analysis to the loci of known variants. Note that the whole file has to be read still; it can hence be better to only do this once and convert to a faster file format.
This ignores all sample information, and simply uses the CHROM
and POS
data to construct intervals of consecutive positions along the chromsomes, i.e., if the file contains positions 1
, 2
, and 3
, but not 4
, an interval spanning 1-3
is inserted into the list.
The VCF file does not have to be sorted for this. The regions are merged into the existing ones, potentially changing existing starts and ends of intervals if they overlap with regions found in the VCF.
Definition at line 493 of file vcf_common.cpp.
size_t get_base_count | ( | BaseCounts const & | bc, |
char | base | ||
) |
Get the count for a base
given as a char.
The given base
has to be one of ACGTDN
(case insensitive), or *#.
for deletions as well.
Definition at line 103 of file population/functions/functions.cpp.
std::pair<std::array<char, 6>, size_t> genesis::population::get_vcf_record_snp_ref_alt_chars_ | ( | VcfRecord const & | record | ) |
Local helper function that returns the REF and ALT chars of a VcfRecord for SNPs.
This function expects the record
to only contain SNP REF and ALT (single nucleotides), and throws when not. It then fills the resulting array with these chars. That is, result[0] is the REF char, result[1] the first ALT char, and so forth.
To keep it speedy, we always return an array that is large enough for all ACGTND
, and return the number of used entries as the second value of the pair.
Definition at line 232 of file vcf_common.cpp.
char guess_alternative_base | ( | Variant const & | variant, |
bool | force = true |
||
) |
Guess the alternative base of a Variant.
If the Variant already has an alternative_base
in ACGT
and force
is not true
, this original base is returned (meaning that this function is idempotent; it does not change the alternative base if there already is one). However, if the alternative_base
is N
or any other char not in ACGT
, or if force
is true
, the base with the highest count that is not the reference base is returned instead. This also means that the reference base has to be set to a value in ACGT
, as otherwise the concept of an alternative base is meaningless anyway. If the reference base is not one of ACGT
, the returned alternative base is N
. Furthermore, if all three non-reference bases have count 0, the returned alternative base is N
.
Definition at line 463 of file population/functions/functions.cpp.
char guess_reference_base | ( | Variant const & | variant | ) |
Guess the reference base of a Variant.
If the Variant already has a reference_base
in ACGT
, this base is returned (meaning that this function is idempotent; it does not change the reference base if there already is one). However, if the reference_base
is N
or any other value not in ACGT
, the base with the highest count is returned instead, unless all counts are 0, in which case the returned reference base is N
.
Definition at line 447 of file population/functions/functions.cpp.
double heterozygosity | ( | BaseCounts const & | sample, |
bool | with_bessel = false |
||
) |
Compute classic heterozygosity.
This is computed as \( h = \frac{n}{n-1} \left( 1 - \sum p^2 \right) \) with n
the total nucleotide_sum() (sum of A
,C
,G
,T
in the sample), and p
their respective nucleotide frequencies, with with_bessel
, or without Bessel's correction in the beginning of the equation when with_bessel
is set to false
(default).
See Equation 3.1 in
Hahn, M. W.
(2018). Molecular Population Genetics.
https://global.oup.com/academic/product/molecular-population-genetics-9780878939657
for details.
Definition at line 110 of file diversity.cpp.
bool is_covered | ( | GenomeRegion const & | region, |
std::string const & | chromosome, | ||
size_t | position | ||
) |
Test whether the chromosome/position is within a given genomic region
.
Definition at line 190 of file genome_region.cpp.
bool genesis::population::is_covered | ( | GenomeRegion const & | region, |
T const & | locus | ||
) |
Test whether the chromosome/position of a locus
is within a given genomic region
.
This is a function template, so that it can accept any data structure that contains public member variables chromosome
(std::string
) and position
(size_t
), such as Variant or GenomeLocus.
Definition at line 121 of file functions/genome_region.hpp.
bool is_covered | ( | GenomeRegion const & | region, |
VcfRecord const & | variant | ||
) |
Definition at line 219 of file genome_region.cpp.
bool is_covered | ( | GenomeRegionList const & | regions, |
std::string const & | chromosome, | ||
size_t | position | ||
) |
Test whether the chromosome/position is within a given list of genomic regions
.
Definition at line 212 of file genome_region.cpp.
bool genesis::population::is_covered | ( | GenomeRegionList const & | regions, |
T const & | locus | ||
) |
Test whether the chromosome/position of a locus
is within a given list of genomic regions
.
This is a function template, so that it can accept any data structure that contains public member variables chromosome
(std::string
) and position
(size_t
), such as Variant or GenomeLocus.
Definition at line 135 of file functions/genome_region.hpp.
bool is_covered | ( | GenomeRegionList const & | regions, |
VcfRecord const & | variant | ||
) |
Definition at line 224 of file genome_region.cpp.
|
inline |
Three-way comparison (spaceship operator <=>
) for two loci in a genome.
The comparison returns -1
if the left locus is before the right locus, +1
for the opposite, and 0
if the two loci are equal.
Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.
Definition at line 140 of file functions/genome_locus.hpp.
|
inline |
Three-way comparison (spaceship operator <=>
) for two loci in a genome.
The comparison returns -1
if the left locus is before the right locus, +1
for the opposite, and 0
if the two loci are equal.
Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.
Definition at line 120 of file functions/genome_locus.hpp.
|
inline |
Three-way comparison (spaceship operator <=>
) for two loci in a genome.
The comparison returns -1
if the left locus is before the right locus, +1
for the opposite, and 0
if the two loci are equal.
Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.
Definition at line 130 of file functions/genome_locus.hpp.
|
inline |
Three-way comparison (spaceship operator <=>
) for two loci in a genome.
The comparison returns -1
if the left locus is before the right locus, +1
for the opposite, and 0
if the two loci are equal.
Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.
Definition at line 95 of file functions/genome_locus.hpp.
|
inline |
Equality comparison (==
) for two loci in a genome.
Definition at line 199 of file functions/genome_locus.hpp.
|
inline |
Equality comparison (==
) for two loci in a genome.
Definition at line 179 of file functions/genome_locus.hpp.
|
inline |
Equality comparison (==
) for two loci in a genome.
Definition at line 189 of file functions/genome_locus.hpp.
|
inline |
Equality comparison (==
) for two loci in a genome.
Definition at line 169 of file functions/genome_locus.hpp.
|
inline |
Greater than comparison (>
) for two loci in a genome.
Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.
Definition at line 361 of file functions/genome_locus.hpp.
|
inline |
Greater than comparison (>
) for two loci in a genome.
Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.
Definition at line 341 of file functions/genome_locus.hpp.
|
inline |
Greater than comparison (>
) for two loci in a genome.
Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.
Definition at line 351 of file functions/genome_locus.hpp.
|
inline |
Greater than comparison (>
) for two loci in a genome.
Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.
Definition at line 331 of file functions/genome_locus.hpp.
|
inline |
Greater than or equal comparison (>=
) for two loci in a genome.
Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.
Definition at line 471 of file functions/genome_locus.hpp.
|
inline |
Greater than or equal comparison (>=
) for two loci in a genome.
Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.
Definition at line 451 of file functions/genome_locus.hpp.
|
inline |
Greater than or equal comparison (>=
) for two loci in a genome.
Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.
Definition at line 461 of file functions/genome_locus.hpp.
|
inline |
Greater than or equal comparison (>=
) for two loci in a genome.
Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.
Definition at line 441 of file functions/genome_locus.hpp.
|
inline |
Inequality comparison (!=
) for two loci in a genome.
Definition at line 251 of file functions/genome_locus.hpp.
|
inline |
Inequality comparison (!=
) for two loci in a genome.
Definition at line 231 of file functions/genome_locus.hpp.
|
inline |
Inequality comparison (!=
) for two loci in a genome.
Definition at line 241 of file functions/genome_locus.hpp.
|
inline |
Inequality comparison (!=
) for two loci in a genome.
Definition at line 221 of file functions/genome_locus.hpp.
|
inline |
Less than comparison (<
) for two loci in a genome.
Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.
Definition at line 307 of file functions/genome_locus.hpp.
|
inline |
Less than comparison (<
) for two loci in a genome.
Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.
Definition at line 287 of file functions/genome_locus.hpp.
|
inline |
Less than comparison (<
) for two loci in a genome.
Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.
Definition at line 297 of file functions/genome_locus.hpp.
|
inline |
Less than comparison (<
) for two loci in a genome.
Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.
Definition at line 277 of file functions/genome_locus.hpp.
|
inline |
Less than or equal comparison (<=
) for two loci in a genome.
Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.
Definition at line 417 of file functions/genome_locus.hpp.
|
inline |
Less than or equal comparison (<=
) for two loci in a genome.
Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.
Definition at line 397 of file functions/genome_locus.hpp.
|
inline |
Less than or equal comparison (<=
) for two loci in a genome.
Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.
Definition at line 407 of file functions/genome_locus.hpp.
|
inline |
Less than or equal comparison (<=
) for two loci in a genome.
Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.
Definition at line 385 of file functions/genome_locus.hpp.
SlidingIntervalWindowIterator<ForwardIterator> genesis::population::make_default_sliding_interval_window_iterator | ( | ForwardIterator | begin, |
ForwardIterator | end, | ||
size_t | width = 0 , |
||
size_t | stride = 0 |
||
) |
Helper function to instantiate a SlidingIntervalWindowIterator for a default use case.
This helper assumes that the underlying type of the input data stream and of the Windows that we are sliding over are of the same type, that is, we do no conversion in the entry_input_function
functor of the SlidingIntervalWindowIterator. It further assumes that this data type has public member variables chromosome
and position
that are accessed by the chromosome_function
and position_function
functors of the SlidingIntervalWindowIterator. For example, a data type that this works for is Variant data.
Definition at line 495 of file sliding_interval_window_iterator.hpp.
SlidingVariantsWindowIterator<ForwardIterator> genesis::population::make_default_sliding_variants_window_iterator | ( | ForwardIterator | begin, |
ForwardIterator | end, | ||
size_t | width = 0 , |
||
size_t | stride = 0 |
||
) |
Helper function to instantiate a SlidingVariantsWindowIterator for a default use case.
This helper assumes that the underlying type of the input data stream and of the Windows that we are sliding over are of the same type, that is, we do no conversion in the entry_input_function
functor of the SlidingVariantsWindowIterator. It further assumes that this data type has public member variables chromosome
and position
that are accessed by the chromosome_function
and position_function
functors of the SlidingVariantsWindowIterator. For example, a data type that this works for is Variant data.
Definition at line 368 of file sliding_variants_window_iterator.hpp.
std::shared_ptr<T> genesis::population::make_input_iterator_with_sample_filter_ | ( | std::string const & | filename, |
R const & | reader, | ||
std::vector< size_t > const & | sample_indices, | ||
bool | inverse_sample_indices, | ||
std::vector< bool > const & | sample_filter | ||
) |
Local helper function template that takes care of intilizing an input iterator, and setting the sample filters, for those iterators for which we do not know the number of samples prior to starting the file iteration.
The template arguments are: T
the returned type of input iterator, and R
the underlying reader type. This is very specific for the use case here, and currently is only meant for how we work with the SimplePileupReader and the SyncReader and their iterators. Both their iterators accept a reader to take settings from.
Definition at line 62 of file variant_input_iterator.cpp.
SlidingIntervalWindowIterator<ForwardIterator, DataType> genesis::population::make_sliding_interval_window_iterator | ( | ForwardIterator | begin, |
ForwardIterator | end, | ||
size_t | width = 0 , |
||
size_t | stride = 0 |
||
) |
Helper function to instantiate a SlidingIntervalWindowIterator without the need to specify the template parameters manually.
The three functors entry_input_function
, chromosome_function
, and position_function
of the SlidingIntervalWindowIterator have to be set in the returned iterator before using it. See make_default_sliding_interval_window_iterator() for an alternative make function that sets these three functors to reasonable defaults that work for the Variant data type.
Definition at line 474 of file sliding_interval_window_iterator.hpp.
SlidingVariantsWindowIterator<ForwardIterator, DataType> genesis::population::make_sliding_variants_window_iterator | ( | ForwardIterator | begin, |
ForwardIterator | end, | ||
size_t | width = 0 , |
||
size_t | stride = 0 |
||
) |
Helper function to instantiate a SlidingVariantsWindowIterator without the need to specify the template parameters manually.
Definition at line 347 of file sliding_variants_window_iterator.hpp.
VariantInputIterator make_variant_input_iterator_from_individual_vcf_file | ( | std::string const & | filename, |
bool | use_allelic_depth = false , |
||
bool | only_biallelic = true , |
||
bool | only_filter_pass = true |
||
) |
Create a VariantInputIterator to iterate the contents of a VCF file as Variants, treating each sample as an individual, and combining them all into one BaseCounts sample.
See convert_to_variant_as_individuals( VcfRecord const&, bool ) for details on the conversion from VcfRecord to Variant. We only consider biallelic SNP positions here.
If only_filter_pass
is set to true
(default), only those positions are considered that have the FILTER field set to "PASS". That is, all variants that did not pass a filter in the VCF processing are skipped.
Definition at line 456 of file variant_input_iterator.cpp.
VariantInputIterator make_variant_input_iterator_from_individual_vcf_file | ( | std::string const & | filename, |
std::vector< std::string > const & | sample_names, | ||
bool | inverse_sample_names = false , |
||
bool | use_allelic_depth = false , |
||
bool | only_biallelic = true , |
||
bool | only_filter_pass = true |
||
) |
Create a VariantInputIterator to iterate the contents of a VCF file as Variants, treating each sample as an individual, and combining them all into one BaseCounts sample.
See convert_to_variant_as_individuals( VcfRecord const&, bool ) for details on the conversion from VcfRecord to Variant. We only consider biallelic SNP positions here.
If only_filter_pass
is set to true
(default), only those positions are considered that have the FILTER field set to "PASS". That is, all variants that did not pass a filter in the VCF processing are skipped.
Additionally, this version of the function takes a list of sample_names
which are used as filter so that only those samples (columns of the VCF records) are evaluated and accessible - or, if inverse_sample_names
is set to true
, instead all but those samples.
Definition at line 468 of file variant_input_iterator.cpp.
VariantInputIterator make_variant_input_iterator_from_pileup_file | ( | std::string const & | filename, |
SimplePileupReader const & | reader = SimplePileupReader{} |
||
) |
Create a VariantInputIterator to iterate the contents of a (m)pileup file as Variants.
Optionally, this takes a reader
with settings to be used.
Definition at line 237 of file variant_input_iterator.cpp.
VariantInputIterator make_variant_input_iterator_from_pileup_file | ( | std::string const & | filename, |
std::vector< bool > const & | sample_filter, | ||
SimplePileupReader const & | reader = SimplePileupReader{} |
||
) |
Create a VariantInputIterator to iterate the contents of a (m)pileup file as Variants.
This uses only the samples at the indices where the sample_filter
is true
. Optionally, this takes a reader
with settings to be used.
Definition at line 257 of file variant_input_iterator.cpp.
VariantInputIterator make_variant_input_iterator_from_pileup_file | ( | std::string const & | filename, |
std::vector< size_t > const & | sample_indices, | ||
bool | inverse_sample_indices = false , |
||
SimplePileupReader const & | reader = SimplePileupReader{} |
||
) |
Create a VariantInputIterator to iterate the contents of a (m)pileup file as Variants.
This uses only the samples at the zero-based indices given in the sample_indices
list. If inverse_sample_indices
is true
, this list is inversed, that is, all sample indices but the ones listed are included in the output.
For example, given a list { 0, 2 }
and a file with 4 samples, only the first and the third sample will be in the output. When however inverse_sample_indices
is also set, then the output will contain the second and fourth sample.
Optionally, this takes a reader
with settings to be used.
Definition at line 246 of file variant_input_iterator.cpp.
VariantInputIterator genesis::population::make_variant_input_iterator_from_pileup_file_ | ( | std::string const & | filename, |
SimplePileupReader const & | reader, | ||
std::vector< size_t > const & | sample_indices, | ||
bool | inverse_sample_indices, | ||
std::vector< bool > const & | sample_filter | ||
) |
Local helper function that takes care of the three functions below.
Definition at line 193 of file variant_input_iterator.cpp.
VariantInputIterator make_variant_input_iterator_from_pool_vcf_file | ( | std::string const & | filename, |
bool | only_biallelic = true , |
||
bool | only_filter_pass = true |
||
) |
Create a VariantInputIterator to iterate the contents of a VCF file as Variants, treating each sample as a pool of individuals.
See convert_to_variant_as_pool( VcfRecord const& ) for details on the conversion from VcfRecord to Variant.
This function requires the VCF to have the "AD" FORMAT field. It only iterates over those VCF record lines that actually have the "AD" FORMAT provided, as this is the information that we use to convert the samples to Variants. All records without that field are skipped. Only SNP records are processed; that is, all non-SNPs (indels and others) are ignord.
If only_biallelic
is set to true
(default), this is further restricted to only contain biallelic SNPs, that is, only positions with exactly one alternative allele.
If only_filter_pass
is set to true
(default), only those positions are considered that have the FILTER field set to "PASS". That is, all variants that did not pass a filter in the VCF processing are skipped.
Definition at line 432 of file variant_input_iterator.cpp.
VariantInputIterator make_variant_input_iterator_from_pool_vcf_file | ( | std::string const & | filename, |
std::vector< std::string > const & | sample_names, | ||
bool | inverse_sample_names = false , |
||
bool | only_biallelic = true , |
||
bool | only_filter_pass = true |
||
) |
Create a VariantInputIterator to iterate the contents of a VCF file as Variants, treating each sample as a pool of individuals.
See convert_to_variant_as_pool( VcfRecord const& ) for details on the conversion from VcfRecord to Variant.
This function requires the VCF to have the "AD" FORMAT field. It only iterates over those VCF record lines that actually have the "AD" FORMAT provided, as this is the information that we use to convert the samples to Variants. All records without that field are skipped. Only SNP records are processed; that is, all non-SNPs (indels and others) are ignord.
If only_biallelic
is set to true
(default), this is further restricted to only contain biallelic SNPs, that is, only positions with exactly one alternative allele.
If only_filter_pass
is set to true
(default), only those positions are considered that have the FILTER field set to "PASS". That is, all variants that did not pass a filter in the VCF processing are skipped.
Additionally, this version of the function takes a list of sample_names
which are used as filter so that only those samples (columns of the VCF records) are evaluated and accessible - or, if inverse_sample_names
is set to true
, instead all but those samples.
Definition at line 443 of file variant_input_iterator.cpp.
VariantInputIterator make_variant_input_iterator_from_sam_file | ( | std::string const & | filename, |
SamVariantInputIterator const & | reader = SamVariantInputIterator{} |
||
) |
Create a VariantInputIterator to iterate the contents of a SAM/BAM/CRAM file as Variants.
An instance of SamVariantInputIterator can be provided from which the settings are copied.
Depending on the settings used in the reader
, this can either produce a single sample (one BaseCounts object in the resulting Variant at each position in the genome), or split the input file by the read group (RG) tag (potentially also allowing for an "unaccounted" group of reads).
The other make_variant_input_iterator_...
functions offer settings to sub-set (filter) the samples based on their names or indices. This can be achieved here as well, but has instead to be done directly in the reader
, instead of providing the fitler arguments to this function.
Definition at line 129 of file variant_input_iterator.cpp.
VariantInputIterator make_variant_input_iterator_from_sync_file | ( | std::string const & | filename | ) |
Create a VariantInputIterator to iterate the contents of a PoPoolation2 sync file as Variants.
Definition at line 314 of file variant_input_iterator.cpp.
VariantInputIterator make_variant_input_iterator_from_sync_file | ( | std::string const & | filename, |
std::vector< bool > const & | sample_filter | ||
) |
Create a VariantInputIterator to iterate the contents of a PoPoolation2 sync file as Variants.
This uses only the samples at the indices where the sample_filter
is true
. Optionally, this takes a reader
with settings to be used.
Definition at line 332 of file variant_input_iterator.cpp.
VariantInputIterator make_variant_input_iterator_from_sync_file | ( | std::string const & | filename, |
std::vector< size_t > const & | sample_indices, | ||
bool | inverse_sample_indices = false |
||
) |
Create a VariantInputIterator to iterate the contents of a PoPoolation2 sync file as Variants.
This uses only the samples at the zero-based indices given in the sample_indices
list. If inverse_sample_indices
is true
, this list is inversed, that is, all sample indices but the ones listed are included in the output.
For example, given a list { 0, 2 }
and a file with 4 samples, only the first and the third sample will be in the output. When however inverse_sample_indices
is also set, then the output will contain the second and fourth sample.
Definition at line 322 of file variant_input_iterator.cpp.
VariantInputIterator genesis::population::make_variant_input_iterator_from_sync_file_ | ( | std::string const & | filename, |
std::vector< size_t > const & | sample_indices, | ||
bool | inverse_sample_indices, | ||
std::vector< bool > const & | sample_filter | ||
) |
Definition at line 271 of file variant_input_iterator.cpp.
VariantInputIterator make_variant_input_iterator_from_variant_parallel_input_iterator | ( | VariantParallelInputIterator const & | parallel_input, |
bool | allow_ref_base_mismatches = false , |
||
bool | allow_alt_base_mismatches = true , |
||
std::string const & | source_sample_separator = ":" |
||
) |
Create a VariantInputIterator to iterate multiple input sources at once, using a VariantParallelInputIterator.
This wraps multiple input sources into one iterator that traverses all of them in parallel, and is here then yet again turned into a Variant per position, using VariantParallelInputIterator::Iterator::joined_variant() to combine all input sources into one. See there for the meaning of the two bool
parameters of this function.
As this is iterating multiple files, we leave the VariantInputIteratorData::file_path and VariantInputIteratorData::source_name empty, and fill the VariantInputIteratorData::sample_names with the sample names of the underlying input sources of the parallel iterator, using their respective source_name
as a prefix, separated by source_sample_separator
, for example my_bam:S1
for a source file /path/to/my_bam.bam
with a RG read group tag S1
.
Definition at line 488 of file variant_input_iterator.cpp.
VariantInputIterator genesis::population::make_variant_input_iterator_from_vcf_file_ | ( | std::string const & | filename, |
std::vector< std::string > const & | sample_names, | ||
bool | inverse_sample_names, | ||
bool | pool_samples, | ||
bool | use_allelic_depth, | ||
bool | only_biallelic, | ||
bool | only_filter_pass | ||
) |
Local helper function that takes care of both main functions below.
Definition at line 351 of file variant_input_iterator.cpp.
BaseCounts merge | ( | BaseCounts const & | p1, |
BaseCounts const & | p2 | ||
) |
Merge the counts of two BaseCountss.
Definition at line 372 of file population/functions/functions.cpp.
BaseCounts merge | ( | std::vector< BaseCounts > const & | p | ) |
Merge the counts of a vector BaseCountss.
Definition at line 379 of file population/functions/functions.cpp.
void merge_inplace | ( | BaseCounts & | p1, |
BaseCounts const & | p2 | ||
) |
Merge the counts of two BaseCountss, by adding the counts of the second (p2
) to the first (p1
).
Definition at line 355 of file population/functions/functions.cpp.
double n_base | ( | size_t | coverage, |
size_t | poolsize | ||
) |
Compute the n_base
term used for Tajima's D in Kofler et al. 2011, using a faster closed form expression.
This term is the expected number of distinct individuals sequenced, which is equivalent to finding the expected number of distinct values selected from a set of integers.
The computation in PoPoolation is slowm, see n_base_matrix(). We here instead use a closed form expression following the reasoning of https://math.stackexchange.com/a/72351 See there for the derivation of the equation.
Definition at line 432 of file diversity.cpp.
double n_base_matrix | ( | size_t | coverage, |
size_t | poolsize | ||
) |
Compute the n_base
term used for Tajima's D in Kofler et al. 2011, following their approach.
This term is the expected number of distinct individuals sequenced, which is equivalent to finding the expected number of distinct values selected from a set of integers.
The computation of this term in PoPoolation uses a recursive dynamic programming approach to sum over different possibilities of selecting sets of integers. This gets rather slow for larger inputs, and there is an equivalent closed form that we here use instead. See n_base() for details. We here merely offer the original PoPoolation implementation as a point of reference.
R. Kofler, P. Orozco-terWengel, N. De Maio, R. V. Pandey, V. Nolte, A. Futschik, C. Kosiol, C. Schlötterer.
PoPoolation: A Toolbox for Population Genetic Analysis of Next Generation Sequencing Data from Pooled Individuals.
(2011) PLoS ONE, 6(1), e15925. https://doi.org/10.1371/journal.pone.0015925
The paper unfortunately does not explain their equations, but there is a hidden document in their code repository that illuminates the situation a bit. See https://sourceforge.net/projects/popoolation/files/correction_equations.pdf
Definition at line 398 of file diversity.cpp.
std::array<size_t, 4> genesis::population::nucleotide_sorting_order_ | ( | std::array< T, 4 > const & | values | ) |
Local helper function that runs a sorting network to sort four values, coming from the four nucleotides.
The input are four values, either counts or frequencies. The output are the indices into this array that are sorted so that the largest one comes first:
auto const data = std::array<T, 4>{ 15, 10, 20, 5 }; auto const order = nucleotide_sorting_order_( data );
yields { 2, 0, 1, 3 }
, so that data[order[0]] = data[2] = 20
is the largest value, data[order[1]] = data[0] = 15
the second largest, and so forth.
Definition at line 162 of file population/functions/functions.cpp.
|
inline |
Count of the pure nucleotide bases at this position, that is, the sum of all A
, C
, G
, and T
.
This is simply the sum of a_count + c_count + g_count + t_count
, which we often use as the coverage at the given site.
NB: In PoPoolation, this variable is called eucov
.
Definition at line 222 of file population/functions/functions.hpp.
|
inline |
Inequality comparison (!=
) for two loci in a genome.
Definition at line 261 of file functions/genome_locus.hpp.
bool operator!= | ( | GenomeRegion const & | a, |
GenomeRegion const & | b | ||
) |
Inequality comparison (!=
) for two GenomeRegions.
Definition at line 53 of file genome_region.cpp.
|
inline |
Less than comparison (<
) for two loci in a genome.
Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.
Definition at line 317 of file functions/genome_locus.hpp.
std::ostream & operator<< | ( | std::ostream & | os, |
BaseCounts const & | bs | ||
) |
Output stream operator for BaseCounts instances.
Definition at line 486 of file population/functions/functions.cpp.
|
inline |
Definition at line 64 of file functions/genome_locus.hpp.
std::ostream & operator<< | ( | std::ostream & | os, |
GenomeRegion const & | region | ||
) |
Definition at line 62 of file genome_region.cpp.
|
inline |
Less than or equal comparison (<=
) for two loci in a genome.
Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.
Definition at line 427 of file functions/genome_locus.hpp.
|
inline |
Equality comparison (==
) for two loci in a genome.
Definition at line 209 of file functions/genome_locus.hpp.
bool operator== | ( | GenomeRegion const & | a, |
GenomeRegion const & | b | ||
) |
Equality comparison (!=
) for two GenomeRegions.
Definition at line 48 of file genome_region.cpp.
|
inline |
Greater than comparison (>
) for two loci in a genome.
Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.
Definition at line 371 of file functions/genome_locus.hpp.
|
inline |
Greater than or equal comparison (>=
) for two loci in a genome.
Note that for our purposes, chromosome names are also sorted in lexicographical order, hence, two loci on different chromosomes will first compare the ordering of their chromosome names.
Definition at line 481 of file functions/genome_locus.hpp.
GenomeRegion parse_genome_region | ( | std::string const & | region, |
bool | zero_based = false , |
||
bool | end_exclusive = false |
||
) |
Parse a genomic region.
Accepted formats are "chromosome", "chromosome:position", "chromosome:start-end", and "chromosome:start..end".
By default, we expect positions (coordindates) to be 1-based amd inclusive (closed interval), but this can be changed with the additional parameters zero_based
and end_exclusive
.
Definition at line 104 of file genome_region.cpp.
GenomeRegionList parse_genome_regions | ( | std::string const & | regions, |
bool | zero_based = false , |
||
bool | end_exclusive = false |
||
) |
Parse a set/list of genomic regions.
The individual regions need to be separated by commas (surrounding white space is okay), and each region needs to follow the format as explained in parse_genome_region(). See there for details.
Definition at line 173 of file genome_region.cpp.
genesis::utils::Matrix<double> genesis::population::pij_matrix_ | ( | size_t | max_coverage, |
size_t | poolsize | ||
) |
Definition at line 326 of file diversity.cpp.
genesis::utils::Matrix<double> const& genesis::population::pij_matrix_resolver_ | ( | size_t | max_coverage, |
size_t | poolsize | ||
) |
Definition at line 360 of file diversity.cpp.
PoolDiversityResults genesis::population::pool_diversity_measures | ( | PoolDiversitySettings const & | settings, |
ForwardIterator | begin, | ||
ForwardIterator | end | ||
) |
Compute Theta Pi, Theta Watterson, and Tajia's D in their pool-sequencing corrected versions according to Kofler et al.
This is a high level function that is meant as a simple example of how to compute these statistics. See theta_pi_pool(), theta_watterson_pool(), and tajima_d_pool() for details. It takes care of most options offered by PoPoolation (as given by settings
here), except for the window width and stride and minimum phred quality score, which have to be applied before filling the window (or whatever other range is used as input here) before calling this function.
Furthermore, results here are not filtered aftwards, so any filtering based on e.g., minimum covered fraction has to be done downstream.
Definition at line 484 of file diversity.hpp.
std::vector< double > prob_cond_true_freq | ( | size_t | n, |
std::vector< bool > const & | alleles, | ||
std::vector< unsigned char > const & | phred_scores, | ||
bool | unfolded | ||
) |
Definition at line 121 of file afs_estimate.cpp.
std::vector< double > prob_cond_true_freq_unfolded | ( | size_t | n, |
std::vector< bool > const & | alleles, | ||
std::vector< unsigned char > const & | phred_scores, | ||
bool | invert_alleles | ||
) |
Definition at line 145 of file afs_estimate.cpp.
void genesis::population::process_conditional_probability | ( | ForwardIterator | begin, |
ForwardIterator | end | ||
) |
Compute the conditional probabilities of AFs. This reimplements process_probCond from Boitard et al.
Definition at line 100 of file afs_estimate.hpp.
void genesis::population::process_pileup_correct_input_order_check_ | ( | utils::InputStream const & | it, |
std::string & | cur_chr, | ||
size_t & | cur_pos, | ||
std::string const & | new_chr, | ||
size_t | new_pos | ||
) |
Local helper function to remove code duplication for the correct input order check.
Definition at line 54 of file simple_pileup_reader.cpp.
void genesis::population::process_sync_correct_input_order_ | ( | utils::InputStream const & | it, |
std::string & | cur_chr, | ||
size_t & | cur_pos, | ||
Variant const & | new_var | ||
) |
Local helper function to remove code duplication for the correct input order check.
Definition at line 52 of file sync_reader.cpp.
void genesis::population::run_vcf_window | ( | SlidingWindowGenerator< Data, Accumulator > & | generator, |
std::string const & | vcf_file, | ||
std::function< Data(VcfRecord const &)> | conversion, | ||
std::function< bool(VcfRecord const &)> | condition = {} |
||
) |
Convenience function to iterate over a whole VCF file.
This function is convenience, and takes care of iterating a VCF file record by record (that is, line by line), using a provided conversion
function to extract the D
/Data
from the VcfRecord. It furthermore takes care of finishing all chromosomes properly, using their lengths as provided in the VCF header.
Before calling the function, of course, all necessary plugin functions have to be set in the SlidingWindowGenerator instance, so that the data is processed as intended. In particular, take care of setting SlidingWindowGenerator::emit_incomplete_windows() to the desired value.
Furthermore, the function offers a condition
function that can be used to skip records that do not fullfil a given condition. That is, if condition
is used, it needs to return true
for records that shall be processed, and false
for those that shall be skipped.
Definition at line 73 of file vcf_window.hpp.
std::string sam_flag_to_string | ( | int | flags | ) |
Turn a set of flags for sam/bam/cram reads into their textual representation.
This is useful for user output. We here use the format of names as used by htslib and samtools, were names are upper case and words in flag names separated by underscores. This ensures compatibility of the output with existing tools.
See http://www.htslib.org/doc/samtools-flags.html and https://broadinstitute.github.io/picard/explain-flags.html for details.
Definition at line 132 of file sam_flags.cpp.
void genesis::population::SimplePileupReader::process_ancestral_base_< SimplePileupReader::Sample > | ( | utils::InputStream & | input_stream, |
SimplePileupReader::Sample & | sample | ||
) | const |
Definition at line 715 of file simple_pileup_reader.cpp.
void genesis::population::SimplePileupReader::process_quality_string_< SimplePileupReader::Sample > | ( | utils::InputStream & | input_stream, |
SimplePileupReader::Sample & | sample | ||
) | const |
Definition at line 568 of file simple_pileup_reader.cpp.
void genesis::population::SimplePileupReader::set_sample_read_bases_< SimplePileupReader::Sample > | ( | std::string const & | read_bases, |
SimplePileupReader::Sample & | sample | ||
) | const |
Definition at line 546 of file simple_pileup_reader.cpp.
void genesis::population::SimplePileupReader::set_sample_read_coverage_< SimplePileupReader::Sample > | ( | size_t | read_coverage, |
SimplePileupReader::Sample & | sample | ||
) | const |
Definition at line 524 of file simple_pileup_reader.cpp.
void genesis::population::SimplePileupReader::set_target_alternative_base_< SimplePileupReader::Record > | ( | SimplePileupReader::Record & | target | ) | const |
Definition at line 503 of file simple_pileup_reader.cpp.
std::pair< SortedBaseCounts, SortedBaseCounts > sorted_average_base_counts | ( | BaseCounts const & | sample_a, |
BaseCounts const & | sample_b | ||
) |
Return the sorted base counts of both input samples, orderd by the average frequencies of the nucleotide counts in the two samples.
Both returned counts will be in the same order, with the nucleotide first that has the highest average count in the two samples, etc.
Definition at line 221 of file population/functions/functions.cpp.
SortedBaseCounts sorted_base_counts | ( | BaseCounts const & | sample | ) |
Return the order of base counts (nucleotides), largest one first.
Definition at line 191 of file population/functions/functions.cpp.
SortedBaseCounts sorted_base_counts | ( | Variant const & | variant, |
bool | reference_first | ||
) |
Get a list of bases sorted by their counts.
If reference_first
is set to true
, the first entry in the resulting array is always the reference base of the Variant, while the other three bases are sorted by counts. If reference_first
is set to false
, all four bases are sorted by their counts.
Definition at line 288 of file population/functions/functions.cpp.
BaseCountsStatus status | ( | BaseCounts const & | sample, |
size_t | min_coverage = 0 , |
||
size_t | max_coverage = 0 , |
||
size_t | min_count = 0 , |
||
bool | tolerate_deletions = false |
||
) |
Compute a simple status with useful properties from the counts of a BaseCounts.
min_coverage
Minimum coverage expected for a BaseCounts to be considered "covered". If the number of nucleotides (A
, C
, G
, T
) in the reads of a sample is less then the here provided min_coverage
, then the BaseCounts is not considered sufficiently covered, and the BaseCountsStatus::is_covered flag will be set to false
.
max_coverage
Same as min_coverage
, but the upper bound on coverage; maximum coverage expected for a BaseCounts to be considered "covered". If the number of nucleotides exceeds this bound, the BaseCountsStatus::is_covered flag will be set to false
. If provided with a value of 0
(default), max_coverage is not used.
Only if the nucleotide count is in between (or equal to either) these two bounds (min_coverage
and max_coverage
), it is considered to be covered, and BaseCountsStatus::is_covered will be set to true
.
min_count
This value is used to determine whether a BaseCounts has too many deletions, and unless tolerate_deletions() is set to true
, the BaseCountsStatus::is_ignored will be set to true
in that case (too many deletions, as given by BaseCounts::d_count), while the values for BaseCountsStatus::is_covered, BaseCountsStatus::is_snp, and BaseCountsStatus::is_biallelic will be set to false
.
Typically, if this function is used after calling filter_min_count() on the BaseCounts, the min_count
is set to the same value for consistency.
tolerate_deletions
Set whether we tolerate BaseCountss with a high amount of deletions.
If set to false
(default), we do not tolerate deletions. In that case, if the number of deletions in a Sample (given by Sample::d_count) is higher than or equal to min_count(), the Sample will be considered ignored (Sample::is_ignored set to true
), and considered not covered (Sample::is_covered, Sample::is_snp, and Sample::is_biallelic will all be set to false
).
If however set to true
, we tolerate high amounts of deletions, and the values for the above properties will be set as usual by considering the nucleotide counts (Sample::a_count, Sample::c_count, Sample::g_count, and Sample::t_count) instead.
Definition at line 49 of file population/functions/functions.cpp.
int string_to_sam_flag | ( | std::string const & | value | ) |
Parse a string as a set of flags for sam/bam/cram reads.
The given string can either be the numeric value as specified by the sam standard, or given as a list of flag names or values, which can be separated by comma, space, vertical bar, or plus sign, and where each flag name is treated case-insensitive and without regarding non-alpha-numeric characters. This is a more lenient parsing than what htslib and samtools offer.
For example, it accepts:
1 0x12 PROPER_PAIR,MREVERSE ProperPair + MateReverse PROPER_PAIR | 0x20
See http://www.htslib.org/doc/samtools-flags.html and https://broadinstitute.github.io/picard/explain-flags.html for details.
Definition at line 81 of file sam_flags.cpp.
double genesis::population::tajima_d_pool | ( | PoolDiversitySettings const & | settings, |
ForwardIterator | begin, | ||
ForwardIterator | end | ||
) |
Compute the pool-sequencing corrected version of Tajima's D according to Kofler et al.
Definition at line 454 of file diversity.hpp.
double genesis::population::tajima_d_pool | ( | PoolDiversitySettings const & | settings, |
ForwardIterator | begin, | ||
ForwardIterator | end, | ||
double | theta_pi, | ||
double | theta_watterson | ||
) |
Compute the pool-sequencing corrected version of Tajima's D according to Kofler et al.
Definition at line 430 of file diversity.hpp.
double tajima_d_pool_denominator | ( | PoolDiversitySettings const & | settings, |
size_t | snp_count, | ||
double | theta | ||
) |
Compute the denominator for the pool-sequencing correction of Tajima's D according to Kofler et al.
Definition at line 451 of file diversity.cpp.
double genesis::population::theta_pi | ( | ForwardIterator | begin, |
ForwardIterator | end, | ||
bool | with_bessel = true |
||
) |
Compute classic theta pi, that is, the sum of heterozygosities.
The function simply sums heterozygosity() for all samples in the given range. If with_bessel
is set, Bessel's correction for the total nucleotide count is used.
Definition at line 178 of file diversity.hpp.
|
inline |
Compute theta pi with pool-sequencing correction according to Kofler et al, for a single BaseCounts, that is, its heterozygosity() including Bessel's correction for the total nucleotide count at each position, divided by the correction denominator.
Definition at line 222 of file diversity.hpp.
double genesis::population::theta_pi_pool | ( | PoolDiversitySettings const & | settings, |
ForwardIterator | begin, | ||
ForwardIterator | end | ||
) |
Compute theta pi with pool-sequencing correction according to Kofler et al, that is, the sum of heterozygosities divided by the correction denominator.
The function sums heterozygosity() for all samples in the given range, including Bessel's correction for the total nucleotide count at each position, and divides each by the respective denominator to correct for error from pool sequencing. See theta_pi_pool_denominator() for details.
Definition at line 199 of file diversity.hpp.
double theta_pi_pool_denominator | ( | PoolDiversitySettings const & | settings, |
size_t | nucleotide_count | ||
) |
Compute the denominator for the pool-sequencing correction of theta pi according to Kofler et al.
We here compute the denominator for a given poolsize
, with a fix min_allele_count
, which is identical for each given nucleotide_count
, and henced cached internally for speedup.
See
R. Kofler, P. Orozco-terWengel, N. De Maio, R. V. Pandey, V. Nolte, A. Futschik, C. Kosiol, C. Schlötterer.
PoPoolation: A Toolbox for Population Genetic Analysis of Next Generation Sequencing Data from Pooled Individuals.
(2011) PLoS ONE, 6(1), e15925. https://doi.org/10.1371/journal.pone.0015925
for details. The paper unfortunately does not explain their equations, but there is a hidden document in their code repository that illuminates the situation a bit. See https://sourceforge.net/projects/popoolation/files/correction_equations.pdf
Definition at line 136 of file diversity.cpp.
double genesis::population::theta_pi_within_pool | ( | ForwardIterator | begin, |
ForwardIterator | end, | ||
size_t | poolsize | ||
) |
Compute classic theta pi (within a population), that is, the sum of heterozygosities including Bessel's correction for total nucleotide sum at each position, and Bessel's correction for the pool size.
This is the same computation used for theta pi within in the FST computation of f_st_pool_unbiased(). It does not use the pool seq correction of Kofler et al.
Definition at line 240 of file diversity.hpp.
double genesis::population::theta_watterson_pool | ( | PoolDiversitySettings const & | settings, |
ForwardIterator | begin, | ||
ForwardIterator | end | ||
) |
Compute theta watterson with pool-sequencing correction according to Kofler et al.
Definition at line 272 of file diversity.hpp.
double theta_watterson_pool_denominator | ( | PoolDiversitySettings const & | settings, |
size_t | nucleotide_count | ||
) |
Compute the denominator for the pool-sequencing correction of theta watterson according to Kofler et al.
We here compute the denominator for a given poolsize
, with a fix min_allele_count
, which is identical for each given nucleotide_count
, and henced cached internally for speedup.
See
R. Kofler, P. Orozco-terWengel, N. De Maio, R. V. Pandey, V. Nolte, A. Futschik, C. Kosiol, C. Schlötterer.
PoPoolation: A Toolbox for Population Genetic Analysis of Next Generation Sequencing Data from Pooled Individuals.
(2011) PLoS ONE, 6(1), e15925. https://doi.org/10.1371/journal.pone.0015925
for details. The paper unfortunately does not explain their equations, but there is a hidden document in their code repository that illuminates the situation a bit. See https://sourceforge.net/projects/popoolation/files/correction_equations.pdf
Definition at line 186 of file diversity.cpp.
|
inline |
Definition at line 48 of file functions/genome_locus.hpp.
std::string to_string | ( | GenomeRegion const & | region | ) |
Definition at line 69 of file genome_region.cpp.
std::ostream & to_sync | ( | BaseCounts const & | bs, |
std::ostream & | os | ||
) |
Output a BaseCounts instance to a stream in the PoPoolation2 sync format.
This is one column from that file, outputting the counts separated by colons, in the order A:T:C:G:N:D
, with D
being deletions (*
in pileup).
Definition at line 43 of file sync_common.cpp.
std::ostream & to_sync | ( | Variant const & | var, |
std::ostream & | os | ||
) |
Output a Variant instance to a stream in the PoPoolation2 sync format.
The format is a tab-delimited file with one variant per line:
Each population column outputs counts separated by colons, in the order A:T:C:G:N:D
, with D
being deletions (*
in pileup).
See https://sourceforge.net/p/popoolation2/wiki/Tutorial/ for details.
Definition at line 50 of file sync_common.cpp.
BaseCounts total_base_counts | ( | Variant const & | variant | ) |
Get the summed up total base counts of a Variant.
This is the same as calling merge() on the samples in the Variant.
Definition at line 139 of file population/functions/functions.cpp.
|
inline |
Count of the pure nucleotide bases at this position, that is, the sum of all A
, C
, G
, and T
.
See nucleotide_sum() for details. This function gives the sum over all samples in the Variant.
Definition at line 232 of file population/functions/functions.hpp.
void transform_zero_out_by_max_count | ( | BaseCounts & | sample, |
size_t | max_count | ||
) |
Transform a BaseCounts sample
by setting any nucleotide count (A
, C
, G
, T
) to zero if max_count
is exceeded for that nucleotide.
This transformation is used as a type of quality control. All nucleotide counts (that is, BaseCounts::a_count, BaseCounts::c_count, BaseCounts::g_count, and BaseCounts::t_count) that are above the given max_count
are set to zero.
Definition at line 101 of file filter_transform.cpp.
void transform_zero_out_by_max_count | ( | Variant & | variant, |
size_t | max_count | ||
) |
Transform a variant
by setting any nucleotide count (A
, C
, G
, T
) of its samples to zero if max_count
is exceeded for that nucleotide.
Definition at line 122 of file filter_transform.cpp.
void transform_zero_out_by_min_count | ( | BaseCounts & | sample, |
size_t | min_count | ||
) |
Transform a BaseCounts sample
by setting any nucleotide count (A
, C
, G
, T
) to zero if min_count
is not reached for that nucleotide.
This transformation is used as a type of quality control. All nucleotide counts (that is, BaseCounts::a_count, BaseCounts::c_count, BaseCounts::g_count, and BaseCounts::t_count) that are below the given min_count
are set to zero.
Definition at line 77 of file filter_transform.cpp.
void transform_zero_out_by_min_count | ( | Variant & | variant, |
size_t | min_count | ||
) |
Transform a variant
by setting any nucleotide count (A
, C
, G
, T
) of its samples to zero if min_count
is not reached for that nucleotide.
Definition at line 94 of file filter_transform.cpp.
void transform_zero_out_by_min_max_count | ( | BaseCounts & | sample, |
size_t | min_count, | ||
size_t | max_count | ||
) |
Transform a BaseCounts sample
by setting any nucleotide count (A
, C
, G
, T
) to zero if min_count
is not reached or if max_count
is exceeded for that nucleotide.
This is the same as running transform_zero_out_by_min_count() and transform_zero_out_by_max_count() individually.
Definition at line 129 of file filter_transform.cpp.
void transform_zero_out_by_min_max_count | ( | Variant & | variant, |
size_t | min_count, | ||
size_t | max_count | ||
) |
Transform a variant
by setting any nucleotide count (A
, C
, G
, T
) of its samples to zero if min_count
is not reached or if max_count
is exceeded for that nucleotide.
Definition at line 147 of file filter_transform.cpp.
std::string vcf_genotype_string | ( | std::vector< VcfGenotype > const & | genotypes | ) |
Return the VCF-like string representation of a set of VcfGenotype entries.
The VcfFormatIterator::get_values() function returns all genotype entries for a given sample of a record/line. Here, we return a string representation similar to VCF of these genotypes, for example 0|0
or ./1
.
Definition at line 560 of file vcf_common.cpp.
size_t vcf_genotype_sum | ( | std::vector< VcfGenotype > const & | genotypes | ) |
Return the sum of genotypes for a set of VcfGenotype entries, typically used to construct a genotype matrix with entries 0,1,2.
The function takes the given genotypes
, encodes the reference as 0 and any alternative as 1, and then sums this over the values. For diploid organisms, this yields possible results in the range of 0 (homozygote for the reference), 1 (heterzygote), or 2 (homozygote for the alternative), which is typically used in genotype matrices.
Definition at line 574 of file vcf_common.cpp.
std::string vcf_hl_type_to_string | ( | int | hl_type | ) |
Internal helper function to convert htslib-internal BCF_HL_* header line type values to their string representation as used in the VCF header ("FILTER", "INFO", "FORMAT", etc).
Definition at line 205 of file vcf_common.cpp.
std::string vcf_value_special_to_string | ( | int | vl_type_num | ) |
Definition at line 177 of file vcf_common.cpp.
std::string vcf_value_special_to_string | ( | VcfValueSpecial | vl_type_num | ) |
Definition at line 172 of file vcf_common.cpp.
std::string vcf_value_type_to_string | ( | int | ht_type | ) |
Definition at line 147 of file vcf_common.cpp.
std::string vcf_value_type_to_string | ( | VcfValueType | ht_type | ) |
Definition at line 142 of file vcf_common.cpp.
Iterate Variants, using a variety of input file formats.
This generic iterator is an abstraction that is agnostic to the underlying file format, and can be used with anything that can be converted to a Variant per genome position. It offers to iterate a whole input file, and transform and filter the Variant as needed in order to make downstream processing as easy as possible.
This is useful for downstream processing, where we just want to work with the Variants along the genome, but want to allow different file formats for their input. Use this iterator to achieve this. For example, use the make_variant_input_iterator_...()
functions to get such an interator for different input file types.
The iterator furthermore offers a data field of type VariantInputIteratorData, which gets filled with basic data about the input file and sample names (if available in the file format). Use the data() function to access this data while iterating.
Definition at line 124 of file variant_input_iterator.hpp.
Definition at line 48 of file variant_window_iterator.hpp.
using VcfFormatIteratorFloat = VcfFormatIterator<float, double> |
Definition at line 67 of file vcf_format_iterator.hpp.
using VcfFormatIteratorGenotype = VcfFormatIterator<int32_t, VcfGenotype> |
Definition at line 68 of file vcf_format_iterator.hpp.
using VcfFormatIteratorInt = VcfFormatIterator<int32_t, int32_t> |
Definition at line 66 of file vcf_format_iterator.hpp.
using VcfFormatIteratorString = VcfFormatIterator<char*, std::string> |
Definition at line 65 of file vcf_format_iterator.hpp.
|
strong |
Select how Variant filter functions that evaluate properties of the Variant::samples (BaseCounts) objects behave when the filter is not true
or false
for all samples.
Enumerator | |
---|---|
kConjunction | The filter returns |
kDisjunction | The filter returns |
kMerge | The filter is applied to the merged BaseCounts of all samples in the Variant. In this special case, only one BaseCounts object is subjected to the filter function, and hence no logical compbination of the outcome is needed. |
Definition at line 58 of file filter_transform.hpp.
|
strong |
SlidingWindowType of a Window, that is, whether we slide along a fixed size interval of the genome, along a fixed number of variants, or represents a whole chromosome.
Enumerator | |
---|---|
kInterval | Windows of this type are defined by a fixed start and end position on a chromosome. The amount of data contained in between these two loci can differ, depending on the number of variant positions found in the underlying data iterator. |
kVariants | Windows of this type are defined as containing a fixed number of entries (usually, Variants or other data that), and hence can span window widths of differing sizes. |
kChromosome | Windows of this type contain positions across a whole chromosome. The window contains all data from a whole chromosome. Moving to the next window then is equivalent to moving to the next chromosome. Note that this might need a lot of memory to keep all the data at once. |
Definition at line 55 of file sliding_window_generator.hpp.
|
strong |
Specification for the values determining header line types of VCF/BCF files.
This list contains the types of header lines that htslib uses for identification, as specified in the VCF header. Corresponds to the BCF_HL_*
macro constants defined by htslib. We statically assert that these have the same values.
Enumerator | |
---|---|
kFilter | |
kInfo | |
kFormat | |
kContig | |
kStructured | |
kGeneric |
Definition at line 70 of file vcf_common.hpp.
|
strong |
Specification for special markers for the number of values expected for key-value-pairs of VCF/BCF files.
This list contains the special markers for the number of values of the INFO
and FORMAT
key-value pairs, as specified in the VCF header, and used in the record lines. Corresponds to the BCF_VL_*
macro constants defined by htslib. We statically assert that these have the same values.
Enumerator | |
---|---|
kFixed | Fixed number of values expected. In VCF, this is denoted simply by an integer number. This simply specifies that there is a fixed number of values to be expected; we do not further define how many exaclty are expected here (the integer value). This is taken care of in a separate variable that is provided whenever a fixed-size value is needed, see for example VcfSpecification. |
kVariable | Variable number of possible values, or unknown, or unbounded. In VCF, this is denoted by '.'. |
kAllele | One value per alternate allele. In VCF, this is denoted as 'A'. |
kGenotype | One value for each possible genotype (more relevant to the FORMAT tags). In VCF, this is denoated as 'G'. |
kReference | One value for each possible allele (including the reference). In VCF, this is denoted as 'R'. |
Definition at line 105 of file vcf_common.hpp.
|
strong |
Specification for the data type of the values expected in key-value-pairs of VCF/BCF files.
This list contains the types of data in values of the INFO
and FORMAT
key-value pairs, as specified in the VCF header, and used in the record lines. Corresponds to the BCF_HT_*
macro constants defined by htslib. We statically assert that these have the same values.
Enumerator | |
---|---|
kFlag | |
kInteger | |
kFloat | |
kString |
Definition at line 88 of file vcf_common.hpp.
|
strong |
Position in the genome that is used for reporting when emitting or using a window.
See anchor_position() for details.
Enumerator | |
---|---|
kIntervalBegin | |
kIntervalEnd | |
kIntervalMidpoint | |
kVariantFirst | |
kVariantLast | |
kVariantMedian | |
kVariantMean | |
kVariantMidpoint |
Definition at line 52 of file population/window/functions.hpp.
|
static |
Map from sam flags to their numerical value, for different types of naming of the flags.
Definition at line 58 of file sam_flags.cpp.