A library for working with phylogenetic and population genetic data.
v0.32.0
GenomeRegionList Class Reference

#include <genesis/population/genome_region_list.hpp>

Detailed Description

List of regions in a genome, for each chromosome.

The data structure stores a list of genome regions, such as coming from BED or GFF files. The class allows to iterate through the regions of each chromosome.

It furtheremore allows querying positions, that is, whether a certain position on a chromosome is part of one of the stored regions. However, when many regions are stored in the list, and many queries need to be performed (such as when reading files and needing to check every position on every chromosome), this can become prohibitively slow, despite using a fast data structure. Use GenomeLocusSet as an alternative that is way faster when all that is needed is information on whether a certain position/coordinate is set or not set.

Positions in the interval of each region are 1-based and inclusive, that is, we used closed intervals. We also offer the special case to add a whole chromosome as a region, in which case the is_covered() function will return true for all positions on that chromosome (without checking that the position is in fact part of the chromosome - as we do not use information on the lengths of chromosomes in this class). We use start and end positions equal to 0 to mark these special whole-chromosome cases - be aware of that when adding regions to the list. See also GenomeLocus, GenomeLocusSet, and GenomeRegion for related classes that have the same special cases.

Interally, we use an IntervalTree to represent the regions of each chromosome, stored in a map from chromosome name to IntervalTree. This is so that access and querying of contained positions is as fast as possible, and so that we do not store the chromosome name string with every region.

See also
GenomeLocus
GenomeLocusSet
GenomeRegion

Definition at line 95 of file genome_region_list.hpp.

Public Member Functions

 GenomeRegionList ()=default
 
 GenomeRegionList (GenomeRegionList &&)=default
 
 GenomeRegionList (GenomeRegionList const &)=default
 
 ~GenomeRegionList ()=default
 
void add (GenomeLocus const &locus, bool overlap=false)
 Add a single GenomeLocus, that is, an interval covering one position on a chromosome. More...
 
void add (GenomeLocus const &start, GenomeLocus const &end, bool overlap=false)
 Add an interval between two GenomeLoci on the same chromosome. More...
 
void add (GenomeRegion const &region, bool overlap=false)
 Add a GenomeRegion to the list. More...
 
void add (GenomeRegionList const &other, bool overlap=false)
 Add a complete GenomeRegionList to this list. More...
 
void add (std::string const &chromosome)
 Add a whole chromosome to the list, so that all its positions are considered to be covered. More...
 
void add (std::string const &chromosome, numerical_type start, numerical_type end, bool overlap=false)
 Add a region to the list, given its chromosome, and start and end positions. More...
 
size_t chromosome_count () const
 Return the number of chromosomes for which there are regions stored. More...
 
std::map< std::string, tree_type > & chromosome_map ()
 Access the underlying container directly. More...
 
std::map< std::string, tree_type > const & chromosome_map () const
 Access the underlying container directly. More...
 
std::vector< std::string > chromosome_names () const
 Get a list of all stored chromosome names. More...
 
tree_typechromosome_regions (std::string const &chromosome)
 For a given chromosome, return the IntervalTree that stores its regions. More...
 
tree_type const & chromosome_regions (std::string const &chromosome) const
 For a given chromosome, return the IntervalTree that stores its regions. More...
 
void clear ()
 Remove all stored regions from all chromosomes. More...
 
void clear (std::string const &chromosome)
 Remove the regions of the specified chromosome. More...
 
size_t cover_count (std::string const &chromosome, numerical_type position, bool whole_chromosome=false) const
 Retun the number of regions (intervals) that overlap with a given position on a chromosome. More...
 
bool empty () const
 Return whether there are chromosomes with regions stored. More...
 
bool has_chromosome (std::string const &chromosome) const
 Return whether a chromosome is stored. More...
 
bool is_covered (std::string const &chromosome) const
 Return whether a whole chromosome is covered. More...
 
bool is_covered (std::string const &chromosome, numerical_type position) const
 Return whether a given position on a chromosome is part of any of the regions (intervals) that are stored here. More...
 
GenomeRegionListoperator= (GenomeRegionList &&)=default
 
GenomeRegionListoperator= (GenomeRegionList const &)=default
 
size_t region_count (std::string const &chromosome) const
 Return the number of regions stored for the specified chromosome. More...
 
size_t total_region_count () const
 Return the number of regions stored in total, across all chromosomes. More...
 

Public Types

using const_iterator = typename tree_type::const_iterator
 
using data_type = EmptyGenomeData
 
using iterator = typename tree_type::iterator
 
using numerical_type = size_t
 
using self_type = GenomeRegionList
 
using tree_type = genesis::utils::IntervalTree< data_type, numerical_type, genesis::utils::IntervalClosed >
 

Constructor & Destructor Documentation

◆ GenomeRegionList() [1/3]

GenomeRegionList ( )
default

◆ ~GenomeRegionList()

~GenomeRegionList ( )
default

◆ GenomeRegionList() [2/3]

GenomeRegionList ( GenomeRegionList const &  )
default

◆ GenomeRegionList() [3/3]

Member Function Documentation

◆ add() [1/6]

void add ( GenomeLocus const &  locus,
bool  overlap = false 
)
inline

Add a single GenomeLocus, that is, an interval covering one position on a chromosome.

If overlap is set, we first check if there is a region already in the list that overlaps with the one that is to be added. If so, the new region is merged with existing one, instead of inserting it. This is more useful of the region list is used to determine coverage, and less useful if regions are meant to indicate some specific parts of the genome, such as genes.

Definition at line 209 of file genome_region_list.hpp.

◆ add() [2/6]

void add ( GenomeLocus const &  start,
GenomeLocus const &  end,
bool  overlap = false 
)
inline

Add an interval between two GenomeLoci on the same chromosome.

If overlap is set, we first check if there is a region already in the list that overlaps with the one that is to be added. If so, the new region is merged with existing one, instead of inserting it. This is more useful of the region list is used to determine coverage, and less useful if regions are meant to indicate some specific parts of the genome, such as genes.

Definition at line 219 of file genome_region_list.hpp.

◆ add() [3/6]

void add ( GenomeRegion const &  region,
bool  overlap = false 
)
inline

Add a GenomeRegion to the list.

This function ensures that regions are valid (start < end).

If overlap is set, we first check if there is a region already in the list that overlaps with the one that is to be added. If so, the new region is merged with existing one, instead of inserting it. This is more useful of the region list is used to determine coverage, and less useful if regions are meant to indicate some specific parts of the genome, such as genes.

Definition at line 242 of file genome_region_list.hpp.

◆ add() [4/6]

void add ( GenomeRegionList const &  other,
bool  overlap = false 
)
inline

Add a complete GenomeRegionList to this list.

This function copies all entries of the list.

If overlap is set, we first check if there is a region already in the list that overlaps with the one that is to be added. If so, the new region is merged with existing one, instead of inserting it. This is more useful of the region list is used to determine coverage, and less useful if regions are meant to indicate some specific parts of the genome, such as genes.

Definition at line 258 of file genome_region_list.hpp.

◆ add() [5/6]

void add ( std::string const &  chromosome)
inline

Add a whole chromosome to the list, so that all its positions are considered to be covered.

Definition at line 139 of file genome_region_list.hpp.

◆ add() [6/6]

void add ( std::string const &  chromosome,
numerical_type  start,
numerical_type  end,
bool  overlap = false 
)
inline

Add a region to the list, given its chromosome, and start and end positions.

The chromosome cannot be empty, and we expect start < end. Both start and end are 1-based, and inclusive, that is, the interval between them is closed. The special case of both start and end equal to 0 means that the whole chromosome is added as an interval.

If overlap is set, we first check if there is a region already in the list that overlaps with the one that is to be added. If so, the new region is merged with existing one, instead of inserting it. This is more useful of the region list is used to determine coverage, and less useful if regions are meant to indicate some specific parts of the genome, such as genes.

Definition at line 154 of file genome_region_list.hpp.

◆ chromosome_count()

size_t chromosome_count ( ) const
inline

Return the number of chromosomes for which there are regions stored.

Definition at line 412 of file genome_region_list.hpp.

◆ chromosome_map() [1/2]

std::map<std::string, tree_type>& chromosome_map ( )
inline

Access the underlying container directly.

Expose the map from chromosome names to the IntervalTree that stores the regions of each chromosome. This is okay to expose, as this class is merely a thin convenience wrapper around it anyway. If the class ever changes to be more than that, we might remove access to this.

See also
chromosome_regions( std::string const& )

Definition at line 504 of file genome_region_list.hpp.

◆ chromosome_map() [2/2]

std::map<std::string, tree_type> const& chromosome_map ( ) const
inline

Access the underlying container directly.

Expose the map from chromosome names to the IntervalTree that stores the regions of each chromosome. This is okay to expose, as this class is merely a thin convenience wrapper around it anyway. If the class ever changes to be more than that, we might remove access to this.

See also
chromosome_regions( std::string const& )

Definition at line 496 of file genome_region_list.hpp.

◆ chromosome_names()

std::vector<std::string> chromosome_names ( ) const
inline

Get a list of all stored chromosome names.

Definition at line 420 of file genome_region_list.hpp.

◆ chromosome_regions() [1/2]

tree_type& chromosome_regions ( std::string const &  chromosome)
inline

For a given chromosome, return the IntervalTree that stores its regions.

Note that this exposes the underlying container, and hence has to be used with caution. In particular position 0 is considered special in this GenomeRegionList class: any chromosome for which we have stored an interval that covers 0 is considered to be fully covered for all its positions.

Definition at line 455 of file genome_region_list.hpp.

◆ chromosome_regions() [2/2]

tree_type const& chromosome_regions ( std::string const &  chromosome) const
inline

For a given chromosome, return the IntervalTree that stores its regions.

Definition at line 441 of file genome_region_list.hpp.

◆ clear() [1/2]

void clear ( )
inline

Remove all stored regions from all chromosomes.

Definition at line 274 of file genome_region_list.hpp.

◆ clear() [2/2]

void clear ( std::string const &  chromosome)
inline

Remove the regions of the specified chromosome.

Definition at line 282 of file genome_region_list.hpp.

◆ cover_count()

size_t cover_count ( std::string const &  chromosome,
numerical_type  position,
bool  whole_chromosome = false 
) const
inline

Retun the number of regions (intervals) that overlap with a given position on a chromosome.

The is_covered() function only returns whether a position is covered at all, but does not tell by how many regions/intervals it is covered. This function returns that value.

Note that the special "whole chromosome" case (marked by setting an interval that covers position 0) is not considered here by default, as this is likely to be not intended. With whole_chromosome, this behaviour can be changed, meaning that if position 0 is also covered (and hence the whole chromosome is marked as covered), the resulting counter is incremented by 1.

Definition at line 347 of file genome_region_list.hpp.

◆ empty()

bool empty ( ) const
inline

Return whether there are chromosomes with regions stored.

Definition at line 404 of file genome_region_list.hpp.

◆ has_chromosome()

bool has_chromosome ( std::string const &  chromosome) const
inline

Return whether a chromosome is stored.

Definition at line 432 of file genome_region_list.hpp.

◆ is_covered() [1/2]

bool is_covered ( std::string const &  chromosome) const
inline

Return whether a whole chromosome is covered.

That special case is specified by an interval that covers position 0.

Definition at line 324 of file genome_region_list.hpp.

◆ is_covered() [2/2]

bool is_covered ( std::string const &  chromosome,
numerical_type  position 
) const
inline

Return whether a given position on a chromosome is part of any of the regions (intervals) that are stored here.

Definition at line 300 of file genome_region_list.hpp.

◆ operator=() [1/2]

GenomeRegionList& operator= ( GenomeRegionList &&  )
default

◆ operator=() [2/2]

GenomeRegionList& operator= ( GenomeRegionList const &  )
default

◆ region_count()

size_t region_count ( std::string const &  chromosome) const
inline

Return the number of regions stored for the specified chromosome.

Definition at line 463 of file genome_region_list.hpp.

◆ total_region_count()

size_t total_region_count ( ) const
inline

Return the number of regions stored in total, across all chromosomes.

Definition at line 476 of file genome_region_list.hpp.

Member Typedef Documentation

◆ const_iterator

Definition at line 112 of file genome_region_list.hpp.

◆ data_type

Definition at line 104 of file genome_region_list.hpp.

◆ iterator

using iterator = typename tree_type::iterator

Definition at line 111 of file genome_region_list.hpp.

◆ numerical_type

using numerical_type = size_t

Definition at line 105 of file genome_region_list.hpp.

◆ self_type

Definition at line 109 of file genome_region_list.hpp.

◆ tree_type


The documentation for this class was generated from the following file: