#include <genesis/population/genome_region_list.hpp>
List of regions in a genome, for each chromosome.
The data structure stores a list of genome regions, such as coming from BED or GFF files. The class allows to iterate through the regions of each chromosome.
It furtheremore allows querying positions, that is, whether a certain position on a chromosome is part of one of the stored regions. However, when many regions are stored in the list, and many queries need to be performed (such as when reading files and needing to check every position on every chromosome), this can become prohibitively slow, despite using a fast data structure. Use GenomeLocusSet as an alternative that is way faster when all that is needed is information on whether a certain position/coordinate is set or not set.
Positions in the interval of each region are 1-based and inclusive, that is, we used closed intervals. We also offer the special case to add a whole chromosome as a region, in which case the is_covered() function will return true
for all positions on that chromosome (without checking that the position is in fact part of the chromosome - as we do not use information on the lengths of chromosomes in this class). We use start and end positions equal to 0 to mark these special whole-chromosome cases - be aware of that when adding regions to the list. See also GenomeLocus, GenomeLocusSet, and GenomeRegion for related classes that have the same special cases.
Interally, we use an IntervalTree to represent the regions of each chromosome, stored in a map from chromosome name to IntervalTree. This is so that access and querying of contained positions is as fast as possible, and so that we do not store the chromosome name string with every region.
Definition at line 95 of file genome_region_list.hpp.
Public Member Functions | |
GenomeRegionList ()=default | |
GenomeRegionList (GenomeRegionList &&)=default | |
GenomeRegionList (GenomeRegionList const &)=default | |
~GenomeRegionList ()=default | |
void | add (GenomeLocus const &locus, bool overlap=false) |
Add a single GenomeLocus, that is, an interval covering one position on a chromosome. More... | |
void | add (GenomeLocus const &start, GenomeLocus const &end, bool overlap=false) |
Add an interval between two GenomeLoci on the same chromosome. More... | |
void | add (GenomeRegion const ®ion, bool overlap=false) |
Add a GenomeRegion to the list. More... | |
void | add (GenomeRegionList const &other, bool overlap=false) |
Add a complete GenomeRegionList to this list. More... | |
void | add (std::string const &chromosome) |
Add a whole chromosome to the list, so that all its positions are considered to be covered. More... | |
void | add (std::string const &chromosome, numerical_type start, numerical_type end, bool overlap=false) |
Add a region to the list, given its chromosome, and start and end positions. More... | |
size_t | chromosome_count () const |
Return the number of chromosomes for which there are regions stored. More... | |
std::map< std::string, tree_type > & | chromosome_map () |
Access the underlying container directly. More... | |
std::map< std::string, tree_type > const & | chromosome_map () const |
Access the underlying container directly. More... | |
std::vector< std::string > | chromosome_names () const |
Get a list of all stored chromosome names. More... | |
tree_type & | chromosome_regions (std::string const &chromosome) |
For a given chromosome, return the IntervalTree that stores its regions. More... | |
tree_type const & | chromosome_regions (std::string const &chromosome) const |
For a given chromosome, return the IntervalTree that stores its regions. More... | |
void | clear () |
Remove all stored regions from all chromosomes. More... | |
void | clear (std::string const &chromosome) |
Remove the regions of the specified chromosome . More... | |
size_t | cover_count (std::string const &chromosome, numerical_type position, bool whole_chromosome=false) const |
Retun the number of regions (intervals) that overlap with a given position on a chromosome. More... | |
bool | empty () const |
Return whether there are chromosomes with regions stored. More... | |
bool | has_chromosome (std::string const &chromosome) const |
Return whether a chromosome is stored. More... | |
bool | is_covered (std::string const &chromosome) const |
Return whether a whole chromosome is covered. More... | |
bool | is_covered (std::string const &chromosome, numerical_type position) const |
Return whether a given position on a chromosome is part of any of the regions (intervals) that are stored here. More... | |
GenomeRegionList & | operator= (GenomeRegionList &&)=default |
GenomeRegionList & | operator= (GenomeRegionList const &)=default |
size_t | region_count (std::string const &chromosome) const |
Return the number of regions stored for the specified chromosome . More... | |
size_t | total_region_count () const |
Return the number of regions stored in total, across all chromosomes. More... | |
Public Types | |
using | const_iterator = typename tree_type::const_iterator |
using | data_type = EmptyGenomeData |
using | iterator = typename tree_type::iterator |
using | numerical_type = size_t |
using | self_type = GenomeRegionList |
using | tree_type = genesis::utils::IntervalTree< data_type, numerical_type, genesis::utils::IntervalClosed > |
|
default |
|
default |
|
default |
|
default |
|
inline |
Add a single GenomeLocus, that is, an interval covering one position on a chromosome.
If overlap
is set, we first check if there is a region already in the list that overlaps with the one that is to be added. If so, the new region is merged with existing one, instead of inserting it. This is more useful of the region list is used to determine coverage, and less useful if regions are meant to indicate some specific parts of the genome, such as genes.
Definition at line 209 of file genome_region_list.hpp.
|
inline |
Add an interval between two GenomeLoci on the same chromosome.
If overlap
is set, we first check if there is a region already in the list that overlaps with the one that is to be added. If so, the new region is merged with existing one, instead of inserting it. This is more useful of the region list is used to determine coverage, and less useful if regions are meant to indicate some specific parts of the genome, such as genes.
Definition at line 219 of file genome_region_list.hpp.
|
inline |
Add a GenomeRegion to the list.
This function ensures that regions are valid (start < end
).
If overlap
is set, we first check if there is a region already in the list that overlaps with the one that is to be added. If so, the new region is merged with existing one, instead of inserting it. This is more useful of the region list is used to determine coverage, and less useful if regions are meant to indicate some specific parts of the genome, such as genes.
Definition at line 242 of file genome_region_list.hpp.
|
inline |
Add a complete GenomeRegionList to this list.
This function copies all entries of the list
.
If overlap
is set, we first check if there is a region already in the list that overlaps with the one that is to be added. If so, the new region is merged with existing one, instead of inserting it. This is more useful of the region list is used to determine coverage, and less useful if regions are meant to indicate some specific parts of the genome, such as genes.
Definition at line 258 of file genome_region_list.hpp.
|
inline |
Add a whole chromosome to the list, so that all its positions are considered to be covered.
Definition at line 139 of file genome_region_list.hpp.
|
inline |
Add a region to the list, given its chromosome, and start and end positions.
The chromosome
cannot be empty, and we expect start
< end
. Both start
and end
are 1-based, and inclusive, that is, the interval between them is closed. The special case of both start
and end
equal to 0 means that the whole chromosome is added as an interval.
If overlap
is set, we first check if there is a region already in the list that overlaps with the one that is to be added. If so, the new region is merged with existing one, instead of inserting it. This is more useful of the region list is used to determine coverage, and less useful if regions are meant to indicate some specific parts of the genome, such as genes.
Definition at line 154 of file genome_region_list.hpp.
|
inline |
Return the number of chromosomes for which there are regions stored.
Definition at line 412 of file genome_region_list.hpp.
|
inline |
Access the underlying container directly.
Expose the map from chromosome names to the IntervalTree that stores the regions of each chromosome. This is okay to expose, as this class is merely a thin convenience wrapper around it anyway. If the class ever changes to be more than that, we might remove access to this.
Definition at line 504 of file genome_region_list.hpp.
|
inline |
Access the underlying container directly.
Expose the map from chromosome names to the IntervalTree that stores the regions of each chromosome. This is okay to expose, as this class is merely a thin convenience wrapper around it anyway. If the class ever changes to be more than that, we might remove access to this.
Definition at line 496 of file genome_region_list.hpp.
|
inline |
Get a list of all stored chromosome names.
Definition at line 420 of file genome_region_list.hpp.
|
inline |
For a given chromosome, return the IntervalTree that stores its regions.
Note that this exposes the underlying container, and hence has to be used with caution. In particular position 0 is considered special in this GenomeRegionList class: any chromosome for which we have stored an interval that covers 0 is considered to be fully covered for all its positions.
Definition at line 455 of file genome_region_list.hpp.
|
inline |
For a given chromosome, return the IntervalTree that stores its regions.
Definition at line 441 of file genome_region_list.hpp.
|
inline |
Remove all stored regions from all chromosomes.
Definition at line 274 of file genome_region_list.hpp.
|
inline |
Remove the regions of the specified chromosome
.
Definition at line 282 of file genome_region_list.hpp.
|
inline |
Retun the number of regions (intervals) that overlap with a given position on a chromosome.
The is_covered() function only returns whether a position is covered at all, but does not tell by how many regions/intervals it is covered. This function returns that value.
Note that the special "whole chromosome" case (marked by setting an interval that covers position 0) is not considered here by default, as this is likely to be not intended. With whole_chromosome
, this behaviour can be changed, meaning that if position 0 is also covered (and hence the whole chromosome is marked as covered), the resulting counter is incremented by 1.
Definition at line 347 of file genome_region_list.hpp.
|
inline |
Return whether there are chromosomes with regions stored.
Definition at line 404 of file genome_region_list.hpp.
|
inline |
Return whether a chromosome is stored.
Definition at line 432 of file genome_region_list.hpp.
|
inline |
Return whether a whole chromosome is covered.
That special case is specified by an interval that covers position 0.
Definition at line 324 of file genome_region_list.hpp.
|
inline |
Return whether a given position on a chromosome is part of any of the regions (intervals) that are stored here.
Definition at line 300 of file genome_region_list.hpp.
|
default |
|
default |
|
inline |
Return the number of regions stored for the specified chromosome
.
Definition at line 463 of file genome_region_list.hpp.
|
inline |
Return the number of regions stored in total, across all chromosomes.
Definition at line 476 of file genome_region_list.hpp.
using const_iterator = typename tree_type::const_iterator |
Definition at line 112 of file genome_region_list.hpp.
using data_type = EmptyGenomeData |
Definition at line 104 of file genome_region_list.hpp.
using iterator = typename tree_type::iterator |
Definition at line 111 of file genome_region_list.hpp.
using numerical_type = size_t |
Definition at line 105 of file genome_region_list.hpp.
using self_type = GenomeRegionList |
Definition at line 109 of file genome_region_list.hpp.
using tree_type = genesis::utils::IntervalTree< data_type, numerical_type, genesis::utils::IntervalClosed > |
Definition at line 108 of file genome_region_list.hpp.