#include <genesis/population/format/bed_reader.hpp>
Reader for BED (Browser Extensible Data) files.
We follow the definition by https://en.wikipedia.org/wiki/BED_(file_format), which itself is based on the UCSC Genome Browser definition of the BED format:
Column number | Title | Definition |
---|---|---|
1 | chrom | Chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671) name |
2 | chromStart | Start coordinate on the chromosome or scaffold for the sequence considered (the first base on the chromosome is numbered 0 in the file format - we here however use 1-based coordinates) |
3 | chromEnd | End coordinate on the chromosome or scaffold for the sequence considered. This position is non-inclusive, unlike chromStart. |
4 | name | Name of the line in the BED file |
5 | score | Score between 0 and 1000 |
6 | strand | DNA strand orientation (positive ["+"] or negative ["-"] or "." if no strand) |
7 | thickStart | Starting coordinate from which the annotation is displayed in a thicker way on a graphical representation (e.g.: the start codon of a gene) |
8 | thickEnd | End coordinates from which the annotation is no longer displayed in a thicker way on a graphical representation (e.g.: the stop codon of a gene) |
9 | itemRgb | RGB value in the form R,G,B (e.g. 255,0,0) determining the display color of the annotation contained in the BED file |
10 | blockCount | Number of blocks (e.g. exons) on the line of the BED file |
11 | blockSizes | List of values separated by commas corresponding to the size of the blocks (the number of values must correspond to that of the "blockCount") |
12 | blockStarts | List of values separated by commas corresponding to the starting coordinates of the blocks, coordinates calculated relative to those present in the chromStart column (the number of values must correspond to that of the "blockCount") |
The reader offers to parse every line or the whole file into a Feature format that contains the above columns (as far as present in the file), or to read into a GenomeRegionList structure instead, in which case only the genome coordinates (chromosome and start and end positions) are used. The input needs to have a consistent number of columns, but only the first three are mandatory. They all must be in the above order, and if later (more towards the end of the line) columns are needed, all previous ones need to be filled as well. Any additional columns after these 12 are also read by our parser, but simply ignored.
Note that the BED format internally uses 0-based half-open intervals. That is, the start and end coordinates chromStart = 0
and chromEnd = 100
define a region starting at the first base, with a length of 100. We here however use 1-based closed intervals, and hence store the same region as 1
and 100
, both in the Feature struct and in the GenomeRegionList.
Furthermore, any lines starting with browser
, track
, or #
are read, but currently ignored. We are not quite sure if such lines are allowed in the middle of BED files by the inofficial standard, hence we here also allow that. The obvious downside of this being the BED specification is that chromosome names "browser" and "track" cannot be used.
Definition at line 91 of file bed_reader.hpp.
Public Member Functions | |
BedReader ()=default | |
BedReader (BedReader &&)=default | |
BedReader (BedReader const &)=default | |
~BedReader ()=default | |
BedReader & | operator= (BedReader &&)=default |
BedReader & | operator= (BedReader const &)=default |
std::vector< Feature > | read (std::shared_ptr< utils::BaseInputSource > source) const |
Read a BED input source, and return its content as a list of Feature structs. More... | |
GenomeLocusSet | read_as_genome_locus_set (std::shared_ptr< utils::BaseInputSource > source) const |
Read an input source, and return its content as a GenomeLocusSet. More... | |
GenomeRegionList | read_as_genome_region_list (std::shared_ptr< utils::BaseInputSource > source, bool merge=false) const |
Read a BED input source, and return its content as a GenomeRegionList. More... | |
void | read_as_genome_region_list (std::shared_ptr< utils::BaseInputSource > source, GenomeRegionList &target, bool merge=false) const |
Read a BED input source, and add its content to an existing GenomeRegionList. More... | |
Classes | |
struct | Feature |
Store all values that can typically appear in the columns of a BED file. More... | |
|
default |
|
default |
std::vector< BedReader::Feature > read | ( | std::shared_ptr< utils::BaseInputSource > | source | ) | const |
Read a BED input source, and return its content as a list of Feature structs.
Definition at line 49 of file bed_reader.cpp.
GenomeLocusSet read_as_genome_locus_set | ( | std::shared_ptr< utils::BaseInputSource > | source | ) | const |
Read an input source, and return its content as a GenomeLocusSet.
This only uses the first three columns, chrom
, chromStart
, and chromEnd
, and ignores everything else.
This is the recommended way to read an input for testing whether genome coordinates are covered (filtered / to be considered) for downstream analyses.
Definition at line 59 of file bed_reader.cpp.
GenomeRegionList read_as_genome_region_list | ( | std::shared_ptr< utils::BaseInputSource > | source, |
bool | merge = false |
||
) | const |
Read a BED input source, and return its content as a GenomeRegionList.
This only uses the first three columns, chrom
, chromStart
, and chromEnd
, and ignores everything else.
If merge
is set, the individual regions of the file are merged if they overlap. This is more useful of the region list is used to determine coverage, and less useful if regions are meant to indicate some specific parts of the genome, such as genes. See the overlap
flag of GenomeRegionList::add( GenomeLocus const&, bool ) for details.
Definition at line 69 of file bed_reader.cpp.
void read_as_genome_region_list | ( | std::shared_ptr< utils::BaseInputSource > | source, |
GenomeRegionList & | target, | ||
bool | merge = false |
||
) | const |
Read a BED input source, and add its content to an existing GenomeRegionList.
This only uses the first three columns, chrom
, chromStart
, and chromEnd
, and ignores everything else.
If merge
is set, the individual regions of the file are merged if they overlap. This is more useful of the region list is used to determine coverage, and less useful if regions are meant to indicate some specific parts of the genome, such as genes. See the overlap
flag of GenomeRegionList::add( GenomeLocus const&, bool ) for details.
Definition at line 78 of file bed_reader.cpp.