A library for working with phylogenetic and population genetic data.
v0.32.0
BedReader Class Reference

#include <genesis/population/format/bed_reader.hpp>

Detailed Description

Reader for BED (Browser Extensible Data) files.

We follow the definition by https://en.wikipedia.org/wiki/BED_(file_format), which itself is based on the UCSC Genome Browser definition of the BED format:

Column number Title Definition
1 chrom Chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671) name
2 chromStart Start coordinate on the chromosome or scaffold for the sequence considered (the first base on the chromosome is numbered 0 in the file format - we here however use 1-based coordinates)
3 chromEnd End coordinate on the chromosome or scaffold for the sequence considered. This position is non-inclusive, unlike chromStart.
4 name Name of the line in the BED file
5 score Score between 0 and 1000
6 strand DNA strand orientation (positive ["+"] or negative ["-"] or "." if no strand)
7 thickStart Starting coordinate from which the annotation is displayed in a thicker way on a graphical representation (e.g.: the start codon of a gene)
8 thickEnd End coordinates from which the annotation is no longer displayed in a thicker way on a graphical representation (e.g.: the stop codon of a gene)
9 itemRgb RGB value in the form R,G,B (e.g. 255,0,0) determining the display color of the annotation contained in the BED file
10 blockCount Number of blocks (e.g. exons) on the line of the BED file
11 blockSizes List of values separated by commas corresponding to the size of the blocks (the number of values must correspond to that of the "blockCount")
12 blockStarts List of values separated by commas corresponding to the starting coordinates of the blocks, coordinates calculated relative to those present in the chromStart column (the number of values must correspond to that of the "blockCount")

The reader offers to parse every line or the whole file into a Feature format that contains the above columns (as far as present in the file), or to read into a GenomeRegionList structure instead, in which case only the genome coordinates (chromosome and start and end positions) are used. The input needs to have a consistent number of columns, but only the first three are mandatory. They all must be in the above order, and if later (more towards the end of the line) columns are needed, all previous ones need to be filled as well. Any additional columns after these 12 are also read by our parser, but simply ignored.

Note that the BED format internally uses 0-based half-open intervals. That is, the start and end coordinates chromStart = 0 and chromEnd = 100 define a region starting at the first base, with a length of 100. We here however use 1-based closed intervals, and hence store the same region as 1 and 100, both in the Feature struct and in the GenomeRegionList.

Furthermore, any lines starting with browser, track, or # are read, but currently ignored. We are not quite sure if such lines are allowed in the middle of BED files by the inofficial standard, hence we here also allow that. The obvious downside of this being the BED specification is that chromosome names "browser" and "track" cannot be used.

Definition at line 91 of file bed_reader.hpp.

Public Member Functions

 BedReader ()=default
 
 BedReader (BedReader &&)=default
 
 BedReader (BedReader const &)=default
 
 ~BedReader ()=default
 
BedReaderoperator= (BedReader &&)=default
 
BedReaderoperator= (BedReader const &)=default
 
std::vector< Featureread (std::shared_ptr< utils::BaseInputSource > source) const
 Read a BED input source, and return its content as a list of Feature structs. More...
 
GenomeLocusSet read_as_genome_locus_set (std::shared_ptr< utils::BaseInputSource > source) const
 Read an input source, and return its content as a GenomeLocusSet. More...
 
GenomeRegionList read_as_genome_region_list (std::shared_ptr< utils::BaseInputSource > source, bool merge=false) const
 Read a BED input source, and return its content as a GenomeRegionList. More...
 
void read_as_genome_region_list (std::shared_ptr< utils::BaseInputSource > source, GenomeRegionList &target, bool merge=false) const
 Read a BED input source, and add its content to an existing GenomeRegionList. More...
 

Classes

struct  Feature
 Store all values that can typically appear in the columns of a BED file. More...
 

Constructor & Destructor Documentation

◆ BedReader() [1/3]

BedReader ( )
default

◆ ~BedReader()

~BedReader ( )
default

◆ BedReader() [2/3]

BedReader ( BedReader const &  )
default

◆ BedReader() [3/3]

BedReader ( BedReader &&  )
default

Member Function Documentation

◆ operator=() [1/2]

BedReader& operator= ( BedReader &&  )
default

◆ operator=() [2/2]

BedReader& operator= ( BedReader const &  )
default

◆ read()

std::vector< BedReader::Feature > read ( std::shared_ptr< utils::BaseInputSource source) const

Read a BED input source, and return its content as a list of Feature structs.

Definition at line 49 of file bed_reader.cpp.

◆ read_as_genome_locus_set()

GenomeLocusSet read_as_genome_locus_set ( std::shared_ptr< utils::BaseInputSource source) const

Read an input source, and return its content as a GenomeLocusSet.

This only uses the first three columns, chrom, chromStart, and chromEnd, and ignores everything else.

This is the recommended way to read an input for testing whether genome coordinates are covered (filtered / to be considered) for downstream analyses.

Definition at line 59 of file bed_reader.cpp.

◆ read_as_genome_region_list() [1/2]

GenomeRegionList read_as_genome_region_list ( std::shared_ptr< utils::BaseInputSource source,
bool  merge = false 
) const

Read a BED input source, and return its content as a GenomeRegionList.

This only uses the first three columns, chrom, chromStart, and chromEnd, and ignores everything else.

If merge is set, the individual regions of the file are merged if they overlap. This is more useful of the region list is used to determine coverage, and less useful if regions are meant to indicate some specific parts of the genome, such as genes. See the overlap flag of GenomeRegionList::add( GenomeLocus const&, bool ) for details.

Definition at line 69 of file bed_reader.cpp.

◆ read_as_genome_region_list() [2/2]

void read_as_genome_region_list ( std::shared_ptr< utils::BaseInputSource source,
GenomeRegionList target,
bool  merge = false 
) const

Read a BED input source, and add its content to an existing GenomeRegionList.

This only uses the first three columns, chrom, chromStart, and chromEnd, and ignores everything else.

If merge is set, the individual regions of the file are merged if they overlap. This is more useful of the region list is used to determine coverage, and less useful if regions are meant to indicate some specific parts of the genome, such as genes. See the overlap flag of GenomeRegionList::add( GenomeLocus const&, bool ) for details.

Definition at line 78 of file bed_reader.cpp.


The documentation for this class was generated from the following files: