#include <genesis/population/format/gff_reader.hpp>
Reader for GFF2 and GFF3 (General Feature Format) and GTF (General Transfer Format) files.
See https://uswest.ensembl.org/info/website/upload/gff.html for the format description. Lines starting with track
or browser
(including a trailing white space) are ignored, as are comment lines starting with #
(or for that matter, ##
for directives), and empty lines.
We currently do not support the underlying ontology features, and simply store the ninth field of the file as a string in Feature::attributes_group. This is also how we support all three formats, GFF2, GFF3, and GTF in one parser: We simply ignore the parts that are different between them. If need, this last field has to be parsed by the user.
See also http://gmod.org/wiki/GFF2, http://gmod.org/wiki/GFF3, and http://genome.ucsc.edu/FAQ/FAQformat.html#format3 for additional information, and https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md for a thorough specification of GFF3.
Definition at line 69 of file gff_reader.hpp.
Public Member Functions | |
GffReader ()=default | |
GffReader (GffReader &&)=default | |
GffReader (GffReader const &)=default | |
~GffReader ()=default | |
GffReader & | operator= (GffReader &&)=default |
GffReader & | operator= (GffReader const &)=default |
bool | parse_line (utils::InputStream &input_stream, Feature &feature) const |
std::vector< Feature > | read (std::shared_ptr< utils::BaseInputSource > source) const |
Read a GFF2/GFF3/GTF input source, and return its content as a list of Feature structs. More... | |
GenomeLocusSet | read_as_genome_locus_set (std::shared_ptr< utils::BaseInputSource > source) const |
Read an input source, and return its content as a GenomeLocusSet. More... | |
GenomeRegionList | read_as_genome_region_list (std::shared_ptr< utils::BaseInputSource > source, bool merge=false) const |
Read a GFF2/GFF3/GTF input source, and return its content as a GenomeRegionList. More... | |
void | read_as_genome_region_list (std::shared_ptr< utils::BaseInputSource > source, GenomeRegionList &target, bool merge=false) const |
Read a GFF2/GFF3/GTF input source, and add its content to an existing GenomeRegionList. More... | |
Classes | |
struct | Feature |
|
default |
|
default |
bool parse_line | ( | utils::InputStream & | input_stream, |
GffReader::Feature & | feature | ||
) | const |
Definition at line 100 of file gff_reader.cpp.
std::vector< GffReader::Feature > read | ( | std::shared_ptr< utils::BaseInputSource > | source | ) | const |
Read a GFF2/GFF3/GTF input source, and return its content as a list of Feature structs.
Definition at line 51 of file gff_reader.cpp.
GenomeLocusSet read_as_genome_locus_set | ( | std::shared_ptr< utils::BaseInputSource > | source | ) | const |
Read an input source, and return its content as a GenomeLocusSet.
This only uses the columns seqname
, start
, and end
, and ignores everything else.
This is the recommended way to read an input for testing whether genome coordinates are covered (filtered / to be considered) for downstream analyses.
Definition at line 63 of file gff_reader.cpp.
GenomeRegionList read_as_genome_region_list | ( | std::shared_ptr< utils::BaseInputSource > | source, |
bool | merge = false |
||
) | const |
Read a GFF2/GFF3/GTF input source, and return its content as a GenomeRegionList.
This only uses the columns seqname
, start
, and end
, and ignores everything else.
If merge
is set, the individual regions of the file are merged if they overlap. This is more useful of the region list is used to determine coverage, and less useful if regions are meant to indicate some specific parts of the genome, such as genes. See the overlap
flag of GenomeRegionList::add( GenomeLocus const&, bool ) for details.
Definition at line 75 of file gff_reader.cpp.
void read_as_genome_region_list | ( | std::shared_ptr< utils::BaseInputSource > | source, |
GenomeRegionList & | target, | ||
bool | merge = false |
||
) | const |
Read a GFF2/GFF3/GTF input source, and add its content to an existing GenomeRegionList.
This only uses the columns seqname
, start
, and end
, and ignores everything else.
If merge
is set, the individual regions of the file are merged if they overlap. This is more useful of the region list is used to determine coverage, and less useful if regions are meant to indicate some specific parts of the genome, such as genes. See the overlap
flag of GenomeRegionList::add( GenomeLocus const&, bool ) for details.
Definition at line 84 of file gff_reader.cpp.