#include <genesis/population/format/vcf_header.hpp>
Capture the information from a header of a VCF/BCF file.
Unfortunately, the terminology used in htslib and in the VCF v4.2 specification to describe different parts of a VCF file seems diffuse; in particular, the words "ID", "key", and "tag" are overloaded, as well as "column", "field", and "sub-field", and the usage of "type" for almost anything.
Given the excerpt from a VCF file
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ... #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51
let's try to untangle the terminology. We mark the terms used in our wrapper classes in bold.
tag
.Hopefully, this helps to make sense of the terminology.
Definition at line 102 of file vcf_header.hpp.
Public Member Functions | |
VcfHeader ()=default | |
Create a default (empty) instance. More... | |
VcfHeader (::bcf_hdr_t *bcf_hdr) | |
Create an instance given a pointer to the htslib-internal header struct. More... | |
VcfHeader (HtsFile &hts_file) | |
Create an instance given an HtsFile object. More... | |
VcfHeader (std::string const &mode) | |
Create an instance with a specific mode. More... | |
VcfHeader (VcfHeader &&other) | |
VcfHeader (VcfHeader const &)=delete | |
~VcfHeader () | |
void | assert_filter (std::string const &id) const |
Assert that a FILTER entry with a given ID is defined in the header of the VCF/BCF file. More... | |
void | assert_format (std::string const &id) const |
Assert that an FORMAT entry with a given ID is defined in the header of the VCF/BCF file. More... | |
void | assert_format (std::string const &id, VcfValueType type) const |
Assert that an FORMAT entry with a given ID is defined in the header of the VCF/BCF file, and that its value(s) has/have a specified data type. More... | |
void | assert_format (std::string const &id, VcfValueType type, size_t number) const |
Assert that an FORMAT entry with a given ID is defined in the header of the VCF/BCF file, that its value(s) has/have a specified data type, and a specified number of values. More... | |
void | assert_format (std::string const &id, VcfValueType type, VcfValueSpecial special) const |
Assert that an FORMAT entry with a given ID is defined in the header of the VCF/BCF file, that its value(s) has/have a specified data type, and a specified special type of number of values. More... | |
void | assert_info (std::string const &id) const |
Assert that an INFO entry with a given ID is defined in the header of the VCF/BCF file. More... | |
void | assert_info (std::string const &id, VcfValueType type) const |
Assert that an INFO entry with a given ID is defined in the header of the VCF/BCF file, and that its value(s) has/have a specified data type. More... | |
void | assert_info (std::string const &id, VcfValueType type, size_t number) const |
Assert that an INFO entry with a given ID is defined in the header of the VCF/BCF file, that its value(s) has/have a specified data type, and a specified number of values. More... | |
void | assert_info (std::string const &id, VcfValueType type, VcfValueSpecial special) const |
Assert that an INFO entry with a given ID is defined in the header of the VCF/BCF file, that its value(s) has/have a specified data type, and a specified special type of number of values. More... | |
::bcf_hdr_t * | data () |
Return the internal htslib ::bcf_hdr_t data struct pointer. More... | |
::bcf_hdr_t const * | data () const |
Return the internal htslib ::bcf_hdr_t data struct pointer. More... | |
size_t | get_chromosome_length (std::string const &chrom_name) const |
Get the length of a chromosome/contig/sequence, given its name. More... | |
std::unordered_map< std::string, std::string > | get_chromosome_values (std::string const &chrom_name) const |
Get all key-value-pairs describing a particular chromosome/contig/sequence, given its name. More... | |
std::vector< std::string > | get_chromosomes () const |
Get a list of the chromosome/contig/sequence names used in the file. More... | |
std::vector< std::string > | get_filter_ids () const |
Get a list of the ID names of all FILTER entries in the header. More... | |
std::unordered_map< std::string, std::string > | get_filter_values (std::string const &id) const |
Get all key-value pairs describing a particular filter header line, given its ID. More... | |
std::vector< std::string > | get_format_ids () const |
Get a list of the ID names of all FORMAT fields in the header. More... | |
VcfSpecification | get_format_specification (std::string const &id) const |
Get the required specification key-value-pairs for a given FORMAT entry. More... | |
std::unordered_map< std::string, std::string > | get_format_values (std::string const &id) const |
Get all key-value pairs describing a particular format field, given its ID. More... | |
std::vector< std::string > | get_info_ids () const |
Get a list of the ID names of all INFO fields in the header. More... | |
VcfSpecification | get_info_specification (std::string const &id) const |
Get the required specification key-value-pairs for a given INFO entry. More... | |
std::unordered_map< std::string, std::string > | get_info_values (std::string const &id) const |
Get all key-value pairs describing a particular info header line, given its ID. More... | |
size_t | get_sample_count () const |
Get the number of samples (columns) in the file. More... | |
size_t | get_sample_index (std::string const &name) const |
Get the index of a sample, given its name. More... | |
std::string | get_sample_name (size_t index) const |
Get the name of a sample given its index. More... | |
std::vector< std::string > | get_sample_names () const |
Return a list of the sample names (column headers) of the VCF/BCF file. More... | |
bool | has_filter (std::string const &id) const |
Return whether a FILTER entry with a given ID is defined in the header of the VCF/BCF file. More... | |
bool | has_format (std::string const &id) const |
Return whether a FORMAT entry with a given ID is defined in the header of the VCF/BCF file. More... | |
bool | has_format (std::string const &id, VcfValueType type) const |
Return whether a FORMAT entry with a given ID is defined in the header of the VCF/BCF file, and whether its value(s) has/have a specified data type. More... | |
bool | has_format (std::string const &id, VcfValueType type, size_t number) const |
Return whether a FORMAT entry with a given ID is defined in the header of the VCF/BCF file, whether its value(s) has/have a specified data type, and a specified number of values. More... | |
bool | has_format (std::string const &id, VcfValueType type, VcfValueSpecial special) const |
Return whether a FORMAT entry with a given ID is defined in the header of the VCF/BCF file, whether its value(s) has/have a specified data type, and a specified special type of number of values. More... | |
bool | has_info (std::string const &id) const |
Return whether an INFO entry with a given ID is defined in the header of the VCF/BCF file. More... | |
bool | has_info (std::string const &id, VcfValueType type) const |
Return whether an INFO entry with a given ID is defined in the header of the VCF/BCF file, and whether its value(s) has/have a specified data type. More... | |
bool | has_info (std::string const &id, VcfValueType type, size_t number) const |
Return whether an INFO entry with a given ID is defined in the header of the VCF/BCF file, whether its value(s) has/have a specified data type, and a specified number of values. More... | |
bool | has_info (std::string const &id, VcfValueType type, VcfValueSpecial special) const |
Return whether an INFO entry with a given ID is defined in the header of the VCF/BCF file, whether its value(s) has/have a specified data type, and a specified special type of number of values. More... | |
VcfHeader & | operator= (VcfHeader &&other) |
VcfHeader & | operator= (VcfHeader const &)=delete |
void | set_samples (std::vector< std::string > const &sample_names, bool inverse_sample_names=false) |
Speficy a subset of samples to be parsed. More... | |
std::string | version () const |
Return the VCF/BCF version string, e.g. "VCFv4.2". More... | |
Friends | |
template<typename S , typename T > | |
class | VcfFormatIterator |
class | VcfRecord |
|
default |
Create a default (empty) instance.
|
explicit |
Create an instance with a specific mode.
For the mode
param, see the ::hts_open()
documentation of htslib.
Definition at line 72 of file vcf_header.cpp.
Create an instance given an HtsFile object.
The HtsFile has to be newly created and cannot have been used to read from before, otherwise its internal file pointer is already past the header.
Definition at line 80 of file vcf_header.cpp.
|
explicit |
Create an instance given a pointer to the htslib-internal header struct.
This copies the header using ::bcf_hdr_dup()
from htslib, and then manages the livetime of the newly created ::bcf_hdr_t
instance only.
Definition at line 92 of file vcf_header.cpp.
~VcfHeader | ( | ) |
Definition at line 100 of file vcf_header.cpp.
Definition at line 107 of file vcf_header.cpp.
void assert_filter | ( | std::string const & | id | ) | const |
Assert that a FILTER entry with a given ID is defined in the header of the VCF/BCF file.
Definition at line 212 of file vcf_header.cpp.
void assert_format | ( | std::string const & | id | ) | const |
Assert that an FORMAT entry with a given ID is defined in the header of the VCF/BCF file.
Definition at line 300 of file vcf_header.cpp.
void assert_format | ( | std::string const & | id, |
VcfValueType | type | ||
) | const |
Assert that an FORMAT entry with a given ID is defined in the header of the VCF/BCF file, and that its value(s) has/have a specified data type.
Definition at line 305 of file vcf_header.cpp.
void assert_format | ( | std::string const & | id, |
VcfValueType | type, | ||
size_t | number | ||
) | const |
Assert that an FORMAT entry with a given ID is defined in the header of the VCF/BCF file, that its value(s) has/have a specified data type, and a specified number of values.
This check requires that the number of values is fixed to the given number
. That is, here, we do not use any of the special cases (number of values depending on number of alleles, etc), but require a fixed given number instead.
Definition at line 315 of file vcf_header.cpp.
void assert_format | ( | std::string const & | id, |
VcfValueType | type, | ||
VcfValueSpecial | special | ||
) | const |
Assert that an FORMAT entry with a given ID is defined in the header of the VCF/BCF file, that its value(s) has/have a specified data type, and a specified special type of number of values.
The last check for the kind of number of values is typically used to require one of the special cases (number of values depending on number of alleles, etc).
Definition at line 310 of file vcf_header.cpp.
void assert_info | ( | std::string const & | id | ) | const |
Assert that an INFO entry with a given ID is defined in the header of the VCF/BCF file.
Definition at line 241 of file vcf_header.cpp.
void assert_info | ( | std::string const & | id, |
VcfValueType | type | ||
) | const |
Assert that an INFO entry with a given ID is defined in the header of the VCF/BCF file, and that its value(s) has/have a specified data type.
Definition at line 246 of file vcf_header.cpp.
void assert_info | ( | std::string const & | id, |
VcfValueType | type, | ||
size_t | number | ||
) | const |
Assert that an INFO entry with a given ID is defined in the header of the VCF/BCF file, that its value(s) has/have a specified data type, and a specified number of values.
This check requires that the number of values is fixed to the given number
. That is, here, we do not use any of the special cases (number of values depending on number of alleles, etc), but require a fixed given number instead.
Definition at line 256 of file vcf_header.cpp.
void assert_info | ( | std::string const & | id, |
VcfValueType | type, | ||
VcfValueSpecial | special | ||
) | const |
Assert that an INFO entry with a given ID is defined in the header of the VCF/BCF file, that its value(s) has/have a specified data type, and a specified special type of number of values.
The last check for the kind of number of values is typically used to require one of the special cases (number of values depending on number of alleles, etc).
Definition at line 251 of file vcf_header.cpp.
|
inline |
Return the internal htslib ::bcf_hdr_t
data struct pointer.
Definition at line 177 of file vcf_header.hpp.
|
inline |
Return the internal htslib ::bcf_hdr_t
data struct pointer.
Definition at line 185 of file vcf_header.hpp.
size_t get_chromosome_length | ( | std::string const & | chrom_name | ) | const |
Get the length of a chromosome/contig/sequence, given its name.
This information is potentially stored auto-magically in the htslib struct.
Definition at line 174 of file vcf_header.cpp.
std::unordered_map< std::string, std::string > get_chromosome_values | ( | std::string const & | chrom_name | ) | const |
Get all key-value-pairs describing a particular chromosome/contig/sequence, given its name.
For example, if the header contains a line
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
then the result will contain entries that map from "ID" to "20", from "length" to "62435964", and so forth. (Note that ID is also included in the result, for simplicity.)
Definition at line 193 of file vcf_header.cpp.
std::vector< std::string > get_chromosomes | ( | ) | const |
Get a list of the chromosome/contig/sequence names used in the file.
These correspond to the field entries in the CHROM
column of the records.
Definition at line 140 of file vcf_header.cpp.
std::vector< std::string > get_filter_ids | ( | ) | const |
Get a list of the ID names of all FILTER
entries in the header.
For example, if the header contains a line
##FILTER=<ID=q10,Description="Quality below 10">
then the list contains an entry q10
.
Definition at line 202 of file vcf_header.cpp.
std::unordered_map< std::string, std::string > get_filter_values | ( | std::string const & | id | ) | const |
Get all key-value pairs describing a particular filter header line, given its ID.
For example, if the header contains a line
##FILTER=<ID=q10,Description="Quality below 10">
then the result will contain entries that map from "ID" to "q10", and from "Description" to "Quality below 10". (Note that ID is also included in the result, for simplicity.)
Definition at line 207 of file vcf_header.cpp.
std::vector< std::string > get_format_ids | ( | ) | const |
Get a list of the ID names of all FORMAT
fields in the header.
For example, if the header contains a line
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
then the list contains an entry GT
.
Definition at line 285 of file vcf_header.cpp.
VcfSpecification get_format_specification | ( | std::string const & | id | ) | const |
Get the required specification key-value-pairs for a given FORMAT entry.
See also get_format_values() for a function that returns all given key-value-paris of the FORMAT entry.
Definition at line 290 of file vcf_header.cpp.
std::unordered_map< std::string, std::string > get_format_values | ( | std::string const & | id | ) | const |
Get all key-value pairs describing a particular format field, given its ID.
For example, if the header contains a line
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
then the result will contain entries that map from "ID" to "GT", from "Number" to "1", and so forth. (Note that ID is also included in the result, for simplicity.)
See also get_format_specification() for a function that returns only the required key-value-pairs for FORMAT entries.
Definition at line 295 of file vcf_header.cpp.
std::vector< std::string > get_info_ids | ( | ) | const |
Get a list of the ID names of all INFO
fields in the header.
For example, if the header contains a line
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
then the list contains an entry DP
.
Definition at line 226 of file vcf_header.cpp.
VcfSpecification get_info_specification | ( | std::string const & | id | ) | const |
Get the required specification key-value-pairs for a given INFO entry.
See also get_info_values() for a function that returns all given key-value-paris of the INFO entry.
Definition at line 231 of file vcf_header.cpp.
std::unordered_map< std::string, std::string > get_info_values | ( | std::string const & | id | ) | const |
Get all key-value pairs describing a particular info header line, given its ID.
For example, if the header contains a line
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
then the result will contain entries that map from "ID" to "DP", from "Number" to "1", and so forth. (Note that ID is also included in the result, for simplicity.)
See also get_info_specification() for a function that returns only the required key-value-pairs for INFO entries.
Definition at line 236 of file vcf_header.cpp.
size_t get_sample_count | ( | ) | const |
Get the number of samples (columns) in the file.
Definition at line 344 of file vcf_header.cpp.
size_t get_sample_index | ( | std::string const & | name | ) | const |
Get the index of a sample, given its name.
If the sample name
does not exist in the VCF file, an exception is thrown.
Definition at line 361 of file vcf_header.cpp.
std::string get_sample_name | ( | size_t | index | ) | const |
Get the name of a sample given its index.
This corresponds to the name given in the #CHROM
line of the VCF file, using indices in the range [ 0, get_sample_count() )
.
Definition at line 350 of file vcf_header.cpp.
std::vector< std::string > get_sample_names | ( | ) | const |
Return a list of the sample names (column headers) of the VCF/BCF file.
These are the names that correspond to the column headers for samples in the #CHROM
line of a VCF file.
Definition at line 373 of file vcf_header.cpp.
bool has_filter | ( | std::string const & | id | ) | const |
Return whether a FILTER entry with a given ID is defined in the header of the VCF/BCF file.
Definition at line 217 of file vcf_header.cpp.
bool has_format | ( | std::string const & | id | ) | const |
Return whether a FORMAT entry with a given ID is defined in the header of the VCF/BCF file.
Definition at line 320 of file vcf_header.cpp.
bool has_format | ( | std::string const & | id, |
VcfValueType | type | ||
) | const |
Return whether a FORMAT entry with a given ID is defined in the header of the VCF/BCF file, and whether its value(s) has/have a specified data type.
Definition at line 325 of file vcf_header.cpp.
bool has_format | ( | std::string const & | id, |
VcfValueType | type, | ||
size_t | number | ||
) | const |
Return whether a FORMAT entry with a given ID is defined in the header of the VCF/BCF file, whether its value(s) has/have a specified data type, and a specified number of values.
This check requires that the number of values is fixed to the given number
. That is, here, we do not use any of the special cases (number of values depending on number of alleles, etc), but require a fixed given number instead.
Definition at line 335 of file vcf_header.cpp.
bool has_format | ( | std::string const & | id, |
VcfValueType | type, | ||
VcfValueSpecial | special | ||
) | const |
Return whether a FORMAT entry with a given ID is defined in the header of the VCF/BCF file, whether its value(s) has/have a specified data type, and a specified special type of number of values.
The last check for the kind of number of values is typically used to require one of the special cases (number of values depending on number of alleles, etc).
Definition at line 330 of file vcf_header.cpp.
bool has_info | ( | std::string const & | id | ) | const |
Return whether an INFO entry with a given ID is defined in the header of the VCF/BCF file.
Definition at line 261 of file vcf_header.cpp.
bool has_info | ( | std::string const & | id, |
VcfValueType | type | ||
) | const |
Return whether an INFO entry with a given ID is defined in the header of the VCF/BCF file, and whether its value(s) has/have a specified data type.
Definition at line 266 of file vcf_header.cpp.
bool has_info | ( | std::string const & | id, |
VcfValueType | type, | ||
size_t | number | ||
) | const |
Return whether an INFO entry with a given ID is defined in the header of the VCF/BCF file, whether its value(s) has/have a specified data type, and a specified number of values.
This check requires that the number of values is fixed to the given number
. That is, here, we do not use any of the special cases (number of values depending on number of alleles, etc), but require a fixed given number instead.
Definition at line 276 of file vcf_header.cpp.
bool has_info | ( | std::string const & | id, |
VcfValueType | type, | ||
VcfValueSpecial | special | ||
) | const |
Return whether an INFO entry with a given ID is defined in the header of the VCF/BCF file, whether its value(s) has/have a specified data type, and a specified special type of number of values.
The last check for the kind of number of values is typically used to require one of the special cases (number of values depending on number of alleles, etc).
Definition at line 271 of file vcf_header.cpp.
Definition at line 116 of file vcf_header.cpp.
void set_samples | ( | std::vector< std::string > const & | sample_names, |
bool | inverse_sample_names = false |
||
) |
Speficy a subset of samples to be parsed.
Only the specified set of sample columns are parsed and used to populate the VcfRecords when reading the VCF/BCF file (or, if inverse_sample_names
is true
, only the specified samples are excluded). This can yield drastical speedups in parsing large files.
This function wraps the bcf_hdr_set_samples()
function of htslib, see there for details.
Definition at line 384 of file vcf_header.cpp.
std::string version | ( | ) | const |
Return the VCF/BCF version string, e.g. "VCFv4.2".
Definition at line 131 of file vcf_header.cpp.
|
friend |
Definition at line 119 of file vcf_header.hpp.
|
friend |
Definition at line 116 of file vcf_header.hpp.