#include <genesis/population/format/sam_variant_input_stream.hpp>
Input stream for SAM/BAM/CRAM files that produces a Variant per genome position.
We expect the input file to be sorted by position. Positions with no reads overlapping are skipped.
Exemplary usage:
auto sam_it = SamVariantInputStream( "/path/to/file.sam" ); sam_it.min_map_qual( 40 ); for( auto const& var : sam_it ) { std::cout << var.chromosome << "\t" << var.position << "\t"; for( auto const& bs : var.samples ) { std::cout << "\t"; to_sync( bs, std::cout ); } std::cout << "\n"; }
By default, as above, all reads are considered to be belonging to the same sample. In that case hence, the above inner loop over samples will only ever go through one SampleCounts object stored in the Variant. We however are also able to split by read group (@RG
), see split_by_rg() and with_unaccounted_rg() for details. In that case, the Variant contains one SampleCounts object per read group, as well as potentially a special one for unaccounted reads with no proper RG. This can further be filtered by setting rg_tag_filter(), to only consider certain RG tags as samples to be produced.
Definition at line 103 of file sam_variant_input_stream.hpp.
Public Member Functions | |
SamVariantInputStream () | |
Create a default instance, with no input. This is also the past-the-end iterator. More... | |
SamVariantInputStream (SamVariantInputStream &&)=default | |
SamVariantInputStream (SamVariantInputStream const &)=default | |
SamVariantInputStream (std::string const &input_file) | |
SamVariantInputStream (std::string const &input_file, std::unordered_set< std::string > const &rg_tag_filter, bool inverse_rg_tag_filter=false) | |
~SamVariantInputStream ()=default | |
Iterator | begin () const |
Iterator | end () const |
uint32_t | flags_exclude_all () const |
self_type & | flags_exclude_all (uint32_t value) |
Do not use reads with all bits set in value present in the FLAG field of the read. More... | |
uint32_t | flags_exclude_any () const |
self_type & | flags_exclude_any (uint32_t value) |
Do not use reads with any bits set in value present in the FLAG field of the read. More... | |
uint32_t | flags_include_all () const |
self_type & | flags_include_all (uint32_t value) |
Only use reads with all bits set in value present in the FLAG field of the read. More... | |
uint32_t | flags_include_any () const |
self_type & | flags_include_any (uint32_t value) |
Only use reads with any bits set in value present in the FLAG field of the read. More... | |
std::string const & | input_file () const |
self_type & | input_file (std::string const &value) |
Set the input file. More... | |
bool | inverse_rg_tag_filter () const |
self_type & | inverse_rg_tag_filter (bool value) |
Reverse the meaning of the list of sample names given by rg_tag_filter(). More... | |
int | max_accumulation_depth () const |
self_type & | max_accumulation_depth (int value) |
Set the maximum read depth (coverage) at a given position that is actually processed. More... | |
int | max_depth () const |
self_type & | max_depth (int value) |
Set the maximum read depth (coverage) at a given position to be considered. More... | |
int | min_base_qual () const |
self_type & | min_base_qual (int value) |
Set the minimum phred-scaled per-base quality score for a nucleotide to be considered. More... | |
int | min_depth () const |
self_type & | min_depth (int value) |
Set the minimum read depth (coverage) at a given position to be considered. More... | |
int | min_map_qual () const |
self_type & | min_map_qual (int value) |
Set the minimum phred-scaled mapping quality score for a read in the input file to be considered. More... | |
SamVariantInputStream & | operator= (SamVariantInputStream &&)=default |
SamVariantInputStream & | operator= (SamVariantInputStream const &)=default |
self_type & | region_filter (std::shared_ptr< GenomeLocusSet > region_filter) |
Set a region filter, so that only loci set in the loci are used, and all others are skipped. More... | |
std::unordered_set< std::string > const & | rg_tag_filter () const |
self_type & | rg_tag_filter (std::unordered_set< std::string > const &value) |
Set the sample names used for filtering reads by their RG read group tag. More... | |
bool | split_by_rg () const |
self_type & | split_by_rg (bool value) |
If set to true , instead of reading all mapped reads as a single sample, split them by the @RG read group tag. More... | |
bool | with_unaccounted_rg () const |
self_type & | with_unaccounted_rg (bool value) |
Decide whether to add a sample for reads without a read group, when splitting by @RG tag. More... | |
Public Types | |
using | difference_type = std::ptrdiff_t |
using | iterator_category = std::input_iterator_tag |
using | pointer = value_type const * |
using | reference = value_type const & |
using | self_type = SamVariantInputStream |
using | value_type = Variant |
Classes | |
class | Iterator |
Iterator over loci of the input sources. More... | |
|
inline |
Create a default instance, with no input. This is also the past-the-end iterator.
Definition at line 347 of file sam_variant_input_stream.hpp.
|
inlineexplicit |
Definition at line 351 of file sam_variant_input_stream.hpp.
SamVariantInputStream | ( | std::string const & | input_file, |
std::unordered_set< std::string > const & | rg_tag_filter, | ||
bool | inverse_rg_tag_filter = false |
||
) |
Definition at line 947 of file sam_variant_input_stream.cpp.
|
default |
|
default |
|
default |
|
inline |
Definition at line 375 of file sam_variant_input_stream.hpp.
|
inline |
Definition at line 390 of file sam_variant_input_stream.hpp.
|
inline |
Definition at line 478 of file sam_variant_input_stream.hpp.
|
inline |
Do not use reads with all bits set in value
present in the FLAG field of the read.
This is equivalent to the -G
setting in samtools view.
The value
can be specified in hex by beginning with 0x
(i.e., /^0x[0-9A-F]+/
), in octal by beginning with 0
(i.e., /^0[0-7]+/
), as a decimal number not beginning with '0', or as a comma-, plus-, or space-separated list of flag names. We are more lenient in parsing flag names then samtools, and allow different capitalization and delimiteres such as dashes and underscores in the flag names as well.
See http://www.htslib.org/doc/samtools-flags.html and https://broadinstitute.github.io/picard/explain-flags.html for details on the flag values, and see https://www.htslib.org/doc/samtools-view.html for their usage in samtools.
Definition at line 501 of file sam_variant_input_stream.hpp.
|
inline |
Definition at line 508 of file sam_variant_input_stream.hpp.
|
inline |
Do not use reads with any bits set in value
present in the FLAG field of the read.
This is equivalent to the -F
/ --excl-flags
/ --exclude-flags
setting in samtools view.
The value
can be specified in hex by beginning with 0x
(i.e., /^0x[0-9A-F]+/
), in octal by beginning with 0
(i.e., /^0[0-7]+/
), as a decimal number not beginning with '0', or as a comma-, plus-, or space-separated list of flag names. We are more lenient in parsing flag names then samtools, and allow different capitalization and delimiteres such as dashes and underscores in the flag names as well.
See http://www.htslib.org/doc/samtools-flags.html and https://broadinstitute.github.io/picard/explain-flags.html for details on the flag values, and see https://www.htslib.org/doc/samtools-view.html for their usage in samtools.
Definition at line 531 of file sam_variant_input_stream.hpp.
|
inline |
Definition at line 420 of file sam_variant_input_stream.hpp.
|
inline |
Only use reads with all bits set in value
present in the FLAG field of the read.
This is equivalent to the -f
/ --require-flags
setting in samtools view.
The value
can be specified in hex by beginning with 0x
(i.e., /^0x[0-9A-F]+/
), in octal by beginning with 0
(i.e., /^0[0-7]+/
), as a decimal number not beginning with '0', or as a comma-, plus-, or space-separated list of flag names. We are more lenient in parsing flag names then samtools, and allow different capitalization and delimiteres such as dashes and underscores in the flag names as well.
See http://www.htslib.org/doc/samtools-flags.html and https://broadinstitute.github.io/picard/explain-flags.html for details on the flag values, and see https://www.htslib.org/doc/samtools-view.html for their usage in samtools.
Definition at line 443 of file sam_variant_input_stream.hpp.
|
inline |
Definition at line 449 of file sam_variant_input_stream.hpp.
|
inline |
Only use reads with any bits set in value
present in the FLAG field of the read.
This is equivalent to the --rf
/ --incl-flags
/ --include-flags
setting in samtools view.
The value
can be specified in hex by beginning with 0x
(i.e., /^0x[0-9A-F]+/
), in octal by beginning with 0
(i.e., /^0[0-7]+/
), as a decimal number not beginning with '0', or as a comma-, plus-, or space-separated list of flag names. We are more lenient in parsing flag names then samtools, and allow different capitalization and delimiteres such as dashes and underscores in the flag names as well.
See http://www.htslib.org/doc/samtools-flags.html and https://broadinstitute.github.io/picard/explain-flags.html for details on the flag values, and see https://www.htslib.org/doc/samtools-view.html for their usage in samtools.
Definition at line 472 of file sam_variant_input_stream.hpp.
|
inline |
Definition at line 399 of file sam_variant_input_stream.hpp.
|
inline |
Set the input file.
This overwrites the file if it was already given in the constructor. Shall not be called after iteration has been started.
Definition at line 410 of file sam_variant_input_stream.hpp.
|
inline |
Definition at line 727 of file sam_variant_input_stream.hpp.
|
inline |
Reverse the meaning of the list of sample names given by rg_tag_filter().
See there for details.
Definition at line 737 of file sam_variant_input_stream.hpp.
|
inline |
Definition at line 631 of file sam_variant_input_stream.hpp.
|
inline |
Set the maximum read depth (coverage) at a given position that is actually processed.
The max_depth() setting excludes sites that have depth/coverage above a given value. However, one might want to keep those sites in the iteration, and yet limit the number of bases being tallied up. This setting is mostly meant as a memory saver, in order to avoid piling up too many sites at the same time. When set to a value greater than 0, only that many bases are considered, and any further reads overlapping the site are not taken into account.
Definition at line 646 of file sam_variant_input_stream.hpp.
|
inline |
Definition at line 614 of file sam_variant_input_stream.hpp.
|
inline |
Set the maximum read depth (coverage) at a given position to be considered.
Positions in the genome with more than the given minimum depth are skipped. If set to 0 (default), the value is not used as a threshold.
Definition at line 625 of file sam_variant_input_stream.hpp.
|
inline |
Definition at line 577 of file sam_variant_input_stream.hpp.
|
inline |
Set the minimum phred-scaled per-base quality score for a nucleotide to be considered.
Any base that has a quality score below the given value is not taken into account in the per-position tally of counts.
Definition at line 588 of file sam_variant_input_stream.hpp.
|
inline |
Definition at line 598 of file sam_variant_input_stream.hpp.
|
inline |
Set the minimum read depth (coverage) at a given position to be considered.
Positions in the genome with fewer than the given minimum depth are skipped.
Definition at line 608 of file sam_variant_input_stream.hpp.
|
inline |
Definition at line 559 of file sam_variant_input_stream.hpp.
|
inline |
Set the minimum phred-scaled mapping quality score for a read in the input file to be considered.
Any read that is below the given value of mapping quality will be completely discarded, and its bases not taken into account.
Definition at line 571 of file sam_variant_input_stream.hpp.
|
default |
|
default |
|
inline |
Set a region filter, so that only loci set in the loci
are used, and all others are skipped.
This still needs some basic processing per position, as we are currently not using the htslib internal filters, but apply it afterwards. Still, this skips the base counting, so it is an advantage over filtering later on.
Definition at line 549 of file sam_variant_input_stream.hpp.
|
inline |
Definition at line 699 of file sam_variant_input_stream.hpp.
|
inline |
Set the sample names used for filtering reads by their RG read group tag.
Only used when split_by_rg() is set to true
. Reads that have an RG read group tag that appears in the header of the input file, but is not present in the value
list given here (or in the constructor of the class), will be ignored. That is, they will also not appear in the "unaccounted" sample, independently of the setting of with_unaccounted_rg(). The unaccounted sample will only contain data from those reads that do not have an RG tag at all, or one that does not appear in the header.
See also inverse_rg_tag_filter() to inverse this setting. That is, instead of only using samples based on the RG tags given in this list here, use all but the given RG tags.
When the given value
list is empty, the filtering by RG read group tag is deactivated (which is also the default), independently of the inverse_rg_tag_filter() setting.
Definition at line 721 of file sam_variant_input_stream.hpp.
|
inline |
Definition at line 656 of file sam_variant_input_stream.hpp.
|
inline |
If set to true
, instead of reading all mapped reads as a single sample, split them by the @RG
read group tag.
This way, multiple SampleCounts objects are created in the resulting Variant, one for each read group, and potentially an additional one for the unaccounted reads that do not have a read group, if with_unaccounted_rg() is also set.
Definition at line 669 of file sam_variant_input_stream.hpp.
|
inline |
Definition at line 675 of file sam_variant_input_stream.hpp.
|
inline |
Decide whether to add a sample for reads without a read group, when splitting by @RG
tag.
If split_by_rg() and this option are both set to true
, also add a special sample for the reads without a read group, as the last SampleCounts object of the Variant. If this option here is however set to false
, all reads without a read group tag or with an invalid read group tag (that does not appear in the header) are ignored. If split_by_rg() is not set to true
, this option here is completely ignored.
See also rg_tag_filter() to sub-set the reads by RG, that is, to ignore reads that have a proper RG tag set, but that belong to a sample that shall be ignored.
Definition at line 693 of file sam_variant_input_stream.hpp.
using difference_type = std::ptrdiff_t |
Definition at line 115 of file sam_variant_input_stream.hpp.
using iterator_category = std::input_iterator_tag |
Definition at line 116 of file sam_variant_input_stream.hpp.
using pointer = value_type const* |
Definition at line 113 of file sam_variant_input_stream.hpp.
using reference = value_type const& |
Definition at line 114 of file sam_variant_input_stream.hpp.
using self_type = SamVariantInputStream |
Definition at line 111 of file sam_variant_input_stream.hpp.
using value_type = Variant |
Definition at line 112 of file sam_variant_input_stream.hpp.