A library for working with phylogenetic and population genetic data.
v0.32.0
SamVariantInputStream Class Reference

#include <genesis/population/format/sam_variant_input_stream.hpp>

Detailed Description

Input stream for SAM/BAM/CRAM files that produces a Variant per genome position.

We expect the input file to be sorted by position. Positions with no reads overlapping are skipped.

Exemplary usage:

auto sam_it = SamVariantInputStream( "/path/to/file.sam" );
sam_it.min_map_qual( 40 );
for( auto const& var : sam_it ) {
    std::cout << var.chromosome << "\t" << var.position << "\t";
    for( auto const& bs : var.samples ) {
        std::cout << "\t";
        to_sync( bs, std::cout );
    }
    std::cout << "\n";
}

By default, as above, all reads are considered to be belonging to the same sample. In that case hence, the above inner loop over samples will only ever go through one SampleCounts object stored in the Variant. We however are also able to split by read group (@RG), see split_by_rg() and with_unaccounted_rg() for details. In that case, the Variant contains one SampleCounts object per read group, as well as potentially a special one for unaccounted reads with no proper RG. This can further be filtered by setting rg_tag_filter(), to only consider certain RG tags as samples to be produced.

Definition at line 103 of file sam_variant_input_stream.hpp.

Public Member Functions

 SamVariantInputStream ()
 Create a default instance, with no input. This is also the past-the-end iterator. More...
 
 SamVariantInputStream (SamVariantInputStream &&)=default
 
 SamVariantInputStream (SamVariantInputStream const &)=default
 
 SamVariantInputStream (std::string const &input_file)
 
 SamVariantInputStream (std::string const &input_file, std::unordered_set< std::string > const &rg_tag_filter, bool inverse_rg_tag_filter=false)
 
 ~SamVariantInputStream ()=default
 
Iterator begin () const
 
Iterator end () const
 
uint32_t flags_exclude_all () const
 
self_typeflags_exclude_all (uint32_t value)
 Do not use reads with all bits set in value present in the FLAG field of the read. More...
 
uint32_t flags_exclude_any () const
 
self_typeflags_exclude_any (uint32_t value)
 Do not use reads with any bits set in value present in the FLAG field of the read. More...
 
uint32_t flags_include_all () const
 
self_typeflags_include_all (uint32_t value)
 Only use reads with all bits set in value present in the FLAG field of the read. More...
 
uint32_t flags_include_any () const
 
self_typeflags_include_any (uint32_t value)
 Only use reads with any bits set in value present in the FLAG field of the read. More...
 
std::string const & input_file () const
 
self_typeinput_file (std::string const &value)
 Set the input file. More...
 
bool inverse_rg_tag_filter () const
 
self_typeinverse_rg_tag_filter (bool value)
 Reverse the meaning of the list of sample names given by rg_tag_filter(). More...
 
int max_accumulation_depth () const
 
self_typemax_accumulation_depth (int value)
 Set the maximum read depth (coverage) at a given position that is actually processed. More...
 
int max_depth () const
 
self_typemax_depth (int value)
 Set the maximum read depth (coverage) at a given position to be considered. More...
 
int min_base_qual () const
 
self_typemin_base_qual (int value)
 Set the minimum phred-scaled per-base quality score for a nucleotide to be considered. More...
 
int min_depth () const
 
self_typemin_depth (int value)
 Set the minimum read depth (coverage) at a given position to be considered. More...
 
int min_map_qual () const
 
self_typemin_map_qual (int value)
 Set the minimum phred-scaled mapping quality score for a read in the input file to be considered. More...
 
SamVariantInputStreamoperator= (SamVariantInputStream &&)=default
 
SamVariantInputStreamoperator= (SamVariantInputStream const &)=default
 
self_typeregion_filter (std::shared_ptr< GenomeLocusSet > region_filter)
 Set a region filter, so that only loci set in the loci are used, and all others are skipped. More...
 
std::unordered_set< std::string > const & rg_tag_filter () const
 
self_typerg_tag_filter (std::unordered_set< std::string > const &value)
 Set the sample names used for filtering reads by their RG read group tag. More...
 
bool split_by_rg () const
 
self_typesplit_by_rg (bool value)
 If set to true, instead of reading all mapped reads as a single sample, split them by the @RG read group tag. More...
 
bool with_unaccounted_rg () const
 
self_typewith_unaccounted_rg (bool value)
 Decide whether to add a sample for reads without a read group, when splitting by @RG tag. More...
 

Public Types

using difference_type = std::ptrdiff_t
 
using iterator_category = std::input_iterator_tag
 
using pointer = value_type const *
 
using reference = value_type const &
 
using self_type = SamVariantInputStream
 
using value_type = Variant
 

Classes

class  Iterator
 Iterator over loci of the input sources. More...
 

Constructor & Destructor Documentation

◆ SamVariantInputStream() [1/5]

Create a default instance, with no input. This is also the past-the-end iterator.

Definition at line 347 of file sam_variant_input_stream.hpp.

◆ SamVariantInputStream() [2/5]

SamVariantInputStream ( std::string const &  input_file)
inlineexplicit

Definition at line 351 of file sam_variant_input_stream.hpp.

◆ SamVariantInputStream() [3/5]

SamVariantInputStream ( std::string const &  input_file,
std::unordered_set< std::string > const &  rg_tag_filter,
bool  inverse_rg_tag_filter = false 
)

Definition at line 947 of file sam_variant_input_stream.cpp.

◆ ~SamVariantInputStream()

~SamVariantInputStream ( )
default

◆ SamVariantInputStream() [4/5]

◆ SamVariantInputStream() [5/5]

Member Function Documentation

◆ begin()

Iterator begin ( ) const
inline

Definition at line 375 of file sam_variant_input_stream.hpp.

◆ end()

Iterator end ( ) const
inline

Definition at line 390 of file sam_variant_input_stream.hpp.

◆ flags_exclude_all() [1/2]

uint32_t flags_exclude_all ( ) const
inline

Definition at line 478 of file sam_variant_input_stream.hpp.

◆ flags_exclude_all() [2/2]

self_type& flags_exclude_all ( uint32_t  value)
inline

Do not use reads with all bits set in value present in the FLAG field of the read.

This is equivalent to the -G setting in samtools view.

The value can be specified in hex by beginning with 0x (i.e., /^0x[0-9A-F]+/), in octal by beginning with 0 (i.e., /^0[0-7]+/), as a decimal number not beginning with '0', or as a comma-, plus-, or space-separated list of flag names. We are more lenient in parsing flag names then samtools, and allow different capitalization and delimiteres such as dashes and underscores in the flag names as well.

See http://www.htslib.org/doc/samtools-flags.html and https://broadinstitute.github.io/picard/explain-flags.html for details on the flag values, and see https://www.htslib.org/doc/samtools-view.html for their usage in samtools.

See also
flags_include_all( uint32_t ), flags_include_any( uint32_t ), flags_exclude_any( uint32_t )

Definition at line 501 of file sam_variant_input_stream.hpp.

◆ flags_exclude_any() [1/2]

uint32_t flags_exclude_any ( ) const
inline

Definition at line 508 of file sam_variant_input_stream.hpp.

◆ flags_exclude_any() [2/2]

self_type& flags_exclude_any ( uint32_t  value)
inline

Do not use reads with any bits set in value present in the FLAG field of the read.

This is equivalent to the -F / --excl-flags / --exclude-flags setting in samtools view.

The value can be specified in hex by beginning with 0x (i.e., /^0x[0-9A-F]+/), in octal by beginning with 0 (i.e., /^0[0-7]+/), as a decimal number not beginning with '0', or as a comma-, plus-, or space-separated list of flag names. We are more lenient in parsing flag names then samtools, and allow different capitalization and delimiteres such as dashes and underscores in the flag names as well.

See http://www.htslib.org/doc/samtools-flags.html and https://broadinstitute.github.io/picard/explain-flags.html for details on the flag values, and see https://www.htslib.org/doc/samtools-view.html for their usage in samtools.

See also
flags_include_all( uint32_t ), flags_include_any( uint32_t ), flags_exclude_all( uint32_t )

Definition at line 531 of file sam_variant_input_stream.hpp.

◆ flags_include_all() [1/2]

uint32_t flags_include_all ( ) const
inline

Definition at line 420 of file sam_variant_input_stream.hpp.

◆ flags_include_all() [2/2]

self_type& flags_include_all ( uint32_t  value)
inline

Only use reads with all bits set in value present in the FLAG field of the read.

This is equivalent to the -f / --require-flags setting in samtools view.

The value can be specified in hex by beginning with 0x (i.e., /^0x[0-9A-F]+/), in octal by beginning with 0 (i.e., /^0[0-7]+/), as a decimal number not beginning with '0', or as a comma-, plus-, or space-separated list of flag names. We are more lenient in parsing flag names then samtools, and allow different capitalization and delimiteres such as dashes and underscores in the flag names as well.

See http://www.htslib.org/doc/samtools-flags.html and https://broadinstitute.github.io/picard/explain-flags.html for details on the flag values, and see https://www.htslib.org/doc/samtools-view.html for their usage in samtools.

See also
flags_include_any( uint32_t ), flags_exclude_all( uint32_t ), flags_exclude_any( uint32_t )

Definition at line 443 of file sam_variant_input_stream.hpp.

◆ flags_include_any() [1/2]

uint32_t flags_include_any ( ) const
inline

Definition at line 449 of file sam_variant_input_stream.hpp.

◆ flags_include_any() [2/2]

self_type& flags_include_any ( uint32_t  value)
inline

Only use reads with any bits set in value present in the FLAG field of the read.

This is equivalent to the --rf / --incl-flags / --include-flags setting in samtools view.

The value can be specified in hex by beginning with 0x (i.e., /^0x[0-9A-F]+/), in octal by beginning with 0 (i.e., /^0[0-7]+/), as a decimal number not beginning with '0', or as a comma-, plus-, or space-separated list of flag names. We are more lenient in parsing flag names then samtools, and allow different capitalization and delimiteres such as dashes and underscores in the flag names as well.

See http://www.htslib.org/doc/samtools-flags.html and https://broadinstitute.github.io/picard/explain-flags.html for details on the flag values, and see https://www.htslib.org/doc/samtools-view.html for their usage in samtools.

See also
flags_include_all( uint32_t ), flags_exclude_all( uint32_t ), flags_exclude_any( uint32_t )

Definition at line 472 of file sam_variant_input_stream.hpp.

◆ input_file() [1/2]

std::string const& input_file ( ) const
inline

Definition at line 399 of file sam_variant_input_stream.hpp.

◆ input_file() [2/2]

self_type& input_file ( std::string const &  value)
inline

Set the input file.

This overwrites the file if it was already given in the constructor. Shall not be called after iteration has been started.

Definition at line 410 of file sam_variant_input_stream.hpp.

◆ inverse_rg_tag_filter() [1/2]

bool inverse_rg_tag_filter ( ) const
inline

Definition at line 727 of file sam_variant_input_stream.hpp.

◆ inverse_rg_tag_filter() [2/2]

self_type& inverse_rg_tag_filter ( bool  value)
inline

Reverse the meaning of the list of sample names given by rg_tag_filter().

See there for details.

Definition at line 737 of file sam_variant_input_stream.hpp.

◆ max_accumulation_depth() [1/2]

int max_accumulation_depth ( ) const
inline

Definition at line 631 of file sam_variant_input_stream.hpp.

◆ max_accumulation_depth() [2/2]

self_type& max_accumulation_depth ( int  value)
inline

Set the maximum read depth (coverage) at a given position that is actually processed.

The max_depth() setting excludes sites that have depth/coverage above a given value. However, one might want to keep those sites in the iteration, and yet limit the number of bases being tallied up. This setting is mostly meant as a memory saver, in order to avoid piling up too many sites at the same time. When set to a value greater than 0, only that many bases are considered, and any further reads overlapping the site are not taken into account.

Definition at line 646 of file sam_variant_input_stream.hpp.

◆ max_depth() [1/2]

int max_depth ( ) const
inline

Definition at line 614 of file sam_variant_input_stream.hpp.

◆ max_depth() [2/2]

self_type& max_depth ( int  value)
inline

Set the maximum read depth (coverage) at a given position to be considered.

Positions in the genome with more than the given minimum depth are skipped. If set to 0 (default), the value is not used as a threshold.

Definition at line 625 of file sam_variant_input_stream.hpp.

◆ min_base_qual() [1/2]

int min_base_qual ( ) const
inline

Definition at line 577 of file sam_variant_input_stream.hpp.

◆ min_base_qual() [2/2]

self_type& min_base_qual ( int  value)
inline

Set the minimum phred-scaled per-base quality score for a nucleotide to be considered.

Any base that has a quality score below the given value is not taken into account in the per-position tally of counts.

Definition at line 588 of file sam_variant_input_stream.hpp.

◆ min_depth() [1/2]

int min_depth ( ) const
inline

Definition at line 598 of file sam_variant_input_stream.hpp.

◆ min_depth() [2/2]

self_type& min_depth ( int  value)
inline

Set the minimum read depth (coverage) at a given position to be considered.

Positions in the genome with fewer than the given minimum depth are skipped.

Definition at line 608 of file sam_variant_input_stream.hpp.

◆ min_map_qual() [1/2]

int min_map_qual ( ) const
inline

Definition at line 559 of file sam_variant_input_stream.hpp.

◆ min_map_qual() [2/2]

self_type& min_map_qual ( int  value)
inline

Set the minimum phred-scaled mapping quality score for a read in the input file to be considered.

Any read that is below the given value of mapping quality will be completely discarded, and its bases not taken into account.

Definition at line 571 of file sam_variant_input_stream.hpp.

◆ operator=() [1/2]

SamVariantInputStream& operator= ( SamVariantInputStream &&  )
default

◆ operator=() [2/2]

SamVariantInputStream& operator= ( SamVariantInputStream const &  )
default

◆ region_filter()

self_type& region_filter ( std::shared_ptr< GenomeLocusSet region_filter)
inline

Set a region filter, so that only loci set in the loci are used, and all others are skipped.

This still needs some basic processing per position, as we are currently not using the htslib internal filters, but apply it afterwards. Still, this skips the base counting, so it is an advantage over filtering later on.

Definition at line 549 of file sam_variant_input_stream.hpp.

◆ rg_tag_filter() [1/2]

std::unordered_set<std::string> const& rg_tag_filter ( ) const
inline

Definition at line 699 of file sam_variant_input_stream.hpp.

◆ rg_tag_filter() [2/2]

self_type& rg_tag_filter ( std::unordered_set< std::string > const &  value)
inline

Set the sample names used for filtering reads by their RG read group tag.

Only used when split_by_rg() is set to true. Reads that have an RG read group tag that appears in the header of the input file, but is not present in the value list given here (or in the constructor of the class), will be ignored. That is, they will also not appear in the "unaccounted" sample, independently of the setting of with_unaccounted_rg(). The unaccounted sample will only contain data from those reads that do not have an RG tag at all, or one that does not appear in the header.

See also inverse_rg_tag_filter() to inverse this setting. That is, instead of only using samples based on the RG tags given in this list here, use all but the given RG tags.

When the given value list is empty, the filtering by RG read group tag is deactivated (which is also the default), independently of the inverse_rg_tag_filter() setting.

Definition at line 721 of file sam_variant_input_stream.hpp.

◆ split_by_rg() [1/2]

bool split_by_rg ( ) const
inline

Definition at line 656 of file sam_variant_input_stream.hpp.

◆ split_by_rg() [2/2]

self_type& split_by_rg ( bool  value)
inline

If set to true, instead of reading all mapped reads as a single sample, split them by the @RG read group tag.

This way, multiple SampleCounts objects are created in the resulting Variant, one for each read group, and potentially an additional one for the unaccounted reads that do not have a read group, if with_unaccounted_rg() is also set.

Definition at line 669 of file sam_variant_input_stream.hpp.

◆ with_unaccounted_rg() [1/2]

bool with_unaccounted_rg ( ) const
inline

Definition at line 675 of file sam_variant_input_stream.hpp.

◆ with_unaccounted_rg() [2/2]

self_type& with_unaccounted_rg ( bool  value)
inline

Decide whether to add a sample for reads without a read group, when splitting by @RG tag.

If split_by_rg() and this option are both set to true, also add a special sample for the reads without a read group, as the last SampleCounts object of the Variant. If this option here is however set to false, all reads without a read group tag or with an invalid read group tag (that does not appear in the header) are ignored. If split_by_rg() is not set to true, this option here is completely ignored.

See also rg_tag_filter() to sub-set the reads by RG, that is, to ignore reads that have a proper RG tag set, but that belong to a sample that shall be ignored.

Definition at line 693 of file sam_variant_input_stream.hpp.

Member Typedef Documentation

◆ difference_type

using difference_type = std::ptrdiff_t

Definition at line 115 of file sam_variant_input_stream.hpp.

◆ iterator_category

using iterator_category = std::input_iterator_tag

Definition at line 116 of file sam_variant_input_stream.hpp.

◆ pointer

using pointer = value_type const*

Definition at line 113 of file sam_variant_input_stream.hpp.

◆ reference

using reference = value_type const&

Definition at line 114 of file sam_variant_input_stream.hpp.

◆ self_type

Definition at line 111 of file sam_variant_input_stream.hpp.

◆ value_type

Definition at line 112 of file sam_variant_input_stream.hpp.


The documentation for this class was generated from the following files: