A library for working with phylogenetic and population genetic data.
v0.32.0
SyncReader Class Reference

#include <genesis/population/format/sync_reader.hpp>

Detailed Description

Reader for PoPoolation2's "synchronized" files.

These files are a simple tally of the counts at each position and sample in a (m)pileup file. Sync files are structured as follows. Each line represents a position on a chromosome:

2R  2302    T   0:7:0:0:0:0 0:7:0:0:0:0
2R  2303    T   0:8:0:0:0:0 0:8:0:0:0:0
2R  2304    C   0:0:9:0:0:0 0:0:9:0:0:0
2R  2305    C   1:0:9:0:0:0 0:0:9:1:0:0

where:

  • col1: reference contig/chromosome
  • col2: position within the reference contig/chromosome
  • col3: reference character (base)
  • col4: allele frequencies of population number 1
  • col5: allele frequencies of population number 2
  • coln: allele frequencies of population number n

The allele frequencies are in the format A:T:C:G:N:D, i.e: count of bases A, count of bases T, etc, and deletion count in the end (character '*' in the mpileup).

See https://sourceforge.net/p/popoolation2/wiki/Tutorial/ for the original format description. Unfortunately, the file format does not support sample names.

We here support an ad-hoc extension of the sync format that offers a header line to store sample names, which are usually not available in the sync format. We currently expect a fixed format:

#chr    pos ref S1 S2...

starting with a number sign (hashtag) # symbol, optionally followed by a tab character, and then listing the fixed columns chr, pos, and ref, followed by the sample name columns, all tab-delimited.

We furthermore allow a custom extension of the format, where .:.:.:.:.:. represents missing data. See allow_missing() and https://github.com/lczech/grenedalf/issues/4 for details.

Note on our internal data representation: The reader returns a Variant per line, where most of the data is set based on the sync input content. However, the sync format does not have alternative bases. By default, we leave it hence as 'N'. See however the guess_alt_base() setting to instead estimate the alternative base from the data.

Definition at line 92 of file sync_reader.hpp.

Public Member Functions

 SyncReader ()=default
 
 SyncReader (SyncReader &&)=default
 
 SyncReader (SyncReader const &)=default
 
 ~SyncReader ()=default
 
bool allow_missing () const
 
SyncReaderallow_missing (bool value)
 Set whether to allow missing data in the format suggested by Kapun et al. More...
 
bool guess_alt_base () const
 
SyncReaderguess_alt_base (bool value)
 Set to guess the alternative base of the Variant, instead of leaving it at 'N'. More...
 
SyncReaderoperator= (SyncReader &&)=default
 
SyncReaderoperator= (SyncReader const &)=default
 
bool parse_line (utils::InputStream &input_stream, Variant &sample_set) const
 Read a single line into the provided Variant. More...
 
bool parse_line (utils::InputStream &input_stream, Variant &sample_set, std::vector< bool > const &sample_filter) const
 Read a single line into the provided Variant, using a subset of the sample columns. More...
 
std::vector< Variantread (std::shared_ptr< utils::BaseInputSource > source) const
 Read the whole input into a vector of Variants. More...
 
std::vector< Variantread (std::shared_ptr< utils::BaseInputSource > source, std::vector< bool > const &sample_filter) const
 
std::vector< std::string > read_header (utils::InputStream &input_stream) const
 Read the header line, if there is one. Do nothing if there is not. More...
 
std::vector< std::string > read_header (utils::InputStream &input_stream, std::vector< bool > const &sample_filter) const
 Read the header line, if there is one, only reading specific columns. Do nothing if there is not. More...
 

Constructor & Destructor Documentation

◆ SyncReader() [1/3]

SyncReader ( )
default

◆ ~SyncReader()

~SyncReader ( )
default

◆ SyncReader() [2/3]

SyncReader ( SyncReader const &  )
default

◆ SyncReader() [3/3]

SyncReader ( SyncReader &&  )
default

Member Function Documentation

◆ allow_missing() [1/2]

bool allow_missing ( ) const
inline

Definition at line 230 of file sync_reader.hpp.

◆ allow_missing() [2/2]

SyncReader& allow_missing ( bool  value)
inline

Set whether to allow missing data in the format suggested by Kapun et al.

In order to distinguish missing/masked data from true zero-counts positions, Kapun suggested to use the notation .:.:.:.:.:. for masked sites. When this is activate (default), we allow to read these, and output this as a zero-counts site with the SampleCounts::status being set to SampleCountsFilterTag::kMissing. If all samples at a position are missing, the Variant::status is also set to VariantFilterTag::kMissing. See https://github.com/lczech/grenedalf/issues/4 for details.

Definition at line 245 of file sync_reader.hpp.

◆ guess_alt_base() [1/2]

bool guess_alt_base ( ) const
inline

Definition at line 206 of file sync_reader.hpp.

◆ guess_alt_base() [2/2]

SyncReader& guess_alt_base ( bool  value)
inline

Set to guess the alternative base of the Variant, instead of leaving it at 'N'.

Excluding the reference base, we use the base of the remaining three that has the highest total count across all samples, unless all of them are zero, in which case we do not set the altnative base. We also skip cases where the ref is not in ACGT, as then the alternative base is also meaningless. In these cases, the alternative will be N.

Note though that this can lead to conflicts between different files, if the second most abundant nucleotide differs between them, e.g., in non-bia llelic positions. Usually we can deal with this, see for example VariantParallelInputStream::Iterator::joined_variant(). Still, it is important to keep this in mind.

Definition at line 224 of file sync_reader.hpp.

◆ operator=() [1/2]

SyncReader& operator= ( SyncReader &&  )
default

◆ operator=() [2/2]

SyncReader& operator= ( SyncReader const &  )
default

◆ parse_line() [1/2]

bool parse_line ( utils::InputStream input_stream,
Variant sample_set 
) const

Read a single line into the provided Variant.

Returns wheather the reading was successful, or not, i.e., wheather the input is at its end.

Definition at line 160 of file sync_reader.cpp.

◆ parse_line() [2/2]

bool parse_line ( utils::InputStream input_stream,
Variant sample_set,
std::vector< bool > const &  sample_filter 
) const

Read a single line into the provided Variant, using a subset of the sample columns.

This is an equivalent overload as described in the read() functions. See there for details.

Definition at line 167 of file sync_reader.cpp.

◆ read() [1/2]

std::vector< Variant > read ( std::shared_ptr< utils::BaseInputSource source) const

Read the whole input into a vector of Variants.

Definition at line 120 of file sync_reader.cpp.

◆ read() [2/2]

std::vector< Variant > read ( std::shared_ptr< utils::BaseInputSource source,
std::vector< bool > const &  sample_filter 
) const

brief Read the whole input into a vector of Variants, using a subset of the sample columns.

The overload expects a vector indicating which columns to read and which to skip. The Variants produced for each line of input only contain as many entries as there are true values in the provided sample_filter. If the size of the sample_filter does not match the number of sample columns, an exception is thrown.

Definition at line 138 of file sync_reader.cpp.

◆ read_header() [1/2]

std::vector< std::string > read_header ( utils::InputStream input_stream) const

Read the header line, if there is one. Do nothing if there is not.

Has to be called at the start of reading a source file, as otherwise the reading will have already moved on from the header line.

This is support for an ad-hoc extension of the sync format that offers a header line to store sample names, which are usually not available in the sync format. We currently expect a fixed format:

#chr    pos ref S1 S2...

starting with a number sign (hashtag) # symbol, optionally followed by a tab character, and then listing the fixed columns chr, pos, and ref, followed by the sample name columns, all tab-delimited.

The return value of the function are the values of the sample name columns, i.e., the sample names.

Definition at line 56 of file sync_reader.cpp.

◆ read_header() [2/2]

std::vector< std::string > read_header ( utils::InputStream input_stream,
std::vector< bool > const &  sample_filter 
) const

Read the header line, if there is one, only reading specific columns. Do nothing if there is not.

This overload of the function additionally takes a vector indicating which sample names to read and return (where sample_filter is true), and ignores the rest (where sample_filter is false). The size of sample_filter has to match the number of sample name columns; an exception is thrown otherwise.

This function hence is meant to match the read() and parse_line() overloads that also take this type of filter.

Definition at line 85 of file sync_reader.cpp.


The documentation for this class was generated from the following files: