#include <genesis/population/format/sync_reader.hpp>
Reader for PoPoolation2's "synchronized" files.
These files are a simple tally of the counts at each position and sample in a (m)pileup file. Sync files are structured as follows. Each line represents a position on a chromosome:
2R 2302 T 0:7:0:0:0:0 0:7:0:0:0:0 2R 2303 T 0:8:0:0:0:0 0:8:0:0:0:0 2R 2304 C 0:0:9:0:0:0 0:0:9:0:0:0 2R 2305 C 1:0:9:0:0:0 0:0:9:1:0:0
where:
The allele frequencies are in the format A:T:C:G:N:D
, i.e: count of bases A
, count of bases T
, etc, and deletion count in the end (character '*' in the mpileup).
See https://sourceforge.net/p/popoolation2/wiki/Tutorial/ for the original format description. Unfortunately, the file format does not support sample names.
We here support an ad-hoc extension of the sync
format that offers a header line to store sample names, which are usually not available in the sync
format. We currently expect a fixed format:
#chr pos ref S1 S2...
starting with a number sign (hashtag) #
symbol, optionally followed by a tab character, and then listing the fixed columns chr
, pos
, and ref
, followed by the sample name columns, all tab-delimited.
We furthermore allow a custom extension of the format, where .:.:.:.:.:.
represents missing data. See allow_missing() and https://github.com/lczech/grenedalf/issues/4 for details.
Note on our internal data representation: The reader returns a Variant per line, where most of the data is set based on the sync input content. However, the sync format does not have alternative bases. By default, we leave it hence as 'N'. See however the guess_alt_base() setting to instead estimate the alternative base from the data.
Definition at line 92 of file sync_reader.hpp.
Public Member Functions | |
SyncReader ()=default | |
SyncReader (SyncReader &&)=default | |
SyncReader (SyncReader const &)=default | |
~SyncReader ()=default | |
bool | allow_missing () const |
SyncReader & | allow_missing (bool value) |
Set whether to allow missing data in the format suggested by Kapun et al. More... | |
bool | guess_alt_base () const |
SyncReader & | guess_alt_base (bool value) |
Set to guess the alternative base of the Variant, instead of leaving it at 'N'. More... | |
SyncReader & | operator= (SyncReader &&)=default |
SyncReader & | operator= (SyncReader const &)=default |
bool | parse_line (utils::InputStream &input_stream, Variant &sample_set) const |
Read a single line into the provided Variant . More... | |
bool | parse_line (utils::InputStream &input_stream, Variant &sample_set, std::vector< bool > const &sample_filter) const |
Read a single line into the provided Variant , using a subset of the sample columns. More... | |
std::vector< Variant > | read (std::shared_ptr< utils::BaseInputSource > source) const |
Read the whole input into a vector of Variants. More... | |
std::vector< Variant > | read (std::shared_ptr< utils::BaseInputSource > source, std::vector< bool > const &sample_filter) const |
std::vector< std::string > | read_header (utils::InputStream &input_stream) const |
Read the header line, if there is one. Do nothing if there is not. More... | |
std::vector< std::string > | read_header (utils::InputStream &input_stream, std::vector< bool > const &sample_filter) const |
Read the header line, if there is one, only reading specific columns. Do nothing if there is not. More... | |
|
default |
|
default |
|
default |
|
default |
|
inline |
Definition at line 230 of file sync_reader.hpp.
|
inline |
Set whether to allow missing data in the format suggested by Kapun et al.
In order to distinguish missing/masked data from true zero-counts positions, Kapun suggested to use the notation .:.:.:.:.:.
for masked sites. When this is activate (default), we allow to read these, and output this as a zero-counts site with the SampleCounts::status being set to SampleCountsFilterTag::kMissing. If all samples at a position are missing, the Variant::status is also set to VariantFilterTag::kMissing. See https://github.com/lczech/grenedalf/issues/4 for details.
Definition at line 245 of file sync_reader.hpp.
|
inline |
Definition at line 206 of file sync_reader.hpp.
|
inline |
Set to guess the alternative base of the Variant, instead of leaving it at 'N'.
Excluding the reference base, we use the base of the remaining three that has the highest total count across all samples, unless all of them are zero, in which case we do not set the altnative base. We also skip cases where the ref is not in ACGT
, as then the alternative base is also meaningless. In these cases, the alternative will be N
.
Note though that this can lead to conflicts between different files, if the second most abundant nucleotide differs between them, e.g., in non-bia llelic positions. Usually we can deal with this, see for example VariantParallelInputStream::Iterator::joined_variant(). Still, it is important to keep this in mind.
Definition at line 224 of file sync_reader.hpp.
|
default |
|
default |
bool parse_line | ( | utils::InputStream & | input_stream, |
Variant & | sample_set | ||
) | const |
Read a single line into the provided Variant
.
Returns wheather the reading was successful, or not, i.e., wheather the input is at its end.
Definition at line 160 of file sync_reader.cpp.
bool parse_line | ( | utils::InputStream & | input_stream, |
Variant & | sample_set, | ||
std::vector< bool > const & | sample_filter | ||
) | const |
Read a single line into the provided Variant
, using a subset of the sample columns.
This is an equivalent overload as described in the read() functions. See there for details.
Definition at line 167 of file sync_reader.cpp.
std::vector< Variant > read | ( | std::shared_ptr< utils::BaseInputSource > | source | ) | const |
Read the whole input into a vector of Variants.
Definition at line 120 of file sync_reader.cpp.
std::vector< Variant > read | ( | std::shared_ptr< utils::BaseInputSource > | source, |
std::vector< bool > const & | sample_filter | ||
) | const |
brief Read the whole input into a vector of Variants, using a subset of the sample columns.
The overload expects a vector indicating which columns to read and which to skip. The Variants produced for each line of input only contain as many entries as there are true
values in the provided sample_filter
. If the size of the sample_filter
does not match the number of sample columns, an exception is thrown.
Definition at line 138 of file sync_reader.cpp.
std::vector< std::string > read_header | ( | utils::InputStream & | input_stream | ) | const |
Read the header line, if there is one. Do nothing if there is not.
Has to be called at the start of reading a source
file, as otherwise the reading will have already moved on from the header line.
This is support for an ad-hoc extension of the sync
format that offers a header line to store sample names, which are usually not available in the sync
format. We currently expect a fixed format:
#chr pos ref S1 S2...
starting with a number sign (hashtag) #
symbol, optionally followed by a tab character, and then listing the fixed columns chr
, pos
, and ref
, followed by the sample name columns, all tab-delimited.
The return value of the function are the values of the sample name columns, i.e., the sample names.
Definition at line 56 of file sync_reader.cpp.
std::vector< std::string > read_header | ( | utils::InputStream & | input_stream, |
std::vector< bool > const & | sample_filter | ||
) | const |
Read the header line, if there is one, only reading specific columns. Do nothing if there is not.
This overload of the function additionally takes a vector indicating which sample names to read and return (where sample_filter
is true
), and ignores the rest (where sample_filter
is false
). The size of sample_filter
has to match the number of sample name columns; an exception is thrown otherwise.
This function hence is meant to match the read() and parse_line() overloads that also take this type of filter.
Definition at line 85 of file sync_reader.cpp.