Sequences

The basic class to represent a genetic sequence is called Sequence (quelle surprise). It stores a label, i.e., a name for the sequence, and its sites. The Sequence class itself is agnostic of the format/encoding of its content, that is, whether it stores nucleotide or amino acid or any other form of data. This offers flexibility when working with sequence data. There are however some functions that are specialized for e.g., nucleotide sequences; that is, they work with sequences of the characters ACGT-.

A sequence comes rarely alone. A collection of sequences is stored in a SequenceSet. The sequences in such a set do not have to have the same length, i.e., it is not an alignment.

The code examples in this tutorial assume that you use

#include "genesis/genesis.hpp"
using namespace genesis::sequence;
using namespace genesis::utils;

at the beginning of your code.

Reading, Writing and Converting

Reading is done using reader classes like FastaReader and PhylipReader. See there for details. Basic usage:

// Read a fasta file into a SequenceSet object.
SequenceSet sequences_a = FastaReader().read( from_file( "path/to/file_a.fasta" ));
 
// Read a phylip file into a SequenceSet object.
SequenceSet sequences_b = PhylipReader().read( from_file( "path/to/file_b.phylip" ));
 
// Read more sequences into an existing SequenceSet object.
FastaReader().read( from_file( "path/to/file_c.fasta" ), sequences_b );
PhylipReader().read( from_file( "path/to/file_d.phylip" ), sequences_a );

Writing is done the other way round, using e.g., FastaWriter or PhylipWriter:

// Write data from a SequenceSet object to a fasta file.
FastaWriter().write( sequences_b, to_file( "path/to/file_e.fasta" ));
 
// Write data from a SequenceSet object to a phylip file.
PhylipWriter().write( sequences_a, to_file( "path/to/file_f.phylip" ));

All the readers and writers can also be normally stored in a variable, for example in order to change their settings and then use them multiple times:

// Instantiate objects and change some exemplary settings.
auto fasta_reader = FastaReader();
fasta_reader.valid_chars( "ACGT-" );
auto phylip_writer = PhylipWriter();
phylip_writer.line_length( 100 );
 
// ... continue using fasta_reader and phylip_writer

Lastly, conversion between different sequence file formats is of course easily done by reading it in one format, and writing it in another.

Inspecting Sites and Printing

Access to the sites of a Sequence is given via its member function sites().

It is also possible to directly iterate the Sequences in a SequenceSet and the single sites of a Sequence:

// Iterate all Sequences in a SequenceSet and all their sites and print them.
for( Sequence const& sequence : sequences_a ) {
 
    // Print the Sequnce label.
    std::cout << sequence.label() << ": ";
 
    // Iterate and print all sites of the Sequence.
    for( char const& site : sequence ) {
        std::cout << site;
    }
    // Alternatively, instead of the loop, you can use
    std::cout << sequence.sites();
 
    // Finish the print line.
    std::cout << std::endl;
}

As printing a Sequence or a whole SequenceSet is common in order to inspect the sites, we offer a class PrinterSimple that does this more easily and with various settings:

// Print a SequenceSet as characters to the standard output.

std::cout << PrinterSimple().print( sequences_a );

It also offers printing using colors for the different sites (i.e., color each of the nucleotides differently). See the class description of PrinterSimple for details.

Furthermore, when dealing with many sequences, printing each character might be to much. For such large datasets, we offer PrinterBitmap, which prints the sites as pixels in a bitmap, each Sequence on a separate line, thus offering a more dense representation of the data:

// Print a SequenceSet as pixels to a bitmap file.
PrinterBitmap()
    .color_map( nucleic_acid_colors() )
    .write( sequences_a, to_file( "path/to/sits.bmp"));

Consensus Sequences

Often, it is desired to summarize a collection of Sequences into a consensus sequence. For this, Genesis offers a couple of different algorithms:

consensus_sequence_with_threshold(), for nucleotide sequences (ACGT), that uses a threshold for the character frequency to determine the consensus at each site.
consensus_sequence_with_ambiguities(), also for nucleotide sequences (ACGT), that uses a similarity_factor to calculate consensus with ambiguity characters.
consensus_sequence_cavener(), again for nucleotide sequences (ACGT), which uses the method by Cavener, 1987.
consensus_sequence_with_majorities(), for all kinds of site characters, that uses a majority rule to find the consensus at each site.

See the documentation of those functions (and their variants) for details.

Related to the calculation of consensus sequences is the calculation of the entropy of a collection of Sequences. The entropy is a measure of information contained in the sites of such a collection.

We offer two modes of calculating the Sequence entropy:

as well as the single-site functions site_entropy() and site_information().

Instead of a SequenceSet, they take a SiteCounts object as input, which is a summarization of the occurence frequency of the sites in a SequenceSet. See there for details.

Further Functions

Finally, we want to point out some other interesting functions:

filter_min_max_sequence_length() removes Sequences from a SequenceSet that are either too short or too long.
remove_characters() allows to remove certain characters by deleting sites in Sequences.
relabel_with_hash() sets new labels for all Sequences according to the hash of the sequence sites, offering different hash functions.
nucleic_acid_code_containment() checks whether two nucleic acid codes are equal, taking degenerated/ambiguous characters into account.

There are more classes and functions to work with Sequences, see namespace sequence for the full list.