The basic class to represent a genetic sequence is called Sequence (quelle surprise). It stores a label, i.e., a name for the sequence, and its sites. The Sequence class itself is agnostic of the format/encoding of its content, that is, whether it stores nucleotide or amino acid or any other form of data. This offers flexibility when working with sequence data. There are however some functions that are specialized for e.g., nucleotide sequences; that is, they work with sequences of the characters ACGT-
.
A sequence comes rarely alone. A collection of sequences is stored in a SequenceSet. The sequences in such a set do not have to have the same length, i.e., it is not an alignment.
The code examples in this tutorial assume that you use
at the beginning of your code.
Reading is done using reader classes like FastaReader and PhylipReader. See there for details. Basic usage:
Writing is done the other way round, using e.g., FastaWriter or PhylipWriter:
All the readers and writers can also be normally stored in a variable, for example in order to change their settings and then use them multiple times:
Lastly, conversion between different sequence file formats is of course easily done by reading it in one format, and writing it in another.
Access to the sites of a Sequence is given via its member function sites().
It is also possible to directly iterate the Sequences in a SequenceSet and the single sites of a Sequence:
As printing a Sequence or a whole SequenceSet is common in order to inspect the sites, we offer a class PrinterSimple that does this more easily and with various settings:
It also offers printing using colors for the different sites (i.e., color each of the nucleotides differently). See the class description of PrinterSimple for details.
Furthermore, when dealing with many sequences, printing each character might be to much. For such large datasets, we offer PrinterBitmap, which prints the sites as pixels in a bitmap, each Sequence on a separate line, thus offering a more dense representation of the data:
Often, it is desired to summarize a collection of Sequences into a consensus sequence. For this, Genesis offers a couple of different algorithms:
ACGT
), that uses a threshold for the character frequency to determine the consensus at each site.ACGT
), that uses a similarity_factor
to calculate consensus with ambiguity characters.ACGT
), which uses the method by Cavener, 1987.See the documentation of those functions (and their variants) for details.
Related to the calculation of consensus sequences is the calculation of the entropy of a collection of Sequences. The entropy is a measure of information contained in the sites of such a collection.
We offer two modes of calculating the Sequence entropy:
as well as the single-site functions site_entropy() and site_information().
Instead of a SequenceSet, they take a SiteCounts object as input, which is a summarization of the occurence frequency of the sites in a SequenceSet. See there for details.
Finally, we want to point out some other interesting functions:
There are more classes and functions to work with Sequences, see namespace sequence for the full list.