Handling bidirectional nuclear sequence data

This exercise will give you more practice handling and editing raw sequence data produced by Sanger sequencing.

The Acrocephalus sequence list contains forward and reverse sequences for a nuclear gene from 3 different Acrocephalus reed warbler species. The sequences are named with a three-letter code to indicate their species (aru = A. arundinaceus, great reed warbler; dum = A. dumetorum, Blyth's reed warbler; ort = A. orientalis, Oriental reed warbler), and are marked with 'F' or 'R' to indicate whether they were sequenced with forward or reverse primers.

Double-click on the Acrocephalus sequences list to open it in a new window. Scroll down to get an overview of the sequences. Note that in a few sequences the sequence quality drops off part way along (e.g. dum2 and dum4 sequences).

Trim the poor quality sequence off the ends of the sequence by clicking Annotate and Predict→Trim Ends. This time we will annotate the trimmed regions rather than deleting them altogether, so select "Annotate new trimmed regions". Set the Error probability limit to 0.01 and click OK. Save the sequence list once the trimming is finished and close the sequence list window.

We will now run the Heterozygote Finder to identify and annotate bases where two different nucleotides have been called at the same position. As these are nuclear sequences each represents two alleles, so there could be heterozygous positions where the two alleles have different bases and a double chromatogram peak is present. Select all the files in the Acrocephalus Sequences folder and click Annotate and Predict→Find Heterozygotes. Set the Peak Similarity to 50%, and choose to Annotate the heterozygote bases.


Click OK and Save the sequences when the analysis has finished. We will come back to the bases which are annotated as heterozygotes after we have assembled the forward and reverse sequences.

We will now assemble the forward and reverse sequences for each individual. Before proceeding with the assembly, we will set the read direction to ensure that the sequences are assembled in the same orientation for each pair. In order to set the read direction the sequence files need to be extracted from the list, as this operation does not work on sequence lists. Select the sequence list and click Sequence→Extract Sequences from List. Choose to save the sequences in a subfolder called Acrocephalus Sequences.

Now select all the forward sequences in the folder (named with an F as the final letter) while holding down the command/control key, and select Sequence→Set Read Direction. Check the Forward box and click OK. There is no need to set the direction of the reverse reads as well.

To proceed with the assembly, select all the sequences in the folder and choose Align/Assemble→De Novo Assemble. Click Assemble by, then select 1st part of name, separated by underscore. This will produce one contig for each pair of forward and reverse sequences. Set the sensitivity to Highest Sensitivity/Slow, and ensure Save assembly report, Save list of unused reads, Save in sub-folder and Save contigs are checked. Choose to Use existing trim regions - with this option the assembler will ignore the regions annotated as trimmed, but you will still be able to see these regions on the sequences. Click OK.



A subfolder called Assembly has now been created which contains the contigs and an Assembly Report. You'll also see a sequence list of unused reads, which contains sequences that could not be assembled. Take a look in this sequence list and you'll see that these sequences are the ones which contained only a short stretch of good quality sequence (dum2 and dum4).


Exercise 2b: Checking assembled sequences and extracting consensus
Exercise 2c: Assembling to reference
Exercise 2b: Analysing consensus sequences