Workshop

In this workshop, we will build a toy read mapper called SeqMap to showcase some of Seq’s domain-specific features and optimizations.

Getting Started

If you don’t have Seq installed already, you can install it with one bash command:

/bin/bash -c "$(curl -fsSL https://seq-lang.org/install.sh)"

Be sure to restart the shell for seqc to be added to your PATH (or add it manually).

Now, let’s create a new directory to house our Seq code and test data:

mkdir seqmap
cd seqmap

We’ll test SeqMap on the following data:

  • Chromosome 22 as the reference sequence (chr22.fa, 52MB)

  • A simulated set of 1 million 101bp reads (reads.fastq, 251MB)

Let’s download this data into a new folder called data:

curl -L http://seq.csail.mit.edu/data.tar.gz | tar zxf - -C .

What does this data actually look like? Let’s take a look:

head data/reads.fq
@chr22_16993648_16994131_1:0:0_2:0:0_0/1
CTACCAAACACCTACTTCGTTTCCTAACATCACTTTAATTTTATCTTAGAGGAATTCTTTTCCCTATCCCATTAAGTTATGGGAGATGGGGCCAGGCATGG
+
55555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555
@chr22_28253010_28253558_1:0:0_0:0:0_1/1
AGTGTTTTGCCTGTGGCTAGACTAAAAATAAGGAATGAGGGGGGTATCTTCCACTCTTGCCCTCTCATCACCCTATTCCCTATATCCAGAACTCAGAGTCC
+
55555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555
@chr22_21656192_21656802_0:0:0_2:0:0_2/1
ATAGCGTGGATTCCTATGACATCAAGGAGCTATTTTATTTGGTAAAACGAAAAAGCACAATAATGAACGAACGCAAGCACTGAAACAGTGGAGACACCTAG

Each FASTQ record consists of four lines:

  • Read name, starting with @. Notice that since this data is simulated, the read name includes the location on the genome where the read comes from; this will be useful later!

  • Read sequence. This is a DNA sequence consisting of ACGT bases and possibly N, indicating an ambiguous base.

  • Separator (+)

  • Read quality scores. This is a string of characters as long as the read sequence, where each character indicates the “quality” of the corresponding base in the read sequence. We won’t be using it here.

What about the reference sequence FASTA file?

head data/chr22.fa
>chr22
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

FASTQ records start with the sequence name (prefixed with >) followed by the sequence itself, split across multiple lines. But why are all the bases N? Chromosomal sequences often have their starts and ends masked with N’s to cover repetitive telomeric sequences. Since we usually don’t want to include such regions in our analyses, they are masked in the file. Let’s look a bit further down:

head -n 1000000 data/chr22.fa | tail -n 10
tattaaaggaaaaaactgtatgaaatagtacatttctcataattctcatt
ttgtaaaaataaagtacttatctatggacataatgagaaaatgactcaag
gtaccaagagtttagccattagctataccagtggattataagcaaattct
gttACGTGCATGCACTCACCTACGCATGTTCATGTATTCATACATACGTA
CATAATTTTTTAAATTTTCTTTTATAGACAAGCAATAGCTTTATAATCTC
TATAATCAGTAAAAATAAGTAAGTggctggacgcagtggctcacacctgt
aatctcagcactttgggaggctgaggagggcagattatgaggtcagaaga
tcaagaccatcctggctaacacagtgaaaccccatctctactaaaaatac
aaaaaattagccacgcgtggtggcacgcgcctgtagtcccagctactggg
gaggctgaggcaggaaaatcgcttgaacccgggaggcagaggttgcggtg

Now we can see the usual ACGT bases. The fact that some bases are lowercase indicates that they are a part of some repetitive element or region. Seq will handle these different uppercase and lowercase characters automatically, so we don’t need to worry about them.

You might notice an additional file called chr22.fa.fai: this is a FASTA index file that includes information about each sequence contained in the file for easier parsing. We won’t use it directly, but Seq uses it internally to make FASTA parsing more efficient.

Section 1: Reading sequences from disk

The first step of processing any kind of sequencing data is to read it from disk. Seq has builtin support for many of the standard file formats such as FASTA, FASTQ, SAM, BAM, etc.

Let’s write a program to read our FASTQ file and print each record’s name and sequence on a single line:

from sys import argv
from bio import *
for record in FASTQ(argv[1]):
    print record.name, record.seq

Now we can run this Seq program:

seqc run section1.seq data/reads.fq > out.txt

and view the results:

head out.txt
chr22_16993648_16994131_1:0:0_2:0:0_0/1 CTACCAAACACCTACTTCGTTTCCTAACATCACTTTAATTTTATCTTAGAGGAATTCTTTTCCCTATCCCATTAAGTTATGGGAGATGGGGCCAGGCATGG
chr22_28253010_28253558_1:0:0_0:0:0_1/1 AGTGTTTTGCCTGTGGCTAGACTAAAAATAAGGAATGAGGGGGGTATCTTCCACTCTTGCCCTCTCATCACCCTATTCCCTATATCCAGAACTCAGAGTCC
chr22_21656192_21656802_0:0:0_2:0:0_2/1 ATAGCGTGGATTCCTATGACATCAAGGAGCTATTTTATTTGGTAAAACGAAAAAGCACAATAATGAACGAACGCAAGCACTGAAACAGTGGAGACACCTAG
chr22_44541236_44541725_0:1:0_0:0:0_3/1 CTCTCTGTCTCTCTCTCTCCCCTAGGTCAGGGTGGTCCCTGGGGAGGCCCCTGGGTTACCCCAAGACAGGTGGGAGGTGCTTCCTACCCGACCCTCTTCCT
chr22_39607671_39608139_0:0:0_2:0:0_4/1 ATTGGCTCAGAGTTCAGCAGGCTGTACCAGCATGGCGCCAGTGTCTGCTCCTGGTGAGGCCTTACGGACGTTACAATAACGGCGGAAGGCAAAGGCGGAGC
chr22_35577703_35578255_3:0:0_1:0:0_5/1 TGCCATGGTGGTTAGCTGCACCCATCAACCTGTCATCTACATTAGGTATTTTTCCTAATGCTATCCCTCCCCTAGCACCCTACCCTCTGATAGGCCCTGGT
chr22_46059124_46059578_1:0:0_1:0:0_6/1 AATCAGTACCAAACAATATATGGATATTATTGGCACTTTGTGCTCCCTCTGCCTGAACTGGGAATTCCTCTATTAGTTTTGACATTATCTGGTATTGAACC
chr22_31651867_31652385_2:0:0_2:0:0_7/1 ATCTAGTGACAGTAAGTGGCTGATAAAGTGAGCTGCCATTACATAGTCATCATCTTTAATAGAAGTTAACACATACTGAGTTTCTACTATATTGGGTCTTT
chr22_24816466_24817026_1:0:0_1:0:0_8/1 CACCTCTAGGGCTCAAGGGGCAGTTCCTCCATTCCTCAGCAGTGGCGCCTGTGGAACTGTGTCCTGAGGCCAGGGGGTGGTCAGGCAGGGCCTGGAGTGGC
chr22_27496272_27496752_1:0:0_1:0:0_9/1 CTTAGCCCCATTAAACACTGGCAGGGCTGAATTGTCTGCTGCCATCCATCACACCTTCTCCCCTAGCCTGGTTTCTTACCTACCTGGAAGCCGTCCCTTTT

Pretty straightforward! FASTA files can be read in a very similar way.

Full code listing

click to download

# SeqMap
# Seq workshop -- Section 1
# Reads and prints a FASTQ file.
# Usage: seqc run section1.seq <FASTQ path>
from sys import argv
from bio import *
for record in FASTQ(argv[1]):
    print record.name, record.seq

Section 2: Building an index

Our goal is to find a “mapping” on the genome for each read. Comparing to every position on the reference sequence would take far too long. An alternative is to create an index of the k-mers from the reference sequence and use it to guide the mapping process.

Let’s build a dictionary that maps each k-mer to its position (“locus”) on the reference sequence:

from sys import argv
from bio import *
K = Kmer[32]
index = {}

for record in FASTA(argv[1]):
    for pos,kmer in record.seq.kmers_with_pos[K](step=1):
        index[kmer] = pos

Of course, there will be k-mers that appear multiple times, but let’s ignore this detail for now and just store the latest position we see for each k-mer.

Another important issue is reverse complementation: some of our reads will map in the reverse direction rather than in the forward direction. For this reason, let’s build our index in such a way that a k-mer is considered “equal” to its reverse complement. One easy way to do this is by using “canonical” k-mers, i.e. the minimum of a k-mer and its reverse complement:

from sys import argv
from bio import *
K = Kmer[32]
index = {}

for record in FASTA(argv[1]):
    for pos,kmer in record.seq.kmers_with_pos[K](step=1):
        index[min(kmer, ~kmer)] = pos  # <--

(We’ll have to use canonical k-mers when querying the index too, of course.)

Now we have our index as a dictionary (index), but we don’t want to build it each time we perform read mapping, since it only depends on the (fixed) reference sequence. So, as a last step, let’s dump the index to a file using the pickle module:

import pickle
import gzip

with gzip.open(argv[1] + '.index', 'wb') as jar:
    pickle.dump(index, jar)

Run the program:

seqc run section2.seq data/chr22.fa

Now we should see a new file data/chr22.fa.index which stores our serialized index.

The nice thing is we should only have to build our index once!

Full code listing

click to download

# SeqMap
# Seq workshop -- Section 2
# Reads and constructs a hash table index from an input
# FASTA file.
# Usage: seqc run section2.seq <FASTA path>
from sys import argv
from bio import *
import pickle
import gzip

K = Kmer[32]
index = {}

for record in FASTA(argv[1]):
    for pos,kmer in record.seq.kmers_with_pos[K](step=1):
        index[min(kmer, ~kmer)] = pos

with gzip.open(argv[1] + '.index', 'wb') as jar:
    pickle.dump(index, jar)

Section 3: Finding k-mer matches

At this point, we have an index we can load from disk. Let’s use it to find candidate mappings for our reads.

We’ll split each read into k-mers and report a mapping if at least two k-mers support a particular locus.

The first step is to load the index:

from sys import argv
from bio import *
import pickle
import gzip

K = Kmer[32]
index = None

with gzip.open(argv[1] + '.index', 'rb') as jar:
    index = pickle.load[dict[K,int]](jar)

Now we can iterate over our reads and query k-mers in the index. We need a way to keep track of candidate mapping positions as we process the k-mers of a read: we can do this using a new dictionary, candidates, which maps candidate alignment positions to the number of k-mers supporting the given position.

Then, we just iterate over candidates and output positions supported by 2 or more k-mers. Finally, we clear candidates before processing the next read:

candidates = Dict[int,int]()  # position -> count mapping
for record in FASTQ(argv[2]):
    for pos,kmer in record.read.kmers_with_pos[K](step=1):
        found = index.get(min(kmer, ~kmer), -1)
        if found > 0:
            candidates.increment(found - pos)

    for pos,count in candidates.items():
        if count > 1:
            print record.name, pos + 1

    candidates.clear()

Run the program:

seqc run section3.seq data/chr22.fa data/reads.fq > out.txt

Let’s take a look at the output:

head out.txt
chr22_16993648_16994131_1:0:0_2:0:0_0/1 16993648
chr22_28253010_28253558_1:0:0_0:0:0_1/1 28253010
chr22_44541236_44541725_0:1:0_0:0:0_3/1 44541236
chr22_31651867_31652385_2:0:0_2:0:0_7/1 31651867
chr22_21584577_21585142_1:0:0_1:0:0_a/1 21584577
chr22_46629499_46629977_0:0:0_2:0:0_b/1 47088563
chr22_46629499_46629977_0:0:0_2:0:0_b/1 51103174
chr22_46629499_46629977_0:0:0_2:0:0_b/1 46795988
chr22_16269615_16270134_0:0:0_1:0:0_c/1 50577316
chr22_16269615_16270134_0:0:0_1:0:0_c/1 16269615

Notice that most positions we reported match the position from the read name (the first integer after the _); not bad!

Full code listing

click to download

# SeqMap
# Seq workshop -- Section 3
# Reads index constructed in Section 2 and looks up k-mers from
# input reads to find candidate mappings.
# Usage: seqc run section3.seq <FASTA path> <FASTQ path>
from sys import argv
from bio import *
import pickle
import gzip

K = Kmer[32]
index = None

with gzip.open(argv[1] + '.index', 'rb') as jar:
    index = pickle.load[dict[K,int]](jar)

candidates = Dict[int,int]()  # position -> count mapping
for record in FASTQ(argv[2]):
    for pos,kmer in record.read.kmers_with_pos[K](step=1):
        found = index.get(min(kmer, ~kmer), -1)
        if found > 0:
            candidates.increment(found - pos)

    for pos,count in candidates.items():
        if count > 1:
            print record.name, pos + 1

    candidates.clear()

Section 4: Smith-Waterman alignment and CIGAR strings

We now have the ability to report mapping positions for each read, but usually we want alignments, which include information about mismatches, insertions and deletions.

Luckily, Seq makes sequence alignment easy: to align sequence q against sequence t, you can just do:

aln = q @ t

aln is a tuple of alignment score and CIGAR string (a CIGAR string is a way of encoding an alignment result, and consists of operations such as M for match/mismatch, I for insertion and D for deletion, accompanied by the number of associated bases; for example, 3M2I4M indicates 3 (mis)matches followed by a length-2 insertion followed by 4 (mis)matches).

By default, Levenshtein distance is used, meaning mismatch and gap costs are both 1, while match costs are zero. More control over alignment parameters can be achieved using the align method:

aln = q.align(t, a=2, b=4, ambig=0, gapo=4, gape=2)

where a is the match score, b is the mismatch cost, ambig is the ambiguous base (N) match score, gapo is the gap open cost and gape the gap extension cost (i.e. a gap of length k costs gapo + (k * gape)). There are many more parameters as well, controlling factors like alignment bandwidth, Z-drop, global/extension alignment and more; check the standard library reference for further details.

For now, we’ll use a simple query.align(target):

candidates = Dict[int,int]()
for record in FASTQ(argv[2]):
    for pos,kmer in record.read.kmers_with_pos[K](step=1):
        found = index.get(min(kmer, ~kmer), -1)
        if found > 0:
            candidates.increment(found - pos)

    for pos,count in candidates.items():
        if count > 1:
            # get query, target and align:
            query = record.read
            target = reference[pos:pos + len(query)]
            alignment = query.align(target)
            print record.name, pos + 1, alignment.score, alignment.cigar

    candidates.clear()

Run the program:

seqc run section4.seq data/chr22.fa data/reads.fq > out.txt

And let’s take a look at the output once again:

head out.txt
chr22_16993648_16994131_1:0:0_2:0:0_0/1 16993648 196 101M
chr22_28253010_28253558_1:0:0_0:0:0_1/1 28253010 196 101M
chr22_44541236_44541725_0:1:0_0:0:0_3/1 44541236 196 101M
chr22_31651867_31652385_2:0:0_2:0:0_7/1 31651867 190 101M
chr22_21584577_21585142_1:0:0_1:0:0_a/1 21584577 196 101M
chr22_46629499_46629977_0:0:0_2:0:0_b/1 47088563 110 20M1I4M1D76M
chr22_46629499_46629977_0:0:0_2:0:0_b/1 51103174 134 20M1I4M1D76M
chr22_46629499_46629977_0:0:0_2:0:0_b/1 46795988 128 20M1I4M1D76M
chr22_16269615_16270134_0:0:0_1:0:0_c/1 50577316 118 101M
chr22_16269615_16270134_0:0:0_1:0:0_c/1 16269615 202 101M

Most of the alignments contain only matches or mismatches (M), which is to be expected as insertions and deletions are far less common. In fact, the three mappings containing indels appear to be incorrect!

A more thorough mapping scheme would also look at alignment scores before reporting mappings, although for the purposes of this workshop we’ll ignore such improvements.

Full code listing

click to download

# SeqMap
# Seq workshop -- Section 4
# Reads index constructed in Section 2 and looks up k-mers from
# input reads to find candidate mappings, then performs alignment.
# Usage: seqc run section4.seq <FASTA path> <FASTQ path>
from sys import argv
from bio import *
import pickle
import gzip

K = Kmer[32]
index = None

reference = s''
for record in FASTA(argv[1]):
    reference = record.seq

with gzip.open(argv[1] + '.index', 'rb') as jar:
    index = pickle.load[dict[K,int]](jar)

candidates = Dict[int,int]()
for record in FASTQ(argv[2]):
    for pos,kmer in record.read.kmers_with_pos[K](step=1):
        found = index.get(min(kmer, ~kmer), -1)
        if found > 0:
            candidates.increment(found - pos)

    for pos,count in candidates.items():
        if count > 1:
            query = record.read
            target = reference[pos:pos + len(query)]
            alignment = query.align(target)
            print record.name, pos + 1, alignment.score, alignment.cigar

    candidates.clear()

Section 5: Pipelines

Pipelines are a very convenient Seq construct for expressing a variety of algorithms and applications. In fact, SeqMap can be thought of as a pipeline with the following stages:

  • read a record from the FASTQ file,

  • find candidate alignments by querying the index,

  • perform alignment for mappings supported by 2+ k-mers and output results.

We can write this as a pipeline in Seq as follows:

def find_candidates(record):
    candidates = Dict[int,int]()
    for pos,kmer in record.read.kmers_with_pos[K](step=1):
        found = index.get(min(kmer, ~kmer), -1)
        if found > 0:
            candidates.increment(found - pos)
    for pos,count in candidates.items():
        if count > 1:
            yield record, pos

def align_and_output(t):
    record, pos = t
    query = record.read
    target = reference[pos:pos + len(query)]
    alignment = query.align(target)
    print record.name, pos + 1, alignment.score, alignment.cigar

Notice that find_candidates yields candidate alignments to align_and_output, which then performs alignment and prints the results. In Seq, all values generated from one stage of a pipeline are passed to the next. The Seq compiler performs many domain-specific optimizations on pipelines, one of which we focus on in the next section.

(Optional) Parallelism

Parallelism can be achieved using the parallel pipe operator, ||>, which tells the compiler that all subsequent stages can be executed in parallel:

FASTQ(argv[2]) |> iter ||> find_candidates |> align_and_output

Since the full program also involves loading the index, let’s time the main pipeline using the timing module:

import timing
with timing('mapping'):
    FASTQ(argv[2]) |> iter ||> find_candidates |> align_and_output

We can try this for different numbers of threads:

export OMP_NUM_THREADS=1
seqc run section5.seq data/chr22.fa data/reads.fq > out.txt
# mapping took 48.2858s

export OMP_NUM_THREADS=2
seqc run section5.seq data/chr22.fa data/reads.fq > out.txt
# mapping took 35.886s

Often, batching reads into larger blocks and processing those blocks in parallel can yield better performance, especially if each read is quick to process. This is also very easy to do in Seq:

def process_block(block):
    block |> iter |> find_candidates |> align_and_output

with timing('mapping'):
    FASTQ(argv[2]) |> blocks(size=2000) ||> process_block

And now:

export OMP_NUM_THREADS=1
seqc run section5.seq data/chr22.fa data/reads.fq > out.txt
# mapping took 48.2858s

export OMP_NUM_THREADS=2
seqc run section5.seq data/chr22.fa data/reads.fq > out.txt
# mapping took 25.2648s

Full code listing

click to download

# SeqMap
# Seq workshop -- Section 5
# Reads index constructed in Section 2 and looks up k-mers from
# input reads to find candidate mappings, then performs alignment.
# Implemented with Seq pipelines.
# Usage: seqc run section5.seq <FASTA path> <FASTQ path>
from sys import argv
from time import timing
from bio import *
import pickle
import gzip

K = Kmer[32]
index = None

reference = s''
for record in FASTA(argv[1]):
    reference = record.seq

with gzip.open(argv[1] + '.index', 'rb') as jar:
    index = pickle.load[dict[K,int]](jar)

def find_candidates(record):
    candidates = Dict[int,int]()
    for pos,kmer in record.read.kmers_with_pos[K](step=1):
        found = index.get(min(kmer, ~kmer), -1)
        if found > 0:
            candidates.increment(found - pos)
    for pos,count in candidates.items():
        if count > 1:
            yield record, pos

def align_and_output(t):
    record, pos = t
    query = record.read
    target = reference[pos:pos + len(query)]
    alignment = query.align(target)
    print record.name, pos + 1, alignment.score, alignment.cigar

with timing('mapping'):
    FASTQ(argv[2]) |> iter |> find_candidates |> align_and_output

Section 6: Domain-specific optimizations

Seq already performs numerous domain-specific optimizations under the hood. However, we can give the compiler a hint in this case to perform one more: inter-sequence alignment. This optimization entails batching sequences prior to alignment, then aligning multiple pairs using a very fast SIMD optimized alignment kernel.

In Seq, we just need one additional function annotation to tell the compiler to perform this optimization:

@inter_align
def align_and_output(t):
    ...

Let’s run the program with and without this optimization:

# without @inter_align
seqc run section5.seq data/chr22.fa data/reads.fq > out.txt
# mapping took 43.4457s

# with @inter_align
seqc run section6.seq data/chr22.fa data/reads.fq > out.txt
# mapping took 32.3241s

(The timings with inter-sequence alignment will depend on the SIMD instruction sets your CPU supports; these numbers are from using AVX2.)

Full code listing

click to download

# SeqMap
# Seq workshop -- Section 6
# Reads index constructed in Section 2 and looks up k-mers from
# input reads to find candidate mappings, then performs alignment.
# Implemented with Seq pipelines using inter-seq. alignment.
# Usage: seqc run section6.seq <FASTA path> <FASTQ path>
from sys import argv
from time import timing
from bio import *
import pickle
import gzip

K = Kmer[32]
index = None

reference = s''
for record in FASTA(argv[1]):
    reference = record.seq

with gzip.open(argv[1] + '.index', 'rb') as jar:
    index = pickle.load[dict[K,int]](jar)

def find_candidates(record):
    candidates = Dict[int,int]()
    for pos,kmer in record.read.kmers_with_pos[K](step=1):
        found = index.get(min(kmer, ~kmer), -1)
        if found > 0:
            candidates.increment(found - pos)
    for pos,count in candidates.items():
        if count > 1:
            yield record, pos

@inter_align
def align_and_output(t):
    record, pos = t
    query = record.read
    target = reference[pos:pos + len(query)]
    alignment = query.align(target)
    print record.name, pos + 1, alignment.score, alignment.cigar

with timing('mapping'):
    FASTQ(argv[2]) |> iter |> find_candidates |> align_and_output