Split alignment Martin C. Frith April 13, 2012

1

Introduction

This document is about aligning a query sequence to a genome, allowing different parts of the query to match different parts of the genome. Here are some possible applications: • Detecting genome rearrangements in cancer, by aligning DNA reads from cancer cells to a reference genome, and looking for reads that span rearrangement breakpoints. • Detecting novel transposon insertions, by looking for DNA reads that span insertion boundaries. • Detecting trans-splicing, by aligning RNA reads to a genome, and looking for reads whose alignment is split between disjoint genome locations. • Spliced alignment (alignment of cis-spliced RNA to DNA) is an important special case.

2

Local versus semi-global alignment

Split alignment can be either local or semi-global. Semi-global means that all of the query must participate in the alignment. Local means that a contiguous segment of the query is aligned.

3

Scoring scheme

I use the standard affine-gap scoring scheme, with one additional parameter: a negative “jump score” which is incurred when the alignment jumps from one part of the genome to another. For now, let us assume that the jump score is constant. Later, we will consider using different scores for different jumps. S(x, y): score for aligning genomic base x to query base y. a: gap existence score. b: gap extension score. (A gap of length k scores a + b × k.) d: jump score. Note that a, b, and d are negative. 1

4

Probabilistic interpretation

Alignment scores are often interpreted as scaled log probabilities: score = t × ln(probability)

(1)

Here, t is an arbitrary scale factor. (If we multiply all the score parameters by a constant factor, it makes no difference to the alignment.) This may provide guidance for choosing the jump score. If the probability of a rearrangement breakpoint occurring immediately after any base is δ, the jump score is: d = t × ln(δ/(2g)) (2) Here, g is the genome size, and 1/(2g) is the probability of jumping to any one base on either strand.

5

Minimum matches on each side of a jump

Typically, the match score S(x, x) for DNA is ≈ t×ln(4). A maximal-score local alignment can include a jump only if there are at least −d/S(x, x) ≈ log4 (2g/δ) matches on each side of the jump. So we cannot detect jumps that are too near the edge of the query. Table 1: Minimum matches on each side of a jump δ −2

10 10−3 10−6

6

g

minimum matches 9

3 × 10 3 × 109 3 × 109

≈ 20 ≈ 21 ≈ 26

Exact algorithm

Here is an algorithm that guarantees to find a maximal-scoring split alignment. For simplicity, let us consider a genome with just one sequence and one strand. (The extension to multiple sequences and strands is straightforward.)

6.1

Definitions

G1 , . . . , Gm : genome sequence of length m. Q1 , . . . , Qn : query sequence of length n. i: ranges from 1 to m. j: ranges from 1 to n.

2

6.2

Local alignment

Initialization W0,0 = 0

(3)

Wi,0 = 0

Yi,0 = −∞

Zi,0 = −∞

(4)

W0,j = 0

Y0,j = −∞

Z0,j = −∞

(5)

Recurrence Xi,j = max(Wi−1,j−1 , Wj−1 + d) + S(Gi , Qj )

(6)

Yi,j = max(Wi−1,j + a, Yi−1,j ) + b

(7)

Zi,j = max(Wi,j−1 + a, Zi,j−1 ) + b

(8)

Wi,j = max(Xi,j , Yi,j , Zi,j , 0)

(9)

Wj = max(Wi,j )

(10)

i

Termination Optimal alignment score = max(Wi,j ) i,j

(11)

This algorithm finds the maximum possible alignment score. In order to find an alignment that has this score, we can perform a traceback, similarly to standard alignment methods. This algorithm is similar to the classic Gotoh algorithm for affine-gap alignment. If we set d = −∞, it becomes identical to the Gotoh algorithm. The time complexity is O(mn), which is practical for small genomes and small query datasets, but not for large genomes and multi-gigabase query data.

6.3

Semi-global alignment

Initialization W0,0 = 0

(12)

Wi,0 = 0

Yi,0 = −∞

Zi,0 = −∞

(13)

W0,j = −∞

Y0,j = −∞

Z0,j = −∞

(14)

Recurrence Xi,j = max(Wi−1,j−1 , Wj−1 + d) + S(Gi , Qj )

(15)

Yi,j = max(Wi−1,j + a, Yi−1,j ) + b

(16)

Zi,j = max(Wi,j−1 + a, Zi,j−1 ) + b

(17)

Wi,j = max(Xi,j , Yi,j , Zi,j )

(18)

Wj = max(Wi,j )

(19)

i

Termination Optimal alignment score = max(Wi,n ) i

3

(20)

7

Allowed positions for jumps

In the preceding algorithms, jumps cannot occur immediately before alignment gaps (insertions or deletions). They can only occur immediately before aligned bases. This restriction cannot prevent us from finding the optimal alignment score, because a jump followed by a gap has the same score as a gap followed by a jump. Moreover, optimal alignments never have jumps adjacent to deletions (in the query relative to the genome). This is because such a deletion could be absorbed into the jump, improving the alignment score.

8

Fast heuristic algorithm

To cope with huge datasets, I propose a fast but inexact algorithm (for local split alignment). This has two steps: 1. Find local, non-split alignments between the query and the genome using a fast aligner such as LAST. Let us call these “initial alignments”. 2. Find a maximal-scoring split alignment using (parts of) the initial alignments. Step 2 is described here.

8.1

Definitions

m: the number of initial alignments. n: the length of the query sequence. Q1 , . . . , Qn : the query sequence. B Ii,j : 1 if initial alignment i begins at query base j, 0 otherwise. E Ii,j : 1 if initial alignment i ends at query base j, 0 otherwise.

I assume that the initial alignments never begin or end with gaps (insertions or deletions). For the next three definitions, conceptually convert the initial alignments to semi-global alignments, by regarding unaligned query bases as insertions. Ai,j : the alignment score for query base j in alignment i. If it is aligned to a genomic base x, the score is S(x, Qj ). If it is aligned to an initial gap, the score is a + b. If it is aligned to a non-initial gap, the score is b. Di,j : the deletion score between query bases j and j + 1 in alignment i. If there are no deleted bases then Di,j = 0. Otherwise, if there k deleted bases, Di,j = a + b × k. X Ii,j : 1 if query base j is aligned to a non-initial gap in alignment i, 0 otherwise.

Finally, we will use logs of the indicator variables: B B Ji,j = log Ii,j

E E Ji,j = log Ii,j

4

(21)

8.2

Algorithm

Initialization Wi,0 = −∞

(22)

W0 = −∞

(23)

Recurrence B X Wi,j = max(Wi,j−1 + Di,j−1 , Ji,j , Wj−1 + d + a · Ii,j ) + Ai,j

Wj = max(Wi,j )

(24) (25)

i

Termination E Optimal alignment score = max(Wi,j + Ji,j ) i,j

(26)

This algorithm only considers split alignments that begin at the beginning of one of the initial alignments. This is because scores > −∞ can arise only B when Ji,j > −∞. Likewise, it only considers split alignments that end at the E end of one of the initial alignments (by using Ji,j ). The algorithm does not consider jumps adjacent to deletions (in the query relative to the genome). This restriction is harmless, because jumps adjacent to deletions have sub-optimal score. It does consider jumps adjacent to insertions. The time complexity is O(mn), which is no more than needed just to read the alignments.

9

Variable jump scores

For spliced alignment, we would like to vary the jump score, to model intron length distributions and splice signals (such as GT-AG). Here is a modification of the heuristic algorithm to allow this. Unfortunately, the time complexity increases to O(m2 n). dk,i,j : the score for jumping from query base j in initial alignment k to query base j + 1 in initial alignment i. Initialization Wi,0 = −∞

(27)

W0 = −∞

(28)

Recurrence Mi,j = max(Wk,j−1 + dk,i,j−1 )

(29)

B X Wi,j = max(Wi,j−1 + Di,j−1 , Ji,j , Mi,j + a · Ii,j ) + Ai,j

(30)

k

5

Termination E Optimal alignment score = max(Wi,j + Ji,j ) i,j

(31)

This algorithm also does not consider jumps adjacent to deletions. Unfortunately, this restriction is no longer harmless. For example, an adjacent deletion may permit a jump to use a consensus splice signal. It might be possible to consider adjacent deletions within the calculation of dk,i,j .

10

Alignment ambiguity

Often, one query will have several alternative alignments that are almost as good as the best alignment. In other words, the alignment is ambiguous. For many applications, however, we would like to find (parts of) alignments that are highly unambiguous. To do this, I use the probabilistic interpretation of alignment scores to assign a probability to each alignment. We can then calculate various marginal probabilities for parts of alignments. Alignment parts with high marginal probabilities (e.g. ≥ 0.999) can be regarded as highly unambiguous. I assume that each alignment’s probability is proportional to exp(s/t), where s is its score. The key step is to calculate the normalization factor z, the sum of exp(s/t) over all alignments. The following version of the heuristic algorithm calculates z. a0 = exp(a/t)

b0 = exp(b/t)

A0i,j = exp(Ai,j /t)

d0 = exp(d/t) 0 Di,j = exp(Di,j /t)

(32) (33)

Initialization 0 Wi,0 =0

W00

=0

(34) (35)

Recurrence   X 0 0 0 B 0 Wi,j = Wi,j−1 · Di,j−1 + Ii,j + Wj−1 · d0 · (a0 )Ii,j · A0i,j X 0 Wj0 = (Wi,j )

(36) (37)

i

Termination z=

X 0 E (Wi,j · Ii,j ) i,j

6

(38)

11

Limitations

• The heuristic algorithms do not realign bases near jumps. For example, suppose the true alignment has a mismatch adjacent to a jump. The initial alignment will exclude the mismatch, because doing so improves the local alignment score. The final alignment will probably have a single-base insertion instead of the mismatch.

7

Split alignment

Apr 13, 2012 - I use the standard affine-gap scoring scheme, with one additional parameter: a .... Ai,j: the alignment score for query base j in alignment i.

161KB Sizes 7 Downloads 353 Views

Recommend Documents

split down the middle
Sep 4, 2013 - In your opinion, which statement best describes the action taken by the ... military's action deposing President Morsi, what best describes how ...

Vehicle alignment system
Jun 13, 1983 - tionally employs rather sophisticated equipment and specialized devices in order to align vehicle wheels. It has been recognized, as shown in the Modern Tire. Dealer, Volume 63, Number 7, June of 1982 at page 31, that alignment of the

split medical.pdf
Page 1 of 1. Powered by TCPDF (www.tcpdf.org). Page 1 of 1. split medical.pdf. split medical.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying split medical.pdf. Page 1 of 1.

Face Value Split - NSE
Oct 1, 2015 - DEPARTMENT : CAPITAL MARKET SEGMENT. Download Ref No ... purpose of 'Face Value Split of shares from Rs.10/- each into Rs.5/- each'. ... Fax No. Email id. 1800 2200 57. 022-26598269 [email protected].

Split-Bridge
10, 2010. (54) EXTENDED CARDBUS/PC CARD. CONTROLLER WITH SPLIT-BRIDGE ..... of peripherals devices including notebook computers, stor.

Face Value Split - NSE
Dec 31, 2014 - You are kindly requested to upload client wise early pay-in allocation ... Email id. 1800 2200 57. 022-26598269 [email protected].

split down the middle
Sep 4, 2013 - Zogby Analytics, LLC. Jon Zogby. Chad Bohnert .... Should these issues be resolved, Egypt can then focus on the business of meeting what our ... economy and creating jobs and opportunities for Egypt's youth. But should the ...

Downlink Interference Alignment - Stanford University
cellular networks, multi-user MIMO. I. INTRODUCTION. ONE of the key performance metrics in the design of cellular systems is that of cell-edge spectral ...

Downlink Interference Alignment - Stanford University
Paper approved by N. Jindal, the Editor for MIMO Techniques of the. IEEE Communications ... Interference-free degrees-of-freedom ...... a distance . Based on ...

Downlink Interference Alignment
Wireless Foundations. U.C. Berkeley. GLOBECOM 2010. Dec. 8. Joint work .... Downlink: Implementation Benefits. 2. 1. 1. K. Fix K-dim reference plane, indep. of ...

Manifold Alignment Determination
examples from which it is capable to recover a global alignment through a prob- ... only pre-aligned data for the purpose of multiview learning. Rather, we exploit ...

Dirty me split
SQL tutorial pdf.Theart ofm.02901034158. Destino finallatino.Drastic ... Fagnerao vivo.Big bang theory 6 completa.Who wants to bea millionaire psp.Greys.

Pdf scan split
Microsoft office volume. ... propaganda, on the other hand C pdfscan split definitely not propaganda has it is taken in 1988,and unlikea pictureit ... Pdf scan split.

Reverse Split Rank - Semantic Scholar
there is no bound for the split rank of all rational polytopes in R3. Furthermore, ... We say that K is (relatively) lattice-free if there are no integer ..... Given a rational polyhedron P ⊆ Rn, we call relaxation of P a rational polyhe- dron Q âŠ

pdf split mac
Page 1. Whoops! There was a problem loading more pages. pdf split mac. pdf split mac. Open. Extract. Open with. Sign In. Main menu. Displaying pdf split mac.

Reverse Split Rank - Semantic Scholar
Watson Research Center, Yorktown Heights, NY, USA. 3 DISOPT, Institut de ..... Given a rational polyhedron P ⊆ Rn, we call relaxation of P a rational polyhe- .... operator (hence, also of the split closure operator) applied to Qi are sufficient to.

Vyapam Split Judgment.pdf
The Madhya Pradesh Vyavsayik Pariksha Mandal Adhiniyam,. 2007 [The Madhya Pradesh Professional Examination Board Act,. 2007] (hereinafter referred to ...

Opportunistic Interference Alignment for Random ... - IEEE Xplore
Dec 14, 2015 - the new standardization called IEEE 802.11 high-efficiency wireless ... Short Range Wireless Transmission Technology with Robustness to ...

Split-ST-OFDM: Using Split Processing to Improve the ...
Dec 15, 2006 - extensive comparison among several of these techniques is drawn, where .... to the space-time encoder is the Nc × 1 symbol vector Uk = [u0(k) u1(k) ..... Ye Li, M.A. Ingram, and T. G. Pratt, “Broadband MIMO-OFDM wireless.

Automatic Score Alignment of Recorded Music - GitHub
Bachelor of Software Engineering. November 2010 .... The latter attempts at finding the database entries that best mach the musical or symbolic .... However, the results of several alignment experiments have been made available online. The.

Opportunistic Interference Alignment for MIMO ...
Feb 15, 2013 - Index Terms—Degrees-of-freedom (DoF), opportunistic inter- ... Education, Science and Technology (2010-0011140, 2012R1A1A1044151). A part of .... information of the channels from the transmitter to all receivers, i.e., its own ......

MUVISYNC: REALTIME MUSIC VIDEO ALIGNMENT ...
computers and portable devices to be played in their homes or on the go. .... lated cost matrix and the path through this matrix does not scale efficiently for large ...