RNA secondary structure prediction

Given a nucleic acid sequence of RNA, find a maximum matching of \(\{A,U\}\) or \(\{C,G\}\) base pairs without knots or sharp turns.

This is a modern C++ implementation that employs (iterative) dynamic programming on intervals to find the cardinality of the maximum matching of base pairs as well as the base pairs in the matching.

Background

RNA is a nucleic acid similar to DNA, but with only a single, helical strand of bases. It plays a key role in turning DNA instructions into functional proteins. RNA molecules fold into complex secondary structures which govern their behaviour.

Various rules govern secondary structure formation:

Pairs of bases match up. Each base matches with no more than one other base.
Adenine always matches with Uracil, while Cytosine always matches with Guanine and vice versa.
There are no kinks in the folded molecule.
Structures are knot-free.

Given an RNA molecule, the aim is to predict its secondary structure.

Problem formulation

An RNA molecule can be represented by a string \(B = b_1 b_2 \ldots b_n\), where each character \(b_i \in \{ A, C, G, U \}\).

A secondary structure on \(B\) is a set of pairs \(S = \{ (i, j) \ldots \}\), where \(1 \le i, j \le n\), satisfying the following rules.

No sharp turns: The ends of each pair are separated by at least some number of intervening bases i.e. if \((i, j) \in S\), then \(i \lt j - \texttt{MIN\_LEN}\).
Complementary base pairs: The elements in each pair in S consist of either \(\{A,U\}\) or \( \{C,G\}\) (in either order).
\(S\) is a Matching: No base appears in more than one pair.
No knots: If \((i,j) \in S\) and \((k,l) \in S\), then we cannot have \(i \lt k \lt j \lt l\).

Thus the problem boils down to finding a maximum matching of \(\{A,U\}\) or \(\{C,G\}\) base pairs without knots or sharp turns.

Dynamic Programming solution

Let \(dp_{l, r }\) denote the cardinality of the maximum matching in \(b_l b_{l+1} \ldots b_r\) of the sequence \(B\). Then it is easy to notice the following recurrence.

\[ dp_{ l, r } = \max \begin{cases} \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } 0 & r - l \le \texttt{MIN\_LEN} \\ \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } dp_{ l, r-1 } & r - l \gt \texttt{MIN\_LEN} \\ \underset{l \le m \le r}{\max} \text{ } \text{ } dp_{ l, m-1 } + 1 + dp_{ m+1, r-1} & b_m \text{ and } b_r \text{ are complementary bases} \end{cases} \]

The required value is \(dp_{ 1, n }\) — the maximum matching in the sequence \(B\).

Timing Analysis

Each state takes \(\mathcal{O}(r-l)\) time to compute and there are \(\mathcal{O}(n^2)\) states. Hence the time complexity of computing all the dp states is \(\mathcal{O}(n^3)\).

Also, since there are \(\mathcal{O}(n^2)\) dp states which must be stored in memory, the space complexity is \(\mathcal{O}(n^2)\).

Visualization

The following were generated using a python script.