Bioinformatics Past Paper
Algorithm, Software, Examples
[Which] Name
[Input] assumptions/conditions/constraints
foundation of reasoning / fill incomplete info/gap
exclude potential limitations
[Output] requirements
[How] methods, description, main steps
[Why] Motivation / Intention / Purposes
[Extension] comparison (pros and cons), speedup, application, complexity.
Lecture 1: Genetics
Reference online resource: Teaching
Youtube @bioinfalgorithms accompanying the textbook "Bioinformatics Algorithms: An Active Learning Approach".
features
DNA
RNA
strand structure
double helix
single-stranded
nitrogenous base
A, T, C, G
A, U, C, G
base pairs
A-T, C-G
A-U, C-G
sugar
Deoxyribose
Ribose
function
long-term storage of genetic info
protein synthesis and gene regulation
stability
more stable (-H)
stable due to hydroxyl group (-OH)
nitrogenous bases: Adenine (A), Thymine (T), Cytosine (C), Guanine (G), Uracil (U).
The total number of codons is 4 3 = 64 4^3=64 4 3 = 64 .
The total number of amino acids is 20.
Codon degeneracy: multiple codons can are mapped toP the same amino acid, for redundancy if codon are mutated.
concept
gene
genome
definition
specific DNA segment for proteins or RNA encoding
complete set of genetic material
functionality
heredity, gene code for proteins
all genes and non-coding sequences
size
various sizes
entire DNA content of the organism
region and feature
DNA genes
programming functions
start marker
promoter region
function declaration (e.g.,def func()
)
coding region
exons (encode protein information)
function body
non-coding region
introns (sequences)
comments or placeholder code
end marker
termination signal (of transcription)
closing brace or return statement
purpose
defines the structure and expression of a gene
defines the structure and behavior of funcs
execution
transcription and translation to produce proteins
execution of the function when called
DNA -- transcription → \rightarrow → messenger RNA (mRNA) -- translation → \rightarrow → protein, codon, amino acid.
Firstly, transcription happens, where a specific segment of DNA is copied into messenger RNA (mRNA). Secondly, translation occurs, where the mRNA is used to synthesize a protein, by attaching to a ribosome.
The ribosome (composed of ribosomal RNA (rRNA) and proteins) reads the codons (three nucleotides of mRNA specifying a particular amino acid) until the ending signal appears.
Transfer RNA (tRNA) molecules bring amino acids to the ribosome. Each tRNA has an anticodon that is complementary to the mRNA codon.
The ribosome facilitates the formation of peptide bonds between amino acids, linking them together to form a growing polypeptide chain .
Gene network [extensions]
Lecture 2: Sequence alignment
Distance metrics
Hamming distance vs edit distance
Dynamic programming
y2012p9q1 (a.i)
very similar DNA sequences: y2020p8q2 (a) , y2007p8q13 (c)
banded dp, for the aligned x i , y j x_i, y_j x i , y j with ∣ i − j ∣ ≤ k |i-j| \leq k ∣ i − j ∣ ≤ k .
For j ∈ [ max ( 1 , i − k ) , min ( ∣ w ∣ , i + k ) ] j \in [\max(1, i-k), \min(|w|, i+k)] j ∈ [ max ( 1 , i − k ) , min ( ∣ w ∣ , i + k )] .
complexity: O ( ∣ v ∣ × k ) < O ( ∣ v ∣ × ∣ w ∣ ) O(|v| \times k) < O(|v| \times |w|) O ( ∣ v ∣ × k ) < O ( ∣ v ∣ × ∣ w ∣ ) , where k k k is the band width.
Scoring matrices
Gaps
y2012p9q1 (a.ii) , y2006p8q13 (b)
non-linear: dp with time O ( ∣ v ∣ 2 ⋅ ∣ w ∣ ) O(|v|^2 \cdot |w|) O ( ∣ v ∣ 2 ⋅ ∣ w ∣ ) , due to the loop over k ∈ [ 0 , i ) ∨ [ 0 , j ) k \in [0, i) \lor [0, j) k ∈ [ 0 , i ) ∨ [ 0 , j ) .
compromise: affine γ ( n ) = d + e × ( n − 1 ) \gamma(n) = d + e \times (n-1) γ ( n ) = d + e × ( n − 1 ) ,
where d d d is the gap opening penalty,
and e e e is the gap extension penalty.
F m i d d l e , G i n − d e l ∈ Z ( ∣ v ∣ + 1 ) × ( ∣ w ∣ + 1 ) F_{middle}, G_{in-del} \in \mathbb{Z}^{(|v|+1) \times (|w|+1)} F mi dd l e , G in − d e l ∈ Z ( ∣ v ∣ + 1 ) × ( ∣ w ∣ + 1 ) .
y2022p9q2 (a) , y2020p8q2 (b) , y2013p7q3 (b) , y2021p9q2 (a)
mismatch and gap extension penalty
If decrease gap penalty, more gaps (fewer sequence homologous regions), vice versa.
If decrease mismatch penalty / gap extension penalty, less gaps (more regions of similarity).
Algorithms
Longest Common Subsequence (LCS)
edit graph M ∈ Z ( m + 1 ) × ( n + 1 ) M \in \mathbb{Z}^{(m+1) \times (n+1)} M ∈ Z ( m + 1 ) × ( n + 1 ) , where v v v and w w w are two strings with lengths m = ∣ v ∣ m=|v| m = ∣ v ∣ and n = ∣ w ∣ n=|w| n = ∣ w ∣ .M[0, 0]
...
M[i, j] M[i, j+1]
\ |
s(i, j) d
\ |
M[i,j+1]-d- M[i,j+1]
where s(i, j) = 5, if match; -3, otherwise,
d = 2, gap penalty.
edit distance = m + n − 2 × LCS(m, n) = m + n - 2 \times \text{LCS(m, n)} = m + n − 2 × LCS(m, n) .
y2015p9q1 (a)
y2007p8q13 (b)
LCS is equivalent to global alignment with a scoring scheme of (match=1, mismatch=0, gap=0).
General alignment (global vs. local)
y2022p9q2 (a) , y2018p27q3 (a) , y2012p9q1 (b.i)
Input: strings v v v and w w w as well as a matrix score.
Output: optimal alignment score, with alignment(s) of v v v and w w w via traceback.
Complexity: O ( ∣ v ∣ ⋅ ∣ w ∣ ) O(|v| \cdot |w|) O ( ∣ v ∣ ⋅ ∣ w ∣ ) time, O ( ∣ v ∣ ⋅ ∣ w ∣ ) O(|v| \cdot |w|) O ( ∣ v ∣ ⋅ ∣ w ∣ ) space.
local alignment= global alignment in a sub-rectangle.
difference: initialization, termination.
Global alignment (Needleman-Wunsch )
Local alignment (Smith-Waterman )
Speedup extension (Four Russians)
Linear space extension (Hirshberg , divide and conquer)
general : complexity summary
general : long DNA sequences
RNA secondary structure prediction (Nussinov-Jacobson folding )
y2023p8q2 (b) , y2022p9q2 (c) , y2019p8q2 (a) , y2009p7q5 (a) , y2015p9q1 (b)
dp: γ ∈ Z n × n \gamma \in \mathbb{Z}^{n \times n} γ ∈ Z n × n .
initialization: γ ( i , i − 1 ) = 0 \gamma(i, i-1) = 0 γ ( i , i − 1 ) = 0 , γ ( i , i ) = 0 \gamma(i, i) = 0 γ ( i , i ) = 0 .
γ ( i , j ) = max { γ ( i + 1 , j ) γ ( i , j − 1 ) γ ( i + 1 , j − 1 ) + δ ( i , j ) max i < k < j γ ( i , k ) + γ ( k + 1 , j ) \gamma(i, j) = \max \begin{cases}
\gamma(i+1, j) \\
\gamma(i, j-1) \\
\gamma(i+1, j-1) + \delta(i, j) \\
\max_{i < k < j} \gamma(i, k) + \gamma(k+1, j)
\end{cases}
γ ( i , j ) = max ⎩ ⎨ ⎧ γ ( i + 1 , j ) γ ( i , j − 1 ) γ ( i + 1 , j − 1 ) + δ ( i , j ) max i < k < j γ ( i , k ) + γ ( k + 1 , j )
complexity: O ( n 3 ) O(n^3) O ( n 3 ) time, O ( n 2 ) O(n^2) O ( n 2 ) space.
limitations: identify pseudo-knot and branched loops is difficult because the same interact with different segments.
Lecture 3: Phylogeny
Phylogenetic trees infer evolutionary relationships among biological species/entities based on their physical or genetic (DNA or amino acid sequences) characteristics similarities and differences.
Reference online resource: Hierarchical/Agglomerative clustering
Motivation
Extension
Phylogeny
distance-based
parsimony-based
input
pairwise additive distance matrix
character tables
assumption
distances are additive
principle of minimal changes
good for
additive changes, faster
detailed and character-based
weakness
no information at internal node
homoplasy vs. shared traits
over-simplification
slower for large scale
examples
UPGMA, neighbor-joining
small (or large) parsimony
Distance-based methods
Distance matrix
The additive tree condition meant that for any two leaves, the distance between them is the sum of edge weights of the path between them.
The ultra-metric tree condition: distance from root to any leaf is the same (i.e., age of root).
Branch lengths represent evolutionary change, allowing for direct comparison of divergence among taxa.
Molecular Clock Hypothesis: genetic change accumulates at a constant rate across lineages over time.
Thus, the rate of mutation or evolutionary change is uniform, allowing for the use of branch lengths as measures of time.
Relationship: ultra-metric ⊆ \subseteq ⊆ additive.
Four-point condition between four taxa: for any four elements, define d i j + d k l = T d_{ij} +d_{kl} =T d ij + d k l = T ,
d i j + d k l < d i l + d j k = d i k + d j l = T + 2 a . d_{ij} + d_{kl} < d_{il} + d_{jk}=d_{ik}+d_{jl} = T+2a.
d ij + d k l < d i l + d jk = d ik + d j l = T + 2 a .
i \ / k
--a--
j / \ l
distance-based
UPGMA
Neighbor-Joining (NJ)
input
additive distance matrix
additive distance matrix
output
rooted ultra-metric tree
un-rooted additive tree
evolution rate
constant
varying
complexity
O ( n 3 ) O(n^3) O ( n 3 ) opt. O ( n 2 log n ) O(n^2 \log n) O ( n 2 log n ) , O ( n 2 ) O(n^2) O ( n 2 )
O ( n 3 ) O(n^3) O ( n 3 )
Past papers
ultra-metric
UPGMA (Unweighted Pair Group Method with Arithmetic Mean)
Neighbor joining
Parsimony-based methods
vs. distance methods
Small parsimony problem (Sankoff)
given a rooted tree T T T with each leaf labeled by one of the N N N sequences,
dp, assigning label k k k to each internal node v v v in the subtree T v T_v T v , for
min k s k ( v ) \min_k s_k(v) min k s k ( v ) , where s k ( v ) = min all symbols i { s i ( LeftNode ) + δ i , k } + min all symbols j { s j ( RightNode ) + δ j , k } s_k(v) = \min_{\text{all symbols i}}\{s_i(\text{LeftNode}) + \delta_{i, k}\} + \min_{\text{all symbols j}}\{s_j(\text{RightNode}) + \delta_{j, k}\} s k ( v ) = min all symbols i { s i ( LeftNode ) + δ i , k } + min all symbols j { s j ( RightNode ) + δ j , k } .
Large parsimony tree problem
NP-complete, use greedy heuristics
Nearest Neighbor Interchange, NNI to find a local minimum.
Past papers
Bootstrap validation
General methods
Multiple alignments
For a k k k -way alignment, where each sequence length is n n n ,
there are n k n^k n k nodes in the alignment graph,
each node has 2 k − 1 2^k-1 2 k − 1 incoming edges,
Hirshberg algorithm can get rid of one O ( n ) O(n) O ( n ) in space complexity.
iterative refinement CLUSTAL
algorithm
given N N N sequences, align each sequence against each other.
use the score of the pairwise alignments to compute a distance matrix.
O ( k 2 ⋅ n 2 ) O(k^2 \cdot n^2) O ( k 2 ⋅ n 2 ) time, with O ( n + k 2 ) O(n + k^2) O ( n + k 2 ) space.
k 2 k^2 k 2 pairs, where each pairwise alignment takes O ( n 2 ) O(n^2) O ( n 2 ) time, with O ( n ) O(n) O ( n ) space.
build a guide tree, showing the best order of progressive alignment.
e.g. UPGMA, O ( k 3 ) O(k^3) O ( k 3 ) or opt. O ( k 2 log k ) O(k^2 \log k) O ( k 2 log k ) or O ( k 2 ) O(k^2) O ( k 2 ) time, with O ( k 2 ) O(k^2) O ( k 2 ) space.
progressive alignment guided by the tree, by merging sub-alignments.
O ( k ) O(k) O ( k ) merge, where single merge takes max O ( k ⋅ n 2 ) O(k \cdot n^2) O ( k ⋅ n 2 ) time, with O ( k ⋅ n ) O(k \cdot n) O ( k ⋅ n ) space.
Total: O ( k 2 ⋅ n 2 + k 3 ) O(k^2 \cdot n^2 + k^3) O ( k 2 ⋅ n 2 + k 3 ) or O ( k 2 ⋅ n 2 ) O(k^2 \cdot n^2) O ( k 2 ⋅ n 2 ) time, with O ( k ⋅ n + k 2 ) O(k \cdot n + k^2) O ( k ⋅ n + k 2 ) space.
dp
greedy
progressive (CLUSTAL)
time
O ( n k ⋅ 2 k ) O(n^k \cdot 2^k ) O ( n k ⋅ 2 k )
O ( k ⋅ n 2 ) O(k \cdot n^2) O ( k ⋅ n 2 )
O ( k 2 ⋅ n 2 ) O(k^2 \cdot n^2) O ( k 2 ⋅ n 2 )
space
O ( n k ) O(n^k) O ( n k )
O ( k ⋅ n ) O(k \cdot n) O ( k ⋅ n )
O ( k ⋅ n + k 2 ) O(k \cdot n + k^2) O ( k ⋅ n + k 2 )
pros
global optimal
faster and less memory
balances both
scales better with large k
good for moderate k
cons
high costs
suboptimal solution
guide tree accuracy
for larger k
greedy alignment order
suboptimal solution
Evaluation
entropy of a multi-alignment is calculated as a column score as the sum of the negative logarithm
of this probability of each symbol.
a completely conserved column would score 0, since − [ log ( 1 ) + 3 log ( 0 ) ] = 0 -[\log(1) + 3 \log(0)] = 0 − [ log ( 1 ) + 3 log ( 0 )] = 0 .
Past papers
Progressive alignment heuristic iterative refinement
y2016p9q2 (c)
Statistic evaluation
Approximate search
Lecture 4: Clustering
gene expression microarray data
K-center, K-means, Hierarchical, Markov clustering
given n n n data points x \mathbf{x} x , find k k k clusters.
UPGMA is a hierarchical clustering algo.
Good clustering principle (or evaluation),
distance between elements in the same cluster is < Δ < \Delta < Δ (intra-cluster distance),
distance between elements in different clusters is > Δ > \Delta > Δ (inter-cluster distance).
Cluster Quality = inter-cluster dist. intra-cluster dist. = Distance to Nearest Cluster Cluster Diameter > 1. \text{Cluster Quality} = \frac{\text{inter-cluster dist.}}{\text{intra-cluster dist.}} =
\frac{\text{Distance to Nearest Cluster}}{\text{Cluster Diameter}} > 1.
Cluster Quality = intra-cluster dist. inter-cluster dist. = Cluster Diameter Distance to Nearest Cluster > 1.
clustering
type
input
supervised?
T complexity
K-center
MaxDist.
x , k \mathbf{x}, k x , k
✓
--
K-means
VarDist.
x , k \mathbf{x}, k x , k
✓
O ( n ⋅ k ⋅ t ) O(n \cdot k \cdot t) O ( n ⋅ k ⋅ t )
Hierarchical
tree
x , k , M d i s t . \mathbf{x}, k, M_{dist.} x , k , M d i s t .
✓
O ( n 3 ) O(n^3) O ( n 3 )
MCL
Graph
x , G / M s i m . \mathbf{x}, G/M_{sim.} x , G / M s im .
✗
M e : O ( n 3 ) M^e: O(n^3) M e : O ( n 3 ) inflate m i j r : O ( n 2 ) m^r_{ij}: O(n^2) m ij r : O ( n 2 )
Soft clustering
each data point is assigned to a cluster with a probability.
input: the joint distribution of unlabeled data x ∈ R N x \in \mathbb{R}^N x ∈ R N , parameterized by θ \theta θ .
hidden vector (or label z i j z_{ij} z ij ): z ∈ R N × K z \in \R^{N \times K} z ∈ R N × K , where K K K is the number of labels,
which cluster with label j ∈ [ 1 , K ] j \in [1, K] j ∈ [ 1 , K ] is assigned to each data point x i x_i x i , and K K K is the total number of clusters.
is the coin with probability of head x i x_i x i faked or not, i.e. 1: fake; 0: real. z i j ∈ { 0 , 1 } z_{ij} \in \{0, 1\} z ij ∈ { 0 , 1 } .
output θ j \theta_j θ j : the MLE of the parameters θ = [ θ 1 , … , θ j , … , θ k ] \theta = [\theta_1, \ldots, \theta_j, \ldots, \theta_k] θ = [ θ 1 , … , θ j , … , θ k ] , that maximizes the joint prob.: P r ( x i , z i j ∣ θ j ) Pr(x_i, z_{ij} | \theta_j) P r ( x i , z ij ∣ θ j ) .
the center coordinates of each cluster j j j .
the probability of heads for coin type of j j j (1: fake; 0: real).
Expectation-Maximization (EM) algo.,
Gibbs sampling: iteratively update the parameters θ \theta θ and the hidden matrix z z z ,
by sampling from conditional distributions until convergence, than to marginalize by integrating over a joint distribution.
E step: optimize P r ( x i , z i j ∣ θ j ) Pr(x_i, z_{ij} | \theta_j) P r ( x i , z ij ∣ θ j ) wrt. the hidden vector z i j z_{ij} z ij while holding the parameters θ j \theta_j θ j fixed,
given the parameters θ t \theta^{t} θ t , derive the hidden matrix P r ( z i j ∣ x i , θ j ) Pr(z_{ij} | x_i, \theta_j) P r ( z ij ∣ x i , θ j ) ,
e.g. the responsibility of cluster j j j with center θ j = c e n t e r j \theta_j = center_j θ j = ce n t e r j for data point x i x_i x i .
P r ( z i j ∣ x i , θ j ) = P r ( x i , z i j ∣ θ j ) P r ( x i ∣ θ j ) = P r ( x i , z i j ∣ θ j ) ∑ j = 1 K P r ( x i , z i j ∣ θ j ) by Bayes’ rule . Pr(z_{ij} | x_i, \theta_j) = \frac{Pr(x_i, z_{ij} | \theta_j)}{Pr(x_i | \theta_j)} = \frac{Pr(x_i, z_{ij} | \theta_j)}{\sum_{j=1}^K Pr(x_i, z_{ij} | \theta_j)} \quad \text{by Bayes' rule}.
P r ( z ij ∣ x i , θ j ) = P r ( x i ∣ θ j ) P r ( x i , z ij ∣ θ j ) = ∑ j = 1 K P r ( x i , z ij ∣ θ j ) P r ( x i , z ij ∣ θ j ) by Bayes’ rule .
Note that the denominator is the normalization factor, ensuring the probabilities sum to 1,
the key is to compute the joint probability P r ( x i , z i j ∣ θ j ) Pr(x_i, z_{ij} | \theta_j) P r ( x i , z ij ∣ θ j ) ,
Force ( x i , z i j ) = e − β ⋅ d i s t a n c e ( x i , θ j = c e n t e r j ) \text{Force}(x_i, z_{ij}) = e^{- \beta \cdot distance(x_i, \, \theta_j = center_j)} Force ( x i , z ij ) = e − β ⋅ d i s t an ce ( x i , θ j = ce n t e r j ) ,
β → ∞ \beta \to \infty β → ∞ for hard clustering,
β → 0 \beta \to 0 β → 0 for extreme soft clustering.
X ∼ Binomial ( n , i ) X \sim \text{Binomial}(n, i) X ∼ Binomial ( n , i ) , P r ( x i , z i j ∣ θ j ) = ( θ j ) i ⋅ ( 1 − θ j ) n − i Pr(x_i, z_{ij} | \theta_j) = (\theta_j)^{i} \cdot (1 - \theta_j)^{n-i} P r ( x i , z ij ∣ θ j ) = ( θ j ) i ⋅ ( 1 − θ j ) n − i ,
where i i i is the number of heads, and n n n is the total number of tosses.
θ j \theta_j θ j is the probability of heads for coin type of j j j (1: fake; 0: real).
here, pick the z i = arg max j P r ( z i j ∣ x i , θ j ) z_i = \arg\max_{j} Pr(z_{ij} | x_i, \theta_j) z i = arg max j P r ( z ij ∣ x i , θ j ) that maximizes the joint prob..
in center-based clustering, set center label; in unknown coin tossing, set is faked or not.
M step: optimize P r ( x i , z i j ∣ θ j ) Pr(x_i, z_{ij} | \theta_j) P r ( x i , z ij ∣ θ j ) wrt. the parameters θ j \theta_j θ j while holding the hidden matrix z i j z_{ij} z ij fixed,
given the hidden matrix, update the parameters θ \theta θ via MLE,
θ t + 1 = arg max θ E z ∼ P r ( z ∣ x , θ ) [ log P r ( x , z ∣ θ ) ] \theta^{t+1} = \arg\max_\theta \mathbb{E}_{z \sim Pr(z | x, \theta)}[\log Pr(x, z | \theta)] θ t + 1 = arg max θ E z ∼ P r ( z ∣ x , θ ) [ log P r ( x , z ∣ θ )] .
θ j \theta_j θ j are updated as the coordinate average of the data points assigned to cluster j j j .
in unknown coin tossing, θ j t + 1 = x × 1 z i = j 1 × 1 z i = j \theta_j^{t+1} = \frac{x \times \mathbf{1}_{z_i = j}}{1 \times \mathbf{1}_{z_i = j}} θ j t + 1 = 1 × 1 z i = j x × 1 z i = j .
X ∼ Binomial ( n , i ) X \sim \text{Binomial}(n, i) X ∼ Binomial ( n , i ) , the MLE of positive outcome θ = i n \theta = \frac{i}{n} θ = n i ,
where i i i is the number of heads, and n n n is the total number of tosses.
Past papers
Evaluation
Louvain: modularity , Leiden algorithm
Lecture 5: Genome sequencing
Reconstruct the original genome, given a set of overlapping short reads from machines.
Hamiltonian graph -- Hamiltonian path (every node, NP-complete)
Bellman–Held–Karp algorithm
boolean dp[j][S_i]
, denoting a valid node subset S i S_i S i ending at node j j j .
for all neighbors k k k of j j j , extend dp[j][S_i]
by dp[k][S_i \ {j}]
and k − j k-j k − j .
O ( n 2 2 n ) O(n^2 2^n) O ( n 2 2 n ) , NP-complete.
De Bruijn graph -- Eulerian path (every edge, easier)
Balanced node: in-degree = out-degree
Semi-balanced node: in-degree = out-degree ± 1 \pm 1 ± 1 (differs at most 1)
Connected Graph: each node is reachable from some other node
Strongly connected Graph: each node is reachable from every other node
Eulerian Graph (a Eulerian cycle)
algorithm: Hierholzer's algorithm (ant) with O ( ∣ E ∣ ) O(|E|) O ( ∣ E ∣ ) .
Euler's theorem
a connected graph is Eulerian if and only if every vertex has even degree.
a graph is Eulerian if and only if it is a balanced connected graph. (semi-)
Eulerian vs. Hamiltonian
De Bruijn graph construction
Paired
K-mers
Bubbles explosion and solutions
Lecture 6: Genome assembly
Use an additional reference genome to augment sequencing or match (read) patterns.
Suffix trees
Compression
Burrows-Wheeler Transform (BWT)
Read / Exact pattern matching
Sequencing De Bruijn graph construction takes a lot of memory and time.
Fitting via alignment: O ( ∣ Patterns ∣ × ∣ Genome ∣ ) O(|\text{Patterns}| \times |\text{Genome}|) O ( ∣ Patterns ∣ × ∣ Genome ∣ ) .
Joint traversal (match or backtrack) via two trie pointers in parallel.
patterns prefix trie: O ( ∣ LongestPattern ∣ × ∣ Genome ∣ ) O(|\text{LongestPattern}| \times |\text{Genome}|) O ( ∣ LongestPattern ∣ × ∣ Genome ∣ ) .
genome suffix trie: O ( ∣ Patterns ∣ + ∣ Genome ∣ ) O(|\text{Patterns}| + |\text{Genome}|) O ( ∣ Patterns ∣ + ∣ Genome ∣ ) .
construction: char nodes T ( ∣ G ∣ 2 ) T(|G|^2) T ( ∣ G ∣ 2 ) , S ( ∣ G ∣ 2 ) S(|G|^2) S ( ∣ G ∣ 2 ) ; substr nodes T ( ∣ G ∣ ) T(|G|) T ( ∣ G ∣ ) , S ( ∼ 20 × ∣ G ∣ ) S(\sim 20 \times|G|) S ( ∼ 20 × ∣ G ∣ ) .
(invert) BWT + suffix array: S ( ∼ 4 × ∣ G ∣ ) S(\sim 4 \times|G|) S ( ∼ 4 × ∣ G ∣ ) .
y2018p9q2 (c)
y2017p9q2 (d)
Inexact pattern matching, with at most d d d mismatches.
potential candidates: at least one of the d + 1 d+1 d + 1 seeds is error-free. Check the entire pattern against the Genome.
(invert) BWT with extended mismatch + suffix array.
Lecture 7: Hidden Markov models
Application
Identify parts
Exons, Introns prediction
Protein secondary structure prediction
CG islands
Evaluation : TP, FP, TN, FN, sensitivity, specificity, precision, F1 score
Denote HMM
:
transition matrix P P P , emission matrix Q Q Q ,
k k k states π = { q 1 , . . . , q k } \pi =\{q_1, ..., q_k\} π = { q 1 , ... , q k } , one observation sequence X = { x 1 , x 2 , … , x N } X =\{x_1, x_2, \dots, x_N\} X = { x 1 , x 2 , … , x N } ,
or M M M training sequences, with a total of N N N observations X = { x 1 , x 2 , … , x N } X =\{x_1, x_2, \dots, x_N\} X = { x 1 , x 2 , … , x N } each.
Algorithm
Inputs
Outputs
Time
Space
Likelihood of a parse
H M M , X , π HMM, X, \pi H MM , X , π
P r ( x , π ) Pr(x,\pi) P r ( x , π )
O ( 1 ) O(1) O ( 1 )
O ( 1 ) O(1) O ( 1 )
Viterbi / decode
H M M , X HMM, X H MM , X
V k ( i ) = max π P ( x , π ∣ π i = k ) V_k(i)=\max_\pi P(x, \pi \vert \pi_i=k) V k ( i ) = max π P ( x , π ∣ π i = k )
O ( N ⋅ k 2 ) O(N \cdot k^2) O ( N ⋅ k 2 )
O ( N ⋅ k ) O(N \cdot k) O ( N ⋅ k )
Forward / eval
H M M , X HMM, X H MM , X
f k ( i ) = P r ( x 1 , . . . , x i ∣ π i = k ) f_k(i) = Pr(x_1, ..., x_i \vert \pi_i=k) f k ( i ) = P r ( x 1 , ... , x i ∣ π i = k )
O ( N ⋅ k 2 ) O(N \cdot k^2) O ( N ⋅ k 2 )
O ( N ⋅ k ) O(N \cdot k) O ( N ⋅ k )
Backward / eval
H M M , X HMM, X H MM , X
P r ( π i = k ∣ x ) = f k ( i ) ⋅ b k ( i ) Pr(\pi_i = k \vert x)=f_k(i) \cdot b_k(i) P r ( π i = k ∣ x ) = f k ( i ) ⋅ b k ( i ) b k ( i ) = P r ( x i + 1 , . . . x n ∣ π i ) b_k(i)=Pr(x_{i+1}, ...x_n \vert \pi_i) b k ( i ) = P r ( x i + 1 , ... x n ∣ π i )
O ( N ⋅ k 2 ) O(N \cdot k^2) O ( N ⋅ k 2 )
O ( N ⋅ k ) O(N \cdot k) O ( N ⋅ k )
Viterbi train / learning
X 1 , . . . , X M X_1, ..., X_M X 1 , ... , X M
P , Q P, Q P , Q
O ( M ⋅ N ⋅ k 2 ) O(M \cdot N \cdot k^2) O ( M ⋅ N ⋅ k 2 )
O ( M ⋅ N ⋅ k ) O(M \cdot N\cdot k) O ( M ⋅ N ⋅ k )
Baum-Welch / learning
X 1 , . . . , X M X_1, ..., X_M X 1 , ... , X M
P , Q P, Q P , Q
O ( M ⋅ N ⋅ k 2 ) O(M \cdot N \cdot k^2) O ( M ⋅ N ⋅ k 2 )
O ( M ⋅ N ⋅ k ) O(M \cdot N\cdot k) O ( M ⋅ N ⋅ k )
Viterbi algorithm
Baum-Welsh algo
Lecture 8: Computing and storage
DNA for Computing
Hamiltonian path problem (also knowns as the Traveling Salesman Problem) is NP-complete.
to find a path in a graph G = ( V , E ) G=(V, E) G = ( V , E ) that visits each vertex exactly once.
Leonard Adleman's DNA computing algorithm (1994) via generate and test.
Generate all possible Hamiltonian paths in a graph G G G .
Step 1: encode the city names and routes as DNA sequences.
Test each path to check if it is Hamiltonian,
total length of the path,
Step 2: sort by length in an electronic gel (field).
Step 3: filter by length via cutting out the band of interest.
start vertex, end vertex,
Step 4: amplify via PCR (Polymerase Chain Reaction) test.
each vertex once,
Step 5: affinity purification (hybridization) test of the complimentary strand.
Output the Hamiltonian path.
Past papers
y2023p8q2 (d)
y2019p8q2 (d)
vs. computational methods
Advantages: synthesizing short single stranded DNA is now a routine process,
so the initial step is straightforward and cheap.
In a test tube the “algorithm” runs in parallel.
However, the complexity still increases exponentially.
For Adleman’s method, what scales exponentially is not the computing time, but rather the amount of DNA.
Another limitation is the error rate for each operation.
Random access in DNA storage
Organick et al. (2018) stored and retrieved more than 200 megabytes of data.
encoding: ID | Addr | Payload | Error correction code, append distinct primers, synthesis.
attach distinct primers to each DNA molecules set, to carry the file information.
redundant information for increased robustness.
decoding: sequencing, cluster reads and consensus algorithm, error correction.
retrieve the file by selectively amplifying and sequencing the molecules with the primer marking the desired file.
test their scheme via a primer library that allowed them to uniquely tag data stored in DNA.
encoded 35 digital files into 13M DNA sequences, each 150-nucleotides long.
Past papers
y2021p9q2 (d)
y2020p9q2 (e) , y2024p8q2 (d)
opportunities (or advantages): longevity (durable), power usage and information density.
challenges (or disadvantages): cost and read/write speed (DNA synthesis and sequencing).
Lecture 9: Stochastic Simulation Algorithm (SSA)
Dobb-Gillespie algorithm (1976)
to simulate coupled biochemical reactions in a well stirred container, where the mean and variance from multiple runs are reported for statistical stability.
assumption: the time steps τ \tau τ so small that only one reaction has occurred.
algorithm: given a set of M M M reactions, and N N N species in the system, with X i X_i X i molecules of species i i i .
t = 0 t = 0 t = 0 , X = [ X 1 , … , X N ] \mathbf{X} = [X_1, \dots, X_N] X = [ X 1 , … , X N ] .
while t < T t < T t < T do:
α 0 = ∑ i = 1 M α i \alpha_0 = \sum_{i=1}^M \alpha_i α 0 = ∑ i = 1 M α i \quad \quad \quad \quad , complexity O ( M ) O(M) O ( M ) .
τ = Exp ( α 0 ) = − 1 α 0 ln ( r 1 ) \tau = \text{Exp}(\alpha_0) = -\frac{1}{\alpha_0} \ln(r_1) τ = Exp ( α 0 ) = − α 0 1 ln ( r 1 ) , where r 1 ∼ Uniform ( 0 , 1 ) r_1 \sim \text{Uniform}(0, 1) r 1 ∼ Uniform ( 0 , 1 ) .
t ′ = t + τ t' = t + \tau t ′ = t + τ
P ( j -th reaction ) = α j α 0 P(j \text{-th reaction}) = \frac{\alpha_j}{\alpha_0} P ( j -th reaction ) = α 0 α j , where j = 1 , … , M j = 1, \dots, M j = 1 , … , M and r 2 ∼ Uniform ( 0 , 1 ) r_2 \sim \text{Uniform}(0, 1) r 2 ∼ Uniform ( 0 , 1 ) .
X ′ = X + ν j \mathbf{X'} = \mathbf{X} + \nu_j X ′ = X + ν j \quad \quad \quad \quad , complexity O ( N ) O(N) O ( N ) .
end while
output X ′ \mathbf{X'} X ′ and t ′ t' t ′ .
Past papers