Bioinformatics Past Paper

Lecture 1: Genetics

Reference online resource: Teaching

Youtube @bioinfalgorithms accompanying the textbook "Bioinformatics Algorithms: An Active Learning Approach".

features DNA RNA
strand structure double helix single-stranded
nitrogenous base A, T, C, G A, U, C, G
base pairs A-T, C-G A-U, C-G
sugar Deoxyribose Ribose
function long-term storage of genetic info protein synthesis and gene regulation
stability more stable (-H) stable due to hydroxyl group (-OH)

nitrogenous bases: Adenine (A), Thymine (T), Cytosine (C), Guanine (G), Uracil (U).

concept gene genome
definition specific DNA segment for proteins or RNA encoding complete set of genetic material
functionality heredity, gene code for proteins all genes and non-coding sequences
size various sizes entire DNA content of the organism
region and feature DNA genes programming functions
start marker promoter region function declaration (e.g.,def func())
coding region exons (encode protein information) function body
non-coding region introns (sequences) comments or placeholder code
end marker termination signal (of transcription) closing brace or return statement
purpose defines the structure and expression of a gene defines the structure and behavior of funcs
execution transcription and translation to produce proteins execution of the function when called

DNA -- transcription \rightarrow messenger RNA (mRNA) -- translation \rightarrow protein, codon, amino acid.

Firstly, transcription happens, where a specific segment of DNA is copied into messenger RNA (mRNA). Secondly, translation occurs, where the mRNA is used to synthesize a protein, by attaching to a ribosome.

Gene network [extensions]

Lecture 2: Sequence alignment

Distance metrics

Hamming distance vs edit distance

Dynamic programming

Scoring matrices

Gaps

Algorithms

Longest Common Subsequence (LCS)

General alignment (global vs. local)

Global alignment (Needleman-Wunsch)

Local alignment (Smith-Waterman)

Speedup extension (Four Russians)

Linear space extension (Hirshberg, divide and conquer)

general: complexity summary

general: long DNA sequences

RNA secondary structure prediction (Nussinov-Jacobson folding)

Lecture 3: Phylogeny

Phylogenetic trees infer evolutionary relationships among biological species/entities based on their physical or genetic (DNA or amino acid sequences) characteristics similarities and differences.

Reference online resource: Hierarchical/Agglomerative clustering

Motivation

Extension

Phylogeny distance-based parsimony-based
input pairwise additive distance matrix character tables
assumption distances are additive principle of minimal changes
good for additive changes, faster detailed and character-based
weakness no information at internal node homoplasy vs. shared traits
over-simplification slower for large scale
examples UPGMA, neighbor-joining small (or large) parsimony

Distance-based methods

Distance matrix

The additive tree condition meant that for any two leaves, the distance between them is the sum of edge weights of the path between them.

The ultra-metric tree condition: distance from root to any leaf is the same (i.e., age of root).

Four-point condition between four taxa: for any four elements, define dij+dkl=Td_{ij} +d_{kl} =T,

dij+dkl<dil+djk=dik+djl=T+2a.d_{ij} + d_{kl} < d_{il} + d_{jk}=d_{ik}+d_{jl} = T+2a.

i \       / k
    --a--
j /       \ l
distance-based UPGMA Neighbor-Joining (NJ)
input additive distance matrix additive distance matrix
output rooted ultra-metric tree un-rooted additive tree
evolution rate constant varying
complexity O(n3)O(n^3)
opt. O(n2logn)O(n^2 \log n), O(n2)O(n^2)
O(n3)O(n^3)

Past papers

ultra-metric

UPGMA (Unweighted Pair Group Method with Arithmetic Mean)

Neighbor joining

Parsimony-based methods

vs. distance methods

Small parsimony problem (Sankoff)

Large parsimony tree problem

Past papers

Bootstrap validation

General methods

Multiple alignments

For a kk-way alignment, where each sequence length is nn,

iterative refinement CLUSTAL algorithm

dp greedy progressive (CLUSTAL)
time O(nk2k)O(n^k \cdot 2^k ) O(kn2)O(k \cdot n^2) O(k2n2)O(k^2 \cdot n^2)
space O(nk)O(n^k) O(kn)O(k \cdot n) O(kn+k2)O(k \cdot n + k^2)
pros global optimal faster and less memory balances both
scales better with large k good for moderate k
cons high costs suboptimal solution guide tree accuracy
for larger k greedy alignment order suboptimal solution

Evaluation

Past papers

Progressive alignment heuristic iterative refinement

y2016p9q2 (c)

Statistic evaluation

Approximate search

Lecture 4: Clustering

gene expression microarray data

K-center, K-means, Hierarchical, Markov clustering

Cluster Quality=inter-cluster dist.intra-cluster dist.=Distance to Nearest ClusterCluster Diameter>1.\text{Cluster Quality} = \frac{\text{inter-cluster dist.}}{\text{intra-cluster dist.}} = \frac{\text{Distance to Nearest Cluster}}{\text{Cluster Diameter}} > 1.

clustering type input supervised? T complexity
K-center MaxDist. x,k\mathbf{x}, k --
K-means VarDist. x,k\mathbf{x}, k O(nkt)O(n \cdot k \cdot t)
Hierarchical tree x,k,Mdist.\mathbf{x}, k, M_{dist.} O(n3)O(n^3)
MCL Graph x,G/Msim.\mathbf{x}, G/M_{sim.} Me:O(n3)M^e: O(n^3)
inflate mijr:O(n2)m^r_{ij}: O(n^2)

Soft clustering

Expectation-Maximization (EM) algo.,

Past papers

Evaluation

Louvain: modularity, Leiden algorithm

Lecture 5: Genome sequencing

Reconstruct the original genome, given a set of overlapping short reads from machines.

Hamiltonian graph -- Hamiltonian path (every node, NP-complete)

De Bruijn graph -- Eulerian path (every edge, easier)

Eulerian vs. Hamiltonian

De Bruijn graph construction

Paired

K-mers

Bubbles explosion and solutions

Lecture 6: Genome assembly

Use an additional reference genome to augment sequencing or match (read) patterns.

Suffix trees

Compression

Burrows-Wheeler Transform (BWT)

Read / Exact pattern matching

Inexact pattern matching, with at most dd mismatches.

Lecture 7: Hidden Markov models

Application

Identify parts

Exons, Introns prediction

Protein secondary structure prediction

CG islands

Evaluation: TP, FP, TN, FN, sensitivity, specificity, precision, F1 score

Denote HMM:

Algorithm Inputs Outputs Time Space
Likelihood of a parse HMM,X,πHMM, X, \pi Pr(x,π)Pr(x,\pi) O(1)O(1) O(1)O(1)
Viterbi / decode HMM,XHMM, X Vk(i)=maxπP(x,ππi=k)V_k(i)=\max_\pi P(x, \pi \vert \pi_i=k) O(Nk2)O(N \cdot k^2) O(Nk)O(N \cdot k)
Forward / eval HMM,XHMM, X fk(i)=Pr(x1,...,xiπi=k)f_k(i) = Pr(x_1, ..., x_i \vert \pi_i=k) O(Nk2)O(N \cdot k^2) O(Nk)O(N \cdot k)
Backward / eval HMM,XHMM, X Pr(πi=kx)=fk(i)bk(i)Pr(\pi_i = k \vert x)=f_k(i) \cdot b_k(i)
bk(i)=Pr(xi+1,...xnπi)b_k(i)=Pr(x_{i+1}, ...x_n \vert \pi_i)
O(Nk2)O(N \cdot k^2) O(Nk)O(N \cdot k)
Viterbi train
/ learning
X1,...,XMX_1, ..., X_M P,QP, Q O(MNk2)O(M \cdot N \cdot k^2) O(MNk)O(M \cdot N\cdot k)
Baum-Welch
/ learning
X1,...,XMX_1, ..., X_M P,QP, Q O(MNk2)O(M \cdot N \cdot k^2) O(MNk)O(M \cdot N\cdot k)

Viterbi algorithm

Baum-Welsh algo

Lecture 8: Computing and storage

DNA for Computing

Hamiltonian path problem (also knowns as the Traveling Salesman Problem) is NP-complete.

Leonard Adleman's DNA computing algorithm (1994) via generate and test.

Past papers

Random access in DNA storage

Organick et al. (2018) stored and retrieved more than 200 megabytes of data.

Past papers

Lecture 9: Stochastic Simulation Algorithm (SSA)

Dobb-Gillespie algorithm (1976)

Past papers