How to improve a task (memory requirement / speed)?
next three tied together:
* algorithm - appropriate & efficient
* data structure - efficient to use/access/query, space efficient to store
* parallelization - to split up tasks and run in parallel
Most realistic time complexity for algorithms in sequence bioinformatics ?
exponential; polynomial
Define time complexity
examples?
Time complexity is defined as the amount of time taken by an algorithm to run, as a function of the length of the input.
= how does the running time of an algorithm increase
with the size of the input?
= most useful way to measure effectiveness of an algorithm
eg. constant, logarithmic, linear, polynomial, exponential
Algorithms - classifications? + examples
by complexity
* logarithmic, linear, quadratic, polynomial, etc
… by design
* brute force
* heuristic (zB greedy, seed and extend
* recursion (dynamic programming)
… by data structure
* hash table
* graphs (zB trees)
… by (biological) problem
* pattern search
Brute force algorithm?
- definition
- use
- running time
Heuristic algorithm?
- definition
- use
don’t guarantee that the optimal solution is found, because not all solutions are evaluated
are a good compromise between speed and accuracy
many different types / designs exist and are used
Two examples for heuristic algorithms
Greedy logarithm - definition, use ?
Seed-and-extend
definition, use?
Seed-and-extend = seed-and-filter-and-extend
Subsampling of k-mers - name 3 types?
minimizers, syncmers, strobemers
Subsampling - Minimizers?
saves on space /ram
such that the same k-mers will be selected from similar sequences
doi:10.1093/bioinformatics/bth408
Subsampling - syncmers?
within a window, select the k-mer(s) with the smallest s-mer at the start or end
- suited for comparison of seqs with mismatches
https://doi.org/10.7717/peerj.10805
Data structure - definition, common examples?
Suffix tree - definition, use?
Suffix tree - disadvantages / advantages?
disadvantages
* not so efficient to construct
* not so efficient to store
(–> convert to more space-efficient suffix arrays!)
advantages
* very efficient to search
- search is proportional to length of the pattern
Suffix trees - applications?
many applications in sequence bioinformatics
- keywords/patterns in text
- (maximal) repeats
- unique substrings
- longest common substring
- inexact matches
Parallelization - why? requires? types?
(parallel computing as opposed to serial computing)
why
* clock speed maxed out
* performance improvements now only possible through parallelization
requires
* special hardware and programs
* important: frequent communication between subtasks?
types
* embarrassing, coarse-grained, fine-grained
multi-core computers - where? why? requires?
(computing) clusters - definition/ consists of, disadvantage, requires, example in linux
definition / consists of:
* multiple computers (that are multi-core); nodes
* a login / head / master node
* linked by a fast local area network
disadvantage:
* expensive – but cheaper than a single computer of comparable size & speed
requires:
* application programs: extra programming / parameters
* requires a system to distribute jobs
example in linux:
SLURM
Heterogeneous computing
CPUs
* 2-8 cores, general-purpose
GPUs
* hundreds of cores, highly specialized - thanks to gaming!
* programmable for general purpose computing since 2007: CUDA (NVIDIA)
integration of CPUs and GPUs
–> performance boost
but: specialized programs / programming skills required
cloud computing - definition, advantages, example providors, requirments
definition: virtual machines on many networked computers
advantages:
- emulation of a physical computer (operating system, configuration, software, etc)
- high flexibility
- saving & duplicating a VM is very easy
computing power rented on demand - example:
- Amazon, Google, Microsoft
- some biological data sets available;
limiting step: data upload may otherwise be the limiting step
requires:
- specialized programming required
MapReduce / Hadoop; Go programming language
Big data - skills needed?
Algorithm - definition
a list set of instructions, used to solve problems or perform tasks
input –>** algorithm **–> output