SEPTEMBER 2 Flashcards by Miyaki Jan Lim

describes the flow of genetic information within a biological system — how information stored in DNA is used to make proteins, which carry out most cellular functions.

CENTRAL DOGMA IN BIOLOGY

How well did you know this?

Not at all

Perfectly

1967: developed an automated protein sequencer called “Sequanator”

How well did you know this?

Not at all

Perfectly

is a book, an encyclopedia of protein sequences; they publish it monthly because of new proteins arising

The atlas of protein sequence and structure

How well did you know this?

Not at all

Perfectly

Structured way of looking into the information; structuring information depends on the biomolecule (DNA, RNA, Proteins, Whole Organisms

Working with tools that move the information into something that’s meaningful in an analysis – information to knowledge

BIOINFORMATICS

How well did you know this?

Not at all

Perfectly

1967: developed the “Erdman Degradation Reaction”

Pehr Victor Erdman

How well did you know this?

Not at all

Perfectly

Developed the Atlas of Protein Sequence and Structure at the National Biomedical Research Foundation where she was an Associate Director

1965

Margaret Dayhoff

How well did you know this?

Not at all

Perfectly

Bioinformatics started with _________; they actually thought that it was the information material

proteins

How well did you know this?

Not at all

Perfectly

a database of collections from collected, translated and curated proteins from:

SwissProt
TrEMBL
PIR
PDB
GenBank
PRF
RefSeq
TPA

NCBI Protein Database in NCBI:

How well did you know this?

Not at all

Perfectly

1985: Nobel prize for the Polypeptide Theory of protein sequence

He was looking into polypeptide structures

Frederick Sanger

How well did you know this?

Not at all

Perfectly

By figuring out what kind of structures, you can somehow figure out how vaccines are made, how medicines interact with cells – general connection was a black box in the past

protein structures

How well did you know this?

Not at all

Perfectly

ATLAS OF PROTEIN SEQUENCE AND STRUCTURE (1967-68) Eventually became the

Resource Protein Sequence Database - COLLECTED

How well did you know this?

Not at all

Perfectly

AMOS BAIROCH major contributions to bioinformatics

1986 Swiss Prot
1990 TrEMBL
2002 UniProtKB
NCBI Protein Database in NCBI:

How well did you know this?

Not at all

Perfectly

For translation of EMBL nucleotide sequences
Supplement to Swiss-Prot initially consisted of computationally annotated sequence entries derived from the translation of all coding sequences (CDSs) found in INSDC databases – TRANSLATED

1990 TrEMBL

How well did you know this?

Not at all

Perfectly

bank where 3D structure of the proteins are stored

Protein Data Bank (PDB)

How well did you know this?

Not at all

Perfectly

is one of the earliest and most important biological databases developed to organize protein sequence information. It played a key role in bioinformatics and molecular biology.

One of the first bioinformatics projects in history.

Helped establish computational biology as a scientific field.

Provided the foundation for today’s sequence databases and bioinformatic tools.

houses the sequences of protein itself

PIR (PROTEIN INFORMATION RESOURCE)

How well did you know this?

Not at all

Perfectly

Swiss-Prot (manual, reviewed) + TrEMBL (automatic, unreviewed)

Together, they provide:

“A complete, reliable, and accessible source of protein knowledge for biological and biomedical research.”

Serves as the central protein database used globally by scientists.

Integrates data from genomics, proteomics, and structural biology.

Enables protein identification, functional prediction, and comparative analysis.

2002 UniProtKB

Universal Protein Resource Knowledgebase

How well did you know this?

Not at all

Perfectly

NCBI Protein Database in NCBI:

Curated protein sequences - EMBL Europe

SwissProt

How well did you know this?

Not at all

Perfectly

NCBI Protein Database in NCBI:

Translated sequences - EMBL Europe

TrEMBL

How well did you know this?

Not at all

Perfectly

A curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domain structure, post-translational modifications, variants, etc.), a minimal level of redundancy and a high level of integration with other databases – CURATED

1986 Swiss Prot

How well did you know this?

Not at all

Perfectly

NCBI Protein Database in NCBI:

Submitted sequences - USA

PIR

How well did you know this?

Not at all

Perfectly

Why did they put these all together in NCBI ?

Synergy that comes with information → better predictions, better functional annotations

We can now understand what big data means

Now that we have a lot of data, it’s now easier to understand the 3D structure of protein based on sequence when you have so many sequences – Molecular aspect

How well did you know this?

Not at all

Perfectly

NCBI Protein Database in NCBI:

Three-dimensional data on proteins and nucleotides - PDB

PDB

How well did you know this?

Not at all

Perfectly

NCBI Protein Database in NCBI:

Annotated - NCBI-USA

RefSeq

How well did you know this?

Not at all

Perfectly

NCBI Protein Database in NCBI:

Amino acids sequence data and translations - Japan

PRF

How well did you know this?

Not at all

Perfectly

DNA sequencing was one of the early techniques sanger used to deduce protein sequences. is an early protein sequencing technique based on partial hydrolysis of proteins to generate random overlapping peptide fragments. By analyzing these fragments and their overlaps, the complete amino acid sequence of a protein can be deduced.

SANGER’S RANDOM PERMEATION METHOD

We now have the creation of DNA side databases Submitted by people who are doing their research in the laboratories give 2 examples of dna databases

1979 Los Alamos DNA Sequence Database 1982 GenBank NCBI or the National Center for Biotechnology Information 1980 European Molecular Biology Laboratory Nucleotide Sequence Data Library Nucleotide sequence analysis 1986 DNA Data Bank of Japan Interdisciplinary Nucleotide Sequence Database Collection or INSDC

NCBI Protein Database in NCBI: And translated sequences - NCBI-USA

GenBank

NCBI Protein Database in NCBI: Annotated Third Party Annotated Sequences in NCBI, USA

TPA

UniProt is the world’s largest and most comprehensive protein database, providing detailed information about protein sequences, structures, and functions. Has information on proteins in different structures, proteomes, etc. Mature protein information system

UNIPROT

To look at info from different sources Store data Make sense across different databases No PubMed in EMBL (Resources are different) GenBank

NCBI

is a protein classification database that integrates data from many different protein signature databases to provide a comprehensive view of protein families and their functions. Like INSDC Trying to link big databases altogether

InterPro

These repositories are just part of the bigger picture of trying to disseminate information in lab settings (collaborations, etc.) Also has set of tools ENA (European Nucleotide Archive)

EMBL

in bioinformatics is the ability of systems and tools to efficiently manage and process the continuously increasing volume of biological data. The methods can also be done by others

SCALABILITY

Human genome is how many gb

8gB

Gene expression data for experiments is how many gb

1TB

Sequence errors estimated at between

0.37 and 35(!) errors per 1000 bases

Problems encountered in bioinformatics and sequencing

Recombination Contamination Annotation errors: propagates misannotations Errors not always corrected in a timely way Genes with varying unrelated functions depending on context Functional annotation is often unsystematic

refers to a common problem in bioinformatics and molecular biology where a gene or protein’s name does not accurately represent its true function — often because the name was given before its full biological role was understood.

Name-function disconnect

what sequence do we want in bioinformatics?

INFORMATION → INTEGRATION → INSIGHT → KNOWLEDGE

INFORMATION SOURCES (DATABASES) Catalogs of genetic variation (SNPs, indels, structural variants)

dbSNP/dbVar

INFORMATION SOURCES (DATABASES) Repository of raw nucleotide sequences (DNA/RNA) submitted by researchers worldwide Can be submitted by so many people Annotated collection of publicly available DNA sequences Information of submitter, submission, and context of submission

GenBank

INFORMATION SOURCES (DATABASES) Curated, non-redundant reference sequences for DNA, RNA, and proteins Only one entry Validated

RefSeq

INFORMATION SOURCES (DATABASES) Catalogs of genetic variation (SNPs, indels, structural variants)

dbSNP/dbVar

INFORMATION SOURCES (DATABASES) Human genes and genetic disorders with clinical relevance

OMIM

INFORMATION SOURCES (DATABASES) Literature database for biomedical research and background knowledge

PubMed

INFORMATION SOURCES (DATABASES) For chemical studies

PubChem

in gene bank, it is a table of annotated positions of genes

FEATURES TABLE

SOME COMMON ANALYSIS TOOLS Homology Searching

BLAST

SOME COMMON ANALYSIS TOOLS Sequence alignment

ClustalW

SOME COMMON ANALYSIS TOOLS Phylogenetics

PHYLIP

SOME COMMON ANALYSIS TOOLS Functional Patterns

HMMER

SOME COMMON ANALYSIS TOOLS Gene Prediction

GenScan

SOME COMMON ANALYSIS TOOLS Regulatory region analysis

MatInspector

SOME COMMON ANALYSIS TOOLS RNA structure

UniFold

SOME COMMON ANALYSIS TOOLS JPred

Protein Structure

TOOLS FOR DATA RETRIEVAL AND EXPLORATION Hub summarizing information on gene structures, function, expression, orthology

Gene Databases

TOOLS FOR DATA RETRIEVAL AND EXPLORATION The central search engine linking across all NCBI databases

Entrez

TOOLS FOR DATA RETRIEVAL AND EXPLORATION Provides amino acid sequences, annotations, and functions

Protein database (RefSeq/GenPept)

CONVENTIONS OR GENERAL SYNTAX Accession

[ACCN]

CONVENTIONS OR GENERAL SYNTAX Affiliation

[AD]

CONVENTIONS OR GENERAL SYNTAX MeSH major topic — One of the major topics discussed in the article

[MAJR]

CONVENTIONS OR GENERAL SYNTAX Author name

[AU]

CONVENTIONS OR GENERAL SYNTAX All fields

[ALL]

CONVENTIONS OR GENERAL SYNTAX Unique author identifier, such as an ORCID ID

[AUID]

CONVENTIONS OR GENERAL SYNTAX Journal title, official abbreviation, or ISSN number — e.g. Journal of Biological Chemistry, J Biol Chem, 0021-9258

[JOUR]

CONVENTIONS OR GENERAL SYNTAX Issue of journal

[ISS]

CONVENTIONS OR GENERAL SYNTAX Gene name

[GENE]

CONVENTIONS OR GENERAL SYNTAX Language

[LA]

CONVENTIONS OR GENERAL SYNTAX Organism

[ORGN]

CONVENTIONS OR GENERAL SYNTAX Publication date — YYYY/MM/DD, YYYY/MM, or YYYY; insert a colon for date range, e.g., 2016:2018

[PDAT]

CONVENTIONS OR GENERAL SYNTAX PubMed ID

[PMID]

CONVENTIONS OR GENERAL SYNTAX Protein name (for sequence records)

[PROT]

CONVENTIONS OR GENERAL SYNTAX Substance name — Name of chemical discussed in article

[SUBS]

CONVENTIONS OR GENERAL SYNTAX Title word

TITL]

CONVENTIONS OR GENERAL SYNTAX Secondary source ID — Names of secondary source databanks and/or accession numbers of sequences discussed in article

[SI]

ANALYSIS TOOLS Identify conserved domains and motifs in proteins

CDD (conserved domain database)

ANALYSIS TOOLS Finds sequence similarities Essential for gene identification, evolutionary studies, and annotation

BLAST (Basic Local Alignment Search Tool)

ANALYSIS TOOLS BLAST variants: looks at variants of the same sequences Compare DNA/RNA/Protein sequences

blastn/blastp/blastx

ANALYSIS TOOLS Design and validate PCR primers

Primer-BLAST

ANALYSIS TOOLS Predict open reading frames in a sequence

ORF Finder (between start and stop codons)

FUNCTIONAL AND PATHWAY TOOLS Repository and analysis of gene expression and functional genomics dataset

Gene Expression Omnibus (GEO)

FUNCTIONAL AND PATHWAY TOOLS Integration of genes and proteins into biological pathways

BioSystems/Pathway Links

FUNCTIONAL AND PATHWAY TOOLS Database linking genetic variation to clinical phenotypes Mutations in disease related genes

ClinVar

SEPTEMBER 2 Flashcards

(83 cards)