Part 1·1.4·20 min read

In Practice: NCBI and Biological Databases

A hands-on introduction to querying NCBI, GenBank, UniProt, and the PDB — the primary databases behind almost all bioinformatics workflows.

NCBIdatabasespracticaltools

Biology generates data at a scale that requires industrial-grade storage and retrieval infrastructure. The sequencing of a single human genome produces ~200 GB of raw reads. NCBI's databases collectively hold petabytes of biological data built up over decades. Before you can analyze biological data, you need to know where it lives and how to get it out.

This chapter is a practical guide to the primary biological databases — what they contain, how to access them programmatically, and how they relate to each other.

The Landscape of Biological Databases

Biological data is not centralized. It's distributed across dozens of specialized databases, most of which cross-reference each other:

DatabaseContentsURL
NCBI / GenBankDNA/RNA sequences (all organisms)ncbi.nlm.nih.gov
RefSeqCurated reference sequencesncbi.nlm.nih.gov/refseq
UniProt / Swiss-ProtProtein sequences + annotationsuniprot.org
PDB3D protein structuresrcsb.org
EnsemblEukaryotic genome annotationensembl.org
SRARaw sequencing readsncbi.nlm.nih.gov/sra
GEOGene expression datasetsncbi.nlm.nih.gov/geo
dbSNPKnown human genetic variantsncbi.nlm.nih.gov/snp
ClinVarVariant-disease associationsncbi.nlm.nih.gov/clinvar
OMIMGenetic diseasesomim.org

The key insight is that these databases are interlinked. A gene in Ensembl has a RefSeq ID. That RefSeq ID maps to UniProt protein records. Those proteins have PDB structure entries. Variants in that gene are in dbSNP; pathogenic variants are in ClinVar. Learning to navigate these cross-references is fundamental to bioinformatics.

NCBI Entrez: The Programmatic Interface

NCBI's Entrez system is the API gateway for most NCBI databases. The Biopython library provides the Entrez module as a clean Python interface.

Setup

python
from Bio import Entrez, SeqIO

# Required by NCBI — they use this to contact you if your queries cause problems
Entrez.email = "your@email.com"
API rate limits

Without an API key, NCBI allows 3 requests/second. With a free API key (available at NCBI), you get 10 requests/second. Always add delays between requests in loops:

python
import time
time.sleep(0.4)  # stay within 3 req/s limit

Fetching a Sequence by Accession

Every sequence in GenBank/RefSeq has an accession number — a stable identifier. For example, NM_007294 is the RefSeq accession for the human BRCA1 mRNA.

python
from Bio import Entrez, SeqIO

Entrez.email = "your@email.com"

# Fetch BRCA1 mRNA sequence
handle = Entrez.efetch(
    db="nucleotide",
    id="NM_007294",
    rettype="gb",       # GenBank format
    retmode="text"
)

record = SeqIO.read(handle, "genbank")
handle.close()

print(record.id)               # NM_007294.4
print(len(record.seq))         # sequence length in bp
print(record.description)      # human description
print(record.seq[:100])        # first 100 bases

Searching for Records

The esearch function runs a text search and returns a list of IDs:

python
handle = Entrez.esearch(
    db="nucleotide",
    term="BRCA1[Gene Name] AND Homo sapiens[Organism] AND mRNA[Filter]",
    retmax=10
)
search_results = Entrez.read(handle)
handle.close()

ids = search_results["IdList"]
print(f"Found {search_results['Count']} records, fetching {len(ids)}")

Entrez search syntax uses field tags in brackets. Common fields:

  • [Gene Name] — gene symbol
  • [Organism] — species
  • [Filter] — record type (mRNA, protein, RefSeq, etc.)
  • [PDAT] — publication date (2020/01/01:2024/12/31[PDAT])

Batch Fetching

For multiple records, use efetch with a comma-separated ID list:

python
import time

ids = ["NM_007294", "NM_000059", "NM_000546"]  # BRCA1, BRCA2, TP53

handle = Entrez.efetch(
    db="nucleotide",
    id=",".join(ids),
    rettype="fasta",
    retmode="text"
)

records = list(SeqIO.parse(handle, "fasta"))
handle.close()

for rec in records:
    print(f"{rec.id}: {len(rec.seq)} bp")

The GenBank Format

The GenBank flat file format is the richest format for annotated sequences. Each record contains the sequence plus extensive metadata:

LOCUS       NM_007294               7088 bp    mRNA    linear   PRI 01-JAN-2024
DEFINITION  Homo sapiens BRCA1 DNA repair associated (BRCA1), mRNA.
ACCESSION   NM_007294
VERSION     NM_007294.4
KEYWORDS    RefSeq; MANE Select.
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; ...
FEATURES             Location/Qualifiers
     source          1..7088
                     /organism="Homo sapiens"
                     /mol_type="mRNA"
                     /chromosome="17"
     gene            1..7088
                     /gene="BRCA1"
     CDS             232..5824
                     /gene="BRCA1"
                     /product="breast cancer type 1 susceptibility protein"
                     /protein_id="NP_009225.1"
ORIGIN
        1 gaattcgatt tctgaataga gatcaagagg ...

Accessing features programmatically:

python
record = SeqIO.read(handle, "genbank")

for feature in record.features:
    if feature.type == "CDS":
        print(f"CDS location: {feature.location}")
        print(f"Gene: {feature.qualifiers.get('gene', ['?'])[0]}")
        print(f"Protein ID: {feature.qualifiers.get('protein_id', ['?'])[0]}")

        # Extract the CDS sequence
        cds_seq = feature.extract(record.seq)
        protein = cds_seq.translate(to_stop=True)
        print(f"Protein length: {len(protein)} aa")

UniProt: The Protein Sequence Database

For proteins, UniProt is the primary reference. It has two tiers:

  • Swiss-Prot — manually curated, high-quality annotations (~570k entries)
  • TrEMBL — computationally annotated, much larger but lower confidence (~250M entries)

You access UniProt via REST API:

python
import requests

def fetch_uniprot(accession: str) -> dict:
    url = f"https://rest.uniprot.org/uniprotkb/{accession}.json"
    response = requests.get(url)
    response.raise_for_status()
    return response.json()

# Fetch human BRCA1 protein
data = fetch_uniprot("P38398")

print(data["primaryAccession"])    # P38398
print(data["uniProtkbId"])         # BRCA1_HUMAN
print(data["organism"]["scientificName"])  # Homo sapiens

# Protein sequence
sequence = data["sequence"]["value"]
print(f"Length: {len(sequence)} aa")
print(f"Mass: {data['sequence']['molWeight']} Da")

Searching UniProt:

python
def search_uniprot(query: str, fields: str = "accession,id,gene_names,length", max_results: int = 10) -> list:
    url = "https://rest.uniprot.org/uniprotkb/search"
    params = {
        "query": query,
        "fields": fields,
        "size": max_results,
        "format": "json"
    }
    response = requests.get(url, params=params)
    response.raise_for_status()
    return response.json()["results"]

results = search_uniprot("BRCA1 AND organism_id:9606 AND reviewed:true")
for entry in results:
    print(entry["primaryAccession"], entry.get("uniProtkbId"))

The PDB: Protein Structure Database

The Protein Data Bank (RCSB PDB) stores experimentally determined 3D structures. Each entry has a 4-character PDB ID.

python
import requests

def fetch_pdb_info(pdb_id: str) -> dict:
    url = f"https://data.rcsb.org/rest/v1/core/entry/{pdb_id.lower()}"
    response = requests.get(url)
    response.raise_for_status()
    return response.json()

info = fetch_pdb_info("1JM7")  # BRCA1 BRCT domain
print(info["struct"]["title"])
print(info["rcsb_entry_info"]["resolution_combined"])
print(info["rcsb_entry_info"]["experimental_method"])

Downloading structure files:

python
from Bio.PDB import PDBParser, MMCIFParser
import urllib.request

# Download PDB file
pdb_id = "1JM7"
url = f"https://files.rcsb.org/download/{pdb_id}.pdb"
urllib.request.urlretrieve(url, f"{pdb_id}.pdb")

# Parse structure
parser = PDBParser(QUIET=True)
structure = parser.get_structure(pdb_id, f"{pdb_id}.pdb")

for model in structure:
    for chain in model:
        residues = list(chain.get_residues())
        print(f"Chain {chain.id}: {len(residues)} residues")
PDB file formats

PDB files come in two formats: the legacy .pdb text format (limited to ~100k atoms) and the newer .cif/.mmCIF format (no size limit). For large complexes (ribosomes, viruses), use mmCIF. Biopython's MMCIFParser handles both.

Cross-Database ID Mapping

A common bioinformatics task is mapping between database IDs. UniProt provides a service for this:

python
def map_ids(ids: list, from_db: str, to_db: str) -> dict:
    url = "https://rest.uniprot.org/idmapping/run"
    response = requests.post(url, data={
        "ids": ",".join(ids),
        "from": from_db,
        "to": to_db
    })
    job_id = response.json()["jobId"]

    # Poll for results
    import time
    while True:
        status = requests.get(f"https://rest.uniprot.org/idmapping/status/{job_id}").json()
        if status.get("jobStatus") == "FINISHED":
            break
        time.sleep(1)

    results_url = f"https://rest.uniprot.org/idmapping/results/{job_id}"
    results = requests.get(results_url).json()
    return {r["from"]: r["to"] for r in results["results"]}

# Map RefSeq protein IDs to UniProt accessions
mapping = map_ids(["NP_009225.1", "NP_000050.2"], "RefSeq_Protein", "UniProtKB")
print(mapping)

Practical Patterns

Pattern 1: Gene → Sequence → Annotation

python
from Bio import Entrez, SeqIO
import time

Entrez.email = "your@email.com"

def get_gene_info(gene_name: str, organism: str = "Homo sapiens") -> dict:
    # 1. Search for RefSeq mRNA
    handle = Entrez.esearch(
        db="nucleotide",
        term=f"{gene_name}[Gene Name] AND {organism}[Organism] AND RefSeq[Filter] AND mRNA[Filter]"
    )
    ids = Entrez.read(handle)["IdList"]
    handle.close()

    if not ids:
        return {}

    time.sleep(0.4)

    # 2. Fetch the top result
    handle = Entrez.efetch(db="nucleotide", id=ids[0], rettype="gb", retmode="text")
    record = SeqIO.read(handle, "genbank")
    handle.close()

    # 3. Extract relevant info
    cds_features = [f for f in record.features if f.type == "CDS"]
    result = {
        "accession": record.id,
        "length_bp": len(record.seq),
        "cds_count": len(cds_features),
    }

    if cds_features:
        cds = cds_features[0]
        result["protein_id"] = cds.qualifiers.get("protein_id", ["?"])[0]
        result["protein_length"] = len(cds.extract(record.seq).translate(to_stop=True))

    return result

info = get_gene_info("TP53")
print(info)

Pattern 2: Bulk Sequence Download for Analysis

python
def download_sequences(gene_name: str, organism: str, output_file: str, max_seqs: int = 50):
    Entrez.email = "your@email.com"

    handle = Entrez.esearch(
        db="nucleotide",
        term=f"{gene_name}[Gene Name] AND {organism}[Organism] AND RefSeq[Filter]",
        retmax=max_seqs
    )
    ids = Entrez.read(handle)["IdList"]
    handle.close()

    time.sleep(0.4)

    handle = Entrez.efetch(
        db="nucleotide",
        id=",".join(ids),
        rettype="fasta",
        retmode="text"
    )

    with open(output_file, "w") as f:
        f.write(handle.read())
    handle.close()

    print(f"Downloaded {len(ids)} sequences to {output_file}")

download_sequences("COX1", "Homo sapiens", "cox1_sequences.fasta")

Summary

The major databases you'll use constantly:

  • NCBI Entrez — the gateway for sequences, genomes, and literature. Biopython wraps it cleanly.
  • UniProt — canonical protein records. Use the REST API directly.
  • RCSB PDB — structural data. Biopython's Bio.PDB module handles parsing.

Cross-referencing between databases is the core skill. A gene has a symbol, a RefSeq ID, a UniProt accession, and possibly a PDB structure — and navigating between them is a daily bioinformatics operation.

Programmatic access checklist

Before writing a database query, check:

  1. Does the database have a stable REST API? (Most do)
  2. Is there a Biopython or dedicated Python library? (Saves weeks of work)
  3. What are the rate limits? (Always add delays in loops)
  4. Do you need an API key for higher throughput?
  5. Is there a bulk download option for large datasets? (FTP/S3 is faster than API for >1000 records)