In Practice: NCBI and Biological Databases

Biology generates data at a scale that requires industrial-grade storage and retrieval infrastructure. The sequencing of a single human genome produces ~200 GB of raw reads. NCBI's databases collectively hold petabytes of biological data built up over decades. Before you can analyze biological data, you need to know where it lives and how to get it out.

This chapter is a practical guide to the primary biological databases — what they contain, how to access them programmatically, and how they relate to each other.

The Landscape of Biological Databases

Biological data is not centralized. It's distributed across dozens of specialized databases, most of which cross-reference each other:

Database	Contents	URL
NCBI / GenBank	DNA/RNA sequences (all organisms)	ncbi.nlm.nih.gov
RefSeq	Curated reference sequences	ncbi.nlm.nih.gov/refseq
UniProt / Swiss-Prot	Protein sequences + annotations	uniprot.org
PDB	3D protein structures	rcsb.org
Ensembl	Eukaryotic genome annotation	ensembl.org
SRA	Raw sequencing reads	ncbi.nlm.nih.gov/sra
GEO	Gene expression datasets	ncbi.nlm.nih.gov/geo
dbSNP	Known human genetic variants	ncbi.nlm.nih.gov/snp
ClinVar	Variant-disease associations	ncbi.nlm.nih.gov/clinvar
OMIM	Genetic diseases	omim.org

The key insight is that these databases are interlinked. A gene in Ensembl has a RefSeq ID. That RefSeq ID maps to UniProt protein records. Those proteins have PDB structure entries. Variants in that gene are in dbSNP; pathogenic variants are in ClinVar. Learning to navigate these cross-references is fundamental to bioinformatics.

NCBI Entrez: The Programmatic Interface

NCBI's Entrez system is the API gateway for most NCBI databases. The Biopython library provides the Entrez module as a clean Python interface.

Setup

python

from Bio import Entrez, SeqIO

# Required by NCBI — they use this to contact you if your queries cause problems
Entrez.email = "your@email.com"

★API rate limits

Without an API key, NCBI allows 3 requests/second. With a free API key (available at NCBI), you get 10 requests/second. Always add delays between requests in loops:

python

import time
time.sleep(0.4)  # stay within 3 req/s limit

Fetching a Sequence by Accession

Every sequence in GenBank/RefSeq has an accession number — a stable identifier. For example, NM_007294 is the RefSeq accession for the human BRCA1 mRNA.

python

from Bio import Entrez, SeqIO

Entrez.email = "your@email.com"

# Fetch BRCA1 mRNA sequence
handle = Entrez.efetch(
    db="nucleotide",
    id="NM_007294",
    rettype="gb",       # GenBank format
    retmode="text"
)

record = SeqIO.read(handle, "genbank")
handle.close()

print(record.id)               # NM_007294.4
print(len(record.seq))         # sequence length in bp
print(record.description)      # human description
print(record.seq[:100])        # first 100 bases

Searching for Records

The esearch function runs a text search and returns a list of IDs:

python

handle = Entrez.esearch(
    db="nucleotide",
    term="BRCA1[Gene Name] AND Homo sapiens[Organism] AND mRNA[Filter]",
    retmax=10
)
search_results = Entrez.read(handle)
handle.close()

ids = search_results["IdList"]
print(f"Found {search_results['Count']} records, fetching {len(ids)}")

Entrez search syntax uses field tags in brackets. Common fields:

[Gene Name] — gene symbol
[Organism] — species
[Filter] — record type (mRNA, protein, RefSeq, etc.)
[PDAT] — publication date (2020/01/01:2024/12/31[PDAT])

Batch Fetching

For multiple records, use efetch with a comma-separated ID list:

python

import time

ids = ["NM_007294", "NM_000059", "NM_000546"]  # BRCA1, BRCA2, TP53

handle = Entrez.efetch(
    db="nucleotide",
    id=",".join(ids),
    rettype="fasta",
    retmode="text"
)

records = list(SeqIO.parse(handle, "fasta"))
handle.close()

for rec in records:
    print(f"{rec.id}: {len(rec.seq)} bp")

The GenBank Format

The GenBank flat file format is the richest format for annotated sequences. Each record contains the sequence plus extensive metadata:

LOCUS       NM_007294               7088 bp    mRNA    linear   PRI 01-JAN-2024
DEFINITION  Homo sapiens BRCA1 DNA repair associated (BRCA1), mRNA.
ACCESSION   NM_007294
VERSION     NM_007294.4
KEYWORDS    RefSeq; MANE Select.
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; ...
FEATURES             Location/Qualifiers
     source          1..7088
                     /organism="Homo sapiens"
                     /mol_type="mRNA"
                     /chromosome="17"
     gene            1..7088
                     /gene="BRCA1"
     CDS             232..5824
                     /gene="BRCA1"
                     /product="breast cancer type 1 susceptibility protein"
                     /protein_id="NP_009225.1"
ORIGIN
        1 gaattcgatt tctgaataga gatcaagagg ...

Accessing features programmatically:

python

record = SeqIO.read(handle, "genbank")

for feature in record.features:
    if feature.type == "CDS":
        print(f"CDS location: {feature.location}")
        print(f"Gene: {feature.qualifiers.get('gene', ['?'])[0]}")
        print(f"Protein ID: {feature.qualifiers.get('protein_id', ['?'])[0]}")

        # Extract the CDS sequence
        cds_seq = feature.extract(record.seq)
        protein = cds_seq.translate(to_stop=True)
        print(f"Protein length: {len(protein)} aa")

UniProt: The Protein Sequence Database

For proteins, UniProt is the primary reference. It has two tiers:

Swiss-Prot — manually curated, high-quality annotations (~570k entries)
TrEMBL — computationally annotated, much larger but lower confidence (~250M entries)

You access UniProt via REST API:

python

import requests

def fetch_uniprot(accession: str) -> dict:
    url = f"https://rest.uniprot.org/uniprotkb/{accession}.json"
    response = requests.get(url)
    response.raise_for_status()
    return response.json()

# Fetch human BRCA1 protein
data = fetch_uniprot("P38398")

print(data["primaryAccession"])    # P38398
print(data["uniProtkbId"])         # BRCA1_HUMAN
print(data["organism"]["scientificName"])  # Homo sapiens

# Protein sequence
sequence = data["sequence"]["value"]
print(f"Length: {len(sequence)} aa")
print(f"Mass: {data['sequence']['molWeight']} Da")

Searching UniProt:

python

def search_uniprot(query: str, fields: str = "accession,id,gene_names,length", max_results: int = 10) -> list:
    url = "https://rest.uniprot.org/uniprotkb/search"
    params = {
        "query": query,
        "fields": fields,
        "size": max_results,
        "format": "json"
    }
    response = requests.get(url, params=params)
    response.raise_for_status()
    return response.json()["results"]

results = search_uniprot("BRCA1 AND organism_id:9606 AND reviewed:true")
for entry in results:
    print(entry["primaryAccession"], entry.get("uniProtkbId"))

The PDB: Protein Structure Database

The Protein Data Bank (RCSB PDB) stores experimentally determined 3D structures. Each entry has a 4-character PDB ID.

python

import requests

def fetch_pdb_info(pdb_id: str) -> dict:
    url = f"https://data.rcsb.org/rest/v1/core/entry/{pdb_id.lower()}"
    response = requests.get(url)
    response.raise_for_status()
    return response.json()

info = fetch_pdb_info("1JM7")  # BRCA1 BRCT domain
print(info["struct"]["title"])
print(info["rcsb_entry_info"]["resolution_combined"])
print(info["rcsb_entry_info"]["experimental_method"])

Downloading structure files:

python

from Bio.PDB import PDBParser, MMCIFParser
import urllib.request

# Download PDB file
pdb_id = "1JM7"
url = f"https://files.rcsb.org/download/{pdb_id}.pdb"
urllib.request.urlretrieve(url, f"{pdb_id}.pdb")

# Parse structure
parser = PDBParser(QUIET=True)
structure = parser.get_structure(pdb_id, f"{pdb_id}.pdb")

for model in structure:
    for chain in model:
        residues = list(chain.get_residues())
        print(f"Chain {chain.id}: {len(residues)} residues")

ℹPDB file formats

PDB files come in two formats: the legacy .pdb text format (limited to ~100k atoms) and the newer .cif/.mmCIF format (no size limit). For large complexes (ribosomes, viruses), use mmCIF. Biopython's MMCIFParser handles both.

Cross-Database ID Mapping

A common bioinformatics task is mapping between database IDs. UniProt provides a service for this:

python

def map_ids(ids: list, from_db: str, to_db: str) -> dict:
    url = "https://rest.uniprot.org/idmapping/run"
    response = requests.post(url, data={
        "ids": ",".join(ids),
        "from": from_db,
        "to": to_db
    })
    job_id = response.json()["jobId"]

    # Poll for results
    import time
    while True:
        status = requests.get(f"https://rest.uniprot.org/idmapping/status/{job_id}").json()
        if status.get("jobStatus") == "FINISHED":
            break
        time.sleep(1)

    results_url = f"https://rest.uniprot.org/idmapping/results/{job_id}"
    results = requests.get(results_url).json()
    return {r["from"]: r["to"] for r in results["results"]}

# Map RefSeq protein IDs to UniProt accessions
mapping = map_ids(["NP_009225.1", "NP_000050.2"], "RefSeq_Protein", "UniProtKB")
print(mapping)

Practical Patterns

Pattern 1: Gene → Sequence → Annotation

python

from Bio import Entrez, SeqIO
import time

Entrez.email = "your@email.com"

def get_gene_info(gene_name: str, organism: str = "Homo sapiens") -> dict:
    # 1. Search for RefSeq mRNA
    handle = Entrez.esearch(
        db="nucleotide",
        term=f"{gene_name}[Gene Name] AND {organism}[Organism] AND RefSeq[Filter] AND mRNA[Filter]"
    )
    ids = Entrez.read(handle)["IdList"]
    handle.close()

    if not ids:
        return {}

    time.sleep(0.4)

    # 2. Fetch the top result
    handle = Entrez.efetch(db="nucleotide", id=ids[0], rettype="gb", retmode="text")
    record = SeqIO.read(handle, "genbank")
    handle.close()

    # 3. Extract relevant info
    cds_features = [f for f in record.features if f.type == "CDS"]
    result = {
        "accession": record.id,
        "length_bp": len(record.seq),
        "cds_count": len(cds_features),
    }

    if cds_features:
        cds = cds_features[0]
        result["protein_id"] = cds.qualifiers.get("protein_id", ["?"])[0]
        result["protein_length"] = len(cds.extract(record.seq).translate(to_stop=True))

    return result

info = get_gene_info("TP53")
print(info)

Pattern 2: Bulk Sequence Download for Analysis

python

def download_sequences(gene_name: str, organism: str, output_file: str, max_seqs: int = 50):
    Entrez.email = "your@email.com"

    handle = Entrez.esearch(
        db="nucleotide",
        term=f"{gene_name}[Gene Name] AND {organism}[Organism] AND RefSeq[Filter]",
        retmax=max_seqs
    )
    ids = Entrez.read(handle)["IdList"]
    handle.close()

    time.sleep(0.4)

    handle = Entrez.efetch(
        db="nucleotide",
        id=",".join(ids),
        rettype="fasta",
        retmode="text"
    )

    with open(output_file, "w") as f:
        f.write(handle.read())
    handle.close()

    print(f"Downloaded {len(ids)} sequences to {output_file}")

download_sequences("COX1", "Homo sapiens", "cox1_sequences.fasta")

Summary

The major databases you'll use constantly:

NCBI Entrez — the gateway for sequences, genomes, and literature. Biopython wraps it cleanly.
UniProt — canonical protein records. Use the REST API directly.
RCSB PDB — structural data. Biopython's Bio.PDB module handles parsing.

Cross-referencing between databases is the core skill. A gene has a symbol, a RefSeq ID, a UniProt accession, and possibly a PDB structure — and navigating between them is a daily bioinformatics operation.

★Programmatic access checklist

Before writing a database query, check:

Does the database have a stable REST API? (Most do)
Is there a Biopython or dedicated Python library? (Saves weeks of work)
What are the rate limits? (Always add delays in loops)
Do you need an API key for higher throughput?
Is there a bulk download option for large datasets? (FTP/S3 is faster than API for >1000 records)