Biology generates data at a scale that requires industrial-grade storage and retrieval infrastructure. The sequencing of a single human genome produces ~200 GB of raw reads. NCBI's databases collectively hold petabytes of biological data built up over decades. Before you can analyze biological data, you need to know where it lives and how to get it out.
This chapter is a practical guide to the primary biological databases — what they contain, how to access them programmatically, and how they relate to each other.
The Landscape of Biological Databases
Biological data is not centralized. It's distributed across dozens of specialized databases, most of which cross-reference each other:
| Database | Contents | URL |
|---|---|---|
| NCBI / GenBank | DNA/RNA sequences (all organisms) | ncbi.nlm.nih.gov |
| RefSeq | Curated reference sequences | ncbi.nlm.nih.gov/refseq |
| UniProt / Swiss-Prot | Protein sequences + annotations | uniprot.org |
| PDB | 3D protein structures | rcsb.org |
| Ensembl | Eukaryotic genome annotation | ensembl.org |
| SRA | Raw sequencing reads | ncbi.nlm.nih.gov/sra |
| GEO | Gene expression datasets | ncbi.nlm.nih.gov/geo |
| dbSNP | Known human genetic variants | ncbi.nlm.nih.gov/snp |
| ClinVar | Variant-disease associations | ncbi.nlm.nih.gov/clinvar |
| OMIM | Genetic diseases | omim.org |
The key insight is that these databases are interlinked. A gene in Ensembl has a RefSeq ID. That RefSeq ID maps to UniProt protein records. Those proteins have PDB structure entries. Variants in that gene are in dbSNP; pathogenic variants are in ClinVar. Learning to navigate these cross-references is fundamental to bioinformatics.
NCBI Entrez: The Programmatic Interface
NCBI's Entrez system is the API gateway for most NCBI databases. The Biopython library provides the Entrez module as a clean Python interface.
Setup
from Bio import Entrez, SeqIO
# Required by NCBI — they use this to contact you if your queries cause problems
Entrez.email = "your@email.com"
Without an API key, NCBI allows 3 requests/second. With a free API key (available at NCBI), you get 10 requests/second. Always add delays between requests in loops:
import time
time.sleep(0.4) # stay within 3 req/s limit
Fetching a Sequence by Accession
Every sequence in GenBank/RefSeq has an accession number — a stable identifier. For example, NM_007294 is the RefSeq accession for the human BRCA1 mRNA.
from Bio import Entrez, SeqIO
Entrez.email = "your@email.com"
# Fetch BRCA1 mRNA sequence
handle = Entrez.efetch(
db="nucleotide",
id="NM_007294",
rettype="gb", # GenBank format
retmode="text"
)
record = SeqIO.read(handle, "genbank")
handle.close()
print(record.id) # NM_007294.4
print(len(record.seq)) # sequence length in bp
print(record.description) # human description
print(record.seq[:100]) # first 100 bases
Searching for Records
The esearch function runs a text search and returns a list of IDs:
handle = Entrez.esearch(
db="nucleotide",
term="BRCA1[Gene Name] AND Homo sapiens[Organism] AND mRNA[Filter]",
retmax=10
)
search_results = Entrez.read(handle)
handle.close()
ids = search_results["IdList"]
print(f"Found {search_results['Count']} records, fetching {len(ids)}")
Entrez search syntax uses field tags in brackets. Common fields:
[Gene Name]— gene symbol[Organism]— species[Filter]— record type (mRNA, protein, RefSeq, etc.)[PDAT]— publication date (2020/01/01:2024/12/31[PDAT])
Batch Fetching
For multiple records, use efetch with a comma-separated ID list:
import time
ids = ["NM_007294", "NM_000059", "NM_000546"] # BRCA1, BRCA2, TP53
handle = Entrez.efetch(
db="nucleotide",
id=",".join(ids),
rettype="fasta",
retmode="text"
)
records = list(SeqIO.parse(handle, "fasta"))
handle.close()
for rec in records:
print(f"{rec.id}: {len(rec.seq)} bp")
The GenBank Format
The GenBank flat file format is the richest format for annotated sequences. Each record contains the sequence plus extensive metadata:
LOCUS NM_007294 7088 bp mRNA linear PRI 01-JAN-2024
DEFINITION Homo sapiens BRCA1 DNA repair associated (BRCA1), mRNA.
ACCESSION NM_007294
VERSION NM_007294.4
KEYWORDS RefSeq; MANE Select.
SOURCE Homo sapiens (human)
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; ...
FEATURES Location/Qualifiers
source 1..7088
/organism="Homo sapiens"
/mol_type="mRNA"
/chromosome="17"
gene 1..7088
/gene="BRCA1"
CDS 232..5824
/gene="BRCA1"
/product="breast cancer type 1 susceptibility protein"
/protein_id="NP_009225.1"
ORIGIN
1 gaattcgatt tctgaataga gatcaagagg ...
Accessing features programmatically:
record = SeqIO.read(handle, "genbank")
for feature in record.features:
if feature.type == "CDS":
print(f"CDS location: {feature.location}")
print(f"Gene: {feature.qualifiers.get('gene', ['?'])[0]}")
print(f"Protein ID: {feature.qualifiers.get('protein_id', ['?'])[0]}")
# Extract the CDS sequence
cds_seq = feature.extract(record.seq)
protein = cds_seq.translate(to_stop=True)
print(f"Protein length: {len(protein)} aa")
UniProt: The Protein Sequence Database
For proteins, UniProt is the primary reference. It has two tiers:
- Swiss-Prot — manually curated, high-quality annotations (~570k entries)
- TrEMBL — computationally annotated, much larger but lower confidence (~250M entries)
You access UniProt via REST API:
import requests
def fetch_uniprot(accession: str) -> dict:
url = f"https://rest.uniprot.org/uniprotkb/{accession}.json"
response = requests.get(url)
response.raise_for_status()
return response.json()
# Fetch human BRCA1 protein
data = fetch_uniprot("P38398")
print(data["primaryAccession"]) # P38398
print(data["uniProtkbId"]) # BRCA1_HUMAN
print(data["organism"]["scientificName"]) # Homo sapiens
# Protein sequence
sequence = data["sequence"]["value"]
print(f"Length: {len(sequence)} aa")
print(f"Mass: {data['sequence']['molWeight']} Da")
Searching UniProt:
def search_uniprot(query: str, fields: str = "accession,id,gene_names,length", max_results: int = 10) -> list:
url = "https://rest.uniprot.org/uniprotkb/search"
params = {
"query": query,
"fields": fields,
"size": max_results,
"format": "json"
}
response = requests.get(url, params=params)
response.raise_for_status()
return response.json()["results"]
results = search_uniprot("BRCA1 AND organism_id:9606 AND reviewed:true")
for entry in results:
print(entry["primaryAccession"], entry.get("uniProtkbId"))
The PDB: Protein Structure Database
The Protein Data Bank (RCSB PDB) stores experimentally determined 3D structures. Each entry has a 4-character PDB ID.
import requests
def fetch_pdb_info(pdb_id: str) -> dict:
url = f"https://data.rcsb.org/rest/v1/core/entry/{pdb_id.lower()}"
response = requests.get(url)
response.raise_for_status()
return response.json()
info = fetch_pdb_info("1JM7") # BRCA1 BRCT domain
print(info["struct"]["title"])
print(info["rcsb_entry_info"]["resolution_combined"])
print(info["rcsb_entry_info"]["experimental_method"])
Downloading structure files:
from Bio.PDB import PDBParser, MMCIFParser
import urllib.request
# Download PDB file
pdb_id = "1JM7"
url = f"https://files.rcsb.org/download/{pdb_id}.pdb"
urllib.request.urlretrieve(url, f"{pdb_id}.pdb")
# Parse structure
parser = PDBParser(QUIET=True)
structure = parser.get_structure(pdb_id, f"{pdb_id}.pdb")
for model in structure:
for chain in model:
residues = list(chain.get_residues())
print(f"Chain {chain.id}: {len(residues)} residues")
PDB files come in two formats: the legacy .pdb text format (limited to ~100k atoms) and the newer .cif/.mmCIF format (no size limit). For large complexes (ribosomes, viruses), use mmCIF. Biopython's MMCIFParser handles both.
Cross-Database ID Mapping
A common bioinformatics task is mapping between database IDs. UniProt provides a service for this:
def map_ids(ids: list, from_db: str, to_db: str) -> dict:
url = "https://rest.uniprot.org/idmapping/run"
response = requests.post(url, data={
"ids": ",".join(ids),
"from": from_db,
"to": to_db
})
job_id = response.json()["jobId"]
# Poll for results
import time
while True:
status = requests.get(f"https://rest.uniprot.org/idmapping/status/{job_id}").json()
if status.get("jobStatus") == "FINISHED":
break
time.sleep(1)
results_url = f"https://rest.uniprot.org/idmapping/results/{job_id}"
results = requests.get(results_url).json()
return {r["from"]: r["to"] for r in results["results"]}
# Map RefSeq protein IDs to UniProt accessions
mapping = map_ids(["NP_009225.1", "NP_000050.2"], "RefSeq_Protein", "UniProtKB")
print(mapping)
Practical Patterns
Pattern 1: Gene → Sequence → Annotation
from Bio import Entrez, SeqIO
import time
Entrez.email = "your@email.com"
def get_gene_info(gene_name: str, organism: str = "Homo sapiens") -> dict:
# 1. Search for RefSeq mRNA
handle = Entrez.esearch(
db="nucleotide",
term=f"{gene_name}[Gene Name] AND {organism}[Organism] AND RefSeq[Filter] AND mRNA[Filter]"
)
ids = Entrez.read(handle)["IdList"]
handle.close()
if not ids:
return {}
time.sleep(0.4)
# 2. Fetch the top result
handle = Entrez.efetch(db="nucleotide", id=ids[0], rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()
# 3. Extract relevant info
cds_features = [f for f in record.features if f.type == "CDS"]
result = {
"accession": record.id,
"length_bp": len(record.seq),
"cds_count": len(cds_features),
}
if cds_features:
cds = cds_features[0]
result["protein_id"] = cds.qualifiers.get("protein_id", ["?"])[0]
result["protein_length"] = len(cds.extract(record.seq).translate(to_stop=True))
return result
info = get_gene_info("TP53")
print(info)
Pattern 2: Bulk Sequence Download for Analysis
def download_sequences(gene_name: str, organism: str, output_file: str, max_seqs: int = 50):
Entrez.email = "your@email.com"
handle = Entrez.esearch(
db="nucleotide",
term=f"{gene_name}[Gene Name] AND {organism}[Organism] AND RefSeq[Filter]",
retmax=max_seqs
)
ids = Entrez.read(handle)["IdList"]
handle.close()
time.sleep(0.4)
handle = Entrez.efetch(
db="nucleotide",
id=",".join(ids),
rettype="fasta",
retmode="text"
)
with open(output_file, "w") as f:
f.write(handle.read())
handle.close()
print(f"Downloaded {len(ids)} sequences to {output_file}")
download_sequences("COX1", "Homo sapiens", "cox1_sequences.fasta")
Summary
The major databases you'll use constantly:
- NCBI Entrez — the gateway for sequences, genomes, and literature. Biopython wraps it cleanly.
- UniProt — canonical protein records. Use the REST API directly.
- RCSB PDB — structural data. Biopython's
Bio.PDBmodule handles parsing.
Cross-referencing between databases is the core skill. A gene has a symbol, a RefSeq ID, a UniProt accession, and possibly a PDB structure — and navigating between them is a daily bioinformatics operation.
Before writing a database query, check:
- Does the database have a stable REST API? (Most do)
- Is there a Biopython or dedicated Python library? (Saves weeks of work)
- What are the rate limits? (Always add delays in loops)
- Do you need an API key for higher throughput?
- Is there a bulk download option for large datasets? (FTP/S3 is faster than API for >1000 records)