Part 2·2.7·25 min read

In Practice: Biopython Basics

Biopython is the standard Python toolkit for biological sequence analysis — parsing GenBank files, computing sequence statistics, running alignments, and working with structures.

biopythonpythonpracticalsequences

Biopython is the closest thing biology has to a standard library for sequence analysis. It wraps NCBI databases, parses dozens of file formats, implements classic algorithms, and provides data structures purpose-built for biological sequences. If you're writing Python code that touches sequences, you're almost certainly using Biopython.

This chapter covers the core workflows: working with sequences, parsing common file formats, computing sequence properties, aligning sequences, and a brief introduction to structure handling.

Installation and Setup

bash
pip install biopython

The main namespace is Bio. Submodules are organized by function:

python
from Bio import SeqIO          # File parsing and sequence I/O
from Bio.Seq import Seq        # The core Seq object
from Bio import Entrez         # NCBI database access
from Bio import Align          # Pairwise and multiple alignment
from Bio import pairwise2      # Legacy pairwise alignment
from Bio.PDB import PDBParser  # Structure parsing

The Seq Object

The foundational abstraction is Bio.Seq.Seq — a string-like object that knows it represents a biological sequence and supports biological operations.

python
from Bio.Seq import Seq

# Create a sequence
dna = Seq("ATGCGATCGATCGATCG")

# String-like operations work as expected
print(len(dna))           # 18
print(dna[0:6])           # ATGCGA
print(dna.count("ATG"))   # 1

# Biology-specific operations
print(dna.complement())           # TACGCTAGCTAGCTAGC
print(dna.reverse_complement())   # CGATCGATCGATCGCAT

# Transcription: DNA → RNA (T → U)
rna = dna.transcribe()
print(rna)  # AUGCGAUCGAUCGAUCG

# Translation: RNA → protein
protein = rna.translate()
print(protein)  # MRSS* (where * is stop codon)

# Direct DNA → protein
protein = dna.translate()
print(protein)       # MRSS*
print(protein[:-1])  # Remove stop codon: MRSS

Sequence Alphabets and Types

python
# By default, Seq is typeless — works for DNA, RNA, or protein
# For RNA sequences, use U instead of T
rna = Seq("AUGCGAUCG")

# MutableSeq allows in-place modification
from Bio.Seq import MutableSeq
mseq = MutableSeq("ATGCGATCG")
mseq[0] = "G"     # point substitution
print(mseq)       # GTGCGATCG

SeqRecord: Sequences with Metadata

The SeqRecord object combines a Seq with its identifier, name, description, and features.

python
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq

record = SeqRecord(
    Seq("ATGCGATCGATCG"),
    id="NM_007294",
    name="BRCA1",
    description="Homo sapiens BRCA1 mRNA",
    annotations={"molecule_type": "mRNA"}
)

print(record.id)           # NM_007294
print(record.seq)          # ATGCGATCGATCG
print(record.description)  # Homo sapiens BRCA1 mRNA

File Parsing with SeqIO

Bio.SeqIO reads and writes all common biological sequence formats. The interface is consistent regardless of format:

python
from Bio import SeqIO

# Parse a FASTA file
records = list(SeqIO.parse("sequences.fasta", "fasta"))
for record in records:
    print(f"{record.id}: {len(record.seq)} bp")

# Parse a GenBank file
for record in SeqIO.parse("brca1.gb", "genbank"):
    print(record.id)
    print(record.description)
    print(f"Features: {len(record.features)}")

# Read a single record
record = SeqIO.read("single.fasta", "fasta")  # raises ValueError if >1 record

Supported Formats

Common formats and their format strings:

FormatStringNotes
FASTA"fasta"Most common; no annotations
GenBank"genbank" or "gb"Full annotations, features
FASTQ"fastq"Sequences + quality scores
EMBL"embl"European equivalent of GenBank
Stockholm"stockholm"Multiple sequence alignments
Clustal"clustal"Multiple sequence alignments

Writing Sequences

python
# Write to FASTA
SeqIO.write(records, "output.fasta", "fasta")

# Convert between formats
with open("output.gb", "w") as f:
    SeqIO.write(records, f, "genbank")

# Convert FASTQ → FASTA
records = SeqIO.parse("reads.fastq", "fastq")
SeqIO.write(records, "reads.fasta", "fasta")

Working with GenBank Features

GenBank records contain rich feature annotations. Features are SeqFeature objects with a type, location, and qualifiers:

python
record = SeqIO.read("brca1.gb", "genbank")

for feature in record.features:
    print(f"Type: {feature.type}")
    print(f"Location: {feature.location}")
    print(f"Strand: {feature.location.strand}")  # 1 (forward) or -1 (reverse)

# Filter for CDS features
cds_features = [f for f in record.features if f.type == "CDS"]

for cds in cds_features:
    gene = cds.qualifiers.get("gene", ["unknown"])[0]
    protein_id = cds.qualifiers.get("protein_id", ["?"])[0]

    # Extract the CDS sequence (handles joins and strand)
    cds_seq = cds.extract(record.seq)
    protein = cds_seq.translate(to_stop=True)

    print(f"{gene} ({protein_id}): {len(protein)} aa")

Feature Locations

Feature locations can be:

  • SimpleLocation(start, end, strand) — a simple interval
  • CompoundLocation — a join of intervals (for genes with introns, or wrapping around circular chromosomes)
python
from Bio.SeqFeature import SimpleLocation

feature = record.features[0]
loc = feature.location

print(loc.start)   # 0-based start (ExactPosition)
print(loc.end)     # 0-based end (ExactPosition)
print(loc.strand)  # 1 or -1

# Check if compound (spliced)
from Bio.SeqFeature import CompoundLocation
if isinstance(loc, CompoundLocation):
    for part in loc.parts:
        print(f"  Exon: {part.start}–{part.end}")

Sequence Analysis

GC Content and Composition

python
from Bio.SeqUtils import gc_fraction

seq = Seq("ATGCGATCGATCGATCG")
gc = gc_fraction(seq)
print(f"GC content: {gc:.1%}")  # 47.1%

# Manual composition
from collections import Counter
counts = Counter(str(seq))
print(counts)  # Counter({'A': 3, 'T': 4, 'G': 5, 'C': 5, ...})

Codon Usage

python
from Bio.SeqUtils import CodonAdaptationIndex
from Bio.SeqUtils.CodonUsage import CodonAdaptationIndex

# Count codons in a coding sequence
cds = Seq("ATGAAACGTTTTGGG")  # must be in-frame, no partial codons

codons = [str(cds[i:i+3]) for i in range(0, len(cds)-3, 3)]
from collections import Counter
codon_counts = Counter(codons)
print(codon_counts)

Finding ORFs

python
def find_orfs(seq: Seq, min_length: int = 100) -> list:
    orfs = []
    for strand, nuc in [(+1, seq), (-1, seq.reverse_complement())]:
        for frame in range(3):
            i = frame
            while i < len(nuc) - 2:
                codon = nuc[i:i+3]
                if codon == "ATG":
                    # Find the stop codon
                    for j in range(i, len(nuc) - 2, 3):
                        if nuc[j:j+3] in ("TAA", "TAG", "TGA"):
                            orf_len = j + 3 - i
                            if orf_len >= min_length:
                                orfs.append({
                                    "start": i,
                                    "end": j + 3,
                                    "strand": strand,
                                    "frame": frame,
                                    "length": orf_len,
                                    "seq": nuc[i:j+3]
                                })
                            break
                i += 3
    return sorted(orfs, key=lambda x: -x["length"])

orfs = find_orfs(Seq("ATGAAACGTTTTGGGTAG"))
for orf in orfs:
    print(f"Strand {orf['strand']}, frame {orf['frame']}: {orf['length']} bp")

Sequence Alignment

Pairwise Alignment

Biopython's Bio.Align provides pairwise alignment with configurable scoring:

python
from Bio import Align

aligner = Align.PairwiseAligner()
aligner.mode = "global"        # Needleman-Wunsch
# aligner.mode = "local"       # Smith-Waterman

# Default scoring: match=1, mismatch=0, gap=-0.5
aligner.match_score = 2
aligner.mismatch_score = -1
aligner.open_gap_score = -2
aligner.extend_gap_score = -0.5

seq1 = Seq("ATGCGATCG")
seq2 = Seq("ATGCAATCG")

alignments = aligner.align(seq1, seq2)
best = alignments[0]

print(f"Score: {best.score}")
print(best)

BLAST Queries via NCBI

python
from Bio.Blast import NCBIWWW, NCBIXML
import time

# Run BLAST (qblast = query BLAST via NCBI web service)
result_handle = NCBIWWW.qblast(
    "blastn",          # program: blastn, blastp, blastx, tblastn, tblastx
    "nt",              # database: nt, nr, refseq_rna, etc.
    "ATGCGATCGATCG",   # query sequence
    hitlist_size=10
)

# Parse results
blast_records = NCBIXML.parse(result_handle)
blast_record = next(blast_records)

for alignment in blast_record.alignments:
    for hsp in alignment.hsps:
        print(f"Match: {alignment.title[:60]}")
        print(f"Score: {hsp.score}, E-value: {hsp.expect:.2e}")
        print(f"Identity: {hsp.identities}/{hsp.align_length}")
        break  # just first HSP
BLAST rate limits

NCBI's BLAST web service rate-limits heavy use. For production workflows with many sequences, install BLAST+ locally (blast+ package) and run blastn / blastp as command-line tools. Biopython can parse the XML output with NCBIXML:

python
from Bio.Blast import NCBIXML
with open("blast_output.xml") as f:
    records = list(NCBIXML.parse(f))

Working with Multiple Sequence Alignments

After aligning multiple sequences (with ClustalW, MUSCLE, MAFFT, etc.), Biopython can parse and analyze the result:

python
from Bio import AlignIO
from Bio.Align import MultipleSeqAlignment

# Parse a Clustal alignment file
alignment = AlignIO.read("sequences.aln", "clustal")

print(f"Alignment length: {alignment.get_alignment_length()}")
print(f"Number of sequences: {len(alignment)}")

for record in alignment:
    print(f"{record.id}: {record.seq}")

# Get a column from the alignment
col_10 = alignment[:, 10]   # column 10, all sequences
print(f"Column 10: {col_10}")

# Compute conservation per position
def column_identity(alignment, position: int) -> float:
    col = alignment[:, position]
    most_common = Counter(col).most_common(1)[0][1]
    return most_common / len(col)

for i in range(alignment.get_alignment_length()):
    identity = column_identity(alignment, i)
    if identity < 0.8:
        print(f"Position {i}: {identity:.0%} conserved")

Structure Basics with Bio.PDB

For 3D protein structures, Biopython provides a hierarchy: Structure → Model → Chain → Residue → Atom.

python
from Bio.PDB import PDBParser, MMCIFParser, PDBIO
import urllib.request

# Download and parse
urllib.request.urlretrieve("https://files.rcsb.org/download/1JM7.pdb", "1JM7.pdb")
parser = PDBParser(QUIET=True)
structure = parser.get_structure("1JM7", "1JM7.pdb")

# Navigate the hierarchy
model = structure[0]              # first MODEL
chain = model["A"]                # chain A
residue = chain[100]              # residue 100
atom = residue["CA"]              # alpha carbon
print(f"CA coordinates: {atom.get_vector()}")

# All residues in chain A
residues = [r for r in chain.get_residues() if r.get_id()[0] == " "]
print(f"Standard residues in chain A: {len(residues)}")

Computing Distances

python
def get_ca_distance(res1, res2) -> float:
    ca1 = res1["CA"].get_vector()
    ca2 = res2["CA"].get_vector()
    return (ca1 - ca2).norm()

# Distance between residues 10 and 50
res10 = chain[10]
res50 = chain[50]
dist = get_ca_distance(res10, res50)
print(f"Cα distance: {dist:.2f} Å")

Practical Pipeline: Sequence Analysis Workflow

Here's a complete workflow connecting the pieces:

python
from Bio import Entrez, SeqIO
from Bio.SeqUtils import gc_fraction
from Bio.Seq import Seq
import time

Entrez.email = "your@email.com"

def analyze_gene(gene_name: str, organism: str = "Homo sapiens") -> dict:
    # Fetch from NCBI
    handle = Entrez.esearch(
        db="nucleotide",
        term=f"{gene_name}[Gene Name] AND {organism}[Organism] AND RefSeq[Filter] AND mRNA[Filter]",
        retmax=1
    )
    result = Entrez.read(handle)
    handle.close()

    if not result["IdList"]:
        return {"error": "Gene not found"}

    time.sleep(0.4)
    handle = Entrez.efetch(db="nucleotide", id=result["IdList"][0], rettype="gb", retmode="text")
    record = SeqIO.read(handle, "genbank")
    handle.close()

    # Find CDS
    cds_features = [f for f in record.features if f.type == "CDS"]
    analysis = {
        "accession": record.id,
        "total_length_bp": len(record.seq),
        "gc_content": gc_fraction(record.seq),
        "cds_count": len(cds_features),
    }

    if cds_features:
        cds_seq = cds_features[0].extract(record.seq)
        protein = cds_seq.translate(to_stop=True)
        analysis["protein_length_aa"] = len(protein)
        analysis["protein_gc"] = gc_fraction(cds_seq)

    return analysis

result = analyze_gene("BRCA1")
for key, value in result.items():
    print(f"{key}: {value}")

Summary

Biopython's key modules:

ModuleUse
Bio.SeqCore sequence object with biological operations
Bio.SeqIORead/write all sequence file formats
Bio.SeqRecordSequences with associated metadata
Bio.EntrezProgrammatic NCBI database access
Bio.AlignPairwise and multiple sequence alignment
Bio.BlastBLAST queries and result parsing
Bio.PDB3D structure parsing and analysis

For almost any biological sequence analysis task in Python, start with Biopython. If you find it doesn't cover your use case, look for dedicated libraries: pyvcf/cyvcf2 for VCF files, pysam for SAM/BAM files, scanpy for single-cell analysis, BioPandas for structure-as-dataframe workflows.