Every human differs from the reference at approximately 4–5 million positions. Some differences cause disease. Most are harmless. A few confer advantages. Understanding the types of , how they arise, and how to classify their effects is foundational to clinical genomics, cancer biology, and evolutionary analysis.
This is not abstract taxonomy. When you run a caller on a tumor-normal pair, every output line is a described by these categories. When you interpret a clinical report, every entry is classified by this framework. Knowing the types and effects of determines what questions you can ask and what tools you use to answer them.
Mutation vs. Variant: The Terminology
These terms are often used interchangeably, but in clinical genomics they have distinct contexts:
implies a pathological change — a known to cause disease. It's a clinical judgment.
is the neutral term for any position that differs from the reference. are further classified by evidence:
- Pathogenic: known to cause disease
- Likely pathogenic: strong evidence for pathogenicity
- of uncertain significance (VUS): insufficient evidence
- Likely benign: probably harmless
- Benign: known to have no disease effect
The distinction matters for communication with clinicians and patients. Everything in a is a ; very few are in the clinical sense.
Types of Variants by Size and Mechanism
Single Nucleotide Variants (SNVs)
A single change. The most common type of genetic variation. When referring to common SNVs found at >1% frequency in the population, they're called SNPs (single polymorphisms). Most disease-associated discovered in GWAS are SNPs.
SNVs in coding regions are classified by their effect on the :
Synonymous (silent): The changes but the codon still encodes the same (due to codon degeneracy). No change. Often assumed to be neutral — but can affect , codon usage, or stability.
Missense: The change causes a different to be incorporated. Effect on function depends on the properties and the position. A conservative substitution (e.g., Leu → Ile, both hydrophobic) is less likely to be damaging than a radical one (e.g., Arg → Glu, charge reversal).
Nonsense: The change creates a premature stop codon (UAA, UAG, UGA). Produces a truncated — almost always loss-of-function if the stop codon is early in the coding sequence. The truncated is often degraded by NMD (nonsense-mediated decay).
Splice site: Occurs at the consensus splice site sequence (GT at 5' splice site, AG at 3' splice site, or nearby sequences). Disrupts → skipping, retention, or cryptic splice activation. Often as damaging as nonsense .
Insertions and Deletions (Indels)
In-frame indels: Length divisible by 3 → inserts or deletes without disrupting the reading frame. Typically less severe than frameshift indels. May delete a critical residue or domain.
Frameshift indels: Length not divisible by 3 → shifts the reading frame of all downstream codons. Produces a completely different sequence after the indel, usually followed quickly by a premature stop codon. Almost always loss-of-function.
Structural Variants (SVs)
Large-scale rearrangements affecting hundreds of pairs to megabases:
- Copy Number (CNVs): duplications or deletions of chromosomal segments. amplification (extra copies → overexpression) and deletion (fewer copies → reduced expression or loss-of-function) are both common in cancer.
- Inversions: a segment is reversed in orientation
- Translocations: a segment moves to a different (or a different position on the same ). Oncogenic translocations create fusion : BCR-ABL in CML (t(9;22)), EML4-ALK in lung cancer, etc.
- Mobile element insertions: retrotransposons or other mobile elements inserting into
Tandem Repeats
Short sequence motifs repeated in tandem. Microsatellites (2–6 bp repeats) are highly polymorphic and prone to replication slippage errors. Trinucleotide repeat expansion is the mechanism of Huntington's disease (CAG expansion in HTT), Fragile X (CGG expansion in FMR1), and other neurodegenerative diseases.
Mutation Mechanisms
Replication Errors
polymerase occasionally incorporates the wrong (proofreading reduces this to ~1/10⁹ per per replication). Mismatch repair then catches most remaining errors. The few that escape become permanent .
Spontaneous Chemical Damage
- Deamination: cytosine spontaneously loses its amino group → uracil ( as thymine). Creates C→T transitions, most commonly at CpG dinucleotides. This is the most common endogenous mutational mechanism.
- Depurination: purine are spontaneously cleaved from the backbone, creating abasic sites.
- Oxidation: reactive oxygen species (ROS) generate 8-oxoguanine, which can mispair with adenine → G:C→T:A transversions.
Environmental Mutagens
- UV radiation: creates cyclobutane pyrimidine dimers and 6-4 photoproducts at adjacent pyrimidines → C→T and CC→TT transitions. Characteristic signature in skin cancers.
- Cigarette smoke: polycyclic aromatic hydrocarbons and other carcinogens create bulky adducts → G→T transversions. Characteristic signature in lung cancers from smokers.
- Alkylating agents: attach methyl or ethyl groups to → errors during replication.
- Ionizing radiation: double-strand breaks → large deletions, translocations.
- APOBEC cytidine deaminases: cellular normally involved in innate immunity; when dysregulated, cause extensive C→T and C→G at TC contexts. Major mutational process in many cancer types.
Mutational Signatures
The pattern of in a reflects the processes that caused them. The COSMIC Mutational Signatures database (v3.4 as of 2024) catalogs 78 validated single substitution signatures, plus others for small indels and SVs.
Each signature is characterized by the relative rates of all 96 types (6 substitution types × 16 trinucleotide contexts). Signature 4 (smoking) is dominated by C[G→T]G. Signature 7a/7b (UV) is dominated by C[C→T]C. Signature 3 (homologous recombination deficiency, found in BRCA1/2-mutant tumors) is dominated by deletions.
Decomposing a tumor's into mutational signatures reveals the etiology — what caused the — and can have clinical implications (BRCA1/2-like signature → may respond to PARP inhibitors).
Variant Classification Frameworks
ACMG/AMP Guidelines
The standard for germline classification (used in clinical genetics labs) is the ACMG/AMP 2015 guidelines. They use a combination of evidence criteria:
- Population frequency: Is the common in the general population? Common = less likely pathogenic.
- Computational predictions: Do tools (SIFT, PolyPhen-2, AlphaMissense) predict it's damaging?
- Functional studies: Does it disrupt function in experimental assays?
- Segregation: Does the co-segregate with disease in affected families?
- Known pathogenic : Is this the same or similar to a previously validated pathogenic ?
Evidence is combined to reach one of 5 classifications (pathogenic, likely pathogenic, VUS, likely benign, benign).
OncoKB and Clinical Oncogenomics
For somatic (cancer) , separate classification systems apply. OncoKB classifies by their clinical actionability — whether there's an approved drug, a clinical trial, or just biological evidence.
A BRAF V600E in melanoma is Level 1 (FDA-approved therapy: vemurafenib, dabrafenib). The same in cholangiocarcinoma might be Level 3A (evidence from clinical trials, not approved). Different cancers, same , different clinical implications.
The VCF File Format
data is stored in VCF ( Call Format) files. This is the universal format for genomic data.
##fileformat=VCFv4.2
##reference=GRCh38
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele frequency">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1
chr17 7674220 rs28934578 G T . PASS AF=0.001; GT:DP 0/1:45
chr7 140453136 . A T 100 PASS SOMATIC; GT:DP 0/1:120
Key fields:
- CHROM/POS: and 1-based position
- REF/ALT: reference and alternate
- QUAL: quality score
- FILTER: PASS or reason for filtering
- INFO: semicolon-delimited annotations ( frequency, functional effect, etc.)
- FORMAT/SAMPLE: per-sample data
The GT () field encodes the : 0/0 = homozygous reference, 0/1 = heterozygous, 1/1 = homozygous alternate. Somatic in tumors are often 0/1 with a fraction (VAF) far below 50% due to tumor heterogeneity and normal contamination.
VCF annotation tools (ANNOVAR, VEP, SnpEff) add predicted functional effects to the INFO field.
Key Population Databases
dbSNP: NCBI's database of known . Assigns rs numbers to common and clinically observed . A in dbSNP is not necessarily benign — it just means it's been observed before.
gnomAD ( Aggregation Database): ~800,000 exomes and ~76,000 whole from diverse populations. The most important population frequency database. A observed in thousands of gnomAD individuals is almost certainly not a high-penetrance disease .
ClinVar: NCBI's database of -disease associations. Aggregates classifications from clinical labs, researchers, and curated sources. The primary reference for clinical interpretation.
COSMIC (Catalogue Of Somatic In Cancer): Somatic database from tumor . Contains >8 million unique from >40,000 tumor samples. Essential for identifying oncogenic and mutational signatures.
Understanding these databases — their scope, their limitations, and how to query them — is the foundation of clinical genomics and cancer bioinformatics. In Chapter 6.5 we'll work directly with VCF files in Python to perform this analysis computationally.
Mutations are permanent changes to the DNA sequence — substitutions, insertions, or deletions of one or more bases. Most mutations are neutral or repaired before expression; a small fraction alter protein function. Somatic mutations affect only the individual; germline mutations are heritable.
A mutation is a bit flip in the source code. A synonymous substitution is a no-op (same amino acid, different codon — like renaming a variable in a compiled binary). A missense mutation is a type error: different amino acid, potentially broken function. A frameshift (insertion/deletion) is corruption of the entire downstream sequence — every codon after the edit point is wrong. Nonsense mutations are null pointer dereferences: a premature stop codon truncates the protein.