Part 3·3.2·12 min read

Epigenetics

Epigenetics controls gene expression without changing DNA sequence — through heritable chemical marks on DNA and histones that determine chromatin accessibility.

epigeneticschromatinmethylationhistones

Identical twins share the same DNA sequence. Yet they can develop different diseases, respond differently to drugs, and show measurable physiological differences as they age. They are even more concordant for disease than fraternal twins but less than 100% concordant — which means something beyond sequence determines phenotype.

That something is epigenetics: chemical modifications to DNA and the proteins around it that change gene expression without altering the underlying sequence. Epigenetic marks are heritable through cell division, reversible in response to environment, and increasingly recognized as key drivers of development, aging, and disease.

The Chromatin Layer

To understand epigenetics, you first need to understand chromatin — the complex of DNA, histone proteins, and associated molecules that makes up chromosomes.

As described in the DNA chapter, DNA is wrapped around histones — small, positively charged proteins that compact the negatively charged DNA. The basic unit is the nucleosome: 147 bp of DNA wrapped ~1.75 times around an octamer of 4 histone types (H2A, H2B, H3, H4, two copies each).

The critical insight: whether DNA is accessible to transcription factors and RNA polymerase depends on how tightly it's packaged.

Two states of chromatin:

  • Euchromatin: loosely packed, accessible, actively transcribed regions. Appears lighter in microscopy.
  • Heterochromatin: densely packed, inaccessible, transcriptionally silent regions. Appears darker. Includes constitutive heterochromatin (centromeres, telomeres — permanently silenced) and facultative heterochromatin (genes silenced in a given cell type but active in others).
{ }Chromatin state as runtime configuration

Think of chromatin state as access control. The same file (gene) exists in every cell, but in some cells it's chmod 000 (heterochromatin — no access), in others chmod 644 (euchromatin — readable). The code hasn't changed. The permissions have.

Unlike filesystem permissions, chromatin states can be dynamically changed in response to signals — environmental cues can "unlock" or "lock" genes — and importantly, daughter cells inherit these states through cell division.

DNA Methylation: The Primary Epigenetic Mark

DNA methylation is the addition of a methyl group (–CH₃) to the 5' position of cytosine, almost exclusively at CpG dinucleotides (cytosine followed by guanine) in mammals.

Distribution

The genome is generally hypomethylated except at repetitive elements (transposons, satellite DNA), where methylation is used to silence potentially harmful mobile elements. About 70–80% of CpGs are methylated in typical somatic cells.

CpG islands — regions with high CpG density, typically 200–3,000 bp long — are located at ~60% of gene promoters and are generally unmethylated when the gene is active. When a CpG island becomes methylated, the associated gene is silenced.

Mechanism of Silencing

Methylated CpGs are recognized by methyl-CpG binding proteins (MBDs, including MeCP2), which recruit histone deacetylases (HDACs) and other repressive machinery. This causes the chromatin to compact, blocking TF access.

Additionally, methylated cytosine physically impairs TF binding — some TFs require unmethylated CpGs at their binding sites.

Writers, Readers, and Erasers

Every epigenetic mark has enzymes that write, read, and erase it:

  • Writers (DNMTs): DNMT1 maintains methylation during DNA replication (copies the methylation pattern to the newly synthesized strand); DNMT3A/3B establish new methylation patterns (de novo methylation)
  • Readers: MBD proteins, Kaiso
  • Erasers: TET enzymes oxidize 5-methylcytosine to 5-hydroxymethylcytosine and further intermediates that are removed by base excision repair, resulting in demethylation

Cancer and DNA Methylation

Cancer shows dramatic methylation changes:

  • Global hypomethylation: repetitive elements become unmethylated, potentially reactivating transposons and destabilizing the genome
  • Promoter hypermethylation: tumor suppressor genes get silenced by CpG island methylation. This is as effective as mutation for inactivating a gene and contributes to the "two-hit" mechanism of tumor suppressor loss.

Epigenetic clocks (like Horvath's clock) use DNA methylation patterns at specific CpGs to accurately estimate biological age — often more precisely than chronological age. Accelerated epigenetic aging predicts disease risk and mortality.

Bisulfite sequencing: measuring DNA methylation

The standard method for measuring DNA methylation genome-wide is bisulfite sequencing. Treating DNA with sodium bisulfite converts unmethylated cytosines to uracil (which reads as thymine after PCR), while methylated cytosines are protected. After sequencing, C residues in the reads correspond to methylated sites; T residues correspond to unmethylated sites.

Whole-genome bisulfite sequencing (WGBS) provides single-CpG resolution. RRBS (reduced representation BS) is cheaper but covers only high-CpG regions. Methylation arrays (Illumina EPIC/450k) measure ~850k CpG sites and are standard for clinical and large-scale studies.

Histone Modifications: The Histone Code

Histone proteins have "tails" — unstructured N-terminal extensions that protrude from the nucleosome core. These tails are extensively modified by enzymes that add or remove chemical groups:

ModificationMarkEffectTypical location
AcetylationH3K27ac, H3K9acActive; neutralizes positive charge, loosens chromatinActive enhancers, active promoters
Methylation (1,2,3 methyl)H3K4me3Active promotersActive TSS
H3K4me1Enhancers (active or poised)Enhancers
H3K27me3RepressivePolycomb-silenced regions
H3K9me3RepressiveConstitutive heterochromatin
H3K36me3Active transcription elongationGene bodies
UbiquitinationH2AK119ub1Repressive (Polycomb)Polycomb-silenced genes
PhosphorylationH3S10phActive; chromosome condensationMitotic chromosomes, active genes

This combinatorial code — called the histone code hypothesis — means that multiple marks together specify chromatin state more precisely than any single mark alone.

Writers, Readers, and Erasers (Histones)

  • Histone acetyltransferases (HATs): write acetyl marks. CBP/p300 write H3K27ac at active enhancers.
  • Histone deacetylases (HDACs): erase acetyl marks. HDAC inhibitors (vorinostat, romidepsin) are approved cancer drugs.
  • Histone methyltransferases (HMTs): write methyl marks. EZH2 writes H3K27me3 (repressive); DOT1L writes H3K79me (active).
  • Histone demethylases (KDMs): erase methyl marks. LSD1/KDM1A demethylates H3K4me1/2 and H3K9me1/2.
  • Bromodomains: protein domains that "read" acetyl marks. BET proteins (BRD2, BRD3, BRD4) bind acetylated histones at enhancers and promoters; BET inhibitors (JQ1, iBET) are in clinical trials for cancer.

Chromatin Accessibility: The Open/Closed Switch

Measuring which regions of the genome are accessible — i.e., not occluded by nucleosomes — is done with ATAC-seq (Assay for Transposase-Accessible Chromatin with sequencing). Transposase Tn5 preferentially inserts sequencing adapters into open chromatin; sequencing reveals which regions are accessible.

ATAC-seq identifies:

  • Active promoters
  • Active enhancers
  • Transcription factor binding sites (TF footprinting)
  • Cell-type-specific regulatory elements

Combined with gene expression data, ATAC-seq helps identify which regulatory elements drive gene expression changes in a given condition.

Polycomb and Trithorax: The Two-State System

Two protein complexes maintain gene silencing and activation states through development:

Polycomb Repressive Complexes (PRC1, PRC2) deposit and read repressive marks (H3K27me3, H2AK119ub1). They maintain gene silencing through cell division — essential for stable cell identity. PRC2's catalytic subunit EZH2 is frequently mutated or overexpressed in cancer.

Trithorax/COMPASS complexes maintain active states through H3K4 methylation and H3K36 methylation. They antagonize Polycomb and keep developmental genes active.

The interplay between Polycomb and Trithorax creates a bistable system — genes are either "on" (Trithorax-dominated) or "off" (Polycomb-dominated), with sharp transitions. This bistability contributes to the robustness of cell identity: once a cell type is established, it maintains its gene expression program stably through thousands of cell divisions.

Imprinting: Parent-of-Origin Epigenetics

Some genes are expressed exclusively from either the maternal or paternal allele — determined by epigenetic marks established in the germline. This is genomic imprinting, and about 100 human genes are imprinted.

Imprinted genes are regulated by imprinting control regions (ICRs) — differentially methylated CpG regions that carry methylation on only one parental allele. This allele-specific methylation is established during gametogenesis and maintained throughout development.

Disorders of imprinting are medically important:

  • Prader-Willi syndrome: loss of paternal 15q11-q13 (or maternal uniparental disomy of chr15)
  • Angelman syndrome: loss of maternal 15q11-q13 (or paternal uniparental disomy)
  • Different syndromes from the same chromosomal deletion because different genes in the region are imprinted in opposite directions.

Epigenetics in Bioinformatics Practice

Common epigenomics data types and tools:

Data typeWhat it measuresAnalysis tools
WGBS / RRBSDNA methylation at CpGsBismark, BSMAP, methylKit
ChIP-seqHistone marks, TF bindingMACS2, deepTools, HOMER
ATAC-seqChromatin accessibilityMACS2, HINT-ATAC, chromVar
Hi-C / 4C3D genome organizationHiC-Pro, cooltools, juicer
CUT&RUNHistone marks (low cell input)Similar to ChIP-seq

A typical epigenomics analysis project involves: aligning reads to the genome, calling peaks (enriched regions), annotating peaks relative to genes, and integrating with expression data to understand regulatory relationships.

The key challenge is integration: a single cell type might have ATAC-seq, ChIP-seq (multiple marks), WGBS, and RNA-seq. Making sense of all four simultaneously — the multi-omics integration problem — is one of the central challenges of current bioinformatics.