In the previous chapter, we established that the genome is a ~3 billion character string. But a string alone is not a program. A program needs structure: defined units with names, inputs, outputs, and rules for when they run. In biology, that structure is provided by genes.
A gene is the fundamental unit of biological information — a stretch of DNA with enough regulatory context to be selectively read, converted into RNA, and (usually) translated into protein. Understanding gene structure is essential because every bioinformatics tool that works with genomes — variant callers, RNA-seq pipelines, annotation software — reasons about gene coordinates, exon boundaries, splice sites, and regulatory regions.
What a Gene Actually Is
Here's a definition that will hold up better than the casual version: a gene is a heritable unit of sequence that can be transcribed into RNA, where that transcription is controlled by associated regulatory elements.
Note what's missing from that definition: "encodes a protein." About 1.5% of the human genome encodes protein, but roughly 80% is transcribed into RNA at some point. Many of those non-coding RNAs have important regulatory functions. A gene that produces only non-coding RNA is still a gene.
The definition of "gene" has been revised multiple times since the term was coined in 1909. Early genetics defined genes by their phenotypic effects. Molecular biology redefined them as DNA sequences encoding proteins. Genomics forced another revision: some genes encode only RNA, some produce multiple proteins via alternative splicing, and some overlap with each other on opposite strands. The operational definition we use here — a transcription unit with regulatory context — reflects the current working consensus.
Gene Structure: The Anatomy of a Function
A protein-coding gene in a eukaryote has several components, each with a distinct role:
The Promoter
The promoter is a regulatory sequence upstream of the gene (typically within ~2000 bp of the transcription start site) where transcription machinery assembles. It contains the core promoter — recognition sequences for RNA polymerase — and often additional sequences that bind regulatory proteins called transcription factors.
Think of the promoter as a function signature combined with its access modifier. It defines: can this gene be called? Under what conditions? With what inputs (transcription factors)?
The classic core promoter elements include:
- TATA box (~−25 to −30 from start site) — binding site for TBP (TATA-binding protein), part of the basal transcription machinery
- Initiator element (at the +1 site) — present in many promoters without a TATA box
Many human promoters also have CpG islands — regions with high GC content and many CpG dinucleotides that resist methylation in active genes. CpG methylation is a key epigenetic silencing mechanism, covered in Chapter 3.2.
Enhancers and Silencers
Enhancers are regulatory sequences that increase transcription when bound by specific transcription factors. They can be located thousands or even hundreds of thousands of base pairs away from the gene they regulate — looping through 3D space to contact the promoter.
Silencers work similarly but decrease transcription.
An enhancer is like an environment variable that gets passed to a build process. The gene's promoter is the build script — it runs, but what it does depends on what environment variables are set. A liver-specific enhancer that binds HNF4α (a liver transcription factor) will activate genes only in liver cells because only liver cells have HNF4α available. The same gene in a neuron, without that transcription factor, stays silent.
Exons and Introns
When a protein-coding gene is transcribed, the full RNA copy — called pre-mRNA — includes both the coding sequences and intervening non-coding sequences:
- Exons — sequences that end up in the mature mRNA (the word "exon" = "expressed")
- Introns — sequences spliced out before the mRNA leaves the nucleus ("intron" = "intervening")
After transcription, a process called RNA splicing removes the introns and joins the exons. The result is a mature mRNA with only the coding and regulatory sequences needed for translation.
The average human protein-coding gene has ~9 exons and ~8 introns. Exons average ~200 bp; introns average ~3,500 bp. The actual coding sequence (the open reading frame, or ORF) is typically much smaller than the total gene span, which can stretch over 100 kb or more of genomic DNA.
Introns look superficially like comments or dead code — sequences that are present in the DNA but removed before execution. But unlike commented-out code, introns are not inert. Many introns contain regulatory elements: splice site signals, regulatory RNAs, and even entire small genes. The splicing machinery that removes them is also a target for alternative splicing regulation, which can change the protein product entirely.
Splice Sites
The boundaries between exons and introns are defined by splice site consensus sequences. The 5' splice site (exon|intron boundary) typically starts with GT (GU in RNA); the 3' splice site (intron|exon boundary) ends with AG. The phrase "GT-AG rule" is a useful mnemonic.
Within the intron, a branch point sequence (~20–50 bp upstream of the 3' splice site) forms a lariat structure during splicing. The spliceosome — a large RNA-protein complex — catalyzes the reaction.
Mutations in splice sites are a major class of pathogenic variants. A single nucleotide change at the GT or AG can cause exon skipping (the exon gets included in the intron and removed), intron retention (the intron ends up in the mRNA), or cryptic splice site activation (a nearby sequence that looks like a splice site gets used instead). All of these alter or destroy the protein product.
The Coding Sequence (CDS) and Open Reading Frame
The coding sequence (CDS) is the portion of the mature mRNA that gets translated into protein. It begins with a start codon (AUG, encoding methionine) and ends with a stop codon (UAA, UAG, or UGA).
The CDS is embedded in the mRNA between UTRs — untranslated regions:
- 5' UTR — between the cap and the start codon; contains ribosome binding sites and regulatory elements
- 3' UTR — between the stop codon and the poly-A tail; contains regulatory sequences that influence mRNA stability, translation efficiency, and subcellular localization
The 3' UTR is a major hub for post-transcriptional regulation. It contains binding sites for microRNAs — small non-coding RNAs that target mRNAs for degradation or translational silencing. Over 60% of human protein-coding genes are regulated by microRNAs. When analyzing differential gene expression, UTR mutations or variants affecting microRNA binding sites can have major phenotypic effects even though they don't change the amino acid sequence.
The Codon Table: A Lookup Table for Translation
The genetic code maps triplets of RNA nucleotides (codons) to amino acids. There are 4³ = 64 possible codons and 20 amino acids, so most amino acids are encoded by multiple codons — this is called degeneracy or redundancy.
The code is:
- Universal — almost identical across all life (with minor exceptions in some mitochondria and organisms)
- Degenerate — multiple codons map to the same amino acid (e.g.,
GCU,GCC,GCA,GCGall encode alanine) - Non-overlapping — each nucleotide belongs to exactly one codon
- Comma-free — no delimiters between codons; the reading frame is established by the start codon
The degeneracy is not random. Synonymous codons (encoding the same amino acid) often differ only in the third position — the "wobble" position. This makes the code more robust to point mutations: a change in the third codon position often doesn't change the amino acid.
Pseudogenes and Gene Families
Not everything that looks like a gene is functional. Pseudogenes are sequences that resemble genes but have lost function through mutations. They arise when a gene is duplicated and one copy accumulates inactivating mutations.
More productively, gene duplication is the primary mechanism for evolving new gene functions. The human genome contains many gene families — groups of related genes that arose by duplication and divergence. The hemoglobin genes (HBA1, HBA2, HBB, HBD, etc.) are a classic example: all related, all encoding oxygen-carrying proteins, but with different expression patterns and oxygen affinities tuned to developmental stage and tissue type.
Reading a Gene Annotation File
In practice, genes are described in annotation files — GTF (Gene Transfer Format) or GFF3 files that list genomic coordinates for each feature. Every RNA-seq analysis starts by mapping reads to a reference genome and counting reads per gene, which requires a gene annotation file.
A GTF record looks like this:
chr17 HAVANA gene 43044295 43125483 . - . gene_id "ENSG00000012048"; gene_name "BRCA1";
chr17 HAVANA transcript 43044295 43125483 . - . gene_id "ENSG00000012048"; transcript_id "ENST00000357654";
chr17 HAVANA exon 43124017 43125483 . - . gene_id "ENSG00000012048"; exon_number "1";
chr17 HAVANA CDS 43124017 43125364 . - . gene_id "ENSG00000012048"; protein_id "ENSP00000350283";
Fields: chromosome, source, feature type, start, end, score, strand, frame, attributes.
The coordinates are 1-based and half-open (start is inclusive, end is inclusive in GTF). The strand (+ or -) matters: genes on the minus strand are read right-to-left in genomic coordinates, so position 43125483 is the 5' end of BRCA1.
Understanding GTF/GFF3 files is a prerequisite for: RNA-seq, ChIP-seq, variant annotation, CRISPR guide design, and most genome browser work.
Why Gene Structure Matters for Bioinformatics
Almost every bioinformatics analysis involves gene boundaries at some level:
- Variant annotation: is this SNP in a coding exon? In a splice site? In a UTR? The functional impact depends entirely on where it falls in the gene structure.
- RNA-seq: reads are counted per gene, per transcript, sometimes per exon. Isoform-level analysis requires knowing exon-intron boundaries.
- ChIP-seq: where is a transcription factor binding relative to nearby gene promoters?
- CRISPR design: guides near a splice site can disrupt splicing even if they don't hit the coding sequence directly.
The gene is not just a label or a name. It's a precise, structured unit with regulatory logic, internal organization, and defined outputs. Treating it as a simple position on a chromosome misses most of the biology.