The Biological Impact of Transposable Elements

Henry L. Levin, PhD, Head, Section on Eukaryotic Transposable Elements
Angela Atwood-Moore, BA, Senior Research Assistant
Hyo Won Ahn, PhD, Research Fellow
Abhishek Anand, PhD, Visiting Fellow
SePil Lee, PhD, Visiting Fellow
Saadlee Shehreen, PhD, Visiting Fellow
Taeen Jidaan, BA, Postbaccalaureate Fellow
Rebecca John, BA, Postbaccalaureate Fellow

Transposable elements (TEs) are prevalent in eukaryotes, where they not only change genome structure but also provide regulatory sequences central to coordinating the expression of gene networks. TEs of model organisms, such as yeast, are particularly well suited to address the dynamics and impact of their replication. We have studied long-term repeat (LTR) retrotransposons of the fission yeast (Schizosaccharomyces pombe) to determine how integration sites are selected and to understand how patterns of integration impact the physiology of the cell. In past work, we found that integration of LTR retrotransposons in S. pombe alters gene expression and adapts cells to environmental stress. It is through selective adaptation that we believe TEs form gene-regulatory networks. In additional studies, we have adapted our methods of mapping large numbers of TE insertions to sequencing HIV-1 integration sites. To date our HIV-1 integration dataset represents the largest published study of positions and allows us to identify important mechanistic aspects of integration that were previously neglected.

In humans, TEs represent 50% of the genome. The dominant families of TEs are Long INterspersed Element-1 (LINE-1 or L1), which constitutes 17% of the genome, and Alu Short Interspersed Elements (SINEs), which are mobilized by L1 proteins and constitute 10% of the genome. Given that TEs make up half of the human genome, it is not surprising that their regulatory features are abundant sources of tissue-specific promoter activity and are critical building blocks of gene-regulatory networks. Although the vast majority of TEs have lost mobility, each genome retains approximately 100 active copies. As a result, genome studies of human populations reveal many thousands of polymorphic TEs. These insertions have the potential to alter gene expression and impact health and disease.

The role of LEDGF in transcription is intertwined with its function in HIV-1 integration.

HIV-1 integration occurs across actively transcribed genes, a specificity that is the result of the interaction of chromatin factor LEDGF (lens epithelium-derived growth factor) with integrase. Our understanding of HIV-1 integration is incomplete, in part because the cellular function of LEDGF is unclear. Although LEDGF was originally isolated as a co-activator that stimulates promoter activity in purified systems, this model is inconsistent with LEDGF–mediated integration across gene bodies and with data suggesting that LEDGF has histone-chaperone activity. To clarify the cellular roles of LEDGF, we conducted RNA-Seq. In the absence of LEDGF, 516 expressed genes were differentially expressed, underscoring a significant role in gene expression. To determine how LEDGF regulates transcription, we measured genome-wide enrichment of RNA Pol II (RNA polymeraase II) and H3K4me3 (a methylated histone), a mark of active promoters. Cells lacking LEDGF had similar levels of H3K4me3 but reduced RNA Pol II at the promoters of down-regulated genes, suggesting that LEDGF may recruit RNA Pol II. To evaluate the direct role of LEDGF in the expression of these genes, we contended with a long-standing problem in understanding HIV-1 integration, namely, there were no accurate maps of chromatin-bound LEDGF. Antibodies specific for LEDGF have not been successful in ChIP-Seq experiments. By CRIPSR editing HEK293T cells, we scarlessly introduce a 3XFLAG tag to the 3′ end of PSIP1, the native LEDGF gene. The resulting ChIP-Seq experiments provided a high-resolution and highly specific map of LEDGF–binding sites across the genome. Surprisingly for a protein that mediates integration across gene bodies, we observed pronounced peaks of LEDGF at the 5′ end of transcription units, which matched the peaks of RNA Pol II and H3K4me3 at the transcription start sites (TSSs) of active promoters (Figure 1A and 1B). Using ChIP-Seq, we also observed that levels of RNA Pol II at promoters were reduced in the absence of LEDGF, demonstrating that LEDGF does recruit RNA Pol II (Figure 1A).

Figure 1. Genome-wide enrichment of LEDGF in HEK293T cells

ChIP-Seq analyses of LEDGF association with chromatin in HEK293T cells.

A. SYK is an example of many active genes found to have strong enrichment of LEDGF at transcription start sites that also associate with RNA Pol II and the histone modification H3K4me3. In the absence of LEDGF, the enrichment of RNA Pol II is significantly reduced.

B. A metagene plot of LEDGF enrichment across genes shows that the highest binding occurs at the transcription start sites.

Figure 1. Genome-wide enrichment of LEDGF in HEK293T cells

Click image to enlarge.

Figure 1. Genome-wide enrichment of LEDGF in HEK293T cells

ChIP-Seq analyses of LEDGF association with chromatin in HEK293T cells.

B. A metagene plot of LEDGF enrichment across genes shows that the highest binding occurs at the transcription start sites.

Efforts to understand how LEDGF is recruited to promoters tested the function of factors that bind to LEDGF, such as the histone methyltransferase MLL1. ChIP-Seq showed that MLL1 peaks matched the positions of LEDGF at promoters and, importantly, when levels of MLL1 were reduced with shRNA (small interfering RNA hairpin), the peaks of LEDGF at promoters were significantly reduced. Interestingly there is a reciprocal relationship, as cells lacking LEDGF exhibited a reduction of MLL1 at active promoters. These experiments provided insight into the function of LEDGF in transcription. Measures of LEDGF association at individual TSSs showed strong correlation with amounts of integration in the downstream transcribed sequences, indicating that MLL1, by recruiting LEDGF to TSSs, plays an important role in the genome-wide pattern of integration.

LEDGF possesses an N-terminal PWWP domain, which is known to interact with histone H3K36me3, an epigenetic mark found across transcribed sequences. The current model of integration is that the PWWP domain (a 100–150 amino acid structure found in eukaryotic proteins involved in DNA methylation, repair, and transcription) is responsible for directing LEDGF directly to the bodies of genes being actively transcribed. This tethers integrase to transcribed sequences, where it inserts HIV-1 cDNA. With ChIP-Seq experiments, we mapped the chromatin binding of LEDGF lacking the PWWP domain and found that no changes occurred in binding locations, results that were surprising because crude measures of chromatin association indicated that PWWP was important for binding. In collaboration with Alan Engelman, we found that removal of the PWWP domain did not significantly alter the integration specificity at individual genes, supporting our finding that PWWP is not important for chromatin binding. To determine which domains of LEDGF may direct its chromatin binding to TSSs, we expressed LEDGF lacking the integrase binding domain (IBD), the region that associates with integrase as well as several cellular factors involved in transcription elongation. Importantly, we found that LEDGF lacking the IBD had far less enrichment at TSSs. MLL1 is one of the factors that directly associates with the IBD and, together with the role of MLL1 in recruiting LEDGF to TSSs, these data indicate that MLL1 plays a key role in the chromatin association of LEDGF at TSSs. Based on these data, we propose a new model of LEDGF–mediated HIV-1 integration. LEDGF is recruited to active promoters by MLL1 and subsequently associates with the RNA Pol II elongation complex, which travels across transcription units to effect HIV-1 integration.

L1 retrotransposition and the mechanism that causes 95% of insertions to have severe 5′ truncations

The human genome contains about 500,000 copies of the non-LTR retrotransposon L1, most of which are inactive. L1 ORF2p (a protein essential for L1 retrotransposition) possesses endonuclease and reverse transcriptase activities that generate a single-stranded nick at the site of insertion and, at this position, perform target primed reverse transcription. However, in 95% of L1 insertions, 5′ truncations occur that remove promoter and most of the protein coding sequences. 5′ truncation has played a significant role in shaping the 3-gigabase genome, which would, in the absence of 5′ truncation, have expanded dramatically to be 8 gigabases. However, little is known about the mechanism of truncation. To identify factors responsible for 5′ truncation, we developed an assay that measures both truncated and full-length insertions. We made a dual reporter system with GFP (green fluorescent protein) in the 3′ UTR and mCherry in the 5′ UTR of L1, where both genes are disrupted by an intron that is in the antisense orientation relative to L1 transcription (Figure 2). The reporters can only be expressed after splicing, reverse transcription, and integration of L1 into the genome. Full-length L1 insertions express red and green reporters, but if 5′ truncated, only the green reporter is active. Comparing the ratios of HEK293T cells expressing dual fluorescence with cells that only express green fluorescence, using FACS, reveals that full-length insertions represent approximately 5% of all de novo integration. RNA-Seq analysis using this system showed that just 35% of the 3′ GFP reporter was spliced compared with the 90% of the 5′ mCherry reporter that was spliced. We also sequenced full-length and truncated insertions in HEK293T cells produced by the dual reporter to confirm the presence of integrated L1. Previous studies identified two residues in L1 that appear to alter the length of insertion. Using this dual reporter, we confirmed that mutations of these residues lowered the ratio of full-length insertions relative to 5′ truncations two-fold. We also evaluated L1 mutations previously shown to increase transposition frequencies in cultured cells. We found that at least two of these mutations increase the frequency of full-length integration.

Figure 2. Dual reporter of L1 transposition measures full-length integration.

The dual reporter of L1 retrotransposition includes mCherry in the 5′ UTR and GFP in the 3′ UTR. Both fluorescent genes are disrupted with an intron that prevents their expression. The expression of mCherry and GFP (green fluorescent protein) occur only after the introns are spliced from the L1 RNA, the RNA is reverse transcribed, and the DNA is integrated. Insertions that are 5′ truncated are GFP⁺, while full-length insertions are both GFP⁺ and mCherry⁺.

Click image to enlarge.

Figure 2. Dual reporter of L1 transposition measures full-length integration.

In efforts to screen large numbers of candidate genes for a role in 5′ truncation, we made another version of the reporter with an inducible promoter and integrated the entire construct in HEK293T cells, using a Piggyback transposon delivery system. For genetic uniformity, we made single-cell–derived clonal populations from these cells. In one clonal cell line, we screened approximately 95 candidate genes for contributions to 5′ truncation using siRNA knockdown methods. CRISPR–mediated knockout of one candidate MOV10 confirmed a two-fold increase in full-length insertions. MOV10 is a 5′ to 3′ RNA helicase shown to associate with L1 RNA mostly in the region where 5′ truncation occurs. We are testing the model that MOV10 causes 5′ truncation by disrupting the progress of ORF2 during reverse transcription.

Retrotransposon insertions associated with risk of neurologic and psychiatric diseases

Mental disorders affected about 970 million people worldwide in 2017. In 2020, 21% of adults in the United States suffered from some form of mental illness. Such diseases thus cause great social and economic burden. Studies of identical twins show that the heritability of diseases such as attention-deficit hyperactivity disorder (ADHD), autism spectrum disorder (ASD), bipolar disorder (BIP), and schizophrenia is extremely high, ranging from 74% to 81%. Because of the complexity of the mammalian nervous system, the genetic and cellular etiology of such diseases remains largely unclear. Progress in genetic methodology has provided the potential to identify mechanisms that underlie the diseases. One approach that has successfully identified important disease loci is genome-wide association studies (GWAS). However, in the cases of neurologic and major psychiatric disorders, GWAS have identified large numbers of loci, each associated with small increases in risk. Importantly, there is extensive overlap of the loci that contribute to major psychiatric disorders, indicating that related molecular mechanisms may underlie distinct clinical phenotypes.

TASs (trait-associated single-nucleotide polymorphisms [SNPs]) of GWAS are genetic tags identifying a genomic region that contains the causal mutation(s) and that lead to increased disease risk. Limits on the design of GWAS typically prevent such studies from identifying causal gene alleles. Thus, determining causal variants remains the most challenging and rate-limiting, but also the most important step, in defining the genetic architecture of diseases. The vast majority of GWAS TASs lie in intergenic or intronic regions and therefore do not alter coding sequence. For such SNPs to be causal, they would likely have regulatory effects on transcription. Structural variants, such as rearrangements, copy number variants, and transposable element (TE) insertions, constitute a substantial and disproportionately large fraction of the genetic variants found to alter gene expression.

In humans, the dominant families of TEs are long interspersed element-1 (LINE-1 or L1) and Alu elements, which are short interspersed elements (SINEs) and are mobilized by L1. TEs readily alter gene expression because they have evolved various sequences that act on enhancers. Given that TEs make up approximately 50% of the human genome, it is not surprising that their regulatory features are abundant sources of tissue-specific promoter activity.

Relatively recent TE insertions can proliferate in the population and become common alleles. The 1000 Genomes Project described genetic variation of diverse human populations by sequencing whole genomes of 2,504 individuals. The extensive survey of genetic variation detected 17,000 polymorphic insertions of TEs, which have the potential to alter gene expression and affect common disease risk. There may be functional consequences of common TE insertion variants that affect common disease risk. Some common polymorphic TEs have been implicated at disease loci detected by GWAS. Common polymorphic Alu (short transposable elements) insertions occur disproportionately near disease loci of GWAS, underscoring the fact that Alu insertions are potential causative variants.

Given the difficulty in identifying genetic variants responsible for neurologic and psychiatric disorders and the regulatory capacity of TEs, we tested whether polymorphic TEs are potential causative variants of such diseases. We analyzed 593 GWAS of neurologic and psychiatric diseases, which in total reported 753 TASs. From the 17,000 polymorphic TEs, we found that 76 were in linkage disequilibrium (LD) with TASs, indicating that the TEs were among the variants with the potential to be causative. We extended our analysis by evaluating each candidate TE for a role in altering expression of proximal genes. In one approach, we investigated whether polymorphic TEs could disrupt regulatory sequences, as annotated with the epigenomic data of the NIH Roadmap Epigenomics Consortium. In all, we identified 10 polymorphic TEs to examine further as causal candidates because they were positioned in enhancer, promoter, heterochromatin, or transcribed sequences present in neurologic tissues.

We hypothesized that polymorphic TEs can have a causal relationship with the risk of psychiatric and neurologic disorders by altering expression of genes in cis. For evidence of altered gene expression, we queried the Genotype-Tissue Expression (GTEx) database, which contains expression data for 948 donors across 54 tissues. GTEx readily identifies changes in tissue-specific gene expression associated with loci-specific genetic variation. SNPs in LD with a query gene are identified as eQTLs (expression quantitative trait loci) if the genetic loci with the variants are significantly associated with altered expression of a gene in a specific tissue. We found that 31 of the TASs linked to TEs were variants that are associated with changes in expression of one or more adjacent genes within regions of the brain.

Having identified a number of polymorphic Alu elements that are significantly associated with disease risk detected by GWAS and that are correlated with altered gene expression in neurologic tissues by eQTL analysis, we developed a luciferase reporter assay to test whether the insert sequences in the context of flanking sequence can influence transcription activity. We measured the impact of candidate Alu and flanking sequences on the function of a minimal promoter in NCRM-1 (human neural stem cells). Of six candidate Alu insertions evaluated for their impact on promoter activity, we found that five significantly altered the expression of luciferase. Taken together, we identified 10 polymorphic TE insertions that are potential candidates on par with other variants for having a causal role in neurologic and psychiatric disorders.

Additional Funding

FY2024 Office of AIDS Research

Publications

Ahn H, Worman Z, Lechsinska A, Payer L, Wang T, Malik N, Li W, Burns K, Nath A, Levin HL. Retrotransposon insertions associated with risk of neurologic and psychiatric diseases. EMBO Reports 2023 24:1–17
Arkhipova IR, Burns KH, Chiappinelli KB, Chuong EB, Goubert C, Guarné A, Larracuente AM, Lee EA, Levin HL. Meeting report: transposable elements at the crossroads of evolution, health and disease 2023. Mobile DNA 2023 14:307–4

Collaborators

Kathleen Burns, MD, PhD, Dana-Farber Cancer Institute, Boston, MA
Ryan Dale, PhD, Bioinformatics and Scientific Programming Core, NICHD, Bethesda, MD
Alan Engelman, PhD, Dana-Farber Cancer Institute, Boston, MA
Caroline Esnault, PhD, Bioinformatics and Scientific Programming Core, NICHD, Bethesda, MD
Avindra Nath, MD, PhD, Division of Neuroimmunology & Neurovirology, NINDS, Bethesda, MD
Mikel Zaratiegui, PhD, Rutgers, The State University of New Jersey, Piscataway, NJ

Contact

For more information, email henry_levin@nih.gov or visit https://www.nichd.nih.gov/research/atNICHD/Investigators/levin.