Skip to main content

National Institutes of Health

Eunice Kennedy Shriver National Institute of Child Health and Human Development

2023 Annual Report of the Division of Intramural Research

The Biological Impact of Transposable Elements

Henry Levin
  • Henry L. Levin, PhD, Head, Section on Eukaryotic Transposable Elements
  • Angela Atwood-Moore, BA, Senior Research Assistant
  • Paul Atkins, PhD, Postdoctoral Fellow
  • Hyo Won Ahn, PhD, Visiting Fellow
  • Abhishek Anand, PhD, Visiting Fellow
  • SePil Lee, PhD, Visiting Fellow
  • Feng Li, PhD, Visiting Fellow
  • Rakesh Pathak, PhD, Visiting Fellow
  • Saadlee Shehreen, PhD, Visiting Fellow
  • Abigail Burkhart, BS, Postbaccalaureate Fellow

Long Terminal Repeat (LTR) retrotransposons are highly abundant and have evolved into ubiquitous families of elements that multiply through cycles of transcription, particle formation, reverse transcription, transport to the nucleus, and integration. Many families of LTR retrotransposons evolved envelope proteins, an addition that allows cell entry and transforms the elements into infectious retroviruses. This close relationship makes LTR retrotransposons ideal models for studying the molecular mechanisms responsible for retrovirus replication. The transposable elements (TEs) of model organisms, such as yeast, are particularly well suited to address the dynamics and impact of their replication. We study LTR retrotransposons of the fission yeast (Schizosaccharomyces pombe) to determine how integration sites are selected and to understand how patterns of integration impact the physiology of the cell. In past work, we found that integration of LTR retrotransposons in S. pombe alters gene expression and adapts cells to environmental stress. It is through selective adaptation that we believe TEs form gene-regulatory networks. In additional studies, we have adapted our methods of mapping large numbers of TE insertions to sequencing HIV-1 integration sites. To date our HIV-1 integration dataset represents the largest published study of positions and allows us to identify important mechanistic aspects of integration that previously have been neglected.

In humans, TEs represent 50% of genomic sequences. The dominant families of TEs are Long INterspersed Element-1 (LINE-1 or L1), which constitutes 17% of the genome, and Alu Short Interspersed Elements (SINEs), which are mobilized by L1 and constitute 10% of the genome. Given that TEs make up half of the human genome, it is not surprising that their regulatory features are abundant sources of tissue-specific promoter activity and are critical building blocks of gene-regulatory networks. Although the vast majority of TEs have lost mobility, each genome retains approximately 100 active copies. As a result, genome studies of human populations reveal many thousands of polymorphic TEs. Our goal is to determine the role of these genetic variants in health and disease.

Identification of an integrase-independent pathway of retrotransposition

Despite the central role of integration in the propagation of retroviruses, important questions remain about residual insertions that occur in the absence of integrase (IN) activity. Mutations in the catalytic residues of HIV-1 IN produce residual infectious titers, typically with a 3- to 4-log reduction. However, in continuous cultures of HIV-1 lacking IN activity, insertion efficiency can be as high as 0.2–0.8% of a wild-type (WT) virus. These findings indicate that retroviruses possess a secondary, IN–independent pathway, which incorporates viral DNA into the host genome. Given that IN–independent infections could compromise the treatment of HIV-1 patients with IN inhibitors, it is important to identify the nature of this pathway.

Figure 1. Tf1 insertion takes place in the absence of integrase.

Figure 1

Click image to view.

A. The diagram shows the strategy of monitoring Tf1 retrotransposition. A drug-resistant gene, nat, with artificial intron (nat-AI) is introduced into Tf1, and the integration of Tf1 into host chromosomes allows cells to grow on plates containing Nat (N-acetyl transferase). The black arrows indicate the frame shift (fs) sites of PR and IN respectively. LTR: long terminal repeat; PR: protease; RT: reverse transcriptase; IN: integrase; WT: wild-type.

B. Growth phenotypes of Tf1-WT, Tf1-INfs, and Tf1-PRfs on medium containing Nat after inducing Tf1 expression.

C. Quantitative transposition analysis Tf1-WT, Tf1-INfs, and Tf1-PRfs.

LTR retrotransposons are important models of retroviruses because of their structural and mechanistic similarities. Tf1 and Tf2 are extensively characterized LTR retrotransposons with high integration activity in S. pombe. Studies of Tf1 expressed with genetic markers demonstrate that the Gag protein, protease (PR), reverse transcriptase (RT), and IN all contribute to transposition. Importantly, the resulting integration is directed to specific RNA pol II (RNA polymerase II) promoters by the DNA–binding factor Sap1. To identify a model system that can be used to study the mechanisms of IN–independent insertion, we measured the insertion of Tf1 lacking IN activity. We performed an insertion assay with Tf1 encoding a frameshift mutation at the start of IN (Tf1-INfs) that blocks expression of IN without altering RT expression or cDNA synthesis. We found Tf1-INfs retained 4.95% of the insertion activity of Tf1-WT [Reference 1], indicating that, in the absence of IN activity, Tf1 cDNA inserted into the host genome with surprising efficiency. Genome-wide insertion profiles of Tf1 lacking IN (Tf1-INfs) were significantly different from those of Tf1 expressing active IN. DNA logo analysis showed that the sequences downstream of the Tf1-INfs insertion sites had a prominent bias for the ATAAC nucleotide sequence, and upstream flanks showed a preference of CAA. Interestingly, the downstream logo matches the sequence of the primer binding site (PBS), an 11 bp sequence retained after reverse transcription on the 3′ end of the plus-strand cDNA. The CAA matches the last three base pairs of the polypurine tract (PPT), which is retained on the 3′ end of the minus-strand cDNA. The PBS and PPT sequence preferences indicated that these single-stranded sequences contributed to insertion through homologous recombination (HR). If IN–independent insertions are directed to sites with homology to the PBS and PPT, we would expect that large numbers of insertions would occur at the 13 pre-existing copies of Tf2 that have PBS and PPT sequences identical to those of Tf1. By analyzing the raw downstream sequences, we found that approximately 70% of the IN–independent insertions occurred at homologous sequences within the pre-existing 5′ LTRs of Tf2s. Whole-genome sequencing of these events revealed that the most common outcome of these insertions resulted in tandem copies of Tf1 and Tf2 elements.

Our data suggest that IN–independent insertion of Tf1 is likely mediated by a form of homologous recombination. To determine whether homologous recombination factors contribute to IN–independent insertion, we measured insertion frequencies of strains lacking mre11, rad50, nbs1, rad51, or rad52 (genes encoding members of a complex that repairs double-strand DNA breaks). The results revealed that the insertions occurred through Rad52–dependent single-strand annealing (SSA), as Rad51 was dispensable. The rad52–R45A mutation, which specifically abolishes the SSA activity of Rad52, significantly reduced the frequency of Tf1-INfs insertions and resulted in dissociation of Rad52 from Tf1 cDNA. These data indicate that Rad52 plays a critical role in IN–independent insertions by binding to the ends of the cDNA, causing recombination with sequences similar to PBS and PPT.

The efficiency of HR–mediated IN–independent insertion of Tf1 raised questions about whether this pathway has a biological function. Our efforts to determine whether IN–independent events occur naturally showed that cultures with continuing expression of WT Tf1 produced insertions that were predominantly IN–independent [Reference 1]. These data demonstrate that Tf1 possesses two efficient insertion pathways, one relying on IN and the other being IN–independent but requiring Rad52. Significantly, we found in previously published data of HIV-1 IN–independent insertion sequences that five of 69 sites had strong similarity to the HIV-1 PBS. Together, our findings indicate that homology-dependent SSA provides a significant pathway of IN–independent insertion.

The role of LEDGF in transcription is intertwined with its function in HIV-1 integration.

HIV-1 integration occurs across actively transcribed genes, a specificity that is attributable to the interaction of host factor LEDGF (lens epithelium-derived growth factor) with integrase. Our understanding of HIV-1 integration is incomplete, in part because the cellular function of LEDGF is unclear. Although LEDGF was originally isolated as a co-activator that stimulates promoter activity in purified systems, this model is inconsistent with LEDGF–mediated integration across gene bodies and with data suggesting that LEDGF can regulate alternative splicing. To clarify the roles of LEDGF in transcription, we conducted RNA-seq. In the absence of LEDGF, 516 expressed genes were differentially expressed (a greater than 1.6-fold change), underscoring a significant role in gene expression. To examine the role of LEDGF in splicing, we analyzed genes that produce differentially expressed mRNA isoforms in the absence of LEDGF. The majority of these isoforms were expressed from different promoters, suggesting that the dominant function of LEDGF is to regulate promoter activity, not splicing. To determine how LEDGF regulates transcription, we measured H3K4me3 (a methylated histone) enrichment, a mark of active promoters. Cells lacking LEDGF had reduced H3K4me3 at down-regulated genes and elevated levels in up-regulated genes. To evaluate the direct role of LEDGF in the expression of these genes, we contended with a long-standing problem in understanding HIV-1 integration, namely, there were no accurate maps of chromatin bound by LEDGF. Antibodies specific for LEDGF have not been successful in ChIP-seq experiments. By CRIPSR editing HEK293T cells, we scarlessly introduce a 3XFLAG tag to the 3′ end of PSIP1, the native LEDGF gene. The resulting ChIP-seq experiments provided a high-resolution map of LEDGF–binding sites across the genome. Surprisingly for a protein that mediates integration across gene bodies, we observed pronounced peaks of LEDGF at the 5′ end of transcription units that matched the peaks of H3K4me3 of active promoters. Significant reduction in H3K4me3 enrichment in LEDGF knockout (KO) cells at LEDGF–bound promoters indicated that LEDGF functions to regulate promoter activity. We also observed by ChIP-seq that levels of RNA Pol II at promoters were reduced in the absence of LEDGF. Efforts to understand how LEDGF is recruited to promoters tested the function of factors that bind to LEDGF, such as the histone methyltransferase MLL1. ChIP-seq showed that MLL1 peaks match the positions of LEDGF at promoters and, importantly, when levels of MLL1 are reduced with siRNA (small interfering RNA), the peaks of LEDGF at promoters are significantly reduced. Reduction of MLL1 also resulted in substantially lower levels of RNA Pol II Ser5 phosphorylation, a mark of active polymerase.

These experiments not only provided insight into the function of LEDGF in transcription but also revealed new aspects of how LEDGF directs integration. LEDGF possess an N-terminal PWWP domain, which is known to interact with histone H3K36me3, an epigenetic mark of active transcription. As a result, it is thought that the PWWP domain is responsible for directing LEDGF directly to the bodies of genes being actively transcribed. With ChIP-seq experiments, we mapped the chromatin binding of LEDGF lacking the PWWP domain and found that no changes occurred in binding locations. These results were surprising because crude measures of chromatin association indicated that PWWP was required for binding. In collaboration with Alan Engelman, we found that removal of the PWWP domain did not significantly alter the integration sites, supporting our finding that PWWP is not important for chromatin binding. Together with our studies on MLL1, these data support a model in which LEDGF is recruited to active promoters by MLL1 and subsequently travels across transcription units to effect HIV-1 integration. We are currently testing the role of MLL1 in integration by mapping insertion sites in cell culture.

Retrotransposon insertions associated with risk of neurologic and psychiatric diseases

Mental disorders affected about 970 million people worldwide in 2017. In 2020, 21% of adults in the United States suffered from some form of mental illness. Such diseases thus cause great social and economic burden. Studies of identical twins show that the heritability of diseases such as attention-deficit hyperactivity disorder (ADHD), autism spectrum disorder (ASD), bipolar disorder (BIP), and schizophrenia is extremely high, ranging from 74% to 81%. Because of the complexity of the mammalian nervous system, the genetic and cellular etiology of such diseases remains largely unclear. Progress in genetic methodology has provided the potential to identify mechanisms that underlie the diseases. One approach that has successfully identified important disease loci is genome-wide association studies (GWAS). However, in the cases of neurologic and major psychiatric disorders, GWAS have identified large numbers of loci, each associated with small increases in risk. Importantly, there is extensive overlap of the loci that contribute to major psychiatric disorders, indicating that related molecular mechanisms may underlie distinct clinical phenotypes.

TASs (trait-associated single-nucleotide polymorphisms [SNPs]) of GWAS are genetic tags identifying a genomic region that contains the causal mutation(s), which lead to increased disease risk. Limits on the design of GWAS typically prevent such studies from identifying causal gene alleles. Thus, determining causal variants remains the most challenging and rate-limiting but also the most important step in defining the genetic architecture of diseases. The vast majority of GWAS TASs lie in intergenic or intronic regions and therefore do not alter coding sequence. For such SNPs to be causal they would likely have regulatory effects on transcription. Structural variants, such as rearrangements, copy number variants, and transposable element (TE) insertions, constitute a substantial and disproportionately large fraction of the genetic variants found to alter gene expression.

In humans, the dominant families of TEs are long interspersed element-1 (LINE-1 or L1) and Alu elements, which are short interspersed elements (SINEs) and are mobilized by L1. TEs readily alter gene expression because they have evolved various sequences that act on enhancers. Given that TEs make up approximately 45% of the human genome, it is not surprising that their regulatory features are abundant sources of tissue-specific promoter activity.

Relatively recent TE insertions can proliferate in the population and become common alleles. The 1000 Genomes Project described genetic variation of diverse human populations by sequencing whole genomes of 2,504 individuals. The extensive survey of genetic variation detected 17,000 polymorphic insertions of TEs, which have the potential to alter gene expression and affect common disease risk. There may be functional consequences of common TE insertion variants that affect common disease risk. Some common polymorphic TEs have been implicated at disease loci detected by GWAS. Common polymorphic Alu (short transposable elements) insertions occur disproportionately near disease loci of GWAS, underscoring the fact that Alu insertions are potential causative variants.

Given the difficulty in identifying genetic variants responsible for neurologic and psychiatric disorders and the regulatory capacity of TEs, we tested whether polymorphic TEs are potential causative variants of such diseases [Reference 2]. We analyzed 593 GWAS of neurologic and psychiatric diseases, which in total reported 753 TASs. From the 17,000 polymorphic TEs, we found that 76 were in linkage disequilibrium (LD) with TASs, indicating that the TEs were among the variants with the potential to be causative. We extended our analysis by evaluating each candidate TE for a role in altering expression of proximal genes. In one approach, we determined whether polymorphic TEs could disrupt regulatory sequences, as annotated with the epigenomic data of the NIH Roadmap Epigenomics Consortium. In all, we identified 10 polymorphic TEs to examine further as causal candidates because they were positioned in enhancer, promoter, heterochromatin, or transcribed sequences present in neurologic tissues.

We hypothesized that the polymorphic TEs have a causal relationship with risk of psychiatric and neurologic disorders by altering expression of genes in cis. For evidence of altered gene expression, we queried the Genotype-Tissue Expression (GTEx) database, which contains expression data for 948 donors across 54 tissues. GTEx readily identifies changes in tissue-specific gene expression associated with loci-specific genetic variation. SNPs in LD with a query gene are identified as eQTLs (expression quantitative trait loci) if the genetic loci with the variants are significantly associated with altered expression of a gene in a specific tissue. We found that 31 of the TASs linked to TEs were variants that are associated with changes in expression of one or more adjacent genes within regions of the brain.

Having identified a number of polymorphic Alu elements that are significantly associated with disease risk detected by GWAS and that are correlated with altered gene expression in neurologic tissues by eQTL analysis, we developed a luciferase reporter assay to test whether the insert sequences in the context of flanking sequence can influence transcription activity. We measured the impact of candidate Alu and flanking sequences on the function of a minimal promoter in NCRM-1 (human neural stem cells). Of six candidate Alu insertions evaluated for their impact on promoter activity, we found that five significantly altered the expression of luciferase. Taken together, we identified 10 polymorphic TE insertions that are potential candidates on par with other variants for having a causal role in neurologic and psychiatric disorders.

Additional Funding

  • FY2023 Office of AIDS Research

Publications

  1. Li F, Lee M, Esnault C, Wendover K, Guo Y, Atkins P, Zaratiegui M, Levin HL. Identification of an integrase-independent pathway of retrotransposition. Sci Adv 2022 8:1–17.
  2. Ahn H, Worman Z, Lechsinska A, Payer L, Wang T, Malik N, Li W, Burns K, Nath A, Levin HL. Retrotransposon insertions associated with risk of neurologic and psychiatric diseases. EMBO Reports 2023 24:1–17.

Collaborators

  • Kathleen Burns, MD, PhD, Dana-Farber Cancer Institute, Boston, MA
  • Ryan Dale, PhD, Bioinformatics and Scientific Programming Core, NICHD, Bethesda, MD
  • Alan Engelman, PhD, Dana-Farber Cancer Institute, Boston, MA
  • Caroline Esnault, PhD, Bioinformatics and Scientific Programming Core, NICHD, Bethesda, MD
  • Avindra Nath, MD, PhD, Division of Neuroimmunology & Neurovirology, NINDS, Bethesda, MD
  • Mikel Zaratiegui, PhD, Rutgers, The State University of New Jersey, Piscataway, NJ

Contact

For more information, email henry_levin@nih.gov or visit https://www.nichd.nih.gov/research/atNICHD/Investigators/levin.

Top of Page