Genomics of Gene Regulation in Mammals
Cells in a particular lineage are derived from multipotential progenitor cells by the processes of commitment, differentiation and maturation. These processes are mediated via induction of lineage-specific genes and repression of other genes that are not part of the lineage. Hence understanding the mechanisms of gene regulation are key to understanding the most fundamental events in biology. Furthermore, aberrant or variant gene expression is not only a cause of overt disease (such as inherited anemias) but it also determines much of the inherited susceptibility to a wide variety of diseases.
Despite its critical importance, in complex organisms we do not even have a rudimentary idea of a “regulatory code”, i.e. we have not deciphered how information in DNA determines the expression of particular genes at the correct time and place and at the appropriate amount. Of course, we know that regulation of gene expression involves occupancy of cis-regulatory modules (CRMs) by transcription factors – often by recognition of specific binding site motifs, recruitment of co-activators and co-repressors and modifications of the chromatin structure. However, a full picture of these epigenetic features and how they result in induction and repression of specific genes is not yet available.
We explore these questions in several systems, one of which is a mouse cell line model of erythroid differentiation (collaboration with Drs. Mitch Weiss and Gerd Blobel, Childrens Hospital of Philadelphia). G1E cells are derived from mouse ES cells with a knockout of the gene Gata1, which encodes a transcription factor required for erythroid differentiation. We restore GATA1 in an estradiol-inducible manner by expressing a GATA1-ER hybrid protein in the G1E-ER4 subline. After activation of GATA1-ER, the cells differentiate from proliferating progenitor cells into erythroblasts, making abundant hemoglobin and changing morphology dramatically. We are measuring many aspects of both gene expression and relevant epigenetic features throughout the mouse genome during differentiation, using high through-put methods such as massively parallel sequencing of highly enriched DNA (Illumina and SOLiD platforms) and hybridization to high-density tiling arrays (Affymetrix and NimbleGen). In particular, we measure comprehensively changes in gene expression during this GATA1-dependent differentiation (Affymetrix arrays and RNA-seq). We also measure genome-wide occupancy by the transcription factors GATA1, GATA2,TAL1, and CTCF, as well as chromatin accessibility (DNase hypersensitive sites) and histone modifications in the chromatin (activating marks H3K4me1 and H3K4me3 and the repressing Polycomb mark H3K27me3).
With this wealth of information, one can see that the interplay of chromatin structure and occupancy by transcription factors helps determine the level of gene expression. While a different story can be deciphered for each gene, some general trends are emerging. Most genes are not expressed even in the proliferating progenitors, and many of these are packaged into chromatin that lacks histone modifications (“dead zones”, perhaps constitutive heterochromatin). Levels of gene expression vary over many orders of magnitude, and within this cohort of expressed genes, we find thousands that change their expression level at least two-fold, with more genes repressed than induced. The major differences in chromatin structure distinguish the “off” genes from the expressed genes; this chromatin structure is established by the time of commitment to erythroid differentiation. In contrast, the induction and repression of genes is not associated with large-scale changes in chromatin structure, although in some cases the induced genes show increased amounts of H3K4me3. Rather, induction and repression appear to be determined by the interplay of transcription factors within the already established chromatin landscape. In particular, occupancy of CRMs by both GATA1 and TAL1 is almost invariably associated with induction. CRMs in the neighborhoods of many repressed genes are bound by GATA1 but TAL1 is removed. Perhaps these are hallmarks of protein complexes that recruit co-activators to induced genes and co-repressors to repressed genes.
This work on epigenetic features and gene expression is complemented by a project using interspecies sequence alignments (comparative genomics) to find functional regions within noncoding DNA sequences. This long-standing collaboration with Drs. Webb Miller, Francesca Chiaromonte and others has led to the development of software for whole-genome alignments (Miller), use of machine-learning to predict regulatory regions from their patterns in multi-species sequence alignments (Chiaromonte and James Taylor), and the testing of these predictions for function as enhancers and promoters by gene transfer into mammalian erythroid cell lines. We also collaborate with large consortia analyzing genome sequences of various vertebrates (mouse, rat, chicken, rhesus macaque, and platypus). We are active in the ENCODE project, which seeks to identify all functional elements in the human genome using high throughput biochemical and genetic methods. We have a long-standing interest in relating human genotype to phenotype, including maintaining a database of human mutations that lead to changes in hemoglobins or thalassemias (inherited anemias). Recently, we were part of a group headed by Drs. Stephan Schuster, Vanessa Hayes and Webb Miller, who determined the genome sequences of Bushmen and a representative of the south African Bantu, Archbishop Desmond Tutu. All these projects are part of our interwoven efforts to understand regulatory regions and how they evolved, and how changes in them lead to medically and physiologically important phenotypes.
Figure legend, Hardison lab
Figure Legend: Dynamics of the epigenetic landscape during erythroid induction. The figure shows82.5 kb around Zfpm1, a gene that is induced immediately after restoration of GATA1. The purple rectangles on the top line mark known CRMs identified in Wang et al. (2007). Underneath the gene structures are indicators of induction (red). This is followed by tracks showing the normalized number of mapped reads for each of the epigenetic features in both differentiating G1E-ER4 cells (with GATA1-ER activated) and progenitor G1E cells (no GATA1). Peaks for GATA1 binding are also marked, by red rectangles. At the bottom is a diagram interpreting the dynamic changes in transcription factor binding at the several CRMs in and around Zfpm1.