Regulatory annotation of plant genomes by means of functional chromatin signatures and comparative sequence analysis

01 December 2012 → 31 August 2016
Regional and community funding: IWT/VLAIO
Research disciplines
  • Natural sciences
    • Plant biology
regulatory annotation plants transcriptional regulation gene expression
Project description

Transcriptional regulation is a dynamic process that plays an important role in establishing gene expression profiles during development or in response to (a)biotic stimuli. The aims of this project consisted of two parts: one is the study of how transcriptional regulation and gene expression is organized across the genome. The second consists of applying the obtained datasets to assign function to TFs and their target genes with previously unknown function.
The research presented in this thesis starts with the development of a phylogenetic footprinting approach for the identification of conserved non-coding sequences (CNSs) in Arabidopsis thaliana using genomic information of 12 dicot plants. In this approach both alignment and non-alignment-based techniques were applied to identify functional motifs in a multi-species context. The method accounts for incomplete motif conservation as well as high sequence divergence between related species. In total, we identified 69,361 footprints associated with 17,895 genes. A gene regulatory network was compiled, through the integration of known TFBS obtained from literature and experimental studies, containing 40,758 interactions, of which two-thirds act through binding events located in DNase I hypersensitive sites. This network shows significant enrichment towards in vivo targets of known regulators and its overall quality was confirmed using five different biological validation metrics. Finally, a proof of concept experiment using detailed expression and function information was performed to demonstrate how static CNSs can be converted into condition-dependent regulatory networks, offering new opportunities for regulatory gene annotation.
In a subsequent analysis, we applied the aforementioned phylogenetic footprinting framework to ten
dicot plants for the identification of CNSs. This yielded 1,032,291 CNSs associated with 243,187 genes.
To annotate these CNSs with TFBSs, we made use of binding site information of 642 TFs originating
from 35 TF families in Arabidopsis. Validation of the obtained CNSs was performed using TF chromatin immunoprecipitation sequencing (ChIP-Seq) data from three species, resulting in significant overlap for the majority of datasets. We also identified ultra-conserved CNSs by including genomes of additional plant families and identified 715 binding sites for 501 genes conserved in dicots, monocots, mosses and green algae. Through application of the obtained CNSs we found that genes part of conserved miniregulons have a higher coherence in their expression profile than other divergent gene pairs.
Next, a novel algorithm was developed that supports both alignment-free and alignment-based conserved motif discovery in the promoter sequences of closely related species. Putative motifs are exhaustively enumerated as words over the IUPAC alphabet and screened for conservation using the branch length score. Because of the exhaustive character of the algorithm and great resource needs, the MapReduce programming model was adopted to take advantage of a cloud computing infrastructure and handle these requirements efficiently. The method was applied to four monocotyledon plant species and we were able to show that high-scoring motifs are significantly enriched for open chromatin regions in Oryza sativa and for transcription factor binding sites inferred through protein-binding microarrays in Oryza sativa and Zea mays. Furthermore, the method was shown to recover ga2ox1-like KN1 binding sites in Zea mays experimentally profiled through ChIP-Seq.
Finally, a target gene identification analysis for 12 NAM-ATAF1/2-CUC2 (NAC) transcription factors was performed. NAC transcription factors are among the largest transcription factor families in plants,
yet limited data exists from unbiased approaches to resolve the DNA-binding preferences of individual
members. We used a TF-target gene identification workflow based on the integration of novel protein
binding microarray data with gene expression and multi-species promoter sequence conservation to identify the DNA-binding specificities and the underlying gene regulatory networks. The data offers specific single base resolution fingerprints for most TFs studied and indicates that NAC DNA binding specificities might be predicted from their DNA binding domain’s sequence. The developed methodology, including the application of complementary functional genomics filters, makes it possible to translate, for each TF, protein binding microarray data into a set of high-quality target genes. NAC target genes reported from independent in vivo analyses were confirmed to be detected by this approach.