Genome-wide binding analysis of 195 DNA-binding proteins reveals reservoir promoters and human-specific SVA family repeat region


A key aspect in defining cell state is the complex choreography of DNA binding events in a given cell type, which in turn establishes a cell-specific gene-expression program. In the past two decades since the sequencing of the human genome there has been a deluge of genome-wide experiments which have measured gene-expression and DNA binding events across numerous cell-types and tissues. Here we re-analyze ENCODE data in a highly reproducible manner by utilizing standardized analysis pipelines, containerization, and literate programming with Rmarkdown. Our approach validated many findings from previous independent studies, underscoring the importance of ENCODE’s goals in providing these reproducible data resources. This approach also revealed several new findings: (i) 1,362 promoters, termed ‘reservoirs,’ have up to 111 different DNA-binding proteins localized on one promoter yet do not have any expression of steady-state RNA (ii) The human specific SVA repeat element may have been co-opted for enhancer regulation. Collectively, this study performed by the students of a CU Boulder computational biology class (BCHM 5631 – Spring 2020) demonstrates the value of reproducible findings and how resources like ENCODE that prioritize data standards can foster new findings with existing data in a didactic environment.