Cellular identity and behavior is controlled by complex gene regulatory networks. Transcription factors (TFs) bind to specific DNA sequences to regulate the transcription of their target genes. On the basis of these TF motifs in cis-regulatory elements we can model the influence of TFs on gene expression. In such models of TF motif activity the data is usually modeled assuming a linear relationship between the motif activity and the gene expression level. A commonly used method to model motif influence is based on Ridge Regression. One important assumption of linear regression is the independence between samples. However, if samples are generated from the same cell line, tissue, or other biological source, this assumption may be invalid. This same assumption of independence is also applied to different, yet similar, experimental conditions, which may also be inappropriate. In theory, the independence assumption between samples could lead to loss in signal detection. Here we investigate whether a Bayesian model that allows for correlations results in more accurate inference of motif activities.
Here, we extend the Ridge Regression to a Bayesian Linear Mixed Model, which allows us to model dependence between different samples. This dataset provides the code to simulate gene expression data as a linear sum of motif influences, with different degrees of interaction assumed.
The code was used in a simulation study in "Investigating the Effect of Dependence Between Conditions with Bayesian Linear Mixed Models for Motif Activity Analysis" (https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0231824) by Simone Lederer, Tom Heskes, Simon J van Heeringen and Cornelis A Albers.
This research is presented in Chapter 4 of the PhD thesis titled "Drug-Drug Interaction Models - from Gene Expression tp Phenotype" by Simone Lederer. The code is written in the Python programming language and additionally available on GitHub (https://github.com/Sim19/SimGEXPwMotifs).