When using lasso regression for transcriptomic data, we reduce number of independent variables based on collinearity effects. However, this method is ignorant of molecular topography in expression networks and I think, as a consequence of dimension reduction, we lose statistical significance in follow up gene ontology pathway analysis.

For example; specific RNA isoforms or downstream transcripts, which may be affected by multiple upstream regulators such as transcription factors or splicing factors, may be kept in a lasso regression model due to the large changes when associated with dependent variable (age in this case), but the upstream drivers may be removed due to collinearity. So we lose important information regarding perhaps drivers of cytokines, signals, metabolic pathways or inflammatory networks etc.

In addition the output will have reduced other RNA isoform species from the same co-expression network due to collinearity in the same expression network. As such, our model will produce a list of features which are as much as possible linked to the dependent variable, but not each other. When then performing gene ontology pathway analysis, the enrichment outputs are muted? This will ultimately hide potential therapeutic molecular targets for intervention? Please see my crude graphical representation for clarity of my assumptions. Is this correct? Do I misunderstand penalization of collinearity?

The LASSO regularization forces coefficients of the model to be 0 so when two genes are collinear, it ends up dropping one randomly. If you are interested in the selected features, you should probably analyze collinearity before you start modelling. It may also be more sensible in such cases to use another penalization such as elastic net or even ridge regression (or some variant).

Thanks for your reply Jean-Karim,

Part of the reason I considered lasso, was because I have tens of thousands of transcripts which I want to asses for age related effects. Normal multivariate regression isn't to be considered because N < independant Variables.

I was then advised to use lasso as this can handle large numbers of variables, but I would still like to be able to asses which ones are age related from the whole transcriptome without losing data. Are the models you have suggested able to inform this for large data sets? Or can you suggest others?

Kind regards, J

Look for other regularization methods than base LASSO such as elastic net, relaxed LASSO (see also this glmnet vignette), methods from the ncvreg package or the L0 regularization (in the l0ara package). Both elastic net and L0 regularization should be able to deal with collinear variables, in the sense that they wouldn't arbitrarily drop some but distribute the effect among them.