Sparse canonical methods for biological data integration: application to a cross-platform study

Lê Cao, Kim-Anh, Martin, Pascal G. P., Robert-Granié, Christèle and Besse, Philippe (2009) Sparse canonical methods for biological data integration: application to a cross-platform study. BMC Bioinformatics, 10 Article# 34: 1-17. doi:10.1186/1471-2105-10-34

Author Lê Cao, Kim-Anh
Martin, Pascal G. P.
Robert-Granié, Christèle
Besse, Philippe
Title Sparse canonical methods for biological data integration: application to a cross-platform study
Journal name BMC Bioinformatics   Check publisher's open access policy
ISSN 1471-2105
Publication date 2009
Sub-type Article (original research)
DOI 10.1186/1471-2105-10-34
Open Access Status DOI
Volume 10
Issue Article# 34
Start page 1
End page 17
Total pages 18
Place of publication London, England, United Kingdom
Publisher BioMed Central
Language eng
Subject 0104 Statistics
06 Biological Sciences
Formatted abstract

In the context of systems biology, few sparse approaches have been proposed so far to integrate several data sets. It is however an important and fundamental issue that will be widely encountered in post genomic studies, when simultaneously analyzing transcriptomics, proteomics and metabolomics data using different platforms, so as to understand the mutual interactions between the different data sets. In this high dimensional setting, variable selection is crucial to give interpretable results. We focus on a sparse Partial Least Squares approach (sPLS) to handle two-block data sets, where the relationship between the two types of variables is known to be symmetric. Sparse PLS has been developed either for a regression or a canonical correlation framework and includes a built-in procedure to select variables while integrating data. To illustrate the canonical mode approach, we analyzed the NCI60 data sets, where two different platforms (cDNA and Affymetrix chips) were used to study the transcriptome of sixty cancer cell lines.


We compare the results obtained with two other sparse or related canonical correlation approaches: CCA with Elastic Net penalization (CCA-EN) and Co-Inertia Analysis (CIA). The latter does not include a built-in procedure for variable selection and requires a two-step analysis. We stress the lack of statistical criteria to evaluate canonical correlation methods, which makes biological interpretation absolutely necessary to compare the different gene selections. We also propose comprehensive graphical representations of both samples and variables to facilitate the interpretation of the results.


sPLS and CCA-EN selected highly relevant genes and complementary findings from the two data sets, which enabled a detailed understanding of the molecular characteristics of several groups of cell lines. These two approaches were found to bring similar results, although they highlighted the same phenomenons with a different priority. They outperformed CIA that tended to select redundant information.
Q-Index Code C1
Q-Index Status Provisional Code
Institutional Status Non-UQ
Additional Notes Article number: 34

Document type: Journal Article
Sub-type: Article (original research)
Collection: Institute for Molecular Bioscience - Publications
Version Filter Type
Citation counts: TR Web of Science Citation Count  Cited 48 times in Thomson Reuters Web of Science Article | Citations
Scopus Citation Count Cited 84 times in Scopus Article | Citations
Google Scholar Search Google Scholar
Created: Thu, 16 Sep 2010, 13:56:38 EST by Laura McTaggart on behalf of Institute for Molecular Bioscience