Extraction of features (biomarkers) and their predictive performance from untargeted metabolomics or lipidomics studies rely on a sound experimental design. The primary aim of an experimental design is to minimise analytical variation, so as to capture induced or inherent biological variation to max. Following appropriate high resolution data acquisition (e.g. UPLC-MS, use RP & HILIC, in both ESI+ & ESI-, ideally on an Orbitrap MS), data pre-processing steps (such as background subtraction, peak finding, RT correction, deisotoping and alignment) play critical role in comprehensive feature extraction, represented by retention time matched accurate mass ions with their peak intensities or area. Various commercial (instrument specific or generic) and free (R, GUI or web) tools are available for data pre-processing with their own advantages and limitations.
Here’s a comparative analysis of two data pre-processing software, XCMS, and SIEVE 2.1 on a metabolomics data set acquired in ESI+ mode at 70K resolution using Q-Exactive. XCMS is the most widely used R-based package (open-source & very fast), whereas, SIEVE is a commercial GUI-based software from Thermo Fisher Scientific. Both software offer various optimisation parameters for LC-MS based data pre-processing. The metabolomics data set was provided by Thermo Fisher Scientific and is part of a study that investigated the effect of fasting on rat serum metabolome.
For analysis in XCMS, raw data files were converted to mzXML using proteowizard. The XCMS script was optimised to process high resolution Q-Exactive data and is very flexible to modify or use with other packages (e.g. CAMERA). Briefly, features were identified with mass difference of 0.005 Da and S:N threshold of 5. This followed retention time correction (loess & symmetric), feature grouping (bandwidth, 5) across all samples and gap filling. In Excel, mass ions separated over 0.5 to 14 min were retained for further processing.
Data processing in SIEVE 2.1 was done using component extraction algorithm which allowed on-the-fly background subtraction and peak picking based on ICIS algorithm. Components or features were extracted from 65 to 850 amu over 0.5 to 14 min. S:N ratio was set to 5, both RT width & isolation was set to 0.1 min and max number of frames to 20K.
Total number of features extracted from XCMS & SIEVE analysis were further processed in Excel to retain features with average peak intensity RSD of < 30 %, that are increased or decreased by >2-fold in fasting group with p-value < 0.05.
Through XCMS, 794 differentiating features were identified, of which 443 and 351 features were increased & decreased (>2-fold) upon fasting, respectively. Whereas, using SIEVE 2.1, 641 differentiating features were found (383 increased and 258 decreased in fasting rat serum metabolome). However, only 294 features were common between SIEVE 2.1 (46 %) & XCMS (37 %). The poor feature overlap (overall 25.8 %) is a serious concern and none of the software seems to be perfect. Most investigators rely on a particular software for supposedly comprehensive feature extraction and are likely to miss a significant proportion of important features. I suppose this is one of the major reasons why biomarkers reported to date are mostly from class-specific primary and few, if any, from secondary metabolic pathways! In terms of speed, SIEVE 2.1 takes marathon processing time of 4:38 hrs against 12 min for XCMS! The processing was done on an i7, 6-core Windows PC with 48 GB memory (2100 MHz).
To be comprehensive, I would use both software and that makes it cool 1141 differentiating features or 423 differentially changed features at >10-fold, (with few, if any, false positives)! This is the most appropriate strategy reported so far and I call it a ‘systems level feature mining’ for systems level understanding. Only this way true potential of metabolomics and lipidomics studies can be revealed and be at the forefront of other omics approaches in either making sense for systems biology or for more practical applications.
Tip of the peak – try at least two different LC-MS data pre-processing software to finalise the feature list.