Features by tons – SIEVE 2.1 vs XCMS (rev-3)

Extraction of features (biomarkers) and their predictive performance from untargeted metabolomics or lipidomics studies rely on a sound experimental design. The primary aim of an experimental design is to minimise analytical variation, so as to capture induced or inherent biological variation to max. Following appropriate high resolution data acquisition (e.g. UPLC-MS, use RP & HILIC, in both ESI+ & ESI-, ideally on an Orbitrap MS), data pre-processing steps (such as background subtraction, peak finding, RT correction, deisotoping and alignment) play critical role in comprehensive feature extraction, represented by retention time matched accurate mass ions with their peak intensities or area. Various commercial (instrument specific or generic) and free (R, GUI or web) tools are available for data pre-processing with their own advantages and limitations.

Here’s a comparative analysis of two data pre-processing software, XCMS, and SIEVE 2.1 on a metabolomics data set acquired in ESI+ mode at 70K resolution using Q-Exactive. XCMS is the most widely used R-based package (open-source & very fast), whereas, SIEVE is a commercial GUI-based software from Thermo Fisher Scientific. Both software offer various optimisation parameters for LC-MS based data pre-processing. The metabolomics data set was provided by Thermo Fisher Scientific and is part of a study that investigated the effect of fasting on rat serum metabolome.

For analysis in XCMS, raw data files were converted to mzXML using proteowizard. The XCMS script was optimised to process high resolution Q-Exactive data and is very flexible to modify or use with other packages (e.g. CAMERA). Briefly, features were identified with mass difference of 0.005 Da and S:N threshold of 5. This followed retention time correction (loess & symmetric), feature grouping (bandwidth, 5) across all samples and gap filling. In Excel, mass ions separated over 0.5 to 14 min were retained for further processing.

Data processing in SIEVE 2.1 was done using component extraction algorithm which allowed on-the-fly background subtraction and peak picking based on ICIS algorithm. Components or features were extracted from 65 to 850 amu over 0.5 to 14 min. S:N ratio was set to 5, both RT width & isolation was set to 0.1 min and max number of frames to 20K.

Total number of features extracted from XCMS & SIEVE analysis were further processed in Excel to retain features with average peak intensity RSD of < 30 %, that are increased or decreased by >2-fold in fasting group with p-value < 0.05.

Through XCMS, 794 differentiating features were identified, of which 443 and 351 features were increased & decreased (>2-fold) upon fasting, respectively. Whereas, using SIEVE 2.1, 641 differentiating features were found (383 increased and 258 decreased in fasting rat serum metabolome). However, only 294 features were common between SIEVE 2.1 (46 %) & XCMS (37 %). The poor feature overlap (overall 25.8 %) is a serious concern and none of the software seems to be perfect. Most investigators rely on a particular software for supposedly comprehensive feature extraction and are likely to miss a significant proportion of important features. I suppose this is one of the major reasons why biomarkers reported to date are mostly from class-specific primary and few, if any, from secondary metabolic pathways! In terms of speed, SIEVE 2.1 takes marathon processing time of 4:38 hrs against 12 min for XCMS! The processing was done on an i7, 6-core Windows PC with 48 GB memory (2100 MHz).

venn plot-sieve2.1-xcms-NEW

To be comprehensive, I would use both software and that makes it cool 1141 differentiating features or 423 differentially changed features at >10-fold, (with few, if any, false positives)! This is the most appropriate strategy reported so far and I call it a ‘systems level feature mining’ for systems level understanding. Only this way true potential of metabolomics and lipidomics studies can be revealed and be at the forefront of other omics approaches in either making sense for systems biology or for more practical applications.

Tip of the peak – try at least two different LC-MS data pre-processing software to finalise the feature list.

SIEVE would be back!

3 thoughts on “Features by tons – SIEVE 2.1 vs XCMS (rev-3)

  1. Victor

    Nice data, though very same phenomenon, of different results obtained by different softwares, was shown to be existing for quite some time already in the field of proteomics. The approach taken there, to deal with this worrying data processing facts or artefacts, is called consensus scoring, when only overlapping features are taken to increase the confidence of IDs, while all features are taken to increase the coverage. I for one, is more in favour for taking overlapping features, as carefull examination of all features found by XCMS and Sieve find quite a lot of false positives. For better confidence, I would also try to use orthogonal Genedata software, available for free during 2 week trials, and see overlap between 3 different of softwares. Overlap between 3 of them will put those results on a very solid ground.

  2. Nate

    mzMine offers another modular tool integrating XCMS algorithms/tools (plus a few others). Do you know anybody with coMet, the nonlinear dynamics metabolomics software? I would love to see the three way comparison!

    We have similar results from XCMS vs SIEVE. For picking “actual” peaks that we can go back, do MS^n, or quantify, SIEVE absolutely has one up over XCMS. A large serum lipidomics experiment ran through XCMS gave a host of differential abundant features. When we tried to actually find the peaks for structural elucidation, a vast majority turned out to be poorly integrated, noisy, choppy, and all sorts of “artifact-y”. Optimization of XCMS parameters did not resolve the issue. In short, it was great for making pretty PCAs, mirror plots, and heatmaps. You can overfit anything. Bad at finding things to actually investigate in a rigorous manner. I suspect those peaks are giving XCMS the larger number of features.

    1. Madhav Mondhe Post author

      I would say none of the software are comprehensive in feature extraction and we absolutely need to combine strengths of different software. Also we need to eliminate false positives (see earlier comment by Victor) or artefacts (as you mentioned), as I see both, XCMS & SIEVE processed data sets are strewn with these! We are working on this and several other related issues from SIEVE processed data set.

      The Progenesis CoMet software is worth looking and a 3- or 4-way comparison would be awesome indeed, if time permitted.


Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>