OpenMS
2.7.0
|
Detects features in MS1 data based on peptide identifications.
pot. predecessor tools | FeatureFinderIdentification | pot. successor tools |
PeakPickerHiRes (optional) | ProteinQuantifier | |
IDFilter |
Reference:
Weisser & Choudhary: Targeted Feature Detection for Data-Dependent Shotgun Proteomics (J. Proteome Res., 2017, PMID: 28673088).
This tool detects quantitative features in MS1 data based on information from peptide identifications (derived from MS2 spectra). It uses algorithms for targeted data analysis from the OpenSWATH pipeline.
The aim is to detect features that enable the quantification of (ideally) all peptides in the identification input. This is based on the following principle: When a high-confidence identification (ID) of a peptide was made based on an MS2 spectrum from a certain (precursor) position in the LC-MS map, this indicates that the particular peptide is present at that position, so a feature for it should be detectable there.
Targeted data analysis on the MS1 level uses OpenSWATH algorithms and follows roughly the steps outlined below.
Use of inferred ("external") IDs
The situation becomes more complicated when several LC-MS/MS runs from related samples of a label-free experiment are considered. In order to quantify a larger fraction of the peptides/proteins in the samples, it is desirable to infer peptide identifications across runs. Ideally, all peptides identified in any of the runs should be quantified in each and every run. However, for feature detection of inferred ("external") IDs, the following problems arise: First, retention times may be shifted between the run being quantified and the run that gave rise to the ID. Such shifts can be corrected (see MapAlignerIdentification), but only to an extent. Thus, the RT location of the inferred ID may not necessarily lie within the RT range of the correct feature. Second, since the peptide in question was not directly identified in the run being quantified, it may not actually be present in detectable amounts in that sample, e.g. due to differential regulation of the corresponding protein. There is thus a risk of introducing false-positive features.
FeatureFinderIdentification deals with these challenges by explicitly distinguishing between internal IDs (derived from the LC-MS/MS run being quantified) and external IDs (inferred from related runs). Features derived from internal IDs give rise to a training dataset for an SVM classifier. The SVM is then used to predict which feature candidates derived from external IDs are most likely to be correct. See steps 4 and 5 below for more details.
1. Assay generation
Feature detection is based on assays for identified peptides, each of which incorporates the retention time (RT), mass-to-charge ratio (m/z), and isotopic distribution (derived from the sequence) of a peptide. Peptides with different modifications are considered different peptides. One assay will be generated for every combination of (modified) peptide sequence, charge state, and RT region that has been identified. The RT regions arise by pooling all identifications of the same peptide, considering a window of size extract:rt_window
around every RT location that gave rise to an ID, and then merging overlapping windows.
2. Ion chromatogram extraction
Ion chromatograms (XICs) are extracted from the LC-MS data (parameter in
). One XIC per isotope in an assay is generated, with the corresponding m/z value and RT range (variable, depending on the RT region of the assay).
3. Feature detection
Next feature candidates - typically several per assay - are detected in the XICs and scored. A variety of scores for different quality aspects are calculated by OpenSWATH.
4. Feature classification
Feature candidates derived from assays with "internal" IDs are classed as "negative" (candidates without matching internal IDs), "positive" (the single best candidate per assay with matching internal IDs), and "ambiguous" (other candidates with matching internal IDs). If "external" IDs were given as input, features based on them are initially classed as "unknown". Also in this case, a support vector machine (SVM) is trained on the "positive" and "negative" candidates, to distinguish between the two classes based on the different OpenSWATH quality scores (plus an RT deviation score). After parameter optimization by cross-validation, the resulting SVM is used to predict the probability of "unknown" feature candidates being positives.
5. Feature filtering
Feature candidates are filtered so that at most one feature per peptide and charge state remains. For assays with internal IDs, only candidates previously classed as "positive" are kept. For assays based solely on external IDs, the feature candidate with the highest SVM probability is selected and kept (possibly subject to the svm:min_prob
threshold).
6. Elution model fitting
Elution models can be fitted to the features to improve the quantification. For robustness, one model is fitted to all isotopic mass traces of a feature in parallel. A symmetric (Gaussian) and an asymmetric (exponential-Gaussian hybrid) model type are available. The fitted models are checked for plausibility before they are accepted.
Finally the results (feature maps, parameter out
) are returned.
The command line parameters of this tool are:
FeatureFinderIdentification -- Detects features in MS1 data based on peptide identifications. Full documentation: http://www.openms.de/doxygen/release/2.7.0/html/TOPP_FeatureFinderIdentification.html Version: 2.7.0 Sep 13 2021, 20:58:47, Revision: 9110e58 To cite OpenMS: Rost HL, Sachsenberg T, Aiche S, Bielow C et al.. OpenMS: a flexible open-source software platform for mass spectrometry data analysis. Nat Meth. 2016; 13, 9: 741-748. doi:10.1038/nmeth.3959. To cite FeatureFinderIdentification: Weisser H, Choudhary JS. Targeted Feature Detection for Data-Dependent Shotgun Proteomics. J. Proteome Res. 2017; 16, 8:2964-2974. doi:10.1021/acs.jproteome.7b00248. Usage: FeatureFinderIdentification <options> Options (mandatory options marked with '*'): -in <file>* Input file: LC-MS raw data (valid formats: 'mzML') -id <file>* Input file: Peptide identifications derived directly from 'in' (valid formats: 'idXML') -id_ext <file> Input file: 'External' peptide identifications (e.g. from aligned runs) (valid formats: 'idXML') -out <file>* Output file: Features (valid formats: 'featureXML') -lib_out <file> Output file: Assay library (valid formats: 'traML') -chrom_out <file> Output file: Chromatograms (valid formats: 'mzML') -candidates_out <file> Output file: Feature candidates (before filtering and model fitting) (valid formats: 'featureXML') -quantify_decoys Whether decoy peptides should be quantified (true) or skipped (false). Parameters for ion chromatogram extraction: -extract:batch_size <number> Nr of peptides used in each batch of chromatogram extraction. Smaller values decrease memory usage but increase runtime. (default: '5000' min: '1') -extract:mz_window <value> M/z window size for chromatogram extraction (unit: ppm if 1 or greater, else Da/Th) (default: '10.0' min: '0.0') -extract:n_isotopes <number> Number of isotopes to include in each peptide assay. (default: '2' min: '2') Parameters for detecting features in extracted ion chromatograms: -detect:peak_width <value> Expected elution peak width in seconds, for smoothing (Gauss filter). Also determines the RT extration window, unless set explicitly via 'ext ract:rt_window'. (default: '60.0' min: '0.0') -detect:mapping_tolerance <value> RT tolerance (plus/minus) for mapping peptide IDs to features. Absolute value in seconds if 1 or greater, else relative to the RT span of the feature. (default: '0.0' min: '0.0') Parameters for scoring features using a support vector machine (SVM): -svm:samples <number> Number of observations to use for training ('0' for all) (default: '0' min: '0') -svm:no_selection By default, roughly the same number of positive and negative observatio ns, with the same intensity distribution, are selected for training. This aims to reduce biases, but also reduces the amount of training data. Set this flag to skip this procedure and consider all available observations (subject to 'svm:samples'). -svm:xval_out <choice> Output file: SVM cross-validation (parameter optimization) results (val id formats: 'csv') -svm:kernel <choice> SVM kernel (default: 'RBF' valid: 'RBF', 'linear') -svm:xval <number> Number of partitions for cross-validation (parameter optimization) (def ault: '5' min: '1') -svm:log2_C <values> Values to try for the SVM parameter 'C' during parameter optimization. A value 'x' is used as 'C = 2^x'. (default: '[-5.0 -3.0 -1.0 1.0 3.0 5.0 7.0 9.0 11.0 13.0 15.0]') -svm:log2_gamma <values> Values to try for the SVM parameter 'gamma' during parameter optimizati on (RBF kernel only). A value 'x' is used as 'gamma = 2^x'. (default: '[-15.0 -13.0 -11.0 -9.0 -7.0 -5.0 -3.0 -1.0 1.0 3.0]') Parameters for fitting elution models to features: -model:type <choice> Type of elution model to fit to features (default: 'symmetric' valid: 'symmetric', 'asymmetric', 'none') Parameters for fitting exp. mod. Gaussians to mass traces.: -EMGScoring:max_iteration <number> Maximum number of iterations for EMG fitting. (default: '100' min: '1') -EMGScoring:init_mom Alternative initial parameters for fitting through method of moments. Common TOPP options: -ini <file> Use the given TOPP INI file -threads <n> Sets the number of threads allowed to be used by the TOPP tool (default : '1') -write_ini <file> Writes the default configuration file --help Shows options --helphelp Shows all options (including advanced)
INI file documentation of this tool: