OpenMS  2.5.0
FeatureFinderIdentification

Detects features in MS1 data based on peptide identifications.

pot. predecessor tools $ \longrightarrow $ FeatureFinderIdentification $ \longrightarrow $ pot. successor tools
PeakPickerHiRes (optional) ProteinQuantifier
IDFilter

Reference:
Weisser & Choudhary: Targeted Feature Detection for Data-Dependent Shotgun Proteomics (J. Proteome Res., 2017, PMID: 28673088).

This tool detects quantitative features in MS1 data based on information from peptide identifications (derived from MS2 spectra). It uses algorithms for targeted data analysis from the OpenSWATH pipeline.

The aim is to detect features that enable the quantification of (ideally) all peptides in the identification input. This is based on the following principle: When a high-confidence identification (ID) of a peptide was made based on an MS2 spectrum from a certain (precursor) position in the LC-MS map, this indicates that the particular peptide is present at that position, so a feature for it should be detectable there.

Note
It is important that only high-confidence (i.e. reliable) peptide identifications are used as input!

Targeted data analysis on the MS1 level uses OpenSWATH algorithms and follows roughly the steps outlined below.

Use of inferred ("external") IDs

The situation becomes more complicated when several LC-MS/MS runs from related samples of a label-free experiment are considered. In order to quantify a larger fraction of the peptides/proteins in the samples, it is desirable to infer peptide identifications across runs. Ideally, all peptides identified in any of the runs should be quantified in each and every run. However, for feature detection of inferred ("external") IDs, the following problems arise: First, retention times may be shifted between the run being quantified and the run that gave rise to the ID. Such shifts can be corrected (see MapAlignerIdentification), but only to an extent. Thus, the RT location of the inferred ID may not necessarily lie within the RT range of the correct feature. Second, since the peptide in question was not directly identified in the run being quantified, it may not actually be present in detectable amounts in that sample, e.g. due to differential regulation of the corresponding protein. There is thus a risk of introducing false-positive features.

FeatureFinderIdentification deals with these challenges by explicitly distinguishing between internal IDs (derived from the LC-MS/MS run being quantified) and external IDs (inferred from related runs). Features derived from internal IDs give rise to a training dataset for an SVM classifier. The SVM is then used to predict which feature candidates derived from external IDs are most likely to be correct. See steps 4 and 5 below for more details.

1. Assay generation

Feature detection is based on assays for identified peptides, each of which incorporates the retention time (RT), mass-to-charge ratio (m/z), and isotopic distribution (derived from the sequence) of a peptide. Peptides with different modifications are considered different peptides. One assay will be generated for every combination of (modified) peptide sequence, charge state, and RT region that has been identified. The RT regions arise by pooling all identifications of the same peptide, considering a window of size extract:rt_window around every RT location that gave rise to an ID, and then merging overlapping windows.

2. Ion chromatogram extraction

Ion chromatograms (XICs) are extracted from the LC-MS data (parameter in). One XIC per isotope in an assay is generated, with the corresponding m/z value and RT range (variable, depending on the RT region of the assay).

See also
OpenSwathChromatogramExtractor

3. Feature detection

Next feature candidates - typically several per assay - are detected in the XICs and scored. A variety of scores for different quality aspects are calculated by OpenSWATH.

See also
OpenSwathAnalyzer

4. Feature classification

Feature candidates derived from assays with "internal" IDs are classed as "negative" (candidates without matching internal IDs), "positive" (the single best candidate per assay with matching internal IDs), and "ambiguous" (other candidates with matching internal IDs). If "external" IDs were given as input, features based on them are initially classed as "unknown". Also in this case, a support vector machine (SVM) is trained on the "positive" and "negative" candidates, to distinguish between the two classes based on the different OpenSWATH quality scores (plus an RT deviation score). After parameter optimization by cross-validation, the resulting SVM is used to predict the probability of "unknown" feature candidates being positives.

5. Feature filtering

Feature candidates are filtered so that at most one feature per peptide and charge state remains. For assays with internal IDs, only candidates previously classed as "positive" are kept. For assays based solely on external IDs, the feature candidate with the highest SVM probability is selected and kept (possibly subject to the svm:min_prob threshold).

6. Elution model fitting

Elution models can be fitted to the features to improve the quantification. For robustness, one model is fitted to all isotopic mass traces of a feature in parallel. A symmetric (Gaussian) and an asymmetric (exponential-Gaussian hybrid) model type are available. The fitted models are checked for plausibility before they are accepted.

Finally the results (feature maps, parameter out) are returned.

Note
Currently mzIdentML (mzid) is not directly supported as an input/output format of this tool. Convert mzid files to/from idXML using IDFileConverter if necessary.

The command line parameters of this tool are:

FeatureFinderIdentification -- Detects features in MS1 data based on peptide identifications.
Full documentation: http://www.openms.de/documentation/TOPP_FeatureFinderIdentification.html
Version: 2.5.0 Feb 20 2020, 20:13:06, Revision: f649042
To cite OpenMS:
  Rost HL, Sachsenberg T, Aiche S, Bielow C et al.. OpenMS: a flexible open-source software platform for mass spectrometry data analysis. Nat Meth. 2016; 13, 9: 741-748. doi:10.1038/nmeth.3959.
To cite FeatureFinderIdentification:
  Weisser H, Choudhary JS. Targeted Feature Detection for Data-Dependent Shotgun Proteomics. J. Proteome Res. 2017; 16, 8:2964-2974. doi:10.1021/acs.jproteome.7b00248.

Usage:
  FeatureFinderIdentification <options>

Options (mandatory options marked with '*'):
  -in <file>*                        Input file: LC-MS raw data (valid formats: 'mzML')
  -id <file>*                        Input file: Peptide identifications derived directly from 'in' (valid 
                                     formats: 'idXML')
  -id_ext <file>                     Input file: 'External' peptide identifications (e.g. from aligned runs) 
                                     (valid formats: 'idXML')
  -out <file>*                       Output file: Features (valid formats: 'featureXML')
  -lib_out <file>                    Output file: Assay library (valid formats: 'traML')
  -chrom_out <file>                  Output file: Chromatograms (valid formats: 'mzML')
  -candidates_out <file>             Output file: Feature candidates (before filtering and model fitting) 
                                     (valid formats: 'featureXML')

Parameters for ion chromatogram extraction:
  -extract:batch_size <number>       Nr of peptides used in each batch of chromatogram extraction. Smaller 
                                     values decrease memory usage but increase runtime. (default: '1000' min:
                                     '1')
  -extract:mz_window <value>         M/z window size for chromatogram extraction (unit: ppm if 1 or greater, 
                                     else Da/Th) (default: '10.0' min: '0.0')
  -extract:n_isotopes <number>       Number of isotopes to include in each peptide assay. (default: '2' min: 
                                     '2')

Parameters for detecting features in extracted ion chromatograms:
  -detect:peak_width <value>         Expected elution peak width in seconds, for smoothing (Gauss filter). 
                                     Also determines the RT extration window, unless set explicitly via 'extr
                                     act:rt_window'. (default: '60.0' min: '0.0')
  -detect:mapping_tolerance <value>  RT tolerance (plus/minus) for mapping peptide IDs to features. Absolute 
                                     value in seconds if 1 or greater, else relative to the RT span of the
                                     feature. (default: '0.0' min: '0.0')

Parameters for scoring features using a support vector machine (SVM):
  -svm:samples <number>              Number of observations to use for training ('0' for all) (default: '0' 
                                     min: '0')
  -svm:no_selection                  By default, roughly the same number of positive and negative observation
                                     s, with the same intensity distribution, are selected for training. This
                                     aims to reduce biases, but also reduces the amount of training data.
                                     Set this flag to skip this procedure and consider all available observat
                                     ions (subject to 'svm:samples').
  -svm:xval_out <choice>             Output file: SVM cross-validation (parameter optimization) results (vali
                                     d formats: 'csv')
  -svm:kernel <choice>               SVM kernel (default: 'RBF' valid: 'RBF', 'linear')
  -svm:xval <number>                 Number of partitions for cross-validation (parameter optimization) (defa
                                     ult: '5' min: '1')
  -svm:log2_C <values>               Values to try for the SVM parameter 'C' during parameter optimization. 
                                     A value 'x' is used as 'C = 2^x'. (default: '[-5.0 -3.0 -1.0 1.0 3.0
                                     5.0 7.0 9.0 11.0 13.0 15.0]')
  -svm:log2_gamma <values>           Values to try for the SVM parameter 'gamma' during parameter optimizatio
                                     n (RBF kernel only). A value 'x' is used as 'gamma = 2^x'. (default:
                                     '[-15.0 -13.0 -11.0 -9.0 -7.0 -5.0 -3.0 -1.0 1.0 3.0]')

Parameters for fitting elution models to features:
  -model:type <choice>               Type of elution model to fit to features (default: 'symmetric' valid: 
                                     'symmetric', 'asymmetric', 'none')

                                     
Common TOPP options:
  -ini <file>                        Use the given TOPP INI file
  -threads <n>                       Sets the number of threads allowed to be used by the TOPP tool (default:
                                     '1')
  -write_ini <file>                  Writes the default configuration file
  --help                             Shows options
  --helphelp                         Shows all options (including advanced)

INI file documentation of this tool:

Legend:
required parameter
advanced parameter
+FeatureFinderIdentificationDetects features in MS1 data based on peptide identifications.
version2.5.0 Version of the tool that generated this parameters file.
++1Instance '1' section for 'FeatureFinderIdentification'
in Input file: LC-MS raw datainput file*.mzML
id Input file: Peptide identifications derived directly from 'in'input file*.idXML
id_ext Input file: 'External' peptide identifications (e.g. from aligned runs)input file*.idXML
out Output file: Featuresoutput file*.featureXML
lib_out Output file: Assay libraryoutput file*.traML
chrom_out Output file: Chromatogramsoutput file*.mzML
candidates_out Output file: Feature candidates (before filtering and model fitting)output file*.featureXML
candidates_in Input file: Feature candidates from a previous run. If set, only feature classification and elution model fitting are carried out, if enabled. Many parameters are ignored.input file*.featureXML
debug0 Sets the debug level0:∞
log Name of log file (created only when specified)
threads1 Sets the number of threads allowed to be used by the TOPP tool
no_progressfalse Disables progress logging to command linetrue,false
forcefalse Overwrite tool specific checks.true,false
testfalse Enables the test mode (needed for internal use only)true,false
+++extractParameters for ion chromatogram extraction
batch_size1000 Nr of peptides used in each batch of chromatogram extraction. Smaller values decrease memory usage but increase runtime.1:∞
mz_window10.0 m/z window size for chromatogram extraction (unit: ppm if 1 or greater, else Da/Th)0.0:∞
n_isotopes2 Number of isotopes to include in each peptide assay.2:∞
isotope_pmin0.0 Minimum probability for an isotope to be included in the assay for a peptide. If set, this parameter takes precedence over 'extract:n_isotopes'.0.0:1.0
rt_quantile0.95 Quantile of the RT deviations between aligned internal and external IDs to use for scaling the RT extraction window0.0:1.0
rt_window0.0 RT window size (in sec.) for chromatogram extraction. If set, this parameter takes precedence over 'extract:rt_quantile'.0.0:∞
+++detectParameters for detecting features in extracted ion chromatograms
peak_width60.0 Expected elution peak width in seconds, for smoothing (Gauss filter). Also determines the RT extration window, unless set explicitly via 'extract:rt_window'.0.0:∞
min_peak_width0.2 Minimum elution peak width. Absolute value in seconds if 1 or greater, else relative to 'peak_width'.0.0:∞
signal_to_noise0.8 Signal-to-noise threshold for OpenSWATH feature detection0.1:∞
mapping_tolerance0.0 RT tolerance (plus/minus) for mapping peptide IDs to features. Absolute value in seconds if 1 or greater, else relative to the RT span of the feature.0.0:∞
+++svmParameters for scoring features using a support vector machine (SVM)
samples0 Number of observations to use for training ('0' for all)0:∞
no_selectionfalse By default, roughly the same number of positive and negative observations, with the same intensity distribution, are selected for training. This aims to reduce biases, but also reduces the amount of training data. Set this flag to skip this procedure and consider all available observations (subject to 'svm:samples').true,false
xval_out Output file: SVM cross-validation (parameter optimization) resultsoutput file*.csv
kernelRBF SVM kernelRBF,linear
xval5 Number of partitions for cross-validation (parameter optimization)1:∞
log2_C[-5.0, -3.0, -1.0, 1.0, 3.0, 5.0, 7.0, 9.0, 11.0, 13.0, 15.0] Values to try for the SVM parameter 'C' during parameter optimization. A value 'x' is used as 'C = 2^x'.
log2_gamma[-15.0, -13.0, -11.0, -9.0, -7.0, -5.0, -3.0, -1.0, 1.0, 3.0] Values to try for the SVM parameter 'gamma' during parameter optimization (RBF kernel only). A value 'x' is used as 'gamma = 2^x'.
epsilon0.001 Stopping criterion0.0:∞
cache_size100.0 Size of the kernel cache (in MB)1.0:∞
no_shrinkingfalse Disable the shrinking heuristicstrue,false
predictorspeak_apices_sum,var_xcorr_coelution,var_xcorr_shape,var_library_sangle,var_intensity_score,sn_ratio,var_log_sn_score,var_elution_model_fit_score,xx_lda_prelim_score,var_isotope_correlation_score,var_isotope_overlap_score,var_massdev_score,main_var_xx_swath_prelim_score Names of OpenSWATH scores to use as predictors for the SVM (comma-separated list)
min_prob0.0 Minimum probability of correctness, as predicted by the SVM, required to retain a feature candidate0.0:1.0
+++modelParameters for fitting elution models to features
typesymmetric Type of elution model to fit to featuressymmetric,asymmetric,none
add_zeros0.2 Add zero-intensity points outside the feature range to constrain the model fit. This parameter sets the weight given to these points during model fitting; '0' to disable.0.0:∞
unweighted_fitfalse Suppress weighting of mass traces according to theoretical intensities when fitting elution modelstrue,false
no_imputationfalse If fitting the elution model fails for a feature, set its intensity to zero instead of imputing a value from the initial intensity estimatetrue,false
++++checkParameters for checking the validity of elution models (and rejecting them if necessary)
min_area1.0 Lower bound for the area under the curve of a valid elution model0.0:∞
boundaries0.5 Time points corresponding to this fraction of the elution model height have to be within the data region used for model fitting0.0:1.0
width10.0 Upper limit for acceptable widths of elution models (Gaussian or EGH), expressed in terms of modified (median-based) z-scores; '0' to disable0.0:∞
asymmetry10.0 Upper limit for acceptable asymmetry of elution models (EGH only), expressed in terms of modified (median-based) z-scores; '0' to disable0.0:∞