OpenMS
IDPosteriorErrorProbability

Tool to estimate the probability of peptide hits to be incorrectly assigned.

potential predecessor tools → IDPosteriorErrorProbability → potential successor tools
MascotAdapter (or other ID engines) ConsensusID
Experimental classes:
This tool has not been tested thoroughly and might behave not as expected!

By default an estimation is performed using the (inverse) Gumbel distribution for incorrectly assigned sequences and a Gaussian distribution for correctly assigned sequences. The probabilities are calculated by using Bayes' law, similar to PeptideProphet. Alternatively, a second Gaussian distribution can be used for incorrectly assigned sequences. At the moment, IDPosteriorErrorProbability is able to handle X! Tandem, Mascot, MyriMatch and OMSSA scores.

No target/decoy information needs to be provided, since the model fits are done on the mixed distribution.

In order to validate the computed probabilities an optional plot output can be generated. There are two parameters for the plot: The scores are plotted in the form of bins. Each bin represents a set of scores in a range of '(highest_score - smallest_score) / number_of_bins' (if all scores have positive values). The midpoint of the bin is the mean of the scores it represents. The parameter 'out_plot' should be used to give the plot a unique name. Two files are created. One with the binned scores and one with all steps of the estimation. If parameter top_hits_only is set, only the top hits of each peptide identification are used for the estimation process. Additionally, if 'top_hits_only' is set, target/decoy information is available and a FalseDiscoveryRate run was performed previously, an additional plot will be generated with target and decoy bins ('out_plot' must not be empty). A peptide hit is assumed to be a target if its q-value is smaller than fdr_for_targets_smaller. The plots are saved as a Gnuplot file. An attempt is made to call Gnuplot, which will create a PDF file containing all steps of the estimation. If this fails, the user has to run Gnuplot manually - or adjust the PATH environment such that Gnuplot can be found and retry.

Note
Currently mzIdentML (mzid) is not directly supported as an input/output format of this tool. Convert mzid files to/from idXML using IDFileConverter if necessary.

The command line parameters of this tool are:

IDPosteriorErrorProbability -- Estimates probabilities for incorrectly assigned peptide sequences and a set 
of search engine scores using a mixture model.
Full documentation: http://www.openms.de/doxygen/nightly/html/TOPP_IDPosteriorErrorProbability.html
Version: 3.3.0-pre-nightly-2024-11-20 Nov 21 2024, 02:34:56, Revision: decb5c8
To cite OpenMS:
 + Pfeuffer, J., Bielow, C., Wein, S. et al.. OpenMS 3 enables reproducible analysis of large-scale mass spec
   trometry data. Nat Methods (2024). doi:10.1038/s41592-024-02197-7.

Usage:
  IDPosteriorErrorProbability <options>

This tool has algorithm parameters that are not shown here! Please check the ini file for a detailed descript
ion or use the --helphelp option

Options (mandatory options marked with '*'):
  -in <file>*        Input file  (valid formats: 'idXML')
  -out <file>*       Output file  (valid formats: 'idXML')
  -out_plot <file>   Txt file (if gnuplot is available, a corresponding PDF will be created as well.) (valid 
                     formats: 'txt')
  -split_charge      The search engine scores are split by charge if this flag is set. Thus, for each charge 
                     state a new model will be computed.
  -top_hits_only     If set only the top hits of every PeptideIdentification will be used
  -ignore_bad_data   If set errors will be written but ignored. Useful for pipelines with many datasets where
                      only a few are bad, but the pipeline should run through.
  -prob_correct      If set scores will be calculated as '1 - ErrorProbabilities' and can be interpreted as 
                     probabilities for correct identifications.
                     
                     
Common TOPP options:
  -ini <file>        Use the given TOPP INI file
  -threads <n>       Sets the number of threads allowed to be used by the TOPP tool (default: '1')
  -write_ini <file>  Writes the default configuration file
  --help             Shows options
  --helphelp         Shows all options (including advanced)

The following configuration subsections are valid:
 - fit_algorithm   Algorithm parameter subsection

You can write an example INI file using the '-write_ini' option.
Documentation of subsection parameters can be found in the doxygen documentation or the INIFileEditor.
For more information, please consult the online documentation for this tool:
  - http://www.openms.de/doxygen/nightly/html/TOPP_IDPosteriorErrorProbability.html

INI file documentation of this tool:

Legend:
required parameter
advanced parameter
+IDPosteriorErrorProbabilityEstimates probabilities for incorrectly assigned peptide sequences and a set of search engine scores using a mixture model.
version3.3.0-pre-nightly-2024-11-20 Version of the tool that generated this parameters file.
++1Instance '1' section for 'IDPosteriorErrorProbability'
in input file input file*.idXML
out output file output file*.idXML
out_plot txt file (if gnuplot is available, a corresponding PDF will be created as well.)output file*.txt
split_chargefalse The search engine scores are split by charge if this flag is set. Thus, for each charge state a new model will be computed.true, false
top_hits_onlyfalse If set only the top hits of every PeptideIdentification will be usedtrue, false
fdr_for_targets_smaller0.05 Only used, when top_hits_only set. Additionally, target/decoy information should be available. The score_type must be q-value from an previous False Discovery Rate run.
ignore_bad_datafalse If set errors will be written but ignored. Useful for pipelines with many datasets where only a few are bad, but the pipeline should run through.true, false
prob_correctfalse If set scores will be calculated as '1 - ErrorProbabilities' and can be interpreted as probabilities for correct identifications.true, false
log Name of log file (created only when specified)
debug0 Sets the debug level
threads1 Sets the number of threads allowed to be used by the TOPP tool
no_progressfalse Disables progress logging to command linetrue, false
forcefalse Overrides tool-specific checkstrue, false
testfalse Enables the test mode (needed for internal use only)true, false
+++fit_algorithmAlgorithm parameter subsection
number_of_bins100 Number of bins used for visualization. Only needed if each iteration step of the EM-Algorithm will be visualized
incorrectly_assignedGumbel for 'Gumbel', the Gumbel distribution is used to plot incorrectly assigned sequences. For 'Gauss', the Gauss distribution is used.Gumbel, Gauss
max_nr_iterations1000 Bounds the number of iterations for the EM algorithm when convergence is slow.
neg_log_delta6 The negative logarithm of the convergence threshold for the likelihood increase.
outlier_handlingignore_iqr_outliers What to do with outliers:
- ignore_iqr_outliers: ignore outliers outside of 3*IQR from Q1/Q3 for fitting
- set_iqr_to_closest_valid: set IQR-based outliers to the last valid value for fitting
- ignore_extreme_percentiles: ignore everything outside 99th and 1st percentile (also removes equal values like potential censored max values in XTandem)
- none: do nothing
ignore_iqr_outliers, set_iqr_to_closest_valid, ignore_extreme_percentiles, none

For the parameters of the algorithm section see the algorithms documentation:
fit_algorithm