OpenMS
IDFileConverter

Converts peptide/protein identification engine file formats.

potential predecessor tools → IDFileConverter → potential successor tools
TPP tools: PeptideProphet, ProteinProphet TPP tools: ProteinProphet
(for conversion from idXML to pepXML)
Sequest protein identification engine

IDFileConverter can be used to convert identification results from external tools/pipelines (like TPP, Sequest, Mascot, OMSSA, X! Tandem) into other (OpenMS-specific) formats. For search engine results, it might be advisable to use the respective TOPP Adapters (e.g. CometAdapter) to avoid the extra conversion step.

The most simple format accepted is '.tsv': A tab separated text file, which contains one or more peptide sequences per line. Each line represents one spectrum, i.e. is stored as a PeptideIdentification with one or more PeptideHits. Lines starting with "#" are ignored by the parser.

Conversion from the TPP file formats pepXML and protXML to OpenMS' idXML is quite comprehensive, to the extent that the original data can be represented in the simpler idXML format.

In contrast, support for converting from idXML to pepXML is limited. The purpose here is simply to create pepXML files containing the relevant information for the use of ProteinProphet. We use the following heuristic: if peptideprophet_analyzed is set, we take the scores from the idXML as is and assume the PeptideHits contain all necessary information. If peptideprophet is not set, we only provide ProteinProphet-compatible results with probability-based scores (i.e. Percolator with PEP score or scores from IDPosteriorErrorProbability). All secondary or non-probability main scores will be written as "search_scores" only.

Support for conversion to/from mzIdentML (.mzid) is still experimental and may lose information.

The xquest.xml format is very specific to Protein-Protein Cross-Linking MS (XL-MS) applications and is only considered useful for compatibility of OpenPepXL / OpenPepXLLF with the xQuest / xProphet / xTract pipeline. It will only have useful output when converting from idXML or mzid containg XL-MS data.

Also supports generation of .mzML files with theoretical spectra from a .FASTA input.

Details on additional parameters:

mz_file:
Some search engine output files (like pepXML, mascotXML, Sequest .out files) may not contain retention times, only scan numbers or spectrum IDs. To be able to look up the actual RT values, the raw file has to be provided using the parameter mz_file. (If the identification results should be used later to annotate feature maps or consensus maps, it is critical that they contain RT values. See also IDMapper.)

mz_name:
pepXML files can contain results from multiple experiments. However, the idXML format does not support this. The mz_name parameter (or mz_file, if given) thus serves to define what parts to extract from the pepXML.

scan_regex:
This advanced parameter defines a spectrum reference format via a Perl-style regular expression. The reference format connects search hits to the MS2 spectra that were searched, and may be needed to look up e.g. retention times in the raw data (mz_file). See the documentation of class SpectrumLookup for details on how to specify spectrum reference formats. Note that it is not necessary to look up any information in the raw data if that information can be extracted directly from the spectrum reference, in which case mz_file is not needed.
For Mascot results exported to (Mascot) XML, scan numbers that can be used to look up retention times (via mz_file) should be given in the "pep_scan_title" XML elements, but the format can vary. Some default formats are defined in the Mascot XML reader, but if those fail to extract the scan numbers, scan_regex can be used to overwrite the defaults.
For pepXML, supplying scan_regex may be necessary for files exported from Mascot, but only if the default reference formats (same as for Mascot XML) do not match. The spectrum references to which scan_regex is applied are read from the "spectrum" attribute of the "spectrum_query" elements.
For Percolator tab-delimited output, information is extracted from the "PSMId" column. By default, extraction of scan numbers and charge states is supported for MS-GF+ Percolator results (retention times and precursor m/z values can then be looked up in the raw data via mz_file).
Some information about the supported input types:

The command line parameters of this tool are:

IDFileConverter -- Converts identification engine file formats.
Full documentation: http://www.openms.de/doxygen/release/3.2.0/html/TOPP_IDFileConverter.html
Version: 3.2.0 Nov 18 2024, 16:14:00, Revision: 03223c3
To cite OpenMS:
 + Pfeuffer, J., Bielow, C., Wein, S. et al.. OpenMS 3 enables reproducible analysis of large-scale mass spec
   trometry data. Nat Methods (2024). doi:10.1038/s41592-024-02197-7.

Usage:
  IDFileConverter <options>

This tool has algorithm parameters that are not shown here! Please check the ini file for a detailed descript
ion or use the --helphelp option

Options (mandatory options marked with '*'):
  -in <path/file>*           Input file or directory containing the data to convert. This may be:
                             - a single file in OpenMS database format (.oms),
                             - a single file in a multi-purpose XML format (.idXML, .mzid, .pepXML, .protXML)
                             ,
                             - a single file in a search engine-specific format (Mascot: .mascotXML, OMSSA: 
                             .omssaXML, X! Tandem: .xml, Percolator: .psms, xQuest: .xquest.xml),
                             - a single file in fasta format (can only be used to generate a theoretical mzML
                             ),
                             ...
                              (valid formats: 'oms', 'idXML', 'mzid', 'fasta', 'pepXML', 'protXML', 'mascotXM
                             L', 'omssaXML', 'xml', 'psms', 'tsv', 'xquest.xml')
  -out <file>*               Output file (valid formats: 'oms', 'idXML', 'mzid', 'pepXML', 'fasta', 'xquest.x
                             ml', 'mzML')
  -out_type <type>           Output file type (default: determined from file extension) (valid: 'oms', 'idXML
                             ', 'mzid', 'pepXML', 'fasta', 'xquest.xml', 'mzML')
                             
  -mz_file <file>            [pepXML, Sequest, Mascot, X! Tandem, mzid, Percolator only] Retention times and 
                             native spectrum ids (spectrum_references) will be looked up in this file (valid 
                             formats: 'mzML', 'mzXML', 'mzData')
                             
  -mz_name <file>            [pepXML only] Experiment filename/path (extension will be removed) to match in 
                             the pepXML file ('base_name' attribute). Only necessary if different from 'mz_fi
                             le'.
  -peptideprophet_analyzed   [pepXML output only] Write output in the format of a PeptideProphet analysis 
                             result. By default a 'raw' pepXML is produced that contains only search engine 
                             results.
  -score_type <choice>       [Percolator only] Which of the Percolator scores to report as 'the' score for a 
                             peptide hit (default: 'qvalue') (valid: 'qvalue', 'PEP', 'score')
                             
Common TOPP options:
  -ini <file>                Use the given TOPP INI file
  -threads <n>               Sets the number of threads allowed to be used by the TOPP tool (default: '1')
  -write_ini <file>          Writes the default configuration file
  --help                     Shows options
  --helphelp                 Shows all options (including advanced)

The following configuration subsections are valid:
 - fasta_to_mzml   [FASTA input + MzML output only] Parameters used to adjust simulation of the theoretical 
                   spectra.

You can write an example INI file using the '-write_ini' option.
Documentation of subsection parameters can be found in the doxygen documentation or the INIFileEditor.
For more information, please consult the online documentation for this tool:
  - http://www.openms.de/doxygen/release/3.2.0/html/TOPP_IDFileConverter.html

INI file documentation of this tool:

Legend:
required parameter
advanced parameter
+IDFileConverterConverts identification engine file formats.
version3.2.0 Version of the tool that generated this parameters file.
++1Instance '1' section for 'IDFileConverter'
in Input file or directory containing the data to convert. This may be:
- a single file in OpenMS database format (.oms),
- a single file in a multi-purpose XML format (.idXML, .mzid, .pepXML, .protXML),
- a single file in a search engine-specific format (Mascot: .mascotXML, OMSSA: .omssaXML, X! Tandem: .xml, Percolator: .psms, xQuest: .xquest.xml),
- a single file in fasta format (can only be used to generate a theoretical mzML),
- a single text file (tab separated) with one line for all peptide sequences matching a spectrum (top N hits),
- for Sequest results, a directory containing .out files.
input file*.oms, *.idXML, *.mzid, *.fasta, *.pepXML, *.protXML, *.mascotXML, *.omssaXML, *.xml, *.psms, *.tsv, *.xquest.xml
out Output fileoutput file*.oms, *.idXML, *.mzid, *.pepXML, *.fasta, *.xquest.xml, *.mzML
out_type Output file type (default: determined from file extension)oms, idXML, mzid, pepXML, fasta, xquest.xml, mzML
mz_file [pepXML, Sequest, Mascot, X! Tandem, mzid, Percolator only] Retention times and native spectrum ids (spectrum_references) will be looked up in this fileinput file*.mzML, *.mzXML, *.mzData
mz_name [pepXML only] Experiment filename/path (extension will be removed) to match in the pepXML file ('base_name' attribute). Only necessary if different from 'mz_file'.
peptideprophet_analyzedfalse [pepXML output only] Write output in the format of a PeptideProphet analysis result. By default a 'raw' pepXML is produced that contains only search engine results.true, false
score_typeqvalue [Percolator only] Which of the Percolator scores to report as 'the' score for a peptide hitqvalue, PEP, score
ignore_proteins_per_peptidefalse [Sequest only] Workaround to deal with .out files that contain e.g. "+1" in references column,
but do not list extra references in subsequent lines (try -debug 3 or 4)
true, false
scan_regex [Mascot, pepXML, Percolator only] Regular expression used to extract the scan number or retention time. See documentation for details.
no_spectra_data_overridefalse [+mz_file only] Avoid overriding 'spectra_data' in protein identifications if 'mz_file' is given and 'spectrum_reference's are added/updated. Use only if you are sure it is absolutely the same 'mz_file' as used for identification.true, false
no_spectra_references_overridefalse [+mz_file only] Avoid overriding 'spectrum_reference' in peptide identifications if 'mz_file' is given and a 'spectrum_reference' is already present.true, false
add_ionmatch_annotation0.0 [+mz_file only] Annotate the identifications with ion matches from spectra in 'mz_file' using the given tolerance (in Da). This will take quite some time.
concatenate_peptidesfalse [FASTA output only] Will concatenate the top peptide hits to one peptide sequence, rather than write a new peptide for each hit.true, false
number_of_hits1 [FASTA output only] Controls how many peptide hits will be exported. A value of 0 or less exports all hits.
log Name of log file (created only when specified)
debug0 Sets the debug level
threads1 Sets the number of threads allowed to be used by the TOPP tool
no_progressfalse Disables progress logging to command linetrue, false
forcefalse Overrides tool-specific checkstrue, false
testfalse Enables the test mode (needed for internal use only)true, false
+++fasta_to_mzml[FASTA input + MzML output only] Parameters used to adjust simulation of the theoretical spectra.
isotope_modelnone Model to use for isotopic peaks ('none' means no isotopic peaks are added, 'coarse' adds isotopic peaks in unit mass distance, 'fine' uses the hyperfine isotopic generator to add accurate isotopic peaks. Note that adding isotopic peaks is very slow.none, coarse, fine
max_isotope2 Defines the maximal isotopic peak which is added if 'isotope_model' is 'coarse'
max_isotope_probability0.05 Defines the maximal isotopic probability to cover if 'isotope_model' is 'fine'
add_metainfofalse Adds the type of peaks as metainfo to the peaks, like y8+, [M-H2O+2H]++true, false
add_lossesfalse Adds common losses to those ion expect to have them, only water and ammonia loss is consideredtrue, false
sort_by_positiontrue Sort output by positiontrue, false
add_precursor_peaksfalse Adds peaks of the unfragmented precursor ion to the spectrumtrue, false
add_all_precursor_chargesfalse Adds precursor peaks with all charges in the given rangetrue, false
add_abundant_immonium_ionsfalse Add most abundant immonium ions (for Proline, Cystein, Iso/Leucine, Histidin, Phenylalanin, Tyrosine, Tryptophan)true, false
add_first_prefix_ionfalse If set to true e.g. b1 ions are addedtrue, false
add_y_ionstrue Add peaks of y-ions to the spectrumtrue, false
add_b_ionstrue Add peaks of b-ions to the spectrumtrue, false
add_a_ionsfalse Add peaks of a-ions to the spectrumtrue, false
add_c_ionsfalse Add peaks of c-ions to the spectrumtrue, false
add_x_ionsfalse Add peaks of x-ions to the spectrumtrue, false
add_z_ionsfalse Add peaks of z-ions to the spectrum (sometimes observed in CID and for some AAs in ExD due to H abstraction)true, false
add_zp1_ionsfalse Add peaks of z+1-radical cations (also [z+H]*^{+} or simply z*) to the spectrum (often observed in ExD)true, false
add_zp2_ionsfalse Add peaks of z+2-radical cations (also [z+2H]*^{2+} or simply z') to the spectrum (often observed in ExD esp. with higher precursor charges >3 and smaller z-ions.)true, false
y_intensity1.0 Intensity of the y-ions0.0:∞
b_intensity1.0 Intensity of the b-ions0.0:∞
a_intensity1.0 Intensity of the a-ions0.0:∞
c_intensity1.0 Intensity of the c-ions0.0:∞
x_intensity1.0 Intensity of the x-ions0.0:∞
z_intensity1.0 Intensity of the z-ions0.0:∞
relative_loss_intensity0.1 Intensity of loss ions, in relation to the intact ion intensity0.0:1.0
precursor_intensity1.0 Intensity of the precursor peak0.0:∞
precursor_H2O_intensity1.0 Intensity of the H2O loss peak of the precursor0.0:∞
precursor_NH3_intensity1.0 Intensity of the NH3 loss peak of the precursor0.0:∞
enzymeTrypsin Enzym used to digest the fasta proteinsTrypsin, Clostripain/P, elastase-trypsin-chymotrypsin, no cleavage, unspecific cleavage, Arg-C, Arg-C/P, staphylococcal protease/D, proline-endopeptidase/HKR, Glu-C+P, PepsinA + P, cyanogen-bromide, leukocyte elastase, proline endopeptidase, Asp-N, Asp-N/B, Asp-N_ambic, Chymotrypsin, Chymotrypsin/P, CNBr, Formic_acid, Lys-C, Lys-N, Lys-C/P, PepsinA, TrypChymo, Trypsin/P, V8-DE, V8-E, glutamyl endopeptidase, Alpha-lytic protease, 2-iodobenzoate, iodosobenzoate
missed_cleavages0 Number of allowed missed cleavages while digesting the fasta proteins
min_charge1 Minimum charge
max_charge1 Maximum charge
precursor_charge0 Manually set precursor charge. (default: 0, meaning max_charge + 1 will be used as precursor charge)