A peptide-centric algorithm for protein inference.

pot. predecessor tools	$\longrightarrow$ ProteinResolver $\longrightarrow$	pot. successor tools
IDFilter		(external)

Experimental classes:: This tool has not been tested thoroughly and might NOT behave as expected!

This tool is an imlementation of

Meyer-Arendt K, Old WM, et al. (2011)
IsoformResolver: A peptide-centric algorithm for protein inference
Journal of Proteome Research 10 (7): 3060-75, DOI: 10.1021/pr200039p

The algorithm tries to assign to each protein its experimentally validated peptide (meaning you should supply peptides with have undergone FDR filtering or alike). Proteins are grouped into ISD groups (in-silico derived) and MSD groups (MS/MS derived) if they have in-silico derived or MS/MS derived peptides in common. Proteins and peptides span a bipartite graph. There is an edge between a protein node and a peptide node if and only if the protein contains the peptide. ISD groups are connected graphs in the forementionend bipartite graph. MSD groups are subgraphs of ISD groups. For further information see above paper.

Remark: If parameter in is given, in_path is ignored. Parameter in_path is considered only if in is empty.

Input

Since the ProteinResolver offers two different input parameters, there are some possibilites how to use this TOPP tool.

One single input file (in)

The ProteinResolver simply performs the protein inference based on the above mentioned algortihm of Meyer-Arendt et al. (2011) for that specific file.

Multiple files (in or in_path)

If no experimental design file is given, all files are treated as in batch processing.
If an experimental design file is provided, all files that can be mapped to the same experimental design are treated as one single input file (simply by merging them before the computation).

Output

Four possible outputs are available:

Protein groups: For each MSD group, the ISD group, the protein indices, the peptide indices, the number of peptides in MSD group, the number of proteins in ISD and the number of proteins in ISD are written to the output file
Protein table: The resulting text file contains one protein per line
Peptide table: The output file will contain one peptide per line and all proteins which contain that specific peptide
Statistics:: Number of ISD groups, number of MSD groups, number of target peptides, number of decoy peptides, number of target and decoy peptides, number of peptides in MSD groups and estimated FDR for protein list.

The results for different input files are appended and written into the same output file. In other words, no matter how many input files you have, you will end up with one single output file.

Text file format of the quantitative experimental design:

The text file has to be column-based and must contain only one additional line as header. The header must specify two specific columns that represents the file name and an identifier for the experimental setup. These two header identifiers can be defined as parameter and must be unique (default: "File" and "ExperimentalSetting"). There are four options how the columns can be separated: tabulator, comma, semi-colon and whitespace.

Example for text file format:

Slice	File	ExperimentalSetting
1	SILAC_2_1	S1224
4	SILAC_3_4	D1224
2	SILAC_10_2	S1224
7	SILAC_8_7	S1224

In this case the values of the parameters "experiment" and "file" which are by default set to "ExperimentalSetting" and "File", respectively, are ok. If you use other column headers you need to change these parameters.

The separator should be changed if the file is not tab separated. Every other column (here: first column) is just ignored. Not every file mentioned in the design file has to be given as input file; and every input file that has no match in the design file is ignored for the computation.

Consider the following scenario:

Input files: SILAC_2_1.consensusXML, SILAC_3_4.consensusXML, SILAC_10_2.consensusXML and SILAC_8_7_.consensusXML

First step: Data from SILAC_2_1.consensusXML and SILAC_10_2.consensusXML is merged, because both files can be mapped to the same setting S1224. SILAC_8_7_.consensusXML is ignored, since SILAC_8_7_ is no match to SILAC_8_7.

Second step: ProteinResolver computes results for the merged data, and the data from the file SILAC_3_4.

Third step: ProteinResolver writes the results for experimental setting S1224 and D1224 to the same output file.

Note: Currently mzIdentML (mzid) is not directly supported as an input/output format of this tool. Convert mzid files to/from idXML using IDFileConverter if necessary.

The command line parameters of this tool are:

ProteinResolver -- protein inference
Full documentation: http://www.openms.de/doxygen/release/2.7.0/html/TOPP_ProteinResolver.html
Version: 2.7.0 Sep 13 2021, 20:58:47, Revision: 9110e58
To cite OpenMS:
  Rost HL, Sachsenberg T, Aiche S, Bielow C et al.. OpenMS: a flexible open-source software platform for mass spectrometry data analysis. Nat Meth. 2016; 13, 9: 741-748. doi:10.1038/nmeth.3959.

Usage:
  ProteinResolver <options>

Options (mandatory options marked with '*'):
  -fasta <file>*                       Input database file (valid formats: 'fasta')
  -in <file(s)>                        Input file(s) holding experimental data (valid formats: 'idXML', 'cons
                                       ensusXML')
  -in_path <file>                      Path to idXMLs or consensusXMLs files. Ignored if 'in' is given.
  -design <file>                       Text file containing the experimental design. See documentation for 
                                       specific format requirements (valid formats: 'txt')
  -protein_groups <file>               Output file. Contains all protein groups (valid formats: 'csv')
  -peptide_table <file>                Output file. Contains one peptide per line and all proteins which cont
                                       ain that peptide (valid formats: 'csv')
  -protein_table <file>                Output file. Contains one protein per line (valid formats: 'csv')

Additional options for algorithm:
  -resolver:missed_cleavages <number>  Number of allowed missed cleavages (default: '2' min: '0')
  -resolver:min_length <number>        Minimum length of peptide (default: '6' min: '1')
  -resolver:enzyme <choice>            Digestion enzyme (default: 'Trypsin' valid: 'Trypsin')

Additional options for quantitative experimental design:
  -designer:experiment <text>          Identifier for the experimental design. (default: 'ExperimentalSetting
                                       ')
  -designer:file <text>                Identifier for the file name. (default: 'File')
  -designer:separator <choice>         Separator, which should be used to split a row into columns (default: 
                                       'tab' valid: 'tab', 'semi-colon', 'comma', 'whitespace')

                                       
Common TOPP options:
  -ini <file>                          Use the given TOPP INI file
  -threads <n>                         Sets the number of threads allowed to be used by the TOPP tool (defaul
                                       t: '1')
  -write_ini <file>                    Writes the default configuration file
  --help                               Shows options
  --helphelp                           Shows all options (including advanced)

INI file documentation of this tool:

Legend:

required parameter

advanced parameter

+ProteinResolverprotein inference

version2.7.0 Version of the tool that generated this parameters file.

++1Instance '1' section for 'ProteinResolver'

fasta Input database fileinput file*.fasta

in[] Input file(s) holding experimental datainput file*.idXML,*.consensusXML

in_path Path to idXMLs or consensusXMLs files. Ignored if 'in' is given.

design Text file containing the experimental design. See documentation for specific format requirementsinput file*.txt

protein_groups output file. Contains all protein groupsoutput file*.csv

peptide_table output file. Contains one peptide per line and all proteins which contain that peptideoutput file*.csv

protein_table output file. Contains one protein per lineoutput file*.csv

additional_info output file for additional infooutput file*.csv

log Name of log file (created only when specified)

debug0 Sets the debug level

threads1 Sets the number of threads allowed to be used by the TOPP tool

no_progressfalse Disables progress logging to command linetrue,false

forcefalse Overrides tool-specific checkstrue,false

testfalse Enables the test mode (needed for internal use only)true,false

+++resolverAdditional options for algorithm

missed_cleavages2 Number of allowed missed cleavages0:∞

min_length6 Minimum length of peptide1:∞

enzymeTrypsin Digestion enzymeTrypsin

+++designerAdditional options for quantitative experimental design

experimentExperimentalSetting Identifier for the experimental design.

fileFile Identifier for the file name.

separatortab Separator, which should be used to split a row into columnstab,semi-colon,comma,whitespace