OpenMS
|
Refreshes the protein references for all peptide hits from an idXML file and adds target/decoy information.
pot. predecessor tools | → PeptideIndexer → | pot. successor tools |
---|---|---|
IDFilter or any protein/peptide processing tool | FalseDiscoveryRate |
PeptideIndexer refreshes target/decoy information and mapping of peptides to proteins. The target/decoy information is crucial for the FalseDiscoveryRate tool. (For FDR calculations, peptides hitting both target and decoy proteins are counted as target hits.)
PeptideIndexer allows for ambiguous amino acids (B|J|Z|X) in the protein database and peptide sequence.
Enzyme cutting rules and partial specificity are derived from input idXML automatically by default or can be specified explicitly by the user.
All peptide and protein hits are annotated with target/decoy information, using the meta value 'target_decoy'. For proteins the possible values are "target" and "decoy", depending on whether the protein accession contains the decoy pattern (parameter decoy_string
) as a suffix or prefix, respectively (see parameter prefix
). Resulting protein hits appear in the order of the FASTA file, except for orphaned proteins, which will appear first with an empty 'target_decoy' metavalue. Duplicate protein accessions & sequences will not raise a warning, but create multiple hits (PeptideIndexer reads the FASTA file piecewise for efficiency reasons, and thus might not see all accessions & sequences at once).
Peptide hits are annotated with metavalue 'protein_references', and if matched to at least one protein also with metavalue 'target_decoy'. The possible values for 'target_decoy' in peptides are "target", "decoy" and "target+decoy", depending on whether the peptide sequence is found only in target proteins, only in decoy proteins, or in both. If the peptide is unmatched the metavalue is missing.
Runtime: PeptideIndexer is usually very fast (loading and storing the data takes the most time) and search speed can be further improved (linearly) by using more threads. Avoid allowing too many (>=4) ambiguous amino acids if your database contains long stretches of 'X' (exponential search space).
PeptideIndexer supports relative database filenames, which (when not found in the current working directory) are looked up in the directories specified by OpenMS.ini:id_db_dir
. The database is by default derived from the input idXML's metainformation ('auto' setting), but can be specified explicitly.
The command line parameters of this tool are:
PeptideIndexer -- Refreshes the protein references for all peptide hits. Full documentation: http://www.openms.de/doxygen/nightly/html/TOPP_PeptideIndexer.html Version: 3.3.0-pre-nightly-2024-11-20 Nov 21 2024, 02:34:56, Revision: decb5c8 To cite OpenMS: + Pfeuffer, J., Bielow, C., Wein, S. et al.. OpenMS 3 enables reproducible analysis of large-scale mass spec trometry data. Nat Methods (2024). doi:10.1038/s41592-024-02197-7. Usage: PeptideIndexer <options> Options (mandatory options marked with '*'): -in <file>* Input idXML file containing the identifications. (valid formats: 'idXML') -fasta <file> Input sequence database in FASTA format. Leave empty for using the same DB as used for the input idXML (this might fail). Non-existing relative filenames are looked up via 'OpenMS.ini:id_db_dir' (valid formats: 'fasta') -out <file>* Output idXML file. (valid formats: 'idXML') -decoy_string <text> String that was appended (or prefixed - see 'decoy_string_position' flag below) to the accessions in the protein database to indicate decoy proteins. If empty (default), it's determined automatically (checking for common terms, both as prefix and suffix). -decoy_string_position <choice> Is the 'decoy_string' prepended (prefix) or appended (suffix) to the protein accession? (ignored if decoy_string is empty) (default: 'prefix') (valid: 'prefix', 'suffix') -missing_decoy_action <choice> Action to take if NO peptide was assigned to a decoy protein (which indicates wrong database or decoy string): 'error' (exit with erro r, no output), 'warn' (exit with success, warning message), 'silent ' (no action is taken, not even a warning) (default: 'error') (vali d: 'error', 'warn', 'silent') -write_protein_sequence If set, the protein sequences are stored as well. -write_protein_description If set, the protein description is stored as well. -keep_unreferenced_proteins If set, protein hits which are not referenced by any peptide are kept. -unmatched_action <choice> If peptide sequences cannot be matched to any protein: 1) raise an error; 2) warn (unmatched PepHits will miss target/decoy annotation with downstream problems); 3) remove the hit. (default: 'error') (valid: 'error', 'warn', 'remove') -aaa_max <number> Maximal number of ambiguous amino acids (AAAs) allowed when matchin g to a protein database with AAAs. AAAs are 'B', 'J', 'Z' and 'X'. (default: '3') (min: '0' max: '10') -mismatches_max <number> Maximal number of mismatched (mm) amino acids allowed when matching to a protein database. The required runtime is exponential in the number of mm's; apply with care. MM's are allowed in addition to AAA's. (default: '0') (min: '0' max: '10') -IL_equivalent Treat the isobaric amino acids isoleucine ('I') and leucine ('L') as equivalent (indistinguishable). Also occurrences of 'J' will be treated as 'I' thus avoiding ambiguous matching. -allow_nterm_protein_cleavage <choice> Allow the protein N-terminus amino acid to clip. (default: 'true') (valid: 'true', 'false') enzyme: -enzyme:name <choice> Enzyme which determines valid cleavage sites - e.g. trypsin cleaves after lysine (K) or arginine (R), but not before proline (P). Defa ult: deduce from input (default: 'auto') (valid: 'auto', 'proline-e ndopeptidase/HKR', 'Glu-C+P', 'Formic_acid', 'Lys-C', 'Lys-N', 'Try psin', 'Arg-C', 'Asp-N_ambic', 'Chymotrypsin', 'Chymotrypsin/P', 'CNBr', 'Arg-C/P', 'Asp-N', 'Asp-N/B', 'unspecific cleavage', 'Lys- C/P', 'PepsinA', 'TrypChymo', 'Trypsin/P', 'V8-DE', 'V8-E', 'leukoc yte elastase', 'proline endopeptidase', 'glutamyl endopeptidase', 'Alpha-lytic protease', '2-iodobenzoate', 'iodosobenzoate', 'staphy lococcal protease/D', 'PepsinA + P', 'cyanogen-bromide', 'Clostripa in/P', 'elastase-trypsin-chymotrypsin', 'no cleavage') -enzyme:specificity <choice> Specificity of the enzyme. Default: deduce from input. 'full': both internal cleavage sites must match. 'semi': one of two internal cleavage sites must match. 'none': allow all peptide hits no matter their context (enzyme is irrelevant). (default: 'auto') (valid: 'auto', 'full', 'semi', 'none') Common TOPP options: -ini <file> Use the given TOPP INI file -threads <n> Sets the number of threads allowed to be used by the TOPP tool (def ault: '1') -write_ini <file> Writes the default configuration file --help Shows options --helphelp Shows all options (including advanced)
INI file documentation of this tool: