OpenMS
DecoyDatabase

Create a decoy peptide database from standard FASTA databases.

Decoy databases are useful to control false discovery rates and thus estimate score cutoffs for identified spectra.

The decoy can either be generated by reversing or shuffling each of the peptides of a sequence (as defined by a given enzyme). For reversing the N and C terminus of the peptides are kept in position by default.

To get a 'contaminants' database have a look at http://www.thegpm.org/crap/index.html or find/create your own contaminant database.

Multiple databases can be provided as input, which will internally be concatenated before being used for decoy generation. This allows you to specify your target database plus a contaminant file and obtain a concatenated target-decoy database using a single call, e.g., DecoyDatabase -in human.fasta crap.fasta -out human_TD.fasta

By default, a combined database is created where target and decoy sequences are written interleaved (i.e., target1, decoy1, target2, decoy2,...). If you need all targets before the decoys for some reason, use only_decoy and concatenate the files externally.

The tool will keep track of all protein identifiers and report duplicates.

Also the tool automatically checks for decoys already in the input files (based on most common pre-/suffixes) and terminates the program if decoys are found.

Extra functionality: The Neighbor Peptide functionality (see subsection 'NeighborSearch') is designed to find peptides (neighbors) in a given set of sequences (FASTA file) that are similar to a target peptide (aka relevant peptide) based on mass and spectral characteristics. This provides more power when searching complex samples, but only a subset of the peptides/proteins is of interest. See www.ncbi.nlm.nih.gov/pmc/articles/PMC8489664/ and NeighborSeq for details.

The command line parameters of this tool are:

DecoyDatabase -- Creates combined target+decoy sequence database from forward sequence database.
Full documentation: http://www.openms.de/doxygen/nightly/html/TOPP_DecoyDatabase.html
Version: 3.3.0-pre-nightly-2024-11-20 Nov 21 2024, 02:34:56, Revision: decb5c8
To cite OpenMS:
 + Pfeuffer, J., Bielow, C., Wein, S. et al.. OpenMS 3 enables reproducible analysis of large-scale mass spec
   trometry data. Nat Methods (2024). doi:10.1038/s41592-024-02197-7.

Usage:
  DecoyDatabase <options>

This tool has algorithm parameters that are not shown here! Please check the ini file for a detailed descript
ion or use the --helphelp option

Options (mandatory options marked with '*'):
  -in <file(s)>*                                    Input FASTA file(s), each containing a database. It is 
                                                    recommended to include a contaminant database as well. 
                                                    (valid formats: 'fasta')
  -out <file>*                                      Output FASTA file where the decoy database (target + deco
                                                    y or only decoy, see 'only_decoy') will be written to. 
                                                    (valid formats: 'fasta')
  -decoy_string <string>                            String that is combined with the accession of the protein
                                                     identifier to indicate a decoy protein. (default: 'DECOY
                                                    _')
  -decoy_string_position <choice>                   Should the 'decoy_string' be prepended (prefix) or append
                                                    ed (suffix) to the protein accession? (default: 'prefix')
                                                     (valid: 'prefix', 'suffix')
  -only_decoy                                       Write only decoy proteins to the output database instead 
                                                    of a combined database.
  -type <choice>                                    Type of sequence. RNA sequences may contain modification 
                                                    codes, which will be handled correctly if this is set to 
                                                    'RNA'. (default: 'protein') (valid: 'protein', 'RNA')
  -method <choice>                                  Method by which decoy sequences are generated from target
                                                     sequences. Note that all sequences are shuffled using 
                                                    the same random seed, ensuring that identical sequences 
                                                    produce the same shuffled decoy sequences. Shuffled seque
                                                    nces that produce highly similar output sequences are 
                                                    shuffled again (see shuffle_sequence_identity_threshold).
                                                     (default: 'reverse') (valid: 'reverse', 'shuffle')
  -enzyme <enzyme>                                  Enzyme used for the digestion of the sample. Only applica
                                                    ble if parameter 'type' is 'protein'. (default: 'Trypsin'
                                                    ) (valid: 'Asp-N_ambic', 'Chymotrypsin', 'Chymotrypsin/P'
                                                    , 'CNBr', '2-iodobenzoate', 'iodosobenzoate', 'staphyloco
                                                    ccal protease/D', 'Trypsin', 'Arg-C', 'Arg-C/P', 'Asp-N',
                                                     'Asp-N/B', 'elastase-trypsin-chymotrypsin', 'no cleavage
                                                    ', 'unspecific cleavage', 'Formic_acid', 'Lys-C', 'Lys-N'
                                                    , 'Lys-C/P', 'PepsinA', 'TrypChymo', 'Trypsin/P', 'V8-DE'
                                                    , 'V8-E', 'leukocyte elastase', 'proline endopeptidase', 
                                                    'glutamyl endopeptidase', 'Alpha-lytic protease', 'prolin
                                                    e-endopeptidase/HKR', 'Glu-C+P', 'PepsinA + P', 'cyanogen
                                                    -bromide', 'Clostripain/P')

Parameters for neighbor peptide search ('in' holds the neighbor candidates):
  -NeighborSearch:in_relevant_proteins <file>       These are the relevant proteins, for which we seek neighb
                                                    ors (valid formats: 'fasta')
  -NeighborSearch:out_neighbor <file>               Output FASTA file with neighbors of relevant peptides 
                                                    (given in 'in_relevant_proteins').
  -NeighborSearch:out_relevant <file>               Output FASTA file with target+decoy of relevant peptides 
                                                    (given in 'in_relevant_proteins'). Required for downstrea
                                                    m filtering of search results via IDFilter and subsequent
                                                     FDR.
  -NeighborSearch:missed_cleavages <int>            Number of missed cleavages for relevant and neighbor pept
                                                    ides. (default: '0')
  -NeighborSearch:mz_bin_size <num>                 Bin size for spectra m/z comparison (the original study 
                                                    suggests 0.05 Th for high-res and 1.0005079 Th for low-re
                                                    s spectra). (default: '0.05')
  -NeighborSearch:pc_mass_tolerance <double>        Maximal precursor mass difference (in Da or ppm; see 'pc_
                                                    mass_tolerance_unit') between neighbor and relevant pepti
                                                    de. (default: '0.01')
  -NeighborSearch:pc_mass_tolerance_unit <choice>   Is 'pc_mass_tolerance' in Da or ppm? (default: 'Da') (val
                                                    id: 'Da', 'ppm')
  -NeighborSearch:min_peptide_length <int>          Minimum peptide length (relevant and neighbor peptides) 
                                                    (default: '5')
  -NeighborSearch:min_shared_ion_fraction <double>  Minimal required overlap 't_i' of b/y ions shared between
                                                     neighbor candidate and a relevant peptide (t_i <= 2*B12/
                                                    (B1+B2)). Higher values result in fewer neighbors. (defau
                                                    lt: '0.25')

                                                    
Common TOPP options:
  -ini <file>                                       Use the given TOPP INI file
  -threads <n>                                      Sets the number of threads allowed to be used by the TOPP
                                                     tool (default: '1')
  -write_ini <file>                                 Writes the default configuration file
  --help                                            Shows options
  --helphelp                                        Shows all options (including advanced)

The following configuration subsections are valid:
 - Decoy   Decoy parameters section

You can write an example INI file using the '-write_ini' option.
Documentation of subsection parameters can be found in the doxygen documentation or the INIFileEditor.
For more information, please consult the online documentation for this tool:
  - http://www.openms.de/doxygen/nightly/html/TOPP_DecoyDatabase.html

INI file documentation of this tool:

Legend:
required parameter
advanced parameter
+DecoyDatabaseCreates combined target+decoy sequence database from forward sequence database.
version3.3.0-pre-nightly-2024-11-20 Version of the tool that generated this parameters file.
++1Instance '1' section for 'DecoyDatabase'
in[] Input FASTA file(s), each containing a database. It is recommended to include a contaminant database as well.input file*.fasta
out Output FASTA file where the decoy database (target + decoy or only decoy, see 'only_decoy') will be written to.output file*.fasta
decoy_stringDECOY_ String that is combined with the accession of the protein identifier to indicate a decoy protein.
decoy_string_positionprefix Should the 'decoy_string' be prepended (prefix) or appended (suffix) to the protein accession?prefix, suffix
only_decoyfalse Write only decoy proteins to the output database instead of a combined database.true, false
typeprotein Type of sequence. RNA sequences may contain modification codes, which will be handled correctly if this is set to 'RNA'.protein, RNA
methodreverse Method by which decoy sequences are generated from target sequences. Note that all sequences are shuffled using the same random seed, ensuring that identical sequences produce the same shuffled decoy sequences. Shuffled sequences that produce highly similar output sequences are shuffled again (see shuffle_sequence_identity_threshold).reverse, shuffle
shuffle_max_attempts30 shuffle: maximum attempts to lower the amino acid sequence identity between target and decoy for the shuffle algorithm
shuffle_sequence_identity_threshold0.5 shuffle: target-decoy amino acid sequence identity threshold for the shuffle algorithm. If the sequence identity is above this threshold, shuffling is repeated. In case of repeated failure, individual amino acids are 'mutated' to produce a different amino acid sequence.
seed1 Random number seed (use 'time' for system time)
enzymeTrypsin Enzyme used for the digestion of the sample. Only applicable if parameter 'type' is 'protein'.Asp-N_ambic, Chymotrypsin, Chymotrypsin/P, CNBr, 2-iodobenzoate, iodosobenzoate, staphylococcal protease/D, Trypsin, Arg-C, Arg-C/P, Asp-N, Asp-N/B, elastase-trypsin-chymotrypsin, no cleavage, unspecific cleavage, Formic_acid, Lys-C, Lys-N, Lys-C/P, PepsinA, TrypChymo, Trypsin/P, V8-DE, V8-E, leukocyte elastase, proline endopeptidase, glutamyl endopeptidase, Alpha-lytic protease, proline-endopeptidase/HKR, Glu-C+P, PepsinA + P, cyanogen-bromide, Clostripain/P
log Name of log file (created only when specified)
debug0 Sets the debug level
threads1 Sets the number of threads allowed to be used by the TOPP tool
no_progressfalse Disables progress logging to command linetrue, false
forcefalse Overrides tool-specific checkstrue, false
testfalse Enables the test mode (needed for internal use only)true, false
+++NeighborSearchParameters for neighbor peptide search ('in' holds the neighbor candidates)
in_relevant_proteins These are the relevant proteins, for which we seek neighborsinput file*.fasta
out_neighbor Output FASTA file with neighbors of relevant peptides (given in 'in_relevant_proteins').output file
out_relevant Output FASTA file with target+decoy of relevant peptides (given in 'in_relevant_proteins'). Required for downstream filtering of search results via IDFilter and subsequent FDR.output file
missed_cleavages0 Number of missed cleavages for relevant and neighbor peptides.
mz_bin_size0.05 Bin size for spectra m/z comparison (the original study suggests 0.05 Th for high-res and 1.0005079 Th for low-res spectra).
pc_mass_tolerance0.01 Maximal precursor mass difference (in Da or ppm; see 'pc_mass_tolerance_unit') between neighbor and relevant peptide.
pc_mass_tolerance_unitDa Is 'pc_mass_tolerance' in Da or ppm?Da, ppm
min_peptide_length5 Minimum peptide length (relevant and neighbor peptides)
min_shared_ion_fraction0.25 Minimal required overlap 't_i' of b/y ions shared between neighbor candidate and a relevant peptide (t_i <= 2*B12/(B1+B2)). Higher values result in fewer neighbors.
+++DecoyDecoy parameters section
non_shuffle_pattern Residues to not shuffle (keep at a constant position when shuffling). Separate by comma, e.g. use 'K,P,R' here.
keepPeptideNTermtrue Whether to keep peptide N terminus constant when shuffling / reversing.true, false
keepPeptideCTermtrue Whether to keep peptide C terminus constant when shuffling / reversing.true, false