OpenMS
|
Compute peptide and protein abundances from annotated feature/consensus maps or from identification results.
potential predecessor tools | → ProteinQuantifier → | potential successor tools |
---|---|---|
IDMapper | external tools e.g. for statistical analysis | |
FeatureLinkerUnlabeled (or another feature grouping tool) |
Reference:
Weisser et al.: An automated pipeline for high-throughput label-free quantitative proteomics (J. Proteome Res., 2013, PMID: 23391308).
Input: featureXML or consensusXML
Quantification is based on the intensity values of the features in the input files. Feature intensities are first accumulated to peptide abundances, according to the peptide identifications annotated to the features/feature groups. Then, abundances of the peptides of a protein are aggregated to compute the protein abundance.
The peptide-to-protein step uses the (e.g. 3) most abundant proteotypic peptides per protein to compute the protein abundances. This is a general version of the "top 3 approach" (but only for relative quantification) described in:
Silva et al.: Absolute quantification of proteins by LCMSE: a virtue of parallel MS acquisition (Mol. Cell. Proteomics, 2006, PMID: 16219938).
Only features/feature groups with unambiguous peptide annotation are used for peptide quantification. It is possible to resolve ambiguities before applying ProteinQuantifier using one of several equivalent mechanisms in OpenMS: IDConflictResolver, ConsensusID (algorithm best
), or FileFilter (option id:keep_best_score_id
).
Similarly, only proteotypic peptides (i.e. those matching to exactly one protein) are used for protein quantification by default. Peptide/protein IDs from multiple identification runs can be handled, but will not be differentiated (i.e. protein accessions for a peptide will be accumulated over all identification runs). See section "Optional input: Protein inference/grouping results" below for exceptions to this.
Peptides with the same sequence, but with different modifications are quantified separately on the peptide level, but treated as one peptide for the protein quantification (i.e. the contributions of differently-modified variants of the same peptide are accumulated).
Input: idXML
Quantification based on identification results uses spectral counting, i.e. the abundance of each peptide is the number of times that peptide was identified from an MS2 spectrum (considering only the best hit per spectrum). Different identification runs in the input are treated as different samples; this makes it possible to quantify several related samples at once by merging the corresponding idXML files with IDMerger. Depending on the presence of multiple runs, output format and applicable parameters are the same as for featureXML and consensusXML, respectively.
The notes above regarding quantification on the protein level and the treatment of modifications also apply to idXML input. In particular, this means that the settings top
0 and aggregate
sum
should be used to get the "classical" spectral counting quantification on the protein level (where all identifications of all peptides of a protein are summed up).
Optional input: Protein inference/grouping results
By default only proteotypic peptides (i.e. those matching to exactly one protein) are used for protein quantification. However, this limitation can be overcome: Protein inference results for the whole sample set can be supplied with the protein_groups
option (or included in a featureXML input). In that case, the peptide-to-protein references from that file are used (rather than those from in
), and groups of indistinguishable proteins will be quantified. Each reported protein quantity then refers to the total for the respective group.
In order for everything to work correctly, it is important that the protein inference results come from the same identifications that were used to annotate the quantitative data. We suggest to use the OpenMS tool ProteinInference ProteinInference.
More information below the parameter specification.
The command line parameters of this tool are:
stty: 'standard input': Inappropriate ioctl for device ProteinQuantifier -- Compute peptide and protein abundances Full documentation: http://www.openms.de/doxygen/nightly/html/TOPP_ProteinQuantifier.html Version: 3.4.0-pre-nightly-2024-12-16 Dec 17 2024, 02:41:12, Revision: 96ad74c To cite OpenMS: + Pfeuffer, J., Bielow, C., Wein, S. et al.. OpenMS 3 enables reproducible analysis of large-scale mass spectrometry data. Nat Methods (2024). doi:10.1038/s41592-024-02197-7. Usage: ProteinQuantifier <options> Options (mandatory options marked with '*'): -in <file>* Input file (valid formats: 'featureXML', 'consensusXML', 'idXML') -protein_groups <file> Protein inference results for the identification runs that were used to annotate the input (e.g. via the ProteinInference tool). Information about indistinguishable proteins will be used for protein quantification. (valid formats: 'idXML') -design <file> Input file containing the experimental design (valid formats: 'tsv') -out <file> Output file for protein abundances (valid formats: 'csv') -peptide_out <file> Output file for peptide abundances (valid formats: 'csv') -mztab <file> Output file (mzTab) (valid formats: 'mzTab') -method <choice> - top - quantify based on three most abundant peptides (number can be changed in 'top'). - iBAQ (intensity based absolute quantification), calculate the sum of all peptide peak intensities divided by the number of theoretically observable tryptic peptides (https://rdcu.be/cND1J). Warning: only consensusXML or featureXML input is allowed! (default: 'top') (valid: 'top', 'iBAQ') -best_charge_and_fraction Distinguish between fraction and charge states of a peptide. For peptides, abundances will be reported separately for each fraction and charge; for proteins, abundances will be computed based only on the most prevalent charge observed of each peptide (over all fractions). By default, abundances are summed over all charge states. Additional options for custom quantification using top N peptides.: -top:N <number> Calculate protein abundance from this number of proteotypic peptides (most abundant first; '0' for all) (default: '3') (min: '0') -top:aggregate <choice> Aggregation method used to compute protein abundances from peptide abundances (default: 'median') (valid: 'median', 'mean', 'weighted_mean', 'sum') -top:include_all Include results for proteins with fewer proteotypic peptides than indicated by 'N' (no effect if 'N' is 0 or 1) Additional options for consensus maps (and identification results comprising multiple runs): -consensus:normalize Scale peptide abundances so that medians of all samples are equal -consensus:fix_peptides Use the same peptides for protein quantification across all samples. With 'N 0',all peptides that occur in every sample are considered. Otherwise ('N'), the N peptides that occur in the most samples (independently of each other) are selected, breaking ties by total abundance (there is no guarantee that the best co-ocurring peptides are chosen!). -greedy_group_resolution <choice> Pre-process identifications with greedy resolution of shared peptides based on the protein group probabilities. (Only works with an idXML file given as protein_groups parameter). (default: 'false') (valid: 'true', 'false') -ratios Add the log2 ratios of the abundance values to the output. Format: log_2(x_0/x_0) <sep> log_2(x_1/x_0) <sep> log_2(x_2/x_0) ... -ratiosSILAC Add the log2 ratios for a triple SILAC experiment to the output. Only applicable to consensus maps of exactly three sub-maps. Format: log_2(heavy/light) <sep> log_2(heavy/middle) <sep> log_2(middle/light) Output formatting options: -format:separator <sep> Character(s) used to separate fields; by default, the 'tab' character is used -format:quoting <method> Method for quoting of strings: 'none' for no quoting, 'double' for quoting with doubling of embedded quotes, 'escape' for quoting with backslash-escaping of embedded quotes (default: 'double') (valid: 'none', 'double', 'escape') -format:replacement <x> If 'quoting' is 'none', used to replace occurrences of the separator in strings before writing (default: '_') Common TOPP options: -ini <file> Use the given TOPP INI file -threads <n> Sets the number of threads allowed to be used by the TOPP tool (default: '1') -write_ini <file> Writes the default configuration file --help Shows options --helphelp Shows all options (including advanced)
INI file documentation of this tool:
Output format
The output files produced by this tool have a table format, with columns as described below:
Protein output (one protein/set of indistinguishable proteins per line):
top
).Peptide output (one peptide or - if best_charge_and_fraction
is set - one charge state and fraction of a peptide per line):
best_charge_and_fraction
was set.consensus:normalize
was set.Protein quantification examples
While quantification on the peptide level is fairly straight-forward, a number of options influence quantification on the protein level - especially for consensusXML input. The three parameters top:N
, top:include_all
and consensus:fix_peptides
determine which peptides are used to quantify proteins in different samples.
As an example, consider a protein with four proteotypic peptides. Each peptide is detected in a subset of three samples, as indicated in the table below. The peptides are ranked by abundance (1: highest, 4: lowest; assuming for simplicity that the order is the same in all samples).
sample 1 | sample 2 | sample 3 | |
peptide 1 | X | X | |
peptide 2 | X | X | |
peptide 3 | X | X | X |
peptide 4 | X | X |
Different parameter combinations lead to different quantification scenarios, as shown here:
parameters "*": no effect in this case | peptides used for quantification "(...)": not quantified here because ... | explanation | ||||
top | include_all | c .:fix_peptides | sample 1 | sample 2 | sample 3 | |
0 | * | no | 1, 2, 3, 4 | 2, 3, 4 | 1, 3 | all peptides |
1 | * | no | 1 | 2 | 1 | single most abundant peptide |
2 | * | no | 1, 2 | 2, 3 | 1, 3 | two most abundant peptides |
3 | no | no | 1, 2, 3 | 2, 3, 4 | (too few peptides) | three most abundant peptides |
3 | yes | no | 1, 2, 3 | 2, 3, 4 | 1, 3 | three or fewer most abundant peptides |
4 | no | * | 1, 2, 3, 4 | (too few peptides) | (too few peptides) | four most abundant peptides |
4 | yes | * | 1, 2, 3, 4 | 2, 3, 4 | 1, 3 | four or fewer most abundant peptides |
0 | * | yes | 3 | 3 | 3 | all peptides present in every sample |
1 | * | yes | 3 | 3 | 3 | single peptide present in most samples |
2 | no | yes | 1, 3 | (peptide 1 missing) | 1, 3 | two peptides present in most samples |
2 | yes | yes | 1, 3 | 3 | 1, 3 | two or fewer peptides present in most samples |
3 | no | yes | 1, 2, 3 | (peptide 1 missing) | (peptide 2 missing) | three peptides present in most samples |
3 | yes | yes | 1, 2, 3 | 2, 3 | 1, 3 | three or fewer peptides present in most samples |
Further considerations for parameter selection
With best_charge_and_fractions
and aggregate
, there is a trade-off between comparability of protein abundances within a sample and of abundances for the same protein across different samples.
Setting best_charge_and_fraction
may increase reproducibility between samples, but will distort the proportions of protein abundances within a sample. The reason is that ionization properties vary between peptides, but should remain constant across samples. Filtering by charge state can help to reduce the impact of feature detection differences between samples.
For aggregate
, there is a qualitative difference between (intensity weighted) mean/median and
sum
in the effect that missing peptide abundances have (only if include_all
is set or top
is 0): (intensity weighted) mean and
median
ignore missing cases, averaging only present values. If low-abundant peptides are not detected in some samples, the computed protein abundances for those samples may thus be too optimistic. sum
implicitly treats missing values as zero, so this problem does not occur and comparability across samples is ensured. However, with sum
the total number of peptides ("summands") available for a protein may affect the abundances computed for it (depending on top
), so results within a sample may become unproportional.