OpenMS
Loading...
Searching...
No Matches
Percolator Class Reference

In-process Percolator: semi-supervised target/decoy rescoring with q-values + PEPs. More...

#include <OpenMS/ANALYSIS/ID/Percolator.h>

Inheritance diagram for Percolator:
[legend]
Collaboration diagram for Percolator:
[legend]

Public Member Functions

 Percolator ()
 
 ~Percolator () override
 
 Percolator (const Percolator &)=delete
 
Percolatoroperator= (const Percolator &)=delete
 
void rescore (std::vector< PeptideIdentification > &peptide_ids, const StringList &feature_names={})
 Rescore PSMs in place. Domain-specific convenience wrapper over rescore(RescoreInput).
 
RescoreOutput rescore (const RescoreInput &input)
 Rescore a feature matrix domain-agnostically.
 
const std::vector< std::vector< double > > & getSvmWeights () const
 SVM weights trained in the last rescore()/train() call.
 
double getPi0 () const
 Pi0 (null fraction) from the last rescore()/score() call.
 
PercolatorModel train (const RescoreInput &input)
 Train a Percolator model on feature rows. No scoring.
 
RescoreOutput score (const RescoreInput &input, const PercolatorModel &model)
 Score feature rows using a pre-trained model. No training.
 
- Public Member Functions inherited from DefaultParamHandler
 DefaultParamHandler (const std::string &name)
 Constructor with name that is displayed in error messages.
 
 DefaultParamHandler (const DefaultParamHandler &rhs)
 Copy constructor.
 
virtual ~DefaultParamHandler ()
 Destructor.
 
DefaultParamHandleroperator= (const DefaultParamHandler &rhs)
 Assignment operator.
 
virtual bool operator== (const DefaultParamHandler &rhs) const
 Equality operator.
 
void setParameters (const Param &param)
 Sets the parameters.
 
const ParamgetParameters () const
 Non-mutable access to the parameters.
 
const ParamgetDefaults () const
 Non-mutable access to the default parameters.
 
const std::string & getName () const
 Non-mutable access to the name.
 
void setName (const std::string &name)
 Mutable access to the name.
 
const std::vector< std::string > & getSubsections () const
 Non-mutable access to the registered subsections.
 

Static Public Member Functions

static void fillPINCompatibleFields (const std::vector< PeptideIdentification > &peptide_ids, bool flatten_hits, RescoreInput &input)
 Fill PIN-compatible optional fields on a RescoreInput.
 
static void saveModel (const PercolatorModel &model, const std::string &filename)
 Serialize a PercolatorModel to a plain-text file.
 
static PercolatorModel loadModel (const std::string &filename)
 Deserialize a PercolatorModel written by saveModel().
 
- Static Public Member Functions inherited from DefaultParamHandler
static void writeParametersToMetaValues (const Param &write_this, MetaInfoInterface &write_here, const std::string &key_prefix="")
 Writes all parameters to meta values.
 

Protected Member Functions

void updateMembers_ () override
 This method is used to update extra member variables at the end of the setParameters() method.
 
- Protected Member Functions inherited from DefaultParamHandler
void defaultsToParam_ ()
 Updates the parameters after the defaults have been set in the constructor.
 

Private Attributes

std::unique_ptr< Impl > impl_
 

Additional Inherited Members

- Protected Attributes inherited from DefaultParamHandler
Param param_
 Container for current parameters.
 
Param defaults_
 Container for default parameters. This member should be filled in the constructor of derived classes!
 
std::vector< std::string > subsections_
 Container for registered subsections. This member should be filled in the constructor of derived classes!
 
std::string error_name_
 Name that is displayed in error messages during the parameter checking.
 
bool check_defaults_
 If this member is set to false no checking if parameters in done;.
 
bool warn_empty_defaults_
 If this member is set to false no warning is emitted when defaults are empty;.
 

Detailed Description

In-process Percolator: semi-supervised target/decoy rescoring with q-values + PEPs.

Wraps a vendored subset of Percolator (training and posterior estimation). The public API is grouped into two sets of entry points.

Combined training and scoring (original Percolator semantics; per-fold score normalization on the merged set):

  • rescore(std::vector<PeptideIdentification>&, ...) — overload for PSM-shaped data; writes scores back to each PeptideHit as meta values.
  • rescore(const RescoreInput&) — domain-agnostic. Accepts a feature matrix, target/decoy labels, optional CV grouping keys, and feature names. Applicable to PSM rescoring, transition rescoring, peak-group rescoring, or any other setting where a semi-supervised target/decoy classifier is appropriate. The vendored implementation requires PSM-shaped records internally; the wrapper attaches synthetic row identifiers to satisfy that requirement and these never surface in the public API.

Separate training and scoring (for train-on-A / predict-on-B workflows; uses fold-averaged weights):

rescore() and train()+score() produce scores on the same scale but are not bitwise equal on the same input; see score() for the rationale.

Preconditions

  • At least ~100 decoys and enough discriminable targets to pass SanityCheck, else InvalidValue is thrown.
  • Features are continuous-valued doubles. Categorical features must be one-hot-encoded by the caller.
  • If the input contains rows that share structure (e.g., multiple transitions per precursor), cv_group_keys must partition them so related rows go into the same CV fold. Otherwise q-values will be optimistic.

Thread safety

A single instance is not concurrent-safe; construct one instance per worker. The vendored Percolator code additionally relies on several process-wide statics (FeatureNames::numFeatures, SanityCheck::initDefaultDir*, etc.) that are reset at the start of each rescore() / train() / score() call. Concurrent calls across different instances* therefore also race on these globals; serialize at the call site if parallelism is required.

Reproducibility

Results are reproducible given the same seed, thread count, and input ordering. Changing the thread count can perturb results because of FeatureMemoryPool allocation ordering.

See also
PSM rescoring example: PercolatorAdapter / ProSE.
Transition rescoring example: OpenSwath layer (to be added when needed).

Constructor & Destructor Documentation

◆ Percolator() [1/2]

◆ ~Percolator()

~Percolator ( )
override

◆ Percolator() [2/2]

Percolator ( const Percolator )
delete

Member Function Documentation

◆ fillPINCompatibleFields()

static void fillPINCompatibleFields ( const std::vector< PeptideIdentification > &  peptide_ids,
bool  flatten_hits,
RescoreInput input 
)
static

Fill PIN-compatible optional fields on a RescoreInput.

Populates input.scan_numbers, input.spec_file_numbers, input.exp_masses, and input.calc_masses from the given PeptideIdentifications, using the same derivation that PercolatorInfile::store would apply when writing a .pin file:

  • scan: parsed via SpectrumLookup::extractScanNumber from the PeptideIdentification's spectrum_reference (or spectrum_id meta value, or fallback to 1-based index).
  • spec_file: hashes file_origin + id_merge_index (same as the PIN SpecId prefix). Zero when single-file / unset.
  • exp_mass: pid.getMZ() (kept as m/z — Percolator doesn't convert to neutral for the sort hash).
  • calc_mass: from hit.metaValueExists("CalcMass") if present, else hit.getSequence().getMZ(hit.getCharge()) (m/z, matching the PercolatorInfile::store fallback).

This helper must be called before rescore(input) when the in-process output is required to match running the external percolator binary through the .pin / .pout pipeline on the same inputs. Without it, the row index is used in place of the scan number, producing a different CV fold split and consequently different trained weights and final scores.

The PeptideIdentifications vector passed in must parallel input.features exactly: same ordering, one row per hit per PeptideIdentification.

Parameters
peptide_idsSource of PIN-equivalent metadata.
flatten_hitsIf true, iterate all hits per PeptideIdentification (matches high-level rescore row ordering). If false, use only the first hit per pid.
inputOutput: the four PIN-compat fields are written here.

◆ getPi0()

double getPi0 ( ) const

Pi0 (null fraction) from the last rescore()/score() call.

The target-null fraction estimated by the vendored PosteriorEstimator::estimatePi0 on the merged, post-TDC Scores set immediately before PEP calculation. Equivalent to the value reported on stderr by the external percolator binary ("New pi_0 estimate on final list..." / "Selecting pi_0=..."). Intended for parity testing and diagnostics; not required for production scoring.

Returns
pi0 in [0, 1], or -1.0 if no rescore()/score() has run yet on this instance, or after a train() call (train() resets last_pi0 since it doesn't run scoring). Always 1.0 when the instance was configured with use_pi0=false.

◆ getSvmWeights()

const std::vector< std::vector< double > > & getSvmWeights ( ) const

SVM weights trained in the last rescore()/train() call.

A single fold-averaged weight vector in raw feature space, with the bias appended as the final element. Identical to PercolatorModel::weights, wrapped in a one-element outer vector to preserve the signature expected by older call sites that consumed per-fold weights. Intended for diagnostics and for writing out a Percolator-compatible .weights file.

Updated by rescore() and train(). score() does not modify this buffer; after a train()+score() sequence the contents reflect the train() call. Loading a model via loadModel() and then calling score() leaves the buffer unchanged (empty if this instance has never trained).

Returns
Outer size = 1 (averaged); inner size = num_features + 1. Empty until rescore()/train() has been called on this instance.

◆ loadModel()

static PercolatorModel loadModel ( const std::string &  filename)
static

Deserialize a PercolatorModel written by saveModel().

The reader is strict: unknown header keys, duplicate keys, missing required keys, invalid normalizer values, and a declared n_features that does not match the actual feature-row count are all rejected.

Exceptions
Exception::FileNotFoundif filename does not exist.
Exception::ParseErroron any format violation (unknown or duplicate header key, missing required field, unsupported format_version, invalid normalizer, or row/count mismatch).

◆ operator=()

Percolator & operator= ( const Percolator )
delete

◆ rescore() [1/2]

RescoreOutput rescore ( const RescoreInput input)

Rescore a feature matrix domain-agnostically.

Parameters
inputFeature matrix, target/decoy labels, optional CV grouping keys, and feature names. See RescoreInput for row/column contract.
Returns
Per-row SVM scores, PSM-level q-values, and PEPs, aligned 1:1 with input.features (no reordering).
Note
This method does not interpret the rows semantically. "PSM" terminology in the returned struct refers to the underlying Percolator algorithm, not to what your rows represent. For transition rescoring, the q_values/peps are transition-level.
No peptide-level roll-up is performed. If you need "best row per unique entity" aggregation, do it above this call.
Target-Decoy Competition (enabled by default via post_processing_tdc) deduplicates PSMs by (scan, expMass). Rows that lose competition are dropped from the internal score set and returned with score = 0.0, q_value = 1.0, pep = 1.0. To distinguish a TDC-eliminated row from a row that genuinely scored zero, either disable TDC or populate scan_numbers / exp_masses so that the deduplication key is informative.
Exceptions
Exception::InvalidValueif sanity checks fail (too few decoys, no discriminative feature, malformed input dimensions).

◆ rescore() [2/2]

void rescore ( std::vector< PeptideIdentification > &  peptide_ids,
const StringList feature_names = {} 
)

Rescore PSMs in place. Domain-specific convenience wrapper over rescore(RescoreInput).

Each PeptideHit gets three new meta values:

  • "percolator_score" (SVM discriminant score)
  • "percolator_q_value" (PSM-level q-value)
  • "percolator_pep" (posterior error probability)
Parameters
peptide_idsTarget + decoy PSMs, mixed. Mutated in place.
feature_namesMeta-value names on each PeptideHit to use as features. Must be numeric. If empty, auto-discover from the first hit's numeric meta values (excluding a blocklist of internal keys).
Exceptions
Exception::InvalidValueif sanity checks fail (too few decoys, no discriminative feature, etc.).

◆ saveModel()

static void saveModel ( const PercolatorModel model,
const std::string &  filename 
)
static

Serialize a PercolatorModel to a plain-text file.

Writes a comment header line, then the header keys format_version, normalizer, seed, n_features, and bias (one per line, in key: value form), followed by one feature_name<TAB>weight data row per feature. The bias is stored in the header rather than as a data row, so feature names are opaque strings with no reserved values. The format is intended to be human-readable and diff-friendly; it is not interoperable with the external percolator binary's multi-column .weights format.

Weights are written at std::numeric_limits<double>::max_digits10 precision so that loadModel() round-trips losslessly.

Exceptions
Exception::UnableToCreateFileif filename cannot be opened.
Exception::InvalidValueif the model violates its internal invariant (weights.size() == feature_names.size() + 1).

◆ score()

RescoreOutput score ( const RescoreInput input,
const PercolatorModel model 
)

Score feature rows using a pre-trained model. No training.

Applies model.weights to input.features unchanged: raw weights × raw features + bias. Callers must not pre-normalize the features. The normalization transform is already folded into the raw weights by Normalizer::unnormalizeweight() at training time; reapplying it would double-count the transform.

The post-scoring pipeline (q-values, optional TDC, PEPs, SVM score rescaling) operates on the input's own target/decoy distribution. q-values and PEPs are therefore always evaluated against the scoring dataset, not the training dataset.

The returned RescoreOutput.scores are not raw dot products. Internally raw_score(i) = Σ input.features[i][j] * model.weights[j]

  • model.weights[n_features] Scores::normalizeScores then rescales the entire score vector so that the median decoy score maps to 0 and the score at test_fdr maps to 1: scores[i] = (raw_score(i) - fdrScore) / (fdrScore - medianDecoyScore) The rescore() CV-merge path rescales per fold instead of once globally, so scores from rescore(X) and from score(X, train(X)) share a scale but are not bitwise equal. If raw dot-product values are required, compute them directly from model.weights.

Target-Decoy Competition (enabled by default via post_processing_tdc) deduplicates rows by (scan, expMass). Rows that lose competition are returned with the defaults score = 0.0, q_value = 1.0, pep = 1.0. See the rescore(RescoreInput) note for mitigation.

Parameters
inputScoring data. Target/decoy labels required for q-value and PEP computation. feature_names must be populated.
modelModel produced by train() (possibly from another Percolator instance, or loaded from disk). feature_names must be populated and match input.feature_names positionally; weight count must equal n_features+1.
Returns
Per-row SVM scores (rescaled per above), q-values, PEPs, aligned 1:1 with input.features.
Exceptions
Exception::InvalidValueon malformed input, model mismatch, empty feature_names on either side, or pathological score distributions that fail Percolator's sanity checks.

◆ train()

PercolatorModel train ( const RescoreInput input)

Train a Percolator model on feature rows. No scoring.

Runs the full semi-supervised cross-validation training on input. All rows participate in training; subset sampling, if desired, must be performed by the caller. The returned model holds fold-averaged SVM weights in raw feature space and can be passed to score(), including on a different RescoreInput.

Side effects: the weights are mirrored into getSvmWeights() and per-call scratch state in the wrapper is reset.

Parameters
inputTraining data. Target/decoy labels required; CV grouping and PIN-compat fields used if supplied.
Returns
PercolatorModel with feature_names, raw-space weights+bias, recorded normalizer type and seed.
Exceptions
Exception::InvalidValueon malformed input or training failure.

◆ updateMembers_()

void updateMembers_ ( )
overrideprotectedvirtual

This method is used to update extra member variables at the end of the setParameters() method.

Also call it at the end of the derived classes' copy constructor and assignment operator.

The default implementation is empty.

Reimplemented from DefaultParamHandler.

Member Data Documentation

◆ impl_

std::unique_ptr<Impl> impl_
private