In-process Percolator: semi-supervised target/decoy rescoring with q-values + PEPs.
Wraps a vendored subset of Percolator (training and posterior estimation). The public API is grouped into two sets of entry points.
Combined training and scoring (original Percolator semantics; per-fold score normalization on the merged set):
- rescore(std::vector<PeptideIdentification>&, ...) — overload for PSM-shaped data; writes scores back to each PeptideHit as meta values.
- rescore(const RescoreInput&) — domain-agnostic. Accepts a feature matrix, target/decoy labels, optional CV grouping keys, and feature names. Applicable to PSM rescoring, transition rescoring, peak-group rescoring, or any other setting where a semi-supervised target/decoy classifier is appropriate. The vendored implementation requires PSM-shaped records internally; the wrapper attaches synthetic row identifiers to satisfy that requirement and these never surface in the public API.
Separate training and scoring (for train-on-A / predict-on-B workflows; uses fold-averaged weights):
rescore() and train()+score() produce scores on the same scale but are not bitwise equal on the same input; see score() for the rationale.
Preconditions
- At least ~100 decoys and enough discriminable targets to pass SanityCheck, else InvalidValue is thrown.
- Features are continuous-valued doubles. Categorical features must be one-hot-encoded by the caller.
- If the input contains rows that share structure (e.g., multiple transitions per precursor), cv_group_keys must partition them so related rows go into the same CV fold. Otherwise q-values will be optimistic.
Thread safety
A single instance is not concurrent-safe; construct one instance per worker. The vendored Percolator code additionally relies on several process-wide statics (FeatureNames::numFeatures, SanityCheck::initDefaultDir*, etc.) that are reset at the start of each rescore() / train() / score() call. Concurrent calls across different instances* therefore also race on these globals; serialize at the call site if parallelism is required.
Reproducibility
Results are reproducible given the same seed, thread count, and input ordering. Changing the thread count can perturb results because of FeatureMemoryPool allocation ordering.
- See also
- PSM rescoring example: PercolatorAdapter / ProSE.
-
Transition rescoring example: OpenSwath layer (to be added when needed).
Fill PIN-compatible optional fields on a RescoreInput.
Populates input.scan_numbers, input.spec_file_numbers, input.exp_masses, and input.calc_masses from the given PeptideIdentifications, using the same derivation that PercolatorInfile::store would apply when writing a .pin file:
- scan: parsed via SpectrumLookup::extractScanNumber from the PeptideIdentification's spectrum_reference (or
spectrum_id meta value, or fallback to 1-based index).
- spec_file: hashes
file_origin + id_merge_index (same as the PIN SpecId prefix). Zero when single-file / unset.
- exp_mass:
pid.getMZ() (kept as m/z — Percolator doesn't convert to neutral for the sort hash).
- calc_mass: from
hit.metaValueExists("CalcMass") if present, else hit.getSequence().getMZ(hit.getCharge()) (m/z, matching the PercolatorInfile::store fallback).
This helper must be called before rescore(input) when the in-process output is required to match running the external percolator binary through the .pin / .pout pipeline on the same inputs. Without it, the row index is used in place of the scan number, producing a different CV fold split and consequently different trained weights and final scores.
The PeptideIdentifications vector passed in must parallel input.features exactly: same ordering, one row per hit per PeptideIdentification.
- Parameters
-
| peptide_ids | Source of PIN-equivalent metadata. |
| flatten_hits | If true, iterate all hits per PeptideIdentification (matches high-level rescore row ordering). If false, use only the first hit per pid. |
| input | Output: the four PIN-compat fields are written here. |
| const std::vector< std::vector< double > > & getSvmWeights |
( |
| ) |
const |
SVM weights trained in the last rescore()/train() call.
A single fold-averaged weight vector in raw feature space, with the bias appended as the final element. Identical to PercolatorModel::weights, wrapped in a one-element outer vector to preserve the signature expected by older call sites that consumed per-fold weights. Intended for diagnostics and for writing out a Percolator-compatible .weights file.
Updated by rescore() and train(). score() does not modify this buffer; after a train()+score() sequence the contents reflect the train() call. Loading a model via loadModel() and then calling score() leaves the buffer unchanged (empty if this instance has never trained).
- Returns
- Outer size = 1 (averaged); inner size = num_features + 1. Empty until rescore()/train() has been called on this instance.
| static void saveModel |
( |
const PercolatorModel & |
model, |
|
|
const std::string & |
filename |
|
) |
| |
|
static |
Serialize a PercolatorModel to a plain-text file.
Writes a comment header line, then the header keys format_version, normalizer, seed, n_features, and bias (one per line, in key: value form), followed by one feature_name<TAB>weight data row per feature. The bias is stored in the header rather than as a data row, so feature names are opaque strings with no reserved values. The format is intended to be human-readable and diff-friendly; it is not interoperable with the external percolator binary's multi-column .weights format.
Weights are written at std::numeric_limits<double>::max_digits10 precision so that loadModel() round-trips losslessly.
- Exceptions
-
Score feature rows using a pre-trained model. No training.
Applies model.weights to input.features unchanged: raw weights × raw features + bias. Callers must not pre-normalize the features. The normalization transform is already folded into the raw weights by Normalizer::unnormalizeweight() at training time; reapplying it would double-count the transform.
The post-scoring pipeline (q-values, optional TDC, PEPs, SVM score rescaling) operates on the input's own target/decoy distribution. q-values and PEPs are therefore always evaluated against the scoring dataset, not the training dataset.
The returned RescoreOutput.scores are not raw dot products. Internally raw_score(i) = Σ input.features[i][j] * model.weights[j]
- model.weights[n_features]
Scores::normalizeScores then rescales the entire score vector so that the median decoy score maps to 0 and the score at test_fdr maps to 1: scores[i] = (raw_score(i) - fdrScore) / (fdrScore - medianDecoyScore) The rescore() CV-merge path rescales per fold instead of once globally, so scores from rescore(X) and from score(X, train(X)) share a scale but are not bitwise equal. If raw dot-product values are required, compute them directly from model.weights.
Target-Decoy Competition (enabled by default via post_processing_tdc) deduplicates rows by (scan, expMass). Rows that lose competition are returned with the defaults score = 0.0, q_value = 1.0, pep = 1.0. See the rescore(RescoreInput) note for mitigation.
- Parameters
-
| input | Scoring data. Target/decoy labels required for q-value and PEP computation. feature_names must be populated. |
| model | Model produced by train() (possibly from another Percolator instance, or loaded from disk). feature_names must be populated and match input.feature_names positionally; weight count must equal n_features+1. |
- Returns
- Per-row SVM scores (rescaled per above), q-values, PEPs, aligned 1:1 with
input.features.
- Exceptions
-
| Exception::InvalidValue | on malformed input, model mismatch, empty feature_names on either side, or pathological score distributions that fail Percolator's sanity checks. |