PosteriorErrorProbabilityModel Class Reference

Implements a mixture model of the inverse gumbel and the gauss distribution or a gaussian mixture. More...

#include <OpenMS/MATH/STATISTICS/PosteriorErrorProbabilityModel.h>

Public Member Functions

 PosteriorErrorProbabilityModel ()
 default constructor More...
 ~PosteriorErrorProbabilityModel () override
 Destructor. More...
bool fit (std::vector< double > &search_engine_scores, const String &outlier_handling)
 fits the distributions to the data points(search_engine_scores). Estimated parameters for the distributions are saved in member variables. computeProbability can be used afterwards. Uses two Gaussians to fit. And Gauss+Gauss or Gumbel+Gauss to plot and calculate final probabilities. More...
bool fitGumbelGauss (std::vector< double > &search_engine_scores, const String &outlier_handling)
 fits the distributions to the data points(search_engine_scores). Estimated parameters for the distributions are saved in member variables. computeProbability can be used afterwards. Uses Gumbel+Gauss for everything. Fits Gumbel by maximizing log likelihood. More...
bool fit (std::vector< double > &search_engine_scores, std::vector< double > &probabilities, const String &outlier_handling)
 fits the distributions to the data points(search_engine_scores) and writes the computed probabilities into the given vector (the second one). More...
void fillDensities (const std::vector< double > &x_scores, std::vector< double > &incorrect_density, std::vector< double > &correct_density)
 Writes the distributions densities into the two vectors for a set of scores. Incorrect_densities represent the incorrectly assigned sequences. More...
void fillLogDensities (const std::vector< double > &x_scores, std::vector< double > &incorrect_density, std::vector< double > &correct_density)
 Writes the log distributions densities into the two vectors for a set of scores. Incorrect_densities represent the incorrectly assigned sequences. More...
void fillLogDensitiesGumbel (const std::vector< double > &x_scores, std::vector< double > &incorrect_density, std::vector< double > &correct_density)
 Writes the log distributions of gumbel and gauss densities into the two vectors for a set of scores. Incorrect_densities represent the incorrectly assigned sequences. More...
double computeLogLikelihood (const std::vector< double > &incorrect_density, const std::vector< double > &correct_density) const
 computes the Likelihood with a log-likelihood function. More...
double computeLLAndIncorrectPosteriorsFromLogDensities (const std::vector< double > &incorrect_log_density, const std::vector< double > &correct_log_density, std::vector< double > &incorrect_posterior) const
std::pair< double, double > pos_neg_mean_weighted_posteriors (const std::vector< double > &x_scores, const std::vector< double > &incorrect_posteriors)
std::pair< double, double > pos_neg_sigma_weighted_posteriors (const std::vector< double > &x_scores, const std::vector< double > &incorrect_posteriors, const std::pair< double, double > &means)
GaussFitter::GaussFitResult getCorrectlyAssignedFitResult () const
 returns estimated parameters for correctly assigned sequences. Fit should be used before. More...
GaussFitter::GaussFitResult getIncorrectlyAssignedFitResult () const
 returns estimated parameters for correctly assigned sequences. Fit should be used before. More...
GumbelMaxLikelihoodFitter::GumbelDistributionFitResult getIncorrectlyAssignedGumbelFitResult () const
 returns estimated parameters for correctly assigned sequences. Fit should be used before. More...
double getNegativePrior () const
 returns the estimated negative prior probability. More...
double computeProbability (double score) const
TextFile initPlots (std::vector< double > &x_scores)
 initializes the plots More...
const String getGumbelGnuplotFormula (const GaussFitter::GaussFitResult &params) const
 returns the gnuplot formula of the fitted gumbel distribution. Only x0 and sigma are used as local parameter alpha and scale parameter beta, respectively. More...
const String getGaussGnuplotFormula (const GaussFitter::GaussFitResult &params) const
 returns the gnuplot formula of the fitted gauss distribution. More...
const String getBothGnuplotFormula (const GaussFitter::GaussFitResult &incorrect, const GaussFitter::GaussFitResult &correct) const
 returns the gnuplot formula of the fitted mixture distribution. More...
void plotTargetDecoyEstimation (std::vector< double > &target, std::vector< double > &decoy)
 plots the estimated distribution against target and decoy hits More...
double getSmallestScore () const
 returns the smallest score used in the last fit More...
void tryGnuplot (const String &gp_file)
 try to invoke 'gnuplot' on the file to create PDF automatically More...
Static Public Member Functions

static std::map< String, std::vector< std::vector< double > > > extractAndTransformScores (const std::vector< ProteinIdentification > &protein_ids, const std::vector< PeptideIdentification > &peptide_ids, const bool split_charge, const bool top_hits_only, const bool target_decoy_available, const double fdr_for_targets_smaller)
 extract and transform score types to a range and score orientation that the PEP model can handle More...
static void updateScores (const PosteriorErrorProbabilityModel &PEP_model, const String &search_engine, const Int charge, const bool prob_correct, const bool split_charge, std::vector< ProteinIdentification > &protein_ids, std::vector< PeptideIdentification > &peptide_ids, bool &unable_to_fit_data, bool &data_might_not_be_well_fit)
 update score entries with PEP (or 1-PEP) estimates More...
static double getGumbel_ (double x, const GaussFitter::GaussFitResult &params)
 computes the gumbel density at position x with parameters params. More...
Private Member Functions

void processOutliers_ (std::vector< double > &x_scores, const String &outlier_handling) const
 transform different score types to a range and score orientation that the model can handle (engine string is assumed in upper-case) More...
PosteriorErrorProbabilityModeloperator= (const PosteriorErrorProbabilityModel &rhs)
 assignment operator (not implemented) More...
 PosteriorErrorProbabilityModel (const PosteriorErrorProbabilityModel &rhs)
 Copy constructor (not implemented) More...

Static Private Member Functions

static double transformScore_ (const String &engine, const PeptideHit &hit, const String &current_score_type)
static double getScore_ (const std::vector< String > &requested_score_types, const PeptideHit &hit, const String &actual_score_type)

Private Attributes

GaussFitter::GaussFitResult incorrectly_assigned_fit_param_
 stores parameters for incorrectly assigned sequences. If gumbel fit was used, A can be ignored. Furthermore, in this case, x0 and sigma are the local parameter alpha and scale parameter beta, respectively. More...
GumbelMaxLikelihoodFitter::GumbelDistributionFitResult incorrectly_assigned_fit_gumbel_param_
GaussFitter::GaussFitResult correctly_assigned_fit_param_
 stores gauss parameters More...
double negative_prior_
 stores final prior probability for negative peptides More...
double max_incorrectly_
 peak of the incorrectly assigned sequences distribution More...
double max_correctly_
 peak of the gauss distribution (correctly assigned sequences) More...
double smallest_score_
 smallest score which was used for fitting the model More...
const String(PosteriorErrorProbabilityModel::* getNegativeGnuplotFormula_ )(const GaussFitter::GaussFitResult &params) const
 points either to getGumbelGnuplotFormula or getGaussGnuplotFormula depending on whether one uses the gumbel or the gaussian distribution for incorrectly assigned sequences. More...
const String(PosteriorErrorProbabilityModel::* getPositiveGnuplotFormula_ )(const GaussFitter::GaussFitResult &params) const
 points to getGumbelGnuplotFormula More...

Detailed Description

Implements a mixture model of the inverse gumbel and the gauss distribution or a gaussian mixture.

This class fits either a Gumbel distribution and a Gauss distribution to a set of data points or two Gaussian distributions using the EM algorithm. One can output the fit as a gnuplot formula using getGumbelGnuplotFormula() and getGaussGnuplotFormula() after fitting.

All parameters are stored in GaussFitResult. In the case of the Gumbel distribution x0 and sigma represent the local parameter alpha and the scale parameter beta, respectively.

test performance and make fitGumbelGauss available via parameters.

allow charge state based fitting

allow semi-supervised by using decoy annotations

allow non-parametric via kernel density estimation

Parameters of this class are:

out_plot string  If given, the some output files will be saved in the following manner: _scores.txt for the scores and which contains the fitted values for each step of the EM-algorithm, e.g., out_plot = /usr/home/OMSSA123 leads to /usr/home/OMSSA123_scores.txt, /usr/home/OMSSA123 will be written. If no directory is specified, e.g. instead of '/usr/home/OMSSA123' just OMSSA123, the files will be written into the working directory.
number_of_bins int100  Number of bins used for visualization. Only needed if each iteration step of the EM-Algorithm will be visualized
incorrectly_assigned stringGumbel Gumbel, Gaussfor 'Gumbel', the Gumbel distribution is used to plot incorrectly assigned sequences. For 'Gauss', the Gauss distribution is used.
max_nr_iterations int1000  Bounds the number of iterations for the EM algorithm when convergence is slow.
neg_log_delta int6  The negative logarithm of the convergence threshold for the likelihood increase.
outlier_handling stringignore_iqr_outliers ignore_iqr_outliers, set_iqr_to_closest_valid, ignore_extreme_percentiles, noneWhat to do with outliers:
- ignore_iqr_outliers: ignore outliers outside of 3*IQR from Q1/Q3 for fitting
- set_iqr_to_closest_valid: set IQR-based outliers to the last valid value for fitting
- ignore_extreme_percentiles: ignore everything outside 99th and 1st percentile (also removes equal values like potential censored max values in XTandem)
- none: do nothing


Constructor & Destructor Documentation

◆ PosteriorErrorProbabilityModel() [1/2]

default constructor

◆ ~PosteriorErrorProbabilityModel()


◆ PosteriorErrorProbabilityModel() [2/2]

Copy constructor (not implemented)

Member Function Documentation

◆ computeLLAndIncorrectPosteriorsFromLogDensities()

double computeLLAndIncorrectPosteriorsFromLogDensities ( const std::vector< double > &  incorrect_log_density,
const std::vector< double > &  correct_log_density,
std::vector< double > &  incorrect_posterior 
) const

computes the posteriors for the datapoints to belong to the incorrect distribution

incorrect_posteriorresulting posteriors
the log-likelihood of the model

◆ computeLogLikelihood()

double computeLogLikelihood ( const std::vector< double > &  incorrect_density,
const std::vector< double > &  correct_density 
) const

computes the Likelihood with a log-likelihood function.

◆ computeProbability()

double computeProbability ( double  score) const

Returns the computed posterior error probability for a given score.

: fit has to be used before using this function. Otherwise this function will compute nonsense.

◆ extractAndTransformScores()

static std::map<String, std::vector<std::vector<double> > > extractAndTransformScores ( const std::vector< ProteinIdentification > &  protein_ids,
const std::vector< PeptideIdentification > &  peptide_ids,
const bool  split_charge,
const bool  top_hits_only,
const bool  target_decoy_available,
const double  fdr_for_targets_smaller 

extract and transform score types to a range and score orientation that the PEP model can handle

protein_idsthe protein identifications
peptide_idsthe peptide identifications
split_chargewhether different charge states should be treated separately
top_hits_onlyonly consider rank 1
target_decoy_availablewhether target decoy information is stored as meta value
fdr_for_targets_smallerfdr threshold for targets
engine (and optional charge state) id -> vector of triplets (score, target, decoy)
supported engines are: XTandem,OMSSA,MASCOT,SpectraST,MyriMatch,SimTandem,MSGFPlus,MS-GF+,Comet

◆ fillDensities()

void fillDensities ( const std::vector< double > &  x_scores,
std::vector< double > &  incorrect_density,
std::vector< double > &  correct_density 

Writes the distributions densities into the two vectors for a set of scores. Incorrect_densities represent the incorrectly assigned sequences.

◆ fillLogDensities()

void fillLogDensities ( const std::vector< double > &  x_scores,
std::vector< double > &  incorrect_density,
std::vector< double > &  correct_density 

Writes the log distributions densities into the two vectors for a set of scores. Incorrect_densities represent the incorrectly assigned sequences.

◆ fillLogDensitiesGumbel()

void fillLogDensitiesGumbel ( const std::vector< double > &  x_scores,
std::vector< double > &  incorrect_density,
std::vector< double > &  correct_density 

Writes the log distributions of gumbel and gauss densities into the two vectors for a set of scores. Incorrect_densities represent the incorrectly assigned sequences.

◆ fit() [1/2]

bool fit ( std::vector< double > &  search_engine_scores,
const String outlier_handling 

fits the distributions to the data points(search_engine_scores). Estimated parameters for the distributions are saved in member variables. computeProbability can be used afterwards. Uses two Gaussians to fit. And Gauss+Gauss or Gumbel+Gauss to plot and calculate final probabilities.

search_engine_scoresa vector which holds the data points
true if algorithm has run through. Else false will be returned. In that case no plot and no probabilities are calculated.
the vector is sorted from smallest to biggest value!

◆ fit() [2/2]

bool fit ( std::vector< double > &  search_engine_scores,
std::vector< double > &  probabilities,
const String outlier_handling 

fits the distributions to the data points(search_engine_scores) and writes the computed probabilities into the given vector (the second one).

search_engine_scoresa vector which holds the data points
probabilitiesa vector which holds the probability for each data point after running this function. If it has some content it will be overwritten.
true if algorithm has run through. Else false will be returned. In that case no plot and no probabilities are calculated.
the vectors are sorted from smallest to biggest value!

◆ fitGumbelGauss()

bool fitGumbelGauss ( std::vector< double > &  search_engine_scores,
const String outlier_handling 

fits the distributions to the data points(search_engine_scores). Estimated parameters for the distributions are saved in member variables. computeProbability can be used afterwards. Uses Gumbel+Gauss for everything. Fits Gumbel by maximizing log likelihood.

search_engine_scoresa vector which holds the data points
true if algorithm has run through. Else false will be returned. In that case no plot and no probabilities are calculated.
the vector is sorted from smallest to biggest value!

◆ getBothGnuplotFormula()

const String getBothGnuplotFormula ( const GaussFitter::GaussFitResult incorrect,
const GaussFitter::GaussFitResult correct 
) const

returns the gnuplot formula of the fitted mixture distribution.

◆ getCorrectlyAssignedFitResult()

GaussFitter::GaussFitResult getCorrectlyAssignedFitResult ( ) const

returns estimated parameters for correctly assigned sequences. Fit should be used before.

◆ getGaussGnuplotFormula()

const String getGaussGnuplotFormula ( const GaussFitter::GaussFitResult params) const

returns the gnuplot formula of the fitted gauss distribution.

◆ getGumbel_()

static double getGumbel_ ( double  x,
const GaussFitter::GaussFitResult params 

computes the gumbel density at position x with parameters params.

◆ getGumbelGnuplotFormula()

const String getGumbelGnuplotFormula ( const GaussFitter::GaussFitResult params) const

returns the gnuplot formula of the fitted gumbel distribution. Only x0 and sigma are used as local parameter alpha and scale parameter beta, respectively.

◆ getIncorrectlyAssignedFitResult()

GaussFitter::GaussFitResult getIncorrectlyAssignedFitResult ( ) const

returns estimated parameters for correctly assigned sequences. Fit should be used before.

◆ getIncorrectlyAssignedGumbelFitResult()

GumbelMaxLikelihoodFitter::GumbelDistributionFitResult getIncorrectlyAssignedGumbelFitResult ( ) const

returns estimated parameters for correctly assigned sequences. Fit should be used before.

◆ getNegativePrior()

double getNegativePrior ( ) const

returns the estimated negative prior probability.

◆ getScore_()

static double getScore_ ( const std::vector< String > &  requested_score_types,
const PeptideHit hit,
const String actual_score_type 

gets a specific score (either main score [preferred] or metavalue) @requested_score_types the requested score_types in order of preference (will be tested with a "_score" suffix as well) @hit the PeptideHit to extract from @actual_score_type the current score type to take preference if matching

◆ getSmallestScore()

double getSmallestScore ( ) const

returns the smallest score used in the last fit

◆ initPlots()

TextFile initPlots ( std::vector< double > &  x_scores)

initializes the plots

◆ operator=()

assignment operator (not implemented)

◆ plotTargetDecoyEstimation()

void plotTargetDecoyEstimation ( std::vector< double > &  target,
std::vector< double > &  decoy 

plots the estimated distribution against target and decoy hits

◆ pos_neg_mean_weighted_posteriors()

std::pair<double, double> pos_neg_mean_weighted_posteriors ( const std::vector< double > &  x_scores,
const std::vector< double > &  incorrect_posteriors 
x_scoresScores observed "on the x-axis"
incorrect_posteriorsPosteriors/responsibilities of belonging to the incorrect component
New estimate for the mean of the correct (pair.first) and incorrect (pair.second) component
only for Gaussian estimates

◆ pos_neg_sigma_weighted_posteriors()

std::pair<double, double> pos_neg_sigma_weighted_posteriors ( const std::vector< double > &  x_scores,
const std::vector< double > &  incorrect_posteriors,
const std::pair< double, double > &  means 
x_scoresScores observed "on the x-axis"
incorrect_posteriorsPosteriors/responsibilities of belonging to the incorrect component
New estimate for the std. deviation of the correct (pair.first) and incorrect (pair.second) component
only for Gaussian estimates

◆ processOutliers_()

void processOutliers_ ( std::vector< double > &  x_scores,
const String outlier_handling 
) const

transform different score types to a range and score orientation that the model can handle (engine string is assumed in upper-case)

◆ transformScore_()

static double transformScore_ ( const String engine,
const PeptideHit hit,
const String current_score_type 

transform different score types to a range and score orientation that the model can handle (engine string is assumed in upper-case)

enginethe search engine name as in the SE param object @hit the PeptideHit to extract transformed scores from @current_score_type the current score type of the PeptideIdentification to take precedence

◆ tryGnuplot()

void tryGnuplot ( const String gp_file)

try to invoke 'gnuplot' on the file to create PDF automatically

◆ updateScores()

static void updateScores ( const PosteriorErrorProbabilityModel PEP_model,
const String search_engine,
const Int  charge,
const bool  prob_correct,
const bool  split_charge,
std::vector< ProteinIdentification > &  protein_ids,
std::vector< PeptideIdentification > &  peptide_ids,
bool &  unable_to_fit_data,
bool &  data_might_not_be_well_fit 

update score entries with PEP (or 1-PEP) estimates

PEP_modelthe PEP model used to update the scores
search_enginethe score of search_engine will be updated
chargeidentifications with the given charge will be updated
prob_correctreport 1-PEP
split_chargeif charge states have been treated separately
protein_idsthe protein identifications
peptide_idsthe peptide identifications
unable_to_fit_datathere was a problem fitting the data (probabilities are all smaller 0 or larger 1)
data_might_not_be_well_fitfit was successful but of bad quality (probabilities are all smaller 0.8 and larger 0.2)
supported engines are: XTandem,OMSSA,MASCOT,SpectraST,MyriMatch,SimTandem,MSGFPlus,MS-GF+,Comet

Member Data Documentation

◆ correctly_assigned_fit_param_

GaussFitter::GaussFitResult correctly_assigned_fit_param_

stores gauss parameters

◆ getNegativeGnuplotFormula_

const String(PosteriorErrorProbabilityModel::* getNegativeGnuplotFormula_) (const GaussFitter::GaussFitResult &params) const

points either to getGumbelGnuplotFormula or getGaussGnuplotFormula depending on whether one uses the gumbel or the gaussian distribution for incorrectly assigned sequences.

◆ getPositiveGnuplotFormula_

const String(PosteriorErrorProbabilityModel::* getPositiveGnuplotFormula_) (const GaussFitter::GaussFitResult &params) const

points to getGumbelGnuplotFormula

◆ incorrectly_assigned_fit_gumbel_param_

GumbelMaxLikelihoodFitter::GumbelDistributionFitResult incorrectly_assigned_fit_gumbel_param_

◆ incorrectly_assigned_fit_param_

GaussFitter::GaussFitResult incorrectly_assigned_fit_param_

stores parameters for incorrectly assigned sequences. If gumbel fit was used, A can be ignored. Furthermore, in this case, x0 and sigma are the local parameter alpha and scale parameter beta, respectively.

◆ max_correctly_

double max_correctly_

peak of the gauss distribution (correctly assigned sequences)

◆ max_incorrectly_

double max_incorrectly_

peak of the incorrectly assigned sequences distribution

◆ negative_prior_

double negative_prior_

stores final prior probability for negative peptides

◆ smallest_score_

double smallest_score_

smallest score which was used for fitting the model