OpenMS
PosteriorErrorProbabilityModel Class Reference

Implements a mixture model of the inverse gumbel and the gauss distribution or a gaussian mixture. More...

#include <OpenMS/MATH/STATISTICS/PosteriorErrorProbabilityModel.h>

Inheritance diagram for PosteriorErrorProbabilityModel:
[legend]
Collaboration diagram for PosteriorErrorProbabilityModel:
[legend]

Public Member Functions

 PosteriorErrorProbabilityModel ()
 default constructor More...
 
 ~PosteriorErrorProbabilityModel () override
 Destructor. More...
 
bool fit (std::vector< double > &search_engine_scores, const String &outlier_handling)
 fits the distributions to the data points(search_engine_scores). Estimated parameters for the distributions are saved in member variables. computeProbability can be used afterwards. Uses two Gaussians to fit. And Gauss+Gauss or Gumbel+Gauss to plot and calculate final probabilities. More...
 
bool fitGumbelGauss (std::vector< double > &search_engine_scores, const String &outlier_handling)
 fits the distributions to the data points(search_engine_scores). Estimated parameters for the distributions are saved in member variables. computeProbability can be used afterwards. Uses Gumbel+Gauss for everything. Fits Gumbel by maximizing log likelihood. More...
 
bool fit (std::vector< double > &search_engine_scores, std::vector< double > &probabilities, const String &outlier_handling)
 fits the distributions to the data points(search_engine_scores) and writes the computed probabilities into the given vector (the second one). More...
 
void fillDensities (const std::vector< double > &x_scores, std::vector< double > &incorrect_density, std::vector< double > &correct_density)
 Writes the distributions densities into the two vectors for a set of scores. Incorrect_densities represent the incorrectly assigned sequences. More...
 
void fillLogDensities (const std::vector< double > &x_scores, std::vector< double > &incorrect_density, std::vector< double > &correct_density)
 Writes the log distributions densities into the two vectors for a set of scores. Incorrect_densities represent the incorrectly assigned sequences. More...
 
void fillLogDensitiesGumbel (const std::vector< double > &x_scores, std::vector< double > &incorrect_density, std::vector< double > &correct_density)
 Writes the log distributions of gumbel and gauss densities into the two vectors for a set of scores. Incorrect_densities represent the incorrectly assigned sequences. More...
 
double computeLogLikelihood (const std::vector< double > &incorrect_density, const std::vector< double > &correct_density) const
 computes the Likelihood with a log-likelihood function. More...
 
double computeLLAndIncorrectPosteriorsFromLogDensities (const std::vector< double > &incorrect_log_density, const std::vector< double > &correct_log_density, std::vector< double > &incorrect_posterior) const
 
std::pair< double, double > pos_neg_mean_weighted_posteriors (const std::vector< double > &x_scores, const std::vector< double > &incorrect_posteriors)
 
std::pair< double, double > pos_neg_sigma_weighted_posteriors (const std::vector< double > &x_scores, const std::vector< double > &incorrect_posteriors, const std::pair< double, double > &means)
 
GaussFitter::GaussFitResult getCorrectlyAssignedFitResult () const
 returns estimated parameters for correctly assigned sequences. Fit should be used before. More...
 
GaussFitter::GaussFitResult getIncorrectlyAssignedFitResult () const
 returns estimated parameters for correctly assigned sequences. Fit should be used before. More...
 
GumbelMaxLikelihoodFitter::GumbelDistributionFitResult getIncorrectlyAssignedGumbelFitResult () const
 returns estimated parameters for correctly assigned sequences. Fit should be used before. More...
 
double getNegativePrior () const
 returns the estimated negative prior probability. More...
 
double computeProbability (double score) const
 
TextFile initPlots (std::vector< double > &x_scores)
 initializes the plots More...
 
const String getGumbelGnuplotFormula (const GaussFitter::GaussFitResult &params) const
 returns the gnuplot formula of the fitted gumbel distribution. Only x0 and sigma are used as local parameter alpha and scale parameter beta, respectively. More...
 
const String getGaussGnuplotFormula (const GaussFitter::GaussFitResult &params) const
 returns the gnuplot formula of the fitted gauss distribution. More...
 
const String getBothGnuplotFormula (const GaussFitter::GaussFitResult &incorrect, const GaussFitter::GaussFitResult &correct) const
 returns the gnuplot formula of the fitted mixture distribution. More...
 
void plotTargetDecoyEstimation (std::vector< double > &target, std::vector< double > &decoy)
 plots the estimated distribution against target and decoy hits More...
 
double getSmallestScore () const
 returns the smallest score used in the last fit More...
 
void tryGnuplot (const String &gp_file)
 try to invoke 'gnuplot' on the file to create PDF automatically More...
 
- Public Member Functions inherited from DefaultParamHandler
 DefaultParamHandler (const String &name)
 Constructor with name that is displayed in error messages. More...
 
 DefaultParamHandler (const DefaultParamHandler &rhs)
 Copy constructor. More...
 
virtual ~DefaultParamHandler ()
 Destructor. More...
 
DefaultParamHandleroperator= (const DefaultParamHandler &rhs)
 Assignment operator. More...
 
virtual bool operator== (const DefaultParamHandler &rhs) const
 Equality operator. More...
 
void setParameters (const Param &param)
 Sets the parameters. More...
 
const ParamgetParameters () const
 Non-mutable access to the parameters. More...
 
const ParamgetDefaults () const
 Non-mutable access to the default parameters. More...
 
const StringgetName () const
 Non-mutable access to the name. More...
 
void setName (const String &name)
 Mutable access to the name. More...
 
const std::vector< String > & getSubsections () const
 Non-mutable access to the registered subsections. More...
 

Static Public Member Functions

static std::map< String, std::vector< std::vector< double > > > extractAndTransformScores (const std::vector< ProteinIdentification > &protein_ids, const std::vector< PeptideIdentification > &peptide_ids, const bool split_charge, const bool top_hits_only, const bool target_decoy_available, const double fdr_for_targets_smaller)
 extract and transform score types to a range and score orientation that the PEP model can handle More...
 
static void updateScores (const PosteriorErrorProbabilityModel &PEP_model, const String &search_engine, const Int charge, const bool prob_correct, const bool split_charge, std::vector< ProteinIdentification > &protein_ids, std::vector< PeptideIdentification > &peptide_ids, bool &unable_to_fit_data, bool &data_might_not_be_well_fit)
 update score entries with PEP (or 1-PEP) estimates More...
 
static double getGumbel_ (double x, const GaussFitter::GaussFitResult &params)
 computes the gumbel density at position x with parameters params. More...
 
- Static Public Member Functions inherited from DefaultParamHandler
static void writeParametersToMetaValues (const Param &write_this, MetaInfoInterface &write_here, const String &key_prefix="")
 Writes all parameters to meta values. More...
 

Private Member Functions

void processOutliers_ (std::vector< double > &x_scores, const String &outlier_handling) const
 transform different score types to a range and score orientation that the model can handle (engine string is assumed in upper-case) More...
 
PosteriorErrorProbabilityModeloperator= (const PosteriorErrorProbabilityModel &rhs)
 assignment operator (not implemented) More...
 
 PosteriorErrorProbabilityModel (const PosteriorErrorProbabilityModel &rhs)
 Copy constructor (not implemented) More...
 

Static Private Member Functions

static double transformScore_ (const String &engine, const PeptideHit &hit, const String &current_score_type)
 
static double getScore_ (const std::vector< String > &requested_score_types, const PeptideHit &hit, const String &actual_score_type)
 

Private Attributes

GaussFitter::GaussFitResult incorrectly_assigned_fit_param_
 stores parameters for incorrectly assigned sequences. If gumbel fit was used, A can be ignored. Furthermore, in this case, x0 and sigma are the local parameter alpha and scale parameter beta, respectively. More...
 
GumbelMaxLikelihoodFitter::GumbelDistributionFitResult incorrectly_assigned_fit_gumbel_param_
 
GaussFitter::GaussFitResult correctly_assigned_fit_param_
 stores gauss parameters More...
 
double negative_prior_
 stores final prior probability for negative peptides More...
 
double max_incorrectly_
 peak of the incorrectly assigned sequences distribution More...
 
double max_correctly_
 peak of the gauss distribution (correctly assigned sequences) More...
 
double smallest_score_
 smallest score which was used for fitting the model More...
 
const String(PosteriorErrorProbabilityModel::* getNegativeGnuplotFormula_ )(const GaussFitter::GaussFitResult &params) const
 points either to getGumbelGnuplotFormula or getGaussGnuplotFormula depending on whether one uses the gumbel or the gaussian distribution for incorrectly assigned sequences. More...
 
const String(PosteriorErrorProbabilityModel::* getPositiveGnuplotFormula_ )(const GaussFitter::GaussFitResult &params) const
 points to getGumbelGnuplotFormula More...
 

Additional Inherited Members

- Protected Member Functions inherited from DefaultParamHandler
virtual void updateMembers_ ()
 This method is used to update extra member variables at the end of the setParameters() method. More...
 
void defaultsToParam_ ()
 Updates the parameters after the defaults have been set in the constructor. More...
 
- Protected Attributes inherited from DefaultParamHandler
Param param_
 Container for current parameters. More...
 
Param defaults_
 Container for default parameters. This member should be filled in the constructor of derived classes! More...
 
std::vector< Stringsubsections_
 Container for registered subsections. This member should be filled in the constructor of derived classes! More...
 
String error_name_
 Name that is displayed in error messages during the parameter checking. More...
 
bool check_defaults_
 If this member is set to false no checking if parameters in done;. More...
 
bool warn_empty_defaults_
 If this member is set to false no warning is emitted when defaults are empty;. More...
 

Detailed Description

Implements a mixture model of the inverse gumbel and the gauss distribution or a gaussian mixture.

This class fits either a Gumbel distribution and a Gauss distribution to a set of data points or two Gaussian distributions using the EM algorithm. One can output the fit as a gnuplot formula using getGumbelGnuplotFormula() and getGaussGnuplotFormula() after fitting.

Note
All parameters are stored in GaussFitResult. In the case of the Gumbel distribution x0 and sigma represent the local parameter alpha and the scale parameter beta, respectively.
Todo:

test performance and make fitGumbelGauss available via parameters.

allow charge state based fitting

allow semi-supervised by using decoy annotations

allow non-parametric via kernel density estimation

Parameters of this class are:

NameTypeDefaultRestrictionsDescription
out_plot string  If given, the some output files will be saved in the following manner: _scores.txt for the scores and which contains the fitted values for each step of the EM-algorithm, e.g., out_plot = /usr/home/OMSSA123 leads to /usr/home/OMSSA123_scores.txt, /usr/home/OMSSA123 will be written. If no directory is specified, e.g. instead of '/usr/home/OMSSA123' just OMSSA123, the files will be written into the working directory.
number_of_bins int100  Number of bins used for visualization. Only needed if each iteration step of the EM-Algorithm will be visualized
incorrectly_assigned stringGumbel Gumbel, Gaussfor 'Gumbel', the Gumbel distribution is used to plot incorrectly assigned sequences. For 'Gauss', the Gauss distribution is used.
max_nr_iterations int1000  Bounds the number of iterations for the EM algorithm when convergence is slow.
neg_log_delta int6  The negative logarithm of the convergence threshold for the likelihood increase.
outlier_handling stringignore_iqr_outliers ignore_iqr_outliers, set_iqr_to_closest_valid, ignore_extreme_percentiles, noneWhat to do with outliers:
- ignore_iqr_outliers: ignore outliers outside of 3*IQR from Q1/Q3 for fitting
- set_iqr_to_closest_valid: set IQR-based outliers to the last valid value for fitting
- ignore_extreme_percentiles: ignore everything outside 99th and 1st percentile (also removes equal values like potential censored max values in XTandem)
- none: do nothing

Note:
  • If a section name is documented, the documentation is displayed as tooltip.
  • Advanced parameter names are italic.

Constructor & Destructor Documentation

◆ PosteriorErrorProbabilityModel() [1/2]

default constructor

◆ ~PosteriorErrorProbabilityModel()

Destructor.

◆ PosteriorErrorProbabilityModel() [2/2]

Copy constructor (not implemented)

Member Function Documentation

◆ computeLLAndIncorrectPosteriorsFromLogDensities()

double computeLLAndIncorrectPosteriorsFromLogDensities ( const std::vector< double > &  incorrect_log_density,
const std::vector< double > &  correct_log_density,
std::vector< double > &  incorrect_posterior 
) const

computes the posteriors for the datapoints to belong to the incorrect distribution

Parameters
incorrect_posteriorresulting posteriors
Returns
the log-likelihood of the model

◆ computeLogLikelihood()

double computeLogLikelihood ( const std::vector< double > &  incorrect_density,
const std::vector< double > &  correct_density 
) const

computes the Likelihood with a log-likelihood function.

◆ computeProbability()

double computeProbability ( double  score) const

Returns the computed posterior error probability for a given score.

Note
: fit has to be used before using this function. Otherwise this function will compute nonsense.

◆ extractAndTransformScores()

static std::map<String, std::vector<std::vector<double> > > extractAndTransformScores ( const std::vector< ProteinIdentification > &  protein_ids,
const std::vector< PeptideIdentification > &  peptide_ids,
const bool  split_charge,
const bool  top_hits_only,
const bool  target_decoy_available,
const double  fdr_for_targets_smaller 
)
static

extract and transform score types to a range and score orientation that the PEP model can handle

Parameters
protein_idsthe protein identifications
peptide_idsthe peptide identifications
split_chargewhether different charge states should be treated separately
top_hits_onlyonly consider rank 1
target_decoy_availablewhether target decoy information is stored as meta value
fdr_for_targets_smallerfdr threshold for targets
Returns
engine (and optional charge state) id -> vector of triplets (score, target, decoy)
Note
supported engines are: XTandem,OMSSA,MASCOT,SpectraST,MyriMatch,SimTandem,MSGFPlus,MS-GF+,Comet,Sage

◆ fillDensities()

void fillDensities ( const std::vector< double > &  x_scores,
std::vector< double > &  incorrect_density,
std::vector< double > &  correct_density 
)

Writes the distributions densities into the two vectors for a set of scores. Incorrect_densities represent the incorrectly assigned sequences.

◆ fillLogDensities()

void fillLogDensities ( const std::vector< double > &  x_scores,
std::vector< double > &  incorrect_density,
std::vector< double > &  correct_density 
)

Writes the log distributions densities into the two vectors for a set of scores. Incorrect_densities represent the incorrectly assigned sequences.

◆ fillLogDensitiesGumbel()

void fillLogDensitiesGumbel ( const std::vector< double > &  x_scores,
std::vector< double > &  incorrect_density,
std::vector< double > &  correct_density 
)

Writes the log distributions of gumbel and gauss densities into the two vectors for a set of scores. Incorrect_densities represent the incorrectly assigned sequences.

◆ fit() [1/2]

bool fit ( std::vector< double > &  search_engine_scores,
const String outlier_handling 
)

fits the distributions to the data points(search_engine_scores). Estimated parameters for the distributions are saved in member variables. computeProbability can be used afterwards. Uses two Gaussians to fit. And Gauss+Gauss or Gumbel+Gauss to plot and calculate final probabilities.

Parameters
search_engine_scoresa vector which holds the data points
Returns
true if algorithm has run through. Else false will be returned. In that case no plot and no probabilities are calculated.
Note
the vector is sorted from smallest to biggest value!

◆ fit() [2/2]

bool fit ( std::vector< double > &  search_engine_scores,
std::vector< double > &  probabilities,
const String outlier_handling 
)

fits the distributions to the data points(search_engine_scores) and writes the computed probabilities into the given vector (the second one).

Parameters
search_engine_scoresa vector which holds the data points
probabilitiesa vector which holds the probability for each data point after running this function. If it has some content it will be overwritten.
Returns
true if algorithm has run through. Else false will be returned. In that case no plot and no probabilities are calculated.
Note
the vectors are sorted from smallest to biggest value!

◆ fitGumbelGauss()

bool fitGumbelGauss ( std::vector< double > &  search_engine_scores,
const String outlier_handling 
)

fits the distributions to the data points(search_engine_scores). Estimated parameters for the distributions are saved in member variables. computeProbability can be used afterwards. Uses Gumbel+Gauss for everything. Fits Gumbel by maximizing log likelihood.

Parameters
search_engine_scoresa vector which holds the data points
Returns
true if algorithm has run through. Else false will be returned. In that case no plot and no probabilities are calculated.
Note
the vector is sorted from smallest to biggest value!

◆ getBothGnuplotFormula()

const String getBothGnuplotFormula ( const GaussFitter::GaussFitResult incorrect,
const GaussFitter::GaussFitResult correct 
) const

returns the gnuplot formula of the fitted mixture distribution.

◆ getCorrectlyAssignedFitResult()

GaussFitter::GaussFitResult getCorrectlyAssignedFitResult ( ) const
inline

returns estimated parameters for correctly assigned sequences. Fit should be used before.

◆ getGaussGnuplotFormula()

const String getGaussGnuplotFormula ( const GaussFitter::GaussFitResult params) const

returns the gnuplot formula of the fitted gauss distribution.

◆ getGumbel_()

static double getGumbel_ ( double  x,
const GaussFitter::GaussFitResult params 
)
inlinestatic

computes the gumbel density at position x with parameters params.

References GaussFitter::GaussFitResult::sigma, and GaussFitter::GaussFitResult::x0.

◆ getGumbelGnuplotFormula()

const String getGumbelGnuplotFormula ( const GaussFitter::GaussFitResult params) const

returns the gnuplot formula of the fitted gumbel distribution. Only x0 and sigma are used as local parameter alpha and scale parameter beta, respectively.

◆ getIncorrectlyAssignedFitResult()

GaussFitter::GaussFitResult getIncorrectlyAssignedFitResult ( ) const
inline

returns estimated parameters for correctly assigned sequences. Fit should be used before.

◆ getIncorrectlyAssignedGumbelFitResult()

GumbelMaxLikelihoodFitter::GumbelDistributionFitResult getIncorrectlyAssignedGumbelFitResult ( ) const
inline

returns estimated parameters for correctly assigned sequences. Fit should be used before.

◆ getNegativePrior()

double getNegativePrior ( ) const
inline

returns the estimated negative prior probability.

◆ getScore_()

static double getScore_ ( const std::vector< String > &  requested_score_types,
const PeptideHit hit,
const String actual_score_type 
)
staticprivate

gets a specific score (either main score [preferred] or metavalue)

Parameters
requested_score_typesthe requested score_types in order of preference (will be tested with a "_score" suffix as well)
hitthe PeptideHit to extract from
actual_score_typethe current score type to take preference if matching

◆ getSmallestScore()

double getSmallestScore ( ) const
inline

returns the smallest score used in the last fit

◆ initPlots()

TextFile initPlots ( std::vector< double > &  x_scores)

initializes the plots

◆ operator=()

assignment operator (not implemented)

◆ plotTargetDecoyEstimation()

void plotTargetDecoyEstimation ( std::vector< double > &  target,
std::vector< double > &  decoy 
)

plots the estimated distribution against target and decoy hits

◆ pos_neg_mean_weighted_posteriors()

std::pair<double, double> pos_neg_mean_weighted_posteriors ( const std::vector< double > &  x_scores,
const std::vector< double > &  incorrect_posteriors 
)
Parameters
x_scoresScores observed "on the x-axis"
incorrect_posteriorsPosteriors/responsibilities of belonging to the incorrect component
Returns
New estimate for the mean of the correct (pair.first) and incorrect (pair.second) component
Note
only for Gaussian estimates

◆ pos_neg_sigma_weighted_posteriors()

std::pair<double, double> pos_neg_sigma_weighted_posteriors ( const std::vector< double > &  x_scores,
const std::vector< double > &  incorrect_posteriors,
const std::pair< double, double > &  means 
)
Parameters
x_scoresScores observed "on the x-axis"
incorrect_posteriorsPosteriors/responsibilities of belonging to the incorrect component
Returns
New estimate for the std. deviation of the correct (pair.first) and incorrect (pair.second) component
Note
only for Gaussian estimates

◆ processOutliers_()

void processOutliers_ ( std::vector< double > &  x_scores,
const String outlier_handling 
) const
private

transform different score types to a range and score orientation that the model can handle (engine string is assumed in upper-case)

◆ transformScore_()

static double transformScore_ ( const String engine,
const PeptideHit hit,
const String current_score_type 
)
staticprivate

transform different score types to a range and score orientation that the model can handle (engine string is assumed in upper-case)

Parameters
enginethe search engine name as in the SE param object
hitthe PeptideHit to extract transformed scores from
current_score_typethe current score type of the PeptideIdentification to take precedence

◆ tryGnuplot()

void tryGnuplot ( const String gp_file)

try to invoke 'gnuplot' on the file to create PDF automatically

◆ updateScores()

static void updateScores ( const PosteriorErrorProbabilityModel PEP_model,
const String search_engine,
const Int  charge,
const bool  prob_correct,
const bool  split_charge,
std::vector< ProteinIdentification > &  protein_ids,
std::vector< PeptideIdentification > &  peptide_ids,
bool &  unable_to_fit_data,
bool &  data_might_not_be_well_fit 
)
static

update score entries with PEP (or 1-PEP) estimates

Parameters
PEP_modelthe PEP model used to update the scores
search_enginethe score of search_engine will be updated
chargeidentifications with the given charge will be updated
prob_correctreport 1-PEP
split_chargeif charge states have been treated separately
protein_idsthe protein identifications
peptide_idsthe peptide identifications
unable_to_fit_datathere was a problem fitting the data (probabilities are all smaller 0 or larger 1)
data_might_not_be_well_fitfit was successful but of bad quality (probabilities are all smaller 0.8 and larger 0.2)
Note
supported engines are: XTandem,OMSSA,MASCOT,SpectraST,MyriMatch,SimTandem,MSGFPlus,MS-GF+,Comet

Member Data Documentation

◆ correctly_assigned_fit_param_

GaussFitter::GaussFitResult correctly_assigned_fit_param_
private

stores gauss parameters

◆ getNegativeGnuplotFormula_

const String(PosteriorErrorProbabilityModel::* getNegativeGnuplotFormula_) (const GaussFitter::GaussFitResult &params) const
private

points either to getGumbelGnuplotFormula or getGaussGnuplotFormula depending on whether one uses the gumbel or the gaussian distribution for incorrectly assigned sequences.

◆ getPositiveGnuplotFormula_

const String(PosteriorErrorProbabilityModel::* getPositiveGnuplotFormula_) (const GaussFitter::GaussFitResult &params) const
private

points to getGumbelGnuplotFormula

◆ incorrectly_assigned_fit_gumbel_param_

GumbelMaxLikelihoodFitter::GumbelDistributionFitResult incorrectly_assigned_fit_gumbel_param_
private

◆ incorrectly_assigned_fit_param_

GaussFitter::GaussFitResult incorrectly_assigned_fit_param_
private

stores parameters for incorrectly assigned sequences. If gumbel fit was used, A can be ignored. Furthermore, in this case, x0 and sigma are the local parameter alpha and scale parameter beta, respectively.

◆ max_correctly_

double max_correctly_
private

peak of the gauss distribution (correctly assigned sequences)

◆ max_incorrectly_

double max_incorrectly_
private

peak of the incorrectly assigned sequences distribution

◆ negative_prior_

double negative_prior_
private

stores final prior probability for negative peptides

◆ smallest_score_

double smallest_score_
private

smallest score which was used for fitting the model