OpenMS
DBSuitability Class Reference

This class holds the functionality of calculating the database suitability. More...

#include <OpenMS/QC/DBSuitability.h>

Inheritance diagram for DBSuitability:
[legend]
Collaboration diagram for DBSuitability:
[legend]

Classes

struct  SuitabilityData
 struct to store results More...
 

Public Member Functions

 DBSuitability ()
 
 ~DBSuitability () override=default
 Destructor. More...
 
void compute (std::vector< PeptideIdentification > &&pep_ids, const MSExperiment &exp, const std::vector< FASTAFile::FASTAEntry > &original_fasta, const std::vector< FASTAFile::FASTAEntry > &novo_fasta, const ProteinIdentification::SearchParameters &search_params)
 Computes suitability of a database used to search a mzML. More...
 
const std::vector< SuitabilityData > & getResults () const
 Returns results calculated by this metric. More...
 
- Public Member Functions inherited from DefaultParamHandler
 DefaultParamHandler (const String &name)
 Constructor with name that is displayed in error messages. More...
 
 DefaultParamHandler (const DefaultParamHandler &rhs)
 Copy constructor. More...
 
virtual ~DefaultParamHandler ()
 Destructor. More...
 
DefaultParamHandleroperator= (const DefaultParamHandler &rhs)
 Assignment operator. More...
 
virtual bool operator== (const DefaultParamHandler &rhs) const
 Equality operator. More...
 
void setParameters (const Param &param)
 Sets the parameters. More...
 
const ParamgetParameters () const
 Non-mutable access to the parameters. More...
 
const ParamgetDefaults () const
 Non-mutable access to the default parameters. More...
 
const StringgetName () const
 Non-mutable access to the name. More...
 
void setName (const String &name)
 Mutable access to the name. More...
 
const std::vector< String > & getSubsections () const
 Non-mutable access to the registered subsections. More...
 

Private Member Functions

double getDecoyDiff_ (const PeptideIdentification &pep_id) const
 Calculates the xcorr difference between the top two hits marked as decoy. More...
 
double getDecoyCutOff_ (const std::vector< PeptideIdentification > &pep_ids, double reranking_cutoff_percentile) const
 Calculates a xcorr cut-off based on decoy hits. More...
 
bool isNovoHit_ (const PeptideHit &hit) const
 Tests if a PeptideHit is considered a deNovo hit. More...
 
bool checkScoreBetterThanThreshold_ (const PeptideHit &hit, double threshold, bool higher_score_better) const
 Tests if a PeptideHit has a score better than the given threshold. More...
 
std::pair< String, ParamextractSearchAdapterInfoFromMetaValues_ (const ProteinIdentification::SearchParameters &meta_values) const
 Looks through meta values of SearchParameters to find out which search adapter was used. More...
 
void writeIniFile_ (const Param &parameters, const String &filename) const
 Writes parameters into a given file. More...
 
std::vector< PeptideIdentificationrunIdentificationSearch_ (const MSExperiment &exp, const std::vector< FASTAFile::FASTAEntry > &fasta_data, const String &adapter_name, Param &parameters) const
 Executes the workflow from search adapter, followed by PeptideIndexer and finishes with FDR. More...
 
std::vector< FASTAFile::FASTAEntrygetSubsampledFasta_ (const std::vector< FASTAFile::FASTAEntry > &fasta_data, double subsampling_rate) const
 Creates a subsampled fasta with the given subsampling rate. More...
 
void calculateSuitability_ (const std::vector< PeptideIdentification > &pep_ids, SuitabilityData &data) const
 Calculates all suitability data from a combined deNovo+database search. More...
 
void appendDecoys_ (std::vector< FASTAFile::FASTAEntry > &fasta) const
 Calculates and appends decoys to a given vector of FASTAEntry. More...
 
double extractScore_ (const PeptideHit &pep_hit) const
 Returns the cross correlation score normalized by MW (if existing), else if the 'force' flag is set the current main score is returned. More...
 
double calculateCorrectionFactor_ (const SuitabilityData &data, const SuitabilityData &data_sampled, double sampling_rate) const
 Calculates the correction factor from two suitability calculations. More...
 
UInt numberOfUniqueProteins_ (const std::vector< PeptideIdentification > &peps, UInt number_of_hits=1) const
 Determines the number of unique proteins found in the protein accessions of PeptideIdentifications. More...
 
Size getIndexWithMedianNovoHits_ (const std::vector< SuitabilityData > &data) const
 Finds the SuitabilityData object with the median number of de novo hits. More...
 
double getScoreMatchingFDR_ (const std::vector< PeptideIdentification > &pep_ids, double FDR, const String &score_name, bool higher_score_better) const
 Extracts the worst score that still passes a FDR (q-value) threshold. More...
 

Private Attributes

std::vector< SuitabilityDataresults_
 result vector More...
 
const boost::regex decoy_pattern_
 pattern for finding a decoy string More...
 

Friends

class DBSuitability_friend
 To test private member functions. More...
 

Additional Inherited Members

- Static Public Member Functions inherited from DefaultParamHandler
static void writeParametersToMetaValues (const Param &write_this, MetaInfoInterface &write_here, const String &key_prefix="")
 Writes all parameters to meta values. More...
 
- Protected Member Functions inherited from DefaultParamHandler
virtual void updateMembers_ ()
 This method is used to update extra member variables at the end of the setParameters() method. More...
 
void defaultsToParam_ ()
 Updates the parameters after the defaults have been set in the constructor. More...
 
- Protected Attributes inherited from DefaultParamHandler
Param param_
 Container for current parameters. More...
 
Param defaults_
 Container for default parameters. This member should be filled in the constructor of derived classes! More...
 
std::vector< Stringsubsections_
 Container for registered subsections. This member should be filled in the constructor of derived classes! More...
 
String error_name_
 Name that is displayed in error messages during the parameter checking. More...
 
bool check_defaults_
 If this member is set to false no checking if parameters in done;. More...
 
bool warn_empty_defaults_
 If this member is set to false no warning is emitted when defaults are empty;. More...
 

Detailed Description

This class holds the functionality of calculating the database suitability.

To calculate the suitability of a database for a specific mzML for identification search, it is vital to perform a combined deNovo+database identification search. Meaning that the database should be appended with an additional entry derived from concatenated deNovo sequences from said mzML. Currently only Comet search is supported.

This class will calculate q-values by itself and will throw an error if any q-value calculation was done beforehand.

The algorithm parameters can be set using setParams().

Allows for multiple usage of the compute function. The result of each call is stored internally in a vector. Therefore old results will not be overridden by a new call. This vector then can be returned using getResults().

This class serves as the library representation of DatabaseSuitability

Constructor & Destructor Documentation

◆ DBSuitability()

Constructor Settings are initialized with their default values: no_rerank = false, reranking_cutoff_percentile = 1, FDR = 0.01

◆ ~DBSuitability()

~DBSuitability ( )
overridedefault

Destructor.

Member Function Documentation

◆ appendDecoys_()

void appendDecoys_ ( std::vector< FASTAFile::FASTAEntry > &  fasta) const
private

Calculates and appends decoys to a given vector of FASTAEntry.

Each sequence is digested with Trypsin. The resulting peptides are reversed and appended to one another. This results in the decoy sequences. The identifier is given a 'DECOY_' prefix.

Parameters
fastareference to fasta vector where the decoys are needed

Referenced by DBSuitability_friend::appendDecoys().

◆ calculateCorrectionFactor_()

double calculateCorrectionFactor_ ( const SuitabilityData data,
const SuitabilityData data_sampled,
double  sampling_rate 
) const
private

Calculates the correction factor from two suitability calculations.

Two suitability calculations need to be done for this. One with the original data and one with data from a search with a sampled database. The number of db hits and deNovo hits behaves linear. The two searches can than be used to calculate the corresponding linear functions. The factor is calculated with the negative ratio of the db slope and the deNovo slope.

Parameters
datasuitability data from the original search
data_sampledvector of suitability data from the sampled search(s)
sampling_ratethe sampling rate used for sampled db [0,1)
Returns
correction factor

Referenced by DBSuitability_friend::calculateCorrectionFactor().

◆ calculateSuitability_()

void calculateSuitability_ ( const std::vector< PeptideIdentification > &  pep_ids,
SuitabilityData data 
) const
private

Calculates all suitability data from a combined deNovo+database search.

Counts top database and top deNovo hits.

Calculates a decoy score cut-off to compare high scoring deNovo hits with lower scoring database hits. If the score difference is smaller than the cut-off the database hit is counted and the deNovo hit ignored.

Suitability is calculated: # database hits / # all hits

Parameters
pep_idspeptide identifications coming from the combined search, each peptide identification should be sorted
dataSuitabilityData object where the result should be written into
Exceptions
MissingInformationif no target/decoy annotation is found on pep_ids
MissingInformationif no xcorr is found, this happens when another adapter than CometAdapter was used

◆ checkScoreBetterThanThreshold_()

bool checkScoreBetterThanThreshold_ ( const PeptideHit hit,
double  threshold,
bool  higher_score_better 
) const
private

Tests if a PeptideHit has a score better than the given threshold.

Parameters
hitPepHit in question
thresholdthreshold to check against
higher_score_bettertrue/false depending if a higher or a lower score is better
Returns
true/false

◆ compute()

void compute ( std::vector< PeptideIdentification > &&  pep_ids,
const MSExperiment exp,
const std::vector< FASTAFile::FASTAEntry > &  original_fasta,
const std::vector< FASTAFile::FASTAEntry > &  novo_fasta,
const ProteinIdentification::SearchParameters search_params 
)

Computes suitability of a database used to search a mzML.

Top deNovo and top database hits from a combined deNovo+database search are counted. The ratio of db hits vs all hits yields the suitability. To re-rank cases, where a de novo peptide scores just higher than the database peptide, a decoy cut-off is calculated. This functionality can be turned off. This will result in an underestimated suitability, but it can solve problems like different search engines or to few decoy hits.

Parameters can be set using the functionality of DefaultParamHandler. Parameters are: no_rerank - re-ranking can be turned off with this (will be set automatically if no cross correlation score is found) reranking_cutoff_percentile - percentile that determines which cut-off will be returned FDR - q-value that should be filtered for Preliminary tests have shown that database suitability is rather stable across common FDR thresholds from 0 - 5 % keep_search_files - temporary files created for and by the internal ID search are kept disable_correction - disables corrected suitability calculations force - forces re-ranking to be done even without a cross correlation score, in which case the default main score is used

The calculated suitability is then tried to be corrected. For this a correction factor for the number of found top deNovo hits is calculated. This is done by perfoming an additional combined identification search with a smaller sample of the database. It was observed that the number of top deNovo and db hits behave linear according to the sampling ratio of the database. This can be used to extrapolate the number of database hits that would be needed to get a suitability of 1. This number in combination with the maximum number of deNovo hits (found with an identification search where only deNovo is used as a database) can be used to calculate a correction factor like this: #database hits for suitability of 1 / #maximum deNovo hits This formula can be simplified in a way that the maximum number of deNovo hits isn't needed:

  • (database hits slope) / deNovo hits slope Both of these values can easily be calculated with the original suitability data in conjunction with the one sampled search.

Correcting the number of found top deNovo hits with this factor results in them being more comparable to the top database hits. This in return results in a more linear behaviour of the suitability according to the sampling ratio. The corrected suitability reflects what sampling ratio your database represents regarding to the theoretical 'perfect' database. Or in other words: Your database needs to be (1 - corrected suitability) bigger to get a suitability of 1.

Both the original suitability as well as the corrected one are reported in the result.

Since q-values need to be calculated the identifications are taken by copy. Since decoys need to be calculated for the fasta input those are taken by copy as well.

Result is appended to the result member. This allows for multiple usage.

Parameters
pep_idsvector containing pepIDs with target/decoy annotation coming from a deNovo+database identification search without FDR (Comet is recommended - to use other search engines either disable reranking or set the '-force' flag) vector is modified internally, and is thus copied
expMSExperiment that was searched to produce the identifications given in pep_ids
original_fastaFASTAEntries of the database used for the ID search (without decoys)
novo_fastaFASTAEntry derived from deNovo peptides
search_paramsSearchParameters object containing information which adapter was used with which settings for the identification search that resulted in pep_ids
Exceptions
MissingInformationif no target/decoy annotation is found on pep_ids
MissingInformationif no xcorr is found, this happens when another adapter than CometAdapter was used
Preconditionif a q-value is found in pep_ids

◆ extractScore_()

double extractScore_ ( const PeptideHit pep_hit) const
private

Returns the cross correlation score normalized by MW (if existing), else if the 'force' flag is set the current main score is returned.

Parameters
pep_hitPeptideHit of which the score is needed
Returns
cross correlation score normalized by MW or current score
Exceptions
MissingInformationif no xcorr is found and 'force' flag isn't set

◆ extractSearchAdapterInfoFromMetaValues_()

std::pair<String, Param> extractSearchAdapterInfoFromMetaValues_ ( const ProteinIdentification::SearchParameters meta_values) const
private

Looks through meta values of SearchParameters to find out which search adapter was used.

Checks for the following adapters: CometAdapter, MSGFPlusAdapter, MSFraggerAdapter, MyriMatchAdapter, OMSSAAdapter and XTandemAdapter

Parameters
meta_valuesSearchParameters object, since the adapters write their parameters here
Returns
A pair containing the name of the adapter and the parameters used to run it
Exceptions
MissingInformationif none of the adapters above is found in the meta values

◆ getDecoyCutOff_()

double getDecoyCutOff_ ( const std::vector< PeptideIdentification > &  pep_ids,
double  reranking_cutoff_percentile 
) const
private

Calculates a xcorr cut-off based on decoy hits.

Decoy differences of all N pepIDs are calculated. The (1-reranking_cutoff_percentile)*N highest one is returned. It is assumed that this difference accounts for 'reranking_cutoff_percentile' of the re-ranking cases.

Parameters
pep_idsvector containing the pepIDs
reranking_cutoff_percentilepercentile that determines which cut-off will be returned
Returns
xcorr cut-off
Exceptions
IllegalArgumentif reranking_cutoff_percentile isn't in range [0,1]
IllegalArgumentif reranking_cutoff_percentile is too low for a decoy cut-off to be calculated
MissingInformationif no more than 20 % of the peptide IDs have two decoys in their top ten peptide hits

◆ getDecoyDiff_()

double getDecoyDiff_ ( const PeptideIdentification pep_id) const
private

Calculates the xcorr difference between the top two hits marked as decoy.

Searches for the top two decoys hits and returns their score difference. By default the xcorr from Comet is used. If no xcorr can be found and the 'force' flag is set the main score from the peptide hit is used, else an error is thrown.

If there aren't two decoys, DBL_MAX is returned.

Parameters
pep_idpepID from where the decoy difference will be calculated
Returns
xcorr difference
Exceptions
MissingInformationif no target/decoy annotation is found
MissingInformationif no xcorr is found

◆ getIndexWithMedianNovoHits_()

Size getIndexWithMedianNovoHits_ ( const std::vector< SuitabilityData > &  data) const
private

Finds the SuitabilityData object with the median number of de novo hits.

If the median isn't distinct (e.g. two entries could be considered median) the upper one is chosen.

Parameters
datavector of SuitabilityData objects
Returns
index to object with median number of de novo hits

Referenced by DBSuitability_friend::getIndexWithMedianNovoHits().

◆ getResults()

const std::vector<SuitabilityData>& getResults ( ) const

Returns results calculated by this metric.

The returned vector contains one DBSuitabilityData object for each time compute was called. Each of these objects contains the suitability information that was extracted from the identifications used for the corresponding call of compute.

Returns
DBSuitabilityData objects in a vector

◆ getScoreMatchingFDR_()

double getScoreMatchingFDR_ ( const std::vector< PeptideIdentification > &  pep_ids,
double  FDR,
const String score_name,
bool  higher_score_better 
) const
private

Extracts the worst score that still passes a FDR (q-value) threshold.

This can be used to 'convert' a FDR threshold to a threshold for the desired score (score and FDR need to be dependent)

Parameters
pep_idsvector of PeptideIdentifications
FDRFDR threshold, hits with a worse q-value score aren't looked at
score_namename of the score to search for The score name doesn't need to be the exact metavalue name, but a metavalue key should contain it. i.e. "e-value" as metavalue "e-value_score"
higher_score_bettertrue/false depending if a higher or lower score (score_name) is better
Returns
the worst score that is still in the FDR threshold
Exceptions
IllegalArgumentif score_name isn't found in the metavalues
Preconditionif main score of pep_ids isn't 'q-value'

Referenced by DBSuitability_friend::getScoreMatchingFDR().

◆ getSubsampledFasta_()

std::vector<FASTAFile::FASTAEntry> getSubsampledFasta_ ( const std::vector< FASTAFile::FASTAEntry > &  fasta_data,
double  subsampling_rate 
) const
private

Creates a subsampled fasta with the given subsampling rate.

The subsampling is based on the number of amino acides and not on the number of fasta entries.

Parameters
fasta_datafasta of which the subsampling should be done
subsampling_ratesubsampling rate to be used [0,1]
Returns
fasta entries with total number of AA = original number of AA * subsampling_rate
Exceptions
IllegalArgumentif subsampling rate is not between 0 and 1

Referenced by DBSuitability_friend::getSubsampledFasta().

◆ isNovoHit_()

bool isNovoHit_ ( const PeptideHit hit) const
private

Tests if a PeptideHit is considered a deNovo hit.

To test this the function looks into the protein accessions. If only the deNovo protein is found, 'true' is returned. If at least one database protein is found, 'false' is returned.

This function also uses boost::regex_search to make sure the deNovo accession doesn't contain a decoy string. This is needed for 'target+decoy' hits.

Parameters
hitPepHit in question
Returns
true/false

◆ numberOfUniqueProteins_()

UInt numberOfUniqueProteins_ ( const std::vector< PeptideIdentification > &  peps,
UInt  number_of_hits = 1 
) const
private

Determines the number of unique proteins found in the protein accessions of PeptideIdentifications.

Parameters
pepsvector of PeptideIdentifications
number_of_hitsthe number of hits to search in (if this is bigger than the actual number of hits all hits are looked at)
Returns
number of unique protein accessions
Exceptions
MissingInformationif no target/decoy annotation is found on peps

Referenced by DBSuitability_friend::numberOfUniqueProteins().

◆ runIdentificationSearch_()

std::vector<PeptideIdentification> runIdentificationSearch_ ( const MSExperiment exp,
const std::vector< FASTAFile::FASTAEntry > &  fasta_data,
const String adapter_name,
Param parameters 
) const
private

Executes the workflow from search adapter, followed by PeptideIndexer and finishes with FDR.

Which adapter should run with which parameters can be controlled. Make sure the search adapter you wish to use is built on your system and the executable is on your PATH variable.

Indexing and FDR are always done the same way.

The inputs are stored in temporary files to execute the Adapter. (MSExperiment -> .mzML, vector<FASTAEntry> -> .fasta, Param -> .INI)

Parameters
expMSExperiment that will be searched
fasta_datarepresents the database that should be used to search
adapter_namename of the adapter to search with
parametersparameters for the adapter
Returns
peptide identifications with annotated q-values
Exceptions
MissingInformationif no adapter name is given
InvalidParameterif a not supported adapter name is given
InternalToolErrorif any error occures while running the adapter
InternalToolErrorif any error occures while running PeptideIndexer functionalities
InvalidParameterif the needed FDR parameters are not found

◆ writeIniFile_()

void writeIniFile_ ( const Param parameters,
const String filename 
) const
private

Writes parameters into a given file.

Parameters
parametersparameters to write
filenamename of the file where the parameters should be written to
Exceptions
UnableToCreateFileif filename isn't writable

Friends And Related Function Documentation

◆ DBSuitability_friend

friend class DBSuitability_friend
friend

To test private member functions.

Member Data Documentation

◆ decoy_pattern_

const boost::regex decoy_pattern_
private

pattern for finding a decoy string

◆ results_

std::vector<SuitabilityData> results_
private

result vector