OpenMS
IDMergerAlgorithm Class Reference

Algorithm for merging multiple protein and peptide identification runs. More...

#include <OpenMS/ANALYSIS/ID/IDMergerAlgorithm.h>

Inheritance diagram for IDMergerAlgorithm:
[legend]
Collaboration diagram for IDMergerAlgorithm:
[legend]

Public Member Functions

 IDMergerAlgorithm (const String &runIdentifier="merged", bool addTimeStampToID=true)
 Constructor for the IDMergerAlgorithm. More...
 
void insertRuns (std::vector< ProteinIdentification > &&prots, PeptideIdentificationList &&peps)
 Insert runs using move semantics. More...
 
void insertRuns (const std::vector< ProteinIdentification > &prots, const PeptideIdentificationList &peps)
 Insert runs using copy semantics. More...
 
void returnResultsAndClear (ProteinIdentification &prots, PeptideIdentificationList &peps)
 Return the merged results and reset internal state. More...
 
- Public Member Functions inherited from DefaultParamHandler
 DefaultParamHandler (const String &name)
 Constructor with name that is displayed in error messages. More...
 
 DefaultParamHandler (const DefaultParamHandler &rhs)
 Copy constructor. More...
 
virtual ~DefaultParamHandler ()
 Destructor. More...
 
DefaultParamHandleroperator= (const DefaultParamHandler &rhs)
 Assignment operator. More...
 
virtual bool operator== (const DefaultParamHandler &rhs) const
 Equality operator. More...
 
void setParameters (const Param &param)
 Sets the parameters. More...
 
const ParamgetParameters () const
 Non-mutable access to the parameters. More...
 
const ParamgetDefaults () const
 Non-mutable access to the default parameters. More...
 
const StringgetName () const
 Non-mutable access to the name. More...
 
void setName (const String &name)
 Mutable access to the name. More...
 
const std::vector< String > & getSubsections () const
 Non-mutable access to the registered subsections. More...
 
- Public Member Functions inherited from ProgressLogger
 ProgressLogger ()
 Constructor. More...
 
virtual ~ProgressLogger ()
 Destructor. More...
 
 ProgressLogger (const ProgressLogger &other)
 Copy constructor. More...
 
ProgressLoggeroperator= (const ProgressLogger &other)
 Assignment Operator. More...
 
void setLogType (LogType type) const
 Sets the progress log that should be used. The default type is NONE! More...
 
LogType getLogType () const
 Returns the type of progress log being used. More...
 
void setLogger (ProgressLoggerImpl *logger)
 Sets the logger to be used for progress logging. More...
 
void startProgress (SignedSize begin, SignedSize end, const String &label) const
 Initializes the progress display. More...
 
void setProgress (SignedSize value) const
 Sets the current progress. More...
 
void endProgress (UInt64 bytes_processed=0) const
 
void nextProgress () const
 increment progress by 1 (according to range begin-end) More...
 

Private Types

using hash_type = std::size_t(*)(const ProteinHit &)
 Type alias for the hash function. More...
 
using equal_type = bool(*)(const ProteinHit &, const ProteinHit &)
 Type alias for the equality function. More...
 

Private Member Functions

String getNewIdentifier_ (bool addTimeStampToID) const
 Generate a new identifier for the merged run. More...
 
bool checkOldRunConsistency_ (const std::vector< ProteinIdentification > &protRuns, const String &experiment_type) const
 Check consistency of search engines and settings across runs. More...
 
bool checkOldRunConsistency_ (const std::vector< ProteinIdentification > &protRuns, const ProteinIdentification &ref, const String &experiment_type) const
 Check consistency of search engines and settings against a reference. More...
 
void insertProteinIDs_ (std::vector< ProteinIdentification > &&old_protRuns)
 Insert protein identifications into the merged result. More...
 
void updateAndMovePepIDs_ (PeptideIdentificationList &&pepIDs, const std::map< String, Size > &runID_to_runIdx, const std::vector< StringList > &originFiles, bool annotate_origin)
 Update peptide ID references and move them to the result. More...
 
void movePepIDsAndRefProteinsToResultFaster_ (PeptideIdentificationList &&pepIDs, std::vector< ProteinIdentification > &&old_protRuns)
 Optimized method to move peptide IDs and reference proteins to result. More...
 

Static Private Member Functions

static void copySearchParams_ (const ProteinIdentification &from, ProteinIdentification &to)
 Copy search parameters between protein identifications. More...
 
static size_t accessionHash_ (const ProteinHit &p)
 Hash function for protein hits based on accession. More...
 
static bool accessionEqual_ (const ProteinHit &p1, const ProteinHit &p2)
 Equality function for protein hits based on accession. More...
 

Private Attributes

ProteinIdentification prot_result_
 The resulting merged protein identification. More...
 
PeptideIdentificationList pep_result_
 The resulting merged peptide identifications. More...
 
std::unordered_set< ProteinHit, hash_type, equal_typecollected_protein_hits_
 Set of collected protein hits using custom hash and equality functions. More...
 
bool filled_ = false
 Flag indicating whether the resulting protein ID is already filled. More...
 
std::map< String, Sizefile_origin_to_idx_
 Mapping to keep track of the mzML origins of spectra. More...
 
String id_
 The new identifier string for the merged run. More...
 
bool fixed_identifier_
 Flag indicating whether the identifier should be fixed (i.e., not contain a timestamp) More...
 

Additional Inherited Members

- Public Types inherited from ProgressLogger
enum  LogType { CMD , GUI , NONE }
 Possible log types. More...
 
- Static Public Member Functions inherited from DefaultParamHandler
static void writeParametersToMetaValues (const Param &write_this, MetaInfoInterface &write_here, const String &key_prefix="")
 Writes all parameters to meta values. More...
 
- Protected Member Functions inherited from DefaultParamHandler
virtual void updateMembers_ ()
 This method is used to update extra member variables at the end of the setParameters() method. More...
 
void defaultsToParam_ ()
 Updates the parameters after the defaults have been set in the constructor. More...
 
- Protected Attributes inherited from DefaultParamHandler
Param param_
 Container for current parameters. More...
 
Param defaults_
 Container for default parameters. This member should be filled in the constructor of derived classes! More...
 
std::vector< Stringsubsections_
 Container for registered subsections. This member should be filled in the constructor of derived classes! More...
 
String error_name_
 Name that is displayed in error messages during the parameter checking. More...
 
bool check_defaults_
 If this member is set to false no checking if parameters in done;. More...
 
bool warn_empty_defaults_
 If this member is set to false no warning is emitted when defaults are empty;. More...
 
- Protected Attributes inherited from ProgressLogger
LogType type_
 
time_t last_invoke_
 
ProgressLoggerImplcurrent_logger_
 
- Static Protected Attributes inherited from ProgressLogger
static int recursion_depth_
 

Detailed Description

Algorithm for merging multiple protein and peptide identification runs.

This class creates a new Protein ID run into which other runs can be inserted. It performs the following operations:

  • Creates a union of protein hits from all inserted runs
  • Concatenates Peptide-Spectrum Matches (PSMs) from all runs
  • Checks search engine consistency across all inserted runs
  • Maintains references between peptide IDs and their corresponding protein IDs

The algorithm differs from the IDMerger tool in two key aspects:

  1. It is implemented as an algorithm class rather than a tool
  2. It allows inserting multiple peptide hits per peptide sequence (not only the first occurrence)

The class handles the complexity of merging identification data from different sources while ensuring consistency and maintaining proper references between proteins and peptides. It can be used in workflows where identification results from multiple files or runs need to be combined into a single comprehensive result set.

The algorithm can optionally annotate the origin of each identification to maintain traceability of the merged results back to their source files.

See also
IDMerger
Todo:
Allow filtering for peptide sequence to supersede the IDMerger tool. Make it keep the best PSMs though.

Member Typedef Documentation

◆ equal_type

using equal_type = bool (*)(const ProteinHit&, const ProteinHit&)
private

Type alias for the equality function.

◆ hash_type

using hash_type = std::size_t (*)(const ProteinHit&)
private

Type alias for the hash function.

Constructor & Destructor Documentation

◆ IDMergerAlgorithm()

IDMergerAlgorithm ( const String runIdentifier = "merged",
bool  addTimeStampToID = true 
)
explicit

Constructor for the IDMergerAlgorithm.

Initializes a new merger with the specified run identifier.

Parameters
runIdentifierBase identifier for the merged run (default: "merged")
addTimeStampToIDWhether to append a timestamp to the run identifier for uniqueness (default: true)

Member Function Documentation

◆ accessionEqual_()

static bool accessionEqual_ ( const ProteinHit p1,
const ProteinHit p2 
)
inlinestaticprivate

Equality function for protein hits based on accession.

Parameters
p1First protein hit to compare
p2Second protein hit to compare
Returns
True if the accessions are equal, false otherwise

References ProteinHit::getAccession().

◆ accessionHash_()

static size_t accessionHash_ ( const ProteinHit p)
inlinestaticprivate

Hash function for protein hits based on accession.

Parameters
pProtein hit to hash
Returns
Hash value for the protein hit

References ProteinHit::getAccession().

◆ checkOldRunConsistency_() [1/2]

bool checkOldRunConsistency_ ( const std::vector< ProteinIdentification > &  protRuns,
const ProteinIdentification ref,
const String experiment_type 
) const
private

Check consistency of search engines and settings against a reference.

Verifies that all runs have compatible search engine settings before merging, using an explicitly provided reference run.

Parameters
protRunsThe runs to check
refAn external protein run to use as reference
experiment_typeExperiment type to allow certain mismatches (e.g., "SILAC")
Returns
True if all runs are consistent with the reference, false otherwise
Exceptions
BaseExceptionfor disagreeing settings
Todo:
Return a merged RunDescription about what to put in the new runs (e.g., for SILAC)

◆ checkOldRunConsistency_() [2/2]

bool checkOldRunConsistency_ ( const std::vector< ProteinIdentification > &  protRuns,
const String experiment_type 
) const
private

Check consistency of search engines and settings across runs.

Verifies that all runs have compatible search engine settings before merging. Uses the first run as an implicit reference.

Parameters
protRunsThe runs to check (first = implicit reference)
experiment_typeExperiment type to allow certain mismatches (e.g., "SILAC")
Returns
True if all runs are consistent, false otherwise
Exceptions
BaseExceptionfor disagreeing settings
Todo:
Return a merged RunDescription about what to put in the new runs (e.g., for SILAC)

◆ copySearchParams_()

static void copySearchParams_ ( const ProteinIdentification from,
ProteinIdentification to 
)
staticprivate

Copy search parameters between protein identifications.

Transfers search parameters from one protein identification to another.

Parameters
fromSource protein identification
toDestination protein identification

◆ getNewIdentifier_()

String getNewIdentifier_ ( bool  addTimeStampToID) const
private

Generate a new identifier for the merged run.

Creates a new identifier by combining the base identifier with a timestamp if requested.

Parameters
addTimeStampToIDWhether to append a timestamp to the identifier
Returns
The generated identifier string

◆ insertProteinIDs_()

void insertProteinIDs_ ( std::vector< ProteinIdentification > &&  old_protRuns)
private

Insert protein identifications into the merged result.

Moves and inserts protein IDs if not yet present, then clears the input.

Parameters
old_protRunsVector of protein identifications to insert

◆ insertRuns() [1/2]

void insertRuns ( const std::vector< ProteinIdentification > &  prots,
const PeptideIdentificationList peps 
)

Insert runs using copy semantics.

Inserts (copies) protein and peptide identifications into the internal merged data structures. This version preserves the source data. Note:

  • Only inserts PeptideIdentifications from existing runs in prots (noop if prots is empty)
  • Duplicates file origins if multiple (compatible) protein runs from the same spectrumfile are merged
Parameters
protsVector of protein identifications to be merged
pepsVector of peptide identifications to be merged

◆ insertRuns() [2/2]

void insertRuns ( std::vector< ProteinIdentification > &&  prots,
PeptideIdentificationList &&  peps 
)

Insert runs using move semantics.

Inserts (moves and clears) protein and peptide identifications into the internal merged data structures. This version uses move semantics for better performance when the source data is no longer needed. Note:

  • Only inserts PeptideIdentifications from existing runs in prots (noop if prots is empty)
  • Duplicates file origins if multiple (compatible) protein runs from the same spectrumfile are merged
Parameters
protsVector of protein identifications to be merged
pepsVector of peptide identifications to be merged

◆ movePepIDsAndRefProteinsToResultFaster_()

void movePepIDsAndRefProteinsToResultFaster_ ( PeptideIdentificationList &&  pepIDs,
std::vector< ProteinIdentification > &&  old_protRuns 
)
private

Optimized method to move peptide IDs and reference proteins to result.

A faster implementation for moving peptide IDs and their referenced proteins to the result data structures.

Parameters
pepIDsVector of peptide identifications to move
old_protRunsVector of protein identifications to reference

◆ returnResultsAndClear()

void returnResultsAndClear ( ProteinIdentification prots,
PeptideIdentificationList peps 
)

Return the merged results and reset internal state.

Retrieves the merged protein and peptide identifications and clears all internal data structures, preparing the algorithm instance for potential reuse.

This method should be called after all desired runs have been inserted to obtain the final merged result.

Parameters
prots[out] The merged protein identification containing the union of all protein hits
peps[out] The merged peptide identifications containing all PSMs from the inserted runs
Note
After calling this method, the internal state is reset, and the algorithm can be reused for a new merging operation.

◆ updateAndMovePepIDs_()

void updateAndMovePepIDs_ ( PeptideIdentificationList &&  pepIDs,
const std::map< String, Size > &  runID_to_runIdx,
const std::vector< StringList > &  originFiles,
bool  annotate_origin 
)
private

Update peptide ID references and move them to the result.

Updates the references in peptide IDs to point to the new protein ID run, then moves the peptide IDs based on the provided mapping.

Parameters
pepIDsVector of peptide identifications to update and move
runID_to_runIdxMapping from run IDs to run indices
originFilesList of origin files for each run
annotate_originWhether to annotate peptide IDs with their origin

Member Data Documentation

◆ collected_protein_hits_

std::unordered_set<ProteinHit, hash_type, equal_type> collected_protein_hits_
private

Set of collected protein hits using custom hash and equality functions.

◆ file_origin_to_idx_

std::map<String, Size> file_origin_to_idx_
private

Mapping to keep track of the mzML origins of spectra.

◆ filled_

bool filled_ = false
private

Flag indicating whether the resulting protein ID is already filled.

◆ fixed_identifier_

bool fixed_identifier_
private

Flag indicating whether the identifier should be fixed (i.e., not contain a timestamp)

◆ id_

String id_
private

The new identifier string for the merged run.

◆ pep_result_

PeptideIdentificationList pep_result_
private

The resulting merged peptide identifications.

◆ prot_result_

ProteinIdentification prot_result_
private

The resulting merged protein identification.