OpenMS
QTCluster Class Reference

A representation of a QT cluster used for feature grouping. More...

#include <OpenMS/DATASTRUCTURES/QTCluster.h>

Collaboration diagram for QTCluster:
[legend]

Classes

class  BulkData
 Class to store the bulk internal data (neighbors, annotations, etc.) More...
 
struct  Element
 
struct  Neighbor
 

Public Types

typedef std::multimap< double, const GridFeature * > NeighborList
 
typedef std::unordered_map< Size, NeighborListNeighborMapMulti
 
typedef std::unordered_map< Size, NeighborNeighborMap
 
typedef std::vector< ElementElements
 

Public Member Functions

 QTCluster (BulkData *const data, bool use_IDs)
 Detailed constructor of the cluster head. More...
 
 QTCluster ()=delete
 Default constructor not accessible Objects of this class should only exist with a valid BulkData* given. Otherwise most of the member functions are undefined behavior or produce segfaults. More...
 
 QTCluster (const QTCluster &rhs)=default
 Cheap copy ctor because most of the data lies outside of this class (BulkData*) Be very careful with this copy constructor. The copy will point to the same BulkData object as the given QTCluster. The latter one shouldn't be used anymore. This operation is only allowed because the boost::heap interface needs it. More...
 
QTClusteroperator= (const QTCluster &rhs)=default
 Cheap copy assignment, see copy ctor for details. More...
 
 QTCluster (QTCluster &&rhs)=default
 cheap move ctor because most of the data lies outside of this class (BulkData*) More...
 
QTClusteroperator= (QTCluster &&rhs)=default
 cheap move assignment because most of the data lies outside of this class (BulkData*) More...
 
 ~QTCluster ()=default
 
const GridFeaturegetCenterPoint () const
 Returns the cluster center. More...
 
Size getId () const
 returns the clusters id More...
 
double getCenterRT () const
 Returns the RT value of the cluster. More...
 
double getCenterMZ () const
 Returns the m/z value of the cluster center. More...
 
Int getXCoord () const
 Returns the x coordinate in the grid. More...
 
Int getYCoord () const
 Returns the y coordinate in the grid. More...
 
Size size () const
 Returns the size of the cluster (number of elements, incl. center) More...
 
bool operator< (const QTCluster &cluster) const
 Compare by quality. More...
 
void add (const GridFeature *const element, double distance)
 Adds a new element/neighbor to the cluster. More...
 
Elements getElements () const
 Gets the clustered elements meaning neighbors + cluster center. More...
 
bool update (const Elements &removed)
 Updates the cluster after the indicated data points are removed. More...
 
double getQuality ()
 Returns the cluster quality and recomputes if necessary. More...
 
double getCurrentQuality () const
 Returns the cluster quality without recomputing. More...
 
const std::set< AASequence > & getAnnotations ()
 Return the set of peptide sequences annotated to the cluster center. More...
 
void setInvalid ()
 Sets current cluster as invalid (also frees some memory) More...
 
bool isInvalid () const
 Whether current cluster is invalid. More...
 
void initializeCluster ()
 Has to be called before adding elements (calling QTCluster::add) More...
 
void finalizeCluster ()
 Has to be called after adding elements (after calling QTCluster::add one or multiple times) More...
 
Elements getAllNeighbors () const
 Get all current neighbors. More...
 

Private Member Functions

void computeQuality_ ()
 Computes the quality of the cluster. More...
 
double optimizeAnnotations_ ()
 Finds the optimal annotation (peptide sequences) for the cluster. More...
 
void makeSeqTable_ (std::map< AASequence, std::map< Size, double >> &seq_table) const
 compute seq table, mapping: peptides -> best distance per input map More...
 
void recomputeNeighbors_ ()
 report elements that are compatible with the optimal annotation More...
 

Private Attributes

double quality_
 Quality of the cluster. More...
 
BulkDatadata_
 Pointer to data members. More...
 
bool valid_
 Whether current cluster is valid. More...
 
bool changed_
 Has the cluster changed (if yes, quality needs to be recomputed)? More...
 
bool use_IDs_
 Keep track of peptide IDs and use them for matching? More...
 
bool collect_annotations_
 Whether initial collection of all neighbors is needed. More...
 
bool finalized_
 Whether current cluster is accepting new elements or not (if true, no more new elements allowed) More...
 

Detailed Description

A representation of a QT cluster used for feature grouping.

Ultimately, a cluster represents a group of corresponding features (or consensus features) from different input maps (feature maps or consensus maps).

Clusters are defined by their center points (one feature each). A cluster also stores a number of potential cluster elements (other features) from different input maps, together with their distances to the cluster center. Every feature that satisfies certain constraints with respect to the cluster center is a potential cluster element. However, since a feature group can only contain one feature from each input map, only the "best" (i.e. closest to the cluster center) such feature is considered a true cluster element. To save memory, only the "best" element for each map is stored inside a cluster.

The QT clustering algorithm has the characteristic of initially producing all possible, overlapping clusters. Iteratively, the best cluster is then extracted and the clustering is recomputed for the remaining points.

In our implementation, multiple rounds of clustering are not necessary. Instead, the clustering is updated in each iteration. This is the reason for temporarily storing all potential cluster elements: When a certain cluster is finalized, its elements have to be removed from the remaining clusters, and affected clusters change their composition. (Note that clusters can also be invalidated by this, if the cluster center is being removed.)

The quality of a cluster is the normalized average distance to the cluster center for present and missing cluster elements. The distance value for missing elements (if the cluster contains no feature from a certain input map) is the user-defined threshold that marks the maximum allowed radius of a cluster.

When adding elements to the cluster, the client needs to call initializeCluster first and the client needs to call finalizeCluster after adding the last element. After finalizeCluster, the client may not add any more elements through the add function (the client must call initializeCluster again before adding new elements).

If use_id_ is set, clusters are extended only with elements that have at least one matching ID. Quality is then computed as the best quality of all possible IDs and this ID is then used as the only (representative) ID of the cluster. The left-out alternative IDs might be added back later based on the original features though.

Todo:
This implementation may benefit from two separate implementations (one considering IDs/annotations one without). The current implementation most likely hinders speed/memory of both by trying to do both in one. The ID-based implementation could additionally benefit from ID scores and make use of ConsensusID functions.
See also
QTClusterFinder

Class Documentation

◆ OpenMS::QTCluster::Element

struct OpenMS::QTCluster::Element
Collaboration diagram for QTCluster::Element:
[legend]
Class Members
const GridFeature * feature
Size map_index

◆ OpenMS::QTCluster::Neighbor

struct OpenMS::QTCluster::Neighbor
Collaboration diagram for QTCluster::Neighbor:
[legend]
Class Members
double distance
const GridFeature * feature

Member Typedef Documentation

◆ Elements

typedef std::vector<Element> Elements

◆ NeighborList

typedef std::multimap<double, const GridFeature*> NeighborList

◆ NeighborMap

typedef std::unordered_map<Size, Neighbor> NeighborMap

◆ NeighborMapMulti

typedef std::unordered_map<Size, NeighborList> NeighborMapMulti

Constructor & Destructor Documentation

◆ QTCluster() [1/4]

QTCluster ( BulkData *const  data,
bool  use_IDs 
)

Detailed constructor of the cluster head.

Parameters
dataPointer to internal data
use_IDsUse peptide annotations?

◆ QTCluster() [2/4]

QTCluster ( )
delete

Default constructor not accessible Objects of this class should only exist with a valid BulkData* given. Otherwise most of the member functions are undefined behavior or produce segfaults.

◆ QTCluster() [3/4]

QTCluster ( const QTCluster rhs)
default

Cheap copy ctor because most of the data lies outside of this class (BulkData*) Be very careful with this copy constructor. The copy will point to the same BulkData object as the given QTCluster. The latter one shouldn't be used anymore. This operation is only allowed because the boost::heap interface needs it.

◆ QTCluster() [4/4]

QTCluster ( QTCluster &&  rhs)
default

cheap move ctor because most of the data lies outside of this class (BulkData*)

◆ ~QTCluster()

~QTCluster ( )
default

Member Function Documentation

◆ add()

void add ( const GridFeature *const  element,
double  distance 
)

Adds a new element/neighbor to the cluster.

Note
There is no check whether the element/neighbor already exists in the cluster!
Parameters
elementThe element to be added
distanceDistance of the element to the center point

◆ computeQuality_()

void computeQuality_ ( )
private

Computes the quality of the cluster.

◆ finalizeCluster()

void finalizeCluster ( )

Has to be called after adding elements (after calling QTCluster::add one or multiple times)

◆ getAllNeighbors()

Elements getAllNeighbors ( ) const

Get all current neighbors.

◆ getAnnotations()

const std::set<AASequence>& getAnnotations ( )

Return the set of peptide sequences annotated to the cluster center.

◆ getCenterMZ()

double getCenterMZ ( ) const

Returns the m/z value of the cluster center.

◆ getCenterPoint()

const GridFeature* getCenterPoint ( ) const

Returns the cluster center.

◆ getCenterRT()

double getCenterRT ( ) const

Returns the RT value of the cluster.

◆ getCurrentQuality()

double getCurrentQuality ( ) const

Returns the cluster quality without recomputing.

◆ getElements()

Elements getElements ( ) const

Gets the clustered elements meaning neighbors + cluster center.

◆ getId()

Size getId ( ) const

returns the clusters id

◆ getQuality()

double getQuality ( )

Returns the cluster quality and recomputes if necessary.

◆ getXCoord()

Int getXCoord ( ) const

Returns the x coordinate in the grid.

◆ getYCoord()

Int getYCoord ( ) const

Returns the y coordinate in the grid.

◆ initializeCluster()

void initializeCluster ( )

Has to be called before adding elements (calling QTCluster::add)

◆ isInvalid()

bool isInvalid ( ) const
inline

Whether current cluster is invalid.

◆ makeSeqTable_()

void makeSeqTable_ ( std::map< AASequence, std::map< Size, double >> &  seq_table) const
private

compute seq table, mapping: peptides -> best distance per input map

◆ operator<()

bool operator< ( const QTCluster cluster) const

Compare by quality.

◆ operator=() [1/2]

QTCluster& operator= ( const QTCluster rhs)
default

Cheap copy assignment, see copy ctor for details.

◆ operator=() [2/2]

QTCluster& operator= ( QTCluster &&  rhs)
default

cheap move assignment because most of the data lies outside of this class (BulkData*)

◆ optimizeAnnotations_()

double optimizeAnnotations_ ( )
private

Finds the optimal annotation (peptide sequences) for the cluster.

The optimal annotation is the one that results in the best quality. It is stored in annotations_;

This function is only needed when peptide ids are used and the current center point does not have any peptide id associated with it. In this case, it is not clear which peptide id the current cluster should use. The function thus iterates through all possible peptide ids and selects the one producing the best cluster.

This function needs access to all possible neighbors for this cluster and thus can only be run when tmp_neighbors_ is filled (which is during the filling of a cluster). The function thus cannot be called after finalizing the cluster.

Returns
The total distance between cluster elements and the center.

◆ recomputeNeighbors_()

void recomputeNeighbors_ ( )
private

report elements that are compatible with the optimal annotation

◆ setInvalid()

void setInvalid ( )

Sets current cluster as invalid (also frees some memory)

Note
Do not attempt to use the cluster again once it is invalid, some internal data structures have now been cleared

◆ size()

Size size ( ) const

Returns the size of the cluster (number of elements, incl. center)

◆ update()

bool update ( const Elements removed)

Updates the cluster after the indicated data points are removed.

Parameters
removedThe datapoints to be removed from the cluster
Returns
Whether the cluster composition has changed due to the update

Member Data Documentation

◆ changed_

bool changed_
private

Has the cluster changed (if yes, quality needs to be recomputed)?

◆ collect_annotations_

bool collect_annotations_
private

Whether initial collection of all neighbors is needed.

This variable stores whether we need to collect all annotations first before we can decide upon the best set of cluster points. This is usually only necessary if the center point does not have an annotation but we want to use ids.

◆ data_

BulkData* data_
private

Pointer to data members.

◆ finalized_

bool finalized_
private

Whether current cluster is accepting new elements or not (if true, no more new elements allowed)

◆ quality_

double quality_
private

Quality of the cluster.

◆ use_IDs_

bool use_IDs_
private

Keep track of peptide IDs and use them for matching?

◆ valid_

bool valid_
private

Whether current cluster is valid.