BALL
1.4.79
|
This class provides efficient similarity calculation functionality for 2D binary fingerprints. More...
#include <BALL/STRUCTURE/binaryFingerprintMethods.h>
Classes | |
struct | Default |
struct | Option |
Public Member Functions | |
Constructors and Destructors | |
BinaryFingerprintMethods () | |
BinaryFingerprintMethods (const Options &options) | |
BinaryFingerprintMethods (const Options &options, const FingerprintFeatures &lib_features) | |
BinaryFingerprintMethods (const Options &options, const FingerprintFeatures &lib_features, const FingerprintFeatures &query_features) | |
BinaryFingerprintMethods (const BinaryFingerprintMethods &bfm) | |
virtual | ~BinaryFingerprintMethods () |
Assignment | |
BinaryFingerprintMethods & | operator= (const BinaryFingerprintMethods &bfm) |
Accessors | |
bool | setLibraryFeatures (const FingerprintFeatures &lib_features) |
bool | setQueryFeatures (const FingerprintFeatures &query_features) |
unsigned int | getTargetLibrarySize () const |
unsigned int | getQueryLibrarySize () const |
const Options & | getOptions () const |
void | setVerbosityLevel (const int verbosity) |
Applications | |
bool | cutoffSearch (const float sim_cutoff, const String &outfile_name) |
bool | connectedComponents (const std::vector< unsigned int > &selection, std::vector< std::vector< unsigned int > > &ccs, std::vector< std::vector< std::pair< unsigned int, float > > > &nn_data, const float cutoff, const bool store_nns=false) |
bool | averageLinkageClustering (const std::vector< unsigned int > &selection, std::vector< std::pair< unsigned int, float > > &nn_data, std::map< unsigned int, std::vector< unsigned int > > &cluster_selection) |
bool | calculateSelectionMedoid (const std::vector< unsigned int > &selection, unsigned int &medoid_index, std::vector< float > &avg_sims) |
Static Public Member Functions | |
static bool | parseBinaryFingerprint (const String &fprint, std::vector< unsigned short > &features, unsigned int fp_type, const char *delim=",") |
This class provides efficient similarity calculation functionality for 2D binary fingerprints.
Binary Fingerprint Methods BinaryFingerprintMethods uses a blocked inverted index data structures as a representation of 2D binary fingerprints. This data structure enables efficient calculation of similarity coefficients which use shared feature counts between two fingerprints. The shared feature count is the number of on-bits which two different fingerprints have in common. Currently the class provides the Tanimoto coefficient.
Based on this similarity algorithm the class provides different useful public methods:
Cutoff Search :
Similarities of all molecules in a query library to all molecules in a target library are calculated and the pairs whose similarities exceed a predefined cutoff are written to an output file. The performance of the method depends on size of the input query. For a single molecule query, the method has minimum performance and grows exponentially with the number of query molecules reaching a maximum performance at about 400 query molecules. As a consequence it is faster to bundle cutoff searches for multiple molecules if available.
Connected Components :
For an input library all pairwise similarities are calculated. A similarity graph is generated by inserting molecule pairs as edges. A variable similarity cutoff is applied to insert only those edges whose corresponding similarity exceeds this cutoff. Thereby, a similarity network is created and the connected components are finally calculated. If desired, the nearest neighbour of every molecule can additionally be returned.
Hierarchical Clustering :
A set of input molecules is hierarhically clustered according to the average linkage criterion. Two different algorithms are implemented to make efficient use of the inverted index algorithm and the available hardware resources. Both methods are based on Reciprocal Nearest Neighbours. The parallel variant is used to handle large input library sizes. If the similarity matrix for the remaining clusters or possibly all input molecules fit into main memory, the matrix is calculated and the Nearest Neighbour Chain algorithm is used.
Finally, the method of Kelly et al. is used to select an level for cluster selection and the created clusters are returned.
Calculate Medoid :
For a set of input molecules the Medoid is calculated. The medoid is the molecule with the highest averaged similarity to all other molecules. Additionally, the method returns the averaged pairwise similarities for every molecule in the input set.
Citations:
Clustering Algorithms: F. Murtagh, Compstat Lectures vol. 4, 1985, Physica-Verlag Würzburg-Wien. (Volume titel: Multidimensional clustering algorithms).
Similarity Update: G. Lance and W. Williams, Comput J. (1967), 9, 373-380. (doi: 10.1093/comjnl/9.4.373).
Cutoff Selection: L.A. Kelley, S.P. Gardner and M.J. Sutcliffe, Protein Eng. (1996), 9, 1063-1065. (doi: 10.1093/protein/9.11.1063).
Definition at line 66 of file binaryFingerprintMethods.h.
BALL::BinaryFingerprintMethods::BinaryFingerprintMethods | ( | ) |
Default constructor.
BALL::BinaryFingerprintMethods::BinaryFingerprintMethods | ( | const Options & | options | ) |
Constructor
options | Options to use. Defaults will be overwritten. |
BALL::BinaryFingerprintMethods::BinaryFingerprintMethods | ( | const Options & | options, |
const FingerprintFeatures & | lib_features | ||
) |
Constructor
options | Options to use. Defaults will be overwritten. |
lib_features | Target library as a lists of integer fingerprint features. |
BALL::BinaryFingerprintMethods::BinaryFingerprintMethods | ( | const Options & | options, |
const FingerprintFeatures & | lib_features, | ||
const FingerprintFeatures & | query_features | ||
) |
Constructor
options | Options to use. Defaults will be overwritten. |
lib_features | Target library as lists of integer fingerprint features. |
query_features | Query library as lists of integer fingerprint features. |
BALL::BinaryFingerprintMethods::BinaryFingerprintMethods | ( | const BinaryFingerprintMethods & | bfm | ) |
Copy constructor.
|
virtual |
Destructor.
bool BALL::BinaryFingerprintMethods::averageLinkageClustering | ( | const std::vector< unsigned int > & | selection, |
std::vector< std::pair< unsigned int, float > > & | nn_data, | ||
std::map< unsigned int, std::vector< unsigned int > > & | cluster_selection | ||
) |
Average linkage clustering of a set of molecules defined by selection.
selection | Indices of molecules in lib_features_ vector for which connected components analysis should be performed |
nn_data | Nearest neighbour information for molecules in selection if already calculated, otherwise an empty vector. |
cluster_selection | Final clustering, i.e. a cluster ID is assigned to every leaf node. |
bool BALL::BinaryFingerprintMethods::calculateSelectionMedoid | ( | const std::vector< unsigned int > & | selection, |
unsigned int & | medoid_index, | ||
std::vector< float > & | avg_sims | ||
) |
Calculation of the medoid of a set of molecules defined by selection. Medoid has the highest average similarity to all other compounds.
selection | Indices of molecules in lib_features_ vector for which the medoid should be calculated. |
medoid_index | Reference to return the index of the calculated medoid in the passed molecule indices (selection). |
avg_sims | Reference to vector which finally will store the average similarity of every molecule to all others in selection. |
bool BALL::BinaryFingerprintMethods::connectedComponents | ( | const std::vector< unsigned int > & | selection, |
std::vector< std::vector< unsigned int > > & | ccs, | ||
std::vector< std::vector< std::pair< unsigned int, float > > > & | nn_data, | ||
const float | cutoff, | ||
const bool | store_nns = false |
||
) |
Calculate similarity network of a set of molecules defined by selection and return the connected components.
selection | Indices of molecules in lib_features_ array for which connected components analysis should be performed. |
ccs | Data structure to return connected components. Eeach ccs member is a single connected component which stores the indices (positions in the vector) of the connected component members in selection. |
nn_data | If store_nns = true, nn_data will store the nearest neighbour information (index into selection and similarity) for all connected component members. |
sim_cutoff | Similarity cutoff to apply for similarity network generation, i.e. only keep similarities above cutoff. |
store_nns | If true, the nearest neighbour and its similarity are stored during connected components calculation. |
bool BALL::BinaryFingerprintMethods::cutoffSearch | ( | const float | sim_cutoff, |
const String & | outfile_name | ||
) |
Perform similarity search for molecules of a query library in a target library and keep pairs with similarity exceeding sim_cutoff.
sim_cutoff | Similarity cutoff to apply for cutoff search, i.e. only keep similarities above cutoff. |
file_name | Output file name to store result pairs with similarities as space separated CSV file. |
const Options& BALL::BinaryFingerprintMethods::getOptions | ( | ) | const |
Get const reference to options.
unsigned int BALL::BinaryFingerprintMethods::getQueryLibrarySize | ( | ) | const |
Get number of molecules in query library.
unsigned int BALL::BinaryFingerprintMethods::getTargetLibrarySize | ( | ) | const |
Get number of molecules in target library.
BinaryFingerprintMethods& BALL::BinaryFingerprintMethods::operator= | ( | const BinaryFingerprintMethods & | bfm | ) |
Assignment operator
|
static |
Wrapper for different binary fingerprint parsers which returns a vector of integer features. The returned features lie in the interval [1, max_feature_id + 1]. Thus, no feature has ID = 0 which is important for inverted index implementation.
fprint | The fingerprint as a separated list of integer features. |
features | A vector reference to return the fingerprint features. The vector will be cleaned first. |
fp_type | Fingerprint encoding: 1 = Binary Bitstring, 2 = Separated list of integer features |
delim | The delimiter character of the list. |
bool BALL::BinaryFingerprintMethods::setLibraryFeatures | ( | const FingerprintFeatures & | lib_features | ) |
Add a target library containing binary fingerprints as feature lists.
lib_features | Target library as lists of integer fingerprint features. |
bool BALL::BinaryFingerprintMethods::setQueryFeatures | ( | const FingerprintFeatures & | query_features | ) |
Add a query library containing binary fingerprints as feature lists.
query_features | Query library as lists of integer fingerprint features. |
void BALL::BinaryFingerprintMethods::setVerbosityLevel | ( | const int | verbosity | ) |
Set verbosity level.
verbosity | verbosity level to use. |