BALL  1.4.79
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Properties Friends Macros Groups Pages
Classes | Static Public Member Functions | List of all members
BALL::BinaryFingerprintMethods Class Reference

This class provides efficient similarity calculation functionality for 2D binary fingerprints. More...

#include <BALL/STRUCTURE/binaryFingerprintMethods.h>

Classes

struct  Default
 
struct  Option
 

Public Member Functions

Constructors and Destructors
 BinaryFingerprintMethods ()
 
 BinaryFingerprintMethods (const Options &options)
 
 BinaryFingerprintMethods (const Options &options, const FingerprintFeatures &lib_features)
 
 BinaryFingerprintMethods (const Options &options, const FingerprintFeatures &lib_features, const FingerprintFeatures &query_features)
 
 BinaryFingerprintMethods (const BinaryFingerprintMethods &bfm)
 
virtual ~BinaryFingerprintMethods ()
 
Assignment
BinaryFingerprintMethodsoperator= (const BinaryFingerprintMethods &bfm)
 
Accessors
bool setLibraryFeatures (const FingerprintFeatures &lib_features)
 
bool setQueryFeatures (const FingerprintFeatures &query_features)
 
unsigned int getTargetLibrarySize () const
 
unsigned int getQueryLibrarySize () const
 
const OptionsgetOptions () const
 
void setVerbosityLevel (const int verbosity)
 
Applications
bool cutoffSearch (const float sim_cutoff, const String &outfile_name)
 
bool connectedComponents (const std::vector< unsigned int > &selection, std::vector< std::vector< unsigned int > > &ccs, std::vector< std::vector< std::pair< unsigned int, float > > > &nn_data, const float cutoff, const bool store_nns=false)
 
bool averageLinkageClustering (const std::vector< unsigned int > &selection, std::vector< std::pair< unsigned int, float > > &nn_data, std::map< unsigned int, std::vector< unsigned int > > &cluster_selection)
 
bool calculateSelectionMedoid (const std::vector< unsigned int > &selection, unsigned int &medoid_index, std::vector< float > &avg_sims)
 

Static Public Member Functions

static bool parseBinaryFingerprint (const String &fprint, std::vector< unsigned short > &features, unsigned int fp_type, const char *delim=",")
 

Detailed Description

This class provides efficient similarity calculation functionality for 2D binary fingerprints.

Binary Fingerprint Methods BinaryFingerprintMethods uses a blocked inverted index data structures as a representation of 2D binary fingerprints. This data structure enables efficient calculation of similarity coefficients which use shared feature counts between two fingerprints. The shared feature count is the number of on-bits which two different fingerprints have in common. Currently the class provides the Tanimoto coefficient.

Based on this similarity algorithm the class provides different useful public methods:
Cutoff Search :
Similarities of all molecules in a query library to all molecules in a target library are calculated and the pairs whose similarities exceed a predefined cutoff are written to an output file. The performance of the method depends on size of the input query. For a single molecule query, the method has minimum performance and grows exponentially with the number of query molecules reaching a maximum performance at about 400 query molecules. As a consequence it is faster to bundle cutoff searches for multiple molecules if available.
Connected Components :
For an input library all pairwise similarities are calculated. A similarity graph is generated by inserting molecule pairs as edges. A variable similarity cutoff is applied to insert only those edges whose corresponding similarity exceeds this cutoff. Thereby, a similarity network is created and the connected components are finally calculated. If desired, the nearest neighbour of every molecule can additionally be returned.
Hierarchical Clustering :
A set of input molecules is hierarhically clustered according to the average linkage criterion. Two different algorithms are implemented to make efficient use of the inverted index algorithm and the available hardware resources. Both methods are based on Reciprocal Nearest Neighbours. The parallel variant is used to handle large input library sizes. If the similarity matrix for the remaining clusters or possibly all input molecules fit into main memory, the matrix is calculated and the Nearest Neighbour Chain algorithm is used.
Finally, the method of Kelly et al. is used to select an level for cluster selection and the created clusters are returned.
Calculate Medoid :
For a set of input molecules the Medoid is calculated. The medoid is the molecule with the highest averaged similarity to all other molecules. Additionally, the method returns the averaged pairwise similarities for every molecule in the input set.

Citations:
Clustering Algorithms: F. Murtagh, Compstat Lectures vol. 4, 1985, Physica-Verlag Würzburg-Wien. (Volume titel: Multidimensional clustering algorithms).
Similarity Update: G. Lance and W. Williams, Comput J. (1967), 9, 373-380. (doi: 10.1093/comjnl/9.4.373).
Cutoff Selection: L.A. Kelley, S.P. Gardner and M.J. Sutcliffe, Protein Eng. (1996), 9, 1063-1065. (doi: 10.1093/protein/9.11.1063).

Definition at line 66 of file binaryFingerprintMethods.h.

Constructor & Destructor Documentation

BALL::BinaryFingerprintMethods::BinaryFingerprintMethods ( )

Default constructor.

BALL::BinaryFingerprintMethods::BinaryFingerprintMethods ( const Options options)

Constructor

Parameters
optionsOptions to use. Defaults will be overwritten.
BALL::BinaryFingerprintMethods::BinaryFingerprintMethods ( const Options options,
const FingerprintFeatures &  lib_features 
)

Constructor

Parameters
optionsOptions to use. Defaults will be overwritten.
lib_featuresTarget library as a lists of integer fingerprint features.
BALL::BinaryFingerprintMethods::BinaryFingerprintMethods ( const Options options,
const FingerprintFeatures &  lib_features,
const FingerprintFeatures &  query_features 
)

Constructor

Parameters
optionsOptions to use. Defaults will be overwritten.
lib_featuresTarget library as lists of integer fingerprint features.
query_featuresQuery library as lists of integer fingerprint features.
BALL::BinaryFingerprintMethods::BinaryFingerprintMethods ( const BinaryFingerprintMethods bfm)

Copy constructor.

virtual BALL::BinaryFingerprintMethods::~BinaryFingerprintMethods ( )
virtual

Destructor.

Member Function Documentation

bool BALL::BinaryFingerprintMethods::averageLinkageClustering ( const std::vector< unsigned int > &  selection,
std::vector< std::pair< unsigned int, float > > &  nn_data,
std::map< unsigned int, std::vector< unsigned int > > &  cluster_selection 
)

Average linkage clustering of a set of molecules defined by selection.

Parameters
selectionIndices of molecules in lib_features_ vector for which connected components analysis should be performed
nn_dataNearest neighbour information for molecules in selection if already calculated, otherwise an empty vector.
cluster_selectionFinal clustering, i.e. a cluster ID is assigned to every leaf node.
Returns
True if clustering was successful, false otherwise.
bool BALL::BinaryFingerprintMethods::calculateSelectionMedoid ( const std::vector< unsigned int > &  selection,
unsigned int &  medoid_index,
std::vector< float > &  avg_sims 
)

Calculation of the medoid of a set of molecules defined by selection. Medoid has the highest average similarity to all other compounds.

Parameters
selectionIndices of molecules in lib_features_ vector for which the medoid should be calculated.
medoid_indexReference to return the index of the calculated medoid in the passed molecule indices (selection).
avg_simsReference to vector which finally will store the average similarity of every molecule to all others in selection.
Returns
True if calculation has been successful, otherwise false.
bool BALL::BinaryFingerprintMethods::connectedComponents ( const std::vector< unsigned int > &  selection,
std::vector< std::vector< unsigned int > > &  ccs,
std::vector< std::vector< std::pair< unsigned int, float > > > &  nn_data,
const float  cutoff,
const bool  store_nns = false 
)

Calculate similarity network of a set of molecules defined by selection and return the connected components.

Parameters
selectionIndices of molecules in lib_features_ array for which connected components analysis should be performed.
ccsData structure to return connected components. Eeach ccs member is a single connected component which stores the indices (positions in the vector) of the connected component members in selection.
nn_dataIf store_nns = true, nn_data will store the nearest neighbour information (index into selection and similarity) for all connected component members.
sim_cutoffSimilarity cutoff to apply for similarity network generation, i.e. only keep similarities above cutoff.
store_nnsIf true, the nearest neighbour and its similarity are stored during connected components calculation.
bool BALL::BinaryFingerprintMethods::cutoffSearch ( const float  sim_cutoff,
const String outfile_name 
)

Perform similarity search for molecules of a query library in a target library and keep pairs with similarity exceeding sim_cutoff.

Parameters
sim_cutoffSimilarity cutoff to apply for cutoff search, i.e. only keep similarities above cutoff.
file_nameOutput file name to store result pairs with similarities as space separated CSV file.
const Options& BALL::BinaryFingerprintMethods::getOptions ( ) const

Get const reference to options.

Returns
Const reference to options of BinaryFingerprintMethods.
unsigned int BALL::BinaryFingerprintMethods::getQueryLibrarySize ( ) const

Get number of molecules in query library.

Returns
Number of query molecules.
unsigned int BALL::BinaryFingerprintMethods::getTargetLibrarySize ( ) const

Get number of molecules in target library.

Returns
Number of target molecules.
BinaryFingerprintMethods& BALL::BinaryFingerprintMethods::operator= ( const BinaryFingerprintMethods bfm)

Assignment operator

static bool BALL::BinaryFingerprintMethods::parseBinaryFingerprint ( const String fprint,
std::vector< unsigned short > &  features,
unsigned int  fp_type,
const char *  delim = "," 
)
static

Wrapper for different binary fingerprint parsers which returns a vector of integer features. The returned features lie in the interval [1, max_feature_id + 1]. Thus, no feature has ID = 0 which is important for inverted index implementation.

Parameters
fprintThe fingerprint as a separated list of integer features.
featuresA vector reference to return the fingerprint features. The vector will be cleaned first.
fp_typeFingerprint encoding: 1 = Binary Bitstring, 2 = Separated list of integer features
delimThe delimiter character of the list.
Returns
True if feature list has been parsed successful.
bool BALL::BinaryFingerprintMethods::setLibraryFeatures ( const FingerprintFeatures &  lib_features)

Add a target library containing binary fingerprints as feature lists.

Parameters
lib_featuresTarget library as lists of integer fingerprint features.
Returns
True if setup was successful, false if formats of input features are invalid.
bool BALL::BinaryFingerprintMethods::setQueryFeatures ( const FingerprintFeatures &  query_features)

Add a query library containing binary fingerprints as feature lists.

Parameters
query_featuresQuery library as lists of integer fingerprint features.
Returns
True if setup was successful, false if formats of input features are invalid.
void BALL::BinaryFingerprintMethods::setVerbosityLevel ( const int  verbosity)

Set verbosity level.

Parameters
verbosityverbosity level to use.