BALL::ML::MLData Class Reference

#include <BALL/MATHS/ML/MLData.h>

List of all members.

Public Member Functions

 MLData ()
 ~MLData ()
Predicates

bool isDataCentered () const
bool isResponseCentered () const

Protected Attributes

Attributes

VMatrix descriptor_matrix_
VMatrix Y_
VMatrix descriptor_transformations_
VMatrix y_transformations_
vector< string > column_names_
vector< string > substance_names_
std::multiset< int > invalidDescriptors_
std::multiset< int > invalidSubstances_
String data_folder_
std::map< String, int > class_names_

Accessors



vector< String > * readPropertyNames (String sd_file)
void readSDFile (const char *file)
void readSDFile (const char *file, std::multiset< int > &act, bool useExDesc=1, bool append=0, bool translate_class_labels=0)
void calculateBALLDescriptors (Molecule &m)
void displayMatrix ()
void centerData (bool center_Y=0)
void scaleAllDescriptors ()
unsigned int getNoSubstances () const
unsigned int getNoDescriptors () const
void readCSVFile (const char *file, int no_y, bool xlabels, bool ylabels, const char *sep=",", bool appendDescriptors=0, bool translate_class_labels=0)
void manipulateY (vector< String > v)
void manipulateY (String v)
void discretizeY (vector< double > thresholds)
void transformX (vector< String > v)
vector< QSARData * > partitionInputData (int p)
void saveToFile (string filename) const
void readFromFile (string filename)
vector< QSARData * > generateExternalSet (double fraction) const
vector< QSARData * > evenSplit (int no_test_splits, int current_test_split_id, int response_id=0) const
vector< double > * getSubstance (int s) const
vector< double > * getActivity (int s) const
unsigned int getNoResponseVariables () const
const vector< string > * getSubstanceNames () const
bool checkforDiscreteY () const
bool checkforDiscreteY (const char *file, std::multiset< int > &activity_IDs) const
void setDataFolder (const char *folder)
void removeHighlyCorrelatedCompounds (double &compound_cor_threshold, double &feature_cor_threshold)
void getSimilarDescriptors (int descriptor_ID, double correlation, std::list< std::pair< uint, String > > &similar_descriptor_IDs) const
void setDescriptorNames (const Molecule &m, std::multiset< int > &activity_IDs, bool useExDesc=1)
void removeInvalidDescriptors (std::multiset< int > &invalidDescriptors)
void removeInvalidSubstances (std::multiset< int > &inv)
void readMatrix (VMatrix &mat, std::ifstream &in, char seperator, unsigned int lines, unsigned int col)
void checkActivityIDs (std::multiset< int > &act, int no_properties)
void insertSubstance (const QSARData *source, int s, bool backtransformation=0)
void printMatrix (const VMatrix &mat, std::ostream &out) const

Detailed Description

QSAR

Definition at line 67 of file MLData.h.


Constructor & Destructor Documentation

BALL::ML::MLData::MLData (  ) 
BALL::ML::MLData::~MLData (  ) 

Member Function Documentation

void BALL::ML::MLData::calculateBALLDescriptors ( Molecule m  ) 

Calculates descriptors for one molecule and saves them into one new line of descriptor_matrix

void BALL::ML::MLData::centerData ( bool  center_Y = 0  ) 

centers each descriptor to mean of 0 and stddev of 1

Parameters:
center_Y if ==1, activity values are also centered. Obviously this should NOT be used for classification experiments!
void BALL::ML::MLData::checkActivityIDs ( std::multiset< int > &  act,
int  no_properties 
) [protected]

checks whether the given list of activity IDs contains any values <0 or values that are larger than the number of properties in the current input file.
If such values are found, an Exception of type InvalidActivityID is thrown.

bool BALL::ML::MLData::checkforDiscreteY ( const char *  file,
std::multiset< int > &  activity_IDs 
) const

checks whether the response variables of a specified file contain only discrete values.

bool BALL::ML::MLData::checkforDiscreteY (  )  const

checks whether the response variables contain only discrete values. This can be used to check whether the current input data set is suitable for a ClassificationModel

void BALL::ML::MLData::discretizeY ( vector< double thresholds  ) 

Discretize the response values. If the response variable(s) of this data object have been normalized, the given thresolds will be automatically normalized accordingly.

Parameters:
thresolds d thresholds for d+1 classes, that are to be created
void BALL::ML::MLData::displayMatrix (  ) 

show descriptor_matrix on stdout

vector<QSARData*> BALL::ML::MLData::evenSplit ( int  no_test_splits,
int  current_test_split_id,
int  response_id = 0 
) const

Split this data set into a training set and a test set. In contrast to generateExternalSet(), compounds for the test set are *not* randomly selected. Instead, this data set is first sorted according to response values (in order to ensure equal response value ranges) and then split regularly into training and test set.

Parameters:
no_test_splits the total number of splits you want to create by successive calls of this function
current_test_split_id the split to be produced, with 0<=current_test_split_id<no_test_splits
vector<QSARData*> BALL::ML::MLData::generateExternalSet ( double  fraction  )  const

generates a training and an external validation set from the current QSARData object

Parameters:
fraction the fraction of this current coumpounds that should be used as external validation set (by random drawing)
vector<double>* BALL::ML::MLData::getActivity ( int  s  )  const

returns a pointer to a new vector containing the UNcentered response values for the s'th substance of the current data set

unsigned int BALL::ML::MLData::getNoDescriptors (  )  const

returns the number of descriptors

unsigned int BALL::ML::MLData::getNoResponseVariables (  )  const

returns the number of response variables

unsigned int BALL::ML::MLData::getNoSubstances (  )  const

returns the number of substances

void BALL::ML::MLData::getSimilarDescriptors ( int  descriptor_ID,
double  correlation,
std::list< std::pair< uint, String > > &  similar_descriptor_IDs 
) const

Find all descriptors of the current data set that have a correlation of at least 'similarity' to the specified feature

Parameters:
descriptor_ID the ID of the descriptor for which similar features should be searched
similarity the desired minimal correlation
similar_descriptor_IDs list to which the IDs of the found descriptors will be saved as pairs of descriptor ID and descriptor name
vector<double>* BALL::ML::MLData::getSubstance ( int  s  )  const

returns a pointer to a new vector containing the UNcentered descriptor values for the s'th substance of the current data set

const vector<string>* BALL::ML::MLData::getSubstanceNames (  )  const
void BALL::ML::MLData::insertSubstance ( const QSARData *  source,
int  s,
bool  backtransformation = 0 
) [protected]

appends compound no <s> taken from the given source to the data of this object.

Parameters:
backtransformation if set to true, all features of the compound are back-transformed after adding them to this object.
bool BALL::ML::MLData::isDataCentered (  )  const

tells whether the features have been centered

bool BALL::ML::MLData::isResponseCentered (  )  const

tells whether the response variables have been centered

void BALL::ML::MLData::manipulateY ( String  v  ) 

for testing purposes only: change Y-matrix according to the given equation

Parameters:
v string containing the equation, e.g."x1+x3*5+x10^2"
void BALL::ML::MLData::manipulateY ( vector< String v  ) 

for testing purposes only: change Y-matrix according to the given equations

vector<QSARData*> BALL::ML::MLData::partitionInputData ( int  p  ) 

partitions the input data into p QSARData object of (approx.) equal size.

void BALL::ML::MLData::printMatrix ( const VMatrix mat,
std::ostream &  out 
) const [protected]

prints a vector-based matrix to a file

void BALL::ML::MLData::readCSVFile ( const char *  file,
int  no_y,
bool  xlabels,
bool  ylabels,
const char *  sep = ",",
bool  appendDescriptors = 0,
bool  translate_class_labels = 0 
)

Read input from a csv file.
This file should contain all descriptor values in the first columns and the activity values in the last no_y columns.

Parameters:
no_y the number of activities, i.e. the number of columns containing activity values
xlabels if ==1, names of descriptors are read from the first line of the table
ylabel if ==1, names of substances are read from the first column of the table
sep the character used to seperate the cells of the table
appendDescriptors if set to 1, descriptors will be read from the file and appended as new columns to the current descriptor_matrix
void BALL::ML::MLData::readFromFile ( string  filename  ) 

reconstructs a QSARData object from a text file

void BALL::ML::MLData::readMatrix ( VMatrix mat,
std::ifstream &  in,
char  seperator,
unsigned int  lines,
unsigned int  col 
) [protected]

reconstructs a vector based matrix from a file

vector<String>* BALL::ML::MLData::readPropertyNames ( String  sd_file  ) 

reads the names of the properties from the first molecule in the given sd-file

void BALL::ML::MLData::readSDFile ( const char *  file,
std::multiset< int > &  act,
bool  useExDesc = 1,
bool  append = 0,
bool  translate_class_labels = 0 
)

Fetches input from one sd-file containing all structures. The activity value for each molecule is taken from its property in the sd-file.

Parameters:
a contains the numbers of the properties that are activity-values
file the sd-file containing the input
useExDesc if set to 1, descriptors read from the sd-file will be used in addition to those calculated by BALL internally
append if set to 1, the substances read from the sd-file will be appended as new lines to the current descriptor_matrix
void BALL::ML::MLData::readSDFile ( const char *  file  ) 

Fetches input from one sd-file containing all structures and from one file containing the activities of all structures sorted in ascending order.
The latter file is assumed to have the same name as the first one, with only the extension changed to ".txt"

Parameters:
file the sd-file containing the input
void BALL::ML::MLData::removeHighlyCorrelatedCompounds ( double compound_cor_threshold,
double feature_cor_threshold 
)

removes compounds whose absolute correlation coefficient to another compound is larger than cor_threshold

Parameters:
feature_cor_threshold Only features that do not have a correlation larger than this value to another feature are used to calculate the similarity of compounds (=instances).
void BALL::ML::MLData::removeInvalidDescriptors ( std::multiset< int > &  invalidDescriptors  )  [protected]

removes columns of invalid descriptor from descriptor_matrix

Parameters:
invalidDescriptors list containing the IDs of the columns to be deleted
void BALL::ML::MLData::removeInvalidSubstances ( std::multiset< int > &  inv  )  [protected]
void BALL::ML::MLData::saveToFile ( string  filename  )  const

saves the current QSARData object to a text file

void BALL::ML::MLData::scaleAllDescriptors (  ) 

scales each descriptor to stddev of 1

void BALL::ML::MLData::setDataFolder ( const char *  folder  ) 

allows to set the data-folder neccessary for computation of descriptors without using BALL_DATA_PATH enviroment variable, which is useful for standalone applications

void BALL::ML::MLData::setDescriptorNames ( const Molecule m,
std::multiset< int > &  activity_IDs,
bool  useExDesc = 1 
) [protected]

writes the names of all external descriptors into column_names

void BALL::ML::MLData::transformX ( vector< String v  ) 

Member Data Documentation

std::map<String,int> BALL::ML::MLData::class_names_ [protected]

in case of classification data sets with non-numeric class labels, this member maps the names of the individual classes to their assigned id.

Definition at line 253 of file MLData.h.

vector<string> BALL::ML::MLData::column_names_ [protected]

names of all descriptors

Definition at line 240 of file MLData.h.

Definition at line 250 of file MLData.h.

matrix containing the values of each descriptor for each substance

Definition at line 228 of file MLData.h.

2xm dimensional matrix (m=no of descriptors) containing mean and stddev of each transformed descriptor

Definition at line 234 of file MLData.h.

std::multiset<int> BALL::ML::MLData::invalidDescriptors_ [protected]

contains the numbers of external descriptors for which invalid values (e.g. strings instead numerical values) were encountered in some molecules

Definition at line 246 of file MLData.h.

std::multiset<int> BALL::ML::MLData::invalidSubstances_ [protected]

Definition at line 248 of file MLData.h.

vector<string> BALL::ML::MLData::substance_names_ [protected]

names of all substances

Definition at line 243 of file MLData.h.

matrix containing the experimentally determined results (active/non-active) for each substance. Different activities are saved column-wise.

Definition at line 231 of file MLData.h.

2xc dimensional matrix (c=no of activities) containing mean and stddev of each transformed activity

Definition at line 237 of file MLData.h.

 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Properties Friends Defines
Generated by  doxygen 1.6.3