The basic concept of COSMOfrag consists of a careful molecular perception of the input molecule which can be given in different molecular formats, including SMILES notation. In this perception eventually missing hydrogens or bond orders are added. Bonds and rings are analyzed with respect to conjugation, E-Z-substitution, and aromaticity. Finally for each atom hash coefficients are calculated taking into account all the local information about the atom itself and about bonds and neighbour atoms. Step by step higher order hash codes are calculated which include the information about an increasing number of neighbouring spheres of the atom. Thus, each two atoms that have an equivalent molecular environment up to the nth neighbouring sphere have identical hash codes up to the nth order. Indeed even some more distant information may be included in the nth sphere since information is preferentially propagated along conjugated bonds assuming similarity along such bonds to be more important than similarity along single bonds. The highest order similarity code taken into account presently is 7. Including the 0th order we thus have 8 coefficients per atom. Some additional less local information is gathered in 2 by-coefficients per atom, resulting in a total number of 10 coefficients per atom.
After that all similarity coefficients are converted into 5-character ASCII codes and combined to a 50-character atom code. For all atoms in each molecule being added to the CFDB, these strings are stored in the CFDB.txt file together with a link to the molecule itself. Thus the search for most similar atoms can be done by a simple search for most similar atom codes. Keeping the CFDB ordered alphanumerically the search for the most similar atom can be done by recursive interval splitting, which is very efficient in a large database. The time required for finding the most similar atom only increases with the logarithm of the database size. In this way, for all atoms of a new molecule image atoms of maximum local similarity can be efficiently found in the database. Finally those molecules out of the CFDB are chosen for partial images, which have most similar image atoms for a large number of atoms of the molecule to be fragmented. Based on our present database of about 48000 drug-like compounds most molecules get fragmented into 2 – 4 fragments.
Based on the 7th order similarity coefficients of all atoms a unique coefficient is generated for each molecule. This is converted into a 7-letter ASCII code being unique for each molecule apart from cis/trans isomerism and stereo-chemical differences. Cis/trans isomerism is translated into an 8th letter and stereochemistry into a 9th letter. Thus, a 9-letter unique name results for each molecule. This name is used for the identification of molecules in the CFDB database and as name for the compressed COSMO files.
COSMOfrag efficiently performs the fragmentation of a new molecule by molecules of the COSMOfrag database (CFDB) using the hash codes of the atoms in the molecule. The result of the fragmentation is written to a COSMOtherm meta-file (.mcos) which then can be used as substitute for the full COSMO file in the COSMOtherm input. Thus almost any kind of calculation which can be done with COSMO files in COSMOtherm can be done similarly using COSMOfrag meta-files.
The COSMOfrag database (CFDB) is composed of currently 48000 highly divers molecules obtained by screening procedures of different databases/libraries of drug-like compounds such as Physprop*, NCI**, or NCBI***. In case your compounds should not be sufficiently well represented by the CFDB you can easily run DFT/COSMO calculations for a small number of representative compounds, add them to the database and consecutively improve the fragmentations.
* Physical Properties Database by Syracuse Research Corporation
** National Cancer Institute Database
*** National Center for Biotechnology Information
back to COSMOfrag