Faster Fingerprints for the CDK
Mark Rijnbeek, who has moved to my team last month to work on the chemistry search engine for our new chemogenomics data, has given Rajarshi‘s new fingerprint implementation a test. Mark was bored to hell by the performance of the version he had in hand and it turned out that it was my old one, which had served us well for quite a while but turned out to be unusable for the amount of data we are testing now.
So he downloaded CDK 1.04, just released a few days ago, and have it a shot.
“Here’s what happens: fetch 1000 molfile clobs from Oracle, put them in a list, create a list of Molecule objects from that, and lastly calculate fingerprints on that last list.
Below is Java system output, each CDK version tested against a 1000 compounds, twice.
The numbers are milliseconds [passed] since program start.
The performance increase is very significant; the older CDK fingerprinter took about a minute (see below) for 1000 fingerprints, the new one about 7 seconds.”
The numbers for the “old code”:
0 - Start benchmark 1000 compounds. 84 - Fingerprinter set up 531 - Connected to database 120 - Resultset opened 1706 - Molfile strings retrieved from database, stored in list 3231 - Molecule objects list built 64202 - Fingerprints calculated
And then CDK 1.04:
0 - Start benchmark 1000 compounds. 77 - Fingerprinter set up 536 - Connected to database 118 - Resultset opened 909 - Molfile strings retrieved from database, stored in list 2360 - Molecule objects list built 9900 - Fingerprints calculated
These numbers are just one representative instance from multiple runs performed by Mark. They do not quite fit the numbers reported by Rajarshi, but the conditions were to different to be comparable. In our case, the achieved speed-up is 8-fold, which is a nice success and even better than Rajarshi’s reported 4-fold speed-up.
We plan to soon be reporting on benchmarking a much larger dataset.
Thanks, Rajarshi. Great stuff!
Categorised as: Open Science