* fixed issue #2965 * added test case for issue #2965 * fixed formatting and added comment. * update * General Reader files * removed dependency on boost filesystems * removed class * clang-format * added-comments * further-cleanup * added clang-formatting * braces-for-if-else * changed error messages, added option for windows file path * fixed getFileName function * cleanup * option for filename without path * further-cleanup * added tests for determineFileFormat * cleanup, const arguments for validate function * init * cleanup * cleanup * clang-format does not work for CMake * added RDK_TEST_MULTITHREADED option * add-flag * cleanup * Delete ConcurrentQueue.h This PR deals with the Generalized File Reader. * Delete testConcurrentQueue.cpp This PR deals with the Generalized File Reader. * no change * concurrent queue * print values * Single Producer Multiple Consumer works * cleanup * Producer Consumer Example * update queue methods and tests * cleanup * test * fixed tests * cleanup, updated tests * Delete ProducerConsumer.h * Delete testProducerConsumer.cpp * cleanup * futher cleanup * changes based on feedback * make queue non copyable * psuedocode * possible implementation * untested implementation * change class to typename * basic-setup * need to fix segfault * need to fix blocking * need to fix blocking * need to fix blocking * fix indentation * one possibility * without lambda function * possible fix with some test cases * performance tests * added support for record id and item text * cleanup * cleanup * fixed memory leak and added methods with tests for getting last id and item text * cleanup * added more test cases with different smi files * cleanup * SD mol supplier * modified the parsing for SDMolSupplier * cleanup * cleanup * new file for testing * added support for reading molecule properties with tests * thread-safe logging and exception handling * cleanup * without thread safe logging * cleanup * cleanup, modified MultithreadedSmilesMolSupplier * cleanup, made reader and writer functions private * move O2.sdf * basic python wrapper with tests * cleanup, added new methods for python wrappers * made changes suggested by Andrew * file and compression formats are case-insensitive * cannot open files with gzstream * cleanup * possible fix for opening compressed streams (SMILES) * removed seekg() and tellg() methods from multithreadeded suppliers * cleanup * test cases for python wrappers * some wrapper cleanup * cleanup, removed unused functions * update the MT tests so that they actually do some work also includes some cleanup here * cleanup * remove iterator_next header include * added support for multithreaded readers * use getNumThreadsToUse for multithreaded suppliers * fixed documentation for multithreaded python wrappers * commented performance test * first draft of final evaluation report * removed inline variables * first draft getting started in python * fixed typos in getting started in python * fixed typos * fix documentation tests * fixed documentation tests * added links to important files and PR * added perfomance results * first version of wrappers with compressed streams * getting rid of streambuf stream method * modified General File Reader * make this work when building in non-threads mode * rename a test * rename a function in the python API * rearrange the python test a bit * disable the stream-based constructors in Python * mark the multithreaded classes as experimental Co-authored-by: greg landrum <greg.landrum@gmail.com>
6.3 KiB
Do Not Review
[GSoC 2020] General File Reader and Multithreaded Mol Supplier
Overview
The General File Reader, as the name suggests, provides the user with the appropriate MolSupplier object to parse a file of a given format. Thus for instance earlier if one wanted to parse a file of smiles, say input.smi, then one would need to explicitly construct an object SmilesMolSupplier. However, with the implementation provided in the GeneralFileReader, one can easily pass the file name along with supplier options to obtain the appropriate MolSupplier object determined by the file format. Furthermore, the General File Reader also provides an interface with the MultithreadedMolSupplier objects for Smiles and SDF file formats. Besides the implementation, test cases are also included to demonstrate the utility of the General File Reader.
The Multithreaded Mol Supplier provides a concurrent implementation of the usual base class MolSupplier. Due to time constraints, multithreaded versions of only Smiles, and SD Mol Suppliers were implemented. The motivation for this part stemmed from parsing large Smiles or SDF files. With the current implementation the user, for instance, can construct the object MultithreadedSmilesMolSupplier to parse a smiles file with a large number of records. Besides the implementation, test cases are also included to demonstrate the correctness and performance of the MultithreadedMolSupplier. Here is a brief summary of the performance result obtained by running the function testPerformance on @greglandrum's machine:
Duration for SmilesMolSupplier: 6256 (milliseconds)
Maximum Duration for MultithreadedSmilesMolSupplier: 6972 (milliseconds) with 1 writer thread
Minimum Duration for MultithreadedSmilesMolSupplier: 855 (milliseconds) with 15 writer threads
Duration for SDMolSupplier: 2584 (milliseconds)
Maximum Duration for MultithreadedSDMolSupplier: 2784 (milliseconds) with 1 writer thread
Minimum Duration for MultithreadedSDMolSupplier: 729 (milliseconds) with 7 writer threads
Implementation
Implementation of the General File Reader is quite concise and makes use of only two methods determineFormat and getSupplier. The former determines the file and the compression format given pathname, while the latter returns a pointer to MolSupplier object given pathname and SupplierOptions.
Regarding the implementation of the MultithreadedMolSupplier, the first step was to implement a thread-safe blocking queue of fixed capacity. This would allow us to extract and process records from the file concurrently. The concurrent queue was implemented with a single lock and two condition variables to signal whether the queue was empty or full. Test cases checking the correctness of the ConcurrentQueue are also included in the project.
The next step required the implementation of the base class MultithreadedMolSupplier which would manage the input and output queue. The input queue would be populated by the method extractNextRecord that would read a record from a given file/stream, whereas the output queue would be populated by the method processMoleculeRecord that would first pop a record from the input queue and then process it into an object of type ROMol *. The reader thread would thus call extractNextRecord until no record can be read, while the writer thread(s) would call the method processMoleculeRecord until the output queue is done and empty. The child classes MultithreadedSmilesMolSupplier and MultithreadedSDMolSupplier primarily provide implementations of the methods, extractNextRecord and processMoleculeRecord. Both suppliers were tested on various files with different parameter values for input queue size, output queue size, and the number of writer threads.
Further Work
Due to time constraints and the difficulty involved in debugging concurrent code, there were a few things that could not be completed.
- In cases where the file format is less defined, it might be useful to parse the file content to discover the file format and possible Supplier options. The current implementation does not support this and only uses the pathname to determine the appropriate Supplier.
- Wrappers for the Multithreaded Smiles and SD Suppliers in other languages such as Java were not implemented in this project.
Changes made for the General File Reader and Multithreaded Mol Supplier:
List of important files added:
- GeneralFileReader.h and testGeneralFileReader.cpp
- ConcurrentQueue.h and testConcurrentQueue.cpp
- MultithreadedMolSupplier.h and MultithreadedMolSupplier.cpp
- MultithreadedSmilesMolSupplier.h, MultithreadedSmilesMolSupplier.cpp and its Python wrapper
- MultithreadedSDMolSupplier.h, MultithreadedSDMolSupplier.cpp and its Python wrapper.
- testMultithreadedMolSupplier.cpp and testMultithreadedMolSupplier.py for testing Python wrappers.
Any other comments?
This project is still work in progress.