The DocumentExtractor class extracts chemical names from text documents and converts them to chemical structures.
Example usage:
// We have a document to process
java.io.Reader document = ...;
DocumentExtractor x = new DocumentExtractor();
x.processHTML(document); // or processPlainText(document) for input in plain text format
// Iterate through the hits
for (Hit hit : x.getHits()) {
System.out.println(hit.position + ": " + hit.text + ": " + hit.structure.toFormat("smiles"));
}
The field hit.position contains the position of the first character of the name in the document.
Note that hit.text contains the name as it appears in the source document. A cleaned version (of possible OCR errors, typos, ...) can be retrieved with hit.structure.getName()
.
This class can also be called on the command-line. It then expects the name of a plain text file as the first argument (or from the standard input when absent). The list of hits is printed on the standard output.
Detailed code examples using Document to Structure in real-world situations.