Automated extraction of chemical structure information from digital raster images
© Park et al 2009
Received: 25 October 2008
Accepted: 05 February 2009
Published: 05 February 2009
To search for chemical structures in research articles, diagrams or text representing molecules need to be translated to a standard chemical file format compatible with cheminformatic search engines. Nevertheless, chemical information contained in research articles is often referenced as analog diagrams of chemical structures embedded in digital raster images. To automate analog-to-digital conversion of chemical structure diagrams in scientific research articles, several software systems have been developed. But their algorithmic performance and utility in cheminformatic research have not been investigated.
This paper aims to provide critical reviews for these systems and also report our recent development of ChemReader – a fully automated tool for extracting chemical structure diagrams in research articles and converting them into standard, searchable chemical file formats. Basic algorithms for recognizing lines and letters representing bonds and atoms in chemical structure diagrams can be independently run in sequence from a graphical user interface-and the algorithm parameters can be readily changed-to facilitate additional development specifically tailored to a chemical database annotation scheme. Compared with existing software programs such as OSRA, Kekule, and CLiDE, our results indicate that ChemReader outperforms other software systems on several sets of sample images from diverse sources in terms of the rate of correct outputs and the accuracy on extracting molecular substructure patterns.
The availability of ChemReader as a cheminformatic tool for extracting chemical structure information from digital raster images allows research and development groups to enrich their chemical structure databases by annotating the entries with published research articles. Based on its stable performance and high accuracy, ChemReader may be sufficiently accurate for annotating the chemical database with links to scientific research articles.
In the scientific literature, there is a tremendous amount of information about the interaction of small molecules with specific targets, the influence of small molecules on biochemical pathways, the phenotypic effects of small molecules in different cell types, as well as the relationship of small molecules, targets, pathways and phenotypes to disease processes. However much of this information has yet to be compiled in the form that would allow using a molecule's chemical structure as an input to search for its potential relevance in a specific physiological, pathological or therapeutic area of interest. Two examples of information resources linking chemical structures with biomedical targets, pathways and phenotypes are PubMed  – the database of the scientific literature corpus – and PubChem  – a publicly available database of over 19 million chemical structures, each of which can have a cross-reference link to similar structures, bio-assay data, and bio-activity descriptions. If these resources can be used to construct a universal database encompassing all known chemical structures with links to specific targets, biochemical pathways, disease states and potential therapeutic applications, a powerful new tool for both biomedical research and drug discovery would emerge.
In general, one can envision two ways to parse scientific articles for chemical information: by searching for names or structure diagrams of chemical agents. The chemical structure diagrams in scientific articles are typically drawn manually using a program such as ChemDraw , ISIS/Draw , DrawIt , and ACD/ChemSketch . Once a structure is drawn, the structural description can be translated into a computer readable format, such as ISIS, MOLfile, SMILES, or ROSDAL formats, which describes the atoms, bond orders, and connectivity patterns of atoms in molecules. However, the diagrams of chemical molecules in scientific journals and reference books are encoded as digitized images (e.g. BMP, TIFF, PNG or GIF), which in turn are embedded within lines of text in a form that is not readily translatable into a computer readable format. Therefore, most references to chemical agents in scientific research articles cannot be easily linked to other repositories of scientific knowledge, and are thus not amenable for analysis or searching using cheminformatic software.
Comparison of existing machine vision system for chemical structure recognition. O and X denotes the availability of key features listed in the first column: O = Positive and X = Negative.
Linux/MS Windows/OS X
Machine Vision Approaches for Digital Image Recognition
Machine-vision is concerned with the theory and method for processing the image data and identifying relevant image features effectively . Machines see objects in different ways than human beings. Given digitized image data or multi dimensional data, machines extract features and classify patterns by examining each digital element (pixels) of each image. In general, a machine processes an image in the following steps:
• De-noising: Removing visual artifacts that decrease the ability to extract information from the images;
• Segmentation: Separating objects in the image;
• Feature extraction: Characterization of each segmented region by extracting topological features;
• Consistency analysis: Interpreting the entire image based on extracted local features; and
• Classification/Matching: Identifying the object in the image in relation to a reference set of objects.
Many applications of machine-vision have been developed and are used in various fields (e.g., automated diagnosis system in medicine, quality control in manufacturing industry, and security and intruder identification). There are also several applications performing tasks of extracting structural information from digital images of technical diagrams. Dori and Wenyin have developed the Machine Drawing Understanding System (MDUS), which can convert printed mechanical engineering drawings that are scanned and stored as raster digitized image files into standard file formats that can be read by Computer-Aided Design (CAD) software . An automated conversion system of electronic circuit diagrams have been also developed .
Machine Vision Systems for Recognition of Chemical Structure Images
The essential components of chemical structure drawing can be categorized into bond lines and atom symbols. In all systems listed below, these two components in raw image data are first separated by a segmentation algorithm. Then bond lines in graphic segments can be processed by a line detection algorithm and atom symbols in text segments can be recognized by a character recognition algorithm. Finally, a graph representing the chemical structure is built from both results, and from this the structure information can be extracted and stored as a standard chemical file format.
To extract a chemical diagram from a document and convert it to a digital chemical file, any automated machine-vision based system would need to be able to execute all of the following tasks without manual intervention. The first step is to identify all the individual chemical diagrams in a document, and segment these diagrams into atoms and bonds connected to form an individual molecule. For this purpose, a document page containing chemical structure diagrams should be scanned to produce a digital raster image of the entire page. Before proceeding to process the scanned digital raster image, it is necessary to extract only a subarea of the page which contains a chemical structure diagram. Next, with the isolated chemical structure image, another algorithm is used to classify the graphic (bonds) and text components in those images. A conventional connected component algorithm is typically used to segment an image into sets of pixels connected with each other, and the relative size of each component gives information to distinguish a component between graphics (bond lines) and text (atom symbols).
Once lines and text have been separated from each other, the next step is to identify the length, position and direction of the lines, and the characters of the text. There are several types of bonds used in the chemical structure diagram: single, double, triple, wedged, dotted, dashed, and dashed-wedged. Since the basic graphical elements composing such bonds are lines, Hough transform  and vectorization algorithms, which are widely used in machine-vision systems, are employed for the line detection schemes. Different bond types can be distinguished by considering detected line length, width and arrangement patterns. For character recognition, text components are conveyed into a character recognition engine where they are analyzed using artificial neural networks or feature based approaches.
The last step of chemical structure extraction involves establishing the connectivity of the atoms, in terms of which atoms are linked to each other, and the number of bonds between them. Based on the result of previous steps, a graph representing the chemical structure is constructed. From the result of character recognition, the detected chemical symbols for atom types or molecular groups are assigned to nodes. The detected lines enable the construction of the entire structure of the grap. In some cases, a character string at a node could be an abbreviation (e.g., OMe for a methyl-ester). In such cases, it is necessary to interpret the chemical meaning of the abbreviation in order to build a complete chemical standard file. A database of chemical abbreviations which frequently appear in the chemical structure diagram can be used for this purpose. By looking up the abbreviation in the database, the abbreviation can be translated directly to a digital chemical representation. If there is no matching entry, the system can flag the structure as potentially misrecognized. At the final step, the compiled chemical structure graph is translated into a chemical standard file such as Molfile, SMILE strings.
The first commercial program to read and interpret digital raster images of chemical structures was Kekule , developed by Joe R. McDaniel and Jason R. Balmuth of Fein-Marquart Associates Inc. in Baltimore, MD. The program requires at least a 150 dpi image resolution. In Kekule, the area of a page that contains a chemical structure diagram needs to be manually identified. In terms of interesting features, Kekule has a built-in algorithm to fix character recognition errors. For this purpose, a neural network is used for generating potential characters with scoring information estimating the likelihood that a specific character corresponds to a certain atom. Even when an incorrectly recognized character has a higher score than the correct candidate, Kekule can fix character-to-atom conversion errors by considering the valence and chemical neighbors of the atom. Still, manual correction at the post-processing step is often required, due to an average accuracy of 0.74 per structure diagram.
Optical Recognition Of Chemical graphics (OROCS)
For converting chemical structure images to computer-readable format, another program called OROCS , was developed at the IBM Almaden Research Center, San Jose, CA. The most interesting feature of the OROCS system is that is has an algorithm for automated extraction of chemical structure diagrams from scanned document images. In order to isolate chemical structure diagrams from other elements – such as text, figures and pictures on a page-the document is segmented by a conventional connected components algorithm. If the size of a segment is larger than a threshold, it is potentially regarded as a chemical structure, and the polygonal shapes of chemical structure diagrams are used to make a final decision. The methodology implemented in OROCS was granted a U.S. patent in 1992 .
Chemical Literature Data Extraction (CLiDE)
Amongst the chemical structure extraction efforts to date, the Chemical-Literature Data-Extraction Project (CLiDE)  is available commercially. CLiDE not only aims at extracting chemical structures but also abstracting chemical information from text. By employing the Documental Format Description Language (DFDL) which can describe logical relationships of objects and elements in a document, CLiDE builds logical associations between chemical structures and the text segments of document . Unlike OROCS and like Kekule, CLiDE does not have an automated process to discriminate chemical structure diagram from graphical objects, so manual separation of chemical diagrams is necessary. As well as Kekule and OROCS, CLiDE requires at least a 300 dpi resolution in scanned images at the scanning step and manual correction at the post processing step to achieve reliable output. However, the drawn chemical structure diagrams are typically embedded in Word documents as GIF or JPG formats, whose the resolution is usually 72–96 dpi. Therefore, these software systems might be impractical tools for fully automated extraction of chemical structure information.
Recently, a new program, called chemOCR , has been developed and made available. Focusing on overcoming the most common errors generated by prior systems, chemOCR adopted a chemical rule-based expert system for the extraction of chemical structure diagrams. The most interesting features, at the post-processing stage, is that chemOCR uses a graph-matching algorithm to select the best-matching chemical structure fragment against sub-graphs of chemical structures stored in a database. With this approach, even if several errors occur during detecting lines or recognizing characters, the errors can be corrected by simply replacing unrealistic chemical fragments of a molecule with known sub-structural motifs present in the database of chemical substructures. In their own testing, chemOCR showed high correct recognition rates ranging from 67 to 97%, and thus outperformed CLiDE which could process only 25 images out of 100 successfully.
Optical Structure Recognition (OSRA)
OSRA , another recently released program is free and open source software written by the CADD group at the National Cancer Institute. OSRA attempts to generate three output structures by varying parameters for the de-noising stage, and then picks one as an output based on its own empirical confidence function. Since most machine vision algorithms could yield quite different interpretations of the same input with a slightly different parameter setting, this iterative processing of the same input could improve the overall ratio of correct outputs, so long as the confidence function is reliable enough.
ChemReader – Overview
ChemReader is a software developer toolkit for translating digital raster images of chemical structures into standard, chemical file formats that can be searched and analyzed with other open source or commercial cheminformatic software. Its intention is to allow tailoring of each step of the extraction of chemical diagrams, to optimize annotating a database of chemical structures from references in the scientific literature, as illustrated in Figure 1. Recognizing the shortcomings of the other systems discussed in the previous section, ChemReader aims to achieve very high recognition accuracy and robust performance sufficient for fully automated processing of research articles. In addition, ChemReader possesses a graphical user interface (GUI) that allows each step of the algorithm to be tested independently.
The first step in ChemReader involves an image processing for re-sizing and de-noising. The chemical structure diagrams are drawn with different settings in the drawing software, such as default bond lengths or character font sizes. Moreover, the image size and format are subjected to variations while transferred to the final destination, for example, a journal article or a web page. Thus it is necessary to resize and de-noise the input image so that the chemical structure diagram within the input image has bond lengths and character sizes optimally adjusted to ChemReader's recognition algorithms. With the first run of line detection as explained below, the length of the single bond is estimated. If the estimated bond length is shorter or larger than a certain threshold (currently 25 pixels), the image is resized such that bonds extracted in the next stages can have ChemReader's preferred length. For this purpose, GREYCstoration , a free implementation of image regulation algorithm  is used.
Separation of lines and characters
The second step is disassembling connected components based on pixel connectivity. In ChemReader, the 8-connectivity algorithm was used. Subsequently, the connected components are classified into characters and graphics. To detect characters, a character detection algorithm searches for objects with similar heights and areas. The most populated area/height combination will, in general, represent text components . Most text components can be separated from the rest of the chemical structure using this method.
If a text component is not separated from a graphic component (e.g., because of a printer error) but is aligned with a successfully-separated text component (referred to as a "seed string"), the glued character component is separated from the graphics by extending the seed string  in the direction in which the seed string characters are aligned. In order to distinguish the small isolated lines or circles representing bonds from the text components, the relative location and horizontal/vertical run profile of each component are also checked. For example, the letter 'l' is often wrongly identified as a graphic component. However, since it always appears next to other letters, the letter 'l' can be correctly identified as a letter and not a bond by considering the relative location of each letter. If text components cannot be identified in this manner, they can often be corrected in subsequent steps.
Line Detection Algorithm
x i cos θ i + y i sin θ i = r i
Since a pixel corresponds to a sinusoidal curve in the Hough space, collinear pixels in the x-y space have intersecting sinusoidal lines. Therefore, all possible lines passing through every arbitrary pair of pixels in a chemical diagram image are identified by checking the intersection points of curves in the Hough space. Figure 4(c) and 4(d) shows the detected line and the corresponding Hough space. The density of a point (r*, θ *)in Hough Space (Figure 4(d)) would represent the likelihood of finding a corresponding line in the actual chemical diagram image (Figure 4(c)).
where n ij is the number of pixels that have distance less than a half of the thickness of the line connecting P i and P j , x ij is the number of black pixels in n ij pixels, and p0 is the total number of black pixels in the image space divided by the image size. The pixel pairs assigned by this method can be selected randomly to reduce computational time and memory usage. Since the ends of line segments can be recognized as corner pixels, those pixels which are identified by the wedge-bond detection-algorithm (described below) can also be used for general line detection. The line detection algorithm terminates when the assigned pixel pair results in a short line segment compared to the previously detected line segments.
Bond Type Identification
In low resolution (fuzzy) images, Hough Transformation often fails to distinguish a double or triple bond from a single bond. With thickness-based bond correction, a single line detected can be interpreted as a double or triple bond by considering the thickness of the bond, as well as the pattern of dark-white transitions perpendicular to the line.
• Area of the triangle = Number of black pixels in the triangle
• Almost isosceles triangle shape
• NB1 > NB2 > NB3, where NB is the number of black pixels (see Figure 7)
In the case where a normal (non-stereochemical) bond is unusually thick or a double bond cannot be resolved as two separate lines, the wedge-bond detection can lead to a bond misrecognition error (Figure 7(c)). To correct this error, the width of wedge bonds (captured by the length between P1 and P2; see Figure 7) is compared to the average width of normal bonds after extracting the normal bonds.
Ring Structure Identification
In low resolution images, it is often observed that a detected line have a different position, length or direction from the actual bond. This is especially the case for the bonds in a hexagonal or pentagonal ring structure because the pixels of the neighbor bonds can act as noise in the Hough Transform (HT). Accumulated errors of line detection around a ring structure would cause significant errors in constructing the topology of the chemical structure. This problem could be solved by detecting Pentagonal or Hexagonal ring structures directly using the Generalized Hough Transformation (GHT) . With GHT, ChemReader detects ring structure as a skeleton for processing chemical structure diagrams, so the topology of molecules can be constructed more accurately and efficiently.
Text (Character) Recognition
Chemical Spell Checker
The GOCR library outputs the recognition results of each character without any chemical interpretation. The results can contain non-existing chemical symbols or valences. To correct these errors, a chemical "spell checker," a recovery process similar to the conventional OCR error correction, is implemented in ChemReader. The characters recognized by the GOCR library are grouped by their relative adjacency and each character group is regarded as a chemical word representing either an atomic symbol or chemical abbreviation, which is subject to "spell checking" based on the chemical dictionary.
where M denotes the number of pixels in a character segment, and I X (j) is a normalized grayscale intensity at the j th pixel of the character segment X. Before the calculation of similarity, the comparing candidate character is always resized so that both input and comparing segments have the same size. Since the exact frequency of each chemical symbols in chemical structure diagram is not known priori we assume that P(T) is equi-probable for all T ∈ Chem_Dictionary and ∑T ∈ Chem_DictionaryP(T) = 1. With this chemical spell checker, the accuracy of chemical symbol recognition increased to 87%, up from 66% without spell checking.
Topology Construction and Data Output
Image sets for performance test.
Number of Images
Google image search
Ligand images at GLIDA database
Journals at PubMed database
The performances of chemical structure recognition are analyzed in two aspects: the fraction of correct outputs and the capability to recognize chemically important substructure patterns. The first measure, the fraction of correct outputs shows straightforwardly the accuracy of the software. Although an error exists in the output molecule, it wouldn't be regarded as a totally useless one if chemically significant features-of-interests are well-recognized. For example, the misassignment of atom charge or bond-stereo would not be so critical for finding molecules similar to the recognized structure in a database. Thus, we compute the statistical measures, precision and recall rates in order to evaluate the software's capability for extracting chemically significant substructure patterns. Precision is the fraction of the extracted patterns that are correct whereas recall is the fraction of the correct patterns that are extracted. Structural patterns defined in the PubChem Substructure Fingerprint  are used in this test. The identity between the original molecule and output chemical structure is determined using an exact matching function in ChemAxon's JChem toolkits . Also, the PubChem fingerprint is computed using an open-source code provided by the NIH Chemical Genomic Center (NCGC) .
Summary of performance testing results for three sets of images.
Set I (Total: 50)
Set II (Total: 100)
Set III (Total: 212)
CLiDE Lite V2.1
Kekule demo V2.0 1
The average Tanimoto similarity scores can be seen as the extent of correctly including chemically important features in the output structure. The more missed (false negative) or misinterpreted (false positive) features the output structure has, the smaller similarity score will become. It is noted that ChemReader's outputs show a high average-similarity score ranging from 0.74 to 0.88 even though only about 30% of outputs are perfectly correct. This indicates that ChemReader can be effective in annotating chemical structure database by linking published research articles to relevant entries in the database. Since those links would likely be created based on a molecular similarity rather than perfect matching, high similarity scores would imply the high accuracy in automated chemical database annotation.
Estimated Precision (P) and Recall (R) rates for classification of substructure patterns.
1. Hierarchic Element Counts
3. Simple atom pairs
4. Simple atom nearest neighbors
5. Detailed atom neighborhoods
6. Simple SMARTS patterns
7. Complex SMARTS patterns
We have examined several examples of the existing software programs that can be utilized for linking the databases of small molecules with the relevant scientific research articles, by matching the chemical structure diagrams in the articles with the structures in the database. In their current states, these programs have limitations to the extent that they generally need manual feeding of images and they have significant error rates. As an alternative, we have developed ChemReader – a machine-vision-based software program designed for the development of customized chemical diagram extraction tools in industry or academic laboratories. Compared to commercially or publicly available software, ChemReader function is transparent, in the sense that algorithm performance can be followed step-by-step. In side-by-side comparison with Kekule, CLiDE and OSRA, ChemReader makes more correct outputs and extracts chemically important substructure patterns with higher recall and precision rates.
To develop ChemReader into a fully-automated system, there are still several challenges that remain to be addressed. For automated extraction of chemical structures and relevant information from scientific articles, it would be important to rapidly distinguish between a diagram of a chemical structure and a non-chemical structure diagram, or a photograph, among the extracted images. Gkoutos et al. have reported a method to classify chemical images based on the use of the Kohonen network  with promising results. Such functionality still has to be incorporated into ChemReader. Finally, since the translation of chemical structure from a raster image to a standard chemical file format is highly error prone as seen in the test, output structures should be thoroughly inspected before utilization. Besides manual curation resulting in high cost of system operation, filtering method which can detect "unreadable" images or wrong outputs and filtered them out at the pre-processing or post-processing stages might be effective to improve the performance of machine vision systems for recognizing chemical structures. In this manner, accuracy can be increased at the expense of throughput. However, since ChemReader is already able to correctly recognize far more images than OSRA, CLiDE or Kekule, this may be a viable course of action for the future of ChemReader's development.
We postulate that, in its current state, ChemReader may be sufficiently accurate for annotating chemical-structure databases with links to scientific research articles. An error at the level of chemical-structure recognition does not necessarily lead to an error in the annotation, since incorrectly extracted molecules may not find a match in the chemical-structure databases. Furthermore, a useful database annotation scheme does not necessarily require perfect matches between database entries and scientific articles. In fact, the ability to link to similar but not identical structures may be important when the intent is to synthesize drug leads that are not identical to the molecule in question, and to identify related compounds in the scientific literature. Since not every chemical database entry may be represented as chemical-structure diagrams in published research articles, the ability to link to similar but not identical molecules may also be useful to increase the number of links between database entries and research articles.
The availability of ChemReader as a cheminformatic tool would allow research and development groups to enrich their knowledge bank of molecules and chemical structures. We are planning that ChemReader becomes commercially available in the near future, with removal of open source parts such as GOCR and Greycstoration. Like ChemReader, other image-based search engines are being developed in other academic disciplines. In mechanical engineering, for example, search engines are being developed for searching catalogues of three dimensional components for mechanical products. Compared to other image-based search engines, image-based cheminformatic search engines are simpler because chemical structures are two dimensional objects with well-defined connectivity patterns (grammars) determined by the atoms and their valences. Indeed, chemical-structure recognition algorithms may be most akin to character- and text-recognition algorithms. Like words in a dictionary, a chemical-structure database can serve as a training set of molecules that can be used to identify the most common chemical substructures present in all known chemical compounds. Based on the frequency of different substructures and using neighboring substructure information, computational techniques borrowed from statistical linguistics may be incorporated to generalize the chemical spellchecker to check structural "spelling", which will further optimize ChemReader's performance.
This work was supported by NIH grants P20-HG003890 to Kerby Shedden and Gus R. Rosania.
- PubMed. [http://www.ncbi.nlm.nih.gov/entrez/query/static/overview.html#Introduction]
- PubChem. [http://pubchem.ncbi.nlm.nih.gov/help.html#PubChem_Overview]
- ChemDraw. [http://www.cambridgesoft.com/software/ChemDraw/]
- ISIS/Draw. [http://www.symyx.com/products/software/decision-support/isis-draw/index.jsp]
- DrawIt. [http://www.chemwindow.com]
- ACD/ChemSketch. [http://www.acdlabs.com/products/chem_dsn_lab/chemsketch/]
- McDaniel JR, Balmuth JR: Kekule: OCR – Optical Chemical (Structure) Recognition. J Chem Inf Comput Sci. 1992, 32: 373-378.View ArticleGoogle Scholar
- Casey R, Boyer S, Healey P, Miller A, Oudot B, Zilles K: Optical Recognition of Chemical Graphics. Proceedings of the Second International Conference on Document Analysis and Recognition: 20–22 October 1993. 1993, Tsukuba, Japan, 627-632.Google Scholar
- Ibison P, Jacquot M, Kam F, Neville AG, Simpson RW, Tonnelier C, Venczel T, Johnson AP: Chemical Literature Data Extraction: The CLiDE Project. J Chem Inf Comput Sci. 1993, 33: 338-334.View ArticleGoogle Scholar
- Rosania GR, Crippen G, Woolf P, States D, Shedden K: A Cheminformatic Toolkit for Mining Biomedical Knowledge. Pharmaceutical Research. 2007, 24: 1791-1802.View ArticleGoogle Scholar
- Algorri ME, Zimmermann M, Friedrich CM, Akle S, Hofmann-Apititus M: Reconstruction of Chemical Molecules from Images. Proceedings of the 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBS): 23–26 August 2007. 2007, Lyon, France, 4609-4612.View ArticleGoogle Scholar
- OSRA: Optical Structure Recognition. [http://cactus.nci.nih.gov/osra/]
- Snyder WE, Qi H: Machine Vision. 2004, New York: Cambridge University PressGoogle Scholar
- Dori D, Wenyin L: Automated CAD Conversion with the Machine Drawing Understanding System: Concepts, Algorithms, and Performance. IEEE Transactions on Systems, Man and Cybernetics. 1999, 29: 411-416.View ArticleGoogle Scholar
- Fahn CS, Wang JF, Lee JY: A Topology-Based Component Extractor for Understanding Electronic Circuit Diagrmas. Computer Vision, Graphics, Image Process. 1988, 44: 119-138.View ArticleGoogle Scholar
- Richard OD, Peter EH: Use of the Hough transformation to detect lines and curves in pictures. Communications of the ACM. 1972, 15: 11-15.View ArticleGoogle Scholar
- Boyer SK, Casey RG, Miller AM, Oudot B, Zilles KS: Apparatus and method for optical recognition of chemical graphics. U.S. Patent No. 5,157,736. 1992Google Scholar
- Gkoutos GV, Rzepa H, Clark RM, Adjei O, Johal H: Chemical Machine Vision: Automated Extraction of Chemical Metadata from Raster Image. J Chem Inf Comput Sci. 2003, 43: 1342-1355.View ArticleGoogle Scholar
- GREYCstoration: open source algorithms for image denoising and interpolation. [http://cimg.sourceforge.net/greycstoration/]
- Tschumperle D: Fast Anisotropic Smoothing of Multi-Valued Images using Curvature-Preserving PDE's, International Journal of Computer Vision. International Journal of Computer Vision. 2006, 68 (1): 65-82.View ArticleGoogle Scholar
- Fletcher LA, Kasturi R: A Robust Algorithm for Text String Separation from Mixed Text/Graphics Images. IEEE Trans on Pattern Analysis and Machine Intelligence. 1998, 10 (6): 910-918.View ArticleGoogle Scholar
- Tombre K, Tabbone S, Pelissier L, Lamiroy B, Dosch P: Text/Graphics Separation Revisited. Proceedings of 5th International Workshop on Document Analysis Systems: 19–21 August 2002; Princeton. 2002, 200-211.View ArticleGoogle Scholar
- MCK Yang, Lee JS, Lien CC, Huang CL: Hough Transform Modified by Line Connectivity and Line Thickness. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1997, 19 (8): 905-910.View ArticleGoogle Scholar
- Sojka E: A New Algorithm for Detecting Corners in Digital Images. Proceedings of the 18th Spring Conference on Computer Graphics: 24–27 April 2002; Budmerice, Slovakia. 2002, Alan Chalmers: ACM, 55-62.Google Scholar
- Ballard DH: Generalizing the Hough transform to detect arbitrary shapes. Pattern Recognition. 1981, 13 (2): 111-122.View ArticleGoogle Scholar
- GOCR: Open source character recognition. [http://jocr.sourceforge.net/]
- Dalby A, Nourse JG, Hounshell D, Gushurst AKI, Grier DL, Leland BA, Laufer J: Description of Several Chemical Structure File Formats Used by Computer Programs Developed at Molecuar Design Limited. J Chem Inf Comput Sci. 1992, 32: 244-255.View ArticleGoogle Scholar
- Weininger D: SMILES, a Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules. J Chem Inf Comput Sci. 1988, 28: 31-36.View ArticleGoogle Scholar
- Introducing CliDE Pro, Fall 2008 ACS National Meeting & Exposition, August 17th–21th, Philadelphia, USA. [http://www.simbiosys.ca/science/presentations/2008-acs-08/ACS_CLiDEPro.ppt]
- GLIDA: GPCR-Ligand Database. [http://pharminfo.pharm.kyoto-u.ac.jp/services/glida/]
- PubChem Substructure fingerprint. [ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt]
- JChem, ChemAxon Ltd. [http://www.chemaxon.com/]
- PubChem Fingerprint for JChem, NIH Chemical Genomics Center. [http://www.ncgc.nih.gov/pub/openhts/]