Print2CAD 2019, Artificial Intelligence
Automatic OCR Text Recognition of Non Native PDF Text
The text in PDF files can be placed as a native PDF text, as a text deconstructed in lines, as a text deconstructed in hatches, and as a text presented in raster pictures.
To recognize this kind of text the program uses artificial intelligence methods of OCR (Optical Character Recognition) and Symbol Recognition.
Text Presented in Raster Pictures
Text Deconstructed in Hatches
Text Deconstructed in Lines
Automatic Text recognition works only with non native texts if the text has the same direction. If the PDF drawing contains text with miscellaneous directions the automatic text recognition will fail.
For text with different directions Print2CAD offers manual separation of the text areas under, “Artificial Intelligence Functions”, with the function, "Enhanced OCR Text Recognition".
The right separation of Text representation is very important for correct text recognition.
The text for OCR text recognition can be placed in PDF as text deconstructions in lines or paths, as text deconstructions into hatches, or as pixel pictures with text.
The Analysis of a PDF file should be done before the activation of a text representation. The Analysis of a PDF file shows in separate pictures what kind of text representation is used in the input PDF file.
If you find more than one text representation, choose all of it.
The right text language selection helps to build the right words. Print2CAD uses artificial intelligence methods for the text control and an internal dictionary to eliminate unusual text combinations.
Maximum Resolution in DPI
The right resolution for OCR text recognition is very important. The resolution has to be as low as possible, but the text has to be very clear and readable. Try first with 300 DPI and push the button “Preview” if the smallest text is not readable, increase the resolution 50 DPI steps.
Minimum and Maximum Text Height in Pixel
The parameter for maximum and minimum text height are very important. The preseparation of a text works based on this parameter. Push the button “Preview” and if you see that not all text are separated increase the maximum height. If you see a lot of free pixels are separated increase the minimum height.
If you choose the raster images as text representation, the threshold decides what pixel belongs to the color black group and what pixels belongs to the white background. Push the button “Preview” and if you see that the text letters connect to each other decrease the threshold.
Conversion of Native PDF texts
The native text in PDF files can be placed as strings or individual characters. The best method to find out if your PDF file contains real text is to analyze the PDF file with the analysis function of Print2CAD and see if there are any text entities indicated.
Another method is to open the PDF file in a PDF Reader and zoom the text to maximum view. If the letters still have smooth edges, your PDF file most likely features real text. If the edges of the letters are not smooth, Print2CAD will not convert the “text” to real text without activating the OCR function.
Parameter: Convert Native PDF Texts into Editable CAD Texts
In PDF files, text is usually defined as separate characters or groups of characters with their own insertion points. With the help of special internal methods, Print2CAD merges characters into strings and places these strings as texts in the DWG or DXF drawings.
Parameter: Convert Native PDF Texts into Hatches
Print2Cad converts all Texts into Polylines with filled Areas (Hatches).
It is not always possible to extract text from a PDF especially when the Unicode map is missing or “user defined”. There are many construction drawings that use this type of trick to stop people from extracting the data.
If it is not possible to cut and paste the correct text from Acrobat then you will have very little chance of converting the text yourself. If Acrobat cannot extract it then it is very unlikely that Print2CAD can extract the text correctly.
To convert this text into hatches or to apply OCR functions on it is the only one possibility to handle this kind of text.
Parameter: Visualization of a Text with Corrupt Codec
If the font codec and encoding table is manually created then the program Print2CAD will use artificial intelligence methods to find out the right codes.
Parameter: Sort Text Onto Separate Layer
When activating this function, all native or recognized text gets sorted onto a predetermined layer. If there are no real text, but only polylines, hatches or raster images, the letters will not be recognized as text.
Parameter: Replace All Fonts With a SHX ot TTF Font
When enabling this option, all text styles get the same selected SHX or TTF font assigned.
Parameter: Scale Factors for Blank Space Width
Text in PDF files is often placed as single letters. In this case the spaces are not available.
When Print2CAD is transforming letters to text, blank spaces get recognized with the help of a substitute space width equating the letter “a.”
Should the space detection does not work properly, increase or reduce the substitute space factor according to the below graphic (by trial and error):
Parameter: Scale Factors for Text Width and Height
If Print2CAD can’t find the fonts used in the PDF in the Windows system, Print2CAD will select a similar font. In doing so, the text width may change.
A workaround for this is the use of scale factors for the text width and height. The text will be scaled by the given factor and placed left-aligned in the CAD drawing.
The fonts in PDF files are mostly embedded, so that you do not need the fonts in your Windows system if you display the PDF files.
In DWG or DXF files the fonts can not be embedded. You will need all the fonts that are used in the DWG or DXF files installed in your Windows system.
Print2CAD is not able to extract PDF embedded fonts into your Windows system.
BacktoCAD Technologies, LLC
601 Cleveland St, Suite 310
Clearwater, FL 33755, USA
Phone: (727) 303 0383
© Copyright 2017 BackToCAD Technologies, LLC. All rights reserved. Kazmierczak® is a registered trademark of Kazmierczak Software GmbH. Print2CAD, AzubiCAD, and CAD2Print are Trademarks of BackToCAD Technologies LLC. CADconv is a Trademark of Expert Robotics Inc.. DWG is the name of Autodesk’s proprietary file format and technology used in AutoCAD® software and related products. Autodesk, the Autodesk logo, AutoCAD, DWG are registered trademarks or trademarks of Autodesk, Inc., and/or its subsidiaries and/or affiliates in the USA and/or other countries. All other brand names, product names, or trademarks belong to their respective holders. This website is independent of Autodesk, Inc., and is not authorized by, endorsed by, sponsored by, affiliated with, or otherwise approved by Autodesk, Inc. The material and software have been placed on this Internet site under the authority of the copyright owner for the sole purpose of viewing of the materials by users of this site. Users, press or journalists are not authorized to reproduce any of the materials in any form or by any means, electronic or mechanical, including data storage and retrieval systems, recording, printing or photocopying.