Reference Manual

Print2CAD 2018 x64 Artificial Intelligence

Optimization 3:

Part 1 - Automatic OCR Text Recognition

German
Optimization 3:  Part 1 - Automatic OCR Text Recognition

Optimization 3:

Part 2 - Native Text Recognition

Optimization 3:

Sample - OCR of Simple Direction Text

Optimization 3:  Part 2 - Native Text Recognition
Optimization 3:  Part 2 - Native Text Recognition

Automatic OCR Text Recognition of Non Native PDF Text

The text in PDF files can be placed as a native PDF text, as a text deconstructed in lines, as a text deconstructed in hatches, and as a text presented in raster pictures.

To recognize this kind of text the program uses artificial intelligence methods of OCR (Optical Character Recognition) and Symbol Recognition.

Automatic OCR Text Recognition of Non Native PDF Text
Automatic OCR Text Recognition of Non Native PDF Text

Text Presented in Raster Pictures

Text Deconstructed in Hatches

Automatic OCR Text Recognition of Non Native PDF Text

Text Deconstructed in Lines

Text Inclination

Automatic Text recognition works only with non native texts if the text has the same direction. If the PDF drawing contains text with miscellaneous directions the automatic text recognition will fail.

For text with different directions Print2CAD offers manual separation of the text areas under, “Artificial Intelligence Functions”, with the function, "Enhanced OCR Text Recognition".

Text Inclination

Text Representation

The right separation of Text representation is very important for correct text recognition.

The text for OCR text recognition can be placed in PDF as text deconstructions in lines or paths, as text deconstructions into hatches, or as pixel pictures with text.

The Analysis of a PDF file should be done before the activation of a text representation. The  Analysis of a PDF file shows in separate pictures what kind of text representation is used in the input PDF file.

If you find more than one text representation, choose all of it.

Text Representation

Text Language

The right text language selection helps to build the right words. Print2CAD uses artificial intelligence methods for the text control and an internal dictionary to eliminate unusual text combinations.

Language English

Maximum Resolution in DPI

The right resolution for OCR text recognition is very important. The resolution has to be as low as possible, but the text has to be very clear and readable. Try first with 300 DPI and push the button “Preview” if the smallest text is not readable, increase the resolution 50 DPI steps.

OCR Resolution

Minimum and Maximum Text Height in Pixel

The parameter for maximum and minimum text height are very important. The preseparation of a text works based on this parameter. Push the button “Preview” and if you see that not all text are separated increase the maximum height. If you see a lot of free pixels are separated increase the minimum height.

Minimum and Maximum Text Height in Pixel

Image Threshold

If you choose the raster images as text representation, the threshold decides what pixel  belongs to the color black group and what pixels belongs to the white background. Push the button “Preview” and if you see that the text letters connect to each other decrease the threshold.

Image Threshold

Conversion of Native PDF texts

The native text in PDF files can be placed as strings or individual characters. The best method to find out if your PDF file contains real text is to analyze the PDF file with the analysis function of Print2CAD and see if there are any text entities indicated.

Another method is to open the PDF file in a PDF Reader and zoom the text to maximum view. If the letters still have smooth edges, your PDF file most likely features real text.  If the edges of the letters are not smooth, Print2CAD will not convert the “text” to real text without activating the OCR function.

Conversion of native PDF texts

Parameter: Convert Native PDF Texts into Editable CAD Texts

In PDF files, text is usually defined as separate characters or groups of characters with their own insertion points. With the help of special internal methods, Print2CAD merges characters into strings and places these strings as texts in the DWG or DXF drawings.

Parameter: Convert Native PDF Texts into Editable CAD Texts

Parameter: Convert Native PDF Texts into Hatches

Print2Cad converts all Texts into Polylines with filled Areas (Hatches).

It is not always possible to extract text from a PDF especially when the Unicode map is missing or “user defined”. There are many construction drawings that use this type of trick to stop people from extracting the data.

If it is not possible to cut and paste the correct text from Acrobat then you will have very little chance of converting the text yourself. If Acrobat cannot extract it then it is very unlikely that Print2CAD can extract the text correctly.

To convert this text into hatches or to apply OCR functions on it is the only one possibility to handle this kind of text.

Parameter: Visualization of a Text with Corrupt Codec

If the font codec and encoding table is manually created then the program Print2CAD will use artificial intelligence methods to find out the right codes.

Parameter: Sort Text Onto Separate Layer

When activating this function, all native or recognized text gets sorted onto a predetermined layer. If there are no real text, but only polylines, hatches or raster images, the letters will not be recognized as text.

Parameter: Replace All Fonts With a SHX ot TTF Font

When enabling this option, all text styles get the same selected SHX or TTF font assigned.

Parameter: Scale Factors for Blank Space Width

Text in PDF files is often placed as single letters. In this case the spaces are not available.

 When Print2CAD is transforming letters to text, blank spaces get recognized with the help of a substitute space width equating the letter “a.”

 Should the space detection does not work properly, increase or reduce the substitute space factor according to the below graphic (by trial and error):

 

Parameter: Scale Factors for Blank Space Width

Parameter: Scale Factors for Text Width and Height

If Print2CAD can’t find the fonts used in the PDF in the Windows system, Print2CAD will select a similar font. In doing so, the text width may change.

A workaround for this is the use of scale factors for the text width and height. The text will be scaled by the given factor and placed left-aligned in the CAD drawing.

The fonts in PDF files are mostly embedded, so that you do not need the fonts in your Windows system if you display the PDF files.

In DWG or DXF files the fonts can not be embedded. You will need all the fonts that are used in the DWG or DXF files installed in your Windows system.

Print2CAD is not able to extract PDF embedded fonts into your Windows system.

BacktoCAD Technologies, LLC

601 Cleveland St, Suite 310

Clearwater, FL 33755, USA

 

Email: bc-sales@cad-pdf.com
Phone: (727) 303 0383

© Copyright 2017 BackToCAD Technologies, LLC. All rights reserved. Kazmierczak® is a registered trademark of Kazmierczak Software GmbH. Print2CAD, AzubiCAD, and CAD2Print are Trademarks of BackToCAD Technologies LLC. CADconv is a Trademark of Expert Robotics Inc.. DWG is the name of Autodesk’s proprietary file format and technology used in AutoCAD® software and related products. Autodesk, the Autodesk logo, AutoCAD, DWG are registered trademarks or trademarks of Autodesk, Inc., and/or its subsidiaries and/or affiliates in the USA and/or other countries. All other brand names, product names, or trademarks belong to their respective holders. This website is independent of Autodesk, Inc., and is not authorized by, endorsed by, sponsored by, affiliated with, or otherwise approved by Autodesk, Inc. The material and software have been placed on this Internet site under the authority of the copyright owner for the sole purpose of viewing of the materials by users of this site. Users, press or journalists are not authorized to reproduce any of the materials in any form or by any means, electronic or mechanical, including data storage and retrieval systems, recording, printing or photocopying.