diff options
author | Thomas Deutschmann <whissi@gentoo.org> | 2020-09-10 18:10:49 +0200 |
---|---|---|
committer | Thomas Deutschmann <whissi@gentoo.org> | 2020-09-11 20:06:36 +0200 |
commit | acfc02c1747065fe450c7cfeb6f1844b62335f08 (patch) | |
tree | 5887806a2e6b99bbb0255e013a9028810e230a7f /doc/Devices.htm | |
parent | Import Ghostscript 9.52 (diff) | |
download | ghostscript-gpl-patches-acfc02c1747065fe450c7cfeb6f1844b62335f08.tar.gz ghostscript-gpl-patches-acfc02c1747065fe450c7cfeb6f1844b62335f08.tar.bz2 ghostscript-gpl-patches-acfc02c1747065fe450c7cfeb6f1844b62335f08.zip |
Import Ghostscript 9.53ghostscript-9.53
Signed-off-by: Thomas Deutschmann <whissi@gentoo.org>
Diffstat (limited to 'doc/Devices.htm')
-rw-r--r-- | doc/Devices.htm | 103 |
1 files changed, 91 insertions, 12 deletions
diff --git a/doc/Devices.htm b/doc/Devices.htm index 166c4080..921211a6 100644 --- a/doc/Devices.htm +++ b/doc/Devices.htm @@ -1,15 +1,6 @@ <!doctype html> <html> <head> -<!-- Global site tag (gtag.js) - Google Analytics --> -<script async src="https://www.googletagmanager.com/gtag/js?id=UA-54391264-2"></script> -<script> - window.dataLayer = window.dataLayer || []; - function gtag(){dataLayer.push(arguments);} - gtag('js', new Date()); - - gtag('config', 'UA-54391264-2'); -</script> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <link href="https://fonts.googleapis.com/css?family=Source+Sans+Pro" rel="stylesheet"> @@ -85,12 +76,14 @@ <ul> <li><a href="#PDF">PDF file output</a></li> <li><a href="#PDFimage">Bitmap PDF output, PCLm output</a></li> +<li><a href="#OCR">OCR devices</a></li> +<li><a href="#PDFocr">Bitmap PDF output (with OCR text)</a></li> <li><a href="#PS">PostScript file output</a></li> <li><a href="#EPS">EPS file output</a></li> <li><a href="#PXL">PCL-XL file output</a></li> <li><a href="#TXT">Text output</a></li> </ul> -<li><a href="#Dis play_devices">Display devices</a></li> +<li><a href="#Display_devices">Display devices</a></li> <ul> <li><a href="#x11_devices">X Window System</a></li> <li><a href="#display_device">display device (MS Windows, OS/2, gtk+)</a></li> @@ -962,6 +955,92 @@ of 'high-level' formats. These allow Ghostscript to preserve (as much as possible) the drawing elements of the input file maintaining flexibility, resolution independence, and editability.</p> +<h3><a name="OCR"></a>Optical Character Recognition (OCR) output</h3> + +<p> + These devices render internally in 8 bit greyscale, and then + feed the resultant image into an OCR engine. Currently, we + are using the Tesseract engine. Not only is this both free + and open source, it gives very good results, and supports + a huge number of languages/scripts. +</p> +<p> + The Tesseract engine relies on files to encapsulate each + language and/or script. These "traineddata" files + are available in different forms, including <a href="github.com/tesseract-ocr/tessdata_fast">fast</a> + and <a href="tesseract-ocr/tessdata_best">best</a> variants. + Alternatively, people can train their own data using the + standard Tesseract tools. +</p> +<p> + These files are looked for from a variety of places. Firstly, + any files placed in "Resource/Tesseract/" will be + included in the binary for any standard (COMPILE_INITS=1) build. + Secondly, files will be searched for in the current directory. + Thirdly, files will be searched for in the directory given by + the environment variable TESSDATA_PREFIX. +</p> +<p> + By default, the OCR process defaults to looking for English text, + using "eng.traineddata". This can be changed by using the + <code>-sOCRLanguage=</code> switch; +</p> +<blockquote> +<dl> +<dt><code>-sOCRLanguage=</code><b><em>language</em></b></dt> +<dd>This sets the trained data sets to use within the Tesseract + OCR engine. For example, the following will use English and + Arabic:</dd></dl> +<blockquote> +<pre> + <kbd>gs -sDEVICE=ocr -r200 -sOCRLanguage="eng,ara" -o out.txt\ + zlib/zlib.3.pdf</kbd> +</pre> +</blockquote> +</blockquote> +<p> + The first device is named ocr. It extracts data as unicode codepoints + and outputs them to the device as a stream of UTF-8 bytes. +</p> +<p> + The second device is named hocr. This extracts the data in + <a href="wikipedia.org/wiki/HOCR">hOCR</a> format. +</p> +<p> + These devices are implemented as downscaling devices, so the + standard parameters can be used to control this process. It + may seem strange to use downscaling on an image that is not + actually going to be output, but there are actually good reasons + for this. Firstly, the higher the resolution, the slower the + OCR process. Secondly, the way the Tesseract OCR engine works + means that anti-aliased images perform broadly as well as the + super-sampled image from which it came. +</p> + +<h3><a name="PDFocr"></a>PDF image output (with OCR text)</h3> + +<p> + These devices do the same render to bitmap and wrap as a PDF process as + the <a name="PDFimage">PDFimage</a> devices above, but with the addition + of an OCR step at the end. The OCR'd text is overlaid "invisibly" + over the images, so searching and cut/paste should still work. +</p> +<p> + The OCR engine being used is Tesseract. For information on this + including how to control what language data is used, see the <a href="OCR"> + OCR devices</a> section above. +</p> +<p> + There are three devices named pdfocr8, pdfocr24 and pdfocr32. These + produce valid PDF files with a colour depth of 8 (Gray), 24 (RGB) or + 32 (CMYK). +</p> +<p> + These devices accept all the same flags as the <a name="PDFimage">PDFimage</a> + devices described above. +</p> +<p> + <h2><a name="High-level"></a>High-level devices</h2> <h3><a name="PDF"></a>PDF writer</h3> @@ -2002,7 +2081,7 @@ spot colors.</p> <hr> <p> -<small>Copyright © 2000-2019 Artifex Software, Inc. All rights reserved.</small> +<small>Copyright © 2000-2020 Artifex Software, Inc. All rights reserved.</small> <p> This software is provided AS-IS with no warranty, either express or @@ -2015,7 +2094,7 @@ or contact Artifex Software, Inc., 1305 Grant Avenue - Suite 200, Novato, CA 94945, U.S.A., +1(415)492-9861, for further information. <p> -<small>Ghostscript version 9.52, 19 March 2020 +<small>Ghostscript version 9.53.0, 10 September 2020 <!-- [3.0 end visible trailer] ============================================= --> |