You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

94 lines
3.8 KiB
HTML

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

<!DOCTYPE html>
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>Tasks of the Contingent Librarian</title>
<link rel="stylesheet" type="text/css" href="tasks.css">
<script src="tasks.js"></script>
</head>
<body>
<div class="cardback"><DOCUMENT_FRAGMENT><div class="mw-parser-output"><div class="thumb tright"><div class="thumbinner" style="width:152px;"><a class="image" href="File:Rereferencing_Open_work_OCR.jpeg.html"><img alt="" class="thumbimage" decoding="async" height="106" src="/mw-mediadesign/images/thumb/3/3c/Rereferencing_Open_work_OCR.jpeg/150px-Rereferencing_Open_work_OCR.jpeg" srcset="/mw-mediadesign/images/thumb/3/3c/Rereferencing_Open_work_OCR.jpeg/225px-Rereferencing_Open_work_OCR.jpeg 1.5x, /mw-mediadesign/images/thumb/3/3c/Rereferencing_Open_work_OCR.jpeg/300px-Rereferencing_Open_work_OCR.jpeg 2x" width="150"></a> <div class="thumbcaption"><div class="magnify"><a class="internal" href="File:Rereferencing_Open_work_OCR.jpeg.html" title="Enlarge"></a></div>A bootleg copy of The Open Work by Umberto Eco. OCR software has mistaken the page number (page 80) as the word “So”</div></div></div>
<p>Snippets:
</p>
<h2><span class="mw-headline" id="Pre-processing_for_OCR">Pre-processing for OCR</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/mw-mediadesign/index.php?title=User:Simon/self_directed_research/OCR_preprocessing&amp;action=edit&amp;section=T-1" title="Edit section: ">edit</a><span class="mw-editsection-bracket">]</span></span></h2>
<p>This script applies transformations to the image before running OCR, resulting in a clearer result:
</p><p><br>
</p>
<pre># import the necessary packages
#from PIL
import Image
import pytesseract
import argparse
import cv2
import os
# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--image", required=True,
help="path to input image to be OCR'd")
ap.add_argument("-p", "--preprocess", type=str, default="thresh",
help="type of preprocessing to be done")
args = vars(ap.parse_args())
# load the example image and convert it to grayscale
image = cv2.imread(args["image"])
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
# check to see if we should apply thresholding to preprocess the
# image
if args["preprocess"] == "thresh":
gray = cv2.threshold(gray, 0, 255,
cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
# make a check to see if median blurring should be done to remove
# noise
elif args["preprocess"] == "blur":
gray = cv2.medianBlur(gray, 3)
# write the grayscale image to disk as a temporary file so we can
# apply OCR to it
filename = "{}.png".format(os.getpid())
cv2.imwrite(filename, gray)
# load the image as a PIL/Pillow image, apply OCR, and then delete
# the temporary file
text = pytesseract.image_to_string(Image.open(filename))
os.remove(filename)
print(text)
# show the output images
cv2.imshow("Image", image)
cv2.imshow("Output", gray)
cv2.waitKey(0)
</pre>
<!--
NewPP limit report
Cached time: 20200612082943
Cache expiry: 86400
Dynamic content: false
CPU time usage: 0.028 seconds
Real time usage: 0.041 seconds
Preprocessor visited node count: 7/1000000
Preprocessor generated node count: 26/1000000
Postexpand include size: 187/2097152 bytes
Template argument size: 0/2097152 bytes
Highest expansion depth: 2/40
Expensive parser function count: 0/100
Unstrip recursion depth: 0/20
Unstrip postexpand size: 1327/5000000 bytes
-->
<!--
Transclusion expansion time report (%,ms,calls,template)
100.00% 3.259 1 User:Simon/self_directed_research/OCR_preprocessing
100.00% 3.259 1 -total
-->
<!-- Saved in parser cache with key wdka_mw_mediadesign-mw_:pcache:idhash:31704-0!canonical and timestamp 20200612082943 and revision id 173944
-->
</div></DOCUMENT_FRAGMENT></div>
</body>
</html>