From 87b4535b83440787a63aaeff16133725981fc1ee Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Pedro=20S=C3=A1=20Couto?= Date: Thu, 30 Jan 2020 17:41:33 +0100 Subject: [PATCH] needs test --- .DS_Store | Bin 0 -> 8196 bytes bounding_box.py | 34 ++++++++++ burstpdf.py | 43 +++++++++++++ chmod.sh | 3 + merge_files.sh | 7 ++ merge_scans.sh | 7 ++ readme.md | 155 +++++++++++++++++++++++++++++++++++++++++++++ scans/.DS_Store | Bin 0 -> 6148 bytes tesseract_ocr.py | 22 +++++++ workshop_stream.sh | 8 +++ 10 files changed, 279 insertions(+) create mode 100644 .DS_Store create mode 100644 bounding_box.py create mode 100755 burstpdf.py create mode 100755 chmod.sh create mode 100755 merge_files.sh create mode 100755 merge_scans.sh create mode 100644 readme.md create mode 100644 scans/.DS_Store create mode 100755 tesseract_ocr.py create mode 100755 workshop_stream.sh diff --git a/.DS_Store b/.DS_Store new file mode 100644 index 0000000000000000000000000000000000000000..c480601b72d45d38c2f8a4d1a7e847409a194013 GIT binary patch literal 8196 zcmeHM&u<$=6n>MWuue$5Bq2>Wz)ESQ9-<}=hkyWK93vD_NQBdnsx9uWcVn-*-dXLg z<2Y%MB~)=kLgEAh35iSj1GsaAGvdaD9+BY44ZfKjx9iwVxfCLGN1A!FGv9mj=G*z< zw@Kh14fd3`wG2&54rb~-mJ_R$QXDc84%}#fh=TE$!;R$(SbsS0FYy7mI-yl0aC|MvZ!P? zkx~^lljwmdRi;7=mZ+oO5OBz%lHEim>ckRtVrpcjA`~V^hn^wf#EOZ`%8Y@Gfk6gD z?4C!o&Z&jSJsrPu+NVntP-oraHOu1`kYaOvi+t~InNTB|fOd;QiYNx!rrn`hcY*vk zrKM4g8q~zNHt-QO`bS}RfWLSWK8UT$ttbwGTcsl zW7GAWP%68&1IYCcT;JyH5^wo-B<)m^4o%BxxvD;Yc)0S)@}lulp>VWl92QnqR~LPG=ivi|fX9#eC4YkO*r>`rfvzxL*s8=rV) z@@!szHb0f0o}PJbR-c=%>eH54yHyW`r)%8fVaaPbRXrbJ?&~$~Z-&ku=D8C)`*FoR zK-bhUh4&r%!v{@UodB4b2MyBb+o9w5*Xp5bZ|v?yPP~(ZTKY~$Go}Y!b9PeG*=1Cn(NmW-uN@&D7R3&nACJy z=+UO>Y|A&B4mvlJwuRz5s#Bv2SZeRmHV)MT`i#D!yYxN%L_gDS^gI2<7#m>|Y?95g zb8Mcyz+Pm^{z&ijJl`88kqjT8)33*?7b-R03{t4n%WLv-SuIe_{p8c=_8&jI*{|Q^ z_9GBM8ElHzJHv5kjad*?B3dMo4c@uktk^qh$Q-@<&e?*NaF>&$Y-z-eWGjg>3q61?l3 z)5jeZv9^JH6ImpI>n2jFppfA>qzuO)@Bd+lx&f%rDJt1bBwCRE^A7Flatbed_Scanner_Workflow + +## Getting started + +This set of scripts was written for the Text Laundrette workshop. The workshop takes place in the Publication Station, WDkA building.
Rotterdam, 03-02-2020
It is a workflow to turn the pictures from a Flatbed Scanner into a final OCRed PDF.
+
+## About the Workshop + +DESCRIPTION +

We will use a home-made, DIY book scanner, and open-source software to scan, process, and add digital features to printed texts brought by the participants to the workshop. Ultimately, we will include them in the “bootleg library”, a shadow library accessible over a local network.

+ +

Shadow libraries operate outside of legal copyright frameworks, in response to decreased open access to knowledge. This workshop aims to extend our research on libraries, their sociability, and methods by which we can add provenance to texts included in public or private, legal or extra-legal collections.

+ +

Participants should bring: a printed text, which they’d like to digitize and share.

+ +

+##Dependencies +###Brew (MAC) or apt-get (LINUX) +

You’ll need the command-line tools for Xcode installed.

+ +```bash +xcode-select --install +``` + +

After install Homebrew.

+ +```bash +ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" +``` + +

Run the following command once you’re done to ensure Homebrew is installed and working properly:

+ +```bash +brew doctor +``` + +```bash +sudo apt-get install python3 python3-pip imagemagick poppler pdfunite +``` + +```bash +brew install python3 python3-pip imagemagick poppler pdfunite +``` +
+###PIP3 + +```bash +sudo pip3 install pdf2image Pillow time logging opencv-python pytesseract +``` + +
+##How to use +

Your scans must look like this for the scripts to perform right.

+ + RIGHT PAGE + ————————————————————— + | | + |—————————— | + | | | + | | | + | | | + | | | + | | | + | 01 | | + |—————————— | + | | + ————————————————————— + + LEFT PAGE RIGHT PAGE + ————————————————————— ————————————————————— +| | | | +| ——————————| |—————————— | +| | | | | | +| | | | | | +| | | | | | +| | | | | | +| | | | | | +| | 02 | | 03 | | +| —————————— | |—————————— | +| | | | + ————————————————————— ————————————————————— + + +

Add your pictures from the book scanner to the folder "/scans"

+ +

Make all the files executable.

+ +```bash +sudo chmod 777 merge_scans.sh workshop_stream.sh marge_files.sh +``` + +

In case you want to skip any of the scripts just comment out in the shell code, workshop_stream.sh.

+ +

Run ./workshop_stream.sh

+ + +

Wait :)

+ +

+##Aditional information +The workflow follows these scripts, by successive order: + +###Create 5 directories + +```bash +mkdir split +mkdir rotated +mkdir ocred +mkdir bounding_box +mkdir cropped +``` +###Merge the files in the directory scans +

All the scans will be appended to one pdf called out.pdf

+```bash +./merge_scans.sh +``` + +###Burst the pdf in scans +

Burst this pdf, renaming all the files so they can be iterated later.

+```bash +python3 burstpdf.py +``` + +###Rotate the pdfs +

The book scanner takes pictures of the pdfs, this scrip iterates through the odd and even pages rotating them to their original position.

+```bash +python3 rotation.py +``` + +###Cropping the bounding boxes +

The pages are now in their original position, but they have a bounding box. This script iterates through them and crops the highest contrast area found.

+```bash +python3 bounding_box.py +``` + +###Cropping the mirror +

The pages are now cropped, but the mirror is still visible in the middle.

+```bash +python3 mirror_crop.py +``` + +###OCR +

In this part we OCR the jpg, turning these into PDFs.

+```bash +python3 tesseract_ocr.py +``` + +###Merge all the files and create the pdf +

The OCRed pages are now joined into their final PDF, your book is ready :)

+```bash +./merge_files.sh +``` +

+## License +The package is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT). diff --git a/scans/.DS_Store b/scans/.DS_Store new file mode 100644 index 0000000000000000000000000000000000000000..9465352d05d40ff1d264409149bc6c8af76b4118 GIT binary patch literal 6148 zcmeHKO-lnY5Pi{t1us2%%rCI?A1rnCBq9{?SgqS$6u0aLUh~s^lNrU;Jxi6DCG(QW zyd?WzlMR5d?5hPZ12AO~3{n{}36FLiS+ZcrW9(63x9>LBeLFJHA%Gj?I75w|V|=D?3C1eNC=_E8q&a0)I{cp4np4Q$_c#fGgk%d@3OOLu3&w43lEs zI_UBhfH+||8|!kH5Kc5K43i>zXhEq&r3SlV1f??`Ew3<4ib_YwGjkl9`Q!2u^6ZSq zoQ_Z^x_1R!fuRD&+MLPxf5}f~^pZcMc*_-V1^$@=VN$NjC11+U)*r7YXKiG8U=fqN pBo!L#gP#C4WFI-po!TFz&b-1fDat4k&vc^y2xLLrxdOkSz&qY7FS-B# literal 0 HcmV?d00001 diff --git a/tesseract_ocr.py b/tesseract_ocr.py new file mode 100755 index 0000000..2e91780 --- /dev/null +++ b/tesseract_ocr.py @@ -0,0 +1,22 @@ +# import libraries +from PIL import Image +import pytesseract +import time + +i = 1 + +while True: + try: + img = Image.open("cropped/page%i.jpg"%i) + print(img) + pdf = pytesseract.image_to_pdf_or_hocr(img, lang="eng", extension='pdf') + time.sleep(1) + file = open(("ocred/page%i.pdf"%i), "w+b") + file.write(bytearray(pdf)) + file.close() + i+=1 + print(i) + + except: + print("All pages must be ready!") + break diff --git a/workshop_stream.sh b/workshop_stream.sh new file mode 100755 index 0000000..24c5cfe --- /dev/null +++ b/workshop_stream.sh @@ -0,0 +1,8 @@ +mkdir split +mkdir ocred +mkdir cropped +./merge_scans.sh +python3 burstpdf.py +python3 bounding_box.py +python3 tesseract_ocr.py +./merge_files.sh