Scripts used to process a book scanned through a Flatbed Scanner.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Pedro Sá Couto 33fee06e18 Update 'readme.md' 3 months ago
scans needs test 3 months ago
.DS_Store needs test 3 months ago
bounding_box.py needs test 3 months ago
burstpdf.py needs test 3 months ago
chmod.sh needs test 3 months ago
merge_files.sh needs test 3 months ago
merge_scans.sh needs test 3 months ago
readme.md Update 'readme.md' 3 months ago
tesseract_ocr.py needs test 3 months ago
workshop_stream.sh needs test 3 months ago

readme.md

Flatbed_Scanner_Workflow

Getting started

This set of scripts was written for the Text Laundrette workshop. The workshop takes place in the Publication Station, WDkA building.
Rotterdam, 03-02-2020
It is a workflow to turn the pictures from a Flatbed Scanner into a final OCRed PDF.

About the Workshop

DESCRIPTION

We will use a home-made, DIY book scanner, and open-source software to scan, process, and add digital features to printed texts brought by the participants to the workshop. Ultimately, we will include them in the “bootleg library”, a shadow library accessible over a local network.

Shadow libraries operate outside of legal copyright frameworks, in response to decreased open access to knowledge. This workshop aims to extend our research on libraries, their sociability, and methods by which we can add provenance to texts included in public or private, legal or extra-legal collections.

Participants should bring: a printed text, which they’d like to digitize and share.



Dependencies

Brew (MAC) or apt-get (LINUX)

You’ll need the command-line tools for Xcode installed.

xcode-select --install

After install Homebrew.

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Run the following command once you’re done to ensure Homebrew is installed and working properly:

brew doctor
sudo apt-get install python3 python3-pip imagemagick poppler pdfunite
brew install python3 python3-pip imagemagick poppler pdfunite


PIP3

sudo pip3 install pdf2image Pillow time logging opencv-python pytesseract


How to use

Your scans must look like this for the scripts to perform right.

                               RIGHT PAGE
                         —————————————————————
                        |                     |
                        |——————————           |
                        |           |         |
                        |           |         |
                        |           |         |
                        |           |         |
                        |           |         |
                        |        01 |         |
                        |——————————           |
                        |                     |
                         —————————————————————

  LEFT PAGE                RIGHT PAGE
 —————————————————————   —————————————————————
|                     | |                     |
|           ——————————| |——————————           |
|         |           | |           |         |
|         |           | |           |         |
|         |           | |           |         |
|         |           | |           |         |
|         |           | |           |         |
|         | 02        | |        03 |         |
|          —————————— | |——————————           |
|                     | |                     |
 —————————————————————   —————————————————————

Add your pictures from the book scanner to the folder "/scans"

Make all the files executable.

sudo chmod 777 merge_scans.sh workshop_stream.sh marge_files.sh

In case you want to skip any of the scripts just comment out in the shell code, workshop_stream.sh.

Run ./workshop_stream.sh

Wait :)



Aditional information

The workflow follows these scripts, by successive order:

Create 3 directories

mkdir split
mkdir ocred
mkdir cropped

Merge the files in the directory scans

All the scans will be appended to one pdf called out.pdf

./merge_scans.sh

Burst the pdf in scans

Burst this pdf, renaming all the files so they can be iterated later.

python3 burstpdf.py

Cropping the bounding boxes

The pages are now in their original position, but they have a bounding box. This script iterates through them and crops the highest contrast area found.

python3 bounding_box.py

OCR

In this part we OCR the jpg, turning these into PDFs.

python3 tesseract_ocr.py

Merge all the files and create the pdf

The OCRed pages are now joined into their final PDF, your book is ready :)

./merge_files.sh



License

The package is available as open source under the terms of the MIT License.