You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

2.6 KiB

DIY Book Scanner Workflow

Getting started

This set of scripts was written for the Text Laundrette workshop.
It is a workflow to turn the pictures from the DIY Book Scanner into a final OCRed PDF.

In case you want to skip any of the scripts just comment out in the shell code, workshop_stream.sh.

##Dependencies ###Brew (MAC) or apt-get (LINUX)

Youll need the command-line tools for Xcode installed.

xcode-select --install

After install Homebrew.

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Run the following command once youre done to ensure Homebrew is installed and working properly:

brew doctor
sudo apt-get install python3 python3-pip imagemagick poppler pdfunite
brew install python3 python3-pip imagemagick poppler pdfunite

###PIP3 sudo pip3 install pdf2image Pillow time logging opencv-python pytesseract

##How to use

Add your pictures from the book scanner to the folder "/scans"

Make all the files executable.

sudo chmod 777 merge_scans.sh workshop_stream.sh marge_files.sh

Run ./workshop_stream.sh

Wait :)

##Aditional information ###Create 5 directories

mkdir split
mkdir rotated
mkdir ocred
mkdir bounding_box
mkdir cropped

###Merge the files in the directory scans

All the scans will be appended to one pdf called out.pdf

```bash ./merge_scans.sh ```

###Burst the pdf in scans

Burst this pdf, renaming all the files so they can be iterated later.

```bash python3 burstpdf.py ```

###Rotate the pdfs

The book scanner takes pictures of the pdfs, this scrip iterates through the odd and even pages rotating them to their original position.

```bash python3 rotation.py ```

###Cropping the bounding boxes

The pages are now in their original position, but they have a bounding box. This script iterates through them and crops the highest contrast area found.

```bash python3 bounding_box.py ```

###Cropping the mirror

The pages are now cropped, but the mirror is still visible in the middle.

```bash python3 mirror_crop.py ```

###OCR

In this part we OCR the jpg, turning these into PDFs.

```bash python3 tesseract_ocr.py ```

###Merge all the files and create the pdf

The OCRed pages are now joined into their final PDF, your book is ready :)

```bash ./merge_files.sh ```

License

The package is available as open source under the terms of the MIT License.