You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

2.6 KiB

Raw Blame History Unescape Escape

DIY Book Scanner Workflow

Getting started

This set of scripts was written for the Text Laundrette workshop.
It is a workflow to turn the pictures from the DIY Book Scanner into a final OCRed PDF.

In case you want to skip any of the scripts just comment out in the shell code, workshop_stream.sh.

##Dependencies ###Brew (MAC) or apt-get (LINUX)

You’ll need the command-line tools for Xcode installed.

xcode-select --install

After install Homebrew.

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Run the following command once you’re done to ensure Homebrew is installed and working properly:

brew doctor

sudo apt-get install python3 python3-pip imagemagick poppler pdfunite

brew install python3 python3-pip imagemagick poppler pdfunite

###PIP3 sudo pip3 install pdf2image Pillow time logging opencv-python pytesseract

##How to use

Add your pictures from the book scanner to the folder "/scans"

Make all the files executable.

sudo chmod 777 merge_scans.sh workshop_stream.sh marge_files.sh

Run ./workshop_stream.sh

Wait :)

##Aditional information ###Create 5 directories

mkdir split
mkdir rotated
mkdir ocred
mkdir bounding_box
mkdir cropped

###Merge the files in the directory scans

All the scans will be appended to one pdf called out.pdf

```bash ./merge_scans.sh ```

###Burst the pdf in scans

Burst this pdf, renaming all the files so they can be iterated later.

```bash python3 burstpdf.py ```

###Rotate the pdfs

The book scanner takes pictures of the pdfs, this scrip iterates through the odd and even pages rotating them to their original position.

```bash python3 rotation.py ```

###Cropping the bounding boxes

The pages are now in their original position, but they have a bounding box. This script iterates through them and crops the highest contrast area found.

```bash python3 bounding_box.py ```

###Cropping the mirror

The pages are now cropped, but the mirror is still visible in the middle.

```bash python3 mirror_crop.py ```

###OCR

In this part we OCR the jpg, turning these into PDFs.

```bash python3 tesseract_ocr.py ```

###Merge all the files and create the pdf

The OCRed pages are now joined into their final PDF, your book is ready :)

```bash ./merge_files.sh ```

License

The package is available as open source under the terms of the MIT License.

2.6 KiB Raw Blame History Unescape Escape

DIY Book Scanner Workflow

Getting started

License

2.6 KiB

Raw Blame History Unescape Escape