Pedro Sá Couto dfa84817ca | 5 years ago | |
---|---|---|
scans | 5 years ago | |
.DS_Store | 5 years ago | |
bounding_box.py | 5 years ago | |
burstpdf.py | 5 years ago | |
chmod.sh | 5 years ago | |
merge_files.sh | 5 years ago | |
merge_scans.sh | 5 years ago | |
mirror_crop.py | 5 years ago | |
readme.md | 5 years ago | |
rotation.py | 5 years ago | |
tesseract_ocr.py | 5 years ago | |
workshop_stream.sh | 5 years ago |
readme.md
DIY Book Scanner Workflow
Getting started
This set of scripts was written for the Text Laundrette workshop.
It is a workflow to turn the pictures from the DIY Book Scanner into a final OCRed PDF.
In case you want to skip any of the scripts just comment out in the shell code, workshop_stream.sh.
##Dependencies ###Brew (MAC) or apt-get (LINUX)
You’ll need the command-line tools for Xcode installed.
xcode-select --install
After install Homebrew.
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
Run the following command once you’re done to ensure Homebrew is installed and working properly:
brew doctor
sudo apt-get install python3 python3-pip imagemagick poppler pdfunite
brew install python3 python3-pip imagemagick poppler pdfunite
###PIP3 sudo pip3 install pdf2image Pillow time logging opencv-python pytesseract
##How to use
Add your pictures from the book scanner to the folder "/scans"
Make all the files executable.
sudo chmod 777 merge_scans.sh workshop_stream.sh marge_files.sh
Run ./workshop_stream.sh
Wait :)
##Aditional information ###Create 5 directories
mkdir split
mkdir rotated
mkdir ocred
mkdir bounding_box
mkdir cropped
###Merge the files in the directory scans
All the scans will be appended to one pdf called out.pdf
```bash ./merge_scans.sh ```###Burst the pdf in scans
Burst this pdf, renaming all the files so they can be iterated later.
```bash python3 burstpdf.py ```###Rotate the pdfs
The book scanner takes pictures of the pdfs, this scrip iterates through the odd and even pages rotating them to their original position.
```bash python3 rotation.py ```###Cropping the bounding boxes
The pages are now in their original position, but they have a bounding box. This script iterates through them and crops the highest contrast area found.
```bash python3 bounding_box.py ```###Cropping the mirror
The pages are now cropped, but the mirror is still visible in the middle.
```bash python3 mirror_crop.py ```###OCR
In this part we OCR the jpg, turning these into PDFs.
```bash python3 tesseract_ocr.py ```###Merge all the files and create the pdf
The OCRed pages are now joined into their final PDF, your book is ready :)
```bash ./merge_files.sh ```License
The package is available as open source under the terms of the MIT License.