|
|
|
@ -0,0 +1,103 @@
|
|
|
|
|
<h1 align="center">DIY Book Scanner Workflow</h1>
|
|
|
|
|
|
|
|
|
|
## Getting started
|
|
|
|
|
|
|
|
|
|
These set of scripts was written for the Text Laundrette workshop. It is a workflow to turn the pictures from the DIY Book Scanner into a final OCRed PDF.
|
|
|
|
|
|
|
|
|
|
In case you want to skip any of the scripts just comment out in the shell code, <em>workshop_stream.sh</em>.
|
|
|
|
|
|
|
|
|
|
##Dependencies
|
|
|
|
|
###Brew (MAC) or apt-get (LINUX)
|
|
|
|
|
<p>You’ll need the command-line tools for Xcode installed.</p>
|
|
|
|
|
```bash
|
|
|
|
|
xcode-select --install
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
<p>After install Homebrew.</p>
|
|
|
|
|
```bash
|
|
|
|
|
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
<p>Run the following command once you’re done to ensure Homebrew is installed and working properly:</p>
|
|
|
|
|
```bash
|
|
|
|
|
brew doctor
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
sudo apt-get install python3 python3-pip imagemagick poppler pdfunite
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
brew install python3 python3-pip imagemagick poppler pdfunite
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
###PIP3
|
|
|
|
|
sudo pip3 install pdf2image Pillow time logging opencv-python pytesseract
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
##How to use
|
|
|
|
|
<p>Add your pictures from the book scanner to the folder "/scans"</p>
|
|
|
|
|
|
|
|
|
|
<p>Make all the files executable.</p>
|
|
|
|
|
```bash
|
|
|
|
|
sudo chmod 777 merge_scans.sh workshop_stream.sh marge_files.sh
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
<p>Run ./workshop_stream.sh</p>
|
|
|
|
|
|
|
|
|
|
<p>Wait :)</p>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
##Aditional information
|
|
|
|
|
###Create 5 directories
|
|
|
|
|
```bash
|
|
|
|
|
mkdir split
|
|
|
|
|
mkdir rotated
|
|
|
|
|
mkdir ocred
|
|
|
|
|
mkdir bounding_box
|
|
|
|
|
mkdir cropped
|
|
|
|
|
```
|
|
|
|
|
###Merge the files in the directory <em>scans</em>
|
|
|
|
|
<p>All the scans will be appended to one pdf called out.pdf</p>
|
|
|
|
|
```bash
|
|
|
|
|
./merge_scans.sh
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
###Burst the pdf in <em>scans</em>
|
|
|
|
|
<p>Burst this pdf, renaming all the files so they can be iterated later.</p>
|
|
|
|
|
```bash
|
|
|
|
|
python3 burstpdf.py
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
###Rotate the pdfs
|
|
|
|
|
<p>The book scanner takes pictures of the pdfs, this scrip iterates through the odd and even pages rotating them to their original position.</p>
|
|
|
|
|
```bash
|
|
|
|
|
python3 rotation.py
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
###Cropping the bounding boxes
|
|
|
|
|
<p>The pages are now in their original position, but they have a bounding box. This script iterates through them and crops the highest contrast area found.</p>
|
|
|
|
|
```bash
|
|
|
|
|
python3 bounding_box.py
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
###Cropping the mirror
|
|
|
|
|
<p>The pages are now cropped, but the mirror is still visible in the middle.</p>
|
|
|
|
|
```bash
|
|
|
|
|
python3 mirror_crop.py
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
###OCR
|
|
|
|
|
<p>In this part we OCR the jpg, turning these into PDFs.</p>
|
|
|
|
|
```bash
|
|
|
|
|
python3 tesseract_ocr.py
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
###Merge all the files and create the pdf
|
|
|
|
|
<p>The OCRed pages are now joined into their final PDF, your book is ready :)</p>
|
|
|
|
|
```bash
|
|
|
|
|
./merge_files.sh
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## License
|
|
|
|
|
The package is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
|