|
|
<h1 align="center">DIY Book Scanner Workflow</h1>
|
|
|
|
|
|
## Getting started
|
|
|
|
|
|
This set of scripts was written for the Text Laundrette workshop. The workshop takes place in the Publication Station, WDkA building.<br> Rotterdam, 03-02-2020<br>It is a workflow to turn the pictures from the DIY Book Scanner into a final OCRed PDF.<br>
|
|
|
<br>
|
|
|
## About the Workshop
|
|
|
|
|
|
<em>DESCRIPTION</em>
|
|
|
<p>We will use a home-made, DIY book scanner, and open-source software to scan, process, and add digital features to printed texts brought by the participants to the workshop. Ultimately, we will include them in the “bootleg library”, a shadow library accessible over a local network.</p>
|
|
|
|
|
|
<p>Shadow libraries operate outside of legal copyright frameworks, in response to decreased open access to knowledge. This workshop aims to extend our research on libraries, their sociability, and methods by which we can add provenance to texts included in public or private, legal or extra-legal collections.</p>
|
|
|
|
|
|
<p>Participants should bring: a printed text, which they’d like to digitize and share.</p>
|
|
|
|
|
|
<br><br>
|
|
|
##Dependencies
|
|
|
###Brew (MAC) or apt-get (LINUX)
|
|
|
<p>You’ll need the command-line tools for Xcode installed.</p>
|
|
|
|
|
|
```bash
|
|
|
xcode-select --install
|
|
|
```
|
|
|
|
|
|
<p>After, install Homebrew.</p>
|
|
|
|
|
|
```bash
|
|
|
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
|
|
|
```
|
|
|
|
|
|
<p>Run the following command once you’re done to ensure Homebrew is installed and working properly:</p>
|
|
|
|
|
|
```bash
|
|
|
brew doctor
|
|
|
```
|
|
|
|
|
|
```bash
|
|
|
sudo apt-get install python3 python3-pip imagemagick poppler-utils
|
|
|
```
|
|
|
|
|
|
```bash
|
|
|
brew install python3 python3-pip imagemagick poppler-utils
|
|
|
```
|
|
|
<br>
|
|
|
###PIP3
|
|
|
|
|
|
```bash
|
|
|
sudo pip3 install pdf2image Pillow opencv-python pytesseract
|
|
|
```
|
|
|
|
|
|
<br>
|
|
|
##How to use
|
|
|
<p>Add your pictures from the book scanner to the folder "/scans"</p>
|
|
|
|
|
|
<p>Make all the files executable.</p>
|
|
|
|
|
|
```bash
|
|
|
sudo chmod 777 merge_scans.sh workshop_stream.sh rename_scans.sh change_res.sh delete_and_start_over.sh
|
|
|
```
|
|
|
|
|
|
<p>In case you want to skip any of the scripts just comment out in the shell code, <em>workshop_stream.sh</em>.</p>
|
|
|
|
|
|
###Increase Imagemagick resources
|
|
|
|
|
|
```bash
|
|
|
nano /etc/ImageMagic-6/policy.xml
|
|
|
```
|
|
|
|
|
|
<p>ImageMagick comes with very low limits:</p>
|
|
|
|
|
|
Resource limits:<br>
|
|
|
Width: 16KP<br>
|
|
|
Height: 16KP<br>
|
|
|
Area: 128MP<br>
|
|
|
Memory: 256MiB<br>
|
|
|
Map: 512MiB<br>
|
|
|
Disk: 1GiB<br>
|
|
|
File: 768<br>
|
|
|
Thread: 4<br>
|
|
|
Throttle: 0<br>
|
|
|
Time: unlimited<br>
|
|
|
|
|
|
<p>change /etc/ImageMagick-6/policy.xml to more sensible defaults:</p>
|
|
|
|
|
|
Resource limits:<br>
|
|
|
Width: 128KP<br>
|
|
|
Height: 128KP<br>
|
|
|
Area: 1.0737GP<br>
|
|
|
Memory: 2GiB<br>
|
|
|
Map: 4GiB<br>
|
|
|
Disk: 8GiB<br>
|
|
|
File: 768<br>
|
|
|
Thread: 4<br>
|
|
|
Throttle: 0<br>
|
|
|
Time: unlimited<br>
|
|
|
|
|
|
|
|
|
###RUN
|
|
|
|
|
|
<p>In terminal run ./workshop_stream.sh</p>
|
|
|
|
|
|
|
|
|
<p>Wait :)</p>
|
|
|
|
|
|
<br><br>
|
|
|
##Aditional information
|
|
|
The workflow follows these scripts, by successive order:
|
|
|
|
|
|
###Create 5 directories
|
|
|
|
|
|
```bash
|
|
|
mkdir split
|
|
|
mkdir rotated
|
|
|
mkdir ocred
|
|
|
mkdir bounding_box
|
|
|
mkdir cropped
|
|
|
```
|
|
|
###Merge the files in the directory <em>scans</em>
|
|
|
<p>All the scans will be renamed</p>
|
|
|
```bash
|
|
|
./rename_scans.sh
|
|
|
```
|
|
|
|
|
|
###Burst the pdf in <em>scans</em>
|
|
|
<p>Change resolution of the scans so that it is lighter to process</p>
|
|
|
```bash
|
|
|
./change_res.sh
|
|
|
```
|
|
|
|
|
|
###Rotate the pdfs
|
|
|
<p>The book scanner takes pictures of the pdfs, this scrip iterates through the odd and even pages rotating them to their original position.</p>
|
|
|
```bash
|
|
|
python3 rotation.py
|
|
|
```
|
|
|
|
|
|
###Crop the bounding boxes
|
|
|
<p>The pages are now in their original position, but they have a bounding box. This script iterates through them and crops the highest contrast area found.</p>
|
|
|
```bash
|
|
|
python3 bounding_box.py
|
|
|
```
|
|
|
|
|
|
###Crop the mirror
|
|
|
<p>The pages are now cropped, but the mirror may still be visible in the edge. This happens if the cameras are not adjusted properly. I commented it out because if the cameras are positioned correctly there is no need for this step.</p>
|
|
|
```bash
|
|
|
python3 mirror_crop.py
|
|
|
```
|
|
|
|
|
|
###OCR
|
|
|
<p>In this part we OCR the jpg files, turning these into PDFs.</p>
|
|
|
```bash
|
|
|
python3 tesseract_ocr.py
|
|
|
```
|
|
|
|
|
|
###Merge all the files and create the pdf
|
|
|
<p>The OCRed pages are now joined into their final PDF, your book is ready :)</p>
|
|
|
```bash
|
|
|
./merge_files.sh
|
|
|
```
|
|
|
|
|
|
##START OVER
|
|
|
<p>Just run delete_and_start_over.sh and start over</p>
|
|
|
```bash
|
|
|
./delete_and_start_over.sh
|
|
|
```
|
|
|
<br><br>
|
|
|
## License
|
|
|
The package is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
|