201b11de4d | 5 years ago | |
---|---|---|
examples | 5 years ago | |
sandbox | 5 years ago | |
static | 5 years ago | |
templates | 5 years ago | |
.gitignore | 5 years ago | |
README.md | 5 years ago | |
download_imgs.py | 5 years ago | |
example_api_calls.py | 5 years ago | |
functions.py | 5 years ago | |
helper-upload_imgs_dir.sh | 5 years ago | |
imgs_info.py | 5 years ago | |
pdf2jpg.sh | 5 years ago | |
query2html.py | 5 years ago | |
run.sh | 5 years ago | |
upload_imgs_dir.py | 5 years ago |
README.md
Wiki to HTML pages script
Depencencies
-
python3
-
pip Python library installed
- Install:
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python3 get-pip.py
- Install:
-
mwclient Python library
- Install:
pip3 install mwclient
- Install:
-
jinja2 Python library
- Install:
pip3 install jinja2
- Install:
-
Pillow Python library for image processing
pip3 install Pillow
-
- Install:
- Debian/Ubuntu:
sudo apt install pandoc
- Mac:
brew install pandoc
- Debian/Ubuntu:
- Install:
login.txt
login.txt
is a local and individual file, ignored by git, where you place you itch wiki username and password, in separate lines.
It is used to let mwclient access the wiki, since it is close for reading and writing.
myusername
mypassword
Run
cd special-issue-11-wiki2html/
Run scripts together with ./run.sh
1 script at a time:
python3 download_imgs.py
- Downloads all images from wiki to
images/
directory - and stores each image's metadata to
images.json
python3 query2html.py
- with ask API perform a query:
- help
python3 query2html.py --help
- run dry
python3 query2html.py --dry
only printing request, not executing it - build custom query with arguments
--conditions --printouts --sort --order
- default query is:
[[File:+]][[Title::+]][[Part::+]][[Date::+]]|?Title|?Date|?Part|?Partof|sort=Date,Title,Part|order=asc,asc,asc
- custom queries
python3 query2html.py --conditions '[[Date::>=1970/01/01]][[Date::<=1979/12/31]]'
python3 query2html.py --conditions '[[Creator::~*task force*]]'
- help
Note: to avoid confusion or problems is better to leave the --printouts
--sort
--order
arguments as the default.
Otherwise document parts will start to get grouped not according to their Title, hence creating documents made from different original parts.
How does query2html.py work?
Based on the query made: MW API will send back a number of Page titles that match the query conditions, together with its printouts (metadata proprety::value pairs).
For each Page:
- its locally stored image is found
- its text retrieved from MW
- a fragment of html (
document_part_html
) is generated based on thetemplates/document_part.html
All Pages that share the same metadata's Title value, will:
- gather all their html fragments in
all_document_parts
- render
templates/document.html
with the content ofall_document_parts
- save the render template to
'static_html/DocumentTitle.html'
,
Each of the saved documents:
- render
templates/index.html
with the info on each document has been saved intodocumentslist
- resulting in
static_html/index.html
Bulk image upload upload_imgs_dir.py
Get Help: python3 upload_imgs_dir.py --help
Edit and run via .helper-upload_imgs_dir.sh
Convert PDFs to folder of JPGs with pdf2jpg.sh
By either:
-
running it from this folder and using absolute path to PDF
./pdf2jpg.sh "/full/path/to/2020_bantayog/PDFname.pdf"
-
copying pdf2jpg.sh to 2020_bantayog/ and running with relative path to PDF
./pdf2jpg.sh "PDFname.pdf"
It is
to convert pdfs to jpgs: convert -quality 100 -density 300 [name-of-pdf] %02d.jpg