You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

342 lines
98 KiB
Plaintext

4 years ago
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Weasyprint\n",
"\n",
"Weasyprint is a python library to layout HTML (and CSS) as print pages, saving to a PDF. In this way, it can be a part of a \"web to print\" workflow."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"https://weasyprint.readthedocs.io/en/latest/tutorial.html"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"from weasyprint import HTML, CSS\n",
"from weasyprint.fonts import FontConfiguration"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## HTML()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The main class that weasyprint is HTML, it represents an HTML document, and provides functions to save as PDF (or PNG). When creating an HTML object you can specify the HTML either via HTML source as a string (via the *string* option), a file (via the *filename* option), or even an online page (via *url*)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"html = HTML(string='<h1>hello</h1>')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"or"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"html = HTML(filename=\"txt/words-for-the-future/LIQUID.html\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"or"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [],
"source": [
"html = HTML(url=\"https://pzwiki.wdka.nl/mediadesign/Category:WordsfortheFuture\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## CSS()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The CSS class lets you include an (additional) CSS file. Just as with the HTML class, you can give a string, filename, or URL. If the HTML already has stylesheets, they will be combined. (is this true?)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"css = CSS(string='''\n",
"@page{\n",
" size: A4;\n",
" margin: 15mm;\n",
" background-color: lightgrey;\n",
" font-family: monospace;\n",
" font-size: 8pt;\n",
" color: #7da0d4;\n",
" border:1px dotted red;\n",
" \n",
" @top-left{\n",
" content: \"natural\";\n",
" }\n",
" @top-center{\n",
" content: \"liquid\";\n",
" }\n",
" @top-right{\n",
" content: \"bodies\";\n",
" }\n",
" @top-middle{\n",
" content: \"\"\n",
" }\n",
" @left-top{\n",
" content: \"material\";\n",
" }\n",
" @right-top{\n",
" content: \"existence\";\n",
" }\n",
" @bottom-left{\n",
" content: \"flux\";\n",
" }\n",
" @bottom-center{\n",
" content: \"living\";\n",
" }\n",
" @bottom-right{\n",
" content: \"energy\";\n",
" }\n",
" }\n",
" body {\n",
" background: #f7c694;\n",
" margin: 20px;\n",
" line-height: 2;\n",
" font-family: monospace;\n",
"}\n",
"\n",
"pre {\n",
" white-space: pre-wrap;\n",
"}\n",
" \n",
"\n",
"\n",
"''')"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"html.write_pdf('mydocument.pdf', stylesheets=[css])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using NLTK to automatically markup a (plain) text with POS tags"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"ename": "ModuleNotFoundError",
"evalue": "No module named 'nltk'",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-11-312ba7d602d9>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0;32mimport\u001b[0m \u001b[0mnltk\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0mtxt\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mopen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'txt/LIQUID.txt'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0mwords\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnltk\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mword_tokenize\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtxt\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0mtagged_words\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnltk\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpos_tag\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mwords\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'nltk'"
]
}
],
"source": [
"import nltk\n",
"\n",
"txt = open('txt/LIQUID.txt').read()\n",
"words = nltk.word_tokenize(txt)\n",
"tagged_words = nltk.pos_tag(words)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# collect all the pieces of HTML\n",
"content = ''\n",
"content += '<h1>Language and Software Studies, by Florian Cramer</h1>'\n",
"\n",
"for word, tag in tagged_words:\n",
" content += f'<span class=\"{ tag }\">{ word }</span> '\n",
"\n",
"# write the HTML file\n",
"with open(\"txt/language.html\", \"w\") as f:\n",
" f.write(f\"\"\"<!DOCTYPE html>\n",
"<html>\n",
"<head>\n",
" <meta charset=\"utf-8\">\n",
" <link rel=\"stylesheet\" type=\"text/css\" href=\"language.css\">\n",
" <title></title>\n",
"</head>\n",
"<body>\n",
"{ content }\n",
"</body>\n",
"\"\"\")\n",
"\n",
"# write a CSS file\n",
"with open(\"txt/language.css\", \"w\") as f:\n",
" f.write(\"\"\"\n",
"\n",
"@page{\n",
" size:A4;\n",
" background-color:lightgrey;\n",
" margin:10mm;\n",
"}\n",
".JJ{\n",
" color:red;\n",
"}\n",
".VB,\n",
".VBG{\n",
" color:magenta;\n",
"}\n",
".NN,\n",
".NNP{\n",
" color:green;\n",
"}\n",
".EX{\n",
" color: blue;\n",
"}\n",
"\n",
" \"\"\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# If you use @font-face in your stylesheet, you would need Weasyprint's FontConfiguration()\n",
"font_config = FontConfiguration()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"html = HTML(\"txt/language.html\")\n",
"css = CSS(\"txt/language.css\")\n",
"html.write_pdf('txt/language.pdf', stylesheets=[css], font_config=font_config)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Preview your PDF in the notebook!\n",
"from IPython.display import IFrame, display\n",
"IFrame(\"txt/language.pdf\", width=900, height=400)"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<h1>Language and Software Studies, by Florian Cramer</h1><span class=\"NN\">Language</span> <span class=\"JJ\">Florian</span> <span class=\"NNP\">Cramer</span> <span class=\"NNP\">Software</span> <span class=\"CC\">and</span> <span class=\"NN\">language</span> <span class=\"VBP\">are</span> <span class=\"RB\">intrinsically</span> <span class=\"VBN\">related</span> <span class=\",\">,</span> <span class=\"IN\">since</span> <span class=\"NN\">software</span> <span class=\"MD\">may</span> <span class=\"VB\">process</span> <span class=\"NN\">language</span> <span class=\",\">,</span> <span class=\"CC\">and</span> <span class=\"VBZ\">is</span> <span class=\"VBN\">constructed</span> <span class=\"IN\">in</span> <span class=\"NN\">language</span> <span class=\".\">.</span> <span class=\"CC\">Yet</span> <span class=\"NN\">language</span> <span class=\"VBZ\">means</span> <span class=\"JJ\">different</span> <span class=\"NNS\">things</span> <span class=\"IN\">in</span> <span class=\"DT\">the</span> <span class=\"NN\">context</span> <span class=\"IN\">of</span> <span class=\"VBG\">computing</span> <span class=\":\">:</span> <span class=\"JJ\">formal</span> <span class=\"NNS\">languages</span> <span class=\"IN\">in</span> <span class=\"WDT\">which</span> <span class=\"EX\">algorithms</span> <span class=\"VBP\">are</span> <span class=\"VBN\">expressed</span> <span class=\"CC\">and</span> <span class=\"NN\">software</span> <span class=\"VBZ\">is</span> <span class=\"VBN\">implemented</span> <span class=\",\">,</span> <span class=\"CC\">and</span> <span class=\"IN\">in</span> <span class=\"JJ\">so-called</span> <span class=\"NNP\">“</span> <span class=\"JJ\">natural</span> <span class=\"NNP\">”</span> <span class=\"NN\">spoken</span> <span class=\"NNS\">languages</span> <span class=\".\">.</span> <span class=\"EX\">There</span> <span class=\"VBP\">are</span> <span class=\"IN\">at</span> <span class=\"JJS\">least</span> <span class=\"CD\">two</span> <span class=\"NNS\">layers</span> <span class=\"IN\">of</span> <span class=\"JJ\">formal</span> <span class=\"NN\">language</span> <span class=\"IN\">in</span> <span class=\"NN\">software</span> <span class=\":\">:</span> <span class=\"NN\">programming</span> <span class=\"NN\">language</span> <span class=\"IN\">in</span> <span class=\"WDT\">which</span> <span class=\"DT\">the</span> <span class=\"NN\">software</span> <span class=\"VBZ\">is</span> <span class=\"VBN\">written</span> <span class=\",\">,</span> <span class=\"CC\">and</span> <span class=\"DT\">the</span> <span class=\"NN\">language</span> <span class=\"VBD\">implemented</span> <span class=\"IN\">within</span> <span class=\"DT\">the</span> <span class=\"NN\">software</span> <span class=\"IN\">as</span> <span class=\"PRP$\">its</span> <span class=\"JJ\">symbolic</span> <span class=\"NNS\">controls</span> <span class=\".\">.</span> <span class=\"IN\">In</span> <span class=\"DT\">the</span> <span class=\"NN\">case</span> <span class=\"IN\">of</span> <span class=\"NNS\">compilers</span> <span class=\",\">,</span> <span class=\"NNS\">shells</span> <span class=\",\">,</span> <span class=\"CC\">and</span> <span class=\"NN\">macro</span> <span class=\"NNS\">languages</span> <span class=\",\">,</span> <span class=\"IN\">for</span> <span class=\"NN\">example</span> <span class=\",\">,</span> <span class=\"DT\">these</span> <span class=\"NNS\">layers</span> <span class=\"MD\">can</span> <span class=\"VB\">overlap</span> <span class=\".\">.</span> <span class=\"VB\">“</span> <span class=\"NNP\">Natural</span> <span class=\"NNP\">”</span> <span class=\"NN\">language</span> <span class=\"VBZ\">is</span> <span class=\"WP\">what</span> <span class=\"MD\">can</span> <span class=\"VB\">be</span> <span class=\"VBN\">processed</span> <span class=\"IN\">as</span> <span class=\"NNS\">data</span> <span class=\"IN\">by</span> <span class=\"NN\">software</span> <span class=\":\">;</span> <span class=\"IN\">since</span> <span class=\"DT\">this</span> <span class=\"NN\">processing</span> <span class=\"VBZ\">is</span> <span class=\"JJ\">form
]
}
],
"source": [
"# Printing the content is useful to see how the HTML is written!\n",
"print( content )"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}