From f6915bf497901b25a836fb3371399cc6003762fd Mon Sep 17 00:00:00 2001 From: Michael Murtaugh Date: Wed, 28 Oct 2020 17:19:15 +0100 Subject: [PATCH] added back print style sheet example, and moved nltk example as a 'special case' --- nltk-pos-tagger.ipynb | 105 +++++++++++++++++- weasyprint.ipynb | 246 ++++++++++++++++++++++-------------------- 2 files changed, 235 insertions(+), 116 deletions(-) diff --git a/nltk-pos-tagger.ipynb b/nltk-pos-tagger.ipynb index 1d03834..48c8798 100644 --- a/nltk-pos-tagger.ipynb +++ b/nltk-pos-tagger.ipynb @@ -1453,12 +1453,115 @@ "tagged" ] }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "words = \"in the beginning was heaven and earth and the time of the whatever\".split()" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['in',\n", + " 'the',\n", + " 'beginning',\n", + " 'was',\n", + " 'heaven',\n", + " 'and',\n", + " 'earth',\n", + " 'and',\n", + " 'the',\n", + " 'time',\n", + " 'of',\n", + " 'the',\n", + " 'whatever']" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "words" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "1" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "words.index(\"the\")" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "IN\n", + "1 the\n", + "BEGINNING\n", + "WAS\n", + "HEAVEN\n", + "AND\n", + "EARTH\n", + "AND\n", + "8 the\n", + "TIME\n", + "OF\n", + "11 the\n", + "WHATEVER\n" + ] + } + ], + "source": [ + "for i, word in enumerate(words):\n", + " if word == \"the\":\n", + " print (i, word)\n", + " else:\n", + " print (word.upper())" + ] + }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], - "source": [] + "source": [ + "words = {}\n", + "words[\"VB\"] = []\n", + "\n", + "for ...\n", + " words[\"VB\"].append(word)\n", + " \n", + " \n", + "choice(words[\"VB\"])" + ] } ], "metadata": { diff --git a/weasyprint.ipynb b/weasyprint.ipynb index dcab8ed..061929d 100644 --- a/weasyprint.ipynb +++ b/weasyprint.ipynb @@ -4,29 +4,31 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Weasyprint" + "# Weasyprint\n", + "\n", + "Weasyprint is a python library to layout HTML (and CSS) as print pages, saving to a PDF. In this way, it can be a part of a \"web to print\" workflow." ] }, { - "cell_type": "code", - "execution_count": 2, + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "from weasyprint import HTML, CSS\n", - "from weasyprint.fonts import FontConfiguration" + "https://weasyprint.readthedocs.io/en/latest/tutorial.html" ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": 5, "metadata": {}, + "outputs": [], "source": [ - "https://weasyprint.readthedocs.io/en/latest/tutorial.html" + "from weasyprint import HTML, CSS\n", + "from weasyprint.fonts import FontConfiguration" ] }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 6, "metadata": {}, "outputs": [], "source": [ @@ -41,13 +43,19 @@ "## HTML" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The main class that weasyprint is HTML, it represents an HTML document, and provides functions to save as PDF (or PNG). When creating an HTML object you can specify the HTML either via HTML source as a string (via the *string* option), a file (via the *filename* option), or even an online page (via *url*)." + ] + }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 7, "metadata": {}, "outputs": [], "source": [ - "# small example HTML object\n", "html = HTML(string='

hello

')" ] }, @@ -55,165 +63,173 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "or in this case let's use python + nltk to make a custom HTML page with parts of speech used as CSS classes..." + "or" ] }, { "cell_type": "code", - "execution_count": 42, + "execution_count": 4, "metadata": {}, "outputs": [], "source": [ - "import nltk\n", - "\n", - "txt = open('txt/language.txt').read()\n", - "words = nltk.word_tokenize(txt)\n", - "tagged_words = nltk.pos_tag(words)" + "html = HTML(filename=\"path/to/some.html\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "or" ] }, { "cell_type": "code", - "execution_count": 23, + "execution_count": 2, "metadata": {}, "outputs": [], "source": [ - "content = ''\n", - "content += '

Language and Software Studies, by Florian Cramer

'\n", - "\n", - "for word, tag in tagged_words:\n", - " content += f'{ word } '\n", - "\n", - "with open(\"txt/language.html\", \"w\") as f:\n", - " f.write(f\"\"\"\n", - "\n", - "\n", - " \n", - " \n", - " \n", - "\n", - "\n", - "{content}\n", - "\n", - "\"\"\")\n", - "\n", - "html = HTML(\"txt/language.html\")" + "html = HTML(url=\"https://pzwiki.wdka.nl/mediadesign/Category:WordsfortheFuture\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Saved to [language.html](txt/language.html). Fun fact: jupyter filters HTML pages that are displayed in the notebook. To see the HTML unfiltered, use an iframe (as below), or right-click and select Open in New Tab in the file list.\n", - "\n", - "Maybe useful evt. https://stackoverflow.com/questions/23358444/how-can-i-use-word-tokenize-in-nltk-and-keep-the-spaces" + "The CSS class lets you include an (additional) CSS file. Just as with the HTML class, you can give a string, filename, or URL. If the HTML already has stylesheets, they will be combined. (is this true?)" ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": 12, "metadata": {}, + "outputs": [], "source": [ - "NB: The above HTML refers to the stylesheet [language.css](txt/language.css) (notice that the path is relative to the HTML page, so no need to say txt in the link)." + "css = CSS(string='''\n", + "@page{\n", + " size: A4;\n", + " margin: 15mm;\n", + " background-color: lightgrey;\n", + " font-family: monospace;\n", + " font-size: 8pt;\n", + " color: red;\n", + " border:1px dotted red;\n", + " \n", + " @top-left{\n", + " content: \"natural\";\n", + " }\n", + " @top-center{\n", + " content: \"language\";\n", + " }\n", + " @top-right{\n", + " content: \"artificial\";\n", + " }\n", + " @top-middle{\n", + " content: \"\"\n", + " }\n", + " @left-top{\n", + " content: \"computer control\";\n", + " }\n", + " @right-top{\n", + " content: \"markup\";\n", + " }\n", + " @bottom-left{\n", + " content: \"formal\";\n", + " }\n", + " @bottom-center{\n", + " content: \"programming\";\n", + " }\n", + " @bottom-right{\n", + " content: \"machine\";\n", + " }\n", + " }\n", + " body{\n", + " font-family: serif;\n", + " font-size: 12pt;\n", + " line-height: 1.4;\n", + " color: magenta;\n", + " }\n", + " h1{\n", + " width: 100%;\n", + " text-align: center;\n", + " font-size: 250%;\n", + " line-height: 1.25;\n", + " color: orange;\n", + " }\n", + " strong{\n", + " color: blue;\n", + " }\n", + " em{\n", + " color: green;\n", + " }\n", + "\n", + "\n", + "''', font_config=font_config)" ] }, { "cell_type": "code", - "execution_count": 34, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " " - ], - "text/plain": [ - "" - ] - }, - "execution_count": 34, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from IPython.display import IFrame\n", - "IFrame(\"txt/language.html\", width=1024, height=600)" + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "html.write_pdf('mydocument.pdf', font_config=font_config)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Generating the PDF!\n", - "\n", - "Now let's let weasyprint do it's stuff! Write_pdf actually calculates the layout, behaving like a web browser to render the HTML visibly and following the CSS guidelines for page media (notice the special rules in the CSS that weasy print recognizes and uses that the browser does not). Notice that the CSS file gets mentioned again explicitly (and here we need to refer to its path relative to this folder)." + "## Using NLTK to automatically markup a (plain) text with POS tags" ] }, { "cell_type": "code", - "execution_count": 39, + "execution_count": 9, "metadata": {}, "outputs": [], "source": [ - "## If we had not linked the CSS in the HTML, you could specify it in this way\n", - "# css = CSS(\"txt/language.css\", font_config=font_config)\n", - "# html.write_pdf('txt/language.pdf', stylesheets=[css], font_config=font_config)" + "import nltk\n", + "\n", + "txt = open('txt/language.txt').read()\n", + "words = nltk.word_tokenize(txt)\n", + "tagged_words = nltk.pos_tag(words)" ] }, { "cell_type": "code", - "execution_count": 40, + "execution_count": 10, "metadata": {}, "outputs": [], "source": [ - "html.write_pdf('txt/language.pdf', font_config=font_config)" - ] - }, - { - "cell_type": "code", - "execution_count": 41, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " " - ], - "text/plain": [ - "" - ] - }, - "execution_count": 41, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from IPython.display import IFrame\n", - "IFrame(\"txt/language.pdf\", width=1024, height=600)" + "content = ''\n", + "content += '

Language and Software Studies, by Florian Cramer

'\n", + "\n", + "for word, tag in tagged_words:\n", + " content += f'{ word } '\n", + "\n", + "with open(\"txt/language.html\", \"w\") as f:\n", + " f.write(f\"\"\"\n", + "\n", + "\n", + " \n", + " \n", + " \n", + "\n", + "\n", + "{content}\n", + "\n", + "\"\"\")" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 11, "metadata": {}, "outputs": [], - "source": [] + "source": [ + "html = HTML(\"txt/language.html\")\n", + "html.write_pdf('txt/language.pdf', font_config=font_config)" + ] }, { "cell_type": "code",