|
|
|
|
<!DOCTYPE html>
|
|
|
|
|
<!DOCTYPE html>
|
|
|
|
|
<html>
|
|
|
|
|
<head>
|
|
|
|
|
<meta charset="utf-8">
|
|
|
|
|
<title>Tasks of the Contingent Librarian</title>
|
|
|
|
|
<link rel="stylesheet" type="text/css" href="tasks.css">
|
|
|
|
|
<script src="tasks.js"></script>
|
|
|
|
|
</head>
|
|
|
|
|
<body>
|
|
|
|
|
|
|
|
|
|
<div class="cardback"><img class="cardbackimg" src="./IMG/QWERTY_layout.jpeg"></div>QWERTY keyboard layout</div>
|
|
|
|
|
<h2><span class="mw-headline" id="11.11.19_Extracting_text_using_curl">11.11.19 Extracting text using curl</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/mw-mediadesign/index.php?title=User:Simon/Trim4/Extracting_text_from_the_web&action=edit&section=T-1" title="Edit section: ">edit</a><span class="mw-editsection-bracket">]</span></span></h2>
|
|
|
|
|
<p><code>curl</code> is a command that can be used from the terminal to take text from a URL. It can be piped with software such as pandoc to convert the text to other formats, and in support of a workflow I'm starting to develop, this comes in quite handy.
|
|
|
|
|
</p><p>I'm writing text on the pad, and then converting it to markdown. This extra step isn't necessary (in fact it adds to the work) but I'm interested in using pads as multi-flow publishing tools in the future so I'm testing this out. Also, using a pad allows me to style the text simply using markdown rather than HTML.
|
|
|
|
|
</p><p>For example, this is a file I made from some notes on a Flusser interview about linear writing:
|
|
|
|
|
</p>
|
|
|
|
|
<pre> $ curl <a class="external free" href="https://pad.xpub.nl/p/flusser_interview_notes/export/txt" rel="nofollow">https://pad.xpub.nl/p/flusser_interview_notes/export/txt</a> | pandoc -t markdown > flusser.md
|
|
|
|
|
</pre>
|
|
|
|
|
<p>I'm then storing the files in <a class="external text" href="https://git.xpub.nl/simoon/thesis" rel="nofollow">my git</a>, which is public. Having texts in git allows me to use its versioning capabilities, allowing me to go back over old modified versions in the file tree - I can copy paste from these snippets that I may want to go back and retain in the future...
|
|
|
|
|
</p>
|
|
|
|
|
<!--
|
|
|
|
|
NewPP limit report
|
|
|
|
|
Cached time: 20200612082936
|
|
|
|
|
Cache expiry: 86400
|
|
|
|
|
Dynamic content: false
|
|
|
|
|
CPU time usage: 0.012 seconds
|
|
|
|
|
Real time usage: 0.021 seconds
|
|
|
|
|
Preprocessor visited node count: 4/1000000
|
|
|
|
|
Preprocessor generated node count: 16/1000000
|
|
|
|
|
Post‐expand include size: 1156/2097152 bytes
|
|
|
|
|
Template argument size: 0/2097152 bytes
|
|
|
|
|
Highest expansion depth: 2/40
|
|
|
|
|
Expensive parser function count: 0/100
|
|
|
|
|
Unstrip recursion depth: 0/20
|
|
|
|
|
Unstrip post‐expand size: 0/5000000 bytes
|
|
|
|
|
-->
|
|
|
|
|
<!--
|
|
|
|
|
Transclusion expansion time report (%,ms,calls,template)
|
|
|
|
|
100.00% 2.305 1 User:Simon/Trim4/Extracting_text_from_the_web
|
|
|
|
|
100.00% 2.305 1 -total
|
|
|
|
|
-->
|
|
|
|
|
|
|
|
|
|
<!-- Saved in parser cache with key wdka_mw_mediadesign-mw_:pcache:idhash:31716-0!canonical and timestamp 20200612082936 and revision id 173983
|
|
|
|
|
-->
|
|
|
|
|
</div></DOCUMENT_FRAGMENT></div>
|
|
|
|
|
|
|
|
|
|
</body>
|
|
|
|
|
</html>
|