libraryofcontingencies/sketches/site4/Cleaning_up_text.html

<!DOCTYPE html>
<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8">
    <title>Tasks of the Contingent Librarian</title>
<link rel="stylesheet" type="text/css" href="tasks.css">
<script src="tasks.js"></script>
</head>
<body>

<div class="card"><DOCUMENT_FRAGMENT><div class="mw-parser-output"><h1><span class="mw-headline" id="cleaning_up_text">cleaning up text</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/mw-mediadesign/index.php?title=User:Simon/Cleaning_up_text&amp;action=edit&amp;section=1" title="Edit section: cleaning up text">edit</a><span class="mw-editsection-bracket">]</span></span></h1>
<p>see also <a href="Being_kind_to_the_reader.html" title="User:Simon/Being kind to the reader">being kind to the reader</a>, <a href="Editing.html" title="User:Simon/Editing">editing</a>, <a href="Typing.html" title="User:Simon/Typing">typing</a>
</p><p>A text found in the wild often comes with visible and invisible artefacts. The visible ones come from bad OCR, with strange characters popping up in place of the ones you expect, such as a 1 instead of an l. The bane of the bootlegger is most definitely the line break or “soft return”, inserted by software that automatically breaks the line as you type. Screen-based formats such as EPUB don’t have the notion of a page, and flow text according to window size.
</p><p>You can either be methodical and remove each soft return manually, or use the powerful automated <i>find/replace all</i> option. A useful tactic is to <i>find</i> every instance of a full stop followed by a space where a line was intentionally broken by the human writer. Next, <i>replace</i> each full stop with an arbitrary but uncommon character, such as a dagger (†). Then, do another <i>find/replace</i> and remove every instance of a soft return and a space, and finally <i>replace</i> the uncommon character with a full stop, in one final <i>find/change</i> command.
Another unwanted character that often appears is the hyphen, inserted where words break at the end of a line. Here the pruning of errant characters is trickier, and the best method is to <i>find</i> each instance and remove them manually. Running <i>find/replace all</i> can often remove necessary hyphens, such as in time ranges (e.g. 9-5) and compound adjectives (e.g. inter-dependent).
</p><p>Image: Hidden characters (e.g. tabs, spaces, carriage and ‘soft’ returns)
</p>
<!--
NewPP limit report
Cached time: 20200612082931
Cache expiry: 86400
Dynamic content: false
CPU time usage: 0.004 seconds
Real time usage: 0.004 seconds
Preprocessor visited node count: 2/1000000
Preprocessor generated node count: 8/1000000
Post‐expand include size: 0/2097152 bytes
Template argument size: 0/2097152 bytes
Highest expansion depth: 2/40
Expensive parser function count: 0/100
Unstrip recursion depth: 0/20
Unstrip post‐expand size: 0/5000000 bytes
-->
<!--
Transclusion expansion time report (%,ms,calls,template)
100.00%    0.000      1 -total
-->

<!-- Saved in parser cache with key wdka_mw_mediadesign-mw_:pcache:idhash:31434-0!canonical and timestamp 20200612082931 and revision id 173927
 -->
</div></DOCUMENT_FRAGMENT></div>

</body>
</html>