libraryofcontingencies/tasks/Extracting_text_from_PDF.html

<!DOCTYPE html>
<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8">
    <title>Tasks of the Contingent Librarian</title>
<link rel="stylesheet" type="text/css" href="tasks.css">
<script src="tasks.js"></script>
</head>
<body>

<div class="card"><DOCUMENT_FRAGMENT><div class="mw-parser-output"><h2><span class="mw-headline" id="Extracting_text_from_a_PDF">Extracting text from a PDF</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/mw-mediadesign/index.php?title=User:Simon/Trim4/Extracting_text_from_PDF&amp;action=edit&amp;section=1" title="Edit section: Extracting text from a PDF">edit</a><span class="mw-editsection-bracket">]</span></span></h2>
<p>In Al Sweigart's <i>Automate the Boring Stuff with Python</i>, there's a nice section on a Python library called PyPDF2 that allows you to work with the contents of PDFs. To begin with, I thought I'd try extracting text from a PDF of William S. Burrough's <i>The Electronic Revolution</i>. I chose this PDF as the only version I've found of it online is a 40pp document published by ubuclassics (which I suppose is the publishing house for ubuweb.com). There was no identifier other than this (no ISBN etc.), and it was impossible locating any other version online. What's more, the PDF had very small text, which was uncomfortable to read when I ran the <a href="Michaels_booklet_script_for_PDF_imposition.html" title="User:Simon/Trim4/Michaels booklet script for PDF imposition">booklet.sh</a> script on it.
</p><p>I thought it would be worthwhile laying out this book again for print reading purposes, and the first step is to get the text from the PDF. Pandoc is usually my go to for extracting text, but it doesn't work with PDFs, so I tried <a class="external text" href="https://pythonhosted.org/PyPDF2/index.html" rel="nofollow">PyPDF2</a>.
</p>
<h3><span class="mw-headline" id="28.09.19">28.09.19</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/mw-mediadesign/index.php?title=User:Simon/Trim4/Extracting_text_from_PDF&amp;action=edit&amp;section=2" title="Edit section: 28.09.19">edit</a><span class="mw-editsection-bracket">]</span></span></h3>
<p>I began by copying a file called electronic_revolution.pdf to a folder, then in the terminal <code>cd</code> into that directory. Then I initiated the interactive python interpreter with this command:
</p>
<pre>$ python3</pre>
<p>Next I wrote the following commands in Python 3 (comments above each line):
</p>
<pre># First, import the PyPDF2 module
&gt;&gt;&gt; import PyPDF2
# Then open electronic_revolution.pdf in read binary mode and store it in pdfFileObj
&gt;&gt;&gt; pdfFileObj = open('electronic_revolution.pdf', 'rb')
# To get a PdfFileReader object that rep- resents this PDF, call PyPDF2.PdfFileReader() and pass it pdfFileObj. Store this PdfFileReader object in pdfReader
&gt;&gt;&gt; pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# The total number of pages in the document is stored in the numPages attribute of a PdfFileReader object
&gt;&gt;&gt; pdfReader.numPages
&gt;&gt;&gt; 40
# The PDF has 40 pages. To extract text from a page, you need to get a Page object, which represents a single page of a PDF, from a PdfFileReader object.
# You can get a Page object by calling the getPage() method on a PdfFileReader object and passing it the page number of the page you’re interested in — in our case, 0
&gt;&gt;&gt; pageObj = pdfReader.getPage(0)
# Once you have your Page object, call its extractText() method to return a string of the page’s text
&gt;&gt;&gt; pageObj.extractText()
&gt;&gt;&gt; 'ubuclassics2005WILLIAM S. BURROUGHSTheElectronic\nRevolution'
</pre>
<p>This returns the value for total page count, and the text for the first page (0). What I want to get is the text for the whole document. I have no idea how to do this!!!
This doesn't work at all.
</p>
<pre>&gt;&gt;&gt; pageObj = pdfReader.getPage(0-40)
</pre>
<h3><span class="mw-headline" id="02.10.19">02.10.19</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/mw-mediadesign/index.php?title=User:Simon/Trim4/Extracting_text_from_PDF&amp;action=edit&amp;section=3" title="Edit section: 02.10.19">edit</a><span class="mw-editsection-bracket">]</span></span></h3>
<p>With Rita &amp; Pedro's help I managed to write a Python script that includes a for loop to extract text from the entire PDF:
</p>
<div class="mw-highlight mw-content-ltr" dir="ltr"><pre><span></span><span class="lineno"> 1 </span><span class="c1"># imports the PyPDF2 module</span>
<span class="lineno"> 2 </span>    <span class="kn">import</span> <span class="nn">PyPDF2</span>
<span class="lineno"> 3 </span>    
<span class="lineno"> 4 </span>    <span class="n">filename</span> <span class="o">=</span> <span class="nb">input</span><span class="p">(</span><span class="s2">"name of the file: "</span><span class="p">)</span>
<span class="lineno"> 5 </span>    
<span class="lineno"> 6 </span>    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">filename</span> <span class="p">,</span><span class="s1">'rb'</span><span class="p">)</span> <span class="k">as</span> <span class="n">pdf_file</span><span class="p">,</span> <span class="nb">open</span><span class="p">(</span><span class="s1">'input.txt'</span><span class="p">,</span> <span class="s1">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">text_file</span><span class="p">:</span>
<span class="lineno"> 7 </span>    <span class="n">read_pdf</span> <span class="o">=</span> <span class="n">PyPDF2</span><span class="o">.</span><span class="n">PdfFileReader</span><span class="p">(</span><span class="n">pdf_file</span><span class="p">)</span>
<span class="lineno"> 8 </span>    <span class="n">number_of_pages</span> <span class="o">=</span> <span class="n">read_pdf</span><span class="o">.</span><span class="n">getNumPages</span><span class="p">()</span>
<span class="lineno"> 9 </span>    <span class="k">for</span> <span class="n">page_number</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">number_of_pages</span><span class="p">):</span>   <span class="c1"># use xrange in Py2</span>
<span class="lineno">10 </span>        <span class="n">page</span> <span class="o">=</span> <span class="n">read_pdf</span><span class="o">.</span><span class="n">getPage</span><span class="p">(</span><span class="n">page_number</span><span class="p">)</span>
<span class="lineno">11 </span>        <span class="n">page_content</span> <span class="o">=</span> <span class="n">page</span><span class="o">.</span><span class="n">extractText</span><span class="p">()</span>
<span class="lineno">12 </span>        <span class="n">text_file</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">page_content</span><span class="p">)</span>
</pre></div>
<p>The next step is to then begin cleaning up the text by removing the line-breaks. We wrote a simple shell script for this that runs the Python script, then a command to take out line breaks:
</p>
<pre>$ python3 extract_text.py
$ grep -v "^$" input.txt &gt; output.txt
</pre>
<p>Only problem is that the names of the txt files that are produced will all be "input.txt", which means that if you run this on more than one PDF, you'll have to move input.txt to another directory before running again, rename the file manually, or perhaps I could write another Python script that renames the file and include it in the shell script after the last line.
</p>
<!-- 
NewPP limit report
Cached time: 20200609110158
Cache expiry: 86400
Dynamic content: false
CPU time usage: 0.028 seconds
Real time usage: 0.219 seconds
Preprocessor visited node count: 43/1000000
Preprocessor generated node count: 116/1000000
Post‐expand include size: 0/2097152 bytes
Template argument size: 0/2097152 bytes
Highest expansion depth: 2/40
Expensive parser function count: 0/100
Unstrip recursion depth: 0/20
Unstrip post‐expand size: 3957/5000000 bytes
-->
<!--
Transclusion expansion time report (%,ms,calls,template)
100.00%    0.000      1 -total
-->

<!-- Saved in parser cache with key wdka_mw_mediadesign-mw_:pcache:idhash:28939-0!canonical and timestamp 20200609110158 and revision id 173566
 -->
</div></DOCUMENT_FRAGMENT></div>

</body>
</html>