|
|
 |
|
Title: Pure Python PDF to text converter
Submitter: Dirk Holtwick
(other recipes)
Last Updated: 2007/04/12
Version no: 1.1
Category:
Text
|
|
2 vote(s)
|
|
|
|
Description:
This example shows how to extract text informations from a PDF file without the need of system dependent tools or code. Just use the pyPdf library from http://pybrary.net/pyPdf/
Source: Text Source
import pyPdf
def getPDFContent(path):
content = ""
pdf = pyPdf.PdfFileReader(file(path, "rb"))
for i in range(0, pdf.getNumPages()):
content += pdf.getPage(i).extractText() + "\n"
content = " ".join(content.replace("\xa0", " ").strip().split())
return content
print getPDFContent("test.pdf")
Discussion:
There are more nice PDF manipulations possible with pyPdf. An other way to extract the text from PDF files is to call the Linux command "pdftotext" and catch its output.
|
|
Add comment
|
|
Number of comments: 3
Josiah Carlson, 2007/04/12
The pdftotxt tool in Xpdf (http://www.foolabs.com/xpdf/download.html) can do a similar thing, though not in Python.
Add comment
backslash should be escaped, Paul Rougieux, 2007/12/06
This code doesn't work as it is here. The backslash should be escaped on this line:
content = " ".join(content.replace("\\xa0", " ").strip().split())
Add comment
Error found, Narendran Subra, 2008/02/20
Given code doesn't work.
Error shows when running my system:
Traceback (most recent call last):
File "pdfext.py", line 15, in
print getPDFContent("testds.pdf")
File "C:\Python25\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\xde' in position 1
018: character maps to
Can anyone could solve the problem and may I know the reason for error?
Add comment
|
|
|
|
|
 |
|