ASPN ActiveState Programmer Network  
ActiveState, a division of Sophos
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups
Submit Recipe
My Recipes

All Recipes
All Cookbooks


View by Category

Title: Pure Python PDF to text converter
Submitter: Dirk Holtwick (other recipes)
Last Updated: 2007/04/12
Version no: 1.1
Category: Text

 

4 stars 2 vote(s)


Description:

This example shows how to extract text informations from a PDF file without the need of system dependent tools or code. Just use the pyPdf library from http://pybrary.net/pyPdf/

Source: Text Source

import pyPdf

def getPDFContent(path):
    content = ""
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # Iterate pages
    for i in range(0, pdf.getNumPages()):
        # Extract text from page and add to content
        content += pdf.getPage(i).extractText() + "\n"
    # Collapse whitespace
    content = " ".join(content.replace("\xa0", " ").strip().split())
    return content

print getPDFContent("test.pdf")

Discussion:

There are more nice PDF manipulations possible with pyPdf. An other way to extract the text from PDF files is to call the Linux command "pdftotext" and catch its output.



Add comment

Number of comments: 3

Josiah Carlson, 2007/04/12
The pdftotxt tool in Xpdf (http://www.foolabs.com/xpdf/download.html) can do a similar thing, though not in Python.
Add comment

backslash should be escaped, Paul Rougieux, 2007/12/06
This code doesn't work as it is here. The backslash should be escaped on this line: content = " ".join(content.replace("\\xa0", " ").strip().split())
Add comment

Error found, Narendran Subra, 2008/02/20
Given code doesn't work. Error shows when running my system:

Traceback (most recent call last):
  File "pdfext.py", line 15, in 
    print getPDFContent("testds.pdf")
  File "C:\Python25\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\xde' in position 1
018: character maps to 
Can anyone could solve the problem and may I know the reason for error?
Add comment



Highest rated recipes:

1. Implementation of sets ...

2. bag collection class

3. deque collection class

4. Floating Point Simulator

5. HTML colors to/from RGB ...

6. Select the nth smallest ...

7. Function Decorators by ...

8. MS SQL Server log monitor

9. Table objects with ...

10. wx twisted support using ...




Privacy Policy | Email Opt-out | Feedback | Syndication
© 2006 ActiveState Software Inc. All rights reserved.