Welcome, guest | Sign In | My Account | Store | Cart

This is a complete program that reads an html doc and converts it to plain ASCII text. In the spirit of minimalism, this operates as a standard unix filter. E.g. htmltotext < foo.html > foo.txt

If the output is going to a terminal, then bold and underline are displayed on the terminal. Italics in HTML are mapped to underlining on the tty. Underlining in HTML is ignored (mostly due to laziness).

Python, 39 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
#!/usr/bin/env python
# htmltotext

import sys, os, htmllib, formatter

bold = os.popen('tput bold').read()
underline =  os.popen('tput smul').read()
reset = os.popen('tput sgr0').read()

class TtyFormatter(formatter.AbstractFormatter):
    def __init__(self, writer):
	formatter.AbstractFormatter.__init__(self, writer)
	self.fontStack = []
	self.fontState = (0,0)
    def push_font(self, font):
	size, italic, bold, tt = font
	self.fontStack.append((italic, bold))
	self.updateFontState()
    def pop_font(self, *args):
	try: self.fontStack.pop()
	except: pass
	self.updateFontState()
    def updateFontState(self):
	try: newState = self.fontStack[-1]
	except: newState = (0,0)
	if self.fontState != newState:
	    print reset,
	    if newState[0]: print underline,
	    if newState[1]: print bold,
	    self.fontState = newState

myWriter = formatter.DumbWriter()
if sys.stdout.isatty():
    myFormatter = TtyFormatter(myWriter)
else:
    myFormatter = formatter.AbstractFormatter(myWriter)
myParser = htmllib.HTMLParser(myFormatter)
myParser.feed(sys.stdin.read())
myParser.close()

The tput unix command is used to get the codes for the terminal. I think it is commonly available, but I haven't run it on a lot of platforms. The basic AbstractFormatter should work everywhere.

2 comments

Mark Moraes 21 years, 9 months ago  # | flag

minor bugfix needed? Shouldn't that be

self.fontStack.append((italic, bold))
Brent Burley (author) 20 years, 1 month ago  # | flag

Fixed. The append(a,b) syntax used to work, though it probably should have been append((a,b)) from the beginning. In any case, I've fixed the bug. Thanks!