Python html escape and unescape

2019-09-2720:19:36 Comment

Python escape or unescape html special words methods.

1、The cgi module that comes with Python has an escape() function:

 1 import cgi
 2 
 3 s = cgi.escape( """& < >""" )   # s = "& < >"

However, it doesn't escape characters beyond &, <, and >. If it is used as cgi.escape(string_to_escape, quote=True), it also escapes ".

Recent Python 3.2 have html module with html.escape() and html.unescape() functions. html.escape() differs from cgi.escape() by its defaults to quote=True:

1 import html
2 
3 s = html.escape( """& < " ' >""" )   # s = '& < " ' >'

Here's a small snippet that will let you escape quotes and apostrophes as well:

1 html_escape_table = {
2     "&": "&",
3     '"': """,
4     "'": "'",
5     ">": ">",
6     "<": "<",
7     }
8 
9 def html_escape(text):
10     """Produce entities within text."""
11     return "".join(html_escape_table.get(c,c) for c in text)

You can also use escape() from xml.sax.saxutils to escape html. This function should execute faster. The unescape() function of the same module can be passed the same arguments to decode a string.

 1 from xml.sax.saxutils import escape, unescape
 2 # escape() and unescape() takes care of &, < and >.
 3 html_escape_table = {
 4     '"': """,
 5     "'": "'"
 6 }
 7 html_unescape_table = {v:k for k, v in html_escape_table.items()}
 8 
 9 def html_escape(text):
 10     return escape(text, html_escape_table)
 11 
 12 def html_unescape(text):
 13     return unescape(text, html_unescape_table)

Undoing the escaping performed by cgi.escape() isn't directly supported by the library. This can be accomplished using a fairly simple function, however:

1 def unescape(s):
2     s = s.replace("<", "<")
3     s = s.replace(">", ">")
4     # this has to be last:
5     s = s.replace("&", "&")
6     return s

A very easy way to transform non-ASCII characters like German umlauts or letters with accents into their HTML equivalents is simply encoding them from unicode to ASCII and use the xmlcharrefreplace encoding error handling:

>>> a = u"äöüßáà"
>>> a.encode('ascii', 'xmlcharrefreplace')
'äöüßáà'

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: