String sanitization in Python

On July 16, 2007 in development, django, python

Sometimes users want to bring text from an editor like Word into your web forms. You will often find nasty little characters hiding in the text, like ’\u2022’ (a.k.a. the notorious bullet). These characters will normally throw errors if you try to convert them to ASCII:

>>> u'\u2022'.encode('ascii')
Traceback (most recent call last):
File "<console>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode ... (yadda yadda)

To sanitize these strings and make them XML/HTML safe:

>>> u'\u2022'.encode('ascii', 'xmlcharrefreplace')
'&#8226;'

It translates the invalid characters into their XML equivalents. Woo! You can also use 'ignore' or 'replace' (replaces with ?):

>>> u'\u2022'.encode('ascii', 'ignore')
''
>>> u'\u2022'.encode('ascii', 'replace')
'?'

If you’re getting nasty Unicode errors from your templates in Django now that they’ve merged the Unicode branch, this might help as a quick fix.

5 comments Add yours…

alan, about 1 month ago

very good its a big problem i also fadce these problems

Brandon Konkle, 28 days ago

You are a genius! Thank you very much for this information – I searched for hours before finally finding this lifesaving post. :-)

develope air force ones, 18 days ago

Reference: Very helpful, thanks!!

replica handbags, 5 days ago

thanks for that cool post

cheap propecia, 21 minutes ago

a blowjob. I would go soft during
I did not find her attractive,
a blowjob. I would go soft during
the middle of sex.
more direct can you get
I was having no
Is very intristing

Post a comment