Tricky, right? Not really.
#!/usr/bin/python
import sys
for c in sys.stdin.read():
if ord(c) < 0x80: sys.stdout.write(c)
elif ord(c) < 0xC0: sys.stdout.write('\xC2' + c)
else: sys.stdout.write('\xC3' + chr(ord(c) - 64))
To use: pipe iso-8859-1 in, get utf-8 out. Or here's a slightly nicer script which works using regular expressions, and can take filename arguments too:
#!/usr/bin/python
"""Convert iso-8859-1 to utf-8. Sean B. Palmer."""
import sys, re; r_iso = re.compile('([\x80-\xFF])')
def iso2utf(s):
def conv(m):
c = m.group(0)
return ('\xC2'+c, '\xC3'+chr(ord(c) - 64))[ord(c) > 0xBF]
return r_iso.sub(conv, s)
def main(argv=None):
if argv is None: argv = sys.argv[1:]
for fn in argv:
s = iso2utf(open(fn).read())
open(fn, 'w').write(s)
if not argv: sys.stdout.write(iso2utf(sys.stdin.read()))
if __name__=="__main__":
main()
The main() idiom that I used is quite useful, incidentally. Let's hope that Karl finds this script easier than using iconv!
Footnote [added 2007-01-12, after being diveintomarkdotted]: Later I published a slightly longer C version of this script. Python allows you to convert between these encodings more easily using bytes.decode('iso-8859-1').encode('utf-8') of course, but that doesn't hint at the magic underneath the covers: utf-8 is really awesome. Hopefully people who just like cool hacks will go "oh, cool hack", whereas those who are more interested in how systems work will stand up and declare to the world "lo and behold! for now is my interest in utf-8 duly piqued!"
As for developing tests for this... I'd rather do another post explaining how and why it works, possibly making a tool to generate test cases, but really it's the pedagogy that counts. In Python this isn't production code, and even though it sure could be, as in the C example, that's not what I'm aiming at here.