Convert Text to UTF-8 Automatically
Plain text is, well, plain. It does not provide any standard possibility to specify its character encoding. If the text document uses a Unicode Transformation Format like UTF-8 or UTF-32LE, the used charset may be indicated by a Byte Order Mark ("BOM"). However, using a BOM with UTF-8 is officially not a recommended practice. Even using byte order marks won't help if dealing with documents encoded using other character sets like "Windows-1252", "ISO-8859-1" or "KOI8-R".
There are dozens of converters available that allow to transform a text from one character encoding to another with ease. However, these do require the document's character encoding to be already known in order to specify the correct encoding parameters. So, if the character encoding of a document is known, using one of these tools is sufficient.
Whenever the character encoding of a text document is unknown,
any2utf8 is a handy tool that allows to
convert files to UTF-8 automatically - without requiring the knowledge of the underlying document charset. The automatic conversion is accomplished using the
AutoUniConv software library.
AutoUniConv is a software library that automatically detects the character encoding of strings and allows to recode them to one of the common Unicode Transformation Formats, such as UTF-8. It is able to detect a wide range of charsets including all common Unicode Transformation Formats and various encodings of the ISO-8859, Windows, Macintosh or IBM/Code Page family. Some national charsets like Big5 or KOI8-R can be automatically detected and converted as well.
How to Use
any2utf8 is a tiny command line application written in ANSI C. It reads input either from file or from the standard input stream and can therefore be used to convert plain text files or within a (Unix) pipe.
The following examples give a short introduction of usage (on a UTF-8 enabled console):
$ cat greek-windows-1253.txt | any2utf8 Ολοι οι άνθρωποι γεννιούνται ελεύθεροι και ίσοι στην αξιοπρέπεια και τα δικαιώματα. $ any2utf8 russian-koi8-r.txt Все люди рождаются свободными и равными в своем достоинстве и правах.
The source code of
any2utf8 is available for download:
If you have not ordered a copy of the AutoUniConv software library yet, have a look at the SDK and feel free to request an evaluation version.