Convert Text to UTF-8 Automatically Using any2utf8

Plain text is, well, plain. It does not provide any standard possibility to specify its charset. If the text document uses a Unicode Transformation Format like UTF-8 or UTF-32LE, the used charset may be indicated by a Byte Order Mark ("BOM"). However, using a BOM with UTF-8 is officially not a recommended practice. Even using byte order marks won't help if dealing with documents encoded using other character sets like "Windows-1252", "ISO-8859-1" or "KOI8-R".

There are dozens of charset converters available that allow to transform a text from one character encoding to another with ease. If the document's charset is already known, using one of these tools is sufficient.

[read on]
Posted 2011-07-15 15:51   by Alex Linke   Link: Permalink
Tags: charset  software  Unicode  AutoUniConv

How to Transliterate Russian Text

Transliterating a natural language text means converting it from one writing system to another by using a set of predefined character or character sequence mappings. These mappings may be context dependent, too.

Transliterations have been standardized for a variety of languages and writing systems by national and international organizations like ISO, DIN or GOST. This way, transliterated natural language text can easily be exchanged and interpreted by those not familiar with its native alphabet.

[read on]
Posted 2010-07-30 11:58   by Alex Linke   Link: Permalink
Tags: Lingua::Translit  Perl  language  transliteration  software

Using Lingua::Lid in a Threaded Application

As of version 0.02 Lingua::Lid is thread-safe if compiled with a recent version of lid (3.0.0 or higher).

This allows you to safely call Lingua::Lid's language and charset identification functions, like lid_ffile and lid_fstr, simultaneously within your application by making use of Perl's ''threads'' module. As thread support in Perl is a compile time option, you will need a thread-enabled version of Perl as shipped by most modern Linux distributions like Debian Lenny or Ubuntu Lucid - or ActiveState's version for Windows.

[read on]
Posted 2010-06-21 09:12   by Alex Linke   Link: Permalink
Tags: Perl  lid  Lingua::Lid  language-identifier  language  charset  software

Aspects of Transliteration

Transliteration is the conversion of letters from one alphabet to another one, like from Greek to Latin. But it may as well be just a simplification within one alphabet, for example omitting any diacritics found in that alphabet or substituting special characters with a sequence of characters without diacritics.

[read on]
Posted 2010-01-25 16:31   by Rona Linke   Link: Permalink
Tags: transliteration  charset  Lingua::Translit  language

Introducing Lingua::Lid

Lingua::Lid is a Perl extension that implements an interface to the lid C/C++ library. As such, it makes lid's language and character encoding identification features available to any Perl application or module.

The following code snippets show a few usage examples, introducing both basic usage and Lingua::Lid's capabilities:

[read on]
Posted 2009-09-30 12:34   by Alex Linke   Link: Permalink
Tags: Perl  lid  Lingua::Lid  language-identifier  language  charset  software

lidc - A Language Identifier (Preview)

lidc is a command line application for Unix-like operating systems (Linux, Solaris, FreeBSD) that allows you to identify the language and character encoding of an input. Based on the lid library, it provides accurate identification results and high performance. However, lidc implements a significant amount of new features on top of those provided by lid, namely the parsing of common input formats. These include:

[read on]
Posted 2009-09-25 16:43   by Alex Linke   Link: Permalink
Tags: lid  lidc  HTML  email  screencast  language-identifier  language  charset  software