Introducing Lingua::Lid
Lingua::Lid is a Perl extension that implements an interface to the lid C/C++ library. As such, it makes lid's language and character encoding identification features available to any Perl application or module.
The following code snippets show a few usage examples, introducing both basic usage and Lingua::Lid's capabilities:
lid_fstr Function and Results Data Structure
The subsequent minimal example script, Lingua-Lid-example1.pl, passes a short string to Lingua::Lid's interface function lid_fstr() which identifies the language and character encoding of a string. The identification results are returned as a hash reference and provide information on the identified language, its ISO 639-3 code and the character encoding.
The string used as an example does contain characters from the ASCII character set only. In the next step we recode the very string to UTF-32BE and let lid_fstr() identify it once again.
Perluse strict; use Lingua::Lid qw/lid_fstr/; use Encode qw/encode/; use Data::Dumper; my $s = "This posting is introducing Lingua::Lid."; my $res = lid_fstr($s); print $res->{language}, " - ", $res->{encoding}, "\n"; $res = lid_fstr(encode("UTF-32BE", $s)); print $res->{language}, " - ", $res->{encoding}, "\n"; print Dumper $res;
As expected, the script produces the following output:
Shell$ perl Lingua-Lid-example1.pl
English - ASCII
English - UTF-32BE
$VAR1 = {
'isocode' => 'eng',
'language' => 'English',
'encoding' => 'UTF-32BE'
};
lid_ffile Function and Error Handling
The second example, Lingua-Lid-example2.pl, focuses on lid_ffile(), a function that allows to identify language and character encoding of a file. In this example it is invoked on any file given as an argument on the commandline.
If either lid_fstr() or lid_ffile() are not able to fulfill their duty,
"undef" is returned and the package variable $Lingua::Lid::errstr is
set to an appropriate natural language string describing the error.
Perluse strict; use Lingua::Lid qw/lid_ffile/; foreach my $file (@ARGV) { my $res = lid_ffile($file); print $file, ": "; unless ($res) { print $Lingua::Lid::errstr, "\n"; next; } print $res->{language}, " - ", $res->{encoding}, "\n"; }
Running the script with a mixture of text, special and non existant files as arguments prints the following results:
Shell$ perl Lingua-Lid-example2.pl README /dev/zero \
/dev/null /nonexistent
README: English - ASCII
/dev/zero: Binary input data
/dev/null: Insufficient input length
/nonexistent: Failed to open file
More information, including the full set of supported languages and character encodings of the underlying lid library, a detailed man page and a ready-to-use online demo is available on Lingua::Lid's website.

2009-09-30 12:34