Man page of lidc(1)

Index


NAME

lidc - identifies language and character encoding of textual input

SYNOPSIS

lidc -i PATH -t TYPE -f FMT_STR -v -h

DESCRIPTION

lidc identifies language and character encoding of textual input.

lidc reads its input either from file or from stdin and can handle various input types: plain text, HTML, XML and email (MIME 1.0 and RFC822).

The results are displayed according to a user-definable format string that allows a broad range of customization.

For a list of supported languages and encodings have a look at the user manual.

OPTIONS

-i PATH

Set the input file, "-" denotes stdin. (Default: stdin).

-t TYPE [txt, html, xml, email]

Set the input file's type. The following formats are supported:

txt (default)

Plain text document (without markup).

html

Any HTML document (X-HTML, HTML 4, ...).

xml

Any XML document.

email

Any email, either conforming to RFC 822 or MIME (as specified by RFC 2045-2049, 2387, 1847 or 3462). See RESTRICTIONS below.

If no TYPE is set and lidc is reading from file (-i), lidc tries to determine the file's type automatically by evaluating the file's extension. The commonly used extensions (.txt, .html, .htm, .xml, .eml) are supported as well as all Maildir extensions and keywords as used by the Dovecot IMAP server.

If no type can be determined and no type is set, "txt" is assumed as default.

-f FMT_STR

Set the output format string. You may customize the output format string as needed. The following flags are provided and replaced with the associated results in the output:

%l -> identified language

%l expands to the English name of the identified language, i.e. "German", "French" or "Swedish".

%i -> ISO 639-3 language code

%i expands to the ISO 639-3 code of the identified language, i.e. "deu", "fra" or "swe".

%e -> identified encoding

%e expands to the identified encoding, i.e. "UTF-8", "ISO-8859-1", "UTF-32LE" or "Windows-1252".

%d -> declared document encoding

%d expands to the declared document encoding in lowercase letters, i.e. "utf-8", "iso-8859-1", "utf-32le" or "windows-1252".

If no document encoding could be determined or the document type does not support encoding declarations (txt), %d expands to "none".

%f -> input file's name

%f expands to the input file's name or to stdin.

Beside the above flags, the common escape sequences \n (newline), \r (carriage-return), \t (tab) and \a (bell) are supported.

If no format string is set, the default output is: "%l, %i, %e\n"

The output is sent to stdout.

-v

Show version information.

-h

Show a short help text.

DIAGNOSTICS

If an error occurs, the application terminates with error code 1 and prints an error message to stderr.

Additionally there are several possible warnings that may be printed to stderr if necessary.

EXAMPLES

Using lidc to identify language and encoding of a plain text file. The default output format string is used:

Shell $ lidc -i danish.txt
 Danish, dan, UTF-32BE

Using lidc to identify language and encoding of an email. The correct type, email, is automatically determined by evaluating the file's extension.

Shell $ lidc -i german.eml
 German, deu, ISO-8859-1

Same as above, but utilizing a pipe. The type has to be set in order to prevent lidc from using the default type, txt.

Shell $ cat german.eml | lidc -t email
 German, deu, ISO-8859-1

Processing an UTF-32 encoded XML file and setting a custom format string (including the declared document encoding):

Shell $ lidc -i hungarian.xml -f "%f: %l, %e, %d\n"
 hungarian.xml: Hungarian, UTF-32LE, utf-32le

A more complex example, providing basic XML output:

XML/HTML $ lidc -i german.eml -f \
   "<email>\n\t<lang>%l</lang>\n\t<enc>%e</enc>\n</email>\n"
 <email>
     <lang>German</lang>
     <enc>ISO-8859-1</enc>
 </email>

RESTRICTIONS

o

There is no support for UTF-16 or UTF-32 encodings in emails.

o

Concerning MIME emails, only the following media types are supported:

x

text/plain

x

text/html

x

message/rfc822

x

multipart/mixed

x

multipart/alternative

x

multipart/digest

x

multipart/parallel

x

multipart/related

x

multipart/signed

x

multipart/report

NOTES

The declared and the identified encoding may differ. This need not be a failure or a problem. Nevertheless it may give a hint on a problem. To give two examples:

1. If the declared encoding is ISO-8859-1 and the identified encoding is ASCII, this will in most cases be correct as the actually used characters may all be in the ASCII range and ISO-8859-1 is a superset of ASCII.

2. If the declared encoding is UTF-8 and the identified encoding is ISO-8859-1 this may be a hint on a problem. For example if an HTML document declares to be UTF-8 but it actually is not, this may cause the site to appear with "broken" characters.

SEE ALSO

User Manual (English version), Benutzerhandbuch (German version)