Man page of lid_ffile(3), lid_fstr(3), lid_fnstr(3) and lid_fwstr(3)

Index


NAME

lid_ffile, lid_fstr, lid_fwstr, lid_fnstr - determine language and encoding of textual input from a variety of sources

SYNOPSIS

C/C++ #include <lid.h>

 lid_t * lid_ffile(const char *file);

 lid_t * lid_fstr(const char *str);
 lid_t * lid_fwstr(const wchar_t *wstr);
 lid_t * lid_fnstr(const char *bstr, size_t len);

DESCRIPTION

The functions lid_ffile(), lid_fstr(), lid_fwstr() and lid_fnstr() determine language and encoding of their input. A list of supported languages and encodings is provided in the user manual and the software specification.

lid_ffile() reads its input from the file specified by file.

lid_fstr() uses the character string pointed to by str as an input, while lid_fwstr() handles a wide character string pointed to by wstr.

lid_fnstr() processes the input of the byte string pointed to by str for the length of len bytes. You have to pay special attention to assure that len is within the memory boundaries of str, because there is no way for lid_fnstr() to do so. In contrast to lid_fstr() this function handles NUL characters and is thus able to process UTF-16 and UTF-32 encoded strings.

RETURN VALUE

lid_ffile(), lid_fstr(), lid_fwstr() and lid_fnstr() return a pointer to a lid_t structure, which is defined as follows:

C/C++ typedef struct lid {
    char *language;
    char *encoding;
    char *isocode;
 } lid_t;

This data structure holds the results determined from the input and consists of:

language

The determined language's name in English, i.e. "German".

encoding

The determined encoding, i.e. "UTF-8".

isocode

The determined language's ISO 639-3 code, i.e. "deu".

The memory pointed to for the result's structure lid_t should be freed using lid_free(3) if not needed anymore.

ERRORS

If an error occurred, the functions return a pointer to NULL and set the global error indicator lid_errno(3) to an appropriate value.

If additionally a natural language message describing the error is wanted, the function lid_strerror(3) can be used.

For convenience, macros can be used instead of the numeric error indicators. The following macros are defined:

LID_ENOERR

No error/clear state

LID_ENOMEM

Memory allocation failed

LID_EFOPEN

Error opening an input file

LID_EFCLOSE

Error closing an input file

LID_EFIO

File IO error

LID_EMATH

Math error

LID_ESHORT

Input too short

LID_EUDEC

UTF decoding failed

LID_EUENC

UTF encoding failed

LID_EUINV

Invalid UTF sequence

LID_EWCCONV

Wide character conversion error

LID_EBINARY

Binary data input

LID_EARG

Invalid argument

LID_EUNDEF

Undefined error

EXAMPLES

The following example of an application, lid_example, which is included in the distribution, takes a set of filenames as command line arguments and uses lid_ffile() to determine their language and encoding. Error checks are done, the results are printed and the memory used by the result's data structures is freed using lid_free(3).

C/C++ #include <stdio.h>
 #include <lid.h>

 int main (int argc, char *argv[])
 {
     lid_t *res = NULL;
     int    i   = 0;

     for (i = 1; i < argc; i++)
     {
        res = lid_ffile(argv[i]);

        if (res == NULL)
        {
            fprintf(stderr, "%s: %s\n",
                argv[i], lid_strerror(lid_errno));
            return 1;
        }

        printf("%s: lang=%s, enc=%s, iso=%s\n",
            argv[i], res->language, res->encoding, res->isocode);

        lid_free(res);
    }

    return 0;
 }

Here is the output of an example execution of the application:

 $ ./lid_example /tmp/english.txt /tmp/german.txt  /dev/null
 /tmp/english.txt: lang=English, enc=ASCII, iso=eng
 /tmp/german.txt: lang=German, enc=UTF-8, iso=deu
 /dev/null: Insufficient input length.

CAVEATS

length

The input length has to reach a minimum size, which is about 25 characters.

encoding

lid_fstr() is not able to handle character strings that are encoded using NUL characters (UTF-16/UTF-32), because it cannot determine their length accurately. lid_fnstr() should be used instead.

format

Only input in plain text can be processed.

NOTES

The library's version is defined as the macro LID_VERSION, which expands to the quoted version string, i.e. "2.0.2".

SEE ALSO

lid_free(3), lid_strerror(3)

liblid User Manual, liblid Software Specification

"The CERT C Secure Coding Standard", "ERR05-C", p. 549ff.