Man page of tw-classify(1)

Index


NAME

tw-classify - classifies documents

SYNOPSIS

tw-classify [db-opt] [opt] -n NUMBER -x NUMBER file(s)

DESCRIPTION

tw-classify automatically classifies input documents to a given set of learned categories.

OPTIONS

DATABASE OPTIONS

These options are required in order to establish a connection to the Textweiser database. They can either be given on the commandline and/or be supplied in a configuration file (-f / --config).

NOTE: If Textweiser uses an SQLite database backend, only the -d / --db_name option is required and all other database options are not available.

-d / --db_name database name

Name of the Textweiser database (UTF-8 encoded).

If Textweiser uses an SQLite database backend, database name is the path to the database file, not necessarily UTF-8 encoded.

-s / --host hostname

Hostname of the database server.

-u / --user username

Username to connect to the database.

-w / --passwd password

Password to connect to the database.

NOTE: If no password is given as an argument on the commandline and no password is set in the configuration file, you will be prompted to enter the password. The password will not be echoed during input.

-p / --port port

Port of the database on hostname.

If port is not set, the default port of the database is assumed.

-t / --instance instance

Name of the Microsoft SQL Server instance on hostname.

NOTE: This option is only available if Textweiser uses the Microsoft SQL Server database backend.

-e / --encrypt

Request communication to the database to be encrypted. If no encrypted connection can be established by the database driver, Textweiser will abort.

--trust-cert

Request to trust any certificate presented by the database server, without validation.

NOTE: In order to use self-signed certificates, this option has to be enabled.

NOTE: Passing this option implicitly enables communication encryption.

The database configuration may be given in a configuration file as well. For details, see CONFIGURATION FILE SYNTAX below.

-f / --config path

Path to a Textweiser database configuration file.

COMMON OPTIONS

-v / --verbose

Enable verbose output.

-V / --version

Show version information and terminate.

-h / --help

Show a short help screen and terminate.

CLASSIFICATION OPTIONS

Every invocation of tw-classify works on a single document or a set of documents and prints a user-definable number of most likely categories for each document.

-n / --show number

Show maximum of number of most likely categories and their estimated probability per input document.

Defaults to "1" (prints the most likely category only).

-x / --threads number

Use number of threads to classify the set of input documents.

Defaults to "1".

DIAGNOSTICS

If an error occurs, the application terminates with an appropriate error code dependent on the operating system in use and prints an error message to stderr.

CONFIGURATION FILE SYNTAX

The syntax of Textweiser configuration files follows an easy to use key/value scheme. Empty lines and any leading/trailing whitespace is ignored. Lines starting with the character # are considered comments.

Values may be enclosed within matching single or double quotes and are assigned to keys using the = character.

SUPPORTED KEYS

host

Hostname of the database server.

user

Username for database authentification.

passwd

Password for database authentification.

db_name

Name of the Textweiser database.

port

Port number of the database server.

instance

Name of the Microsoft SQL Server instance.

encrypt

Enable/disable communication encryption.

The following values are recognized:

"yes" or "on"

Enable encryption.

"no" or "off"

Disable encryption.

In order to trust a server's certificate, append the "trust-cert" token to the value, separated by a comma and/or whitespace, i.e.

 encrypt = "yes, trust-cert"

NOTES

o

On Microsoft Windows an option may be started by the "/" character as well.

o

On any Unix-like system the common sequence "--" terminates parsing of options.

o

Any configuration file specified (-f / --config) is parsed and evaluated before other commandline arguments are evaluated. As a result, arguments given on the commandline overwrite settings given in a configuration file.

o

Communication encryption is the task of the database driver. Textweiser merely instructs the driver to enable or disable encryption according to the passed options and checks whether the operation did succeed.

EXAMPLES

For brevity, the following examples assume Textweiser is using the SQLite database backend and that a Textweiser database and a set of categories have already been created and trained. See the EXAMPLES sections of tw-admin(1) and tw-learn(1) for details.

Classify a set of documents using default settings:

 $ tw-classify -d example.sqlt email-it-projects.txt \
   email-marketing.txt email-sales.txt
 email-it-projects.txt: IT Projects
 email-marketing.txt: Marketing
 email-sales.txt: Sales

Classify the same set using a custom setting showing up to 5 classification results and using 2 threads to speed up classification:

Shell $ tw-classify -d example.sqlt -n 5 -x 2 email-it-projects.txt \
   email-marketing.txt email-sales.txt
 Classification results for email-it-projects.txt:
 01:      IT Projects -> 100.00%

 Classification results for email-marketing.txt:
 01:        Marketing -> 95.22%
 02:            Sales -> 22.85%
 03:      IT Projects -> 6.94%

 Classification results for email-sales.txt:
 01:            Sales -> 100.00%
 02:        Marketing -> 19.24%

SEE ALSO

tw-admin(1), tw-learn(1)

Textweiser User Manual

http://www.lingua-systems.com/text-classifier/textweiser-library/