Use Your Content > Improve > Automatic Language Detection

Automatic Language Detection 

Automatic Language Detection (ALD) is the classification of documents according to both their language and character encoding. The identification of these is important for the subsequent indexing, processing, and filtering of documents.


ALD is not supported for all languages. If your ALD is not detecting a particular language, check that it is supported. See Automatic Language Detection Languages.

Most of the data output from IDOL Connectors is in UTF-8 encoding. HPE recommends setting LangDetectUTF8 to True in your configuration file to prevent problems in scenarios where the LangDetectType fields contain only 7-bit ASCII characters, but other fields contain UTF-8 characters.

If encoding detection fails, the document is indexed using the DefaultLanguageType.

If language detection fails, or if the language is not configured, the document is indexed using the General language type (if configured).

If you know the language and encoding of the document, you can speed up indexing by specifying the language and encoding (either as an index action parameter or as a LanguageType field value).

ALD works best when given a large amount of content. Choose your LangDetectType fields appropriately, for example to avoid fields that just contain names or numbers.

The DetectLanguage action allows you to test which language a block of text is identified as. You can use this action to check if ALD is working correctly.

A document can only have one LanguageType. If the document contains text from more than one language, with an even split of languages, then IDOL Server might pick either language. For example, this might occur for a form where each field has a translation.

How Much Text is Needed for Detection?

IDOL Server can detect some languages (such as UTF-8 encoded languages that use a unique script) using only a few characters. However, for some languages that are similar to another language (such as Catalan and Spanish), IDOL Server might need a large paragraph of text to allow it to confidently distinguish the languages. You must always provide at least three words of input text for ALD.

The amount of text to use usually depends on what your text is. For example, lists of places, numbers or names are extremely hard to detect language, and their presence means that IDOL requires more text. For natural language text, such as a news article, it can usually detect the language in fewer characters.

What Fields are Used for Language Detection?

When you index documents into IDOL, it uses the LangDetectType fields for ALD. If you have not configured LangDetectType fields, it uses SourceType fields, which in turn default to use the Index fields.

By default ALD uses the first 50kB of text in those fields. It also uses the MaxLanguageDetectTerms configuration parameter to determine how many terms to use for further word-level analysis, if required. Increasing this parameter can affect speed but generally has minimal effect on accuracy, so HPE recommends that you use the default value.