Token Filters

Analyzers reference Token Filters by name. Use existing ones or create variants with IndexMapping.AddCustomTokenFilter:

var m *IndexMapping = index.Mapping()
err := m.AddCustomTokenFilter("color_stop_filter", map[string]interface{}{
    "type": stop_tokens_filter.Name,
    "tokens": []interface{}{
        "red",
        "green",
        "blue",
    },
})
if err != nil {
    log.Fatal(err)
}

creates a new Stop Token Filter named “color_stop_filter”, which removes all “red”, “green” or “blue” tokens. Once registered, this filter can be referenced by a custom Analyzer.

Apostrophe

Configuration:

  • type: apostrophe_filter.Name

The Apostrophe Token Filter removes all characters after an apostrophe.

Camel Case

The Camel Case Filter splits a token written in camel case into the set of tokens comprising it. For example, the token camelCase would produce camel and Case.

CLD2

The CLD2 Token Filter will take the text from each token and pass it to the Compact Language Detection 2 library. Each token is replaced with a new token corresponding to the ISO 639 language code detected. Input text should already be converted to lower case.

Compound Word Dictionary

The compound word dictionary filter lets you supply a dictionary of words that combine to form compound words and lets you index them individually.

Edge n-gram

The edge n-gram token filter will compute n-grams just like the n-gram token filter, but all the computed n-grams are rooted at one side (either the front or the back).

Elision

The elision filter identifies and removes articles prefixing a term and separated by an apostrophe.

For example, in French l'avion becomes avion.

The elision filter is configured with a reference to a token map containing the articles.

Keyword Marker

The keyword marker filter will identify keywords and mark them as such. Keywords are then ignored by any downstream stemmer.

The keyword marker filter is configured with a token map containing the keywords.

Length

The length filter identifies tokens which are either too long or too short. There are two parameters, the minimum token length and the maximum token length. Tokens that are either too long or too short are removed from the token stream.

Lowercase

The Lowercase Token Filter will examine each input token and map all Unicode letters to their lower case.

n-gram

The n-gram token filter computes n-grams from each input token. There are two parameters, the minimum and maximum n-gram length.

Porter Stemmer

The porter stemmer filter applies the Porter Stemming Algorithm to the input tokens.

Shingle

The Shingle filter computes multi-token shingles from the input token stream. For example, the token stream the quick brown fox when configured with a shingle minimum and maximum length of 2 would produce the tokens the quick, quick brown and brown fox.

Stemmer

The stemmer token filter takes input terms and applies a stemming process to them.

This implementation uses libstemmer.

The supported languages are:

  • Danish
  • Dutch
  • English
  • Finnish
  • French
  • German
  • Hungarian
  • Italian
  • Norwegian
  • Porter
  • Portuguese
  • Romanian
  • Russian
  • Spanish
  • Swedish
  • Turkish

Stop Token

Configuration:

  • type: stop_tokens_filter.Name
  • stop_token_map (string): the name of the token map identifying tokens to remove.

The Stop Token Filter is configured with a map of tokens that should be removed from the token stream.

Truncate Token

The truncate token filter truncates each input token to a maximum token length.

Unicode Normalize

The Unicode normalization filter converts the input terms into the specified Unicode Normalization Form.

The supported forms are:

  • nfc
  • nfd
  • nfkc
  • nfkd