HomeDev GuideAPI Reference
Dev GuideAPI ReferenceUser GuideGitHubNuGetDev CommunityDoc feedbackLog In

Stop words

Describes how to enable and use stop words

Stop words are the words in a stop list (or stop list or negative dictionary) which are filtered out (stopped) before or after processing of natural language data (text) because they are insignificant. A use-case of stop words besides stopping unimportant words from being processed, is stopping words that are considered noise otherwise from a business or societal perspective. This means that no matches are retrieved by the search engine given the queried stop words. By default, Content Graph does not use stop words, but you can configure them.

Stop words with full-text search

The following list of English words are often considered stop words:

a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with.

A stop word is usually a single word, which is used as a filter to stop a token from being indexed.

For example, if you have this field value "the dog is at the park", and use this stop list, then the following tokens get indexed ["dog", "park"] and you can only match on these 2 tokens when doing full-text search using the contains or like operators.

❗️

Querying with stop words only with searchable string fields

Stop words are supported for searchable string fields, but not supported for normal string, number, date and Boolean fields. In the latter field types, stop words are not applied and results are found when querying with stop words. Content Graph supports only single-token stop words and multi-word stop words are not applied.

Store custom stop words

Stop words are stored as a text file, where each line is a single stop word.

🚧

Validation of stop words

A line in the stop list cannot be greater than 1,000 characters/bytes and the maximum number of entries is 50,000.

Stop words are treated case-sensitively both at index and query time. For example, the is different than The. This could be useful to fully index and query on The Guardian (newspaper) but ignore the in the guardian with full-text search. This is an example of stop list and we will use this stop list in the query examples below, and note it consists of only single words:

the
Schwarzenegger
amy
Bob

You can store stop words using the REST endpoint configured in the GraphQL Gateway. It requires authorization using your HMAC key and secret.

  • PUT <GATEWAY_URL>/resources/stopwords with the following optional query string:
    • language_routing to store the custom stop words in the request body for a specific locale (default when not provided is standard, i.e., no locale)

The body should consist of stop words as previously described or can be empty if you do not wish to configure any stop words (default behavior). If you do not us a query parameter with this endpoint, then the custom stop list is applied to the INVARIANT locale (index with no languages configured).

After storing stop words, they are automatically applied when synchronizing content and ignored when querying with Content Graph.

❗️

Stop lists need to be configured before content synchronization

You need to store your stop words in Content Graph before provisioning your account and synchronizing content. You cannot update stop lists after your account is provisioned. If you want to update your stop list, you need to do the following:

  1. Upload the updated stop list with the PUT endpoint as described above.
  2. Reset account.
  3. Synchronize content.

Query examples

For full-text search with the contains and like operator on searchable string fields, we only support single-token stop words and multi-word stop words will not be applied.

When Schwarzenegger is a stop word and it occurs as Schwarzenegger (case-sensitive) in your content, then the following query will not return any results.

{
  BiographyPage(where: { Name: { contains: "Schwarzenegger" } }) {
    items {
      Name
      Die
      Born
      Language {
        DisplayName
        Name
      }
      _score
    }
  }
}

However, if the name Amy Winehouse occurs in your content but amy (note the lowercase) is defined as stop word, you still get a result returned with the following GQL query because the term Amy (note the uppercase) was never stopped from being indexed and will return the result.

{
  BiographyPage(where: { Name: { contains: "Amy" } }) {
    items {
      Name
      Die
      Born
      Language {
        DisplayName
        Name
      }
      _score
    }
  }
}

This query is equivalent in this form and will also return the result.

{
  BiographyPage(where: { Name: { like: "%Amy%" } }) {
    items {
      Name
      Die
      Born
      Language {
        DisplayName
        Name
      }
      _score
    }
  }
}

Both examples will return this result:

{
  "data": {
    "BiographyPage": {
      "items": [
        {
          "Name": "Amy Winehouse",
          "Die": "2011-07-23T00:00:00Z",
          "Born": "1983-11-14T00:00:00Z",
          "Language": {
            "DisplayName": "English",
            "Name": "en"
          },
          "_score": 1.6928279
        }
      ]
    }
  }
}

The stop words are processed case-sensitively at indexing time. So the following query will not return any results, because the query is a stop word.

{
  BiographyPage(where: { Name: { contains: "amy" } }) {
    items {
      Name
      Die
      Born
      Language {
        DisplayName
        Name
      }
      _score
    }
  }
}