Index solution

Because you reference the EPiServer.Find.Cms assembly in the Optimizely Content Management System (CMS) project, published content is automatically indexed. Content is also reindexed or deleted from the index when saved, moved, or deleted. CMS indexes each language version as a separate document.

Indexing module

The indexing module is an IInitializableModule that handles DataFactory event indexing. Whenever content is saved, published, moved, or deleted, it triggers an index request to the ContentIndexer.Instance object, which handles the indexing.

ContentIndexer.Instance

The ContentIndexer.Instance singleton, located in the EPiServer.Find.Cms namespace, adds support for indexing IContent and UnifiedFile objects. ContentIndexer.Instance supports reindexing the entire PageTree and specific language branches and individual content and files. When indexing an IContent object, page files are also indexed.

Invisible mode

ContentIndexer can work in invisible mode when indexing objects passed by the IndexingModule. Invisible mode handles indexing in a separate thread, not the DataFactory event thread. So, indexing does not delay the DataFactory event thread and does not delay the save or publish action. To override this default behavior, set ContentIndexer.Instance.Invisible to false.

Bind event indexing with a dedicated instance

In Find 16.2, the processing of the content event indexing queue is tied to the scheduler web app. Without a scheduler app, the queue is processed in all instances. By default, all instances can still populate the queue.

The SchedulerOptions.Enabled option governs this behavior. See Scheduled Jobs.

To circumvent this behavior, you can enable the processing of the content event indexing queue with the following configuration:

services.Configure<FindCmsOptions>(options => {
    options.DisableScheduledPageQueue = false;
});

Customize pages to be indexed

The ContentIndexer.Instance has conventions for customizing indexing. For example, you can control which pages are indexed and dependencies between pages.

To control which content is indexed, pass a verification expression to the ShouldIndex convention. By default, CMS indexes published content.

For example, if you do not want to index a page type (such as the LoginPageType), pass a verification expression that validates to false for the LoginPageType to the ShouldIndex convention. Preferably, you would do this during application startup, such as in the global.asax file's Application\_Start method.

//using EPiServer.Find.Cms.Conventions;

ContentIndexer.Instance.Conventions
  .ForInstancesOf<LoginPageType>()
  .ShouldIndex(x => false);

To override the default setting, add a convention for PageData and add the appropriate verification expression.

//using EPiServer.Find.Cms.Conventions;
ContentIndexer.Instance.Conventions
  .ForInstancesOf<PageData>()
  .ShouldIndex(x => true);

To exclude a property from being indexed, use the JsonIgnore attribute or add a convention for it.

//using EPiServer.Find.Cms.Conventions;
SearchClient.Instance.Conventions
  .ForInstancesOf<PageData>()
  .ExcludeField(x => x.ACL);

[JsonIgnore]
public DateInterval Interval { get; set; }

File indexing

IContentMedia indexes files by default when based on the following MIME types:

text/plain
application/pdf
application/postscript
application/msword
application/vnd.openxmlformats-officedocument.wordprocessingml.document

{
  [InitializableModule]
  [ModuleDependency(typeof (EPiServer.Web.InitializationModule))]
  public class FindInitialization: IInitializableModule {
    private ContentAssetHelper contentAssetHelper;
    private ContentIndexer contentIndexer;

    public void Initialize(InitializationEngine context) {

      contentAssetHelper = ServiceLocator.Current.GetInstance < ContentAssetHelper > ();
      contentIndexer = ServiceLocator.Current.GetInstance < ContentIndexer > ();

      //Media
      ContentIndexer.Instance.Conventions.ForInstancesOf < MediaData > ().ShouldIndex(p => ShouldIndexDocument(p));

    }

    bool ShouldIndexDocument(MediaData content) {
      if (contentAssetHelper.GetAssetOwner(content.ContentLink) is FileUploadElementBlock) {
        //if descendant of episerver forms or a file uplaoded through a epi form, do not index
        return false;
      }
      return !content.IsDeleted && isNotArchived(content.StopPublish);
    }

    bool isNotArchived(DateTime ? stopPublishDate) {
      return (stopPublishDate == null || (stopPublishDate != null && stopPublishDate > System.DateTime.Now));
    }

    public void Uninitialize(InitializationEngine context) {}
  }
}

Change the name or namespaces of page types

If you change the name or namespace of a page type, a mismatch occurs between the types in the index and the new page types. This might cause errors when querying because the API cannot resolve the correct page type from what is reported from the index. To solve this, reindex all pages using the scheduled plugin to have new page types reflected in the index.

String length indexing limitation

In the underlying functionality for Optimizely Search & Navigation, there is an ignore_above Elasticsearch parameter set to 8191. The value is the quote of the maximum limit of bytes per term in Lucene and the maximum number of bytes per UTF-8 character, minus one ((32766/4)-1) = 8191).

If a string field is longer than ignore_above, it will not be indexed or stored. You cannot modify this setting.

Related blog post: Why is my Find indexing job freezing and dying?