Index integrated solution
Describes the indexing of an integrated solution with Optimizely Search & Navigation and Optimizely Content Management System (CMS).
Because you reference the EPiServer.Find.Cms
assembly in the Optimizely Content Management System (CMS) project, published content is automatically indexed. Content is also reindexed or deleted from the index when saved, moved, or deleted. CMS indexes each language version as a separate document.
Indexing module
The indexing module is an IInitializableModule
that handles DataFactory
event indexing. Whenever content is saved, published, moved, or deleted, it triggers an index request to the ContentIndexer.Instance
object, which handles the indexing.
ContentIndexer.Instance
The ContentIndexer.Instance
singleton, located in the EPiServer.Find.Cms
namespace, adds support for indexing IContent
and UnifiedFile
objects. ContentIndexer.Instance
supports reindexing the entire PageTree
and specific language branches and individual content and files. When indexing an IContent
object, page files are also indexed.
Invisible mode
AÂ core feature of the ContentIndexer
is its ability to work in invisible mode when indexing objects passed by the IndexingModule
. Invisible mode handles indexing in a separate thread, not the DataFactory
event thread. So, indexing does not delay the DataFactory
event thread and does not delay the save or publish action. To override this default behavior, set ContentIndexer.Instance.Invisible
to false.
Binding event indexing with a dedicated instance
In Find 16.2, the processing of the content event indexing queue is tied to the scheduler web app. Without a scheduler app, the queue is processed in all instances. By default, all instances can still populate the queue.
The SchedulerOptions.Enabled
option governs this behavior. See Scheduled Jobs.
To circumvent this behavior, you can enable the processing of the content event indexing queue with the following configuration:
services.Configure<FindCmsOptions>(options => {
options.DisableScheduledPageQueue = false;
});
Conventions
The ContentIndexer.Instance
has conventions for customizing indexing. For example, you can control which pages are indexed (described below) and dependencies between pages.
Customize pages to be indexed
To control which content is indexed, pass a verification expression to the ShouldIndex
convention. By default, CMS indexes published content.
For example, if you do not want to index a page type (such as the LoginPageType
), pass a verification expression that validates to false for the LoginPageType
to the ShouldIndex
convention. Preferably, you would do this during application startup, such as in the global.asax
file's Application\_Start
method.
//using EPiServer.Find.Cms.Conventions;
ContentIndexer.Instance.Conventions
.ForInstancesOf<LoginPageType>()
.ShouldIndex(x => false);
To override the default setting, add a convention for PageData
and add the appropriate verification expression.
//using EPiServer.Find.Cms.Conventions;
ContentIndexer.Instance.Conventions
.ForInstancesOf<PageData>()
.ShouldIndex(x => true);
To exclude a property from being indexed, use the JsonIgnore
attribute or add a convention for it.
//using EPiServer.Find.Cms.Conventions;
SearchClient.Instance.Conventions
.ForInstancesOf<PageData>()
.ExcludeField(x => x.ACL);
[JsonIgnore]
public DateInterval Interval { get; set; }
File indexing
IContentMedia
indexes files by default when based on the following MIME types:
text/plain
application/pdf
application/postscript
application/msword
application/vnd.openxmlformats-officedocument.wordprocessingml.document
Manage PII in search results
You should not index form attachments if there is a chance those attachments contain sensitive information, such as personally identifiable information (PII). For example, if you upload documents with PII data on form attachments, Optimizely Search & Navigation will index these, even though form attachment permissions prevent viewing. However, you might discover the information in search results.
To prevent this, add the following initialization code to prevent uploads of those form attachments.
{
[InitializableModule]
[ModuleDependency(typeof (EPiServer.Web.InitializationModule))]
public class FindInitialization: IInitializableModule {
private ContentAssetHelper contentAssetHelper;
private ContentIndexer contentIndexer;
public void Initialize(InitializationEngine context) {
contentAssetHelper = ServiceLocator.Current.GetInstance < ContentAssetHelper > ();
contentIndexer = ServiceLocator.Current.GetInstance < ContentIndexer > ();
//Media
ContentIndexer.Instance.Conventions.ForInstancesOf < MediaData > ().ShouldIndex(p => ShouldIndexDocument(p));
}
bool ShouldIndexDocument(MediaData content) {
if (contentAssetHelper.GetAssetOwner(content.ContentLink) is FileUploadElementBlock) {
//if descendant of episerver forms or a file uplaoded through a epi form, do not index
return false;
}
return !content.IsDeleted && isNotArchived(content.StopPublish);
}
bool isNotArchived(DateTime ? stopPublishDate) {
return (stopPublishDate == null || (stopPublishDate != null && stopPublishDate > System.DateTime.Now));
}
public void Uninitialize(InitializationEngine context) {}
}
}
Change the name or namespaces of page types
If you change the name or namespace of a page type, a mismatch occurs between the types in the index and the new page types. This might cause errors when querying because the API cannot resolve the correct page type from what is reported from the index. To solve this, reindex all pages using the scheduled plugin to have new page types reflected in the index.
Improve search relevancy for attachments
By default, search relevancy for text inside an attachment is imperfect because it indexes attachments in the default language, which might not match the document's content. CMS content, in contrast, indexes using enabled languages to improve search relevancy.
Also, when browsing Search & Navigation's explore view of an attachment, the attachment text is not readable because it is indexed using the base64 representation of itself.
To improve the search relevancy of text attachments, use the IAttachmentHelper
interface, which enables developers to implement their parsing of attachments. Out of the box, Optimizely provides an implementation of IAttachmentHelper
that uses Microsoft IFilter
functionality. For this to work, the correct IFilters
need to be installed on the client.
You should use this package because it enhances the quality of your search.
Use the default implementation of IAttachmentHelper
- Install the
EPiServer.Find.Cms.AttachmentFilter
NuGet package. - Determine which attachment file types you want to support (PDF and Microsoft Word). Each file type has a corresponding filter. The list of file types and filters is below.
- Download and install the selected filters.
- Restart.
- Add some supported file attachments to your site.
- Log into your website and go to Find > Overview > Explore.
- Find the attachments and verify their content is stored as readable text under
SearchAttachmentText$$String
.
Supported file formats
Using Ifilters
with Search & Navigation, you can parse the file types below.
File types: adw, ai, doc, docm, docx, dwg, eps, gif, html, htm, jpeg, jpg, mm, msg, odt, ods, odp, odi, one, otf, otp, pdf, png, ppt, pptm, pptx, ps, rar, sda, sdg, sdm, sfs, sgf, smf, std, sti, stw, svg, sxd, sxi, txt, vdx, vsd, vdx, vor, vss, vst, vsx, vtx, wma, wmv, xls, xlsb, xlsm, xlsx, xml, zip
For many file types, more than one filter is available. For example, you can find more filters on IFilterShop.
Some common file types and their filters are listed below.
PDF
Adobe has the PDFÂ IFIlter
, although it does not work in all environments.Â
Microsoft Office 2010 filter packs
Microsoft's filter pack (download here) covers the file types below.
- Legacy Office Filter (97-2003; .doc, .ppt, .xls)
- Metro Office Filter (2007; .docx, .pptx, .xlsx)
- Zip Filter
- OneNote filter
- Visio Filter
- Publisher Filter
- Open Document Format Filter
String length indexing limitation
In the underlying functionality for Optimizely Search & Navigation, there is an ignore_above
 Elasticsearch parameter set to 8191. The value is the quote of the maximum limit of bytes per term in Lucene and the maximum number of bytes per UTF-8 character, minus one ((32766/4)-1) = 8191).
If a string field is longer than ignore_above
, it will not be indexed or stored. You cannot modify this setting.
Related blog post: Why is my Find indexing job freezing and dying?
Updated 8 months ago