Text extraction from media
Describes how to extract searchable text from PDF files and images in Optimizely Graph, so the extracted content can be queried through the GraphQL API alongside other content properties.
Text extraction in Optimizely Graph reads the textual content of media files (PDFs, Office documents, and images), then exposes that text on the content type so it can be queried through the GraphQL API. Use text extraction to make uploaded media searchable alongside structured content, without maintaining a separate index.
Prerequisites
Before you call the Extract service, confirm the following:
- An Optimizely Graph account with API access.
- An HMAC key and secret, or a single key, with permission to call the
POST <GATEWAY_URL>/extractREST endpoint. - The file you want to process encoded as a base64 string in the request body, in one of the supported file formats listed in this article.
How text extraction works
Text is extracted from media such as PDF files and images. The extracted text is added to an additional text property (the Content field) on the content type, which can be used just like any other property on a content type.
The following is an example GraphQL response:
{
"data": {
"ImageFile": {
"items": [
{
"Content": "a man wearing a suit and tie smiling at the camera",
"ContentType": [
"Image",
"Media",
"ImageFile",
"Content"
],
"Name": "ToddSlayton.jpg",
"Url": "http://localhost:8081/globalassets/contact-portraits/toddslayton.jpg"
}
]
}
}
}
NoteUse the
matchoperator in the field because of its efficient full-text capabilities.
If you use the NuGet package Optimizely.ContentGraph.Cms to synchronize content, no further action is required. Everything is integrated into this package.
Call the Extract service
Call the Extract service directly when you synchronize content outside the Optimizely.ContentGraph.Cms NuGet package, or when you want to extract text from a one-off file without going through the content sync pipeline.
Call the Extract service by using the REST endpoint configured in the GraphQL Gateway. It requires authorization. Use either the HMAC key and secret or single-key authorization.
POST <GATEWAY_URL>/extract with the following optional query string:
detect_language:
- In case the file format is the text format, this option indicates whether to specify the language of the text (default, when not provided, is
false). The detected language value is an ISO 639-1 code (for example,svoren). - If the file is an image, ignore this option.
The body should consist of extracted content as a base64 string.
The following file formats are supported:
- Text file formats –
.DOC,.DOCX,.XLS,.XLSX,.TXT - Image file formats –
.JPEG,.PNG,.GIF,.BMP,.TIFF,.WEBP
The following is an example response:
{
"type": "ocr",
"language": "en",
"text": "a man wearing a suit and tie smiling at the camera",
"result": "Ok",
"error": null
}Updated 9 days ago
