Text extraction field
Describes how to work with text extraction, when using the GraphQL API for the Optimizely querying service, to retrieve content in Optimizely solutions.
Text is extracted from media like PDF files and images. The extracted text is added to an additional text property (the field Content) on the content type, which can be used just like any other property on a content type.
The GraphQL response example:
{
"data": {
"ImageFile": {
"items": [
{
"Content": "a man wearing a suit and tie smiling at the camera",
"ContentType": [
"Image",
"Media",
"ImageFile",
"Content"
],
"Name": "ToddSlayton.jpg",
"Url": "http://localhost:8081/globalassets/contact-portraits/toddslayton.jpg"
}
]
}
}
}
Note
You should use the match operator in the field, because of its efficient full-text capabilities.
If we use the Nuget package Optimizely.ContentGraph.Cms
to synchronize content, we don't need to do anything more. Everything has been integrated into this package.
You can call the Extract service by using the REST endpoint configured in the GraphQL Gateway. It requires authorization. We can use either the HMAC key and secret or the Single key authorization.
POST \<GATEWAY_URL>/extract
with the following optional query string:
detect_language
:
- In case the file format is the text format, this option indicates whether we want to specify the language of the text or not (default, when not provided, is false). The detected language value is an ISO 639-1 code (for example,
sv
oren
). - If the file is an image, you can ignore this option.
The body should consist of extracted content as base64 string
Supported file formats:
- Text file formats –
.DOC
,.DOCX
,.XLS
,.XLSX
,.TXT
- Image file formats –
.JPEG
,.PNG
,.GIF
,.BMP
,.TIFF
,.WEBP
The response example:
{
"type": "ocr",
"language": "en",
"text": "a man wearing a suit and tie smiling at the camera",
"result": "Ok",
"error": null
}
Updated 9 months ago