Understand your data
Information on how your data is formatted for content coming from external data sources using APIs.
Note
The code and sample data for this tutorial is available on GitHub.
Optimizely Graph lets you sync and query for content coming from external data sources using APIs. This data can be loaded as a batch job. To demonstrate the use of external data, this tutorial uses the non-commercial datasets from IMDb.
Before you can start syncing other data sources to Optimizely Graph, you need to understand the data. This includes knowing the format, the fields used, and the properties of these fields.
- Format – The IMDb datasets are in TSV format, where the first row is the column headers, each row is a record, and each column is a value.
- Field names – Headers of CSV and TSV files can be used as field names. This tutorial uses the headers as field names.
- Types of fields – Columns have specific types in this tutorial. they are one of the following:
- String
- Integer
- Floats
- Boolean
- String array
Other types that are currently supported but not used in this tutorial areDate
and object types.
- Properties of fields – You can check whether any of these fields are useful for full-text search. If so, you can set them as
searchable
.
This tutorial focuses on 3 datasets of the IMDb data.
- Names
- Titles
- Ratings
When looking at the names, the primaryName
and primaryProfession
are useful for full-text searches. To use the headers as field names, append the suffix ___searchable
to the column headers on the first row.
Note
___searchable
has three underscores.
nconst primaryName___searchable birthYear deathYear primaryProfession___searchable knownForTitles
nm0000001 Fred Astaire 1899 1987 soundtrack,actor,miscellaneous tt0050419,tt0031983,tt0053137,tt0072308
nm0000002 Lauren Bacall 1924 2014 actress,soundtrack tt0075213,tt0037382,tt0117057,tt0038355
nm0000003 Brigitte Bardot 1934 \N actress,soundtrack,music_department tt0049189,tt0054452,tt0057345,tt0056404
nm0000004 John Belushi 1949 1982 actor,soundtrack,writer tt0080455,tt0077975,tt0072562,tt0078723
nm0000005 Ingmar Bergman 1918 2007 writer,director,actor tt0083922,tt0050976,tt0050986,tt0069467
nm0000006 Ingrid Bergman 1915 1982 actress,soundtrack,producer tt0038787,tt0034583,tt0036855,tt0038109
nm0000007 Humphrey Bogart 1899 1957 actor,soundtrack,producer tt0037382,tt0034583,tt0042593,tt0043265
When looking at the titles, primaryTitle
and genres
are useful for full-text search, so mark these as searchable
.
tconst titleType primaryTitle___searchable originalTitle isAdult startYear endYear runtimeMinutes genres___searchable
tt0000001 short Carmencita Carmencita 0 1894 \N 1 Documentary,Short
tt0000002 short Le clown et ses chiens Le clown et ses chiens 0 1892 \N 5 Animation,Short
tt0000003 short Pauvre Pierrot Pauvre Pierrot 0 1892 \N 4 Animation,Comedy,Romance
tt0000004 short Un bon bock Un bon bock 0 1892 \N 12 Animation,Short
Finally, when looking at the ratings, there are integers and floats, but no changes are required to the headers.
tconst averageRating numVotes
tt0000001 5.7 1996
tt0000002 5.8 268
tt0000003 6.5 1885
tt0000004 5.5 177
tt0000005 6.2 2670
Updated about 1 month ago