Understand your data

Information on how your data is formatted for content coming from external data sources using APIs.



The code and sample data for this tutorial is available on GitHub.

Optimizely Graph allows you to synchronize and query for content coming from external data sources using APIs. This data can be loaded as a batch job. To demonstrate the use of external data, this tutorial uses the non-commercial datasets from IMDb.

Before you can start synchronizing other data sources to Optimizely Graph, you need to understand the data. This includes knowing the format, the fields used, and properties of these fields.

  • Format – The IMDb datasets are in TSV format, where the first row are the column headers, each row is a record, and each column is a value.
  • Field names – Headers of CSV/TSV files can be used as field names. This tutorial uses the headers as field names.
  • Types of fields – Columns have specific types in this tutorial:
    • String
    • Integer
    • Floats
    • Boolean
    • String array
      Other types currently supported, but not used in this tutorial are Date and object types.
  • Properties of fields – You can check whether any of these fields will be useful for full-text search. If so, you can set them as searchable.

This tutorial focuses on 3 datasets of the IMDb data:

  • Names
  • Titles
  • Ratings

When looking at the names, the primaryName and primaryProfession will be useful for full-text search. To use the headers as field names, append the suffix ___searchable to the column headers on the first row.



___searchable has three underscores.

nconst        primaryName___searchable    birthYear       deathYear       primaryProfession___searchable  knownForTitles
nm0000001       Fred Astaire        1899    1987    soundtrack,actor,miscellaneous  tt0050419,tt0031983,tt0053137,tt0072308
nm0000002       Lauren Bacall       1924    2014    actress,soundtrack      tt0075213,tt0037382,tt0117057,tt0038355
nm0000003       Brigitte Bardot     1934    \N      actress,soundtrack,music_department     tt0049189,tt0054452,tt0057345,tt0056404
nm0000004       John Belushi        1949    1982    actor,soundtrack,writer tt0080455,tt0077975,tt0072562,tt0078723
nm0000005       Ingmar Bergman      1918    2007    writer,director,actor   tt0083922,tt0050976,tt0050986,tt0069467
nm0000006       Ingrid Bergman      1915    1982    actress,soundtrack,producer     tt0038787,tt0034583,tt0036855,tt0038109
nm0000007       Humphrey Bogart     1899    1957    actor,soundtrack,producer       tt0037382,tt0034583,tt0042593,tt0043265

When looking at the titles, primaryTitle and genres will be useful for full-text search, so mark these as searchable.

tconst  titleType       primaryTitle___searchable       originalTitle   isAdult startYear       endYear runtimeMinutes  genres___searchable
tt0000001       short   Carmencita      Carmencita      0       1894    \N      1       Documentary,Short
tt0000002       short   Le clown et ses chiens  Le clown et ses chiens  0       1892    \N      5       Animation,Short
tt0000003       short   Pauvre Pierrot  Pauvre Pierrot  0       1892    \N      4       Animation,Comedy,Romance
tt0000004       short   Un bon bock     Un bon bock     0       1892    \N      12      Animation,Short

Finally, when looking at the ratings, there are both integers and floats, but no changes are required to the headers.

tconst  averageRating   numVotes
tt0000001       5.7     1996
tt0000002       5.8     268
tt0000003       6.5     1885
tt0000004       5.5     177
tt0000005       6.2     2670