A dataset is a combination of several data sources, on which you could apply data preparation rules and scorings, before syncing them to other tools.
For example, you could create datasets:
- Master Contacts: Combine all contacts from different sources
- Master Contacts x Orders: In addition to all contacts, I would like to add information like the nb of purchases, the average basket value, etc.
- Top Customers in France
- Products listing, List of stores, etc.
Datasets can be built from your raw data sources, or from another dataset.
You can create a dataset in no code or in SQL.
Let’s focus here on “No code” dataset creation.
1. Create the dataset from the first source
Step 1 - Create a dataset
From the left Menu Datasets, you can click on “Build dataset” at the top right.
You will be invited to choose between “No code Builder” or “SQL Builder”, and to give a name to your dataset.
Step 2 - Choose the first source
The first step is to choose the data source of your dataset.
You can choose a table from one of the Sources defined in the “Connectors” menu. It is also possible to choose an existing dataset as a source.
For files in your FTP, there is an option to choose several files at once using a regex expression on the file name.
Step 3 - Describe the first source
You are invited to:
- give a name to the first data source you have just chosen.
- choose the frequency at which the dataset will be updated from this data source. By default, Octolis fetches the data source every hour for new or updated records. Depending on your plan, it is possible to set this frequency at 1mn.
- match the data source with a business category. The purpose here is to give a business sense to the data source.
For the real-time purpose, datasets could be fueled by API or webhooks.
Step 4 - Define the fields to import
The objective here is to choose the fields from your data source that will be imported into your dataset. You can choose all fields or select only some of them.
Octolis detects automatically most data “Types”, but we can make some mistakes. Please take the time to review the right data “Type” because a wrong data “Type” may create issues when applying a data preparation recipe on this column.
In the “Advanced settings”, you can choose if Octolis imports all the data source files each time or only the updated records since the last import to fuel the dataset. For obvious performance reasons, the default and recommended option are to import only the last updated records. This implies having a reliable “Updated at” column in your source file.
Step 5 - Map the fields with your dataset + Define the dedupe key
By default, your dataset columns have the same names that one of the data sources, and you can rename them in this step.
In the “Advanced settings” menu, you can define the “Dedupe” key. It could be the main ID of your data source records, a column “Email” or a combination like “Firstname” x ”Lastname” x ”Postal code”.
When two records imported have the same “Dedupe key”, they are merged when imported into the dataset.