Skip to main content

Get Started

Prerequisites

You need Python >= 3.10.

Install SOIL

pip install soil-sdk

Depending on your system settings you might need to run the above with root rights and replace pip by pip3, e.g. sudo pip3 install soil-sdk

Generate your first project

soil init your-project-name

This will generate a folder with the boilerplate elements and some example code.

example-app

Configure soil

In the new project open the soil.conf file and change the contents to something like this, replacing the auth_api_key and auth_app_id with the credentials provided to you:

{
"auth_api_key": "xxxxxxxxxxxxxxxxxxxxxxxxx",
"auth_app_id": "yyyyyyyyyyyyyyyyyyyyyyyyyy",
"auth_url": "https://auth.amalfianalytics.com",
"soil_url": "https://soil.amalfianalytics.com/api"
}

Login to the platform

soil login

This will ask you your Amalfi credentials and store a token in the folder $HOME/.soil. This token will expire after one month.

SOIL App Concepts

A SOIL Application has three main concepts:

  • Scripts: these scripts won't run outside the SOIL platform and are the entry points to the application.
  • Modules: They run inside the SOIL platform and contain instructions to transform the data. They are decorated with @modulify and can be written or imported from the SOIL library.
  • Data structures: They contain the data, the metadata and instructions on how to serialize, deserialize and query it.

A SOIL application consists in a set of scripts that will upload or query some data, transform it and store it again.

Running pipelines

A script will contain one or more pipelines that will look like this:

import soil
from soil.modules.preprocessing.filters import row_filter
from soil.modules.simple_module import simple_mean

patients = soil.data('my_dataset')
women, = row_filter.RowFilter(patients, sex={'eql': '1'})
statistics, = simple_mean(women, aggregation_column='age')
print(statistics.data)
# { 'mean': 54 }
print(statistics.metadata)

Pipelines are lazily evaluated, this means that they are not ran until the data is needed. In the example the pipeline won't run until the line print(statistics.data). This way the data transfer is minimized. The calls to a data structure that will trigger a pipleine run are ds.data, ds.metadata and ds.get_data(**kwargs). The pipeline only runs with the first call to the data and it is blocking. This means that the line print(statistics.data) is blocking but print(statistics.metadata) is not.

Intermediate results are not stored, meaning that if we want to do for example women.data the partial pipeline will run even if we have computed the intermediate results before to get statistics.data.