Get Started

Prerequisites#

You need Python >= 3.7.

Install SOIL#

pip install soil-sdk

Depending on your system settings you might need to run the above with root rights and replace pip by pip3, e.g. sudo pip3 install soil-sdk

Configure soil#

soil configure

This will ask for your credentials. You will need an application id and an API key provided by the admin. Usually, you should leave the default URL provider. In case you enter a wrong input, run soil configure --reset and reenter the required information again.

An example configuration file should look as follows:

auth_api_key: ###################
auth_app_id: ###################
auth_url: https://auth.amalfianalytics.com
soil_url: https://dev.soil.amalfianalytics.com/

Login to the platform#

soil login

This will ask you your Amalfi credentials and store a token in the folder $HOME/.soil Note that you always have to run login if you have made any changes in the steps above.

Generate your first project#

soil init your-project-name

This will generate a folder with the boilerplate elements and some example code.

example-app

SOIL App Concepts#

A SOIL Application has three main concepts:

  • Scripts: these scripts won't run outside the SOIL platform and are the entry points to the application.
  • Modules: They run inside the SOIL platform and contain instructions to transform the data. They are decorated with @modulify and can be written or imported from the SOIL library.
  • Data structures: They contain the data, the metadata and instructions on how to serialize, deserialize and query it.

A SOIL application consists in a set of scripts that will upload or query some data, transform it and store it again.

Running pipelines#

A script will contain one or more pipelines that will look like this:

import soil
from soil.modules.preprocessing.filters import row_filter
from soil.modules.simple_module import simple_mean
patients = soil.data('my_dataset')
women, = row_filter.RowFilter(patients, sex={'eql': '1'})
statistics, = simple_mean(women, aggregation_column='age')
print(statistics.data)
# { 'mean': 54 }
print(statistics.metadata)

Pipelines are lazily evaluated, this means that they are not ran until the data is needed. In the example the pipeline won't run until the line print(statistics.data). This way the data transfer is minimized. The calls to a data structure that will trigger a pipleine run are ds.data, ds.metadata and ds.get_data(**kwargs). The pipeline only runs with the first call to the data and it is blocking. This means that the line print(statistics.data) is blocking but print(statistics.metadata) is not.

Intermediate results are not stored, meaning that if we want to do for example women.data the partial pipeline will run even if we have computed the intermediate results before to get statistics.data.