The configuration file defines the resources available for Auger and the parameters of your model.
Defining Clusters for Your Project
Running an experiment requires that you configure your Auger enviroment with the necessary details describing your cluster size and dataset. When defining your cluster size it is important to make sure that your instance type is large enough to support your dataset. Here are some relevant options for this
- instance_type the size of the cluster you want to train your data on.
- worker_nodes_count minimum of 2, the more workers deployed the more jobs that can be run in parallel.
- autoterminate_minutes how long Auger will let your cluster run w/o any evaluations before it automatically shuts down.
cluster: worker_nodes_count: 2 instance_type: c5.2xlarge autoterminate_minutes: 120
Defining Projects and Experiments
You can optionally set the project and experiment that you would like to run evaluations on. A new project and experiment will be created for you.
You can also define the organization to run experiments on. If left unset, your organization will default to the first one available for your user.
organization: yourorg project: projectlive5 experiment: iris_api_test
The following options describe your model: where the data is stored, its features and its target.
- data_path the remote url to your dataset. You can use a fully qualified public url. If you have already uploaded a file onto your project you can use the relative path
- data_extension - the extension of your dataset file.
- data_compression - the compression type for your dataset if any.
- feature_columns - the column names that you want to use as features for training.
- target_feature - the feature you wish to predict on.
An example YAML for this appears below:
evaluation_options: data_path: files/iris_data_sample.csv # data_path: https://www.openml.org/data/download/8/dataset_8_liver-disorders.arff # data_extension: ".csv" # data_compression: gzip feature_columns: - sepal_length - sepal_width - petal_length - petal_width target_feature: class
These options control whether you are performing regression, classification
- classification Set True for classification and if False regression is assumed.
- binary_classification If your target has 2 classes and you are running classification set to True
- categorical_features the list of features you want to set as categorical. One hot encoding will be used.
- label_encoding_features the list of features you want to label encode.
- time_series_features If you are running a timeseries dataset you should include at least one timeseries feature. This is a date datatype field.
- scoring This is the scoring function your evaluations will optimize for. See the Auger Metrics documentation for definitions of these metrics and details on which scoring functions are available with which model types.
categorical_features: - class datetime_features:  time_series_features:  classification: true binary_classification: false scoring: accuracy
The following options control the time, number of trials and other aspects of your evaluations.
- cross_validation_folds - the number of folds used to evaluate your trained models.
- max_total_time_mins - the total time your evaluations will be run before stopping.
- max_eval_time_mins the total time each evaluation will be run before stopping.
- max_n_trials - the total number of trials run before stopping.
- use_ensemble - whether or not to run ensembling in evaluations.
cross_validation_folds: 5 max_total_time_mins: 60 max_eval_time_mins: 1 max_n_trials: 10 use_ensemble: true
Example Configuration File
A complete example configuration file for Auger can be found here.