Overview
Overview
Although model training applications can be complex, they typically adhere to a simple structure consisting of two main components: the training code and the user interface. The UI is a collection of Supervisely widgets that allow the user to pick data, adjust training parameters, and watch the training process. More information about widgets can be found in this section.
The implementation of the model training process varies significantly depending on the specific model architecture. Supervisely's applications code provides you with the freedom to delve into different implementation approaches for model training: HRDA, YOLOv8, MMDetection.
Let's now delve into some crucial components of applications using the following code and explore practical use cases of SDK methods and functions that will empower you to craft robust and functional applications.
Use cases and best practice
Base UI
To incorporate widgets into the application, wrap them within a supervisely.app.widgets.Container
object and pass it as the layout parameter to the sly.Application
constructor.
Checkpoints
Description
Throughout the training process, multiple checkpoints of the model are commonly saved: the best-performing checkpoint, the most recent checkpoint or several of the most recent checkpoints. Upon successful training completion, all relevant files are transferred to Team Files. In the event of training interruptions due to unforeseen circumstances, such as GPU memory limitations or server issues, we desire the capability to resume training from previously generated checkpoints.
Tools
While developers are responsible for properly storing training artifacts in Team Files, interim files can be temporarily stored in the sly.app.get_synced_data_dir()
directory. This path within the Docker container is linked to a corresponding host folder within the Agent Docker container. Consequently, all checkpoints remain accessible throughout the training process at Team Files/Supervisely Agents/{agent_name}/app_data
.
Files saved to this directory will be automatically purged upon successful program completion. Only in the event of training errors, the agent will refrain from deleting these files. These files can be subsequently removed through the Agent panel in Team Clusters.
The value of sly.app.get_synced_data_dir()
is determined by the SLY_APP_DATA_DIR
environment variable. In PART 2 of the code, we utilize distinct SLY_APP_DATA_DIR
paths for production and local debugging scenarios.
Example
The train_process()
function utilizes sly.app.get_synced_data_dir()
to save checkpoints for each epoch, as well as the final checkpoint. In PART 4, upon the completion or graceful cancellation (using graceful=True) of the training process, the script initiates the uploading of artifacts to the Team Files. The api.task.set_output_directory()
function is employed to display the training path in the output after training concludes.
Additionally, it is advisable to clear the checkpoints directory with sly.fs.remove_dir(checkpoints_dir)
before shutting down to avoid data leakage, even though the agent is likely to automatically clean up.
Application stop
Description
Halting a Supervisely application can be accomplished in three distinct ways: by terminating the running Docker container via an agent request, known as a force stop, by terminating the internal process within the container through a request to the sly.Application
object, or by allowing the app to complete its tasks and then terminate (app.stop()
command in PART 4). The following guidelines must be followed for the stopping process to be error-free.
Non-main threads should either be explicitly terminated on stop or configured to run in daemon mode.
The execution of training and validation cycles must be interruptible at batch boundaries.
The
app.stop()
must always be called at the end.
Disregarding these guidelines may lead to scenarios where the application fails to terminate upon request or generates erroneous logs, despite the application's logic being sound.
Tools
To guarantee a smooth application stoppage, consider using the following class methods from app = sly.Application()
object:
app.is_stopped()
method returns True if the application has received a signal to stop;app.call_before_shutdown(func)
method adds a callback that will be called before the application terminates;app.StopException
error is used to exit from a training function;app.handle_stop(graceful=True)
context manager allows you to run functions suppressing the app.StopException error. Ifgraceful=True
, then the part of the code that was limited by the context manager will be skipped on error, the remaining part will be executed until the application is stopped using theapp.stop()
method. Otherwise, the application will be stopped immediately after the function exits.
Example
The training function is initiated within the app.handle_stop(graceful=True)
context manager. Therefore, if a STOP task is received, the train_process()
function will be interrupted after batch processing. However, the shutdown process will be on hold until app.stop()
is called. This setup ensures that our application successfully uploads all saved checkpoints to Team Files before shutting down.
Last updated