Iterate over a project

In this article, we will learn how to iterate through a project with annotated data in python. It is one of the most frequent operations in Superviely Apps and python automation scripts.

Dataset Types

In Supervisely, datasets can be organized in two ways: flat or nested. Understanding these structures can help you with efficient data organization and management.

Flat Dataset Structure

A flat dataset is the simplest form of organization where all images and their annotations are stored in a single level. This structure is good for simple projects with straightforward organization.

You can add this dataset to your team via Supervisely Ecosystem - ⬇️Lemons (Annotated)

Example structure:

πŸ“¦ Lemons (Annotated)           ← The project
 ┣ πŸ“‚ ds1                       ← The dataset
 ┃ ┣ πŸ“‚ ann                     ← Folder for annotations
 ┃ ┃ ┣ πŸ“œ IMG_0748.jpeg.json    ← Annotation for image 0748
 ┃ ┃ ┣ πŸ“œ IMG_1836.jpeg.json    ← Annotation for image 1836
 ┃ ┃ ┣ πŸ“œ IMG_2084.jpeg.json
 ┃ ┃ ┣ πŸ“œ IMG_3861.jpeg.json
 ┃ ┃ ┣ πŸ“œ IMG_4451.jpeg.json
 ┃ ┃ β”— πŸ“œ IMG_8144.jpeg.json
 ┃ ┣ πŸ“‚ img                     ← Folder for images
 ┃ ┃ ┣ πŸ–ΌοΈ IMG_0748.jpeg         ← Image 0748
 ┃ ┃ ┣ πŸ–ΌοΈ IMG_1836.jpeg         ← Image 1836
 ┃ ┃ ┣ πŸ–ΌοΈ IMG_2084.jpeg
 ┃ ┃ ┣ πŸ–ΌοΈ IMG_3861.jpeg
 ┃ ┃ ┣ πŸ–ΌοΈ IMG_4451.jpeg
 ┃ ┃ β”— πŸ–ΌοΈ IMG_8144.jpeg
 ┣ πŸ“œ meta.json                 ← Project metadata
 β”— πŸ“œ README.md                 ← Optional readme file

Nested Dataset Structure

A nested dataset structure is a bit more advanced. It lets you create datasets inside other datasets, forming a hierarchyβ€”like tree for your data. Nested datasets are good for complex projects requiring hierarchical organization or when you need to group related data together.

You can add this dataset to your team via Supervisely Ecosystem - ⬇️Fruits (Annotated)

Important Note about Nested Datasets:

When working with nested datasets, keep in mind:

  • Parent datasets (like "Temperate" or "Tropical") can be empty or non-empty themselves, but contain images inside nested datasets

  • To get all parent dataset images including nested ones, you'll need to iterate through each nested dataset

Example structure:

  • The main datasets ("Temperate" and "Tropical") don't hold images or annotations directly in ann and img folders.

  • Instead, they have a datasets folder containing nested datasets (like "Apple", "Banana", etc.), and those hold the images and annotations.

  • The main datasets can also contain images, but we removed them for this example

Step-by-Step Guide

Everything you need to reproduce this tutorial is on GitHub: source code, Visual Studio code configuration, and a shell script for creating venv.

In this guide we will go through the following steps:

**** Step 1. Get a demo project with labeled lemons and kiwis or fruits project with nested datasets.

**** Step 2. Prepare .env files with credentials and ID of a demo project.

**** Step 3. Run python script.

**** Step 4. Show possible optimizations.

1. Demo project

If you don't have any projects yet, go to the ecosystem and add the demo project πŸ‹ Lemons (Annotated) or 🍍 Fruits Nested (Annotated) to your current workspace.

Add demo project "Lemons (Annotated)" to your workjspace

2. .env files

Create a file at ~/supervisely.env with the credentials for your Supervisely account. Learn more about environment variables here. The content should look like this:

Create the second file local.env and place it in the same directory with the main.py. This file will contain values we are going to use in the python script.

3. Python script

This script illustrates only the basics. If your project is huge and has hundreds of thousands of images then it is not so efficient to download annotations one by one. It is better to use batch (bulk) methods to reduce the number of API requests and significantly speed up your code. Learn more in the optimizations section below.

To start debugging you need to

  1. Clone the repo

  2. Create venv by running the script create_venv.sh

  3. Change value in local.env

  4. Check that you have ~/supervisely.env file with correct values

Source code

If you are working with nested datasets and want to get full path to the dataset, you can use api.dataset.tree method instead of api.dataset.get_list. It returns a generator that yields tuples (parents, dataset) where parents is a list of parent dataset names and dataset is a dataset object.

Your for loop will look like this:

Output

The script above produces the following output for Lemons (Annotated) project:

The script above produces the following output for Fruits Nested (Annotated) project:

4. Optimizations

The bottleneck of this script is in these lines (27-28):

If you have 1M images in your project, your code will send 🟑 1M requests to download annotations. It is inefficient due to Round Trip Time (RTT) and a large number of similar tiny requests to a Supervisely database.

It can be optimized by using the batch API method:

Supervisely API allows downloading annotations for multiple images in a single request. The code sample below sends βœ… 50x fewer requests and it leads to a significant speed-up of our original code:

The optimized version of the original script is in main_optimized.py.

Last updated

Was this helpful?