Finding directories with specific markers

A tutorial on how to find directories with specific markers (filename) and check them for some conditions using `sly.fs.dirs_with_marker()` function.

Introduction

First of all, let's talk about the use cases of this function. When working with import apps, there are often cases where we need a specific structure for input data to ensure the application functions correctly. Frequently, it's necessary to identify a specific file marker that serves as a reference point for determining the rest of the structure. However, user-provided data can have varying structures. For instance, the target directory might not be in the root directory but nested within several other directories, or the data may contain not just one but multiple suitable directories. For example, we're trying to find a directory with a config.json file in it. The directory structure might look like this:

📂 input_dir
┣ 📂 nested_dir_1
┃ ┣ 📂 nested_dir_2
┃ ┃ ┣ 📂 nested_dir_3
┃ ┃ ┃ ┣ 📄 config.json
┃ ┃ ┃ ┗ 📄 data.csv
┣ 📂 nested_dir_4
┃ ┣ 📄 config.json
┃ ┗ 📄 data.csv

So, in this case, we need to find two directories: nested_dir_3 and nested_dir_4. It's not a problem to find them, but why implement the same logic every time? It's much easier to use a function that will do it for us everywhere we need it. And it's just a part of the problem, we're trying to solve here. Same as finding directories, we usually need to check them for some conditions. Because the file we've found maybe incorrect and a user may just forget to delete it, or the directory may not contain other needed files. If we try to work with the data from this directory, we may have an error or, much worse, incorrect results. Of course, we can write the function that will check the directory for us, and pass the directories to it one by one. Well, while using sly.fs.dirs_with_marker() we still need to have this function, if we need to check the directory for some conditions, but we can make this process more convenient and the code more readable and clear.

Function signature

sly.fs.dirs_with_marker(
    input_path: str,
    markers: Union[str, List[str]],
    check_function: Optional[Callable] = None,
    ignore_case: Optional[bool] = False,
) -> Generator[str, None, None]:

Parameters

Data Example

We prepared a short Python script, that will unpack an archive (as an example of input data from a user) and find directories with config.json files in them. Then it will check if they're valid. Conditions for checking are the following:

  • The directory must contain config.json file.

  • The config.json file must have a key valid, and its value must be true.

  • The directory must contain two other subdirectories: images and anns.

Example archive structure:

📦extracted
 ┗ 📂input_dir
 ┃ ┣ 📂subdir01
 ┃ ┃ ┣ 📂subdir11
 ┃ ┃ ┃ ┣ 📂anns
 ┃ ┃ ┃ ┣ 📂images
 ┃ ┃ ┃ ┗ 📄config.json
 ┃ ┃ ┗ 📂subdir12
 ┃ ┃ ┃ ┣ 📂anns
 ┃ ┃ ┃ ┣ 📂images
 ┃ ┃ ┃ ┗ 📄config.json
 ┃ ┗ 📂subdir02

So, we have two directories with config.json files in them, but only one of them is valid.

You can find the above demo archive in the data directory of the dirs-with-marker repo - here

Tutorial content

Everything you need to reproduce this tutorial is on GitHub: main.py.

Step 1. How to extract the archive and remove junk files

It's not a rare case, when the archive from a user contains some unnecessary files. For example, the archive may contain a __MACOSX directory or .DS_Store files, which are created by macOS. If we don't handle them, we may have an error while working with the data, so in most cases, it's much easier to delete them.

To extract the archive, we'll use the sly.fs.unpack_archive() function:

import os
import json
import supervisely as sly

DATA_DIR = "data"
ARCHIVE_PATH = os.path.join(DATA_DIR, "input_archive.zip")
EXCTRACT_PATH = os.path.join(DATA_DIR, "extracted")

sly.fs.unpack_archive(ARCHIVE_PATH, EXCTRACT_PATH, remove_junk=True)

ℹ️ If the archive was already extracted and you want to remove junk files from it, you can use the sly.fs.remove_junk_from_dir() function:

sly.fs.remove_junk_from_dir(EXCTRACT_PATH)

Now we have a directory without junk files, and we can start searching for directories with markers.

Step 2. How to find directories with markers

As we already have a directory without junk files, we need to specify the markers we want to find. In our case, it's config.json file. We can pass it as a string or as a list of strings if we need to find multiple markers. ℹ️ If you're passing a list of markers it's important to mention, that they will be searched with the OR, not AND condition. It means that if you pass a list of markers, the function will return all directories that contain at least one of the markers. If you need a condition when the directory contains several specific markers, you can implement this check in the check_function parameter, we will talk about it in next section. Now we can use the sly.fs.dirs_with_marker() function to iterate over all directories with markers:

MARKERS = "config.json"

for directory in sly.fs.dirs_with_marker(EXCTRACT_PATH, MARKERS, ignore_case=True):
    print(f"The directory with the marker '{MARKERS}' is found: '{directory}'")

Ok, we've found the directories, but how can we check them for some conditions? Let's talk about it in the next section.

Step 3. How to check directories for specific conditions

As we've already mentioned, we can pass a function to the check_function parameter, which will be used to check the directory for specific conditions. This function must return True if the directory is valid and False otherwise. So, our conditions for checking were: config.json file must have a key valid, and its value must be true, and the directory must contain two subdirectories: images and anns. Let's implement this check:

def check_function(directory: str) -> bool:
    config = json.load(open(os.path.join(directory, MARKERS)))
    images_dir = os.path.join(directory, "images")
    anns_dir = os.path.join(directory, "anns")

    return config.get("valid") is True and os.path.isdir(images_dir) and os.path.isdir(anns_dir)

Just as a reminder, the check_function must return the bool value.

Example of the final code

So, we've implemented the check function, and now we can use it in the sly.fs.dirs_with_marker() function. Here's the full code:

import os
import json

import supervisely as sly

DATA_DIR = "data"
ARCHIVE_PATH = os.path.join(DATA_DIR, "input_archive.zip")
EXCTRACT_PATH = os.path.join(DATA_DIR, "extracted")

# 1. Extracting the archive and removing junk files.
sly.fs.unpack_archive(ARCHIVE_PATH, EXCTRACT_PATH, remove_junk=True)

# 2. Specifying the marker we want to find.
MARKERS = "config.json"

# 3. Iterating over directories with markers without checking them.
for directory in sly.fs.dirs_with_marker(EXCTRACT_PATH, MARKERS, ignore_case=True):
    print(f"The directory with the marker '{MARKERS}' is found: '{directory}'")

# 4. Defining the check function.
def check_function(directory: str) -> bool:
    config = json.load(open(os.path.join(directory, MARKERS)))
    images_dir = os.path.join(directory, "images")
    anns_dir = os.path.join(directory, "anns")

    return config.get("valid") is True and os.path.isdir(images_dir) and os.path.isdir(anns_dir)

# 5. Iterating over directories with directories which contains markers and passed the check.
for checked_directory in sly.fs.dirs_with_marker(
    EXCTRACT_PATH, MARKERS, check_function=check_function, ignore_case=True
):
    print(f"The directory '{checked_directory}' is valid.")

Let's have a look on what we've got here:

  1. We've extracted the archive and removed junk files.

  2. We've specified the marker we want to find.

  3. We've iterated over directories with markers without checking them. In our test case it will print two directories, while the correct is only one.

  4. We've defined the check function that will be used to check the directory meet our requirements.

  5. We've iterated over directories that passed all checks. In our test case it will print only one directory, which is correct.

And now we can easily work with the data from the directory (or directories) we've found, knowing that it's valid.

Summary

In this tutorial, we've learned how to find directories with specific markers and check them for some conditions using sly.fs.dirs_with_marker() function. We've also learned how to extract the archive and clean it of junk files using sly.fs.unpack_archive() or sly.fs.remove_junk_from_dir() functions. We hope this tutorial was helpful for you and it will save you some time in the future, while working with import apps.

Last updated