tonic.cached_dataset#

Module Contents#

Classes#

MemoryCachedDataset

MemoryCachedDataset caches the samples to memory to substantially improve data loading

DiskCachedDataset

DiskCachedDataset caches the data samples to the hard drive for subsequent reads, thereby

CachedDataset

Deprecated class that points to DiskCachedDataset for now but will be removed in a future

Functions#

save_to_disk_cache(→ None)

Save data to caching path on disk in an hdf5 file. Can deal with data

load_from_disk_cache(→ Tuple)

Load data from file cache, separately for (data) and (targets).

class tonic.cached_dataset.MemoryCachedDataset[source]#

MemoryCachedDataset caches the samples to memory to substantially improve data loading speeds. However you have to keep a close eye on memory consumption while loading your samples, which can increase rapidly when converting events to rasters/frames. If your transformed dataset doesn’t fit into memory, yet you still want to cache samples to speed up training, consider using DiskCachedDataset instead.

Parameters:
  • dataset – Dataset to be cached to memory.

  • device – Device to cache to. This is preferably a torch device. Will cache to CPU memory if None (default).

  • transform – Transforms to be applied on the data

  • target_transform – Transforms to be applied on the label/targets

  • transforms – A callable of transforms that is applied to both data and labels at the same time.

dataset: Iterable#
device: Optional[str]#
transform: Optional[Callable]#
target_transform: Optional[Callable]#
transforms: Optional[Callable]#
samples_dict: dict#
__getitem__(index)[source]#
__len__()[source]#
class tonic.cached_dataset.DiskCachedDataset[source]#

DiskCachedDataset caches the data samples to the hard drive for subsequent reads, thereby potentially improving data loading speeds. If dataset is None, then the length of this dataset will be inferred from the number of files in the caching folder. Pay attention to the cache path you’re providing, as DiskCachedDataset will simply check if there is a file present with the index that it is looking for. When using train/test splits, it is wise to also take that into account in the cache path.

Note

When you change the transform that is applied before caching, DiskCachedDataset cannot know about this and will present you with an old file. To avoid this you either have to clear your cache folder manually when needed, incorporate all transformation parameters into the cache path which creates a tree of cache files or use reset_cache=True.

Note

Caching Pytorch tensors will write numpy arrays to disk, so be careful when loading the sample and you expect a tensor. The recommendation is to defer the transform to tensor as late as possible.

Parameters:
  • dataset – Dataset to be cached to disk. Can be None, if only files in cache_path should be used.

  • cache_path – The preferred path where the cache will be written to and read from.

  • reset_cache – When True, will clear out the cache path during initialisation. Default is False

  • transform – Transforms to be applied on the data

  • target_transform – Transforms to be applied on the label/targets

  • transforms – A callable of transforms that is applied to both data and labels at the same time.

  • num_copies – Number of copies of each sample to be cached. This is a useful parameter if the dataset is being augmented with slow, random transforms.

  • compress – Whether to apply lightweight lzf compression, default is True.

dataset: Iterable#
cache_path: str#
reset_cache: bool = False#
transform: Optional[Callable]#
target_transform: Optional[Callable]#
transforms: Optional[Callable]#
num_copies: int = 1#
compress: bool = True#
__post_init__()[source]#
__getitem__(item) Tuple[object, object][source]#
Return type:

Tuple[object, object]

__len__()[source]#
tonic.cached_dataset.save_to_disk_cache(data, targets, file_path: Union[str, pathlib.Path], compress: bool = True) None[source]#

Save data to caching path on disk in an hdf5 file. Can deal with data that is a dictionary. :param data: numpy ndarray-like or a list thereof. :param targets: same as data, can be None. :param file_path: caching file path. :param compress: Whether to apply compression. (default = True - uses lzf compression)

Parameters:
  • file_path (Union[str, pathlib.Path]) –

  • compress (bool) –

Return type:

None

tonic.cached_dataset.load_from_disk_cache(file_path: Union[str, pathlib.Path]) Tuple[source]#

Load data from file cache, separately for (data) and (targets).

Can assemble dictionaries back together. :param file_path: caching file path.

Returns:

data, targets

Parameters:

file_path (Union[str, pathlib.Path]) –

Return type:

Tuple

class tonic.cached_dataset.CachedDataset(*args, **kwargs)[source]#

Bases: DiskCachedDataset

Deprecated class that points to DiskCachedDataset for now but will be removed in a future release.

Please use MemoryCachedDataset or DiskCachedDataset in the future.