dival.datasets.cached_dataset module

dival.datasets.cached_dataset.generate_cache_files(dataset, cache_files, size=None, flush_interval=1000)[source]

Generate cache files for CachedDataset.

Parameters:

dataset (Dataset) – Dataset from which to cache samples.
cache_files (dict of [tuple of ] (str or None)) –
Filenames of the cache files for each part and for each component to be cached. The part ('train', …) is the key to the dict. For each part, a tuple of filenames should be provided, each of which can be None, meaning that this component should not be cached. If the dataset only provides one element per sample, the filename does not have to be packed inside a tuple. If a key is omitted, the part is not cached.

As an example, for a CT dataset with cached FBPs instead of observations for parts 'train' and 'validation':
```
{'train':      ('cache_train_fbp.npy',      None),
 'validation': ('cache_validation_fbp.npy', None)}
```
size (dict of int, optional) – Numbers of samples to cache for each dataset part. If a field is omitted or has value None, all samples are cached. Default: {}.
flush_interval (int, optional) – Number of samples to retrieve before flushing to file (using memmap). This amount of samples should fit into the systems main memory (RAM). If -1, each file content is only flushed once at the end.

class dival.datasets.cached_dataset.CachedDataset(dataset, space, cache_files, size=None)[source]

Bases: Dataset

Dataset that allows to replace elements of a dataset with cached data from .npy files.

The arrays in the .npy files must have shape (self.get_len(part),) + self.space[i].shape for the i-th component.

__init__(dataset, space, cache_files, size=None)[source]

Parameters:

dataset (Dataset) – Original dataset from which non-cached elements are used. Must support random access if any elements are not cached.
space ([tuple of ] odl.space.base_tensors.TensorSpace, optional) – The space(s) of the elements of samples as a tuple. This may be different from space, e.g. for precomputing domain-changing operations on the elements.
cache_files (dict of [tuple of ] (str or None)) –
Filenames of the cache files for each part and for each component. The part ('train', …) is the key to the dict. For each part, a tuple of filenames should be provided, each of which can be None, meaning that this component should be fetched from the original dataset. If the dataset only provides one element per sample, the filename does not have to be packed inside a tuple. If a key is omitted, the part is fetched from the original dataset.

As an example, for a CT dataset with cached FBPs instead of observations for parts 'train' and 'validation':
```
{'train':      ('cache_train_fbp.npy',      None),
 'validation': ('cache_validation_fbp.npy', None)}
```
size (dict of int, optional) – Numbers of samples for each part. If a field is omitted or has value None, all available samples are used, which may be less than the number of samples in the original dataset if the cache contains fewer samples. Default: {}.

generator(part='train')[source]

Yield data.

The default implementation calls get_sample() if the dataset implements it (i.e., supports random access).

Parameters:: part ({'train', 'validation', 'test'}, optional) – Whether to yield train, validation or test data. Default is 'train'.
Yields:: data (odl element or tuple of odl elements) – Sample of the dataset.

get_sample(index, part='train', out=None)[source]

Get single sample by index.

Parameters:

index (int) – Index of the sample.
part ({'train', 'validation', 'test'}, optional) – The data part. Default is 'train'.
out (array-like or tuple of (array-like or bool) or None) –
Array(s) (or e.g. odl element(s)) to which the sample is written. A tuple should be passed, if the dataset returns two or more arrays per sample (i.e. pairs, …). If a tuple element is a bool, it has the following meaning:

True
Create a new array and return it.

False
Do not return this array, i.e. None is returned.

Returns:

sample – E.g. for a pair dataset: (array, None) if out=(True, False).

Return type:

[tuple of ] (array-like or None)

get_samples(key, part='train', out=None)[source]

Get samples by slice or range.

The default implementation calls get_sample() if the dataset implements it.

Parameters:

key (slice or range) – Indexes of the samples.
part ({'train', 'validation', 'test'}, optional) – The data part. Default is 'train'.
out (array-like or tuple of (array-like or bool) or None) –
Array(s) (or e.g. odl element(s)) to which the sample is written. The first dimension must match the number of samples requested. A tuple should be passed, if the dataset returns two or more arrays per sample (i.e. pairs, …). If a tuple element is a bool, it has the following meaning:

True
Create a new array and return it.

False
Do not return this array, i.e. None is returned.

Returns:

samples – If the dataset has multiple arrays per sample, a tuple holding arrays is returned. E.g. for a pair dataset: (array, None) if out=(True, False). The samples are stacked in the first (additional) dimension of each array.

Return type:

[tuple of ] (array-like or None)