dival.datasets package

Module contents

Implements datasets for training and evaluating learned reconstructors.

get_standard_dataset(name, **kwargs)

Return a standard dataset by name.

Dataset([space])

Dataset base class.

GroundTruthDataset([space])

Ground truth dataset base class.

ObservationGroundTruthPairDataset(…[, …])

Dataset of pairs generated from a ground truth generator by applying a forward operator and noise.

EllipsesDataset([image_size, min_pt, …])

Dataset with images of multiple random ellipses.

LoDoPaBDataset([min_pt, max_pt, …])

The LoDoPaB-CT dataset, which is documented in the Data Descriptor article https://www.nature.com/articles/s41597-021-00893-z and hosted on https://zenodo.org/record/3384092.

The function get_standard_dataset() returns fixed “standard” datasets with pairs of observation and ground truth samples. Currently the standard datasets are 'ellipses' and 'lodopab'.

The class ObservationGroundTruthPairDataset can be used, either directly or via GroundTruthDataset.create_pair_dataset(), to create a custom dataset of pairs given a ground truth dataset and a forward operator. For example:

dival.datasets.get_standard_dataset(name, **kwargs)[source]

Return a standard dataset by name.

The standard datasets are (currently):

'ellipses'

A typical synthetical CT dataset with ellipse phantoms.

EllipsesDataset is used as ground truth dataset, a ray transform with parallel beam geometry using 30 angles is applied, and white gaussian noise with a standard deviation of 2.5% (i.e. 0.025 * mean(abs(observation))) is added.

In order to avoid the inverse crime, the ground truth images of shape (128, 128) are upscaled by bilinear interpolation to a resolution of (400, 400) before the ray transform is applied (whose discretization is different from the one of ray_trafo).

Attributes of the returned dataset:
ray_trafoodl.tomo.RayTransform

Ray transform corresponding to the noiseless forward operator.

get_ray_trafo(**kwargs)function

Function that returns a ray transform corresponding to the noiseless forward operator. Keyword arguments (e.g. impl) are forwarded to the RayTransform constructor.

'lodopab'

The LoDoPaB-CT dataset, which is documented in the Data Descriptor article https://www.nature.com/articles/s41597-021-00893-z and hosted on https://zenodo.org/record/3384092. It is a simulated low dose CT dataset based on real reconstructions from the LIDC-IDRI dataset.

The dataset contains 42895 pairs of images and projection data. For simulation, a ray transform with parallel beam geometry using 1000 angles and 513 detector pixels is used. Poisson noise corresponding to 4096 incident photons per pixel before attenuation is applied to the projection data.

Attributes of the returned dataset:
ray_trafoodl.tomo.RayTransform

Ray transform corresponding to the noiseless forward operator.

Methods of the returned dataset:
get_ray_trafo(**kwargs)

Function that returns a ray transform corresponding to the noiseless forward operator. Keyword arguments (e.g. impl) are forwarded to the RayTransform constructor.

Parameters
  • name (str) – Name of the dataset.

  • kwargs (dict) –

    Keyword arguments. Supported parameters for the datasets are:

    'ellipses'
    impl{'skimage', 'astra_cpu', 'astra_cuda'}, optional

    Implementation passed to odl.tomo.RayTransform Default: 'astra_cuda'.

    fixed_seedsdict or bool, optional

    Seeds to use for random ellipse generation, passed to EllipsesDataset.__init__(). Default: False.

    fixed_noise_seedsdict or bool, optional

    Seeds to use for noise generation, passed as noise_seeds to GroundTruthDataset.create_pair_dataset(). If True is passed (the default), the seeds {'train': 1, 'validation': 2, 'test': 3} are used.

    'lodopab'
    num_anglesint, optional

    Number of angles to use from the full 1000 angles. Must be a divisor of 1000.

    observation_model{'post-log', 'pre-log'}, optional

    The observation model to use. Default is 'post-log'.

    min_photon_countfloat, optional

    Replacement value for a simulated photon count of zero. If observation_model == 'post-log', a value greater than zero is required in order to avoid undefined values. The default is 0.1, both for 'post-log' and 'pre-log' model.

    sorted_by_patientbool, optional

    Whether to sort the samples by patient id. Useful to resplit the dataset. Default: False.

    impl{'skimage', 'astra_cpu', 'astra_cuda'}, optional

    Implementation passed to odl.tomo.RayTransform Default: 'astra_cuda'.

Returns

dataset – The standard dataset. It has an attribute standard_dataset_name that stores its name.

Return type

Dataset

class dival.datasets.Dataset(space=None)[source]

Bases: object

Dataset base class.

Subclasses must either implement generator() or provide random access by implementing get_sample() and get_samples() (which then should be indicated by setting the attribute random_access = True).

space

The spaces of the elements of samples as a tuple. If only one element per sample is provided, this attribute is the space of the element (i.e., no tuple). It is strongly recommended to set this attribute in subclasses, as some functionality may depend on it.

Type

[tuple of ] odl.space.base_tensors.TensorSpace or None

shape

The shapes of the elements of samples as a tuple of tuple of int. If only one element per sample is provided, this attribute is the shape of the element (i.e., not a tuple of tuple of int, but a tuple of int).

Type

[tuple of ] tuple of int, optional

train_len

Number of training samples.

Type

int, optional

validation_len

Number of validation samples.

Type

int, optional

test_len

Number of test samples.

Type

int, optional

random_access

Whether the dataset supports random access via self.get_sample and self.get_samples. Setting this attribute is the preferred way for subclasses to indicate whether they support random access.

Type

bool, optional

num_elements_per_sample

Number of elements per sample. E.g. 1 for a ground truth dataset or 2 for a dataset of pairs of observation and ground truth.

Type

int, optional

standard_dataset_name

Datasets returned by get_standard_dataset have this attribute giving its name.

Type

str, optional

__init__(space=None)[source]

The attributes that potentially should be set by the subclass are: space (can also be set by argument), shape, train_len, validation_len, test_len, random_access and num_elements_per_sample.

Parameters

space ([tuple of ] odl.space.base_tensors.TensorSpace, optional) – The spaces of the elements of samples as a tuple. If only one element per sample is provided, this attribute is the space of the element (i.e., no tuple). It is strongly recommended to set space in subclasses, as some functionality may depend on it.

generator(part='train')[source]

Yield data.

The default implementation calls get_sample() if the dataset implements it (i.e., supports random access).

Parameters

part ({'train', 'validation', 'test'}, optional) – Whether to yield train, validation or test data. Default is 'train'.

Yields

data (odl element or tuple of odl elements) – Sample of the dataset.

get_train_generator()[source]
get_validation_generator()[source]
get_test_generator()[source]
get_len(part='train')[source]

Return the number of elements the generator will yield.

Parameters

part ({'train', 'validation', 'test'}, optional) – Whether to return the number of train, validation or test elements. Default is 'train'.

get_train_len()[source]

Return the number of samples the train generator will yield.

get_validation_len()[source]

Return the number of samples the validation generator will yield.

get_test_len()[source]

Return the number of samples the test generator will yield.

get_shape()[source]

Return the shape of each element.

Returns shape if it is set. Otherwise, it is inferred from space (which is strongly recommended to be set in every subclass). If also space is not set, a NotImplementedError is raised.

Returns

shape

Return type

[tuple of ] tuple

get_num_elements_per_sample()[source]

Return number of elements per sample.

Returns num_elements_per_sample if it is set. Otherwise, it is inferred from space (which is strongly recommended to be set in every subclass). If also space is not set, a NotImplementedError is raised.

Returns

num_elements_per_sample

Return type

int

get_data_pairs(part='train', n=None)[source]

Return first samples from data part as DataPairs object.

Only supports datasets with two elements per sample.``

Parameters
  • part ({'train', 'validation', 'test'}, optional) – The data part. Default is 'train'.

  • n (int, optional) – Number of pairs (from beginning). If None, all available data is used (the default).

get_data_pairs_per_index(part='train', index=None)[source]

Return specific samples from data part as DataPairs object.

Only supports datasets with two elements per sample.

For datasets not supporting random access, samples are extracted from generator(), which can be computationally expensive.

Parameters
  • part ({'train', 'validation', 'test'}, optional) – The data part. Default is 'train'.

  • index (int or list of int, optional) – Indices of the samples in the data part. Default is '[0]'.

create_torch_dataset(part='train', reshape=None, transform=None)[source]

Create a torch dataset wrapper for one part of this dataset.

If supports_random_access() returns False, a subclass of of torch.utils.data.IterableDataset is returned that fetches samples via generator(). Note: When using torch’s DataLoader with multiple workers you might want to individually configure the datasets for each worker, see the PyTorch docs on IterableDataset. For this purpose it can be useful to modify the wrapped dival dataset in worker_init_fn(), which can be accessed there via torch.utils.data.get_worker_info().dataset.dataset.

If supports_random_access() returns True, a subclass of of torch.utils.data.Dataset is returned that retrieves samples using get_sample().

Parameters
  • part ({'train', 'validation', 'test'}, optional) – The data part. Default is 'train'.

  • reshape (tuple of (tuple or None), optional) – Shapes to which the elements of each sample will be reshaped. If None is passed for an element, no reshape is applied.

  • transform (callable, optional) – Transform to be applied on each sample, useful for augmentation. Default: None, i.e. no transform.

Returns

dataset – The torch dataset wrapping this dataset. The wrapped dival dataset is assigned to the attribute dataset.dataset.

Return type

torch.utils.data.Dataset or torch.utils.data.IterableDataset

create_keras_generator(part='train', batch_size=1, shuffle=True, reshape=None)[source]

Create a keras data generator wrapper for one part of this dataset.

If supports_random_access() returns False, a generator wrapping generator() is returned. In this case no shuffling is performed regardless of the passed shuffle parameter. Also, parallel data loading (with multiple workers) is not applicable.

If supports_random_access() returns True, a tf.keras.utils.Sequence is returned, which is implemented using get_sample(). For datasets that support parallel calls to get_sample(), the returned data generator (sequence) can be used by multiple workers.

Parameters
  • part ({'train', 'validation', 'test'}, optional) – The data part. Default is 'train'.

  • batch_size (int, optional) – Batch size. Default is 1.

  • shuffle (bool, optional) – Whether to shuffle samples each epoch. This option has no effect if supports_random_access() returns False, since in that case samples are fetched directly from generator(). The default is True.

  • reshape (tuple of (tuple or None), optional) – Shapes to which the elements of each sample will be reshaped. If None is passed for an element, no reshape is applied.

get_sample(index, part='train', out=None)[source]

Get single sample by index.

Parameters
  • index (int) – Index of the sample.

  • part ({'train', 'validation', 'test'}, optional) – The data part. Default is 'train'.

  • out (array-like or tuple of (array-like or bool) or None) –

    Array(s) (or e.g. odl element(s)) to which the sample is written. A tuple should be passed, if the dataset returns two or more arrays per sample (i.e. pairs, …). If a tuple element is a bool, it has the following meaning:

    True

    Create a new array and return it.

    False

    Do not return this array, i.e. None is returned.

Returns

sample – E.g. for a pair dataset: (array, None) if out=(True, False).

Return type

[tuple of ] (array-like or None)

get_samples(key, part='train', out=None)[source]

Get samples by slice or range.

The default implementation calls get_sample() if the dataset implements it.

Parameters
  • key (slice or range) – Indexes of the samples.

  • part ({'train', 'validation', 'test'}, optional) – The data part. Default is 'train'.

  • out (array-like or tuple of (array-like or bool) or None) –

    Array(s) (or e.g. odl element(s)) to which the sample is written. The first dimension must match the number of samples requested. A tuple should be passed, if the dataset returns two or more arrays per sample (i.e. pairs, …). If a tuple element is a bool, it has the following meaning:

    True

    Create a new array and return it.

    False

    Do not return this array, i.e. None is returned.

Returns

samples – If the dataset has multiple arrays per sample, a tuple holding arrays is returned. E.g. for a pair dataset: (array, None) if out=(True, False). The samples are stacked in the first (additional) dimension of each array.

Return type

[tuple of ] (array-like or None)

supports_random_access()[source]

Whether random access seems to be supported.

If the object has the attribute self.random_access, its value is returned (this is the preferred way for subclasses to indicate whether they support random access). Otherwise, a simple duck-type check is performed which tries to get the first sample by random access.

Returns

supportsTrue if the dataset supports random access, otherwise False.

Return type

bool

class dival.datasets.GroundTruthDataset(space=None)[source]

Bases: dival.datasets.dataset.Dataset

Ground truth dataset base class.

__init__(space=None)[source]
Parameters

space (odl.space.base_tensors.TensorSpace, optional) – The space of the samples. It is strongly recommended to set space in subclasses, as some functionality may depend on it.

create_pair_dataset(forward_op, post_processor=None, noise_type=None, noise_kwargs=None, noise_seeds=None)[source]

The parameters are a subset of those of ObservationGroundTruthPairDataset.__init__().

class dival.datasets.ObservationGroundTruthPairDataset(ground_truth_gen, forward_op, post_processor=None, train_len=None, validation_len=None, test_len=None, domain=None, noise_type=None, noise_kwargs=None, noise_seeds=None)[source]

Bases: dival.datasets.dataset.Dataset

Dataset of pairs generated from a ground truth generator by applying a forward operator and noise.

NB: This dataset class does not allow for random access. Supporting random access would require to restore the same random generator state each time the same sample is being accessed if a fixed noise realization should be used for each sample.

__init__(ground_truth_gen, forward_op, post_processor=None, train_len=None, validation_len=None, test_len=None, domain=None, noise_type=None, noise_kwargs=None, noise_seeds=None)[source]
Parameters
  • ground_truth_gen (generator function) – Function returning a generator providing ground truth. Must accept a part parameter like Dataset.generator().

  • forward_op (odl operator) – Forward operator to apply on the ground truth.

  • post_processor (odl operator, optional) – Post-processor to apply on the result of the forward operator.

  • train_len (int, optional) – Number of training samples.

  • validation_len (int, optional) – Number of validation samples.

  • test_len (int, optional) – Number of test samples.

  • domain (odl space, optional) – Ground truth domain. If not specified, it is inferred from forward_op.

  • noise_type (str, optional) – Noise type. See NoiseOperator for the list of supported noise types.

  • noise_kwargs (dict, optional) – Keyword arguments passed to NoiseOperator.

  • noise_seeds (dict of int, optional) – Seeds to use for random noise generation. The part ('train', …) is the key to the dict. If a key is omitted or a value is None, no fixed seed is used for that part. By default, no fixed seeds are used.

generator(part='train')[source]

Yield data.

The default implementation calls get_sample() if the dataset implements it (i.e., supports random access).

Parameters

part ({'train', 'validation', 'test'}, optional) – Whether to yield train, validation or test data. Default is 'train'.

Yields

data (odl element or tuple of odl elements) – Sample of the dataset.

class dival.datasets.EllipsesDataset(image_size=128, min_pt=None, max_pt=None, train_len=32000, validation_len=3200, test_len=3200, fixed_seeds=False)[source]

Bases: dival.datasets.dataset.GroundTruthDataset

Dataset with images of multiple random ellipses.

This dataset uses odl.phantom.ellipsoid_phantom() to create the images. The images are normalized to have a value range of [0., 1.] with a background value of 0..

space

odl.uniform_discr(min_pt, max_pt, (image_size, image_size), dtype='float32'), with the parameters passed to __init__().

shape

(image_size, image_size), with image_size parameter passed to __init__(). Default (128, 128).

train_len

train_len parameter passed to __init__(). Default 32000.

validation_len

validation_len parameter passed to __init__(). Default 3200.

test_len

test_len parameter passed to __init__(). Default 3200.

random_access

False

num_elements_per_sample

1

__init__(image_size=128, min_pt=None, max_pt=None, train_len=32000, validation_len=3200, test_len=3200, fixed_seeds=False)[source]
Parameters
  • image_size (int, optional) – Number of pixels per image dimension. Default: 128.

  • min_pt ([int, int], optional) – Minimum values of the lp space. Default: [-image_size/2, -image_size/2].

  • max_pt ([int, int], optional) – Maximum values of the lp space. Default: [image_size/2, image_size/2].

  • train_len (int or None, optional) – Length of training set. Default: 32000. If None, infinitely many samples could be generated.

  • validation_len (int, optional) – Length of training set. Default: 3200.

  • test_len (int, optional) – Length of test set. Default: 3200.

  • fixed_seeds (dict or bool, optional) – Seeds to use for random generation. The values of the keys 'train', 'validation' and 'test' are used. If a seed is None or omitted, it is choosen randomly. If True is passed, the seeds fixed_seeds={'train': 42, 'validation': 2, 'test': 1} are used. If False is passed (the default), all seeds are chosen randomly.

generator(part='train')[source]

Yield random ellipse phantom images using odl.phantom.ellipsoid_phantom().

Parameters

part ({'train', 'validation', 'test'}, optional) – The data part. Default is 'train'.

Yields

image (element of space) – Random ellipse phantom image with values in [0., 1.].

class dival.datasets.LoDoPaBDataset(min_pt=None, max_pt=None, observation_model='post-log', min_photon_count=None, sorted_by_patient=False, impl='astra_cuda')[source]

Bases: dival.datasets.dataset.Dataset

The LoDoPaB-CT dataset, which is documented in the Data Descriptor article https://www.nature.com/articles/s41597-021-00893-z and hosted on https://zenodo.org/record/3384092. It is a simulated low dose CT dataset based on real reconstructions from the LIDC-IDRI dataset.

The dataset contains 42895 pairs of images and projection data. For simulation, a ray transform with parallel beam geometry using 1000 angles and 513 detector pixels is used. Poisson noise corresponding to 4096 incident photons per pixel before attenuation is applied to the projection data. The images have a size of 362x362 px.

An ODL ray transform that corresponds to the noiseless forward operator can be obtained via the get_ray_trafo method of this dataset. Additionally, the ray_trafo attribute holds a ray transform instance, which is created during __init__(). Note: By default, the 'astra_cuda' implementation backend is used, which requires both astra and a CUDA-enabled GPU being available. You can choose a different backend by passing impl='skimage' or impl='astra_cpu'.

Further functionalities:

  • converting the stored post-log observations to pre-log observations on the fly (cf. observation_model parameter of __init__())

  • sorting by patient ids (cf. sorted_by_patient parameter of __init__())

  • changing the zero photon count replacement value of 0.1 used for pre-log observations (cf. min_photon_count parameter of __init__())

space
(space[0], space[1]), where
space[0]

odl.uniform_discr([0., -0.1838], [3.1416, 0.1838], (1000, 513), dtype='float32')

space[1]

odl.uniform_discr(min_pt, max_pt, (362, 362), dtype='float32')), with min_pt and max_pt parameters passed to __init__()

shape

(362, 362)

train_len

35820

validation_len

3522

test_len

3553

random_access

True

num_elements_per_sample

2

ray_trafo

Ray transform corresponding to the noiseless forward operator.

Type

odl.tomo.RayTransform

sorted_by_patient

Whether the samples are sorted by patient id. Default: False.

Type

bool

rel_patient_ids

Relative patient ids of the samples in the original non-sorted order for each part, as returned by LoDoPaBDataset.get_patient_ids(). None, if the csv files are not found.

Type

(dict of array) or None

__init__(min_pt=None, max_pt=None, observation_model='post-log', min_photon_count=None, sorted_by_patient=False, impl='astra_cuda')[source]
Parameters
  • min_pt ([float, float], optional) – Minimum values of the lp space. Default: [-0.13, -0.13].

  • max_pt ([float, float], optional) – Maximum values of the lp space. Default: [0.13, 0.13].

  • observation_model ({'post-log', 'pre-log'}, optional) –

    The observation model to use. The default is 'post-log'.

    'post-log'

    Observations are linearly related to the normalized ground truth via the ray transform, obs = ray_trafo(gt) + noise. Note that the scaling of the observations matches the normalized ground truth, i.e., they are divided by the linear attenuation of 3071 HU.

    'pre-log'

    Observations are non-linearly related to the ground truth, as given by the Beer-Lambert law. The model is obs = exp(-ray_trafo(gt * MU(3071 HU))) + noise, where MU(3071 HU) is the factor, by which the ground truth was normalized.

  • min_photon_count (float, optional) – Replacement value for a simulated photon count of zero. If observation_model == 'post-log', a value greater than zero is required in order to avoid undefined values. The default is 0.1, both for 'post-log' and 'pre-log' model.

  • sorted_by_patient (bool, optional) – Whether to sort the samples by patient id. Useful to resplit the dataset. See also get_indices_for_patient(). Note that the slices of each patient are ordered randomly wrt. the z-location in any case. Default: False.

  • impl ({'skimage', 'astra_cpu', 'astra_cuda'}, optional) – Implementation passed to odl.tomo.RayTransform to construct ray_trafo.

generator(part='train')[source]

Yield pairs of low dose observations and (virtual) ground truth.

Parameters

part ({'train', 'validation', 'test'}, optional) – The data part. Default is 'train'.

Yields

(observation, ground_truth)

observationodl element with shape (1000, 513)

The values depend on the observation_model and min_photon_count parameters that were passed to __init__().

ground_truthodl element with shape (362, 362)

The values lie in the range [0., 1.].

get_ray_trafo(**kwargs)[source]

Return the ray transform that is a noiseless version of the forward operator.

Parameters

impl ({'skimage', 'astra_cpu', 'astra_cuda'}, optional) – The backend implementation passed to odl.tomo.RayTransform.

Returns

ray_trafo – The ray transform that corresponds to the noiseless map from 362 x 362 images to the -log of their projections (sinograms).

Return type

odl operator

get_sample(index, part='train', out=None)[source]

Get single sample of the dataset. Returns a pair of (virtual) ground truth and its low dose observation, of which either part can be left out by option.

Parameters
  • index (int) – The index into the dataset part.

  • part ({'train', 'validation', 'test'}, optional) – The data part. Default is 'train'.

  • out (tuple of array-likes or bools, optional) –

    out==(out_observation, out_ground_truth)

    out_observationarray-like or bool

    Shape (1000, 513). If an odl element or array is passed, the observation is written to it. If True, a new odl element holding the observation is created (the default). If False, no observation is returned.

    out_ground_trutharray-like or bool

    Shape (362, 362). If an odl element or array is passed, the ground truth is written to it. If True, a new odl element holding the ground truth is created (the default). If False, no ground truth is returned.

Returns

observationodl element or np.ndarray or None

Depending on the value of out_observation (see parameter out), a newly created odl element, out_observation or None is returned. The observation values depend on the observation_model and min_photon_count parameters that were given to the constructor.

ground_truthodl element or np.ndarray or None

Depending on the value of out_ground_truth (see parameter out), a newly created odl element, out_ground_truth or None is returned. The values lie in the range [0., 1.].

Return type

(observation, ground_truth)

get_samples(key, part='train', out=None)[source]

Get slice of the dataset. Returns a pair of (virtual) ground truth data and its low dose observation data, of which either part can be left out by option.

Parameters
  • key (slice or range) – The indices into the dataset part.

  • part ({'train', 'validation', 'test'}, optional) – The data part. Default is 'train'.

  • out (tuple of arrays or bools, optional) –

    out==(out_observation, out_ground_truth)

    out_observationnp.ndarray or bool

    If an array is passed, the observation data is written to it. If True, a new array holding the observation data is created (the default). If False, no observation data is returned.

    out_ground_truthnp.ndarray or bool

    If an array is passed, the ground truth data is written to it. If True, a new array holding the ground truth data is created (the default). If False, no ground truth data is returned.

Returns

observationnp.ndarray or None

Shape (samples, 1000, 513). Depending on the value of out_observation (see parameter out), a newly created array, out_observation or None is returned. The observation values depend on the observation_model and min_photon_count parameters that were given to the constructor.

ground_truthnp.ndarray or None

Shape (samples, 362, 362). Depending on the value of out_ground_truth (see parameter out), a newly created array, out_ground_truth or None is returned. The values lie in the range [0., 1.].

Return type

(observation, ground_truth)

get_indices_for_patient(rel_patient_id, part='train')[source]

Return the indices of the samples from one patient. If self.sorted_by_patient is True, the indices will be subsequent.

Parameters
  • rel_patient_id (int) – Patient id, relative to the part.

  • part ({'train', 'validation', 'test'}, optional) – Whether to return the number of train, validation or test patients. Default is 'train'.

Returns

indices – The indices of the samples from the patient.

Return type

array

static check_for_lodopab()[source]

Fast check whether first and last file of each dataset part exist under the configured data path.

Returns

exists – Whether LoDoPaB seems to exist.

Return type

bool

static get_num_patients(part='train')[source]

Return the number of patients in a dataset part.

Parameters

part ({'train', 'validation', 'test'}, optional) – Whether to return the number of train, validation or test patients. Default is 'train'.

static get_patient_ids(relative=True)[source]

Return the (relative) patient id for all samples of all dataset parts.

Parameters

relative (bool, optional) – Whether to use ids relative to the dataset part. The csv files store absolute indices, where “train_ids < validation_ids < test_ids”. If False, these absolute indices are returned. If True, the smallest absolute id of the part is subtracted, giving zero-based (relative) patient ids. Default: True

Returns

ids – For each part: an array with the (relative) patient ids for all samples (length: number of samples in the corresponding part).

Return type

dict of array

Raises

OSError – An OSError is raised if one of the csv files containing the patient ids is missing in the configured data path.

static get_idx_sorted_by_patient(ids=None)[source]

Return indices that allow access to each dataset part in patient id order.

Note: in most cases this method should not be called directly. Rather specify sorted_by_patient=True to the constructor if applicable. A plausible use case of this method, however, is to access existing cache files that were created with sorted_by_patient=False. In this case, the dataset should be constructed with sorted_by_patient=False, wrapped by a CachedDataset and then reordered with ReorderedDataset using the indices returned by this method.

Parameters

ids (dict of array-like, optional) – Patient ids as returned by get_patient_ids(). It is not relevant to this function whether they are relative.

Returns

idx – Indices that allow access to each dataset part in patient id order. Each array value is an index into the samples in original order (as stored in the HDF5 files). I.e.: By iterating the samples with index idx[part][i] for i = 0, 1, 2, ... one first obtains all samples from one patient, then continues with the samples of the second patient, and so on.

Return type

dict of array

Raises

OSError – An OSError is raised if ids is None and one of the csv files containing the patient ids is missing in the configured data path.

class dival.datasets.AngleSubsetDataset(dataset, angle_indices, impl=None)[source]

Bases: dival.datasets.dataset.Dataset

CT dataset that selects a subset of the angles of a basis CT dataset.

__init__(dataset, angle_indices, impl=None)[source]
Parameters
  • dataset (Dataset) –

    Basis CT dataset. Requirements:

    • sample elements are (observation, ground_truth)

    • get_ray_trafo() gives corresponding ray transform.

  • angle_indices (array-like or slice) – Indices of the angles to use from the observations.

  • impl ({'skimage', 'astra_cpu', 'astra_cuda'}, optional) – Implementation passed to odl.tomo.RayTransform to construct ray_trafo.

get_ray_trafo(**kwargs)[source]

Return the ray transform that matches the subset of angles specified to the constructor via angle_indices.

generator(part='train')[source]

Yield data.

The default implementation calls get_sample() if the dataset implements it (i.e., supports random access).

Parameters

part ({'train', 'validation', 'test'}, optional) – Whether to yield train, validation or test data. Default is 'train'.

Yields

data (odl element or tuple of odl elements) – Sample of the dataset.

get_sample(index, part='train', out=None)[source]

Get single sample by index.

Parameters
  • index (int) – Index of the sample.

  • part ({'train', 'validation', 'test'}, optional) – The data part. Default is 'train'.

  • out (array-like or tuple of (array-like or bool) or None) –

    Array(s) (or e.g. odl element(s)) to which the sample is written. A tuple should be passed, if the dataset returns two or more arrays per sample (i.e. pairs, …). If a tuple element is a bool, it has the following meaning:

    True

    Create a new array and return it.

    False

    Do not return this array, i.e. None is returned.

Returns

sample – E.g. for a pair dataset: (array, None) if out=(True, False).

Return type

[tuple of ] (array-like or None)

get_samples(key, part='train', out=None)[source]

Get samples by slice or range.

The default implementation calls get_sample() if the dataset implements it.

Parameters
  • key (slice or range) – Indexes of the samples.

  • part ({'train', 'validation', 'test'}, optional) – The data part. Default is 'train'.

  • out (array-like or tuple of (array-like or bool) or None) –

    Array(s) (or e.g. odl element(s)) to which the sample is written. The first dimension must match the number of samples requested. A tuple should be passed, if the dataset returns two or more arrays per sample (i.e. pairs, …). If a tuple element is a bool, it has the following meaning:

    True

    Create a new array and return it.

    False

    Do not return this array, i.e. None is returned.

Returns

samples – If the dataset has multiple arrays per sample, a tuple holding arrays is returned. E.g. for a pair dataset: (array, None) if out=(True, False). The samples are stacked in the first (additional) dimension of each array.

Return type

[tuple of ] (array-like or None)

class dival.datasets.CachedDataset(dataset, space, cache_files, size=None)[source]

Bases: dival.datasets.dataset.Dataset

Dataset that allows to replace elements of a dataset with cached data from .npy files.

The arrays in the .npy files must have shape (self.get_len(part),) + self.space[i].shape for the i-th component.

__init__(dataset, space, cache_files, size=None)[source]
Parameters
  • dataset (Dataset) – Original dataset from which non-cached elements are used. Must support random access if any elements are not cached.

  • space ([tuple of ] odl.space.base_tensors.TensorSpace, optional) – The space(s) of the elements of samples as a tuple. This may be different from space, e.g. for precomputing domain-changing operations on the elements.

  • cache_files (dict of [tuple of ] (str or None)) –

    Filenames of the cache files for each part and for each component. The part ('train', …) is the key to the dict. For each part, a tuple of filenames should be provided, each of which can be None, meaning that this component should be fetched from the original dataset. If the dataset only provides one element per sample, the filename does not have to be packed inside a tuple. If a key is omitted, the part is fetched from the original dataset.

    As an example, for a CT dataset with cached FBPs instead of observations for parts 'train' and 'validation':

    {'train':      ('cache_train_fbp.npy',      None),
     'validation': ('cache_validation_fbp.npy', None)}
    

  • size (dict of int, optional) – Numbers of samples for each part. If a field is omitted or has value None, all available samples are used, which may be less than the number of samples in the original dataset if the cache contains fewer samples. Default: {}.

generator(part='train')[source]

Yield data.

The default implementation calls get_sample() if the dataset implements it (i.e., supports random access).

Parameters

part ({'train', 'validation', 'test'}, optional) – Whether to yield train, validation or test data. Default is 'train'.

Yields

data (odl element or tuple of odl elements) – Sample of the dataset.

get_sample(index, part='train', out=None)[source]

Get single sample by index.

Parameters
  • index (int) – Index of the sample.

  • part ({'train', 'validation', 'test'}, optional) – The data part. Default is 'train'.

  • out (array-like or tuple of (array-like or bool) or None) –

    Array(s) (or e.g. odl element(s)) to which the sample is written. A tuple should be passed, if the dataset returns two or more arrays per sample (i.e. pairs, …). If a tuple element is a bool, it has the following meaning:

    True

    Create a new array and return it.

    False

    Do not return this array, i.e. None is returned.

Returns

sample – E.g. for a pair dataset: (array, None) if out=(True, False).

Return type

[tuple of ] (array-like or None)

get_samples(key, part='train', out=None)[source]

Get samples by slice or range.

The default implementation calls get_sample() if the dataset implements it.

Parameters
  • key (slice or range) – Indexes of the samples.

  • part ({'train', 'validation', 'test'}, optional) – The data part. Default is 'train'.

  • out (array-like or tuple of (array-like or bool) or None) –

    Array(s) (or e.g. odl element(s)) to which the sample is written. The first dimension must match the number of samples requested. A tuple should be passed, if the dataset returns two or more arrays per sample (i.e. pairs, …). If a tuple element is a bool, it has the following meaning:

    True

    Create a new array and return it.

    False

    Do not return this array, i.e. None is returned.

Returns

samples – If the dataset has multiple arrays per sample, a tuple holding arrays is returned. E.g. for a pair dataset: (array, None) if out=(True, False). The samples are stacked in the first (additional) dimension of each array.

Return type

[tuple of ] (array-like or None)

dival.datasets.generate_cache_files(dataset, cache_files, size=None, flush_interval=1000)[source]

Generate cache files for CachedDataset.

Parameters
  • dataset (Dataset) – Dataset from which to cache samples.

  • cache_files (dict of [tuple of ] (str or None)) –

    Filenames of the cache files for each part and for each component to be cached. The part ('train', …) is the key to the dict. For each part, a tuple of filenames should be provided, each of which can be None, meaning that this component should not be cached. If the dataset only provides one element per sample, the filename does not have to be packed inside a tuple. If a key is omitted, the part is not cached.

    As an example, for a CT dataset with cached FBPs instead of observations for parts 'train' and 'validation':

    {'train':      ('cache_train_fbp.npy',      None),
     'validation': ('cache_validation_fbp.npy', None)}
    

  • size (dict of int, optional) – Numbers of samples to cache for each dataset part. If a field is omitted or has value None, all samples are cached. Default: {}.

  • flush_interval (int, optional) – Number of samples to retrieve before flushing to file (using memmap). This amount of samples should fit into the systems main memory (RAM). If -1, each file content is only flushed once at the end.

class dival.datasets.FBPDataset(dataset, ray_trafo, filter_type='Hann', frequency_scaling=1.0)[source]

Bases: dival.datasets.dataset.Dataset

Dataset computing filtered back-projections for a CT dataset on the fly.

Each sample is a pair of a FBP and a ground truth image.

__init__(dataset, ray_trafo, filter_type='Hann', frequency_scaling=1.0)[source]
Parameters
  • dataset (Dataset) – CT dataset. FBPs are computed from the observations, the ground truth is taken directly from the dataset.

  • ray_trafo (odl.tomo.RayTransform) – Ray transform from which the FBP operator is constructed.

  • filter_type (str, optional) – Filter type accepted by odl.tomo.fbp_op(). Default: 'Hann'.

  • frequency_scaling (float, optional) – Relative cutoff frequency passed to odl.tomo.fbp_op(). Default: 1.0.

generator(part='train')[source]

Yield data.

The default implementation calls get_sample() if the dataset implements it (i.e., supports random access).

Parameters

part ({'train', 'validation', 'test'}, optional) – Whether to yield train, validation or test data. Default is 'train'.

Yields

data (odl element or tuple of odl elements) – Sample of the dataset.

get_sample(index, part='train', out=None)[source]

Get single sample by index.

Parameters
  • index (int) – Index of the sample.

  • part ({'train', 'validation', 'test'}, optional) – The data part. Default is 'train'.

  • out (array-like or tuple of (array-like or bool) or None) –

    Array(s) (or e.g. odl element(s)) to which the sample is written. A tuple should be passed, if the dataset returns two or more arrays per sample (i.e. pairs, …). If a tuple element is a bool, it has the following meaning:

    True

    Create a new array and return it.

    False

    Do not return this array, i.e. None is returned.

Returns

sample – E.g. for a pair dataset: (array, None) if out=(True, False).

Return type

[tuple of ] (array-like or None)

get_samples(key, part='train', out=None)[source]

Get samples by slice or range.

The default implementation calls get_sample() if the dataset implements it.

Parameters
  • key (slice or range) – Indexes of the samples.

  • part ({'train', 'validation', 'test'}, optional) – The data part. Default is 'train'.

  • out (array-like or tuple of (array-like or bool) or None) –

    Array(s) (or e.g. odl element(s)) to which the sample is written. The first dimension must match the number of samples requested. A tuple should be passed, if the dataset returns two or more arrays per sample (i.e. pairs, …). If a tuple element is a bool, it has the following meaning:

    True

    Create a new array and return it.

    False

    Do not return this array, i.e. None is returned.

Returns

samples – If the dataset has multiple arrays per sample, a tuple holding arrays is returned. E.g. for a pair dataset: (array, None) if out=(True, False). The samples are stacked in the first (additional) dimension of each array.

Return type

[tuple of ] (array-like or None)

class dival.datasets.ReorderedDataset(dataset, idx)[source]

Bases: dival.datasets.dataset.Dataset

Dataset that reorders the samples of another dataset by specified index arrays for each part.

__init__(dataset, idx)[source]
Parameters
  • dataset (Dataset) – Dataset to take the samples from. Must support random access.

  • idx (dict of array-like) – Indices into the original dataset for each part. Each array-like must have (at least) the same length as the part.

get_sample(index, part='train', out=None)[source]

Get single sample by index.

Parameters
  • index (int) – Index of the sample.

  • part ({'train', 'validation', 'test'}, optional) – The data part. Default is 'train'.

  • out (array-like or tuple of (array-like or bool) or None) –

    Array(s) (or e.g. odl element(s)) to which the sample is written. A tuple should be passed, if the dataset returns two or more arrays per sample (i.e. pairs, …). If a tuple element is a bool, it has the following meaning:

    True

    Create a new array and return it.

    False

    Do not return this array, i.e. None is returned.

Returns

sample – E.g. for a pair dataset: (array, None) if out=(True, False).

Return type

[tuple of ] (array-like or None)