dival.datasets.lodopab_dataset module

Provides LoDoPaBDataset.

Provides simple access to the LoDoPaB-CT dataset documented in a Data Descriptor article.

dival.datasets.lodopab_dataset.download_lodopab()[source]
class dival.datasets.lodopab_dataset.LoDoPaBDataset(min_pt=None, max_pt=None, observation_model='post-log', min_photon_count=None, sorted_by_patient=False, impl='astra_cuda')[source]

Bases: dival.datasets.dataset.Dataset

The LoDoPaB-CT dataset, which is documented in the Data Descriptor article https://www.nature.com/articles/s41597-021-00893-z and hosted on https://zenodo.org/record/3384092. It is a simulated low dose CT dataset based on real reconstructions from the LIDC-IDRI dataset.

The dataset contains 42895 pairs of images and projection data. For simulation, a ray transform with parallel beam geometry using 1000 angles and 513 detector pixels is used. Poisson noise corresponding to 4096 incident photons per pixel before attenuation is applied to the projection data. The images have a size of 362x362 px.

An ODL ray transform that corresponds to the noiseless forward operator can be obtained via the get_ray_trafo method of this dataset. Additionally, the ray_trafo attribute holds a ray transform instance, which is created during __init__(). Note: By default, the 'astra_cuda' implementation backend is used, which requires both astra and a CUDA-enabled GPU being available. You can choose a different backend by passing impl='skimage' or impl='astra_cpu'.

Further functionalities:

  • converting the stored post-log observations to pre-log observations on the fly (cf. observation_model parameter of __init__())

  • sorting by patient ids (cf. sorted_by_patient parameter of __init__())

  • changing the zero photon count replacement value of 0.1 used for pre-log observations (cf. min_photon_count parameter of __init__())

space
(space[0], space[1]), where
space[0]

odl.uniform_discr([0., -0.1838], [3.1416, 0.1838], (1000, 513), dtype='float32')

space[1]

odl.uniform_discr(min_pt, max_pt, (362, 362), dtype='float32')), with min_pt and max_pt parameters passed to __init__()

shape

(362, 362)

train_len

35820

validation_len

3522

test_len

3553

random_access

True

num_elements_per_sample

2

ray_trafo

Ray transform corresponding to the noiseless forward operator.

Type

odl.tomo.RayTransform

sorted_by_patient

Whether the samples are sorted by patient id. Default: False.

Type

bool

rel_patient_ids

Relative patient ids of the samples in the original non-sorted order for each part, as returned by LoDoPaBDataset.get_patient_ids(). None, if the csv files are not found.

Type

(dict of array) or None

__init__(min_pt=None, max_pt=None, observation_model='post-log', min_photon_count=None, sorted_by_patient=False, impl='astra_cuda')[source]
Parameters
  • min_pt ([float, float], optional) – Minimum values of the lp space. Default: [-0.13, -0.13].

  • max_pt ([float, float], optional) – Maximum values of the lp space. Default: [0.13, 0.13].

  • observation_model ({'post-log', 'pre-log'}, optional) –

    The observation model to use. The default is 'post-log'.

    'post-log'

    Observations are linearly related to the normalized ground truth via the ray transform, obs = ray_trafo(gt) + noise. Note that the scaling of the observations matches the normalized ground truth, i.e., they are divided by the linear attenuation of 3071 HU.

    'pre-log'

    Observations are non-linearly related to the ground truth, as given by the Beer-Lambert law. The model is obs = exp(-ray_trafo(gt * MU(3071 HU))) + noise, where MU(3071 HU) is the factor, by which the ground truth was normalized.

  • min_photon_count (float, optional) – Replacement value for a simulated photon count of zero. If observation_model == 'post-log', a value greater than zero is required in order to avoid undefined values. The default is 0.1, both for 'post-log' and 'pre-log' model.

  • sorted_by_patient (bool, optional) – Whether to sort the samples by patient id. Useful to resplit the dataset. See also get_indices_for_patient(). Note that the slices of each patient are ordered randomly wrt. the z-location in any case. Default: False.

  • impl ({'skimage', 'astra_cpu', 'astra_cuda'}, optional) – Implementation passed to odl.tomo.RayTransform to construct ray_trafo.

generator(part='train')[source]

Yield pairs of low dose observations and (virtual) ground truth.

Parameters

part ({'train', 'validation', 'test'}, optional) – The data part. Default is 'train'.

Yields

(observation, ground_truth)

observationodl element with shape (1000, 513)

The values depend on the observation_model and min_photon_count parameters that were passed to __init__().

ground_truthodl element with shape (362, 362)

The values lie in the range [0., 1.].

get_ray_trafo(**kwargs)[source]

Return the ray transform that is a noiseless version of the forward operator.

Parameters

impl ({'skimage', 'astra_cpu', 'astra_cuda'}, optional) – The backend implementation passed to odl.tomo.RayTransform.

Returns

ray_trafo – The ray transform that corresponds to the noiseless map from 362 x 362 images to the -log of their projections (sinograms).

Return type

odl operator

get_sample(index, part='train', out=None)[source]

Get single sample of the dataset. Returns a pair of (virtual) ground truth and its low dose observation, of which either part can be left out by option.

Parameters
  • index (int) – The index into the dataset part.

  • part ({'train', 'validation', 'test'}, optional) – The data part. Default is 'train'.

  • out (tuple of array-likes or bools, optional) –

    out==(out_observation, out_ground_truth)

    out_observationarray-like or bool

    Shape (1000, 513). If an odl element or array is passed, the observation is written to it. If True, a new odl element holding the observation is created (the default). If False, no observation is returned.

    out_ground_trutharray-like or bool

    Shape (362, 362). If an odl element or array is passed, the ground truth is written to it. If True, a new odl element holding the ground truth is created (the default). If False, no ground truth is returned.

Returns

observationodl element or np.ndarray or None

Depending on the value of out_observation (see parameter out), a newly created odl element, out_observation or None is returned. The observation values depend on the observation_model and min_photon_count parameters that were given to the constructor.

ground_truthodl element or np.ndarray or None

Depending on the value of out_ground_truth (see parameter out), a newly created odl element, out_ground_truth or None is returned. The values lie in the range [0., 1.].

Return type

(observation, ground_truth)

get_samples(key, part='train', out=None)[source]

Get slice of the dataset. Returns a pair of (virtual) ground truth data and its low dose observation data, of which either part can be left out by option.

Parameters
  • key (slice or range) – The indices into the dataset part.

  • part ({'train', 'validation', 'test'}, optional) – The data part. Default is 'train'.

  • out (tuple of arrays or bools, optional) –

    out==(out_observation, out_ground_truth)

    out_observationnp.ndarray or bool

    If an array is passed, the observation data is written to it. If True, a new array holding the observation data is created (the default). If False, no observation data is returned.

    out_ground_truthnp.ndarray or bool

    If an array is passed, the ground truth data is written to it. If True, a new array holding the ground truth data is created (the default). If False, no ground truth data is returned.

Returns

observationnp.ndarray or None

Shape (samples, 1000, 513). Depending on the value of out_observation (see parameter out), a newly created array, out_observation or None is returned. The observation values depend on the observation_model and min_photon_count parameters that were given to the constructor.

ground_truthnp.ndarray or None

Shape (samples, 362, 362). Depending on the value of out_ground_truth (see parameter out), a newly created array, out_ground_truth or None is returned. The values lie in the range [0., 1.].

Return type

(observation, ground_truth)

get_indices_for_patient(rel_patient_id, part='train')[source]

Return the indices of the samples from one patient. If self.sorted_by_patient is True, the indices will be subsequent.

Parameters
  • rel_patient_id (int) – Patient id, relative to the part.

  • part ({'train', 'validation', 'test'}, optional) – Whether to return the number of train, validation or test patients. Default is 'train'.

Returns

indices – The indices of the samples from the patient.

Return type

array

static check_for_lodopab()[source]

Fast check whether first and last file of each dataset part exist under the configured data path.

Returns

exists – Whether LoDoPaB seems to exist.

Return type

bool

static get_num_patients(part='train')[source]

Return the number of patients in a dataset part.

Parameters

part ({'train', 'validation', 'test'}, optional) – Whether to return the number of train, validation or test patients. Default is 'train'.

static get_patient_ids(relative=True)[source]

Return the (relative) patient id for all samples of all dataset parts.

Parameters

relative (bool, optional) – Whether to use ids relative to the dataset part. The csv files store absolute indices, where “train_ids < validation_ids < test_ids”. If False, these absolute indices are returned. If True, the smallest absolute id of the part is subtracted, giving zero-based (relative) patient ids. Default: True

Returns

ids – For each part: an array with the (relative) patient ids for all samples (length: number of samples in the corresponding part).

Return type

dict of array

Raises

OSError – An OSError is raised if one of the csv files containing the patient ids is missing in the configured data path.

static get_idx_sorted_by_patient(ids=None)[source]

Return indices that allow access to each dataset part in patient id order.

Note: in most cases this method should not be called directly. Rather specify sorted_by_patient=True to the constructor if applicable. A plausible use case of this method, however, is to access existing cache files that were created with sorted_by_patient=False. In this case, the dataset should be constructed with sorted_by_patient=False, wrapped by a CachedDataset and then reordered with ReorderedDataset using the indices returned by this method.

Parameters

ids (dict of array-like, optional) – Patient ids as returned by get_patient_ids(). It is not relevant to this function whether they are relative.

Returns

idx – Indices that allow access to each dataset part in patient id order. Each array value is an index into the samples in original order (as stored in the HDF5 files). I.e.: By iterating the samples with index idx[part][i] for i = 0, 1, 2, ... one first obtains all samples from one patient, then continues with the samples of the second patient, and so on.

Return type

dict of array

Raises

OSError – An OSError is raised if ids is None and one of the csv files containing the patient ids is missing in the configured data path.