.. _dataset_and_dataloader_configuration:
Dataset and dataloader configuration
====================================
This section describes the dataset configuration used to load experiments into the Experanto dataloaders. It includes global settings, modality-specific configurations, and dataloader parameters.
Default YAML configuration
^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: yaml
dataset:
global_sampling_rate: null
global_chunk_size: null
add_behavior_as_channels: false
replace_nans_with_means: false
cache_data: false
out_keys:
- screen
- responses
- eye_tracker
- treadmill
modality_config:
screen:
keep_nans: false
sampling_rate: 30
chunk_size: 60
valid_condition:
tier: train
offset: 0
sample_stride: 1
include_blanks: true
transforms:
normalization: normalize
Resize:
_target_: torchvision.transforms.v2.Resize
size:
- 144
- 256
interpolation:
rescale: true
rescale_size:
- 144
- 256
responses:
keep_nans: false
sampling_rate: 8
chunk_size: 16
offset: 0.0
transforms:
normalization: normalize_variance_only
interpolation:
interpolation_mode: nearest_neighbor
filters:
nan_filter:
__target__: experanto.filters.common_filters.nan_filter
__partial__: true
vicinity: 0.05
eye_tracker:
keep_nans: false
sampling_rate: 30
chunk_size: 60
offset: 0
transforms:
normalization: normalize
interpolation:
interpolation_mode: nearest_neighbor
filters:
nan_filter:
__target__: experanto.filters.common_filters.nan_filter
__partial__: true
vicinity: 0.05
treadmill:
keep_nans: false
sampling_rate: 30
chunk_size: 60
offset: 0
transforms:
normalization: normalize
interpolation:
interpolation_mode: nearest_neighbor
filters:
nan_filter:
__target__: experanto.filters.common_filters.nan_filter
__partial__: true
vicinity: 0.05
dataloader:
batch_size: 16
shuffle: true
num_workers: 2
pin_memory: true
drop_last: true
prefetch_factor: 2
Viewing the configuration
^^^^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: python
from omegaconf import OmegaConf, open_dict
from experanto.configs import DEFAULT_CONFIG as cfg
print(OmegaConf.to_yaml(cfg))
Modifying the configuration
^^^^^^^^^^^^^^^^^^^^^^^^^^^
You can change parameters programmatically:
.. code-block:: python
cfg.dataset.modality_config.screen.include_blanks = True
cfg.dataset.modality_config.screen.valid_condition = {"tier": "train"}
cfg.dataloader.num_workers = 8
Configuration options
^^^^^^^^^^^^^^^^^^^^^
Dataset options
"""""""""""""""
``global_sampling_rate``
Override sampling rate for all modalities. Set to ``None`` to use
per-modality rates.
``global_chunk_size``
Override chunk size (number of time steps/data points) for all modalities.
Set to ``None`` to use per-modality sizes.
The time window covered by a chunk is ``chunk_size / sampling_rate``, so
the ``global_sampling_rate`` should be taken into account:
- **With** ``global_sampling_rate`` set: all modalities share the same
output rate, so a single ``global_chunk_size`` unambiguously gives every
modality the same time window.
- **Without** ``global_sampling_rate`` (per-modality rates active):
different modalities have different rates, so the same sample count
produces different durations. In this case, leave ``global_chunk_size``
as ``None`` and set ``chunk_size`` per modality instead.
``add_behavior_as_channels``
If ``True``, concatenate behavioral data (e.g., eye tracker, treadmill) as
additional channels to the screen data.
``replace_nans_with_means``
If ``True``, replace NaN values with the mean of non-NaN values.
``cache_data``
If ``True``, cache interpolated data in memory for faster access.
``out_keys``
List of modality keys to include in the output dictionary.
``normalize_timestamps``
If ``True``, normalize timestamps to start from 0.
Modality options
""""""""""""""""
Each modality (e.g., screen, responses, eye_tracker, treadmill) supports:
``keep_nans``
Whether to keep NaN values in the output.
``sampling_rate``
Controls the spacing of the time points that the dataset constructs and
passes to :meth:`~experanto.experiment.Experiment.interpolate`. Concretely,
each item in the dataset requests values at times
``start, start + 1/sampling_rate, start + 2/sampling_rate, …``. The
interpolator then interpolates the stored raw samples at those points.
``chunk_size``
Number of **time steps/data points** returned per item for this modality.
Internally, ``sampling_rate`` defines the spacing of the time points passed
to the interpolator, so the covered time window is:
.. math::
\text{duration (s)} = \frac{\text{chunk\_size}}{\text{sampling\_rate}}
Note that ``sampling_rate`` here controls the *spacing* of the time points
requested from the underlying experiment (see ``sampling_rate`` above). The
native acquisition rate of the signal does not matter (the interpolator simply looks up the stored values closest to each requested time, e.g.).
When per-modality output rates differ, ``chunk_size`` must be set per
modality to cover the same time window. The default configuration keeps
all modalities at a 2-second window while using different output rates:
============ ============= =========== ===========
Modality sampling_rate chunk_size Duration
============ ============= =========== ===========
screen 30 Hz 60 2 s
eye_tracker 30 Hz 60 2 s
treadmill 30 Hz 60 2 s
responses 8 Hz 16 2 s
============ ============= =========== ===========
If you unify all rates with ``global_sampling_rate``, use
``global_chunk_size`` instead and this per-modality value is ignored.
In general: ``chunk_size = desired_duration_seconds * sampling_rate``.
``offset``
Time offset in seconds applied to the time points constructed for this
modality. For example, if the screen is queried at times
``[t, t + 1/sampling_rate, …]``, setting ``offset = 0.1`` on responses
means responses are queried at ``[t + 0.1, t + 0.1 + 1/sampling_rate, …]``.
Useful for aligning modalities with known temporal delays relative to the
screen stimulus.
``transforms``
Dictionary of transforms to apply at the dataset level. This is modality
specific, i.e., not all modalities support the same set of transforms. Some
examples include ``"normalize"`` for sequences, such as eye_tracker,
and ``"normalize_variance_only"`` for responses.
To understand how transforms are loaded and applied internally, refer to
:meth:`experanto.datasets.ChunkDataset.initialize_transforms`. If you need
to implement a custom transform, we recommend following the same pattern
used there. In particular, note how each entry in the ``transforms``
dictionary is checked and, when it is a config ``dict``, instantiated via
Hydra before being added to the transform pipeline.
You can point Experanto to any callable (function or class) by using
Hydra's ``_target_`` key, which triggers
`hydra.utils.instantiate `_
under the hood (e.g., ``_target_: my_package.my_module.MyTransform``).
``interpolation``
Interpolation settings. This is modality specific, i.e., not all modalities
support the same set of interpolation methods. Some examples include
``"rescale"`` for the screen and ``"interpolation_mode"`` (e.g.,
``"nearest_neighbor"``) for sequences.
``filters``
Dictionary of filter functions to apply to the data.
Dataloader options
""""""""""""""""""
All standard ``torch.utils.data.DataLoader`` options are supported. See the
`PyTorch DataLoader documentation `_
for the full list of available parameters.