.. _loading_dataset:

Loading a dataset object
========================

Dataset objects organize experimental data (from the :class:`~experanto.experiment.Experiment` class) for machine learning tasks, offering project-specific and configurable access for training and evaluation. They often serve as a source for creating dataloaders (see :func:`~experanto.dataloaders.get_multisession_dataloader`).

.. note::

   The key distinction between :class:`~experanto.experiment.Experiment` and a
   dataset object is one of **time discretization**.
   :class:`~experanto.experiment.Experiment` is a low-level interface: you hand
   it any array of time points and it returns values at those points via a
   lookup into the raw stored data. :class:`~experanto.datasets.ChunkDataset` 
   is used on top of it and imposes a specific time structure. For each item,
   it constructs a separate ``times`` array per modality using that modality's
   configured ``sampling_rate`` and ``chunk_size`` (``times = start + np.arange(chunk_size) / sampling_rate``),
   then calls :meth:`~experanto.experiment.Experiment.interpolate` for each
   modality independently. This is how all modalities end up covering the same
   time window with compatible shapes in a batch.

Key features of dataset objects
-------------------------------

Dataset objects provide several essential features:

- **Sampling Rate**: Defines the spacing of the time points that the dataset
  constructs and hands to the underlying :class:`~experanto.experiment.Experiment`
  for each item (``time_delta = 1 / sampling_rate``). The experiment then does
  a lookup into the raw stored data at those points.
- **Chunk Size**: Determines the number of values returned when calling the ``__getitem__`` method. This is crucial, for example, for deep learning models that use 3D convolutions over time, where single elements or small chunk sizes are insufficient to capture meaningful temporal patterns.
- **Modality Configuration**: Specifies the details of the interpolation, including:

  - The **interpolation method** used.
  - **Conditions** that the data must fulfill.
  - **Transformations** applied to the data (e.g., normalization, resizing, cropping, greyscale conversion).

Loading a dataset
-----------------
To load a dataset, follow the steps below:

.. code-block:: python

    from experanto.datasets import ChunkDataset
    from torch.utils.data import DataLoader
    from omegaconf import OmegaConf
    from experanto.configs import DEFAULT_CONFIG as cfg

    cfg.dataset.modality_config.screen.transforms.Resize.size = [144, 144] 
    cfg.dataset.modality_config.screen.interpolation.rescale_size = [144, 144]
    cfg.dataset.modality_config.screen.transforms.greyscale = True
    modality_cfg = cfg.dataset.modality_config

    # Extract only 'screen' and 'responses' or other modalities if necessecary for single session loading
    selected_modalities = OmegaConf.create({
        'screen': modality_cfg.screen,
        'responses': modality_cfg.responses
    })

    root_folder = '../data/allen_data'
    sampling_rate = 60
    chunk_size = 60 # since we also use video data we always use chunks of images to also consider temporal developements

    train_dataset = ChunkDataset(root_folder=f'{root_folder}/experiment_951980471_train', global_sampling_rate=sampling_rate,
            global_chunk_size=chunk_size, modality_config = selected_modalities)

This configuration ensures that:

- **Screen data** is preprocessed with normalization (per default), resizing, and greyscale conversion.
- **Response data** undergoes standardization and nearest-neighbor interpolation (per default).

Other modalities can be defined in the same manner as **Responses**. If your desired modalities do not match our existing data structures and config layout, you will need to implement them yourself.
We appreciate contributions to Experanto in the form of pull requests via GitHub to make more modalities accessible.

Sampling data from the dataset
------------------------------
We can confirm the creation and functionality of our datasets by sampling some data.
To sample data from the dataset, we can simply index into it. For example, to sample the first data chunk:

.. code-block:: python

    # Interpolation showcase using the dataset object
    sample = train_dataset[0]

    # Print the keys and their respective shapes
    print(sample.keys())
    for key in sample.keys():
        print(f'This is shape {sample[key].shape} for modality {key}')

This will output something like:

.. code-block:: text

    dict_keys(['screen', 'responses'])
    This is shape torch.Size([1, 60, 144, 144]) for modality screen
    This is shape torch.Size([60, 12]) for modality responses

Defining dataloaders
---------------------
Once the dataset is verified, we can define `DataLoader <https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader>`_ objects for training or other purposes. This allows easy batch processing during training:

.. code-block:: python

    # Define a DataLoader for the training set
    train_data_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

Verifying dataloader functionality
----------------------------------
To confirm that the **DataLoader** works as expected, we can iterate over it and inspect the batch data. For example, to check the shapes of the data in each batch:

.. code-block:: python

    # Interpolation showcase using the data_loaders
    for batch_idx, batch_data in enumerate(train_data_loader):
        # batch_data is a dictionary with keys 'screen', 'responses'
        screen_data = batch_data['screen']
        responses_data = batch_data['responses']
        
        # Print or inspect the batch
        print(f"Batch {batch_idx}:")
        print("Screen Data:", screen_data.shape)
        print("Responses:", responses_data.shape)
        break

This will output something like:

.. code-block:: text

    Batch 0:
    Screen Data: torch.Size([32, 1, 60, 144, 144])
    Responses: torch.Size([32, 60, 12])