.. _data_model:

The canopy data model
=====================

The Field object
----------------

.. currentmodule:: canopy.core.field

**Canopy** stores data in a self-describing container called a :class:`Field`. A :class:`Field` is meant to represent a spatio-temporally varying DGVM diagnostic or output (e.g., annual GPP, stored carbon, LAI...). Each quantity may be further disaggregated into *layers* (e.g., GPP can be disaggregated into PFTs, carbon storage into different carbon pools...)

A :class:`Field` object consists of:

.. currentmodule:: canopy.core.grid.grid_abc

- A `pandas DataFrame <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html>`_ with a spatio-temporal index to store the field's data
- A :class:`Grid` object describing the grid associated with the data
- A metadata dictionary
- A history dictionary
- Methods for basic data slicing and reduction.

.. currentmodule:: canopy.core.field

To create a :class:`Field`, the pandas DataFrame and the associated grid must be supplied. For the :class:`Field` object to be successfully instantiated, it must fulfill the specifications below, and the data and the grid must be compatible.

The DataFrame
-------------

The data table is a pandas `DataFrame` containing the spatio-temporal indexed data.

- Layers are stored as columns. The column identifiers (names) must be of `str` type.
- The index must contain at least one `time` level of `pandas PeriodIndex <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.PeriodIndex.html>`_ type.
- It may additionally contain one or two spatial levels (e.g. `lon` and `lat`), of `float` type.
- One additional, redundant `label` level is allowed, but not required, for site-based data (or, in general, any data associated to an unstructured grid).

The index levels can therefore have one of the following forms:

- `[label, x_1, x_2, time]`: Fully spatio-temporal site-based data, with a _redundant_ index level `label`.
- `[x_1, x_2, time]`: Fully spatio-temporal data, with two spatial dimensions (`x_1` and `x_2` could be, e.g., `lon` and `lat`).
- `[x_1, time]`: Spatio-temporal data, dimension `x_2` has been reduced.
- `[x_2, time]`: Spatio-temporal data, dimension `x_1` has been reduced.
- `[time]`: Time series. It can represent purely temporal data or data whose spatial dimensions have both been reduced.

.. image:: _static/frame_general.png

.. note::
  *Entry* refers to a single spatio-temporal location (a row of the table).

.. note::
  Currently, canopy does not support a vertical coordinate nor subgrid indexing (but it will in the future).

The Grid object
---------------

.. currentmodule:: canopy.core.grid.grid_abc

:class:`Grid` objects describe different types of grids. These are implemented as subclasses of the abstract base class `Grid` (described below). Instances of a :class:`Grid` subclass provide all the necessary information to perform spatial reduction operations (grid operations or _gridops_) associated to that grid type.

Currently, canopy comes with two grid types and their associated operations:

- ``lonlat``: describes a standard geographical (longitude-latitude) grid. The grid spacing is constant, but can be different for each axis.
- ``sites``: describes an unstructured grid, such as the one associated to a collection of sites.

There is a special type of grid, ``empty``, which is used for pure time series (no spatial dimensions).

All grids are derived from the abstract base class :class:`Grid`, and override the following methods:

- :meth:`Grid.from_frame` (class method): constructs a :class:`Grid` object from a pandas `DataFrame`
- :meth:`Grid.validate_frame`: checks that the grid describes the data contained in the supplied `DataFrame` correctly (i.e., if the frame's index is compatible with the grid).
- :meth:`Grid.crop`: returns a cropped :class:`Grid` of the same type, or :class:`GridEmpty` if the supplied dataframe is empty or outside the grid's domain.
- :meth:`Grid.reduce`: returns a reduces :class:`Grid` of the same type, or :class:`GridEmpty` if both axes are reduced.
- :meth:`Grid.is_compatible`: Checks for compatibility of two grids, i.e., if two grids can be added together. All grids are compatible with the type :class:`GridEmpty`. As an example, two :class:`GridLonLat` grids are compatible if the grids extended to cover the globe overlap perfectly (i.e., they are subsets of the same global grid).
- :meth:`Grid.__add__`: returns the sum of two :class:`Grid` objects, which is another :class:`Grid` object. Adding :class:`GridEmpty` to a grid object simply returns a deep copy of the object.

Additionally, all subclasses define the following class attributes:

.. currentmodule:: canopy.core.grid.spatial_axis

- _grid_type: str: The name under which the grid subclass will be registered in the grid registry.
- _xaxis: SpatialAxis: Named tuple of type :class:`SpatialAxis` defining the grid's *X* axis attributes.
- _yaxis: SpatialAxis: Named tuple of type :class:`SpatialAxis` defining the grid's *Y* axis attributes.
- _xaxis_key: Short key to refer to the grid's *X* axis (e.g. in grid operations).
- _yaxis_key: Short key to refer to the grid's *Y* axis (e.g. in grid operations).

.. currentmodule:: canopy.core.grid.grid_abc

The derived class can override the :class:`Grid` base class :meth:`__init__` method. However, the overriden :meth:`__init__` must still call the base class' :meth:`__init__` by using the :func:`super` function.

Creating a Field
----------------

.. currentmodule:: canopy.core.field

In order to gain insight into the structure of a :class:`Field` object, we will create one from scratch. We encourage you to try this example on your own. This is, of course, a dummy example with randomly generated data. Normally, the Field is created by data-reader functions from model output files or from an observational dataset. See :ref:`reading_files` and :ref:`data_sources`.
   
First we need a multi-indexed DataFrame. Let's create one purporting to hold plant annual transpiration (in mm) by PFT, in mm, for a small 2x2 gridcells domain between 1999 and 2001 on a **lonlat** grid.

.. code-block:: python

   import pandas as pd
   import numpy as np
   import canopy as cp

   pfts = ['Conifer', 'Broadleaf', 'Grass', ]
   years = [pd.Period(year=x, freq='Y') for x in [1999, 2000, 2001]]
   lons = [13.25, 13.75]
   lats = [40.75, 41.25]
   index = pd.MultiIndex.from_product([lons, lats, years], names=['lon', 'lat', 'time'])
   np.random.seed(10)
   data = 200*np.random.random(len(index)*len(pfts)).reshape([len(index), len(pfts)])
   data = pd.DataFrame(data, index=index, columns=pfts)
   print(data)

.. code-block:: console

                         Conifer   Broadleaf       Grass
    lon   lat   time                                    
    13.25 40.75 1999  154.264129    4.150390  126.729647
                2000  149.760777   99.701402   44.959329
                2001   39.612573  152.106142   33.822167
          41.25 1999   17.667963  137.071964  190.678669
                2000    0.789653  102.438453  162.524192
                2001  122.505213  144.351063   58.375214
    13.75 40.75 1999  183.554825  142.915157  108.508874
                2000   28.434010   74.668152  134.826723
                2001   88.366635   86.802799  123.553396
          41.25 1999  102.627649  130.079436  120.207791
                2000  161.044639  104.329430  181.729776
                2001   63.847218   18.091870   60.140011

.. currentmodule:: canopy.grid.grid_abc

Now, let's create a :class:`Grid` object associated with this data. The grid specifications can be inferred from the DataFrame of interests by invoking the :meth:`Grid.from_frame` constructor as follows:

.. code-block:: python

   grid = cp.grid.get_grid('lonlat').from_frame(data)
   print(grid)

.. code-block:: console

    Longitude: 
    13.25 to 13.75 (step: 0.5)
    Latitude: 
    40.75 to 41.25 (step: 0.5)


Finally, we construct the Field object. It looks like not much is going on, but the Field constructor will verify the DataFrame to ensure that the data conforms to the **canopy** data model described above.

.. code-block:: python

   # Annual transpiration
   aaet = cp.Field(grid, data)
   print(aaet)

.. code-block:: console

   Data
   ----
   name: [no name]
   units: [no units]
   description: [no description]

   Grid: lonlat
   ------------
   Longitude: 
   13.25 to 13.75 (step: 0.5)
   Latitude: 
   40.75 to 41.25 (step: 0.5)

   Time series
   -----------
   Span: 1999-01-01 00:00:00 - 2001-12-31 23:59:59.999999999
   Frequency: Y-DEC

   History
   -------

To examine the Field's data, one can use:

.. code-block:: python

   print(f"Field's layers: {aaet.layers}")
   print("Field's data:")
   print(aaet.data)


.. code-block:: console

   Field's layers: ['Conifer', 'Broadleaf', 'Grass']
   Field's data:
                        Conifer   Broadleaf       Grass
   lon   lat   time                                    
   13.25 40.75 1999  154.264129    4.150390  126.729647
               2000  149.760777   99.701402   44.959329
               2001   39.612573  152.106142   33.822167
         41.25 1999   17.667963  137.071964  190.678669
               2000    0.789653  102.438453  162.524192
               2001  122.505213  144.351063   58.375214
   13.75 40.75 1999  183.554825  142.915157  108.508874
               2000   28.434010   74.668152  134.826723
               2001   88.366635   86.802799  123.553396
         41.25 1999  102.627649  130.079436  120.207791
               2000  161.044639  104.329430  181.729776
               2001   63.847218   18.091870   60.140011


Notice that our created-by-hand Field does not yet have metadata. In normal **canopy** workflow, the metadata is added upon reading from disk by the reader function or the `Source` object (see :ref:`data_manipulation`), as long as the data source is registered. Metadata can be added or reset manually as follows:

.. code-block:: python

   # Fails because the entry 'name' already exists by default in every Field.
   #aaet.add_md('name', 'aaet')
   # For existing entries, like the three default ones, we use Field.set_md()
   aaet.set_md('name', 'aaet')
   aaet.set_md('description', 'Annual transpiration by PFT')
   aaet.set_md('units', 'mm')
   # We can add any metadata we want with Field.add_md()
   aaet.add_md('scenario', 'SSP1-2.6')
   # We can also manually add entries to the history log
   aaet.log('Field created manually with bogus data')
   print(aaet)

.. code-block:: console

   Data
   ----
   name: aaet
   units: mm
   description: Annual transpiration by PFT
   scenario: SSP1-2.6

   Grid: lonlat
   ------------
   Longitude: 
   13.25 to 13.75 (step: 0.5)
   Latitude: 
   40.75 to 41.25 (step: 0.5)

   Time series
   -----------
   Span: 1999-01-01 00:00:00 - 2001-12-31 23:59:59.999999999
   Frequency: Y-DEC

   History
   -------
   2025-05-12 19:20:48: Field created manually with bogus data