.. _data_model: The canopy data model ===================== The Field object ---------------- .. currentmodule:: canopy.core.field **Canopy** stores data in a self-describing container called a :class:`Field`. A :class:`Field` is meant to represent a spatio-temporally varying DGVM diagnostic or output (e.g., annual GPP, stored carbon, LAI...). Each quantity may be further disaggregated into *layers* (e.g., GPP can be disaggregated into PFTs, carbon storage into different carbon pools...) A :class:`Field` object consists of: .. currentmodule:: canopy.core.grid.grid_abc - A `pandas DataFrame `_ with a spatio-temporal index to store the field's data - A :class:`Grid` object describing the grid associated with the data - A metadata dictionary - A history dictionary - Methods for basic data slicing and reduction. .. currentmodule:: canopy.core.field To create a :class:`Field`, the pandas DataFrame and the associated grid must be supplied. For the :class:`Field` object to be successfully instantiated, it must fulfill the specifications below, and the data and the grid must be compatible. The DataFrame ------------- The data table is a pandas `DataFrame` containing the spatio-temporal indexed data. - Layers are stored as columns. The column identifiers (names) must be of `str` type. - The index must contain at least one `time` level of `pandas PeriodIndex `_ type. - It may additionally contain one or two spatial levels (e.g. `lon` and `lat`), of `float` type. - One additional, redundant `label` level is allowed, but not required, for site-based data (or, in general, any data associated to an unstructured grid). The index levels can therefore have one of the following forms: - `[label, x_1, x_2, time]`: Fully spatio-temporal site-based data, with a _redundant_ index level `label`. - `[x_1, x_2, time]`: Fully spatio-temporal data, with two spatial dimensions (`x_1` and `x_2` could be, e.g., `lon` and `lat`). - `[x_1, time]`: Spatio-temporal data, dimension `x_2` has been reduced. - `[x_2, time]`: Spatio-temporal data, dimension `x_1` has been reduced. - `[time]`: Time series. It can represent purely temporal data or data whose spatial dimensions have both been reduced. .. image:: _static/frame_general.png .. note:: *Entry* refers to a single spatio-temporal location (a row of the table). .. note:: Currently, canopy does not support a vertical coordinate nor subgrid indexing (but it will in the future). The Grid object --------------- .. currentmodule:: canopy.core.grid.grid_abc :class:`Grid` objects describe different types of grids. These are implemented as subclasses of the abstract base class `Grid` (described below). Instances of a :class:`Grid` subclass provide all the necessary information to perform spatial reduction operations (grid operations or _gridops_) associated to that grid type. Currently, canopy comes with two grid types and their associated operations: - ``lonlat``: describes a standard geographical (longitude-latitude) grid. The grid spacing is constant, but can be different for each axis. - ``sites``: describes an unstructured grid, such as the one associated to a collection of sites. There is a special type of grid, ``empty``, which is used for pure time series (no spatial dimensions). All grids are derived from the abstract base class :class:`Grid`, and override the following methods: - :meth:`Grid.from_frame` (class method): constructs a :class:`Grid` object from a pandas `DataFrame` - :meth:`Grid.validate_frame`: checks that the grid describes the data contained in the supplied `DataFrame` correctly (i.e., if the frame's index is compatible with the grid). - :meth:`Grid.crop`: returns a cropped :class:`Grid` of the same type, or :class:`GridEmpty` if the supplied dataframe is empty or outside the grid's domain. - :meth:`Grid.reduce`: returns a reduces :class:`Grid` of the same type, or :class:`GridEmpty` if both axes are reduced. - :meth:`Grid.is_compatible`: Checks for compatibility of two grids, i.e., if two grids can be added together. All grids are compatible with the type :class:`GridEmpty`. As an example, two :class:`GridLonLat` grids are compatible if the grids extended to cover the globe overlap perfectly (i.e., they are subsets of the same global grid). - :meth:`Grid.__add__`: returns the sum of two :class:`Grid` objects, which is another :class:`Grid` object. Adding :class:`GridEmpty` to a grid object simply returns a deep copy of the object. Additionally, all subclasses define the following class attributes: .. currentmodule:: canopy.core.grid.spatial_axis - _grid_type: str: The name under which the grid subclass will be registered in the grid registry. - _xaxis: SpatialAxis: Named tuple of type :class:`SpatialAxis` defining the grid's *X* axis attributes. - _yaxis: SpatialAxis: Named tuple of type :class:`SpatialAxis` defining the grid's *Y* axis attributes. - _xaxis_key: Short key to refer to the grid's *X* axis (e.g. in grid operations). - _yaxis_key: Short key to refer to the grid's *Y* axis (e.g. in grid operations). .. currentmodule:: canopy.core.grid.grid_abc The derived class can override the :class:`Grid` base class :meth:`__init__` method. However, the overriden :meth:`__init__` must still call the base class' :meth:`__init__` by using the :func:`super` function. Creating a Field ---------------- .. currentmodule:: canopy.core.field In order to gain insight into the structure of a :class:`Field` object, we will create one from scratch. We encourage you to try this example on your own. This is, of course, a dummy example with randomly generated data. Normally, the Field is created by data-reader functions from model output files or from an observational dataset. See :ref:`reading_files` and :ref:`data_sources`. First we need a multi-indexed DataFrame. Let's create one purporting to hold plant annual transpiration (in mm) by PFT, in mm, for a small 2x2 gridcells domain between 1999 and 2001 on a **lonlat** grid. .. code-block:: python import pandas as pd import numpy as np import canopy as cp pfts = ['Conifer', 'Broadleaf', 'Grass', ] years = [pd.Period(year=x, freq='Y') for x in [1999, 2000, 2001]] lons = [13.25, 13.75] lats = [40.75, 41.25] index = pd.MultiIndex.from_product([lons, lats, years], names=['lon', 'lat', 'time']) np.random.seed(10) data = 200*np.random.random(len(index)*len(pfts)).reshape([len(index), len(pfts)]) data = pd.DataFrame(data, index=index, columns=pfts) print(data) .. code-block:: console Conifer Broadleaf Grass lon lat time 13.25 40.75 1999 154.264129 4.150390 126.729647 2000 149.760777 99.701402 44.959329 2001 39.612573 152.106142 33.822167 41.25 1999 17.667963 137.071964 190.678669 2000 0.789653 102.438453 162.524192 2001 122.505213 144.351063 58.375214 13.75 40.75 1999 183.554825 142.915157 108.508874 2000 28.434010 74.668152 134.826723 2001 88.366635 86.802799 123.553396 41.25 1999 102.627649 130.079436 120.207791 2000 161.044639 104.329430 181.729776 2001 63.847218 18.091870 60.140011 .. currentmodule:: canopy.grid.grid_abc Now, let's create a :class:`Grid` object associated with this data. The grid specifications can be inferred from the DataFrame of interests by invoking the :meth:`Grid.from_frame` constructor as follows: .. code-block:: python grid = cp.grid.get_grid('lonlat').from_frame(data) print(grid) .. code-block:: console Longitude: 13.25 to 13.75 (step: 0.5) Latitude: 40.75 to 41.25 (step: 0.5) Finally, we construct the Field object. It looks like not much is going on, but the Field constructor will verify the DataFrame to ensure that the data conforms to the **canopy** data model described above. .. code-block:: python # Annual transpiration aaet = cp.Field(grid, data) print(aaet) .. code-block:: console Data ---- name: [no name] units: [no units] description: [no description] Grid: lonlat ------------ Longitude: 13.25 to 13.75 (step: 0.5) Latitude: 40.75 to 41.25 (step: 0.5) Time series ----------- Span: 1999-01-01 00:00:00 - 2001-12-31 23:59:59.999999999 Frequency: Y-DEC History ------- To examine the Field's data, one can use: .. code-block:: python print(f"Field's layers: {aaet.layers}") print("Field's data:") print(aaet.data) .. code-block:: console Field's layers: ['Conifer', 'Broadleaf', 'Grass'] Field's data: Conifer Broadleaf Grass lon lat time 13.25 40.75 1999 154.264129 4.150390 126.729647 2000 149.760777 99.701402 44.959329 2001 39.612573 152.106142 33.822167 41.25 1999 17.667963 137.071964 190.678669 2000 0.789653 102.438453 162.524192 2001 122.505213 144.351063 58.375214 13.75 40.75 1999 183.554825 142.915157 108.508874 2000 28.434010 74.668152 134.826723 2001 88.366635 86.802799 123.553396 41.25 1999 102.627649 130.079436 120.207791 2000 161.044639 104.329430 181.729776 2001 63.847218 18.091870 60.140011 Notice that our created-by-hand Field does not yet have metadata. In normal **canopy** workflow, the metadata is added upon reading from disk by the reader function or the `Source` object (see :ref:`data_manipulation`), as long as the data source is registered. Metadata can be added or reset manually as follows: .. code-block:: python # Fails because the entry 'name' already exists by default in every Field. #aaet.add_md('name', 'aaet') # For existing entries, like the three default ones, we use Field.set_md() aaet.set_md('name', 'aaet') aaet.set_md('description', 'Annual transpiration by PFT') aaet.set_md('units', 'mm') # We can add any metadata we want with Field.add_md() aaet.add_md('scenario', 'SSP1-2.6') # We can also manually add entries to the history log aaet.log('Field created manually with bogus data') print(aaet) .. code-block:: console Data ---- name: aaet units: mm description: Annual transpiration by PFT scenario: SSP1-2.6 Grid: lonlat ------------ Longitude: 13.25 to 13.75 (step: 0.5) Latitude: 40.75 to 41.25 (step: 0.5) Time series ----------- Span: 1999-01-01 00:00:00 - 2001-12-31 23:59:59.999999999 Frequency: Y-DEC History ------- 2025-05-12 19:20:48: Field created manually with bogus data