datafold documentation


What is datafold?

datafold is a Python package that provides data-driven models for point clouds to find an explicit mani-fold parametrization and to identify non-linear dynamical systems on these manifolds. Informally, a manifold is an usually unknown geometrical structure on which data is sampled. For high-dimensional point clouds a typical use case is to parametrize an intrinsic low-dimension manifold, with non-linear dimension reduction. For time series data, the underlying dynamical system is assumed to have a phase space that is a manifold.

For a longer introduction to datafold, please go to this introduction page and for a mathematical thorough introduction, we refer to the used references.

The source code is distributed under the MIT license.

Any contribution (code/tutorials/documentation improvements) and feedback is very welcome. Either use the issue tracker or service desk email. Please also see the “Contributing” section further below.

Note

The project is under active development in a research-driven environment.

  • Code quality varies ranging from “experimental/early stage” to “well-tested”. In general, well tested classes are listed in the software documentation and are directly accessible through the package levels pcfold, dynfold or appfold (e.g. from datafold.dynfold import .... Experimental code is only accessible via “deep imports” (e.g. from datafol.dynfold.outofsample import ...) and may raise a warning when using it.

  • There is no deprecation cycle. Backwards compatibility is indicated by the package version, where we use a semantic versioning policy [major].[minor].[patch], i.e.

    • major - making incompatible changes in the (documented) API

    • minor - adding functionality in a backwards-compatible manner

    • patch - backwards-compatible bug fixes

    We do not intend to indicate a feature complete milestone with version 1.0.

Highlights

datafold includes:

  • Data structures to handle point clouds on manifolds (PCManifold) and time series collections (TSCDataFrame). The data structures are used both internally and for model input/outputs (if applicable).

  • An efficient implementation of the DiffusionMaps model to parametrize a manifold from point cloud data or to approximate the Laplace-Beltrami operator eigenfunctions. In the model an arbitrary kernel can be set. This includes, for example, a standard Gaussian kernel, a continuous k nearest neighbor kernel, or a dynamics adapted kernel (cone kernel).

  • Out-of-sample methods such as the (auto-tuned) Laplacian Pyramids or Geometric Harmonics to interpolate general function values on manifold point clouds.

  • (Extended-) Dynamic Mode Decomposition (e.g. DMDFull or EDMD) which are data-driven dynamical models built from time series data. To improve the model’s accuracy, the available data can be transformed with a variety of functions. This includes scaling of heterogeneous time series features, representing the time series in another coordinate system (e.g. Laplace-Beltrami operator) or to reconstruct a diffeomorphic copy of the phase space with time delay embedding (cf. Takens theorem).

  • EDMDCV allows the model parameters (including the transformation model parameters) to be optimized with cross-validation and also accounts for time series splitting.

Cite

If you use datafold in your research, please cite the paper published in the Journal of Open Source Software (JOSS).

@article{Lehmberg2020,
         doi       = {10.21105/joss.02283},
         url       = {https://doi.org/10.21105/joss.02283},
         year      = {2020},
         publisher = {The Open Journal},
         volume    = {5},
         number    = {51},
         pages     = {2283},
         author    = {Daniel Lehmberg and Felix Dietrich and Gerta K{\"o}ster and Hans-Joachim Bungartz},
         title     = {datafold: data-driven models for point clouds and time series on manifolds},
         journal   = {Journal of Open Source Software}}

How to get it?

Installation of datafold requires Python>=3.6 with pip and setuptools installed (both packages usually ship with a standard Python installation). The datafold package dependencies are listed in the next section and install automatically.

There are two ways to install datafold:

  1. PyPI: installs the datafold core package (without tutorials and tests). To download the tutorial files separately please visit the Tutorials page.

  2. Source: downloads the entire repository. This is only recommended if you want access to the latest (but potentially unstable) development, plan to contribute to datafold, or wish to run the tests.

From PyPI

datafold is hosted on the official Python package index (PyPI) (https://pypi.org/project/datafold/). To install datafold and its dependencies use pip:

pip install datafold

Use pip3` if pip is reserved for Python<3.

Note

If you use Python with Anaconda, please also go to Installation with Anaconda.

From source

  1. Download the git repository

    1. If you wish to contribute code, it is required to have git installed. Clone the repository with

    git clone https://gitlab.com/datafold-dev/datafold.git
    

    b. Download the repository (zip, tar.gz, tar.bz2, tar)

  2. Install datafold from the root folder of the repository with

    python setup.py install
    

    Add the --user flag to install the software for the current user only.

  3. Optionally, run the tests locally. Because the tests have additional dependencies, they have be installed separately with the requirements-dev.txt file

    pip install -r requirements-dev.txt
    python setup.py test
    

    Use python3 if python is reserved for Python<3.

Dependencies

The datafold package dependencies are managed in the requirements.txt file and install with the package manager pip, if the package requirement is not already fulfilled. The tests and some tutorials require further dependencies which are managed in the requirements-dev.txt file.

The datafold software integrates with common packages from the Python scientific computing stack. Specifically, this is:

  • NumPy

    The data structure PCManifold in datafold subclasses from NumPy’s ndarray to model a point cloud sampled on a manifold. A PCManifold is associated with a PCManifoldKernel that describes the data locality and hence the geometry. NumPy is used throughout datafold and is the default for numerical data and algorithms.

  • pandas

    datafold addresses time series data in the data structure TSCDataFrame which subclasses from Pandas’ rich data structure DataFrame. Internally, this is again a NumPy array, but a data frame can index time values, multiple time series and multiple features. The available time series data can then be captured in a single object with easy data slicing and dedicated time series functionality.

  • scikit-learn

    All datafold algorithms that are part of the “machine learning pipeline” align to the scikit-learn API. This is done by deriving the models from BaseEstimator. or appropriate MixIns. datafold also defines own base classes that align with scikit-learn in a duck-typing fashion to allow processing time series data in a TSCDataFrame object.

  • SciPy

    The package is used for elementary numerical algorithms and data structures in conjunction with NumPy. Examples in datafold include the (sparse) linear least square regression, (sparse) solving for eigenpairs and sparse matrices as optional data structure for kernel matrices.

How does it compare to other software?

This section only includes other Python packages, and does not compare the size (e.g. active developers) of the projects.

  • scikit-learn

    provides algorithms for the entire machine learning pipeline. The main class of models in scikit-learn map feature inputs to a fixed number of target outputs for tasks like regression or classification. datafold is integrated into the scikit-learn API and focuses on the manifold learning algorithms. Furthermore, datafold includes a model class that can process time series data from dynamical systems. The number of outputs may vary: a user provides an initial condition (the input) and an arbitrary sampling frequency and prediction horizon.

  • PyDMD

    provides many variants of the Dynamic Mode Decomposition (DMD). Some of the DMD models are special cases of a dictionary of the Extended Dynamic Mode Decomposition, while other DMD variants are currently not covered in datafold. datafold.dynfold.dmd.py includes an (experimental) wrapper for the PyDMD package to make use of missing DMD models. However, a limitation of PyDMD is that it only allows single time series as input (numpy.ndarray), see PyDMD issue 86. datafold addresses this with the data structure TSCDataFrame.

  • PySINDy

    specializes on a sparse identification of dynamical systems to infer governing equations. SINDy is basically a DMD variant and not in the scope of datafold and note yet included. PySINDy also provides time series transformations, which are referred to as library. This matches the definition of dictionary in the Extended Dynamic Mode Decomposition). PySINDy also supports multiple time series but these are managed in lists and not in a single data structure.

  • TensorFlow

    allows data-driven regression/prediction with the main model type (deep) neural networks. For manifold learning (Variational) Auto-Encoders are suitable and for time series predictions there are recurrent networks such as the Long Short-Term Memory (LSTM) are a good choice. In general neural networks lack a mathematical background theory and are black-box models with a non-deterministic learning process that require medium to large sized datasets. Nonetheless, for many applications the models are very successful. The models in datafold, in contrast, have a strong mathematical background, can often be used as part of the analysis, have deterministic results and are capable to handle smaller data sets.

Contributing

Bug reports, feature requests and user questions

Any contribution (code/tutorials/documentation changes) and feedback is very welcome. For all correspondence regarding the software please open a new issue in the datafold issue tracker or email if do not have a gitlab account (this opens a confident issue in gitlab).

All code contributors are listed in the contributors file.

Setting up datafold for development

This section describes all steps to set up datafold for code development and should be read before contributing. The datafold repository must be cloned via git (see section “From source” above).

Quick set up

The following bash commands include all steps described in detail below for a quick set up.

# Clone repository (replace FORK_NAMESPACE after forking datafold)
git clone git@gitlab.com:[FORK_NAMESPACE]/datafold.git
cd ./datafold/

# Optional: set up virtual environment
# Note: if you use Python with Anaconda create a conda environment instead and
#       install pip in it
#       https://datafold-dev.gitlab.io/datafold/conda_install_info.html
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip

# Optional: install datafold
#   not required if repository path is included in PYTHONPATH
python setup.py install

# Install development dependencies and code
pip install -r requirements-dev.txt

# Optional: install and run code formatting tools
pre-commit install
pre-commit run --all-files

# Optional: run tests
python setup.py test

# Optional: build documentation
sphinx-apigen -f -o ./doc/source/_apidoc/ ./datafold/
sphinx-build -b html ./doc/source/ ./public/

Fork and create merge requests to datafold

Please read and follow the steps of gitlab’s “Project forking workflow”.

Note

We have set up a “Continuous Integration” (CI) pipeline. However, the worker (a gitlab-runner) of the parent repository is not available for forked projects (for reasons see here).

After you have created a fork you can clone the repository with

git clone git@gitlab.com:[FORK_NAMESPACE]/datafold.git

Install developer dependencies

The file requirements-dev.txt in the root directory of the repository contains all developing dependencies and is readable with pip.

The recommended (but optional) way is to install all dependencies into a virtual environment. This avoids conflicts with other installed packages. In order to set up a virtual environment run from the root directory:

python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements-dev.txt

Use python3 if python is reserved for Python<3.

Note

If you are using Python with Anaconda, please see Installation with Anaconda <https://datafold-dev.gitlab.io/datafold/conda_install_info.html>`__, to set up a ``conda environment instead of a virtualenv.

To install the dependencies without a virtual environment simply execute:

pip install -r requirements-dev.txt

Use pip3 if pip is reserved for Python<3.

Install git pre-commit hooks

The datafold source code is automatically formatted with

  • black for general code formatting

  • isort for sorting Python import statements alphabetically and in sections.

  • nbstripout for removing potentially large binary formatted output cells in a Jupyter notebook before the content gets into the git history.

It is highly recommended that the tools inspect and format the code before the code is committed to the git history. The tools alter the source code in a deterministic way, meaning each tool should only format the code once to obtain the desired format. None of the tool should break the code or alter its behaviour.

The most convenient way to set up the tools is to install the git commit-hooks via pre-commit (installs with the development dependencies). To install the git-hooks run from root directory:

pre-commit install

The installed git-hooks then run automatically prior to each git commit. To execute the formatting on the current source code without a commit (e.g., for testing purposes or during development), run from the root directory of the repository:

pre-commit run --all-files

Run tests

The tests are executed with Python package nose (installs with the development dependencies).

To execute all datafold unit tests locally run from the root directory of the repository:

python setup.py test

Alternatively, you can also execute the tests with nosetests, which provides further options (see nosetests --help)

nosetests datafold/ -v

To execute the tutorials (tests check only if an error occurs in the tutorial) run from the root directory:

nosetests tutorials/ -v

All tests (unit and tutorials) can also be executed remotely in a gitlab “Continuous Integration” (CI) setup. The pipeline runs for every push to the set up repository.

Visit “gitlab pipelines” for an introduction. datafold’s pipeline configuration is located in the file .gitlab-ci.yml.

Compile and build documentation

The documentation is built with Sphinx and various Sphinx extensions (all install with the development dependencies). The source code is documented with numpydoc style.

Additional dependencies for building the documentation (not contained in requirements-dev.txt):

  • LaTex to render maths equations,

  • mathjax to display the LaTex equations in the browser

  • graphviz to render class dependency graphs, and

  • pandoc to convert between formats (required by nbsphinx extension that includes the tutorials into the web page documentation).

In Linux, install the packages with

apt install libjs-mathjax fonts-mathjax dvipng pandoc graphviz

(This excludes the Latex installation see the available texlive packages).

To build the documentation run from the root folder of the repository:

sphinx-apigen -f -o ./doc/source/_apidoc/ ./datafold/
sphinx-build -b html ./doc/source/ ./public/

The page entry is then located at ./public/index.html. Please make sure that the installation of Sphinx is in the path environment variable.


Affiliations

Contributors

  • Daniel Lehmberg (1,2) DL is supported by the German Research Foundation (DFG), grant no. KO 5257/3-1 and thanks the research office (FORWIN) of Munich University of Applied Sciences and CeDoSIA of TUM Graduate School at the Technical University of Munich for their support.

  • Felix Dietrich (2). FD thanks the Technical University of Munich for their support.

1. Munich University of Applied Sciences

Faculty of Computer Science and Mathematics in research group “Pedestrian Dynamics” www.vadere.org,

_images/hm_logo.png

2. Technical University of Munich

Chair of Scientific Computing in Computer Science

_images/tum_logo.png