PHATE - Potential of Heat-diffusion for Affinity-based Trajectory Embedding¶

PHATE (Potential of Heat-diffusion for Affinity-based Trajectory Embedding) is a tool for visualizing high dimensional data. PHATE uses a novel conceptual framework for learning and visualizing the manifold to preserve both local and global distances.

To see how PHATE can be applied to datasets such as facial images and single-cell data from human embryonic stem cells, check out our Nature Biotechnology publication.

Moon, van Dijk, Wang, Gigante et al. **Visualizing Transitions and Structure for Biological Data Exploration**. 2019. *Nature Biotechnology*.

Quick Start¶

If you have loaded a data matrix data in Python (cells on rows, genes on columns) you can run PHATE as follows:

import phate
phate_op = phate.PHATE()
data_phate = phate_op.fit_transform(data)

PHATE accepts the following data types: numpy.array, scipy.spmatrix, pandas.DataFrame and anndata.AnnData.

Usage¶

To run PHATE on your dataset, create a PHATE operator and run fit_transform. Here we show an example with an artificial tree:

import phate
tree_data, tree_clusters = phate.tree.gen_dla()
phate_operator = phate.PHATE(k=15, t=100)
tree_phate = phate_operator.fit_transform(tree_data)
phate.plot.scatter2d(phate_operator, c=tree_clusters)
# or phate.plot.scatter2d(tree_phate, c=tree_clusters)
phate.plot.rotate_scatter3d(phate_operator, c=tree_clusters)

Help¶

If you have any questions or require assistance using PHATE, please contact us at https://krishnaswamylab.org/get-help

class phate.PHATE(n_components=2, knn=5, decay=40, n_landmark=2000, t='auto', gamma=1, n_pca=100, mds_solver='sgd', knn_dist='euclidean', knn_max=None, mds_dist='euclidean', mds='metric', n_jobs=1, random_state=None, verbose=1, **kwargs)[source]

PHATE operator which performs dimensionality reduction.

Potential of Heat-diffusion for Affinity-based Trajectory Embedding (PHATE) embeds high dimensional single-cell data into two or three dimensions for visualization of biological progressions as described in Moon et al, 2017 [1].

Parameters:

n_components (int, optional, default: 2) – number of dimensions in which the data will be embedded
knn (int, optional, default: 5) – number of nearest neighbors on which to build kernel
decay (int, optional, default: 40) – sets decay rate of kernel tails. If None, alpha decaying kernel is not used
n_landmark (int, optional, default: 2000) – number of landmarks to use in fast PHATE
t (int, optional, default: 'auto') – power to which the diffusion operator is powered. This sets the level of diffusion. If ‘auto’, t is selected according to the knee point in the Von Neumann Entropy of the diffusion operator
gamma (float, optional, default: 1) – Informational distance constant between -1 and 1. gamma=1 gives the PHATE log potential, gamma=0 gives a square root potential.
n_pca (int, optional, default: 100) – Number of principal components to use for calculating neighborhoods. For extremely large datasets, using n_pca < 20 allows neighborhoods to be calculated in roughly log(n_samples) time.
mds_solver ({'sgd', 'smacof'}, optional (default: 'sgd')) – which solver to use for metric MDS. SGD is substantially faster, but produces slightly less optimal results. Note that SMACOF was used for all figures in the PHATE paper.
knn_dist (string, optional, default: 'euclidean') – recommended values: ‘euclidean’, ‘cosine’, ‘precomputed’ Any metric from scipy.spatial.distance can be used distance metric for building kNN graph. Custom distance functions of form f(x, y) = d are also accepted. If ‘precomputed’, data should be an n_samples x n_samples distance or affinity matrix. Distance matrices are assumed to have zeros down the diagonal, while affinity matrices are assumed to have non-zero values down the diagonal. This is detected automatically using data[0,0]. You can override this detection with knn_dist=’precomputed_distance’ or knn_dist=’precomputed_affinity’.
knn_max (int, optional, default: None) – Maximum number of neighbors for which alpha decaying kernel is computed for each point. For very large datasets, setting knn_max to a small multiple of knn can speed up computation significantly.
mds_dist (string, optional, default: 'euclidean') – Distance metric for MDS. Recommended values: ‘euclidean’ and ‘cosine’ Any metric from scipy.spatial.distance can be used. Custom distance functions of form f(x, y) = d are also accepted
mds (string, optional, default: 'metric') – choose from [‘classic’, ‘metric’, ‘nonmetric’]. Selects which MDS algorithm is used for dimensionality reduction
n_jobs (integer, optional, default: 1) – The number of jobs to use for the computation. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used
random_state (integer or numpy.RandomState, optional, default: None) – The generator used to initialize SMACOF (metric, nonmetric) MDS If an integer is given, it fixes the seed Defaults to the global numpy random number generator
verbose (int or boolean, optional (default: 1)) – If True or > 0, print status messages
potential_method (deprecated.) – Use gamma=1 for log transformation and gamma=0 for square root transformation.
kwargs (additional arguments for graphtools.Graph) –

X¶

Type:	array-like, shape=[n_samples, n_dimensions]

embedding¶

Stores the position of the dataset in the embedding space

Type:	array-like, shape=[n_samples, n_components]

graph¶

The graph built on the input data

Type:	graphtools.base.BaseGraph

optimal_t¶

The automatically selected t, when t = ‘auto’. When t is given, optimal_t is None.

Type:	int

Examples

>>> import phate
>>> import matplotlib.pyplot as plt
>>> tree_data, tree_clusters = phate.tree.gen_dla(n_dim=100, n_branch=20,
...                                               branch_length=100)
>>> tree_data.shape
(2000, 100)
>>> phate_operator = phate.PHATE(knn=5, decay=20, t=150)
>>> tree_phate = phate_operator.fit_transform(tree_data)
>>> tree_phate.shape
(2000, 2)
>>> phate.plot.scatter2d(tree_phate, c=tree_clusters)

References

[1]	Moon KR, van Dijk D, Zheng W, et al. (2017), PHATE: A Dimensionality Reduction Method for Visualizing Trajectory Structures in High-Dimensional Biological Data, BioRxiv.

diff_op

array-like, shape=[n_samples, n_samples] or [n_landmark, n_landmark] The diffusion operator built from the graph

Type:	diff_op

diff_potential

Interpolates the PHATE potential to one entry per cell

This is equivalent to calculating infinite-dimensional PHATE, or running PHATE without the MDS step.

Returns:	diff_potential
Return type:	ndarray, shape=[n_samples, min(n_landmark, n_samples)]

fit(X)[source]

Computes the diffusion operator

Parameters:	X (array, shape=[n_samples, n_features]) – input data with n_samples samples and n_dimensions dimensions. Accepted data types: numpy.ndarray, scipy.sparse.spmatrix, pd.DataFrame, anndata.AnnData. If knn_dist is ‘precomputed’, data should be a n_samples x n_samples distance or affinity matrix
Returns:	phate_operator (PHATE) The estimator object

fit_transform(X, **kwargs)[source]

Computes the diffusion operator and the position of the cells in the embedding space

Parameters:	X (array, shape=[n_samples, n_features]) – input data with n_samples samples and n_dimensions dimensions. Accepted data types: numpy.ndarray, scipy.sparse.spmatrix, pd.DataFrame, anndata.AnnData If knn_dist is ‘precomputed’, data should be a n_samples x n_samples distance or affinity matrix kwargs (further arguments for PHATE.transform()) – Keyword arguments as specified in `transform()`
Returns:	embedding – The cells embedded in a lower dimensional space using PHATE
Return type:	array, shape=[n_samples, n_dimensions]

reset_mds(**kwargs)[source]

Deprecated. Reset parameters related to multidimensional scaling

Parameters:

n_components (int, optional, default: None) – If given, sets number of dimensions in which the data will be embedded
mds (string, optional, default: None) – choose from [‘classic’, ‘metric’, ‘nonmetric’] If given, sets which MDS algorithm is used for dimensionality reduction
mds_dist (string, optional, default: None) – recommended values: ‘euclidean’ and ‘cosine’ Any metric from scipy.spatial.distance can be used If given, sets the distance metric for MDS

reset_potential(**kwargs)[source]

Deprecated. Reset parameters related to the diffusion potential

Parameters:	t (int or 'auto', optional, default: None) – Power to which the diffusion operator is powered If given, sets the level of diffusion potential_method (string, optional, default: None) – choose from [‘log’, ‘sqrt’] If given, sets which transformation of the diffusional operator is used to compute the diffusion potential

set_params(**params)[source]

Set the parameters on this estimator.

Any parameters not given as named arguments will be left at their current value.

Parameters:

n_components (int, optional, default: 2) – number of dimensions in which the data will be embedded
knn (int, optional, default: 5) – number of nearest neighbors on which to build kernel
decay (int, optional, default: 40) – sets decay rate of kernel tails. If None, alpha decaying kernel is not used
n_landmark (int, optional, default: 2000) – number of landmarks to use in fast PHATE
t (int, optional, default: 'auto') – power to which the diffusion operator is powered. This sets the level of diffusion. If ‘auto’, t is selected according to the knee point in the Von Neumann Entropy of the diffusion operator
gamma (float, optional, default: 1) – Informational distance constant between -1 and 1. gamma=1 gives the PHATE log potential, gamma=0 gives a square root potential.
n_pca (int, optional, default: 100) – Number of principal components to use for calculating neighborhoods. For extremely large datasets, using n_pca < 20 allows neighborhoods to be calculated in roughly log(n_samples) time.
mds_solver ({'sgd', 'smacof'}, optional (default: 'sgd')) – which solver to use for metric MDS. SGD is substantially faster, but produces slightly less optimal results. Note that SMACOF was used for all figures in the PHATE paper.
knn_dist (string, optional, default: 'euclidean') – recommended values: ‘euclidean’, ‘cosine’, ‘precomputed’ Any metric from scipy.spatial.distance can be used distance metric for building kNN graph. Custom distance functions of form f(x, y) = d are also accepted. If ‘precomputed’, data should be an n_samples x n_samples distance or affinity matrix. Distance matrices are assumed to have zeros down the diagonal, while affinity matrices are assumed to have non-zero values down the diagonal. This is detected automatically using data[0,0]. You can override this detection with knn_dist=’precomputed_distance’ or knn_dist=’precomputed_affinity’.
knn_max (int, optional, default: None) – Maximum number of neighbors for which alpha decaying kernel is computed for each point. For very large datasets, setting knn_max to a small multiple of knn can speed up computation significantly.
mds_dist (string, optional, default: 'euclidean') – recommended values: ‘euclidean’ and ‘cosine’ Any metric from scipy.spatial.distance can be used distance metric for MDS
mds (string, optional, default: 'metric') – choose from [‘classic’, ‘metric’, ‘nonmetric’]. Selects which MDS algorithm is used for dimensionality reduction
n_jobs (integer, optional, default: 1) – The number of jobs to use for the computation. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used
random_state (integer or numpy.RandomState, optional, default: None) – The generator used to initialize SMACOF (metric, nonmetric) MDS If an integer is given, it fixes the seed Defaults to the global numpy random number generator
verbose (int or boolean, optional (default: 1)) – If True or > 0, print status messages

Examples

>>> import phate
>>> import matplotlib.pyplot as plt
>>> tree_data, tree_clusters = phate.tree.gen_dla(n_dim=50, n_branch=5,
...                                               branch_length=50)
>>> tree_data.shape
(250, 50)
>>> phate_operator = phate.PHATE(knn=5, decay=20, t=150)
>>> tree_phate = phate_operator.fit_transform(tree_data)
>>> tree_phate.shape
(250, 2)
>>> phate_operator.set_params(n_components=10)
PHATE(decay=20, knn=5, knn_dist='euclidean', mds='metric',
   mds_dist='euclidean', n_components=10, n_jobs=1, n_landmark=2000,
   n_pca=100, potential_method='log', random_state=None, t=150,
   verbose=1)
>>> tree_phate = phate_operator.transform()
>>> tree_phate.shape
(250, 10)
>>> # plt.scatter(tree_phate[:,0], tree_phate[:,1], c=tree_clusters)
>>> # plt.show()

Returns:
Return type:	self

transform(X=None, t_max=100, plot_optimal_t=False, ax=None)[source]

Computes the position of the cells in the embedding space

Parameters:

X (array, optional, shape=[n_samples, n_features]) – input data with n_samples samples and n_dimensions dimensions. Not required, since PHATE does not currently embed cells not given in the input matrix to PHATE.fit(). Accepted data types: numpy.ndarray, scipy.sparse.spmatrix, pd.DataFrame, anndata.AnnData. If knn_dist is ‘precomputed’, data should be a n_samples x n_samples distance or affinity matrix
t_max (int, optional, default: 100) – maximum t to test if t is set to ‘auto’
plot_optimal_t (boolean, optional, default: False) – If true and t is set to ‘auto’, plot the Von Neumann entropy used to select t
ax (matplotlib.axes.Axes, optional) – If given and plot_optimal_t is true, plot will be drawn on the given axis.

Returns:

embedding (array, shape=[n_samples, n_dimensions])
The cells embedded in a lower dimensional space using PHATE