Downstream Analysis Models¶
Identifying cellular compartments / tissue zones (sklearn NMF)¶
Colocated cell combination model  denovo factorisation of cell type density using sklearn NMF.

class
cell2location.models.downstream.CoLocatedGroupsSklearnNMF.
CoLocatedGroupsSklearnNMF
(n_fact: int, X_data: numpy.ndarray, n_iter=10000, verbose=True, var_names=None, var_names_read=None, obs_names=None, fact_names=None, sample_id=None, init='random', random_state=0, alpha=0.1, l1_ratio=0.5, nmf_kwd_args={})[source]¶ Bases:
cell2location.models.base.base_model.BaseModel
Colocated cell combination model  denovo factorisation of cell type density using sklearn NMF.
This model takes the absolute cell density inferred by cell2location as input to nonnegative matrix factorisation to identify groups of cell types with similar locations or ‘tissue zones’.
If you want to find the most disctinct cell type combinations, use a small number of factors.
If you want to find very strong colocation signal and assume that most cell types are on their own, use a lot of factors (> 30).
To perform this analysis we initialise the model and train it several times to evaluate consitency. This class wraps around scikitlearn NMF to perform training, visualisation, export of the results.
Note
Factors are exchangeable so while you find factors with consistent cell type composition, every time you train the model you get those factors in a different order.
This analysis is most revealing for tissues (such as lymph node) and cell types (such as glial cells) where signals between cell types mediate their location patterns. In the mouse brain locations of neurones are determined during development so most neurones stand alone in their location pattern.
Density \(w_{sf}\) of each cell type f across locations s is modelled as an additive function of the cell combinations (microenvironments) r. This means the density of one cell type in one location can be explained by 2 distinct combinations r.
Cell type density is therefore a function of the following nonnegative components:
\[w_{sf} = \sum_{r} ({i_{sr} \: k_{rf} \: m_{f}})\] Components
\(k_{rf}\) represents the proportion of cells of each type (regulatory programmes) f that correspond to each colocated combination r, normalised for total abundance of each cell type \(m_{f}\).
\(m_{f}\) cell type budget accounts for the difference in abundance between cell types, thus focusing the interpretation of \(k_{rf}\) on cell colocation.
\(i_{sr}\) is proportional to the number of cells from each neighbourhood r in each location s, and shows the abundance of combinations r in locations s.
In practice \(q_{rf} = k_{rf} \: m_{f}\) is obtained from scikitlearn NMF and normalised by the sum across combinations r to obtain \(k_{rf}\):
\[k_{rf} = q_{rf} / (\sum_{r} q_{rf})\]Note
So, the model reports the proportion of cells of each type that belong to each combination (parameter called ‘cell_type_fractions’). For example, 81% of Astro_2 are found in fact_28. This way we account for the absolute abundance of each cell type.
 Parameters
n_fact – Maximum number of cell type groups, or factors
X_data – Numpy array of the cell abundance (cols) in locations (rows)
n_iter – number of training iterations
verbose – var_names, var_names_read, obs_names, fact_names, sample_id: See parent class BaseModel for details.
init, random_state, alpha, l1_ratio – arguments for sklearn.decomposition.NMF with sensible defaults see help(sklearn.decomposition.NMF) for more details
nmf_kwd_args – dictionary with more keyword arguments for sklearn.decomposition.NMF

fit
(n=3, n_type='restart')[source]¶ Find parameters using sklearn.decomposition.NMF, optionally restart several times, and export parameters to self.samples[‘post_sample_means’]
 Parameters
n – number of independent initialisations (Default value = 3)
n_type – type of repeated initialisation:
‘restart’ to pick different initial value,
‘cv’ for molecular crossvalidation  splits counts into n datasets, for now, only n=2 is implemented
‘bootstrap’ for fitting the model to multiple downsampled datasets. Run mod.bootstrap_data() to generate variants of data (Default value = ‘restart’)
 Returns
exported parameters in self.samples[‘post_sample_means’]
 Return type
None

evaluate_stability
(node_name, align=True, n_samples=1000)[source]¶ Evaluate stability of the solution between training initialisations (correlates the values of factors between training initialisations)
 Parameters
node_name – name of the parameter to evaluate, see self.samples[‘post_sample_means’].keys() Factors should be in columns.
align – boolean, match factors between training restarts using linear_sum_assignment? (Default value = True)
n_samples – does nothing, added to preserve call signature consistency with bayesian models
 Returns
plots comparing all training initialisations to initialisation 1.
 Return type
None

sample_posterior
(node='all', n_samples=1000, save_samples=False, return_samples=True, mean_field_slot='init_1')[source]¶ This function does nothing but added to preserve call signature with future Bayesian versions of the model.

compute_expected_fact
(fact_ind=None)[source]¶ Compute expected abundance of each cell type in each location that comes from a subset of factors. E.g. expressed factors in self.fact_filt
 Parameters
fact_ind – (Default value = None)

plot_posterior_mu_vs_data
(mu_node_name='mu', data_node='X_data')[source]¶ Plot expected value (of cell density) of the model against observed input data: 2D histogram, where each point is each point in the input data matrix
 Parameters
mu_node_name – name of the object slot containing expected value (Default value = ‘mu’)
data_node – name of the object slot containing data (Default value = ‘X_data’)

sample2df
(node_name='nUMI_factors', ct_node_name='cell_type_factors')[source]¶ Export cell combinations and their profile across locations as Pandas data frames.
 Parameters
node_name – name of the location loading model parameter to be exported (Default value = ‘nUMI_factors’)
ct_node_name – name of the cell_type loadings model parameter to be exported (Default value = ‘cell_type_factors’)
 Returns
8 Pandas dataframes added to model object: .cell_type_loadings, .cell_factors_sd, .cell_factors_q05, .cell_factors_q95 .gene_loadings, .gene_loadings_sd, .gene_loadings_q05, .gene_loadings_q95
 Return type
None
Archetypal Analysis¶
Archetypal tissue zones.

class
cell2location.models.downstream.ArchetypalAnalysis.
ArchetypalAnalysis
(n_fact: int, X_data: numpy.ndarray, n_iter=5000, verbose=True, var_names=None, var_names_read=None, obs_names=None, fact_names=None, sample_id=None, random_state=0, pcha_kwd_args={})[source]¶ Bases:
cell2location.models.base.base_model.BaseModel
This model identified archetypal tissue zones using PCHA algorithm.
If you would like to use this function please first run pip install py_pcha to install the dependency.
This model takes the absolute cell density inferred by cell2location as input to archetypal analysis aimed to find a set of most distinct tissue zones, which can be spatially interlaced unlike standard clustering.
To perform this analysis we initialise the model and train it several times to evaluate consitency. This class wraps around py_pcha package to perform training, visualisation, export of the results.
For more details on Archetypal Analysis using Principle Convex Hull Analysis (PCHA) see https://github.com/ulfaslak/py_pcha.
Note
Archetypes are exchangeable so while you find archetypes with consistent cell type composition, every time you train the model you get those archetypes in a different order.
Density \(w_{sf}\) of each cell type f across locations s is modelled as an additive function of the archetype r. This means the density of one cell type in one location can be explained by 2 distinct archetypes r.
Cell type density is therefore a function of the following nonnegative components:
\[w_{sf} = \sum_{r} ({i_{sr} \: k_{rf} \: m_{f}})\] Components
\(k_{rf}\) represents the proportion of cells of each type f that correspond to each colocated combination r, normalised for total abundance of each cell type \(m_{f}\).
\(m_{f}\) total abundance of each cell type.
\(i_{sr}\) represents the contribution of each archetype r in each location s, constrained as follows:
\[\sum_{r} i_{sr} = 1\]
In practice \(q_{rf} = k_{rf} \: m_{f}\) is obtained by performing archetypal analysis and normalised by the sum across combinations r to obtain \(k_{rf}\):
\[k_{rf} = q_{rf} / (\sum_{r} q_{rf})\]Note
So, the model reports the proportion of cells of each type that belong to each combination (parameter called ‘cell_type_fractions’). For example, 81% of Astro_2 are found in fact_28. This way we account for the absolute abundance of each cell type.
 Parameters
n_fact – Maximum number archetypes
X_data – Numpy array of the cell abundance (cols) in locations (rows)
n_iter – number of training iterations
verbose – var_names, var_names_read, obs_names, fact_names, sample_id: See parent class BaseModel for details.
init, random_state, alpha, l1_ratio – arguments for sklearn.decomposition.NMF with sensible defaults see help(sklearn.decomposition.NMF) for more details
pcha_kwd_args – dictionary with more keyword arguments for py_pcha.PCHA

fit
(n=3, n_type='restart')[source]¶ Find parameters using py_pcha.PCHA, optionally restart several times, and export parameters to self.samples[‘post_sample_means’]
 Parameters
n – number of independent initialisations (Default value = 3)
n_type – type of repeated initialisation:
‘restart’ to pick different initial value,
‘cv’ for molecular crossvalidation  splits counts into n datasets, for now, only n=2 is implemented
‘bootstrap’ for fitting the model to multiple downsampled datasets. Run mod.bootstrap_data() to generate variants of data (Default value = ‘restart’)
 Returns
exported parameters in self.samples[‘post_sample_means’]
 Return type
None

evaluate_stability
(node_name, align=True, n_samples=1000)[source]¶ Evaluate stability of the solution between training initialisations (correlates the values of archetypes between training initialisations)
 Parameters
node_name – name of the parameter to evaluate, see self.samples[‘post_sample_means’].keys() Factors should be in columns.
align – boolean, match factors between training restarts using linear_sum_assignment? (Default value = True)
n_samples – does nothing, added to preserve call signature consistency with bayesian models
 Returns
plots comparing all training initialisations to initialisation 1.
 Return type
None

sample_posterior
(node='all', n_samples=1000, save_samples=False, return_samples=True, mean_field_slot='init_1')[source]¶ This function does nothing but added to preserve call signature with future Bayesian versions of the model.

compute_expected_fact
(fact_ind=None)[source]¶ Compute expected abundance of each cell type in each location that comes from a subset of archetypes.
 Parameters
fact_ind – (Default value = None)

plot_posterior_mu_vs_data
(mu_node_name='mu', data_node='X_data')[source]¶ Plot expected value (of cell density) of the model against observed input data: 2D histogram, where each point is each point in the input data matrix
 Parameters
mu_node_name – name of the object slot containing expected value (Default value = ‘mu’)
data_node – name of the object slot containing data (Default value = ‘X_data’)

sample2df
(node_name='nUMI_factors', ct_node_name='cell_type_factors')[source]¶ Export archetypes and their profile across locations as Pandas data frames.
 Parameters
node_name – name of the location loading model parameter to be exported (Default value = ‘nUMI_factors’)
ct_node_name – name of the cell_type loadings model parameter to be exported (Default value = ‘cell_type_factors’)
 Returns
8 Pandas dataframes added to model object: .cell_type_loadings, .cell_factors_sd, .cell_factors_q05, .cell_factors_q95 .gene_loadings, .gene_loadings_sd, .gene_loadings_q05, .gene_loadings_q95
 Return type
None