with Joseph Kliegman
Explore an HCA data set in Scanpy
%matplotlib inline import loompy import matplotlib.pyplot as plt import numpy as np import pandas as pd import scanpy.api as sc sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3) sc.settings.set_figure_params(dpi=80, color_map='Greys') sc.logging.print_versions() results_file = '/results/pancreas.h5ad'
1. Read the expression matrix file downloaded from the HCA
The file shown here is from the study Single-Cell Analysis of Human Pancreas Reveals Transcriptional Signatures of Aging and Somatic Mutation Patterns by Enge, et al., available in this repository at ./pancreas.loom
adata = sc.read_loom(pancreas.loom, sparse=True, cleanup=False, X_name='spliced', obs_names='CellID', var_names='Gene')
2. Query for genes of interest
HCA uses Gencode nomenclature as gene identifiers. A list of these identifiers and their corresponding gene symbols is included in this repository as ./ENSG_to_name.csv
gene_names = pd.read_csv(ENSG_to_name.csv, index_col = 0) named_list = adata.var.join(gene_names)
named_list.query('name == ["GCG"]') ## replace GCG with gene of interest
name | |
---|---|
ENSG00000115263.14 | GCG |
3. Filter and normalize
- Filter out genes that are present in few cells, cells that contain few genes
- Transform the sum of (remaining) gene expression counts in each cell to equal the same number (1e4)
- Keep 500 highest expressed genes from each cell (this threshold used in initial publication)
- Transform the sum of (remaining) gene expression counts in each cell to equal the same number (1e4)
- Keep 500 highest expressed genes from each cell (this threshold used in initial publication)
sc.pp.filter_genes(adata, min_cells=3) sc.pp.filter_cells(adata, min_genes=200) adata.obs['n_counts'] = np.sum(adata.X, axis=1).A1 dat = sc.pp.normalize_per_cell(adata, counts_per_cell_after=1e4, copy=True) # here we'll make "top", which is the union of the top 500 expressed genes from each cell top = [] for i in range(dat.shape[0]): indices = np.asarray(np.argsort(dat.X[i,:].todense())[0,:]).flatten()[::-1][0:500] top = np.union1d(top, indices) # update adata to include only the list "top" adata = adata[:, top.astype('int')]
4. Perform dimensionality reduction and clustering
- Compute the principal components
- Generate the neighborhood graph
- Determine clusters using Louvain method
- Perform dimensionality reduction with UMAP
- Perform dimensionality reduction with tSNE
- Generate the neighborhood graph
- Determine clusters using Louvain method
- Perform dimensionality reduction with UMAP
- Perform dimensionality reduction with tSNE
4.1. Determine weighted expression of select genes
This will be used in a few figures below
genes = ['ENSG00000254647.6', 'ENSG00000115263.14', 'ENSG00000157005.3', 'ENSG00000204983.13', #INS #GCG #SST #PRSS1 'ENSG00000007062.11','ENSG00000154096.13', 'ENSG00000108849.7'] #PRSS1 #PROM1 #THY1 #PPY colors = np.asarray([[226,43,34], [67,114,175], [48,178,78], [139,47,175], [244,152,46], [249,246,49], [119,73,27]]) weights = adata[:,genes].X.todense() weights = weights / (weights.sum(axis=1) + 0.0001) c = np.dot(weights, colors)
5. Embed cells using tSNE
- INS: ENSG00000254647.6
- GCG: ENSG00000115263.14
- PRSS1: ENSG00000204983.13
- SST: ENSG00000157005.3
- PROM1: ENSG00000007062.11
- THY1: ENSG00000154096.13
- PPY: ENSG00000108849.7
- GCG: ENSG00000115263.14
- PRSS1: ENSG00000204983.13
- SST: ENSG00000157005.3
- PROM1: ENSG00000007062.11
- THY1: ENSG00000154096.13
- PPY: ENSG00000108849.7
5.1. The quick way to plot in tSNE space
sc.pl.tsne(adata, color=['ENSG00000254647.6', 'ENSG00000115263.14', 'ENSG00000204983.13'], wspace=0.3, size=150, linewidths=1, edgecolors='black');
5.2. A slightly prettier custom plot
def clean_plot(): plt.grid(False) plt.yticks([]) plt.xticks([]) plt.xlabel('tSNE 1') plt.ylabel('tSNE 2') plt.figure(figsize=[15,4.5]) ids = ['ENSG00000254647.6', 'ENSG00000115263.14', 'ENSG00000204983.13'] genes = ['INS', 'GCG', 'PRSS1'] plots = [131, 132, 133] for i, g, p in zip(ids, genes, plots): plt.subplot(p); plt.scatter(adata.obsm['X_tsne'][:,0], adata.obsm['X_tsne'][:,1], c=adata[:,i].X, linewidths=0.3, edgecolors='black'); plt.title(g) clean_plot()
5.3. Cells colored by weighted expression of select genes
plt.figure(figsize=[8, 8]) plt.scatter(adata.obsm['X_tsne'][:,0], adata.obsm['X_tsne'][:,1], c=c/255, alpha=0.3); clean_plot()
6. Embed cells using UMAP
The same clusters and markers can be viewed using alternative embeddings:
6.1. Cells colored by louvain clustering
#sc.pl.umap(adata, color='louvain', wspace=0.3, size=150, linewidths=0.3, edgecolors='black'); ## skipping this for now -- working on some louvain issues above
6.2. Cells colored by weighted expression of select genes
plt.figure(figsize=[8, 8]) plt.scatter(adata.obsm['X_umap'][:,0], adata.obsm['X_umap'][:,1], c=c/255, alpha=0.3); plt.grid(False) plt.yticks([]) plt.xticks([]) plt.xlabel('UMAP 1') plt.ylabel('UMAP 2');
7. Where to go from here
There are many more ways to explore the data you've downloaded from the HCA -- see the scanpy tutorials for ideas https://scanpy.readthedocs.io/en/latest/examples.html#