/ Apr 26 2019

Explore an HCA data set in Scanpy

ENSG_to_name.csv
pancreas.loom

This notebook shows a few quick steps to start exploring an example HCA data sets using scanpy (citation).

%matplotlib inline

import loompy
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scanpy.api as sc

sc.settings.verbosity = 3  # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.settings.set_figure_params(dpi=80, color_map='Greys')
sc.logging.print_versions()
results_file = '/results/pancreas.h5ad'

1. Read the expression matrix file downloaded from the HCA

The file shown here is from the study Single-Cell Analysis of Human Pancreas Reveals Transcriptional Signatures of Aging and Somatic Mutation Patterns by Enge, et al., available in this repository at ./pancreas.loom

adata = sc.read_loom(
pancreas.loom
, sparse=True, cleanup=False, X_name='spliced', obs_names='CellID', var_names='Gene')

2. Query for genes of interest

HCA uses Gencode nomenclature as gene identifiers. A list of these identifiers and their corresponding gene symbols is included in this repository as ./ENSG_to_name.csv

gene_names = pd.read_csv(
ENSG_to_name.csv
, index_col = 0) named_list = adata.var.join(gene_names)
named_list.query('name == ["GCG"]') ## replace GCG with gene of interest 
name
ENSG00000115263.14 GCG

3. Filter and normalize

  • Filter out genes that are present in few cells, cells that contain few genes
  • Transform the sum of (remaining) gene expression counts in each cell to equal the same number (1e4)
  • Keep 500 highest expressed genes from each cell (this threshold used in initial publication)
  • Transform the sum of (remaining) gene expression counts in each cell to equal the same number (1e4)
  • Keep 500 highest expressed genes from each cell (this threshold used in initial publication)
sc.pp.filter_genes(adata, min_cells=3)
sc.pp.filter_cells(adata, min_genes=200)

adata.obs['n_counts'] = np.sum(adata.X, axis=1).A1
dat = sc.pp.normalize_per_cell(adata, counts_per_cell_after=1e4, copy=True)

# here we'll make "top", which is the union of the top 500 expressed genes from each cell
top = []
for i in range(dat.shape[0]):
    indices = np.asarray(np.argsort(dat.X[i,:].todense())[0,:]).flatten()[::-1][0:500]
    top = np.union1d(top, indices)
    
# update adata to include only the list "top"
adata = adata[:, top.astype('int')]

4. Perform dimensionality reduction and clustering

  • Compute the principal components
  • Generate the neighborhood graph
  • Determine clusters using Louvain method
  • Perform dimensionality reduction with UMAP
  • Perform dimensionality reduction with tSNE
  • Generate the neighborhood graph
  • Determine clusters using Louvain method
  • Perform dimensionality reduction with UMAP
  • Perform dimensionality reduction with tSNE
sc.pp.pca(adata, svd_solver='arpack')
sc.pp.neighbors(adata)
#sc.tl.louvain(adata, resolution=0.15)
sc.tl.tsne(adata)
sc.tl.umap(adata)
adata.write(results_file)
pancreas.h5ad

4.1. Determine weighted expression of select genes

This will be used in a few figures below

genes = ['ENSG00000254647.6', 'ENSG00000115263.14', 'ENSG00000157005.3', 'ENSG00000204983.13', #INS #GCG #SST #PRSS1
         'ENSG00000007062.11','ENSG00000154096.13', 'ENSG00000108849.7'] #PRSS1 #PROM1 #THY1 #PPY
colors = np.asarray([[226,43,34], [67,114,175], [48,178,78], 
                     [139,47,175], [244,152,46], [249,246,49], [119,73,27]])
weights = adata[:,genes].X.todense()
weights = weights / (weights.sum(axis=1) + 0.0001)
c = np.dot(weights, colors)

5. Embed cells using tSNE

  • INS: ENSG00000254647.6
  • GCG: ENSG00000115263.14
  • PRSS1: ENSG00000204983.13
  • SST: ENSG00000157005.3
  • PROM1: ENSG00000007062.11
  • THY1: ENSG00000154096.13
  • PPY: ENSG00000108849.7
  • GCG: ENSG00000115263.14
  • PRSS1: ENSG00000204983.13
  • SST: ENSG00000157005.3
  • PROM1: ENSG00000007062.11
  • THY1: ENSG00000154096.13
  • PPY: ENSG00000108849.7

5.1. The quick way to plot in tSNE space

sc.pl.tsne(adata, color=['ENSG00000254647.6', 'ENSG00000115263.14', 'ENSG00000204983.13'], 
           wspace=0.3, size=150, linewidths=1, edgecolors='black');

5.2. A slightly prettier custom plot

def clean_plot():
    plt.grid(False)
    plt.yticks([])
    plt.xticks([])
    plt.xlabel('tSNE 1')
    plt.ylabel('tSNE 2')

plt.figure(figsize=[15,4.5])
ids = ['ENSG00000254647.6', 'ENSG00000115263.14', 'ENSG00000204983.13']
genes = ['INS', 'GCG', 'PRSS1']
plots = [131, 132, 133]
for i, g, p in zip(ids, genes, plots):
    plt.subplot(p);
    plt.scatter(adata.obsm['X_tsne'][:,0], adata.obsm['X_tsne'][:,1], c=adata[:,i].X,
                linewidths=0.3, edgecolors='black');
    plt.title(g)
    clean_plot()

5.3. Cells colored by weighted expression of select genes

plt.figure(figsize=[8, 8])
plt.scatter(adata.obsm['X_tsne'][:,0], adata.obsm['X_tsne'][:,1], c=c/255, alpha=0.3);
clean_plot()

6. Embed cells using UMAP

The same clusters and markers can be viewed using alternative embeddings:

6.1. Cells colored by louvain clustering

#sc.pl.umap(adata, color='louvain', wspace=0.3, size=150, linewidths=0.3, edgecolors='black');
## skipping this for now -- working on some louvain issues above

6.2. Cells colored by weighted expression of select genes

plt.figure(figsize=[8, 8])
plt.scatter(adata.obsm['X_umap'][:,0], adata.obsm['X_umap'][:,1], c=c/255, alpha=0.3);
plt.grid(False)
plt.yticks([])
plt.xticks([])
plt.xlabel('UMAP 1')
plt.ylabel('UMAP 2');

7. Where to go from here

There are many more ways to explore the data you've downloaded from the HCA -- see the scanpy tutorials for ideas https://scanpy.readthedocs.io/en/latest/examples.html#