Parens for Python - UMAP & Trimap

We are going to explore some more Python libraries through the use of libpython-clj.

This time, we are going to focus on a couple dimensionality reduction libraries called UMAP and Trimap. They are going to need a few support libraries installed to go through the examples:

{:deps
 {org.clojure/clojure {:mvn/version "1.10.1"}
  cnuernber/libpython-clj {:mvn/version "1.36"}}}
deps.edn
Clojure

Install the python dependencies

pip3 install seaborn
pip3 install matplotlib
pip3 install sklearn
pip3 install numpy
pip3 install pandas
pip3 install umap-learn
pip3 install trimap
36.9s
Clj & Python env (Bash in Clojure)

We also need to setup a plotting alias with matplotlib

(ns gigasquid.plot
  (:require [libpython-clj.require :refer [require-python]]
            [libpython-clj.python :as py :refer [py. py.. py.-]]))
14.4s
Clj & Python env (Clojure)

First, we have to define a quick macro to show the plotting for our local system. This allows matplotlib, (the library that seaborn is built on), to be able to be shown headlessly.

;;;; have to set the headless mode before requiring pyplot
(def mplt (py/import-module "matplotlib"))
(py. mplt "use" "Agg")
(require-python 'matplotlib.pyplot)
(require-python 'matplotlib.backends.backend_agg)
(defmacro with-show
  "Takes forms with mathplotlib.pyplot to then show locally"
  [& body]
  `(let [_# (matplotlib.pyplot/clf)
         fig# (matplotlib.pyplot/figure)
         agg-canvas# (matplotlib.backends.backend_agg/FigureCanvasAgg fig#)]
     ~(cons 'do body)
     (py. agg-canvas# "draw")
     (matplotlib.pyplot/savefig (str "results/" gensym ".png"))))
2.6s
Clj & Python env (Clojure)
gigasquid.plot/with-show

UMAP

UMAP is a dimensionality reduction library. It seems like a lot of words, but it basically takes a complicated dataset with many variables and reduces it down to something much simpler without losing the fundamental characteristics.

(ns gigasquid.umap
  (:require [libpython-clj.require :refer [require-python]]
            [libpython-clj.python :as py :refer [py. py.. py.-]]
            [gigasquid.plot :as plot]))
;;;; you will need all these things below installed
;;; with pip or something else
;;; What is umap? - dimensionality reduction library
(require-python '[seaborn :as sns])
(require-python '[matplotlib.pyplot :as pyplot])
(require-python '[sklearn.datasets :as sk-data])
(require-python '[sklearn.model_selection :as sk-model])
(require-python '[numpy :as numpy])
(require-python '[pandas :as pandas])
(require-python '[umap :as umap])
5.4s
Clj & Python env (Clojure)
:ok

Next we are going to follow along the code tutorial from https://umap-learn.readthedocs.io/en/latest/basic_usage.html

We next setup the defaults for plotting and get some data to work with. We'll look at the Iris dataset. It isn't very representative in terms of real world data since btoht the number of points and features are small, but it will illustrate what is going on with dimensionality reduction.

;;; set the defaults for plotting
(sns/set)
(def iris (sk-data/load_iris))
(py.- iris DESCR)
0.1s
Clj & Python env (Clojure)

We define a data frame and a series for the data set and can then plot the species.

(def iris-df (pandas/DataFrame (py.- iris data) :columns (py.- iris feature_names)))
(py/att-type-map iris-df)
(def iris-name-series (let [iris-name-map (zipmap (range 3) (py.- iris target_names))]
                        (pandas/Series (map (fn [item]
                                              (get iris-name-map item))
                                            (py.- iris target)))))
(py. iris-df __setitem__ "species" iris-name-series)
(py/get-item iris-df "species")
(plot/with-show
  (sns/pairplot iris-df :hue "species"))
5.6s
Clj & Python env (Clojure)

Now time to reduce! First we define a reducer and than train it to lean about the manifold. The fit_tranforms function first fits data and then transforms it into a numpy array.

(def reducer (umap/UMAP))
(def embedding (py. reducer fit_transform (py.- iris data)))
(py.- embedding shape) ;=>  (150, 2)
;;; 150 samples with 2 column.  Each row of the array is a 2-dimensional representation of the corresponding flower. Thus we can plot the embedding as a standard scatterplot and color by the target array (since it applies to the transformed data which is in the same order as the original).
(str (first embedding)) ;=> [12.449954  -6.0549345]
5.3s
Clj & Python env (Clojure)
"[14.31796 -4.056695]"
(let [colors (mapv #(py/get-item (sns/color_palette) %)
                   (py.- iris target))
      x (mapv first embedding)
      y (mapv last embedding)]
 (plot/with-show
   (pyplot/scatter x y :c colors)
   (py. (pyplot/gca) set_aspect "equal" "datalim")
   (pyplot/title "UMAP projection of the Iris dataset" :fontsize 24)))
1.0s
Clj & Python env (Clojure)

UMAP with Digits Data

Now let's use a dataset with more complicated data. The handwritten digit set we all know and love.

(def digits (sk-data/load_digits))
(str (py.- digits DESCR))
0.2s
Clj & Python env (Clojure)

Let's take a look at the images to see what we are dealing with:

(plot/with-show
  (let [[fig ax-array] (pyplot/subplots 20 20)
        axes (py. ax-array flatten)]
    (doall (map-indexed (fn [i ax]
                          (py. ax imshow (py/get-item (py.- digits images) i) :cmap "gray_r"))
                        axes))
    (pyplot/setp axes :xticks [] :yticks [] :frame_on false)
    (pyplot/tight_layout :h_pad 0.5 :w_pad 0.01)))
14.4s
Clj & Python env (Clojure)

Now, let's do a scatterplot of the first 10 dimensions for the 64 elements of the grayscale values.

(def digits-df (pandas/DataFrame (mapv #(take 10 %) (py.- digits data))))
(def digits-target-series (pandas/DataFrame (mapv #(str "Digit " %) (py.- digits target))))
(py. digits-df __setitem__ "digit" digits-target-series)
(plot/with-show
  (sns/pairplot digits-df :hue "digit" :palette "Spectral"))
76.2s
Clj & Python env (Clojure)

Let's reduce it!

;;;; use umap with the fit instead
(def reducer (umap/UMAP :random_state 42))
(py. reducer fit (py.- digits data))
;;; now we can look at the embedding attribute on the reducer or call transform on the original data
(def embedding (py. reducer transform (py.- digits data)))
(str (py.- embedding shape))
5.7s
Clj & Python env (Clojure)
"(1797, 2)"

We now have a dataset with 1797 rows but only 2 columns. We can plot the resulting embedding, coloring the data points by the class to which they belong (the digit).

(plot/with-show
  (let [x (mapv first embedding)
        y (mapv last embedding)
        colors (py.- digits target)
        bounds (numpy/subtract (numpy/arange 11) 0.5)
        ticks (numpy/arange 10)]
    (pyplot/scatter x y :c colors :cmap "Spectral" :s 5)
    (py. (pyplot/gca) set_aspect "equal" "datalim")
    (py. (pyplot/colorbar :boundaries bounds) set_ticks ticks)
    (pyplot/title "UMAP projection of the Digits dataset" :fontsize 24)))
4.5s
Clj & Python env (Clojure)

Trimap

Trimap is another dimensionality reduction library that uses a different algorithm - ;https://pypi.org/project/trimap/

(ns gigasquid.trimap
  (:require [libpython-clj.require :refer [require-python]]
            [libpython-clj.python :as py :refer [py. py.. py.-]]
            [gigasquid.plot :as plot]))
(require-python '[trimap :as trimap])
(require-python '[sklearn.datasets :as sk-data])
(require-python '[matplotlib.pyplot :as pyplot])
15.7s
Clj & Python env (Clojure)
:ok

We can do the digit example using it too.

(def digits (sk-data/load_digits))
(def digits-data (py.- digits data))
(def embedding (py. (trimap/TRIMAP) fit_transform digits-data))
(str (py.- embedding shape))
2.7s
Clj & Python env (Clojure)
"(1797, 2)"

Finally, we can visualize it as before

(plot/with-show
  (let [x (mapv first embedding)
        y (mapv last embedding)
        colors (py.- digits target)
        bounds (numpy/subtract (numpy/arange 11) 0.5)
        ticks (numpy/arange 10)]
    (pyplot/scatter x y :c colors :cmap "Spectral" :s 5)
    (py. (pyplot/gca) set_aspect "equal" "datalim")
    (py. (pyplot/colorbar :boundaries bounds) set_ticks ticks)
    (pyplot/title "UMAP projection of the Digits dataset" :fontsize 24)))
1.0s
Clj & Python env (Clojure)

I hope that you have enjoyed this example and that it will spur your curiosity to try Python interop for yourself. You can find this code example, along with other here https://github.com/gigasquid/libpython-clj-examples

Runtimes (1)