Clojure Spec for Data Science

At the SciCloj meetup in Berlin the idea came up to use clojure.spec to validate input data and then use generators from the same specs to fill in for the invalid data. Here we explore a proof of concept.

{:deps
 {org.clojure/clojure {:mvn/version "1.10.1"}
  org.clojure/test.check {:mvn/version "0.10.0-alpha3"}}}
deps.edn
Clojure
(require '[clojure.spec.alpha :as s])
(require '[clojure.spec.gen.alpha :as gen])
0.2s
Clojure

Let's have an example data set with a bunch of values

(def data [1 2 1 4 5 1 999 ""])
0.1s
Clojure
user/data

Define the specs on what we consider valid data

(s/def ::n (s/int-in 1 6))
(s/def ::input (s/coll-of ::n))
0.1s
Clojure
:user/input

and see if our input data is considered valid

(s/valid? ::input data)
0.1s
Clojure
false
(s/explain-data ::input data)
0.1s
Clojure
Map {:clojure.spec.alpha/problems: List(2), :clojure.spec.alpha/spec: :user/input, :clojure.spec.alpha/value: Vector(8)}

We see it is not valid, and clojure.spec can give us exact information on what problems it found

(def explained *1)
0.1s
Clojure
Map {:clojure.spec.alpha/problems: List(2), :clojure.spec.alpha/spec: :user/input, :clojure.spec.alpha/value: Vector(8)}
(get-in data (-> explained ::s/problems first :in))
0.1s
Clojure
999

To use the generators, we can just give it the name of the spec for which we need a value

(gen/generate (s/gen ::n))
1.9s
Clojure
3

We can put this together, to validate the input data and automatically fill in for the values which failed validation.

(reduce (fn [d p]
          (update-in d 
                     (:in p)
                     (fn [n] (gen/generate (s/gen (-> p :via last))))))
        data
        (-> explained ::s/problems))
0.1s
Clojure
Vector(8) [1, 2, 1, 4, 5, 1, 5, 4]
(def valid-data *1)
(s/valid? ::input valid-data)
0.1s
Clojure
true

And our data is now valid!

TODO:

  • write a custom generator which analyzes valid input data and generate the most common value

Runtimes (1)