Julia version of Goldberg's "The unreasonable effectiveness of Character-level Language Models"

Introduction

The Julia version of the Python code in Yoav Goldberg's original notebook is not intended to be a line by line copy, though I tried to keep it similar looking enough so that it was comparable. If you are curious about Julia see the last section: A Note on Julia. Python is not Julia, vice versa, and there is a more Julian way to structure the program.

I will not copy the content of Goldberg's blog post, which you should read (link in the title; it in turn relies on the Andrej Karpathy blog post "The Unreasonable Effectiveness of Recurrent Neural Networks", which you should read too). What I will do is take a brief tour of what unreasonable effectiveness means, its relevance to modern linguistic theory, and why presenting an example program of an n-gram model is appropriate for this topic.

Unreasonably Effective

The unreasonable effectiveness trope derives its current lineage (as far as I can tell) from a paper written by Eugene Wigner:

Wigner, Eugene. February 1960. "The Unreasonable Effectiveness of Mathematics in the Natural Sciences". Communications in Pure and Applied Mathematics, vol. 13, No. I.

Wigner's paper reminds me of a post by Peter Norvig: On Chomsky and the Two Cultures of Statistical Learning, which is an argument against Noam Chomsky's insistence that statistical patterns of language use are not fundamentally important data for discovering the patterns of natural language construction. Chomsky takes it even further by declaring that the work of compiling language use statistics is similar to one who would "collect butterflies and make many observations" (Chomsky, Noam. 1979. Language and Responsibility. Harvester Press). That is, language use is a surface phenomenon that distracts us from the underlying principles; and it is the underlying principles of the structure of nature that scientists are really after. For Chomsky, the job of the theoretical linguist is to build a formal model (i.e., a mathematical theory) of human language from "first principles". Norvig's position is that gathering data on observable (surface) phenomenon and constructing statistical and/or probabilistic models from that data is legitimate science and has plenty of precedence.

Wigner

Wigner's point about the unreasonable effectiveness of math in the natural sciences is an attempt to frame an observation about the history of natural philosophy (i.e., the physical sciences, namely physics). That observation simply put: as history and ideas progress through time, mathematics has often been correct about physical phenomenon it had not tried to empirically explain through experimentation. Instead, mathematicians followed the course of their research and instincts in developing purely mathematical ideas or models. Those ideas and/or models often end up foundational to predicting a major scientific experiment; they end up as mathematical theories about the physical world. Of course, the math needn't always be an accident. Plenty of examples exist where the math was worked out to explain or predict some phenomena without (much) empirical or experimental validation. [One example Wigner gives is Newton's universal law of gravitation. The mathematical theory, which resolved the work of Kepler and Galileo, was constructed before we knew how to accurately observe and measure falling bodies and planetary motion. Only later did we confirm the accuracy of the predictions].

Wigner goes further than claiming that math helps to "explain" physical theories. He goes so far as proposing an epistemological law (pg 7).

The preceding [...] examples, which could be multiplied almost indefinitely, should illustrate the appropriateness and accuracy of the mathematical formulation of the laws of nature in terms of concepts chosen for their manipulability, the “laws of nature” being of almost fantastic accuracy but of strictly limited scope. I propose to refer to the observation which these examples illustrate as the empirical law of epistemology.

He follows that up with a common proclamation about using math to discover how the mind works. (His proclamation is "common" in a historical sense. From the point of view of language philosophy and the history of ideas it is common to read of the hopeful sentiment that some formalism will lead to a discovery of how the human mind works (See Formigari, Lia. 2004. A History of Language Philosophies. John Benjamins Publishing Company)) (page 9).

A much more difficult and confusing situation would arise if we could, some day, establish a theory of the phenomena of consciousness, or of biology, which would be as coherent and convincing as our present theories of the inanimate world.

Certainly, this is where many theoretical linguists, cognitive scientists, and artificial intelligence researchers hope to find themselves.

Hamming

A follow up to Wigner's paper, and the second published use of the term (at least that I could find) unreasonable effectiveness in reference to science and math is a talk by Richard Hamming, 1979 (yes, that Hamming, who invented the Hamming Distance metric).

Hamming, Richard W. February 1980. "The Unreasonable Effectiveness of Mathematics". The American Mathematical Monthly, vol. 87, No. 2.

Hamming cites Wigner, and picks up where Wigner left off. He tries to shed light on why math is so effective by proposing four explanations. I'm only concerned with one of those explanations here (number 2, on page 89):

2. We select the kind of mathematics to use. Mathematics does not always work. When we found that scalars did not work for forces, we invented a new mathematics, vectors. And going further we have invented tensors. [...] Thus my second explanation is that we select the mathematics to fit the situation, and it is simply not true that the same mathematics works every place.

Mathematics is effective at constructing theories that can predict physical theories because we can, and do, extend basic operations of simpler systems to solve new measurement problems. For example, from ancient Greece we get the expansion into rationals (fractions). This was to deal with what appeared to be contradictions in the then-current discrete mathematical number system for measuring certain geometric objects like the square. There are many other examples (e.g., zero, negative numbers, complex numbers, quaternions, etc.).

Hamming touches on the notion of invariance (following Wigner's cue). Simpler number systems were not fundamentally changed when new (extended) number systems arose. That is, their operations were invariant. Adding two discrete numbers did not fundamentally change when we needed to extend the operation to rationals. One could still measure an object in the simpler, less precise, number system. And while some revision of previous proofs usually happened, the fundamental concepts stayed the same. Hamming says (pages 85-86)

To summarize, from simple counting using the God-given integers, we made various extensions of the ideas of numbers to include more things. Sometimes the extensions were made for what amounted to aesthetic reasons, and often we gave up some property of the earlier number system. Thus we came to a number system that is unreasonably effective even in mathematics itself; [...] one of the main strands of mathematics is the extension, the generalization, the abstraction -- they are all more or less the same thing -- of well-known concepts to new situations. [...] old proofs of theorems may become false proofs. The old proofs no longer cover the newly defined things. The miracle is that almost always the theorems are still true; it is merely a matter of fixing up the proofs.

Shifting our perspective on the notion of invariance: mathematical theories make successful predictions about physical theories that do not vary with context. That is, mathematical predictions about gravity hold despite context. They work on earth and they work on the moon. They work in September and July as well, whether we have a Republican president or not. If, and when, mathematics extends even further than the present, we expect that the mathematical foundation of the theory of gravity will still work. Newton's work did not negate Galileo's; nor Einstein's Newton's.

A mathematical theory of human language will work whether you speak English or Urdu, Tucano or Japanese. It will work whether we live on Earth or in space. And much like number systems that get extended, language theory has found that extending some basic mathematical concepts about language statistics (e.g., distributional semantics) has been a useful enterprise. These extensions do not change previous results (i.e., traditional n-gram models are limited in the amount of history of seen sequences, whereas more modern and sophisticated models have extended this "history". This extended "history" operation doesn't change n-gram models, just as the "addition" operation on rationals has not changed addition on integers). And so, it seems reasonable to expect that a mathematical theory of human language needs to be invariant in these two aspects.

The story looking back

Before moving on it's worth noting that it's easy to look back on history and see a rational unfolding of events. For every successful mathematical model, for every successful example Wigner and Hamming highlight, it stands to reason there are parallel failures, unknown or forgotten. A proper treatment is way beyond the scope here.

Karpathy and Goldberg

The phrase, unreasonable effectiveness, then, echoed by Karpathy and Goldberg, has a trajectory along the points I've drawn: collecting data on observable things may seem like it won't have much to reveal to us about the underlying nature of the thing in question. Unless, of course, in the collection process we are simultaneously constructing a mathematical model for it.

Depending on how you look at it:

  1. Building language models is an exercise in mathematics and doesn't have much to do with the empirical data it uses. The modeling data could just as well be any sequentially ordered text that has some "pattern", as long as the resulting model provides sufficient explanatory power. Using the model to generate acceptable artificial text is secondary; what is crucial is the mathematical theory. One might broadly call this a rationalist (Chomskyan) approach.

  2. On the other hand, the data itself is driving the mathematics; we want language models that can generate intelligible text and predict patterns confirmed as acceptable by human judgement. And if a model does not perform well (according to human judgements) then we are free to change the mathematical theory in order to yield a model that generates more intelligible output. One might call this an empiricist approach.

Karpathy's post was written in 2015, and Goldberg's a little later that year (last commit on the notebook code was May 22, 2015). 4 years later and the basic ideas (as well as extensions of these ideas) are still providing plenty of interesting results:

  • distributional semantics (a pretty old mathematical concept, at least over 60 yrs)

  • language models (LM)

  • recursive neural networks (RNN)

  • long short-term memory neural networks (LSTM)

  • character-level (and variable sequence length) encodings

The bigger question, from a history of science point of view, is this: Are programs that converge on discoverable patterns in written and spoken text revealing patterns that are "invariant" in some sense. That is, are they patterns worthy of building a new science (e.g., a mathematical theory of human language), or are they artifacts of a combination of shifting variables dependent on context? The answer to this question remains to be seen, and likely requires resolving a tension implicit in language theory, highlighted by views 1,2.

This tension is superbly catalogued by Formigari in her book A History of Language Philosophies. She maps out a consistent path of conflicting explanations and/or origins of human language falling into dichotomies reflecting the historical religious-scientific understanding of the time: spiritual or bestial, mind or body, reason or emotion. A more modern split would be the difference between a rationalist mathematical theory (spiritual/reason) or an empiricist social theory (bestial/emotion) of language. The fact that modern natural language theory (theoretical linguistics, cognitive linguistics, NLP, computational linguistics, etc....; as well as both commercial and academic enterprises) still struggles with this tension is evidence that we have not made the fundamental breakthrough (e.g., commercial industry tensions for 1 and 2 above equate to focusing on algorithms[1] or data collection[2]). Progress marches on, we get better at understanding, but we are still locked into the same paradigmatic thinking that has framed our understanding for centuries.

Put another way, does creating a program that perfectly mimics talking to a human reveal anything about the nature of language (is it even a useful product)? Does research in modern machine learning techniques and architectures (e.g., RNNs, LSTMs, etc.) tell us anything about language (can we really commercialize this tech long term; such as building it into an operating system)? Do current mathematical techniques underlying machine learning methods count as a theory of human language (will these techniques be useful in industry in 10 years -- should I invest in building a huge, stable, performant code base of these techniques)? We don't know. Primarily we don't know because we have yet to build good enough programs to meet our standards. And when we do, will those programs, language models, architectures, etc... generalize across human languages?

Meanwhile, the ever expanding playground of building language models continues and there is no reason why we should artificially limit ourselves to what can be accomplished through experimentation. It remains to be seen if RNNs, LSTMs, and other mathematical models will have an unreasonably effective impact and we won't know unless we keep going.

Much of the modern AI around language comes from extending the basic concepts captured by an n-gram language model (for example, see Mikolov, Tomas. 2012. "Statistical Language Models Based on Neural Networks", PhD diss., Brno University of Technology.); it's appropriate that an n-gram implementation accompany this article.

The Code

I'll be using Julia

"1.3.0"
.

"$VERSION"
0.8s
Julia
"1.3.0"

Include DataStructures so we can use DefaultDict and Accumulator.

Pkg.add("DataStructures")
import DataStructures.Accumulator
import DataStructures.DefaultDict
import DataStructures.counter
using Printf
28.4s
Julia

I strayed from the original by creating some helper functions for the URLs, getting the file name from URL, defining the begin offset for n-gram, and finding the begin offset of n-gram.

#helper functions for: urls
shakespeare_input() = "http://cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt"
#get file name from url
file_namer(url) = split(url,"/")[end]
#define the BEGIN offset for n-gram
padding(chr,num) = repeat(chr, num)
#find BEGIN offset of n-gram
rewind(str,len) = str[end-(len-1):end]
0.6s
Julia
rewind (generic function with 1 method)

We will need to sum items in an Accumulator then use a list comprehension to build a tuple of the character and the ratio of that character to the total number of items in the Accumulator. A simple example:

#for a dict with char=>int
d = Dict{Char,Int}('a'=>2, 'z'=>1)
println("Dict: ", d)
#built as an accumulator
a = Accumulator(d)
println("Accumulator(d): ", a)
#sum the items
s = float(sum(a))
println("sum(a): ", s)
#build tuple of (char,ratio)
char_ratio = [(char,float(sum(count))/s) for (char,count) in a]
println("char and ratio: ", char_ratio)
1.7s
Julia

The normalize function wraps the above functionality.

#Goldberg uses a nested method for this, I've separated it.
function normalize(accum)
   s = float(sum(accum))
   [(char,float(sum(count))/s) for (char,count) in accum]
end
0.5s
Julia
normalize (generic function with 1 method)

Julia's DefaultDict with an Accumulator over a Dict{String,Int} seems like a decent replacement for Python's defaultdict(Counter).

function train_char_lm(url;order=4)
   file = file_namer(url)
   if !isfile(file)
      download(url, file)
   end
   data = read(file, String)
   pad = padding("~",order)
   data = pad * data
   tlp = DefaultDict{String,Accumulator{Char,Int}}(counter(String))
   #julia is 1-indexed (like R and Matlab)
   for i in 1:(length(data)-order)
      hist,curr = data[i:(i-1)+order],data[i+order]
      tlp[hist][curr]+=1
   end
   Dict(hist => normalize(chars) for (hist,chars) in tlp)
end
0.5s
Julia
train_char_lm (generic function with 1 method)

Generate a single letter by looking at the last order of characters and randomly given the distribution.

function generate_letter(model, history, order)::Char
   history = rewind(history,order)
   dist = model[history]
   x = rand()
   for (c,v) in dist
      x=x-v
      if x <= 0.0
         return c
      end
   end
end
0.6s
Julia
generate_letter (generic function with 1 method)

Generate text of nletters by passing the initial begin padding (e.g., history) and running generate_letter in a loop, updating the history in each pass.

function generate_text(model; order=4, nletters=1000)
   history = padding("~", order)
   out = String[]
   for i in 1:nletters
      c = string(generate_letter(model, history, order))
      history = rewind(history,order) * c
      push!(out,c)
   end
   join(out,"")
end
0.5s
Julia
generate_text (generic function with 1 method)
lm = train_char_lm(shakespeare_input(), order=4)
7.1s
Julia
lm["ello"]
1.1s
Julia
11-element Array{Tuple{Char,Float64},1}: ('n', 0.00170358) ('!', 0.00681431) ('w', 0.817717) ('\'', 0.0170358) ('?', 0.00681431) (':', 0.00511073) ('r', 0.0596252) (',', 0.0272572) (' ', 0.0136286) ('.', 0.00681431) ('u', 0.0374787)
lm["Firs"]
0.3s
Julia
1-element Array{Tuple{Char,Float64},1}: ('t', 1.0)
lm["rst "]
0.5s
Julia
43-element Array{Tuple{Char,Float64},1}: ('w', 0.024077) ('E', 0.00160514) ('o', 0.0304976) ('B', 0.00963082) ('h', 0.0192616) ('i', 0.0168539) ('r', 0.00722311) ('q', 0.00160514) ('P', 0.0144462) ('K', 0.00802568) ⋮ ('y', 0.0024077) ('k', 0.00401284) ('F', 0.0120385) ('N', 0.000802568) ('m', 0.0224719) ('S', 0.162921) ('L', 0.106742) ('g', 0.011236) ('l', 0.0104334)

Generate Shakespeare from different order models

Shakespeare is cheating a bit. It's young enough English to still be intelligible, but old enough to sound weird. In this way, generated text of Shakespeare gets a bit of a free-pass since it is supposed to sound odd to modern English speakers. But it is still interesting to see as the order increases so does the appearance of Shakespeare-like prose.

order 4

length(lm)
4.6s
Julia
82652
generate_text(lm, order=4)
Julia

order 6

lm = train_char_lm(shakespeare_input(), order=6);
length(lm)
25.9s
Julia
653260
generate_text(lm, order=6)
Julia

order 7

lm = train_char_lm(shakespeare_input(), order=7);
length(lm)
84.6s
Julia
1190536
generate_text(lm, order=7)
Julia

order 9

lm = train_char_lm(shakespeare_input(), order=9);
length(lm)
Julia
2437686
generate_text(lm, order=9)
Julia

A Note on Julia

Julia is an optionally/dynamically typed programming language using multiple dispatch, Garbage Collection (GC), and a Just In Time (JIT) compiler for LLVM. It was designed as a general purpose programming language with a focus on scientific computing: simple syntax, interactive software development, ad-hoc data exploration, high performance without having to manage memory.

One trade-off with Julia is the interactive latency with the JIT compiler (if all you are doing is one-off scripts or programs that don't need to run the same function many times Julia is not for you). Another trade-off is that Julia relies on the developer to write type-stable code so the compiler can be efficient and performant. If you are interested in learning more there are a number of high quality tutorials for Julia; Julia recently hit 1.0 version so make sure any resource you find is for this version (or above).

Why use Julia and not Python? A legitimate question and a bit of troll. If you want to dig-in your heels with Python and then have me convince you to move then anything I say will be in vain. Otherwise, my reasons for using Julia:

  1. native concurrency and parallelism

  2. amount of time and skill to write performant Julia is less than doing it in the inevitable Python/C++ combination.

  3. I don't have to "vectorize all things" as Julia's compiler can make for loops (and other patterns) efficient

  4. near C/C++ speeds with orders of magnitude less effort (though often it wont be the first way you write something.)

  5. I came from a LISP background and Julia is nice for me

  6. I've used Ruby for a long time and Julia is nice for me

  7. mathematicians, academics, and scientists make up the majority of community

  8. fantastic modern type system that supports both fine-grained control and generic code

  9. I love and use Go and Rust; Julia is a great complement

  10. Python was never "readable, intuitive, and easy" for me

Runtimes (1)