Tensorflow 1.15.2

Default environment for Tensorflow w/ Keras and TFLearn

This notebook builds a reusable environment for Tensorflow, based on the Python 3 environment. Tensorflow is compiled here, to make use of SIMD instruction sets and the cuDNN, NCCL, and TensorRT CUDA libraries.

If the end state of the runtime in which Tensorflow was compiled is needed, the Build Py3 TF environment is also exported. In addition, the wheel installation file of this compiled Tensorflow is available for download here: tensorflow-1.15.2-cp37-cp37m-linux_x86_64.whl

Showcase

Plain Tensorflow

We'll follow the deep convolutional generative adversarial networks (DCGAN) example by Aymeric Damien, from the Tensorflow Examples project, to generate digit images from a noise distribution.

Reference paper: Unsupervised representation learning with deep convolutional generative adversarial networks. A Radford, L Metz, S Chintala. arXiv:1511.06434.

First, parameters.

# Training Params
num_steps = 5000
batch_size = 32
# Network Params
image_dim = 784 # 28*28 pixels * 1 channel
gen_hidden_dim = 256
disc_hidden_dim = 256
noise_dim = 200 # Noise data points
0.2s
Tensorflow Test (Python)
Python Tensorflow 1.15.2

Define networks.

# Generator Network
# Input: Noise, Output: Image
def generator(x, reuse=False):
    with tf.variable_scope('Generator', reuse=reuse):
        # TensorFlow Layers automatically create variables and calculate their
        # shape, based on the input.
        x = tf.layers.dense(x, units=6 * 6 * 128)
        x = tf.nn.tanh(x)
        # Reshape to a 4-D array of images: (batch, height, width, channels)
        # New shape: (batch, 6, 6, 128)
        x = tf.reshape(x, shape=[-1, 6, 6, 128])
        # Deconvolution, image shape: (batch, 14, 14, 64)
        x = tf.layers.conv2d_transpose(x, 64, 4, strides=2)
        # Deconvolution, image shape: (batch, 28, 28, 1)
        x = tf.layers.conv2d_transpose(x, 1, 2, strides=2)
        # Apply sigmoid to clip values between 0 and 1
        x = tf.nn.sigmoid(x)
        return x
# Discriminator Network
# Input: Image, Output: Prediction Real/Fake Image
def discriminator(x, reuse=False):
    with tf.variable_scope('Discriminator', reuse=reuse):
        # Typical convolutional neural network to classify images.
        x = tf.layers.conv2d(x, 64, 5)
        x = tf.nn.tanh(x)
        x = tf.layers.average_pooling2d(x, 2, 2)
        x = tf.layers.conv2d(x, 128, 5)
        x = tf.nn.tanh(x)
        x = tf.layers.average_pooling2d(x, 2, 2)
        x = tf.contrib.layers.flatten(x)
        x = tf.layers.dense(x, 1024)
        x = tf.nn.tanh(x)
        # Output 2 classes: Real and Fake images
        x = tf.layers.dense(x, 2)
    return x
0.2s
Tensorflow Test (Python)
Python Tensorflow 1.15.2

Network setup.

import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
# Import MNIST data (http://yann.lecun.com/exdb/mnist/)
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)
# Build Networks
# Network Inputs
noise_input = tf.placeholder(tf.float32, shape=[None, noise_dim])
real_image_input = tf.placeholder(tf.float32, shape=[None, 28, 28, 1])
# Build Generator Network
gen_sample = generator(noise_input)
# Build 2 Discriminator Networks (one from noise input, one from generated samples)
disc_real = discriminator(real_image_input)
disc_fake = discriminator(gen_sample, reuse=True)
disc_concat = tf.concat([disc_real, disc_fake], axis=0)
# Build the stacked generator/discriminator
stacked_gan = discriminator(gen_sample, reuse=True)
# Build Targets (real or fake images)
disc_target = tf.placeholder(tf.int32, shape=[None])
gen_target = tf.placeholder(tf.int32, shape=[None])
# Build Loss
disc_loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(
    logits=disc_concat, labels=disc_target))
gen_loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(
    logits=stacked_gan, labels=gen_target))
# Build Optimizers
optimizer_gen = tf.train.AdamOptimizer(learning_rate=0.001)
optimizer_disc = tf.train.AdamOptimizer(learning_rate=0.001)
# Training Variables for each optimizer
# By default in TensorFlow, all variables are updated by each optimizer, so we
# need to precise for each one of them the specific variables to update.
# Generator Network Variables
gen_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='Generator')
# Discriminator Network Variables
disc_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='Discriminator')
# Create training operations
train_gen = optimizer_gen.minimize(gen_loss, var_list=gen_vars)
train_disc = optimizer_disc.minimize(disc_loss, var_list=disc_vars)
# Initialize the variables (i.e. assign their default value)
init = tf.global_variables_initializer()
7.0s
Tensorflow Test (Python)
Python Tensorflow 1.15.2

Finally, training.

# Start training
sess = tf.Session()
# Run the initializer
sess.run(init)
for step in range(1, num_steps+1):
	# Prepare Input Data
	# Get the next batch of MNIST data (only images are needed, not labels)
	batch_x, _ = mnist.train.next_batch(batch_size)
	batch_x = np.reshape(batch_x, newshape=[-1, 28, 28, 1])
	# Generate noise to feed to the generator
	z = np.random.uniform(-1., 1., size=[batch_size, noise_dim])
	# Prepare Targets (Real image: 1, Fake image: 0)
	# The first half of data fed to the generator are real images,
	# the other half are fake images (coming from the generator).
	batch_disc_y = np.concatenate(
		[np.ones([batch_size]), np.zeros([batch_size])], axis=0)
	# Generator tries to fool the discriminator, thus targets are 1.
	batch_gen_y = np.ones([batch_size])
	# Training
	feed_dict = {real_image_input: batch_x, noise_input: z,
				 disc_target: batch_disc_y, gen_target: batch_gen_y}
	_, _, gl, dl = sess.run([train_gen, train_disc, gen_loss, disc_loss],
							feed_dict=feed_dict)
	if step % 1000 == 0 or step == 1:
		print('Step %i: Generator Loss: %f, Discriminator Loss: %f' % (step, gl, dl))
		
		# Generate images from noise, using the generator network.
		f, a = plt.subplots(4, 10, figsize=(10, 4))
		for i in range(10):
			# Noise input.
			z = np.random.uniform(-1., 1., size=[4, noise_dim])
			g = sess.run(gen_sample, feed_dict={noise_input: z})
			for j in range(4):
				# Generate image from noise. Extend to 3 channels for matplot figure.
				img = np.reshape(np.repeat(g[j][:, :, np.newaxis], 3, axis=2),
					newshape=(28, 28, 3))
				a[j][i].imshow(img)
				
		#f.show()
		plt.suptitle("Step {}".format(step))
		plt.savefig("/results/step-{}.svg".format(step))
		plt.close()
88.6s
Tensorflow Test (Python)
Python Tensorflow 1.15.2

Keras

Adapted from mnist_mlp.py in the Keras examples collection. Can be run on CPU or GPU, just depends what the runtime's Machine Type is set to.

Trains a simple deep NN on the MNIST dataset. Gets to 98.40% test accuracy after 20 epochs(there is *a lot* of margin for parameter tuning). 2 seconds per epoch on a K520 GPU.

Imports and settings.

from __future__ import print_function
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import RMSprop
batch_size = 128
num_classes = 10
epochs = 20
0.4s
Tensorflow Test (Python)
Python Tensorflow 1.15.2

Data.

# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.reshape(60000, 784)
x_test = x_test.reshape(10000, 784)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')
# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
1.9s
Tensorflow Test (Python)
Python Tensorflow 1.15.2

Define the model.

model = Sequential()
model.add(Dense(512, activation='relu', input_shape=(784,)))
model.add(Dropout(0.2))
model.add(Dense(512, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(num_classes, activation='softmax'))
model.summary()
model.compile(loss='categorical_crossentropy',
              optimizer=RMSprop(),
              metrics=['accuracy'])
0.4s
Tensorflow Test (Python)
Python Tensorflow 1.15.2

Training. We can save our result to a file at the end.

history = model.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_data=(x_test, y_test))
model.save("/results/mnist.kerasave")
53.0s
Tensorflow Test (Python)
Python Tensorflow 1.15.2
mnist.kerasave

In a new runtime, load the test data and saved model with training data.

import keras
from keras.datasets import mnist
from keras.models import load_model
num_classes = 10
(_,_), (x_test, y_test) = mnist.load_data()
x_test = x_test.reshape(10000, 784)
x_test = x_test.astype('float32')
x_test /= 255
y_test = keras.utils.to_categorical(y_test, num_classes)
model = load_model(
mnist.kerasave
)
9.4s
Tensorflow Test Eval (Python)
Python Tensorflow 1.15.2

Evaluate.

score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
1.3s
Tensorflow Test Eval (Python)
Python Tensorflow 1.15.2

TFLearn

From the TFLearn + Tensorflow layers.py example.

from __future__ import print_function
import tensorflow as tf
import tflearn
# --------------------------------------
# High-Level API: Using TFLearn wrappers
# --------------------------------------
# Using MNIST Dataset
import tflearn.datasets.mnist as mnist
mnist_data = mnist.read_data_sets(one_hot=True)
# User defined placeholders
with tf.Graph().as_default():
    # Placeholders for data and labels
    X = tf.placeholder(shape=(None, 784), dtype=tf.float32)
    Y = tf.placeholder(shape=(None, 10), dtype=tf.float32)
    net = tf.reshape(X, [-1, 28, 28, 1])
    # Using TFLearn wrappers for network building
    net = tflearn.conv_2d(net, 32, 3, activation='relu')
    net = tflearn.max_pool_2d(net, 2)
    net = tflearn.local_response_normalization(net)
    net = tflearn.dropout(net, 0.8)
    net = tflearn.conv_2d(net, 64, 3, activation='relu')
    net = tflearn.max_pool_2d(net, 2)
    net = tflearn.local_response_normalization(net)
    net = tflearn.dropout(net, 0.8)
    net = tflearn.fully_connected(net, 128, activation='tanh')
    net = tflearn.dropout(net, 0.8)
    net = tflearn.fully_connected(net, 256, activation='tanh')
    net = tflearn.dropout(net, 0.8)
    net = tflearn.fully_connected(net, 10, activation='linear')
    # Defining other ops using Tensorflow
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=net, labels=Y))
    optimizer = tf.train.AdamOptimizer(learning_rate=0.01).minimize(loss)
    # Initializing the variables
    init = tf.global_variables_initializer()
    # Launch the graph
    with tf.Session() as sess:
        sess.run(init)
        batch_size = 128
        for epoch in range(2):  # 2 epochs
            avg_cost = 0.
            total_batch = int(mnist_data.train.num_examples / batch_size)
            for i in range(total_batch):
                batch_xs, batch_ys = mnist_data.train.next_batch(batch_size)
                sess.run(optimizer, feed_dict={X: batch_xs, Y: batch_ys})
                cost = sess.run(loss, feed_dict={X: batch_xs, Y: batch_ys})
                avg_cost += cost / total_batch
                if i % 20 == 0:
                    print("Epoch:", '%03d' % (epoch + 1), "Step:", '%03d' % i,
                          "Loss:", str(cost))
20.6s
Tensorflow Test (Python)
Python Tensorflow 1.15.2

Setup

Build Tensorflow

Building Tensorflow allows use of SIMD CPU enhancements like AVX. Cuda 10.2 supports up to GCC9. To get the Nvidia CUDA libraries we must set the environment variable NEXTJOURNAL_MOUNT_CUDA in the runtime configuration. Tensorflow can see some speedups if we give it libjemalloc.

apt-get -qq update
apt-get install --no-install-recommends \
  xutils-dev zlib1g-dev libjemalloc-dev
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-7 25
update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-7 25
echo "/usr/local/cuda/extras/CUPTI/lib64" > /etc/ld.so.conf.d/cupti.conf
ldconfig
6.3s
Build Py3 TF (Bash)

Install dependencies for the pip package build, listed here.

conda install \
  absl-py astor gast google-pasta opt_einsum protobuf termcolor wrapt \
  tensorboard tensorflow-estimator keras-applications keras-preprocessing
55.4s
Build Py3 TF (Bash)

Download TensorRT. The link needs to be pulled from the console when downloading off the Nvidia website.

Install TensorRT from tarfile above. Have to fudge the python install because the wheel file is minor-version specific for some reason.

cd /usr/local
tar -zxf 
TensorRT-7.0.0.11.Ubuntu-18.04.x86_64-gnu.cuda-10.2.cudnn7.6.tar.gz
ln -sf TensorRT* tensorrt
echo '/usr/local/tensorrt/lib' > /etc/ld.so.conf.d/tensorrt.conf
ldconfig
cd /usr/local/tensorrt
pip install python/tensorrt*cp37*.whl \
  uff/uff*.whl graphsurgeon/graphsurgeon*.whl
19.0s
Build Py3 TF (Bash)

The Tensorflow compilation configure script is hardcoded to look for libnccl.so in <nccl_install_dir>/lib, but we have /lib64, so we need to set up some links to redirect it.

mkdir -p /usr/local/nccl_redir
cd /usr/local/nccl_redir
for i in `ls /usr/local/cuda`; do ln -s /usr/local/cuda/$i ./; done
ln -s lib64 lib
0.1s
Build Py3 TF (Bash)

Install Bazel. Tensorflow 1.15.0 works with Bazel 0.26.1.

export BAZEL_VERSION=0.26.1
export BAZEL_FILE=bazel-${BAZEL_VERSION}-installer-linux-x86_64.sh
wget --progress=dot:giga \
  https://github.com/bazelbuild/bazel/releases/download/$BAZEL_VERSION/$BAZEL_FILE
chmod +x $BAZEL_FILE
./$BAZEL_FILE
3.4s
Build Py3 TF (Bash)

Clone the source and checkout the release.

git clone https://github.com/tensorflow/tensorflow
cd tensorflow
git checkout v1.15.2
52.8s
Build Py3 TF (Bash)

This configure script uses environment variables to do a non-interactive config. The march flag set through CC_OPT_FLAGS is of particular interest for CPU-only computation, as it controls which SIMD instruction sets Tensorflow will use, which can have large performance impacts. Some important flag values:

  • nehalem: Core-i family (circa 2008) supports MMX, SSE1-4.2, and POPCNT, equivalent to the corei7 march flag pre-GCC5.

  • sandybridge: Adds AVX (large potential speedups), AES and PCLMUL, and is oldest family that the Google Cloud runs (2011). Requires GCC5+.

  • skylake: Adds a wide variety of SIMD instructions, including AVX2, and is currently the newest family the Google Cloud has. Requires GCC6+.

Also of interest for CPU computation is TF_NEED_MKL. Enabling this compiles Tensorflow to use the Intel Math Kernel Library, which is highly optimized for any CPU the Google Cloud will provide. In Tensorflow the MKL and CUDA are mutually exclusive—MKL is reserved for CPU-optimized builds.

cd /tensorflow
export TF_ROOT="/opt/tensorflow"
export PYTHON_BIN_PATH="/opt/conda/bin/python"
export PYTHON_LIB_PATH="$($PYTHON_BIN_PATH -c 'import site; print(site.getsitepackages()[0])')"
export PYTHONPATH=${TF_ROOT}/lib
export PYTHON_ARG=${TF_ROOT}/lib
export TF_NEED_GCP=1   # Google Cloud
export TF_NEED_HDFS=1  # Hadoop Filesystem access
export TF_NEED_S3=1    # Amazon S3
export TF_NEED_AWS=0   # Amazon AWS
export TF_NEED_IGNITE=1
export TF_NEED_KAFKA=1 # Apache KAFKA
export TF_NEED_JEMALLOC=1 # Alternative malloc
export TF_NEED_GDR=0   # GPU Direct RDMA
export TF_NEED_VERBS=0 # VERBS RDMA
export TF_NEED_CUDA=1
export CUDA_TOOLKIT_PATH=/usr/local/cuda
export TF_CUDA_VERSION="$($CUDA_TOOLKIT_PATH/bin/nvcc --version | sed -n 's/^.*release \(.*\),.*/\1/p')"
export TF_CUDA_COMPUTE_CAPABILITIES=7.0,6.1,6.0,3.7 # V100, P100, P4, K80
export CUDNN_INSTALL_PATH=/usr/local/cuda
export TF_CUDNN_VERSION="$(sed -n 's/^#define CUDNN_MAJOR\s*\(.*\).*/\1/p' $CUDNN_INSTALL_PATH/include/cudnn.h)"
export TF_NEED_TENSORRT=1  # Nvidia TensorRT
export TENSORRT_INSTALL_PATH=/usr/local/tensorrt
export NCCL_INSTALL_PATH=/usr/local/nccl_redir # Nvidia NCCL
export TF_NCCL_VERSION="$(sed -n 's/^#define NCCL_MAJOR\s*\(.*\).*/\1/p' $NCCL_INSTALL_PATH/include/nccl.h)"
export TF_CUDA_CLANG=0    # Use clang compiler instead of nvcc
export TF_NEED_OPENCL=0
export TF_NEED_OPENCL_SYCL=0
export TF_NEED_ROCM=0
export TF_ENABLE_XLA=0    # Accelerated Linear Algebra JIT compiler
export TF_NEED_MKL=0       # Intel Math Kernel Library
export TF_DOWNLOAD_MKL=0
export TF_NEED_MPI=0       # Message Passing Interface
export TF_SET_ANDROID_WORKSPACE=0
export GCC_HOST_COMPILER_PATH=$(which gcc)
export CC_OPT_FLAGS="-march=sandybridge"
./configure
8.4s
Build Py3 TF (Bash)

Finally, the build—this takes about 11 hours.

export LD_LIBRARY_PATH="/usr/local/cuda/lib64:/usr/local/nvidia/lib64"
export CUDNN_INCLUDE_DIR="/usr/local/cuda/include"
export CUDNN_LIBRARY="/usr/local/cuda/lib64/libcudnn.so"
export TMP="/tmp"
cd /tensorflow
bazel build --experimental_ui_limit_console_output=32 \
  --config=opt --config=cuda --verbose_failures --jobs="auto" \
  --action_env="LD_LIBRARY_PATH=${LD_LIBRARY_PATH}" \
  --action_env="CUDNN_INCLUDE_DIR=${CUDNN_INCLUDE_DIR}" \
  --action_env="CUDNN_LIBRARY=${CUDNN_LIBRARY}" \
  //tensorflow/tools/pip_package:build_pip_package
38747.5s
Build Py3 TF (Bash)

We'll export this environment just in case anyone wants to play with the compiled result, but the important part here is the creation of a .whl wheel file which can be installed via pip.

cd /tensorflow
bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
cp /tmp/tensorflow_pkg/tensorflow*.whl /results/
81.3s
Build Py3 TF (Bash)
tensorflow-1.15.2-cp37-cp37m-linux_x86_64.whl

Install Tensorflow and Frontends to Environment

Finally, we'll install the package we created in a clean environment, plus the TFLearn and standalone Keras frontends.

conda install \
  absl-py astor google-pasta opt_einsum protobuf termcolor wrapt \
  mock pbr h5py grpcio markdown werkzeug cython jemalloc \
  pyyaml graphviz pydot # for use with Keras
conda clean -qtipy
echo "/usr/local/cuda/extras/CUPTI/lib64" > /etc/ld.so.conf.d/cupti.conf
40.1s
Python Tensorflow 1.15.2 (Bash)
cd /usr/local
tar -zxf 
TensorRT-7.0.0.11.Ubuntu-18.04.x86_64-gnu.cuda-10.2.cudnn7.6.tar.gz
ln -sf TensorRT* tensorrt
echo '/usr/local/tensorrt/lib' > /etc/ld.so.conf.d/tensorrt.conf
ldconfig
17.1s
Python Tensorflow 1.15.2 (Bash)

Sometimes pip fails here, saying the Python interpreter is bad. I don't know why. Occasionally (rarely) running python -V beforehand prevents it. I don't know why. More often, running python -V after the error shows up...fixes it. I don't know why.

python -V
0.4s
Python Tensorflow 1.15.2 (Bash)
cd /usr/local/tensorrt
pip install python/tensorrt*cp37*.whl \
            uff/uff*.whl graphsurgeon/graphsurgeon*.whl
2.1s
Python Tensorflow 1.15.2 (Bash)

Need to cap the tensor* add-on versions here, because...pip is kinda dumb?

pip install 
tensorflow-1.15.2-cp37-cp37m-linux_x86_64.whl
\
  'tensorboard<2' 'tensorflow-estimator<2' \
  keras keras-applications keras-preprocessing \
  git+https://github.com/tflearn/tflearn.git
35.3s
Python Tensorflow 1.15.2 (Bash)
du -hsx /
6.4s
Python Tensorflow 1.15.2 (Bash)
Runtimes (4)