David Schmudde / Oct 07 2019

Adventures in Immutable Python

https://blog.quiltdata.com/declarative-etl-with-pandas-and-pyarrow-a779de13ef9d

Mutability

a = 5
b = 5
print("IDs a: %s, b: %s" % (id(a), id(b)))
print("a == b %s, a is b %s" % (a == b, a is b))
a = 5000
b = 5000
print("IDs a: %s, b: %s" % (id(a), id(b)))
print("a == b %s, a is b %s" % (a == b, a is b))

Value immutability (42) vs. Object mutation ([42])

print(" 42  ==  42  %s,  42  is  42  %s" % (42 == 42, 42 is 42))
print("[42] == [42] %s, [42] is [42] %s" % ([42] == [42], [42] is [42]))

More in the official docs Assignment Statement and Comparisons.

Arrow

One of the goals of Apache Arrow is to serve as a common data layer enabling zero-copy data exchange between multiple frameworks. A key component of this vision is the use of off-heap memory management (via Plasma) for storing and sharing Arrow-serialized objects between applications.

Arrays are immutable once created. Implementations can provide APIs to mutate an array, but applying mutations will require a new array data structure to be built.

Plasma

Plasma holds immutable objects in shared memory so that they can be accessed efficiently by many clients across process boundaries. In light of the trend toward larger and larger multicore machines, Plasma enables critical performance optimizations in the big data regime.

Using Plasma plus Arrow, the data being operated on would be placed in the Plasma store once, and all of the workers would read the data without copying or deserializing it (the workers would map the relevant region of memory into their own address spaces). The workers would then put the results of their computation back into the Plasma store, which the driver could then read and aggregate without copying or deserializing the data.

nohup plasma_store -m 60397977 -s /tmp/plasma &> /tmp/plasma.log & sleep 1
tail /tmp/plasma.log
import pyarrow.plasma as plasma
client = plasma.connect("/tmp/plasma")
client
<pyarrow._pla...x7f241b2c0fb8>
# Create an object.
object_id = plasma.ObjectID(20 * b'a')
object_size = 1000
buffer = memoryview(client.create(object_id, object_size))

# Write to the buffer.
for i in range(1000):
    buffer[i] = i % 128

# Seal the object making it immutable and available to other clients.
client.seal(object_id)

# The character "a" is encoded as 61 in hex.
object_id
ObjectID(6161...1616161616161)
# Create a python object.
object_id = client.put("hello, world")

# Get the object.
client.get(object_id)
'hello, world'
# Create a different client. Note that this second client could be
# created in the same or in a separate, concurrent Python session.
client2 = plasma.connect("/tmp/plasma")

# Get the object in the second client. This blocks until the object has been sealed.
object_id2 = plasma.ObjectID(20 * b"a")
[buffer2] = client2.get_buffers([object_id])
buffer2
<pyarrow.lib....x7f2414dd50a0>

If the object has not been sealed yet, then the call to client.get_buffers will block until the object has been sealed by the client constructing the object. Using the timeout_ms argument to get, you can specify a timeout for this (in milliseconds). After the timeout, the interpreter will yield control back.

print(str(buffer) + " --- " + str(buffer[1]))
print(str(buffer2) + " --- " + str(buffer2[1]) + " (unsealed)")
print(bytes(buffer[1:4]))
print(bytes(buffer2[1:4]))

Arrow Tables

The equivalent to a pandas DataFrame in Arrow is a Table. Both consist of a set of named columns of equal length. While pandas only supports flat columns, the Table also provides nested columns, thus it can represent more data than a DataFrame, so a full conversion is not always possible.

Conversion from a Table to a DataFrame is done by calling pyarrow.Table.to_pandas(). The inverse is then achieved by using pyarrow.Table.from_pandas().

~ DataFrames via arrow.apache.org

from pyarrow import csv
table = csv.read_csv(
artwork_data.csv
) print(table) df_mutable = table.to_pandas() df_mutable.head()

Pandas

Storing a Pandas DataFrame still follows the create then seal process of storing an object in the Plasma store, however one cannot directly write the DataFrame to Plasma with Pandas alone. Plasma also needs to know the size of the DataFrame to allocate a buffer for.

import pyarrow as pa
import pandas as pd

# Create a Pandas DataFrame
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
     'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)

# Convert the Pandas DataFrame into a PyArrow RecordBatch
record_batch = pa.RecordBatch.from_pandas(df)

Creating the Plasma object requires an ObjectID and the size of the data. Now that we have converted the Pandas DataFrame into a PyArrow RecordBatch, use the MockOutputStream to determine the size of the Plasma object.

# Useful utility for generating random object store ids
import numpy as np

def random_object_id():
  return plasma.ObjectID(np.random.bytes(20))

random_object_id()
ObjectID(0d70...be00dc10b7173)
# Create the Plasma object from the PyArrow RecordBatch. Most of the work here
# is done to determine the size of buffer to request from the object store.
object_id = random_object_id()
mock_sink = pa.MockOutputStream()
stream_writer = pa.RecordBatchStreamWriter(mock_sink, record_batch.schema)
stream_writer.write_batch(record_batch)
stream_writer.close()
data_size = mock_sink.size()
buf = client.create(object_id, data_size)

The DataFrame can now be written to the buffer as follows.

# Write the PyArrow RecordBatch to Plasma
stream = pa.FixedSizeBufferWriter(buf)
stream_writer = pa.RecordBatchStreamWriter(stream, record_batch.schema)
stream_writer.write_batch(record_batch)
stream_writer.close()

Finally, seal the finished object for use by all clients:

# Seal the Plasma object
client.seal(object_id)
# Fetch the Plasma object
[data] = client.get_buffers([object_id])  # Get PlasmaBuffer from ObjectID
df_buffer = pa.BufferReader(data)
# Convert object back into an Arrow RecordBatch
reader = pa.RecordBatchStreamReader(df_buffer)
record_batch = reader.read_next_batch()
# Convert back into Pandas
result = record_batch.to_pandas()

StaticFrame

Pandas

All pandas data structures are value-mutable (the values they contain can be altered) but not always size-mutable. The length of a Series cannot be changed, but, for example, columns can be inserted into a DataFrame. However, the vast majority of methods produce new objects and leave the input data untouched. In general we like to favor immutability where sensible.

import numpy as np
import pandas as pd

artwork_data = pd.read_csv(
artwork_data.csv
) artwork_data.drop(columns=["accession_number", "artistRole", "artistId", "dateText", "acquisitionYear", "dimensions", "width", "height", "depth", "creditLine", "units", "inscription", "thumbnailCopyright", "thumbnailUrl", "url"]) # (inplace=True) leaves id, artist, title, medium, year

Rationale

Immutable data structures reduce opportunities for error and promote the design of pure functions, offering programs that are easier to reason about and maintain. While Pandas is used in many domains where such benefits are highly desirable, there is no way to enforce immutability in Pandas.

~ Documentation

One of the main things you learn when you start with scientific computing in Python is that you should not write for-loops over your data. Instead you are advised to use the vectorized functions provided by packages like numpy. The major share of computations can be represented as a combination of fast NumPy operations.

~ https://uwekorn.com/2018/08/03/use-numba-to-work-with-apache-arrow-in-pure-python.html

In Action

StaticFrame aspires to have comparable or better performance than Pandas. While this is already the case for some core operations (See Performance), some important functions are far more performant in Pandas (such as reading delimited text files via pd.read_csv).

import static_frame as sf

df = sf.Frame.from_pandas(artwork_data)

print(df.shape)
df.dtypes

StaticFrame interfaces for extracting data will be familiar to Pandas users, though with a number of interface refinements to remove redundancies and increase consistency. On a Frame, __getitem__ is (exclusively) a column selector; loc and iloc are (with one argument) row selectors or (with two arguments) row and column selectors.

df['artist': 'year'].tail()

Instead of in-place assignment, an assign interface object (similar to the Frame.astype interface shown above) is provided to expose __getitem__, loc, and iloc interfaces that, when called with an argument, return a new object with the desired changes. These interfaces expose the full range of expressive assignment-like idioms found in Pandas and NumPy. Arguments can be single values, or Series and Frame objects, where assignment will align on the Index.

StaticFrame immutability:

def inc(x):
  x+=1
  return x

print("Original: " + str(df.loc[69196, 'acquisitionYear']))
df.assign.loc[69196, 'acquisitionYear'](inc(df.loc[69196, 'acquisitionYear']))
df.loc[69196, 'acquisitionYear']
2013.0

Updating a StaticFrame structure requires creating a new one:

print("Original: " + str(df.loc[69196, 'acquisitionYear']))
df_updated = df.assign.loc[69196, 'acquisitionYear'](inc(df.loc[69196, 'acquisitionYear']))
df_updated.loc[69196, 'acquisitionYear']
2014.0

Pandas mutability:

print("Original: " + str(artwork_data.loc[69196, 'acquisitionYear']))
artwork_data.at[69196, 'acquisitionYear'] = inc(artwork_data.at[69196, 'acquisitionYear'])
artwork_data.loc[69196, 'acquisitionYear']
2014.0

When the cell is run again, the value of artwork_data.at[69196, 'acquisitionYear'] has already been mutated.

print("Original: " + str(artwork_data.loc[69196, 'acquisitionYear']))
artwork_data.at[69196, 'acquisitionYear'] = inc(artwork_data.at[69196, 'acquisitionYear'])
artwork_data.loc[69196, 'acquisitionYear']
2015.0

Documentation

StaticFrame does not implement its own types or numeric computation routines, relying entirely on NumPy. NumPy offers desirable stability in performance and interface. For working with SciPy and related tools, StaticFrame exposes easy access to NumPy arrays.

The static_frame.Series and static_frame.Frame store data in immutable NumPy arrays. Once created, array values cannot be changed. StaticFrame manages NumPy arrays, setting the ndarray.flags.writeable attribute to False on all managed and returned NumPy arrays.

Hand Spun

Official Docs

I would argue that is is better style to pass that variable in as an argument to the function, or create a class that contains that variable and the function. Using globals in python is usually a bad idea.

via Joop

class Bla(object):
    def __init__(self):
        self._df = pd.DataFrame(index=[1,2,3])

    @property
    def df(self):
        return self._df.copy()
a = [0,1]
a.append(2)
print(a)
a_new = a + [3]
print(a)
a_new
[0, 1, 2, 3]
import pandas as pd
test_s = pd.Series([1,2,3])
print("1st: %s %s Length: %s" % (id(test_s), id(test_s.array), len(test_s)))
test_s[3] = 37
print("2nd: %s %s Length: %s" % (id(test_s), id(test_s.array), len(test_s)))

Appending and deleting are allowed, but that doesn't necessarily imply the Series is mutable.

Series/DataFrames are internally represented by NumPy arrays which are immutable (fixed size) to allow a more compact memory representation and better performance.

When you assign to a Series, you're actually calling Series.__setitem__ (which then delegates to NDFrame.__loc__) which creates a new array. This new array is then assigned back to the same Series (of course, as the end user, you don't get to see this), giving you the illusion of mutability.