Adventures in Immutable Python
https://blog.quiltdata.com/declarative-etl-with-pandas-and-pyarrow-a779de13ef9d
Mutability
a = 5 b = 5 print("IDs a: %s, b: %s" % (id(a), id(b))) print("a == b %s, a is b %s" % (a == b, a is b))
a = 5000 b = 5000 print("IDs a: %s, b: %s" % (id(a), id(b))) print("a == b %s, a is b %s" % (a == b, a is b))
Value immutability (42
) vs. Object mutation ([42]
)
print(" 42 == 42 %s, 42 is 42 %s" % (42 == 42, 42 is 42)) print("[42] == [42] %s, [42] is [42] %s" % ([42] == [42], [42] is [42]))
More in the official docs Assignment Statement and Comparisons.
Arrow
One of the goals of Apache Arrow is to serve as a common data layer enabling zero-copy data exchange between multiple frameworks. A key component of this vision is the use of off-heap memory management (via Plasma) for storing and sharing Arrow-serialized objects between applications.
Arrays are immutable once created. Implementations can provide APIs to mutate an array, but applying mutations will require a new array data structure to be built.
Plasma
Plasma holds immutable objects in shared memory so that they can be accessed efficiently by many clients across process boundaries. In light of the trend toward larger and larger multicore machines, Plasma enables critical performance optimizations in the big data regime.
Using Plasma plus Arrow, the data being operated on would be placed in the Plasma store once, and all of the workers would read the data without copying or deserializing it (the workers would map the relevant region of memory into their own address spaces). The workers would then put the results of their computation back into the Plasma store, which the driver could then read and aggregate without copying or deserializing the data.
nohup plasma_store -m 60397977 -s /tmp/plasma &> /tmp/plasma.log & sleep 1 tail /tmp/plasma.log
import pyarrow.plasma as plasma client = plasma.connect("/tmp/plasma") client
# Create an object. object_id = plasma.ObjectID(20 * b'a') object_size = 1000 buffer = memoryview(client.create(object_id, object_size)) # Write to the buffer. for i in range(1000): buffer[i] = i % 128 # Seal the object making it immutable and available to other clients. client.seal(object_id) # The character "a" is encoded as 61 in hex. object_id
# Create a python object. object_id = client.put("hello, world") # Get the object. client.get(object_id)
# Create a different client. Note that this second client could be # created in the same or in a separate, concurrent Python session. client2 = plasma.connect("/tmp/plasma") # Get the object in the second client. This blocks until the object has been sealed. object_id2 = plasma.ObjectID(20 * b"a") [buffer2] = client2.get_buffers([object_id]) buffer2
If the object has not been sealed yet, then the call to client.get_buffers will block until the object has been sealed by the client constructing the object. Using the timeout_ms
argument to get, you can specify a timeout for this (in milliseconds). After the timeout, the interpreter will yield control back.
print(str(buffer) + " --- " + str(buffer[1])) print(str(buffer2) + " --- " + str(buffer2[1]) + " (unsealed)") print(bytes(buffer[1:4])) print(bytes(buffer2[1:4]))
Arrow Tables
The equivalent to a pandas DataFrame in Arrow is a Table. Both consist of a set of named columns of equal length. While pandas only supports flat columns, the Table also provides nested columns, thus it can represent more data than a DataFrame, so a full conversion is not always possible.
Conversion from a Table to a DataFrame is done by calling pyarrow.Table.to_pandas()
. The inverse is then achieved by using pyarrow.Table.from_pandas()
.
~ DataFrames via arrow.apache.org
from pyarrow import csv table = csv.read_csv(artwork_data.csv) print(table) df_mutable = table.to_pandas() df_mutable.head()
Pandas
Storing a Pandas DataFrame
still follows the create then seal process of storing an object in the Plasma store, however one cannot directly write the DataFrame
to Plasma with Pandas alone. Plasma also needs to know the size of the DataFrame
to allocate a buffer for.
import pyarrow as pa import pandas as pd # Create a Pandas DataFrame d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']), 'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) # Convert the Pandas DataFrame into a PyArrow RecordBatch record_batch = pa.RecordBatch.from_pandas(df)
Creating the Plasma object requires an ObjectID
and the size of the data. Now that we have converted the Pandas DataFrame
into a PyArrow RecordBatch
, use the MockOutputStream
to determine the size of the Plasma object.
# Useful utility for generating random object store ids import numpy as np def random_object_id(): return plasma.ObjectID(np.random.bytes(20)) random_object_id()
# Create the Plasma object from the PyArrow RecordBatch. Most of the work here # is done to determine the size of buffer to request from the object store. object_id = random_object_id() mock_sink = pa.MockOutputStream() stream_writer = pa.RecordBatchStreamWriter(mock_sink, record_batch.schema) stream_writer.write_batch(record_batch) stream_writer.close() data_size = mock_sink.size() buf = client.create(object_id, data_size)
The DataFrame can now be written to the buffer as follows.
# Write the PyArrow RecordBatch to Plasma stream = pa.FixedSizeBufferWriter(buf) stream_writer = pa.RecordBatchStreamWriter(stream, record_batch.schema) stream_writer.write_batch(record_batch) stream_writer.close()
Finally, seal the finished object for use by all clients:
# Seal the Plasma object client.seal(object_id)
# Fetch the Plasma object [data] = client.get_buffers([object_id]) # Get PlasmaBuffer from ObjectID df_buffer = pa.BufferReader(data)
# Convert object back into an Arrow RecordBatch reader = pa.RecordBatchStreamReader(df_buffer) record_batch = reader.read_next_batch()
# Convert back into Pandas result = record_batch.to_pandas()
StaticFrame
Pandas
All pandas data structures are value-mutable (the values they contain can be altered) but not always size-mutable. The length of a Series cannot be changed, but, for example, columns can be inserted into a DataFrame. However, the vast majority of methods produce new objects and leave the input data untouched. In general we like to favor immutability where sensible.
import numpy as np import pandas as pd artwork_data = pd.read_csv(artwork_data.csv) artwork_data.drop(columns=["accession_number", "artistRole", "artistId", "dateText", "acquisitionYear", "dimensions", "width", "height", "depth", "creditLine", "units", "inscription", "thumbnailCopyright", "thumbnailUrl", "url"]) # (inplace=True) leaves id, artist, title, medium, year
Rationale
Immutable data structures reduce opportunities for error and promote the design of pure functions, offering programs that are easier to reason about and maintain. While Pandas is used in many domains where such benefits are highly desirable, there is no way to enforce immutability in Pandas.
~ Documentation
One of the main things you learn when you start with scientific computing in Python is that you should not write for-loops over your data. Instead you are advised to use the vectorized functions provided by packages like numpy
. The major share of computations can be represented as a combination of fast NumPy operations.
~ https://uwekorn.com/2018/08/03/use-numba-to-work-with-apache-arrow-in-pure-python.html
In Action
StaticFrame aspires to have comparable or better performance than Pandas. While this is already the case for some core operations (See Performance), some important functions are far more performant in Pandas (such as reading delimited text files via pd.read_csv
).
import static_frame as sf df = sf.Frame.from_pandas(artwork_data) print(df.shape) df.dtypes
StaticFrame interfaces for extracting data will be familiar to Pandas users, though with a number of interface refinements to remove redundancies and increase consistency. On a Frame
, __getitem__
is (exclusively) a column selector; loc
and iloc
are (with one argument) row selectors or (with two arguments) row and column selectors.
df['artist': 'year'].tail()
Instead of in-place assignment, an assign
interface object (similar to the Frame.astype
interface shown above) is provided to expose __getitem__
, loc
, and iloc
interfaces that, when called with an argument, return a new object with the desired changes. These interfaces expose the full range of expressive assignment-like idioms found in Pandas and NumPy. Arguments can be single values, or Series
and Frame
objects, where assignment will align on the Index.
StaticFrame immutability:
def inc(x): x+=1 return x print("Original: " + str(df.loc[69196, 'acquisitionYear'])) df.assign.loc[69196, 'acquisitionYear'](inc(df.loc[69196, 'acquisitionYear'])) df.loc[69196, 'acquisitionYear']
Updating a StaticFrame structure requires creating a new one:
print("Original: " + str(df.loc[69196, 'acquisitionYear'])) df_updated = df.assign.loc[69196, 'acquisitionYear'](inc(df.loc[69196, 'acquisitionYear'])) df_updated.loc[69196, 'acquisitionYear']
Pandas mutability:
print("Original: " + str(artwork_data.loc[69196, 'acquisitionYear'])) artwork_data.at[69196, 'acquisitionYear'] = inc(artwork_data.at[69196, 'acquisitionYear']) artwork_data.loc[69196, 'acquisitionYear']
When the cell is run again, the value of artwork_data.at[69196, 'acquisitionYear']
has already been mutated.
print("Original: " + str(artwork_data.loc[69196, 'acquisitionYear'])) artwork_data.at[69196, 'acquisitionYear'] = inc(artwork_data.at[69196, 'acquisitionYear']) artwork_data.loc[69196, 'acquisitionYear']
Documentation
StaticFrame does not implement its own types or numeric computation routines, relying entirely on NumPy. NumPy offers desirable stability in performance and interface. For working with SciPy and related tools, StaticFrame exposes easy access to NumPy arrays.
The static_frame.Series
and static_frame.Frame
store data in immutable NumPy arrays. Once created, array values cannot be changed. StaticFrame manages NumPy arrays, setting the ndarray.flags.writeable
attribute to False on all managed and returned NumPy arrays.
Hand Spun
Official Docs
- Objects, Values, and Types: lists mutable and immutable types
I would argue that is is better style to pass that variable in as an argument to the function, or create a class that contains that variable and the function. Using globals in python is usually a bad idea.
via Joop
class Bla(object): def __init__(self): self._df = pd.DataFrame(index=[1,2,3]) def df(self): return self._df.copy()
a = [0,1] a.append(2) print(a) a_new = a + [3] print(a) a_new
import pandas as pd test_s = pd.Series([1,2,3]) print("1st: %s %s Length: %s" % (id(test_s), id(test_s.array), len(test_s))) test_s[3] = 37 print("2nd: %s %s Length: %s" % (id(test_s), id(test_s.array), len(test_s)))
Appending and deleting are allowed, but that doesn't necessarily imply the Series is mutable.
Series/DataFrames are internally represented by NumPy arrays which are immutable (fixed size) to allow a more compact memory representation and better performance.
When you assign to a Series, you're actually calling Series.__setitem__
(which then delegates to NDFrame.__loc__
) which creates a new array. This new array is then assigned back to the same Series (of course, as the end user, you don't get to see this), giving you the illusion of mutability.