How to use Python Polars copy-on-write principle

How to use Python Polars copy-on-write principle
python
Ethan Jackson

I come from C++ and R world and just started using Polars. This is a great library. I want to confirm my understanding of its copy-on-write principle:

import polars as pl x = pl.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}) y = x # Here, y is semantically a copy of x and # users shall treat y as a copy of x, but under the # hood, y is currently still just a "reference" to x. y = y.with_columns( pl.Series([7, 8, 9]).alias('b') ) # Copy-on-write occurs, but only a new column 'b' are created. z = y # Users shall treat z as a copy of y, but # under the hood, z is currently still just a # "reference" to y. # Create row index for conditional operations z = z.with_row_index("row_idx") z = z.with_columns( pl.when(pl.col("row_idx") == 0) .then(10) .otherwise(pl.col("b")) .alias("b") ) # Copy-on-write kicks in. The entire # column 'b' is copied and then the first element is # changed to 10. z = z.with_columns( pl.when(pl.col("row_idx") == 1) .then(11) .otherwise(pl.col("b")) .alias("b") ) # The 2nd element is changed in-place to 11. # Remove the temporary row index column z = z.drop("row_idx")

And at this point, x, y and z are semantically independent data frames, and users shall treat them as such. But under the hood, only one column 'a' exists right now. In other words, here are the data actually existing in memory:

[1, 2, 3] [4, 5, 6] [7, 8, 9] [10, 11, 9]

Are all my statements and code comments correct? If not, what should be the correct ones?

Edit: I threw my post into ChatGPT and it says:

"The 2nd element is changed in-place to 11.” ❌ Not in-place. Polars is immutable. Even if you just change one element, the result is a new Series (and potentially a new DataFrame). So z.with_columns(...) always creates a new DataFrame and any modified Series is new memory.

Am I right or is the AI right? Is there any part of the official document that can answer this authoritatively?

Answer

All statements are correct but the last one isn't:

But under the hood, only one column 'a' exists right now

But if you meant that as a single column chunck shared across those dataframes then your all correct

Polars uses copy-on-write (COW), so unless a mutation touches a column, it continues to share the same memory reference across clones or transforms. In your example:

- You never modified column a in y or z.

- So all three DataFrames—x, y, and z—still reference the exact same column chunk for a.

Polars tracks column ownership at the chunk level. When you do something like .with_columns(...) involving a, it checks if the expression impacts the original buffer. If yes, the chunk is cloned. If not, it's re-used.

It's just that unless the column is modified they will still reference to the same chunk

Related Articles