-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
CoW: add readonly flag to ExtensionArrays, return read-only EA/ndarray in .array/EA.to_numpy() #61925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CoW: add readonly flag to ExtensionArrays, return read-only EA/ndarray in .array/EA.to_numpy() #61925
Conversation
…y in .array/EA.to_numpy()
| # strictly less than 2000 to be below Index.__pandas_priority__. | ||
| __pandas_priority__ = 1000 | ||
|
|
||
| _readonly = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use arr.flags.writeable to be consistent with numpy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this was easier for a quick POC ;)
It would indeed keep it more consistent in usage, so that might be a reason to add a flags attribute, so code that needs to work with both ndarray or EA can use one code path. But I don't think we would ever add any of the other flags that numpy has, so not sure it would then be worth to add a nested attribute for this.
| elif self._readonly and astype_is_view(self.dtype, result.dtype): | ||
| # If the ExtensionArray is readonly, make the numpy array readonly too | ||
| result = result.view() | ||
| result.flags.writeable = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be done below the setting of na_value on L616?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, because in that case the result array is already a copy, so no need to take a read-only view in that case
|
i get why .values and .array are made read-only, but why are we bothering with to_numpy? |
|
That's a good question, I didn't really think about it deeply .. But so for the non-extension dtypes, we also did it for I do think there is value in being consistent in those different ways to get a numpy array from the pandas object. So could also ask, why not for |
I don't feel strongly about this, but asked in the first place because it seems most of the code complexity in this PR is driven by to_numpy changes. Without that, most of this is just boilerplate edits to The main reason i can think of to treat to_numpy different from .array and .values is that it has an explicit |
Looking at the diff again, I think it is a bit 50/50 between
Yeah, we could potentially also make the default of From previous discussions (maybe #52823), I seem to remember that we at some point did bring up whether it would be worth having a keyword to control this behaviour, i.e. so there would be a way that you could ask for a numpy array that was guaranteed to be mutable. Of course you could do instead of adding a keyword like (but this is probably a discussion for #52823) |
Good point to verify. For the EA class itself, it seems we have some methods that can act inplace:
For DataFrame et al, there are indeed some methods that allow inplace modifications, and that gives a problem with underlying read-only arrays, but that is not specific for EAs. That is also an issue already on main for default numpy dtypes if then end up using read-only arrays. This is essentially #62144 but then not specifically for |
mroeschke
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @jbrockmendel if you want to have another look
pandas/core/arrays/base.py
Outdated
| """ | ||
| raise AbstractMethodError(self) | ||
|
|
||
| def _getitem_returns_view(self, key) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we expect anyone to override this? i.e. does it need to be a method? or just convenient to put it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess this makes it easy for subclass authors to use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently don't override it anywhere, I just put it here in the base class because it gets used in the __getitem__ implementation of various subclasses. But so indeed does not need to be a method.
I don't think I expect EA authors to override the method (or if override it, they essentially just define it for their own, it's not that the overriding will influence something else on the base class).
They might want to use it for their own getitem implementation, and at that point it is a bit easier to have it as a method instead of a helper function somewhere. But let's start conservative and not expose it to EA authors (we can always add it later)
| result = data.fillna(data_missing[1]) | ||
| assert result[0] == data_missing[1] | ||
|
|
||
| # copy=False is ignored -> so same result as above |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment (in a few places) confused me until (i think) i figured out it refers to the keyword in the fillna method. could clarify for future-me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking back at it, the first comment above "copy=False keyword is not ignored by SparseArray.fillna" also seems wrong, because SparseArray.fillna just completely ignores copy.
Attempted to clarify this.
|
small comments, no objections |
| super().test_fillna_no_op_returns_copy(data) | ||
|
|
||
| def test_fillna_readonly(self, data_missing): | ||
| # copy=False keyword is not ignored by SparseArray.fillna |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not necessarily for this PR, but should we add an EA attribute specifying whether copy=False is ignored? that way we could avoid overriding tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or could also be an attribute on the Test class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yah either way it can be saved for a follow-up
|
thanks @jorisvandenbossche |
Addresses one of the remaining TODO items from #48998
Similar as #51082 and some follow-up PRs, ensuring we also mark EAs as read-only like we do for numpy arrays, when the user gets the underlying EA from a pandas object.
For that purpose, added a
_readonlyattribute to the EA class that is False by default.Still need to add more tests and fix a bunch of tests
closes #58007