I have a Pandas DataFrame representing data grouped by an ID, and I need to update values in one column based on conditions that involve values in other rows within the same group. Specifically, I want to update a column status
to 'resolved' under these conditions:
- The
status
column for the current row is initially 'pending'. - There is at least one other row within the same
group_id
wherestatus
is 'active' and therelated_id
matches theid
of the current row.
import pandas as pd
data = {'id': [1, 2, 3, 4, 5, 6, 7],
'group_id': ['A', 'A', 'A', 'B', 'B', 'B', 'B'],
'status': ['pending', 'active', 'pending', 'pending', 'active', 'pending', 'pending'],
'related_id': [None, None, 1, None, None, 4, 4]}
df = pd.DataFrame(data)
print(df)
id group_id status related_id
0 1 A pending None
1 2 A active None
2 3 A pending 1
3 4 B pending None
4 5 B active None
5 6 B pending 4
6 7 B pending 4
In this example, rows with id
3, 6, and 7 should have their status
updated to 'resolved' because they are 'pending', belong to the same group_id
as a row with 'active' status, and their related_id
matches the id
of that 'active' row (row 1 for id 3, row 4 for id 6 & 7).
The output I want:
id group_id status related_id
0 1 A pending None
1 2 A active None
2 3 A resolved 1
3 4 B pending None
4 5 B active None
5 6 B resolved 4
6 7 B resolved 4
I tried using groupby()
and apply()
, but I'm struggling to apply the condition check across rows within each group.
Answer
A possible solution:
m = (df.groupby('group_id')
.apply(lambda x:
x['related_id'].isin(x.index[x['status'].eq('active')]),
include_groups=False)
.reset_index(drop=True))
df.loc[m, 'status'] = 'resolved'
Output:
id group_id status related_id
0 1 A pending NaN
1 2 A active NaN
2 3 A resolved 1.0
3 4 B pending NaN
4 5 B active NaN
5 6 B resolved 4.0
6 7 B resolved 4.0