Python Pandas Dataframe - 基于条件的Groupby和Average(Python Pandas Dataframe - Groupby and Average based on Condition)

我有一个如下所示的数据框:

id start end diff mindiff 1 2015-01-02 2015-07-01 180 57 2 2015-02-03 2015-05-12 98 56 3 2015-01-15 2015-01-20 5 5 4 2015-02-04 2015-04-15 70 55 5 2015-03-15 2015-05-01 47 46 6 2015-02-22 2015-03-01 7 7 7 2015-03-21 2015-04-12 22 22 8 2015-04-11 2015-06-15 65 50 9 2015-04-11 2015-05-01 20 20 10 2015-03-30 2015-04-01 2 2 11 2015-04-28 2015-06-15 48 33 12 2015-05-01 2015-06-01 31 31 13 2015-05-10 2015-06-09 30 30 14 2015-05-19 2015-07-01 43 42 15 2015-06-01 2015-06-06 5 5 16 2015-06-02 2015-06-29 27 27 17 2015-04-29 2015-05-21 22 22 18 2015-05-25 2015-07-01 37 36 19 2015-06-04 2015-06-26 22 22 20 2015-06-21 2015-07-01 10 10 21 2015-05-30 2015-06-06 7 7 22 2015-06-30 2015-07-01 1 1

字段是id,start(日期),end(日期),diff(开始和结束之间的天数),mindiff(min(差异和开始的最后一天x个月)。

在这种情况下x是1(因此比“开始日期晚一个月”)

我想要完成的是找到Mindiff的平均值(平均值),按照“结束”的年/月分组,但只对每个组平均具有“开始”年/月x(上面定义)的记录几个月回到groupedby月份。 从上面的数据集中的示例,id 1将仅以年/月2015/1和2015/1 + x(2015/2)进行平均。

这是一个标记每条记录的表格,以及我希望平均的月份:

Months id 1 2 3 4 5 6 7 1 1 1 2 1 1 3 1 4 1 1 5 1 1 6 1 1 7 1 1 8 1 1 9 1 1 10 1 1 11 1 1 12 1 1 13 1 1 14 1 1 15 1 16 1 17 1 1 18 1 1 19 1 20 1 1 21 1 1 22 1 1

以下是我正在寻找的思维导致的AVG /月份:

Months id 1 2 3 4 5 6 7 1 57 57 2 56 56 3 5 4 55 55 5 46 46 6 7 7 7 22 22 8 50 50 9 20 20 10 2 2 11 33 33 12 31 31 13 30 30 14 42 42 15 5 16 27 17 22 22 18 36 36 19 22 20 10 10 21 7 7 22 1 1 AVG 31 43.8 31.3 27.9 30.1 21.1 5.5

最后,这是我正在寻找的数据帧:

Month Avg Diff Trailing x months 2015-01 31 2015-02 43.75 2015-03 31.33333333 2015-05 27.85714286 2015-05 30.11111111 2015-06 21.1 2015-07 5.5

我知道这可能是一个循环,但我的直觉说GROUPBY更pythonic,可能更有效。 但是,我如何才能得到“开始”月份的具体滚动思维值,以便在“年末/月”的组合中取平均值。 谢谢您的帮助。

I have a dataframe that looks like this:

id start end diff mindiff 1 2015-01-02 2015-07-01 180 57 2 2015-02-03 2015-05-12 98 56 3 2015-01-15 2015-01-20 5 5 4 2015-02-04 2015-04-15 70 55 5 2015-03-15 2015-05-01 47 46 6 2015-02-22 2015-03-01 7 7 7 2015-03-21 2015-04-12 22 22 8 2015-04-11 2015-06-15 65 50 9 2015-04-11 2015-05-01 20 20 10 2015-03-30 2015-04-01 2 2 11 2015-04-28 2015-06-15 48 33 12 2015-05-01 2015-06-01 31 31 13 2015-05-10 2015-06-09 30 30 14 2015-05-19 2015-07-01 43 42 15 2015-06-01 2015-06-06 5 5 16 2015-06-02 2015-06-29 27 27 17 2015-04-29 2015-05-21 22 22 18 2015-05-25 2015-07-01 37 36 19 2015-06-04 2015-06-26 22 22 20 2015-06-21 2015-07-01 10 10 21 2015-05-30 2015-06-06 7 7 22 2015-06-30 2015-07-01 1 1

The fields are id, start (date), end (date), diff(the number of days between start and end), mindiff(the min(diff and the last day x months from start).

x in this case is 1 (so one month "later than" the start date)

What I am trying to accomplish is to find the Average(mean) of mindiff, grouped by the year/month of 'end', but ONLY averaging for each group the records which have a 'start' year/month x (defined above) months back through the groupedby month. Example from the data set above, id 1 would only be averaged in year/month 2015/1 and 2015/1+x (2015/2).

Here is a table flagging each record and which month I would like to average in:

Months id 1 2 3 4 5 6 7 1 1 1 2 1 1 3 1 4 1 1 5 1 1 6 1 1 7 1 1 8 1 1 9 1 1 10 1 1 11 1 1 12 1 1 13 1 1 14 1 1 15 1 16 1 17 1 1 18 1 1 19 1 20 1 1 21 1 1 22 1 1

Here are the mindiffs and resulting AVG/month I am looking for:

Months id 1 2 3 4 5 6 7 1 57 57 2 56 56 3 5 4 55 55 5 46 46 6 7 7 7 22 22 8 50 50 9 20 20 10 2 2 11 33 33 12 31 31 13 30 30 14 42 42 15 5 16 27 17 22 22 18 36 36 19 22 20 10 10 21 7 7 22 1 1 AVG 31 43.8 31.3 27.9 30.1 21.1 5.5

Finally, here is the dataframe I am looking for as a result:

Month Avg Diff Trailing x months 2015-01 31 2015-02 43.75 2015-03 31.33333333 2015-05 27.85714286 2015-05 30.11111111 2015-06 21.1 2015-07 5.5

I know this is possible with a loop, but my gut says that a GROUPBY is more pythonic and likely more efficient. But how do I get only the specific rolling mindiff values for months of 'start' to be averaged within the groupby of the 'end' year/month. Thanks for the help.

最满意答案

首先,我创建了不同年份的测试数据,并将最后一行的开始设置为12月。 然后我将start和end列转换为句点 - periodS和periodE cols。

我按列month使用函数groupby ,从列Avg计算平均值:

g = df1.groupby('months')['Avg'].mean().reset_index() import pandas as pd import numpy as np import io temp=u"""id;start;end 1;2014-01-02;2014-07-01 2;2014-02-03;2014-05-12 3;2014-01-15;2014-01-20 4;2014-02-04;2014-04-15 5;2014-03-15;2014-05-01 6;2014-02-22;2014-03-01 7;2015-03-21;2015-04-12 8;2015-04-11;2015-06-15 9;2015-04-11;2015-05-01 10;2015-03-30;2015-04-01 11;2015-04-28;2015-06-15 12;2015-05-01;2015-06-01 13;2015-05-10;2015-06-09 14;2016-05-19;2016-07-01 15;2016-06-01;2016-06-06 16;2016-06-02;2016-06-29 17;2016-04-29;2016-05-21 18;2016-05-25;2016-07-01 19;2017-06-04;2017-06-26 20;2017-06-21;2017-07-01 21;2017-05-30;2017-06-06 22;2017-12-30;2018-02-01""" df = pd.read_csv(io.StringIO(temp), sep=";", index_col=[0]) print df def last_day_of_next_month(any_day): next_month = any_day.replace(day=28) + pd.Timedelta(days=36) # this will never fail return next_month - pd.Timedelta(days=next_month.day) df['mindiff'] = (pd.to_datetime(df['start']).apply(last_day_of_next_month) - pd.to_datetime(df['start'])).astype('timedelta64[D]') df['diff'] = (pd.to_datetime(df['end']) - pd.to_datetime(df['start'])).astype('timedelta64[D]') df['mindiff'] = df[['mindiff', 'diff']].apply(lambda x: min(x), axis=1) #print df #set day of start and end to periodindex df['periodS'] = pd.to_datetime(df['start']).dt.to_period('M') df['periodE'] = pd.to_datetime(df['end']).dt.to_period('M') #if period end is higher as period start, add one month else NaN df['period'] = np.where(df['periodE'] > df['periodS'],df['periodS'] + 1, np.nan) #print df #df from subset df1 = df[['mindiff', 'periodS', 'period']] #pivot data (from rows to columns) df1 = df1.set_index('mindiff').stack().reset_index() #rename columns names df1.columns = ['Avg', 'tmp', 'months'] #groupby by column month and count mean from column Avg g = df1.groupby('months')['Avg'].mean().reset_index() print g # months Avg #0 2014-01 31.000000 #1 2014-02 43.750000 #2 2014-03 41.000000 #3 2014-04 46.000000 #4 2015-03 12.000000 #5 2015-04 25.400000 #6 2015-05 32.800000 #7 2015-06 30.500000 #8 2016-04 22.000000 #9 2016-05 33.333333 #10 2016-06 27.500000 #11 2017-05 7.000000 #12 2017-06 13.000000 #13 2017-07 10.000000 #14 2017-12 32.000000 #15 2018-01 32.000000

First I created testing data with different years and start in last row is set to December. Then I convert start and end columns to periods - periodS and periodE cols.

I use function groupby by column month and count mean from column Avg:

g = df1.groupby('months')['Avg'].mean().reset_index() import pandas as pd import numpy as np import io temp=u"""id;start;end 1;2014-01-02;2014-07-01 2;2014-02-03;2014-05-12 3;2014-01-15;2014-01-20 4;2014-02-04;2014-04-15 5;2014-03-15;2014-05-01 6;2014-02-22;2014-03-01 7;2015-03-21;2015-04-12 8;2015-04-11;2015-06-15 9;2015-04-11;2015-05-01 10;2015-03-30;2015-04-01 11;2015-04-28;2015-06-15 12;2015-05-01;2015-06-01 13;2015-05-10;2015-06-09 14;2016-05-19;2016-07-01 15;2016-06-01;2016-06-06 16;2016-06-02;2016-06-29 17;2016-04-29;2016-05-21 18;2016-05-25;2016-07-01 19;2017-06-04;2017-06-26 20;2017-06-21;2017-07-01 21;2017-05-30;2017-06-06 22;2017-12-30;2018-02-01""" df = pd.read_csv(io.StringIO(temp), sep=";", index_col=[0]) print df def last_day_of_next_month(any_day): next_month = any_day.replace(day=28) + pd.Timedelta(days=36) # this will never fail return next_month - pd.Timedelta(days=next_month.day) df['mindiff'] = (pd.to_datetime(df['start']).apply(last_day_of_next_month) - pd.to_datetime(df['start'])).astype('timedelta64[D]') df['diff'] = (pd.to_datetime(df['end']) - pd.to_datetime(df['start'])).astype('timedelta64[D]') df['mindiff'] = df[['mindiff', 'diff']].apply(lambda x: min(x), axis=1) #print df #set day of start and end to periodindex df['periodS'] = pd.to_datetime(df['start']).dt.to_period('M') df['periodE'] = pd.to_datetime(df['end']).dt.to_period('M') #if period end is higher as period start, add one month else NaN df['period'] = np.where(df['periodE'] > df['periodS'],df['periodS'] + 1, np.nan) #print df #df from subset df1 = df[['mindiff', 'periodS', 'period']] #pivot data (from rows to columns) df1 = df1.set_index('mindiff').stack().reset_index() #rename columns names df1.columns = ['Avg', 'tmp', 'months'] #groupby by column month and count mean from column Avg g = df1.groupby('months')['Avg'].mean().reset_index() print g # months Avg #0 2014-01 31.000000 #1 2014-02 43.750000 #2 2014-03 41.000000 #3 2014-04 46.000000 #4 2015-03 12.000000 #5 2015-04 25.400000 #6 2015-05 32.800000 #7 2015-06 30.500000 #8 2016-04 22.000000 #9 2016-05 33.333333 #10 2016-06 27.500000 #11 2017-05 7.000000 #12 2017-06 13.000000 #13 2017-07 10.000000 #14 2017-12 32.000000 #15 2018-01 32.000000

更多推荐