分享
三行代码  ›  专栏  ›  技术社区  ›  est

Python Dataframe:基于特定条件删除重复项

  •  1
  • est  · 技术社区  · 1 周前

    具有重复的车间ID的数据帧,其中某些车间ID发生两次,而有些发生三次:
    我只想根据分配给其区域的最短店铺距离保留唯一的店铺ID。

        Area  Shop Name  Shop Distance  Shop ID   
    
    0   AAA   Ly         86             5d87790c46a77300
    1   AAA   Hi         230            5ce5522012138400
    2   BBB   Hi         780            5ce5522012138400
    3   CCC   Ly         450            5d87790c46a77300
    ...
    91  MMM   Ju         43             4f76d0c0e4b01af7
    92  MMM   Hi         1150           5ce5522012138400
    ...
    

    使用pandas drop_duplicates drop the row duplicates,但条件基于第一个/最后一个出现的店铺ID,这不允许我按距离排序:

    shops_df = shops_df.drop_duplicates(subset='Shop ID', keep= 'first')
    

    我也试着按商店ID分组然后排序,但是sort返回错误:重复

    bbtshops_new['C'] = bbtshops_new.groupby('Shop ID')['Shop ID'].cumcount()
    bbtshops_new.sort_values(by=['C'], axis=1)
    

    到目前为止,我试着一直做到这个阶段:

    # filter all the duplicates into a new df
    df_toclean = shops_df[shops_df['Shop ID'].duplicated(keep= False)]
    
    # create a mask for all unique Shop ID
    mask = df_toclean['Shop ID'].value_counts()
    
    # create a mask for the Shop ID that occurred 2 times
    shop_2 = mask[mask==2].index
    
    # create a mask for the Shop ID that occurred 3 times
    shop_3 = mask[mask==3].index
    
    # create a mask for the Shops that are under radius 750 
    dist_1 = df_toclean['Shop Distance']<=750
    
    # returns results for all the Shop IDs that appeared twice and under radius 750
    bbtshops_2 = df_toclean[dist_1 & df_toclean['Shop ID'].isin(shop_2)]
    
    * if i use df_toclean['Shop Distance'].min() instead of dist_1 it returns 0 results
    
    

    我想我已经做了很长的一段时间了,但仍然没有找到删除副本的方法,有谁知道如何用更短的方式解决这个问题?我是python新手,谢谢你的帮助!

    2 回复  |  直到 1 周前
        1
  •  3
  •   Alexander    1 周前

    尝试首先根据距离对数据帧进行排序,然后删除重复的商店。

    df = shops_df.sort_values('Distance')
    df = df[~df['Shop ID'].duplicated()]  # The tilda (~) inverts the boolean mask.
    

    或者只是一个链式表达式(每个注释来自@chmielcode)。

    df = (
        shops_df
        .sort_values('Distance')
        .drop_duplicates(subset='Shop ID', keep= 'first')
        .reset_index(drop=True)  # Optional.
    )
    
        2
  •  0
  •   oppressionslayer    1 周前

    您可以使用idxmin:

    df.loc[df.groupby('Area')['Shop Distance'].idxmin()]
    
      Area Shop Name  Shop  Distance              Shop ID
    0  AAA        Ly              86     5d87790c46a77300
    2  BBB        Hi             780     5ce5522012138400
    3  CCC        Ly             450     5d87790c46a77300
    4  MMM        Ju              43     4f76d0c0e4b01af7