分享
三行代码  ›  专栏  ›  技术社区  ›  Jordan Wrong

按dplyr/data.table分组的滚动平均值 - Rolling Mean By Group Dplyr/data.table

  •  0
  • Jordan Wrong  · 技术社区  · 3 月前

    我有一个大的data.frame,有350k行。目前我正在尝试在data.table中创建lag_close列的滚动平均值。自从 我说得更流利 dplyr 比在 data.table 我试图用 DPLYR 我犯了一个错误。

    df = df %>% na.omit()%>%  group_by(ticker) %>% mutate(avg10 = rollapplyr(lag_close, 10, mean))
    

    错误:列 avg10 必须是长度577(组大小)或1,而不是 五百六十七

    我的下一个尝试是使用data.table路由,因为我看到人们取得了成功。 没有返回任何错误,但输出非常奇怪且不准确。

    df = as.data.table(df)
    df[, laggedRets := rollapplyr(lag_close, 10, mean, na.rm = T), by = ticker]
    tail(df, 10)
    ticker ret.adjusted.prices ret.closing.prices lag_close         jump laggedRets
     1:   MELI         0.002731904        0.002731904    501.48  0.020174791    507.849
     2:   MELI         0.007258614        0.007258614    502.85  0.001966823    507.528
     3:   MELI        -0.002606134       -0.002606134    506.50 -0.011437339    507.085
     4:   MELI         0.018132218        0.018132218    505.18  0.003832865    506.282
     5:   MELI        -0.020628408       -0.020628408    514.34 -0.014373574    505.262
     6:   MELI         0.001568257        0.001568257    503.73  0.003428456         NA
     7:   MELI         0.018155879        0.018155879    504.52 -0.002957652         NA
     8:   MELI        -0.022815765       -0.022815765    513.68  0.002663470         NA
     9:   MELI        -0.007132015       -0.007132015    501.96 -0.005273231         NA
    10:   MELI        -0.012801487       -0.012801487    498.38 -0.016467802         NA
    
    head(df, 10)
      ticker ret.adjusted.prices ret.closing.prices lag_close         jump laggedRets
     1:     AA        0.0496011100       0.0496011100     28.83  0.006568736         NA
     2:     AA        0.0128883014       0.0128883014     30.26 -0.002647255         NA
     3:     AA        0.0009787928       0.0009787928     30.65  0.001304239         NA
     4:     AA       -0.0391134289      -0.0391134289     30.68  0.006497716         NA
     5:     AA        0.0508819539       0.0508819539     29.48  0.011802531     31.154
     6:     AA        0.0319560684       0.0319560684     30.98  0.003222691     31.752
     7:     AA        0.0334689407       0.0334689407     31.97  0.031709630     32.268
     8:     AA       -0.0009080811      -0.0009080811     33.04 -0.006680893     32.769
     9:     AA       -0.0112086950      -0.0112086950     33.01 -0.001212398     33.330
    10:     AA        0.0664829064       0.0664829064     32.64  0.009452792     34.132
    

    似乎在data.table的两边都有NA,但情况并非如此。而且这些数字也不准确。我不确定我遗漏了什么。 这是一个样品,你们可以测试一下。

    df =structure(list(ticker = c("AA", "AA", "AA", "AA", "AA", "AA", 
    "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA", 
    "AA", "AA", "AA", "AABA", "AABA", "AABA", "AABA", "AABA", "AABA", 
    "AABA", "AABA", "AABA", "AABA", "AABA", "AABA", "AABA", "AABA", 
    "AABA", "AABA", "AABA", "AABA", "AABA", "AABA", "AAL", "AAL", 
    "AAL", "AAL", "AAL", "AAL", "AAL", "AAL", "AAL", "AAL", "AAL", 
    "AAL", "AAL", "AAL", "AAL", "AAL", "AAL", "AAL", "AAL", "AAL", 
    "AAOI", "AAOI", "AAOI", "AAOI", "AAOI", "AAOI", "AAOI", "AAOI", 
    "AAOI", "AAOI", "AAOI", "AAOI", "AAOI", "AAOI", "AAOI", "AAOI", 
    "AAOI", "AAOI", "AAOI", "AAOI", "AAPL", "AAPL", "AAPL", "AAPL", 
    "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", 
    "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", 
    "ABBV", "ABBV", "ABBV", "ABBV", "ABBV", "ABBV", "ABBV", "ABBV", 
    "ABBV", "ABBV", "ABBV", "ABBV", "ABBV", "ABBV", "ABBV", "ABBV", 
    "ABBV", "ABBV", "ABBV", "ABBV"), lag_close = c(28.83, 30.26, 
    30.65, 30.68, 29.48, 30.98, 31.969999, 33.040001, 33.009998, 
    32.639999, 34.810001, 35.419998, 35.66, 36.290001, 37.5, 36.529999, 
    36.299999, 36.669998, 36.549999, 36.450001, 38.900002, 40.060001, 
    41.34, 41.23, 41.34, 42.299999, 42.59, 42.110001, 42.27, 41.990002, 
    42.029999, 42.09, 42.049999, 42.400002, 43.900002, 44.939999, 
    44.549999, 44.419998, 43.93, 44.07, 46.299999, 46.700001, 45.889999, 
    46.209999, 47.080002, 48.48, 48.639999, 48.099998, 47.650002, 
    46.75, 47.639999, 47.259998, 48, 46.939999, 47.540001, 47.91, 
    49.59, 46.950001, 44.900002, 44.25, 23.1, 23.51, 22.809999, 22.42, 
    22.24, 22.870001, 22.719999, 29.860001, 28.219999, 30.440001, 
    30.84, 30.48, 30.58, 31.290001, 32, 31.709999, 31.280001, 31.15, 
    30.41, 30.76, 116.150002, 116.019997, 116.610001, 117.910004, 
    118.989998, 119.110001, 119.75, 119.25, 119.040001, 120, 119.989998, 
    119.779999, 120, 120.080002, 119.970001, 121.879997, 121.940002, 
    121.949997, 121.629997, 121.349998, 62.41, 63.290001, 63.77, 
    63.790001, 64.209999, 64.07, 61.139999, 61.279999, 61.990002, 
    61.860001, 61.66, 61.380001, 61.150002, 60.959999, 60.580002, 
    61.380001, 61.27, 60, 60.43, 61.110001)), row.names = c(NA, -120L
    ), class = "data.frame", .Names = c("ticker", "lag_close"))
    
    1 回复  |  直到 3 月前
        1
  •  1
  •   Cettt    3 月前

    你可以使用 map purrr 打包并应用于 1:n() :

    df = df %>% 
      na.omit() %>% 
      group_by(ticker) %>% 
      mutate(avg10 = map_dbl(1:n(), ~mean(lag_close[(max(.x-9, 1)):.x], na.rm =T))
    

    当然,您必须对前9行中少于10个观察值的情况进行谴责。在我的解中,第1行到第9行包含最后1到9个观测值的平均值。