Performance Issues (CPU/processes) on groupby.apply #2158

ThiagoCM · 2021-05-12T19:02:32Z

ThiagoCM
May 12, 2021

Hello,

I'm trying to use the groupby.apply method in a Time Series Forecasting application but I'm facing a severe performance issue, which seems to be related with the processes generated/CPU performance.

Basically, what I need to do is groupby each one of my time series and apply the regressor on those groups. This is done by the code below. I'm already returning types of my created dataframe and I'm current using the 'distributed-sequence' index, since it is a time series task and the index sequence is relevant.

def apply_direct_model(df, step_model_list, dates, column_name) -> ks.DataFrame["Data": np.datetime64, "Qtd_Venda_Prevista": float]:

  X = df[cols_predict].copy()

  forecasted_values = []
  
  for i, step_model in enumerate(step_model_list):
    pred = step_model.predict(X)
    forecasted_values.append(pred[-1])
  
  dic = {'Data': dates,
        'Qtd_Venda_Prevista': forecasted_values}
  df_forecast = pd.DataFrame(dic)
  return df_forecast


df_results = df_forecast.groupby(cols_group).apply(apply_direct_model, step_model_list = model_list, dates=dates_forecast, column_name = col)

Whenever I run this code, the number of processes increases a lot and my cluster starts to have a Load > 100% (actually around 400%, but I don't think this is a real problem) for a long period, in which nothing seems to advance (the stages/jobs are simply running forever without any advances, with I think is the problem). The load is only on the workers, which I do think makes sense.

I've tried a couple of cluster configurations, but it didn't changed the behavior.

auto-scale cluster (2-10 workers, F16 type, 16 cores, 32g memory)

- 10 workers, F16 type, 16 cores, 32g memory (I've tested this one because after the auto scaling process, the load were not distributed between the new workers)

- 1 worker, F72s, 72 cores, 144g memory

I imagine the problem here is with the Koalas DataFrame -> Pandas DataFrame -> Koalas DataFrame procedure which is applied by the groupby.apply, but I don't see another way to solve the problem (considering that I need to groupby each time series and apply the forecaster only on that series, so I could not use another apply/transform methods described in the Transform and Apply a function page)

Have someone had this kind of problem before? Is there another way to implement this? Any thoughts/help is appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Issues (CPU/processes) on groupby.apply #2158

{{title}}

Replies: 0 comments

Select a reply

Performance Issues (CPU/processes) on groupby.apply #2158

ThiagoCM May 12, 2021

Replies: 0 comments

ThiagoCM
May 12, 2021