You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to use the groupby.apply method in a Time Series Forecasting application but I'm facing a severe performance issue, which seems to be related with the processes generated/CPU performance.
Basically, what I need to do is groupby each one of my time series and apply the regressor on those groups. This is done by the code below. I'm already returning types of my created dataframe and I'm current using the 'distributed-sequence' index, since it is a time series task and the index sequence is relevant.
Whenever I run this code, the number of processes increases a lot and my cluster starts to have a Load > 100% (actually around 400%, but I don't think this is a real problem) for a long period, in which nothing seems to advance (the stages/jobs are simply running forever without any advances, with I think is the problem). The load is only on the workers, which I do think makes sense.
I've tried a couple of cluster configurations, but it didn't changed the behavior.
- 10 workers, F16 type, 16 cores, 32g memory (I've tested this one because after the auto scaling process, the load were not distributed between the new workers)
- 1 worker, F72s, 72 cores, 144g memory
I imagine the problem here is with the Koalas DataFrame -> Pandas DataFrame -> Koalas DataFrame procedure which is applied by the groupby.apply, but I don't see another way to solve the problem (considering that I need to groupby each time series and apply the forecaster only on that series, so I could not use another apply/transform methods described in the Transform and Apply a function page)
Have someone had this kind of problem before? Is there another way to implement this? Any thoughts/help is appreciated.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hello,
I'm trying to use the groupby.apply method in a Time Series Forecasting application but I'm facing a severe performance issue, which seems to be related with the processes generated/CPU performance.
Basically, what I need to do is groupby each one of my time series and apply the regressor on those groups. This is done by the code below. I'm already returning types of my created dataframe and I'm current using the 'distributed-sequence' index, since it is a time series task and the index sequence is relevant.
Whenever I run this code, the number of processes increases a lot and my cluster starts to have a Load > 100% (actually around 400%, but I don't think this is a real problem) for a long period, in which nothing seems to advance (the stages/jobs are simply running forever without any advances, with I think is the problem). The load is only on the workers, which I do think makes sense.
I've tried a couple of cluster configurations, but it didn't changed the behavior.
- 10 workers, F16 type, 16 cores, 32g memory (I've tested this one because after the auto scaling process, the load were not distributed between the new workers)
- 1 worker, F72s, 72 cores, 144g memory
I imagine the problem here is with the Koalas DataFrame -> Pandas DataFrame -> Koalas DataFrame procedure which is applied by the groupby.apply, but I don't see another way to solve the problem (considering that I need to groupby each time series and apply the forecaster only on that series, so I could not use another apply/transform methods described in the Transform and Apply a function page)
Have someone had this kind of problem before? Is there another way to implement this? Any thoughts/help is appreciated.
Beta Was this translation helpful? Give feedback.
All reactions