-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UnionRDD $plus$plus taking a long time #28
Comments
Welcome to big data ;-) Spark requires that all data be in-memory over the cluster so 16g per node may be undersized for memory, how big is your data when exported? Also the read and write operations often take the most time and you can't always trust what it says about the code that is executed. The Spark GUI is doing it's best to say what is long but can only get to the resolution of a closure, inside, there are many lines of code some of which trigger lazy evaluation of many other things in the Spark DAG. The spike may be a bottleneck but since it completes quickly and (I assume) no errors, is that a problem? The long time at low load can often be improved with better partitioning, which is included in the UR v0.5.0 and PIO-0.10.0, the Apache versions of things. FYI this repo is being deprecated since we merged our version of PIO with the Apache PIO. The UR template has moved to https://github.com/actionml/universal-recommender |
Thanks 😄 I can't export the data right now, so I can't give a precise figure, but HBase has 3.1G used. The bottleneck here is really the long time at low load, probably due to the not optimized partitioning. Is there a way to improve the partitioning without changing PIO's version? I tried the apache version a while ago, but it was still too quirky. |
I noticed that my driver doesn't have as much memory as the nodes themselves... |
I increased the memory to 14GB (which it doesn't fully make use of), and the plateau now lasts for 1h. |
This not a supported template, please leave questions https://groups.google.com/forum/#!forum/actionml-user And use the template for PIO-0.10.0 here: https://github.com/actionml/universal-recommender Why do you find the behavior wrong? I see nothing unusual about it without access the the GUI. Training often take hours but since the query and input servers continue to operate without interruption this is normal behavior. |
Is taking a looong time compared to everything else.
I have a lot of new items coming in all the time, and this stage in particular (Stage 35) gets stuck for 1h.
I'm using a spark cluster with 3, 4core, nodes with around 16GB of RAM each, but it seems that I can't speed up this stage in particular (which is the bottleneck for the whole train).
The training CPU usage looks something like this:
That plateau is the stage 35, more precisely the $plus$plus at URModel.scala:86
Any idea on how to get over this bottleneck?
The text was updated successfully, but these errors were encountered: