Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnionRDD $plus$plus taking a long time #28

Open
mamoit opened this issue May 4, 2017 · 5 comments
Open

UnionRDD $plus$plus taking a long time #28

mamoit opened this issue May 4, 2017 · 5 comments

Comments

@mamoit
Copy link

mamoit commented May 4, 2017

UnionRDD [64]
$plus$plus at URModel.scala:86

Is taking a looong time compared to everything else.
I have a lot of new items coming in all the time, and this stage in particular (Stage 35) gets stuck for 1h.
I'm using a spark cluster with 3, 4core, nodes with around 16GB of RAM each, but it seems that I can't speed up this stage in particular (which is the bottleneck for the whole train).

The training CPU usage looks something like this:
screenshot_20170504_110902
That plateau is the stage 35, more precisely the $plus$plus at URModel.scala:86

Any idea on how to get over this bottleneck?

@pferrel
Copy link
Owner

pferrel commented May 4, 2017

Welcome to big data ;-)

Spark requires that all data be in-memory over the cluster so 16g per node may be undersized for memory, how big is your data when exported?

Also the read and write operations often take the most time and you can't always trust what it says about the code that is executed. The Spark GUI is doing it's best to say what is long but can only get to the resolution of a closure, inside, there are many lines of code some of which trigger lazy evaluation of many other things in the Spark DAG.

The spike may be a bottleneck but since it completes quickly and (I assume) no errors, is that a problem? The long time at low load can often be improved with better partitioning, which is included in the UR v0.5.0 and PIO-0.10.0, the Apache versions of things.

FYI this repo is being deprecated since we merged our version of PIO with the Apache PIO. The UR template has moved to https://github.com/actionml/universal-recommender

@mamoit
Copy link
Author

mamoit commented May 5, 2017

Thanks 😄

I can't export the data right now, so I can't give a precise figure, but HBase has 3.1G used.

The bottleneck here is really the long time at low load, probably due to the not optimized partitioning.
It has been getting worse with the number of items that can be recommended, and not with the number of events in general. We had a big inventory spike, which resulted in this plateu.
The node's network is quite idle during that time, so it is not a problem of accessing HBase or ES.

Is there a way to improve the partitioning without changing PIO's version? I tried the apache version a while ago, but it was still too quirky.

@mamoit
Copy link
Author

mamoit commented May 5, 2017

I noticed that my driver doesn't have as much memory as the nodes themselves...
I'll try to increase it.

@mamoit
Copy link
Author

mamoit commented May 9, 2017

I increased the memory to 14GB (which it doesn't fully make use of), and the plateau now lasts for 1h.
Way shorter, but still quite long.

@pferrel
Copy link
Owner

pferrel commented May 9, 2017

This not a supported template, please leave questions https://groups.google.com/forum/#!forum/actionml-user

And use the template for PIO-0.10.0 here: https://github.com/actionml/universal-recommender

Why do you find the behavior wrong? I see nothing unusual about it without access the the GUI. Training often take hours but since the query and input servers continue to operate without interruption this is normal behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants