Any chance to make ur have no requirement on single executor machine's memory? #37

WayneWang12 · 2017-09-27T05:20:35Z

I'm researching this model and it is really awesome for small companies like us.
I've trained a model easily with 10 million trading orders. However, when I increase the number to 100 million, model cannot be trained.

Actually we have a cluster with 1TB memory. But this model requires the memory size of a single machine. My cluster have 20 nodes and each gets 64GB memory. It is obviously not enough for 10 million orders. I'm wondering if there is any chance for this model to make no requirement to a single machine. I think 1TB is quite enough. The bottleneck is on single machine's memory.

Driver is OK. I can find a temporary machine with 128GB or 256GB for a day. But I can't make this to single executor machines because they are constant and maybe I have to upgrade machines for all.

Or is there any way to make executors run on high memory machines?

pferrel · 2017-09-27T17:39:35Z

Spark needs to have all data in memory spread across the cluster and some data structures have to be copied to each machine. Therefore it is a memory hog. The good news is that if you use AWS we can start large machines, train, and shut them down afterward. There is no use for Spark unless you are training while using the UR. This means that if you have a permanent Spark cluster it is wasted most of the time unless you are sharing it and that is not recommend if you want guaranteed model update times. So yes memory is required, our solution is big but temporary machines in AWS. BTW, please join the Google group for questions so others can benefit: https://groups.google.com/forum/#!forum/actionml-user <https://groups.google.com/forum/#!forum/actionml-user> On Sep 26, 2017, at 10:20 PM, Wayne Wang <[email protected]> wrote: I'm researching this model and it is really awesome for small companies like us. I've trained a model easily with 10 million trading orders. However, when I increase the number to 100 million, model cannot be trained. Actually we have a cluster with 1TB memory. But this model requires the memory size of a single machine. My cluster have 20 nodes and each gets 64GB memory. It is obviously not enough for 10 million orders. I'm wondering if there is any chance for this model to make no requirement to a single machine. I think 1TB is quite enough. The bottleneck is on single machine's memory. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#37>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAT8S7YkkYZmyZEpDIQdqn7rx2sm9RjUks5smdskgaJpZM4PlOwn>.

WayneWang12 · 2017-09-28T06:06:50Z

OK, I see. But why the training in map at package.scala:96 takes so long. The training lasts for already 16.4 hours and we have see no result now. And it only use 12 cores while I've got more than 400.

Is there a way to make it quickier? And also I'll post it to the group.

WayneWang12 changed the title ~~Any chance to make ur have no request on single machine memory?~~ Any chance to make ur have no requirement on single machine memory? Sep 27, 2017

WayneWang12 changed the title ~~Any chance to make ur have no requirement on single machine memory?~~ Any chance to make ur have no requirement on single executor machine's memory? Sep 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any chance to make ur have no requirement on single executor machine's memory? #37

Any chance to make ur have no requirement on single executor machine's memory? #37

WayneWang12 commented Sep 27, 2017 •

edited

Loading

pferrel commented Sep 27, 2017 via email

WayneWang12 commented Sep 28, 2017

Any chance to make ur have no requirement on single executor machine's memory? #37

Any chance to make ur have no requirement on single executor machine's memory? #37

Comments

WayneWang12 commented Sep 27, 2017 • edited Loading

pferrel commented Sep 27, 2017 via email

WayneWang12 commented Sep 28, 2017

WayneWang12 commented Sep 27, 2017 •

edited

Loading