optimize writes in genomix data using nulls #57

jakebiesinger · 2013-11-08T22:43:04Z

This commit is based on and takes #56 further into Node-- only those values that are non-null and collections of length > 0 are saved into the output stream.

Outside of Node, the only visible difference is that there are no default values for averageCoverage (it's null by default and trying to use it without setting it first will give a NPE).

Inside of Node, anywhere a get() is called, we initialize the requested value if it's not already existing. There's also a fair bit of extra code involved in checking for null (curse you, Java) especially in comparison with raw types.

The first commit is kind of unrelated-- it removes EnumSet.allOf() where it can be replaced with much less overhead with enum.values().

jakebiesinger · 2013-11-12T18:37:17Z

Any comments on this one? It's been open for 4 days now o_O

JavierJia · 2013-11-12T18:50:50Z

genomix/genomix-data/src/main/java/edu/uci/ics/genomix/type/Node.java

+        if (n.internalKmer != null && n.internalKmer.getKmerLetterLength() > 0) {
+            n.internalKmer.write(out);
+        }
+        if (n.averageCoverage != null) {


Should this write() block consist with the setByCopy(), in terms of n.getActiveFields & NODE_FIELDS.XXXX ?

Yes, they should be consistent. n.getActiveFields writes a single byte whose bits are defined in NODE_FIELDS and are 0 if the particular values are null or size() == 0. The setByCopy() function might be a little more permissive in that if size() == 0, it will create a new list of length 0 rather than being null. Shouldn't affect behavior but I could change it to be more consistent if we want.

I'd like to make it more consistent. Because there are several fields to write and read, then the order became very critical. If we want to change or add some fields in the future, it would be easier to change.

Sorry, maybe I should be more clear:

setAsCopy(byte[] data, int offset) reads the very first byte from the byte[] just as write() writes it and readFields() reads it. There is no difference there and they should have the exact same effect. They should be completely consistent with NODE_FIELDS and getActiveFields().

setAsReference(byte[] data, int offset) is almost an exact copy of that function with the exception that VKmer instances (in the EdgeMaps and the internalKmer will be references rather than copies-- everything else must be a copy.

setAsCopy(Node other) is what I meant when I said they're not consistent-- for that function, when other has a list whose size() == 0, this.<list> could either be null or of size() == 0 as well. Right now, it's size() == 0, just like the Node that's being copied. But this really shouldn't affect anything practically.

JavierJia · 2013-11-12T19:44:14Z

LGTM 👍

jakebiesinger · 2013-11-12T19:46:15Z

@anbangx the test cases in travis are currently failing because the pregelix input files are out of date. Sure would be nice to have that automatic generation tool you've been working on 😄.

jakebiesinger · 2013-11-12T19:48:48Z

BTW we should have better performace after this is merged in, thanks to dc90e19. According to #20, the total time spent in creating and iterating over those EnumSet's was 10% of the total runtime (:exclamation:)

Nan-Zhang · 2013-11-12T22:47:44Z

members alway be null until we need it~ I think I can modify and add more 'null' cases in nodeTest to test some functions.
Why enum.values() has much less overhead comparing to EnumSet.allOf()?

jakebiesinger · 2013-11-12T22:59:50Z

@Nan-Zhang yes, we could return a null in some cases instead of creating the object on the fly. The problem with that approach is when you want to iterate, you have to check for null values:

if (node.getEdgeMap(FF) != null)
    for (EdgeMap m : node.getEdgeMap(FF))
         // do something with m

which makes the interface ugly. So without changing the interface at all (we never return null, as we never did before), we can just generate the values on the fly when requested. Thus, algorithms that don't use certain values won't have them written.

jakebiesinger · 2013-11-12T23:01:17Z

Why enum.values() has much less overhead comparing to EnumSet.allOf()?

.values() is just returning a static final array. EnumSet.allof has to create a new object and inspect the values of the enum.

Nan-Zhang · 2013-11-13T00:34:42Z

LGTM

On Tue, Nov 12, 2013 at 3:01 PM, Jake Biesinger [email protected]:

Why enum.values() has much less overhead comparing to EnumSet.allOf()?

.values() is just returning a static final array. EnumSet.allof has to
create a new object and inspect the values of the enum.

?
Reply to this email directly or view it on GitHubhttps://github.com//pull/57#issuecomment-28342829
.

…mix-data-using-nulls optimize writes in genomix data using nulls

jakebiesinger added 4 commits November 8, 2013 14:31

use <enum>.values() instead of EnumSet.allOf()

dc90e19

Node uses null values and writes less to stream

ba2d6fa

udpate hyracks test cases to use null's

c2ff5eb

set coverage in split repeat to 1/2 the node's coverage

9eabec7

JavierJia reviewed Nov 12, 2013
View reviewed changes

fix for null value in GraphStatistics

b29e168

jakebiesinger added a commit that referenced this pull request Nov 14, 2013

Merge pull request #57 from uci-cbcl/wbiesing/optimize-writes-in-geno…

4b32544

…mix-data-using-nulls optimize writes in genomix data using nulls

jakebiesinger merged commit 4b32544 into wbiesing/pregelix-messages-have-null-values Nov 14, 2013

jakebiesinger deleted the wbiesing/optimize-writes-in-genomix-data-using-nulls branch November 14, 2013 00:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize writes in genomix data using nulls #57

optimize writes in genomix data using nulls #57

jakebiesinger commented Nov 8, 2013

jakebiesinger commented Nov 12, 2013

JavierJia Nov 12, 2013

jakebiesinger Nov 12, 2013

JavierJia Nov 12, 2013

jakebiesinger Nov 12, 2013

JavierJia commented Nov 12, 2013

jakebiesinger commented Nov 12, 2013

jakebiesinger commented Nov 12, 2013

Nan-Zhang commented Nov 12, 2013

jakebiesinger commented Nov 12, 2013

jakebiesinger commented Nov 12, 2013

Nan-Zhang commented Nov 13, 2013

optimize writes in genomix data using nulls #57

optimize writes in genomix data using nulls #57

Conversation

jakebiesinger commented Nov 8, 2013

jakebiesinger commented Nov 12, 2013

JavierJia Nov 12, 2013

Choose a reason for hiding this comment

jakebiesinger Nov 12, 2013

Choose a reason for hiding this comment

JavierJia Nov 12, 2013

Choose a reason for hiding this comment

jakebiesinger Nov 12, 2013

Choose a reason for hiding this comment

JavierJia commented Nov 12, 2013

jakebiesinger commented Nov 12, 2013

jakebiesinger commented Nov 12, 2013

Nan-Zhang commented Nov 12, 2013

jakebiesinger commented Nov 12, 2013

jakebiesinger commented Nov 12, 2013

Nan-Zhang commented Nov 13, 2013