dfs-datastores should use Hadoop's serialization mechanism #3

sritchie · 2011-12-29T13:04:25Z

Rather than requiring custom serialization methods in the PailStructure, dfs-datastores should provide a way to hook in to Hadoop's io.serializations system. It's much easier to write pail structures that implement Serializable if they don't have to carry around a serializer. This also allows for mixed-type pails.

The text was updated successfully, but these errors were encountered:

nathanmarz · 2011-12-29T19:11:21Z

I don't agree with this. The advantage of tying serialization to Pails is that they enforce the types within them. When I read data from a pail, I'm confident in exactly what I'll be getting out (this is also why mixed-type pails are a non-goal).

Additionally, pail should be usable outside of a M/R context, which means it must know its own serializers.

The problem we have now is that to use a pail within a M/R job, we need implement the serializer for the pail as well as for io.serializations. Rather than have pail use io.serializations, I think it makes more sense to have M/R jobs that use pail to automatically add pail's serializers into io.serializations.

sritchie · 2011-12-30T18:13:21Z

I think that makes a lot of sense. What I meant is that it would be advantageous if Pail could serialize objects using Hadoop's Serialization interface, rather than specifying serialize and deserialize in PailStructure. PailStructure could replace serialize and deserialize with getSerializations; the pail would then enforce the order of the supplied serializations. We could also including a method to prepare the JobConf by appending these to io.serializations as you suggested.

I agree that a huge benefit of Pail is the ability to enforce the internal structure of its data, but I'm not convinced that type is the way to do this. For example, if we replace hfs-seqfile with a Clojure data pail, we'll probably specify the type as clojure.lang.PersistentVector, or even just java.util.List, so that we can store tuples with multiple fields. This ends up enforcing nothing about the internal data.

Additionally, Pail doesn't use the type information to actually enforce anything. (You can dump strings into an Integer pail, for example.) The type's just a check between two pails before an absorb or copy. This isn't good enough for safety, since two PailStructures with the same name and type but different getTarget implementations will corrupt each other upon combination.

What do you think of:

public interface PailStructure<T> extends Serializable {
    public boolean isValidTarget(String... dirs);
    public String getSerialization();
    public List<String> getTarget(T object);
    public String structureID();
}

structureID would return a user-specified identifier used to compare two pails. This would allow pails to be dynamically generated while still enforcing their structure on copy or absorb calls.

pail.meta would then contain:

--- 
format: SequenceFile
serialization: cascading.kryo.KryoSerialization
structure_id: forma.hadoop.pail.DataChunkPailStructure

sritchie · 2012-01-07T18:55:46Z

Update -- we're thinking

public interface PailStructure<T> extends Serializable {
    public boolean isValidTarget(String... dirs);
    public String getSerializations();
    public List<String> getTarget(T object);
    public String getType();
    public Map<String, String> getMetadata();
}

sritchie · 2013-02-14T20:08:23Z

Pretty clear that Bijection would be a good thing here.

new version 2.0.0 with Spark support and improved file handling for AWS S3 file systems

jschwellach added a commit to jschwellach/dfs-datastores that referenced this issue Jan 9, 2017

Merge pull request nathanmarz#3 from christianrichter/2.0.0

e5a0390

new version 2.0.0 with Spark support and improved file handling for AWS S3 file systems

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dfs-datastores should use Hadoop's serialization mechanism #3

dfs-datastores should use Hadoop's serialization mechanism #3

sritchie commented Dec 29, 2011

nathanmarz commented Dec 29, 2011

sritchie commented Dec 30, 2011

sritchie commented Jan 7, 2012

sritchie commented Feb 14, 2013

dfs-datastores should use Hadoop's serialization mechanism #3

dfs-datastores should use Hadoop's serialization mechanism #3

Comments

sritchie commented Dec 29, 2011

nathanmarz commented Dec 29, 2011

sritchie commented Dec 30, 2011

sritchie commented Jan 7, 2012

sritchie commented Feb 14, 2013