-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mapping CDXJ <-> NDJSON #3
Comments
Although, this is application specific issue and has nothing to do with the format and semantic, but we can still discuss it here which might help us shape the format better. I would say, if the knowledge of the key fields is present as an application out-of-band context then there is no issue in transforming them back and forth. Here are a few points in this regard:
So, the above example can very well be written as: @keys ["urlkey", "timestamp"]
com,google)/ 20150125034709 {"urlkey": "com,google)/", "timestamp": "20150125034709", "url": "http://www.google.com/", "length": "7674", "filename": "common-crawl/crawl-data/CC-MAIN-2015-06/segments/1422118059355.87/warc/CC-MAIN-20150124164739-00123-ip-10-180-212-252.ec2.internal.warc.gz", "digest": "S3K4ZKZALJ4DB4RL2IQ6D233IW7XXLVO", "offset": "725344446"} |
Yes, you're right that there is no requirement to remove those entries from the JSON dict, but would very likely be desirable for any application to save space. And you're right that its very application specific. For example, given
and so on... Hmm.. this makes me think that there should be an enforcement of no duplication, eg. if "timestamp" is used as a key it can't be a value, since JSON does not allow duplicate keys in a dict... |
Yeah, the more I think about it, the more it makes sense to think of this format as a transformation on NDJSON.. To that effect, backwards compatibility to NDJSON may be desirable, as much as its possible.. Instead of
This way, a regular NDJSON parser can read the first two line. The This data has a clear and unambiguous equivalent in NDJSON form,
Original CDX already has precedent of using a space to indicate a format header, so not too surprising. |
I would strongly oppose using leading spaces to signify the special meta blocks as the boundary spaces are prone to damage and should not be relied upon. Also, making just the meta portion NDJSON compatible does not buy us anything because rest of the document will not be compatible anyway. Also, having an empty object as separator is not useful, for two reasons; 1) we don't really need a separating line if we have a way to identify meta lines using Currently used
And here are lines from the example it refers to: @meta {"name": "Internet Archive", "year": 1996}
@meta {"updated_at": "2015-09:03T13:27:52Z"} |
I am thinking of ways to reduce the spec to the very minimum, separating data from metadata. How about just
Also, I think that the JSON value data should only be a dict {}, so only If the value is a list, then it breaks the NDJSON equivalency I mentioned above, which I'd like to maintain. A JSON list can be represented as a single value a dict anyway. Alternatively, perhaps any field that starts with That way, you can have support for |
This is exactly what is intended and described in the blog post. ORS reserves |
I don't see a compelling reason why would we struggle to keep some sort of NDJSON equivalency. NDJSON does not have metadata provision and our data portion is way more flexible than NDJSON. |
I think the main compelling reason for this format (and why someone would use this over NDJSON) is the prefix sorting capability. If the sorting is not needed, it is much better to use an existing format like NDJSON. There are several existing tools that work with JSON data, including NDJSON. jq in particular: https://stedolan.github.io/jq/ is well established and provides various unix-like tools for JSON data, including newline delimited. If anyone is going to be doing any custom processing with CDXJ, the easiest solution is to convert it back to NDJSON and then pass it to an existing tool. The metadata fields can be filtered out as needed by such tools as well |
Sorting is not the only compelling reason, filtering, grouping, distributing (such as using MapReduce) data some other examples that can be done more efficiently in ORS than making each line a valid JSON. Pushing the prefix keys back inside the JSON block kills the purpose of this format. NDJSON may be good and "well" supported as the authors advertise it, but it has it's limitations where it just can't be used. On the other hand there are many tools that use ORS-like formats both for generation and consumption. Logentries has something called KVP (Key-Value- pair) that is very similar, except it does not enforce single-line aspect of it (but it will happily parse the tighter version). Fluentd for example collects logs from various sources, consolidates them by default in ORS-like format (strict single line entries), but can be configured to other formats as well. It then can send that data to many other tools for visualization, event notification, and other log analysis activities. I have already mentioned Docker logs that generate ORS-like logs by default. I don't think there is a pressing need to look for opportunistic NDJSON-similarity, in that case why would we put efforts to standardize yet another format.
It is a good tool, but the performance is not free. It may be a good tool to search in a small file, but when dealing with data stream at scale, tools like this will fail to perform well. Essentially they will require loading the whole file (at once or in stream mode) in order for them to perform the lookup for every single query. You are missing the point of CDXJ for example, where we want the lookup keys to be placed outside the JSON block so that we can perform plain text processing instead of parsing and loading individual objects which will be performance nightmare.
I don't think converting back to NDJSON will be the easiest solution. If the values used in the lookup prefixes are desired to be present in the object when loaded, then keep them in the JSON object and surface a copy of their values in the prefix. Now use prefixes for lookup and pass the value JSON block to a widely supported natural JSON parser (no need to bring NDJSON in the mix). However, if the tool knows what the key prefix fields are then it can use those values without essentially requiring the duplicate data in the value object. Alternatively, the key values can be injected (merged) in the loaded object that was created after parsing the value block after the JSON is parsed as opposed to injecting key fields back to the the marshaled JSON and them parse it. |
I think perhaps this is where we disagree... my original intention for using this format, is to support sorted, line-oriented data (CDX) with a variable number of fields. Here's how my use case works: when writing out in this format, the internal representation is a stream of JSON dict (NDJSON), but the prefix is pulled out and written first, for sorting. Same for reading: the prefix is read, then the JSON dict, and the prefix is merged back in. (I'm using a specific prefix but can be generalized for generic NDJSON->CDXJ conversion) This is a hybrid format specifically optimized for sorting, using it for anything else will likely be error prone, because multiple formats are mixed on the same line (space-delimited keys and JSON value) I certainly do not have interest (or time) to create custom processing tools for this format! Instead, I would focus on creating well-defined conversions to existing formats:
If fast tabular processing is needed, you can convert to CSV-like format with specific columns, then pass to other tools (though this will not be "lossless" as CDXJ is not tabular, so some data may be dropped).. If JSON structure parsing is needed on the value, then converting to a full JSON object and passing to jq or other JSON processing tools is the best approach. If key-value processing for Hadoop is needed, then converting to an existing format designed for this, such as MRJob format, is the best approach. The conversion options could also specify which value fields, or if just the key, just the value, or both should be used. The conversion tools should be made as flexible as possible to address all the use cases, and convert to any existing format for additional processing. This is a domain specific format designed to solve a specific problem: sorting, merging line-oriented data with a varied number of fields, while other formats are better suited for other use cases, and have much better tooling. Thus, we should make the conversion process easy and well defined. When sorting line oriented data, it is sometimes useful to ensure certain lines are always sorted first and that is the reason for specifying that such lines should start with As for Docker, Fluend, etc..., as mentioned before, I do not know their use cases, and just because this format happens to be a superset of their log formats is not a compelling enough reason for a new format. Unless those communities are involved in building another standard format, and tools around that format, I would be very cautious of putting any weight on this argument. If the primary use case is merging and sorting line oriented log files, than that is already covered. :) |
If you disagree, please provide some examples, not related to sorting, where processing CDXJ directly has any advantage over converting to existing format :) |
What about something like this for NDJSON representation? {"@keys": ["surt", "timestamp"]}
{"_key": ["com,google)/", "20150125034709"], "timestamp": "20150125034709", "url": "http://www.google.com/", "length": "7674", "filename": "file1.warc.gz", "digest": "S3K4ZKZALJ4DB4RL2IQ6D233IW7XXLVO", "offset": "725344446"} The requirement is that every data line starts with a well defined json key like '_key' which is not allowed as json key for the rest of the data. The value pointed to by '_key' is an array corresponding to the definition in the metadata line. This will be sortable with unix tools and readable by NDJSON tools. As the reserved json key '_key' starts with an underscore it's sorted after the metadata lines. One more requirement is that every line must follow the same pattern for whitespace, otherwise sorting will be broken. |
Interesting, yeah I think this could make a lot of sense.. I am less concerned about sorting order in the NDJSON representation than unambiguous mapping to allow easy, well-defined conversion. Unfortunately, there's not really a way to guarantee field order or spacing consistency in different JSON serializers. But, that is not essential, as sorting should be done before converting to NDJSON, or if needed, converting back to CDXJ after filtering. This provides such an unambiguous mapping with the
|
Since there are already parsers for newline-delimited JSON, it may be useful to map CDXJ to this format and vice versa. It would be useful to have a specific one-to-one mapping from CDXJ to JSON lines.
For example, pywb cdx server already supports an
output=json
(and soonoutput=cdxj
), which can return the same data in JSON.A CDXJ line looks like this:
while an equivalent JSON line currently looks like this:
This can be done with the following restrictions:
urlkey
urlkey
can not otherwise be used in the JSON dictionary in CDXJ@context
,@id
,@keys
are added to the JSON the same way, and can not be used in JSON dict of CDXJPerhaps instead of
urlkey
, it should be@urlkey
or@key
or some other clearly defined name...This would then allow for unambiguously converting from JSON back to CDXJ, if needed.
The text was updated successfully, but these errors were encountered: