Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define how localization tagging works #5

Open
jeremydw opened this issue Dec 16, 2020 · 18 comments
Open

Define how localization tagging works #5

jeremydw opened this issue Dec 16, 2020 · 18 comments
Milestone

Comments

@jeremydw
Copy link
Member

jeremydw commented Dec 16, 2020

We are abandoning Grow.dev's automatic localization untagging in order to limit memory usage and improve performance. After all, most content isn't localized, so it doesn't make sense to penalize all templates if content isn't localized to begin with.

We want to support the feature but require developers to use it explicitly rather than automatically.

Some thoughts:

Option 1 - Explicit !a.Localized YAML tag.

# In YAML
foo: !a.Localized
  default: Default content
  de_DE: de_DE content

# In templates (add "L" filter consistent with "T" translation filter)
{{doc.foo|l}}  -> Returns "Default content" or "de_DE content" depending on the doc's locale

Option 2 - Preserve the same syntax as Grow.dev

# In YAML
foo: Default content
foo@de_DE: de_DE content

# In templates:
{{doc|l('foo')}} -> Returns "default content" or "de_DE content" depending on the doc's locale
{{l('doc.foo')}} -> Same as above

Option 3 - Something else I'm not thinking of?

I have a preference for Option 1 because the usage is clearer in templates and having the a.Localized object likely makes it more scalable in the future in case we need to add more features to that.

# For nested structures
{{doc.foo|l().bar}}
@Zoramite
Copy link
Member

Is this specifically for translation, or are you going to also use it for general overridden data in the document?

@jeremydw
Copy link
Member Author

Proposing not using this for translation at all -- just localized content. (For example, if we want to show/hide modules on a per-locale basis.) An easy to understand example is with localized YouTube videos (i.e. same module, different YouTube video IDs on a per-locale basis).

Translations would be handled with a similar tag to gettext, e.g.:

{{doc.foo|t}} -> Pulls from `/content/locales/<lang>.yaml

@Zoramite
Copy link
Member

Zoramite commented Dec 16, 2020

Ok, so it becomes an explicit localization, the template has to know that the content can be localized before it can get the localized version?

In grow it is implicit since the data just changes for the template and it doesn't 'know' about the other possible data unless you manually load the yaml with a different locale.

This seems like it would cause some extra code turn when you decide to add localized content. For example, if you decide to localize the videos, you would update the yaml to use the constructor, but then you would have to also update any template code that uses the video information to make it manually call the l filter.

@stevenle
Copy link
Member

Agree with @Zoramite, having to add |t or |l or whatever everywhere is cumbersome. Is untagging a yaml really that slow? You're just walking the JSON tree right? It doesn't add memory pressure unless you're caching that somewhere, just the CPU cost of traversing through the values of the tree.

@jeremydw
Copy link
Member Author

  • Yes, this proposal creates code turn when adding localized content for the first time -- but the proposed usage is more or less identical to translations -- where we already explicitly say that we need a translation via the {{_(...)}} or {{...|t}} tag.
  • The reason why the memory increases is because when it creates each document's fields object, it has to store an untagged version of the YAML file in memory for that document. So one document ends up taking "document * X locales" memory.

@Zoramite
Copy link
Member

Zoramite commented Dec 17, 2020

Here is a thought:

  • Parse the yaml creating a base object with the Localized and Translation objects part of the structure.
  • When you need to use the data, you process the tree for a given locale. The Document then holds onto this localized version and it is cached with the Document. (Or you could cache it separately if desired)
  • Get rid of the |t and |l filters completely.
    • The reason we use the |t is to get the translated value, but if the process of localizing from the base yaml already gets the correct value for the translation, you don't need a filter to do it.
    • The reason we use the |l is to get the localized value, but if the process of localizing from the base yaml already gets the correct value for the localization, there is no need for a filter.
    • The other use of the _() for is to do string replacing, which could be still be done using a filter or tag when needed.
  • On memory usage, it would use more memory, but it would make working with templates a lot easier since you no longer have to care if the value needs to be localized or if it is a translation, you just output the value. And I think that the memory usage for storing an object would not be very large even if you are storing thousands of them. It would be nice to have a memory monitoring in the build process to do 'snapshots' of the memory while in use to be able to watch the memory usage over time.

@Zoramite
Copy link
Member

For the mechanics, I would suggest a toLocale(locale) method on the classes that can be localized, then walking the json is just looking for a toLocale method on the object and calling it or return the object itself if it doesn't exist.

@jeremydw
Copy link
Member Author

I'm all for not having the developer specify a {{..|l}} value (or a {{...|t}} value if we can get away with it). But I am paranoid about memory usage as the untagging was both a performance hit and a memory hit in Grow.dev. If we can implement this in a clean way without impacting performance or memory I'm all for it. :)

By the way, we should have a way to access the untranslated/unlocalized values too. In Grow.dev, developers couldn't really access the localized values of other locales within a document, without getting the entire other document. In my previous proposal, it'd be possible for a template to access the other locales without fetching the other document entirely.

Example use case for translations:

<button data-analytics-label="{{button.label}}">{{button.label|t}}</button>

In the above example we want the data attribute to be in English but the visible label to be in the translated language.

@jeremydw
Copy link
Member Author

(By the way, not blaming anyone other than myself for the performance/memory hit in Grow. That was a design that I spearheaded and I wrote the original untagging... so it was a problem I created for us.) :)

@Zoramite
Copy link
Member

One of the things I was going to test in Grow was having the translation be an object instead of just the translated string during untagging. So that the __str__ would return the localized version using the doc's current locale but you could still access the original string, etc.

We could try something similar here where the toLocale() would return a new instance of the same class but stores the locale. In Nunchucks, I believe it uses toString() for displaying objects. For example: to do {{label}} it would call label.toString() if it is an object and exists. Then the toString() can use the this.locale to show the correct value. But you could still use the other methods on the object, like label.default or label.toLocale('es'), etc.

@stevenle
Copy link
Member

We could try something similar here where the toLocale() would return a new instance of the same class but stores the locale. In Nunchucks, I believe it uses toString() for displaying objects. For example: to do {{label}} it would call label.toString() if it is an object and exists. Then the toString() can use the this.locale to show the correct value. But you could still use the other methods on the object, like label.default or label.toLocale('es'), etc.

This is a pretty clever solution if !a.Localized has access to the document's locale.

(By the way, not blaming anyone other than myself for the performance/memory hit in Grow. That was a design that I spearheaded and I wrote the original untagging... so it was a problem I created for us.) :)

Re: memory concerns, most memory issues are caused by keeping objects around longer than they're needed. Grow and Ama both seem to be aggressively caching YAML files for performance reasons, but I wonder if it's worth looking into turning that off in server mode (which avoids us having to have a file watcher to check for YAML changes). An alternative would be to use some caching object that monitors its own memory usage and evicts cache keys by LRU.

@jeremydw
Copy link
Member Author

(Wrote this while Steven was replying...)

Yeah; it's worth testing. One complication/challenge I've found is that the locale is only determined from the context...

In other words, for our discussion, the toString would need to know which locale to convert to. The |t filter does that (because Nunjucks/Jinja filters can provide the context and the document and therefore the locale to the toString method.

I wasn't able to think of a way that the locale could both be auto-determined without access to the template rendering context's locale, as well as without having to create a copy of the YAML file for each locale that the Doc is in. But if there's a way that doesn't impact memory usage let's do it.

@jeremydw
Copy link
Member Author

Regarding the cache: I found that Node's performance with parsing YAML is incredibly fast. So we may not need the cache as it exists currently in the prototype. In my benchmarks and tests I've found the current approach works well (with up to ~10,000 documents) but you start to see slowdown after that in my simple tests.

Note that the slowdown is still measured in milliseconds (i.e. 500ms) and nothing as bad as it was with Grow (i.e. 1-5+ seconds for huge sites with many locales).

BTW: Node's file watcher is also really, really good (based on my tests).

@Zoramite
Copy link
Member

I agree, for the server, we don't really need to cache the yaml at all with how fast this seems to process the yaml files. The reason we cache so heavily in Grow is that it is a slow process for python to parse yaml. If that is no longer a concern the server can just read everything fresh and parse as needed.

We may want to cache during the build command, but that would be more of an optimization move for processing a lot of files and it would only make sense if you had multiple locales and didn't want to keep parsing the same yaml file if you can just localize the base yaml extraction for each locale.

Re: the access to the locale, the fields comes from the document correct? Once the document is created it should have a locale already as part of the constructor, so when it first loads the fields you would use the base and call toLocale(). So the document fields is always the localized version of the yaml. Then when it gets to the template it already has the localized version of the class, you do not need to pass anything.

@Zoramite
Copy link
Member

Oh, the other use case for the caching--besides the multiple locale documents--is the usage of !g.doc or !g.yaml. That is an optimization that can come later if we find that the constructors having to reload all the other references is slowing it down. Such as for often used files.

But we could be smart about that and detect which ones are referenced the most often and keep them in a cache without caching everything. Ex, the loader can track the number of times each path is loaded and the ones that start to be referenced more often (ex: > x times) can get cached instead of reread.

@jeremydw
Copy link
Member Author

Without getting too off track from the localization issue at hand, the LRU cache sounds great to me. That would help with memory usage.

How do we all feel about this approach?

# In YAML
foo: !a.Localized
  default: foo
  de_DE: German localized content

# In templates (it just works, no |L needed for localizations)
{{doc.foo}} (or {{doc.fields.foo}} as it's currently implemented)

If the above can be done in a performant way I fully approve it. :)

@stevenle
Copy link
Member

Regarding the cache: I found that Node's performance with parsing YAML is incredibly fast. So we may not need the cache as it exists currently in the prototype. In my benchmarks and tests I've found the current approach works well (with up to ~10,000 documents) but you start to see slowdown after that in my simple tests.

If YAML parsing is fast enough to serve requests without noticeable latency, I'd probably say removing it would simplify the architecture since a file watcher wouldn't need to be running on a child process. The file watcher, while useful, adds considerable overhead in determining when certain cache objects need to be cleared.

Oh, the other use case for the caching--besides the multiple locale documents--is the usage of !g.doc or !g.yaml. That is an optimization that can come later if we find that the constructors having to reload all the other references is slowing it down. Such as for often used files.

But we could be smart about that and detect which ones are referenced the most often and keep them in a cache without caching everything. Ex, the loader can track the number of times each path is loaded and the ones that start to be referenced more often (ex: > x times) can get cached instead of reread.

If the cache is needed for perf reasons, maybe using a per-request cache would be better. I'm assuming this may be tough to inject into the current codebase though since some request context would need to be added to anything that calls readYaml.

@jeremydw
Copy link
Member Author

We could easily use a per request cache by passing the request object into the Pod when it's instantiated. All readYaml calls go through the Pod so they would have access to the request that way.

Regarding perf, on the current prototype the dev server responds with page renders in 3-5ms. It's really fast (even for sites with 5000 pages). The only "slow" part of the system is reading all the files to produce the full list of paths. (It only gets slow after having ~5000 files each defining their own path.) But even then it creates the list in about 500ms.

Also, we could require or suggest sites that are larger than X to use routes in the podspec equivalent (like Oak) and then we could use a normal trie structure like Oak and do everything lazily. I'm also OK with mandating all sites do this (even small ones) but then it would be harder for specific pages to override or customize their URL structure on a case by case basis.

Zoramite added a commit that referenced this issue Dec 17, 2020
Localized the document fields when loading fields for a document.
Uses the document's locale to get the correct localized value when `toString` is called.

Part of #5
Zoramite added a commit that referenced this issue Dec 22, 2020
Localized the document fields when loading fields for a document.
Uses the document's locale to get the correct localized value when `toString` is called.

Related to #5

The content body part still uses null since it can be used to show that the document has been parsed, but there is no body, different from when it is not parsed and is just undefined.
@jeremydw jeremydw added this to the 1.0.0 milestone Apr 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants