Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare templating approaches #39

Closed
ruflin opened this issue Dec 13, 2022 · 5 comments
Closed

Compare templating approaches #39

ruflin opened this issue Dec 13, 2022 · 5 comments

Comments

@ruflin
Copy link
Contributor

ruflin commented Dec 13, 2022

There are currently 3 open PR's related to templating for the corpus generation tool:

I would like to use this issue to discuss a bit the similarities and differences between the approaches.

My current understanding is, #37 is using the go text/template as its template language. #38 supports JavaScript expressions as part of the field names. What about #36 ? What are the other core differences between the approaches?

@ruflin
Copy link
Contributor Author

ruflin commented Dec 13, 2022

An additional question I have is around performance concerns which are discussed in the pull requests. Memory seems to be one of the main concerns. I assume this is less about loading and reading the templates itself but during data production?

@endorama
Copy link
Member

endorama commented Dec 13, 2022

I'm adding here some context, as all PR are a bit lacking on it.

#36 is the initial PR @aspacca has been working on to add templating to the generator.
Without this PR data generation is always based on fields.yml so the tool can only generate events and not raw data.

Generating data from a template allows the tool to generate other possible data formats, like "source" data (i.e. source VPC Flow log to be fed to our data collection).

#36 brings in the initial refactoring of the generator, with the new generator being able to generate any output based on a template. By default a backward compatible implementation is used.

This generator:

  • use the fields.yml file to define available fields
  • uses a configuration file (*.conf.yml) to define boundaries for fields (i.e. low/high boundaries for scalar values or enum for keyword values)
  • render the template generating field values within constraints (if present)

This generator has a limitation: does not support generating values based on other values or any conditional expression.

We wanted to enable generating data like with to this Go implementation (interesting features are in the comments):

func (v *Vpcflow) randomize() {
	// [...]
	v.End = time.Now().Unix()
	v.Start = v.End - int64(rand.Intn(60)) 		// refer a previously generated field
	v.Action = actions[rand.Intn(2)] 				// select a value from a set
	if v.Packets == 0 {								// perform boolean evaluation to select value
		v.LogStatus = statuses[2]
	} else {
		v.LogStatus = statuses[rand.Intn(2)]
	}
}

This led Andrea to create #38, which is #36 + JS based expressions.

The code for #36 implements a very basic template engine (that uses regexp to "parse" the template and extract fields). This implementation is more error prone and in the initial tests from Andrea was outperforming the original implementation by not that much.

This led me to think about using Go text/template instead of our custom implementation. This approach brings in support for more complex templates and we were not sure how much the impact on performances would be.
Related work is included in #37.

We decided to push the experiments we did as they were and postpone the discussion/review on results and trade offs.

Some considerations:

  • we can conclude that generator with template #36 is the best performant, but lacks some features we consider useful
  • we can conclude that generator with text/template package #37 is decent on CPU performances (better than original generator) but way worse in memory usage, which prevents it being the only implementation due to memory constraints in some use cases
  • we may want to merge multiple implementation, offering one more performant but less flexible and one less performant but more flexible;
  • I remember that Support javascript expression #38 was not finished or there were some issue to discuss, but we did not enter into too much details, so we need to wait Andrea is back from PTO
  • generator with text/template package #37 demonstrated that is possible to use a third-party template library, and we may want to explore further down in that direction by testing other template engines; there are others that exhibit great performance improvements on the memory side (with the advantage of reducing complexity and maintenance burden)
  • text/template uses reflection quite a lot, which I would think as the cause for the higher memory usage (I would expect some more but the benchmark suggest 18x more memory usage 😅).

Hope this makes the overall context a bit clearer.

@endorama
Copy link
Member

I assume this is less about loading and reading the templates itself but during data production?

Yes, as in general we can assume that the template parsing step can be done only once, while generating data happens at each iteration.

@ruflin
Copy link
Contributor Author

ruflin commented Dec 14, 2022

Thanks for all the details @endorama, super helpful. In elastic/elastic-package#984 (comment) I put together some thoughts on how things could work end-2-end in the context of elastic-package with these changes.

@endorama
Copy link
Member

I'm going to close this as linked PR have been closes and superseded by #41 where we make the final implementation: using Golang text/template.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants