Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very large form conversion fails due to timeout #171

Closed
yanokwa opened this issue Oct 13, 2020 · 14 comments
Closed

Very large form conversion fails due to timeout #171

yanokwa opened this issue Oct 13, 2020 · 14 comments
Labels

Comments

@yanokwa
Copy link
Member

yanokwa commented Oct 13, 2020

When I upload a 1.6MB XLSForm on the test server, it fails with "Something went wrong: the server returned an invalid error."

The pyxform and service containers don't show any errors, but the nginx container does.

nginx | 2020/10/13 21:14:23 [error] 55#55: *30951 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 99.145.234.152, server: , request: "POST /v1/projects/102/forms HTTP/1.1", upstream: "http://172.18.0.7:8383/v1/projects/102/forms", host: "test.central.getodk.org", referrer: "https://test.central.getodk.org/"

The XML version uploads just fine.

Obviously something is timing out, but I don't know yet where the timeout is happening. Findings so far...

  • pyxform 1.2.0 converts the form just fine on my local machine. Slow, but works.
  • pyform-http (pyxform 1.2.0/waitress 1.4.4) on my local machine running in Docker works great with an HTTP post. Takes 6 minutes, but still works.
@yanokwa yanokwa changed the title Very large form conversion fails due to timeout. Very large form conversion fails due to timeout Oct 13, 2020
@issa-tseng
Copy link
Member

6 minutes is bigger than the default tcp timeout interval of 5 minutes, is probably why. possibly the solution is to change the proxy_read_timeout for the internal nginx pyxform proxy to a much larger number.

@matthew-white
Copy link
Member

Closing after discussion with @lognaturel. We can always reopen if it's still an issue.

@yanokwa
Copy link
Member Author

yanokwa commented Oct 19, 2024

I've tried with v2024.2.1 on three large forms: 15.6MB, 11.9MB, and 9.7MB. All three upload, convert, and preview just fine.

@yanokwa
Copy link
Member Author

yanokwa commented Nov 12, 2024

Just ran into a 496KB form with 6728 rows that uploads but fails conversion with a 504.

2024/11/12 14:01:55 [error] 327933#327933: *1519075 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 99.145.234.152, server: sandbox.getodk.cloud, request: "POST /v1/projects/1/forms HTTP/1.1", upstream: "http://172.18.0.2:8383/v1/projects/1/forms", host: "sandbox.getodk.cloud", referrer: "https://sandbox.getodk.cloud/"
99.145.234.152 - - [12/Nov/2024:14:01:55 +0000] "POST /v1/projects/1/forms HTTP/1.1" 504 562 "https://sandbox.getodk.cloud/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36" "-"

I can confirm that increasing proxy_read_timeout in files/nginx/odk.conf.template from 2m to 5m works. Any danger in making that the default?

@yanokwa yanokwa reopened this Nov 12, 2024
@matthew-white
Copy link
Member

matthew-white commented Nov 12, 2024

I can confirm that increasing proxy_read_timeout in files/nginx/odk.conf.template from 2m to 5m works.

That's interesting to me, as the proxy_read_timeout is set on Backend (paths that start with /v1). To me, that means that the timeout isn't happening in the communication between Backend and pyxform-http, but between the client and Backend. I wonder how much time pyxform is taking to convert the form vs. how much time Backend is taking to do other things like creating the form schema. (I'm reminded a little of getodk/central-backend#662.)

Any danger in making that the default?

I don't know the answer to that offhand, but it's certainly something we could look into. We've seen a 504 in another issue recently as well, #691.

496KB form with 6728 rows

Just wanted to note that I'm a little surprised by this size! If you've been able to upload a 15MB form, I'm surprised that 496KB would be a problem.

@lognaturel
Copy link
Member

I wonder how much time pyxform is taking to convert the form vs. how much time Backend is taking to do other things like creating the form schema

I think that's a great question.

@yanokwa could you please send the form to me and @lindsay-stevens so we can see what the pyxfrom perf is like?

@lindsay-stevens
Copy link

File size can be unrelated to the amount of useful data. For example having lots of direct cell formatting can result in lots of "styles" saved for the document, or there could be lots of comments, or used-but-empty cells beyond the main table range, or other XLSX features. File size can also be unrelated to XLSForm complexity - it could be a very large but simple form, or a smaller and complex one. For pyxform, complexity generally means lots of things that have to be parsed, checked, cross-referenced / looked up: e.g. pyxform reference replacements (like ${other_item}), nested repeats, pulldata/instance calls, translations/media, etc.

As for the 496KB form with 6728 rows, it has: ~1000 answerable questions, ~2000 notes, ~500 calculate items, ~1000 groups (up to 4 levels of nesting), ~2000 constraints, ~2000 pyxform references, ~50 choice lists with a total ~500 options. No repeats, media, or translations. Uncompressed file size is ~4MB, of which ~2MB is the survey sheet, and ~200KB for ~400 document comments. On the survey sheet, when saved as CSV (excluding ",) ~20% of the 2M characters are non-latin unicode script (I think this may be why it's sluggish to open and use in Excel/LibreOffice). The converted XForm document size is 5MB.

When I run XLS2XForm conversion with pyxform master (b65e7277 ~v2.2.0), the 496KB file takes ~5 minutes. It takes ~6 seconds to read the data and prepare the initial internal form structure, and the rest is spent in the survey.to_xml step. Of that, ODK Validate takes ~6s; most of that time is explosive recursion of the internal survey nested dict structures e.g. SurveyElement get_lineage (go self.parent recursively, 8M calls), iter_descendants (self.children recursively, 60M calls), getattr (self.get recursion, compounded by a lot of dict.copy(), 152M calls). Much of which is initiated by get_xpath() (8M calls).

So, definitely room for improvement in pyxform in regards to processing large forms. These kinds of issues don't become obvious until there's a certain quantity of e.g. cross references, groups, nesting levels, etc. Although I think the docs advise against designing large forms? Or at least limiting complexity.

@matthew-white
Copy link
Member

Super useful information. Thank you, @lindsay-stevens!

One thing I'm wondering is whether we should wait to see whether changes to pyxform would improve the situation, or whether we should we go ahead and up the Central timeout. I'm also wondering, could different endpoints have different timeouts? Maybe it's not a bad idea to keep the timeout lower for most endpoints, but I also don't see much harm in increasing the timeout for the form upload endpoint.

I also want to make sure I understood what @yanokwa wrote here:

Just ran into a 496KB form with 6728 rows that uploads but fails conversion with a 504.

Does this mean that Central was able to create the form eventually? That is, sometime after the 504 was received, you could see the form in the form list? I think it's probably surprising to users that an error message doesn't mean that the request failed. If others think so too, I could file a separate issue about that. We could change how that error is messaged on Frontend ("Your request took too long, and Central stopped waiting for it. It may eventually complete."). Or we could set up Backend to actually stop working and roll back the database transaction after 2 minutes (or whatever we decide for the request timeout). That way, an error response would mean that the request was unsuccessful / had no effect.

@lognaturel
Copy link
Member

Of that, ODK Validate takes ~6s

Is that maybe a typo? Given the calls you list, I'd expect that to be higher.

I think it's probably surprising to users that an error message doesn't mean that the request failed. If others think so too, I could file a separate issue about that

That's a good idea.

Your request took too long, and Central stopped waiting for it. It may eventually complete.

That's a little bit of a terrifying message? But I can't immediately think of something better...

@yanokwa
Copy link
Member Author

yanokwa commented Nov 14, 2024

Does this mean that Central was able to create the form eventually? That is, sometime after the 504 was received, you could see the form in the form list?

Nope, I never saw the form after the 504.

Whether we should wait to see whether changes to pyxform would improve the situation, or whether we should we go ahead and up the Central timeout.

I'd prefer to wait to see how quickly @lindsay-stevens can get a performance increase shipped. We have proxy_read_timeout as workaround if someone needs it.

@lindsay-stevens
Copy link

Is that maybe a typo?

Validate really took 6 seconds. It's pyxform doing some weird stuff. Currently working on a branch here - that form is now producing the same output in about 20 seconds, including Validate time.

The first commit tidied up some dict management and cached each element's XPath which brought it to about 45s total. The second commit is starting to tackle the main issue which seems to be Survey._var_repl_function. I think it is resulting in something like, for every pyxform reference, check for common ancestor groups that are repeats, by iterating the survey from the top down and compare the target/context xpath (separately and repeatedly) against each element's xpath (it seems to have been this way for many years). The second commit avoids that for this form by checking from the bottom up if there are any common ancestor repeats. So for complex forms with repeats it would be a problem still.

@yanokwa / @lognaturel let me know if you want to ship this pronto and I'll prioritise adding test cases (so far just been using that form and cross checking with existing test suite). Otherwise I could keep going to optimise for large forms with repeats and many references - might help XLSForm/pyxform#453 as well.

@lognaturel
Copy link
Member

that form is now producing the same output in about 20 seconds, including Validate time.

🕺 💃 🪩

I could keep going to optimise for large forms with repeats and many references

Let's keep going a bit and see what you find! I think we'll aim to release pyxform the first week of December so that's a good target. Thanks!

@lognaturel
Copy link
Member

Let's close this issue for now. If we get more similar form conversion timeouts, let's look at them from a pyxform perspective first.

@matthew-white
Copy link
Member

I think it's probably surprising to users that an error message doesn't mean that the request failed. If others think so too, I could file a separate issue about that

That's a good idea.

It sounds like the request actually did fail for @yanokwa, so maybe we don't need to make any big change here. I still think a better error message would be nice though, something other than "Something went wrong: error code 504." I have a PR up for that at getodk/central-frontend#1052.

Your request took too long, and Central stopped waiting for it. It may eventually complete.

That's a little bit of a terrifying message? But I can't immediately think of something better...

If 504s do sometimes fail, then maybe we don't need to worry our users unnecessarily. I've removed the second sentence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants