Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvement. Process raw JSON data incrementally. #7

Open
vampy opened this issue Sep 12, 2015 · 6 comments
Open

Performance improvement. Process raw JSON data incrementally. #7

vampy opened this issue Sep 12, 2015 · 6 comments

Comments

@vampy
Copy link
Member

vampy commented Sep 12, 2015

Currently every time the maint_graphics.py generates the report data, it removes all the old data and starts to process all the new data from scratch.
This is quite inefficient as it takes quite a lot of CPU power and memory (especially on low ram machines, cough digitalocean VPS).

A better solution would be:

  1. Every time we run the generation script we will keep track of the last ID in the userreport table that was affected.
  2. Every time the generation script is run it will only process data from the previous remembered ID to the current latest ID.
  3. Handle user count for each device by looking up in the table to see if the device is already there https://github.com/supertuxkart/stk-stats/blob/master/userreport%2Fmaint.py#L48.
    Maybe another column which has the SHA1 sum of all the columns that identifies the device, so that we do not have to add non-clustered keys on all columns in the graphicsdevice table.
    But I am not sure which is faster, one column SHA1 sum (non-clustered key) OR all columns non-clustered keys (that uniquely identify a entry device).
@qwertychouskie
Copy link

Pertinent discussion on IRC:

@hiker
Whow - do we really have between 10k and 50k android installs???
@Arthur_D
Possibly
@hiker
I am impressed :)
QwertyChouskie
http://addons.supertuxkart.net/stats/
187,865!
@hiker
Yes, but that's #files downloaded, not #installations
We should have a look at our hardware stats :)
@Arthur_D
Issue is that they take too much resources on the server to generate
@hiker
Yes, I know :(
That's why I wrote 'should'
not 'look here' ;)
Maybe we could add some less detailed statistic - just OS and numbers or so
@Arthur_D
From what leyyin told me the main issue is that there's no "diff" feature, so it has to compile everything every time
@hiker
You mean you can't just add more data in by just analysing new data?
@Arthur_D
And that eats up RAM like no tomorrow
Yep, unless I misunderstood
@hiker
We should do something about this :P
← s8321414 has quit (Quit: Konversation terminated!)
@Arthur_D
Indeed
Stragus
What about compiling the data and updating the pages once per day or something?
Then it doesn't matter if it takes a while
@hiker
That's what we are already doing
QwertyChouskie
Yeah, leyyin[m] you should do that ;P
@hiker
We can't run the daily cron process to update anymore
@Arthur_D
Well we stopped doing that since it crashes

  • Stragus scratches his head pondering how that could be slow
  • hiker is actually wondering the same :)

@Arthur_D
Too high RAM usage
QwertyChouskie
How much RAM does the server have?
@Arthur_D
It's probably doing something incredibly inefficient
1GB if I remember correctly
@hiker
We got the whole stats package from 0ad (iirc)
@Arthur_D
Yep
Stragus
It really doesn't make much sense, you don't have billions of records
@hiker
Maybe we would need to rewrite it ... if we only had the time
Stragus
(and if you did, then you should stream from disk while compiling summaries, not caching the whole thing...)
Oh well, it's probably not written in C or C++ so I couldn't really have a look
Stragus
It's very good to have an online reference of GL extension supports, for a lot of people, and the 0 A.D. database is very outdated
(Where is the code of that thing?)
@Arthur_D
I think we have it in a GitHub repo for the add-ons site
@hiker
https://github.com/supertuxkart/stk-stats
QwertyChouskie
https://github.com/supertuxkart/stk-stats
Stragus
I'm sure a lot of folks in ##OpenGL would love an updated online database, especially if you have Android in there
QwertyChouskie
oops too late :)
Stragus
Eh, Python
@hiker
grin we might have android in there :P We just don't know
Stragus
So it's a SQL database, that could be read from a C program that outputs web pages
← Tobbi has quit (Quit: My MacBook has gone to sleep. ZZZzzz…)
Stragus
It's the SQL -> static_web_pages step that fails, correct?

  • Stragus is trying to find that in the source

@hiker
I don't know any of the details, leyyon would know
Kitoko
i like how android made this sudden jump to #1 usage hehe
→ swift110 has joined
QwertyChouskie
Stragus #7 may have some useful info
Stragus
Compiling the pages from scratch every time is no big deal, I think the Python is just broken
→ Auria has joined
ⓘ ChanServ set mode +o Auria
Stragus
The code doesn't look catastrophically inefficient, except for buffering everything before writing, but I'm sure you don't have gigabytes of data...
leper that might be wrong though :P at least last time someone asked we (0ad that is) had something about 60GB for that thing Stragus Oh :) Okay, so it needs to be accumulated/compiled while it's being read from the SQL database, not all buffered leper
so that might very well be the issue
(we also have someone interested in updating and improving our own copy of that, since the one maintaining that isn't active anymore, nor has been for quite a while)
(and I guess some queries might be improved a lot by changing the DB schema, or just storing the submitted data somewhere, and extracting what is needed into some table)
(which allows fast queries for a few things, and if something else is needed the raw data is still there, it might just take some time to parse it into something that can be queried)
Stragus
I can't write Python but that seems to be pretty simple, to stream the reports from SQL in the _save_devices() loop
leper`
likely, though I'd still suggest (at least that's what I did to that one mentioned above) to store the data in a different way, so quite a few queries could be done with simple sql queries
(eg counting OSs)
@hiker
+1 :)
Stragus
The only way to reach gigabytes of data would be if identical reports aren't consolidated
(merge an identical report to the existing one, increment some counter)
leper
last I checked the whole submitted data is some JSON blob that's stored in some field
@hiker
Stragus: at least for stk it should report one installation only once
leper
(with our version in some mysql instance that likes to corrupt the db, repairing which takes huge amounts of ram and ages)

@qwertychouskie
Copy link

https://github.com/supertuxkart/stk-stats/blob/master/userreport/maint.py#L109 is interesting here, looks like someone was trying to fix this already. Anyone know if it worked and if not, why not? @leyyin

@vampy
Copy link
Member Author

vampy commented Feb 13, 2019

To be honest I can't remember exactly, but afaik it was slowing down the script a lot.

@qwertychouskie
Copy link

Better slow than not working, right?

@vampy
Copy link
Member Author

vampy commented Feb 16, 2019

Slow as in will hog the server for like half a day slow, so it is technically not working.
It needs more improvements as described in the issue above

@qwertychouskie
Copy link

Could you work on this soon then please? Or maybe @Alayan-stk-2 wants to learn python :P

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants