Performance improvement. Process raw JSON data incrementally. #7

vampy · 2015-09-12T22:02:36Z

Currently every time the maint_graphics.py generates the report data, it removes all the old data and starts to process all the new data from scratch.
This is quite inefficient as it takes quite a lot of CPU power and memory (especially on low ram machines, cough digitalocean VPS).

A better solution would be:

Every time we run the generation script we will keep track of the last ID in the userreport table that was affected.
Every time the generation script is run it will only process data from the previous remembered ID to the current latest ID.
Handle user count for each device by looking up in the table to see if the device is already there https://github.com/supertuxkart/stk-stats/blob/master/userreport%2Fmaint.py#L48.
Maybe another column which has the SHA1 sum of all the columns that identifies the device, so that we do not have to add non-clustered keys on all columns in the graphicsdevice table.
But I am not sure which is faster, one column SHA1 sum (non-clustered key) OR all columns non-clustered keys (that uniquely identify a entry device).

The text was updated successfully, but these errors were encountered:

qwertychouskie · 2017-11-16T00:19:25Z

Pertinent discussion on IRC:

@hiker
Whow - do we really have between 10k and 50k android installs???
@Arthur_D
Possibly
@hiker
I am impressed :)
QwertyChouskie
http://addons.supertuxkart.net/stats/
187,865!
@hiker
Yes, but that's #files downloaded, not #installations
We should have a look at our hardware stats :)
@Arthur_D
Issue is that they take too much resources on the server to generate
@hiker
Yes, I know :(
That's why I wrote 'should'
not 'look here' ;)
Maybe we could add some less detailed statistic - just OS and numbers or so
@Arthur_D
From what leyyin told me the main issue is that there's no "diff" feature, so it has to compile everything every time
@hiker
You mean you can't just add more data in by just analysing new data?
@Arthur_D
And that eats up RAM like no tomorrow
Yep, unless I misunderstood
@hiker
We should do something about this :P
← s8321414 has quit (Quit: Konversation terminated!)
@Arthur_D
Indeed
Stragus
What about compiling the data and updating the pages once per day or something?
Then it doesn't matter if it takes a while
@hiker
That's what we are already doing
QwertyChouskie
Yeah, leyyin[m] you should do that ;P
@hiker
We can't run the daily cron process to update anymore
@Arthur_D
Well we stopped doing that since it crashes

Stragus scratches his head pondering how that could be slow

hiker is actually wondering the same :)

@Arthur_D
Too high RAM usage
QwertyChouskie
How much RAM does the server have?
@Arthur_D
It's probably doing something incredibly inefficient
1GB if I remember correctly
@hiker
We got the whole stats package from 0ad (iirc)
@Arthur_D
Yep
Stragus
It really doesn't make much sense, you don't have billions of records
@hiker
Maybe we would need to rewrite it ... if we only had the time
Stragus
(and if you did, then you should stream from disk while compiling summaries, not caching the whole thing...)
Oh well, it's probably not written in C or C++ so I couldn't really have a look
Stragus
It's very good to have an online reference of GL extension supports, for a lot of people, and the 0 A.D. database is very outdated
(Where is the code of that thing?)
@Arthur_D
I think we have it in a GitHub repo for the add-ons site
@hiker
https://github.com/supertuxkart/stk-stats
QwertyChouskie
https://github.com/supertuxkart/stk-stats
Stragus
I'm sure a lot of folks in ##OpenGL would love an updated online database, especially if you have Android in there
QwertyChouskie
oops too late :)
Stragus
Eh, Python
@hiker
grin we might have android in there :P We just don't know
Stragus
So it's a SQL database, that could be read from a C program that outputs web pages
← Tobbi has quit (Quit: My MacBook has gone to sleep. ZZZzzz…)
Stragus
It's the SQL -> static_web_pages step that fails, correct?

Stragus is trying to find that in the source

@hiker
I don't know any of the details, leyyon would know
Kitoko
i like how android made this sudden jump to #1 usage hehe
→ swift110 has joined
QwertyChouskie
Stragus #7 may have some useful info
Stragus
Compiling the pages from scratch every time is no big deal, I think the Python is just broken
→ Auria has joined
ⓘ ChanServ set mode +o Auria
Stragus
The code doesn't look catastrophically inefficient, except for buffering everything before writing, but I'm sure you don't have gigabytes of data...
leper that might be wrong though :P at least last time someone asked we (0ad that is) had something about 60GB for that thing Stragus Oh :) Okay, so it needs to be accumulated/compiled while it's being read from the SQL database, not all buffered leper
so that might very well be the issue
(we also have someone interested in updating and improving our own copy of that, since the one maintaining that isn't active anymore, nor has been for quite a while)
(and I guess some queries might be improved a lot by changing the DB schema, or just storing the submitted data somewhere, and extracting what is needed into some table)
(which allows fast queries for a few things, and if something else is needed the raw data is still there, it might just take some time to parse it into something that can be queried)
Stragus
I can't write Python but that seems to be pretty simple, to stream the reports from SQL in the _save_devices() loop
leper`
likely, though I'd still suggest (at least that's what I did to that one mentioned above) to store the data in a different way, so quite a few queries could be done with simple sql queries
(eg counting OSs)
@hiker
+1 :)
Stragus
The only way to reach gigabytes of data would be if identical reports aren't consolidated
(merge an identical report to the existing one, increment some counter)
leper
last I checked the whole submitted data is some JSON blob that's stored in some field
@hiker
Stragus: at least for stk it should report one installation only once
leper
(with our version in some mysql instance that likes to corrupt the db, repairing which takes huge amounts of ram and ages)

qwertychouskie · 2019-02-11T18:32:30Z

https://github.com/supertuxkart/stk-stats/blob/master/userreport/maint.py#L109 is interesting here, looks like someone was trying to fix this already. Anyone know if it worked and if not, why not? @leyyin

vampy · 2019-02-13T09:38:08Z

To be honest I can't remember exactly, but afaik it was slowing down the script a lot.

qwertychouskie · 2019-02-13T21:49:32Z

Better slow than not working, right?

vampy · 2019-02-16T14:21:24Z

Slow as in will hog the server for like half a day slow, so it is technically not working.
It needs more improvements as described in the issue above

qwertychouskie · 2019-02-20T23:04:57Z

Could you work on this soon then please? Or maybe @Alayan-stk-2 wants to learn python :P

vampy added enhancement server P2: Major labels Sep 12, 2015

vampy mentioned this issue Feb 10, 2019

Gdpr compliance supertuxkart/stk-code#3748

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvement. Process raw JSON data incrementally. #7

Performance improvement. Process raw JSON data incrementally. #7

vampy commented Sep 12, 2015

qwertychouskie commented Nov 16, 2017

qwertychouskie commented Feb 11, 2019

vampy commented Feb 13, 2019

qwertychouskie commented Feb 13, 2019

vampy commented Feb 16, 2019

qwertychouskie commented Feb 20, 2019

Performance improvement. Process raw JSON data incrementally. #7

Performance improvement. Process raw JSON data incrementally. #7

Comments

vampy commented Sep 12, 2015

qwertychouskie commented Nov 16, 2017

qwertychouskie commented Feb 11, 2019

vampy commented Feb 13, 2019

qwertychouskie commented Feb 13, 2019

vampy commented Feb 16, 2019

qwertychouskie commented Feb 20, 2019