-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for mod_fcgid execution models #6
Comments
@lazy404 wrote:
Using a SMA from processes not forked from a common ancestor — say by using the MAP_FIXED — is problematic on *nix, especially as there is no guarantee that the same address window will be available; this is especially true where the PHP instances may be loading different extensions that could malloc different storage areas during extension startup (which occurs before zend_extension startup). This is the reason that MLC OPcache abandons absolute addressing for file-based content. Compiled modules are stored in a position independent format in file and relocated on reloading. I know from the timing instrumentation in the code that this is a low cost operation in relative terms. Also since the "SMA" only needs to store the modules for a single request, it doesn't grow very large. I've added a JiT mmap allocator to complement this, so for example executing the index.php script for MediaWiki reads about 200 compiled scripts from its index.opcache file — about 3Mb largely serially from a single file. This is LZ4H compressed and this is expanded to ~30Mbyte in the mmap cache. In fact this LZ4 expansion is the biggest overhead in using the Cache, but I chose LZ4 because its expansion is fast and this is still only ~10% of the compile cost. The true shared memory cache being used here is the (Linux) Virtual File System cache, and the density here is a lot higher as the compiled script content is compressed at an application level. |
currently we use MAP_FIXED with some high addressbase which seems safe, which is a big ugly hack but works most of the time (colisions of about 100MB in 47 bit addresspace are rare ;) . Looking at the kernel code(http://xorl.wordpress.com/2011/01/16/linux-kernel-aslr-implementation/) it's possible to extend the gap between the ranomized stack space and mmap space by setting a static gap value of eg TASK_SIZE/6*4. 1/6 of TASK_SIZE gives us over 0,5Gigs on 32 bit systems and something gigantic on a 64 bit system. #define MAX_GAP (TASK_SIZE/6*5)
} This should make sharing the cache in fcgid mode much easier to complete, we just need to take care of the initial reconecting (it looks shared_alloc_win32.c is doing something very similar). I will try digging deeper into this. Other idea could be taken from (http://stackoverflow.com/questions/6446101/how-do-i-choose-a-fixed-address-for-mmap) An aside: MAP_FIXED can be used safely if you first make a single mapping without MAP_FIXED, then use MAP_FIXED to create various mappings at fixed addresses relative to one another over top of the original throw-away mapping that was at a kernel-assigned location. With a sufficiently large mapping size, this might solve OP's problem. – R.. Jun 23 '11 at 5:24 so if mod_fcgid process manager mmaps 1GB of anonymous memory, all its children can use MAP_FIXED inside this region (address can be passed to php) , after the restart first running process might have to remove the cache, or realocate just as MLC is doing after loading the data from file. The compression sounds wery interesting, if it still is only 10% overgead, still in our env we will have to consider aditional performance hit on cache updates. I think a non shared SMA can be dangerus in a shared hosting env. Given small 30MB uncompressed cache file times 300 processes gets us 8GB of used memory. I understand that we can use opcache.cache_pattern to make the cache sets more local and smaller, the MLC opcache will then preseed zop with the much smaller set. |
I also did a trawl on this, and most of the *nix-based solutions are a hack. At best you can do mapping collision detection and the subtleties are very platform / kernel specific. You might get it working for your particular stack, but as a general solution, I am just not convinced. Going to a pos. indep. approach added a lot of work for me, so I only abandoned an absolute address for the SMA with reluctance.
I guess there's no free lunch. The LZ4H compression rate is a lot slower than expansion, but surely the update rates are minuscule compared to read rates?
Hummnnn, but what about the security vulnerabilities of a shared SMA in a shared hosting environment? At the end of the day, the SMA is just memory address range that is directly accessible to any extension in any process mapping onto it. An unscrupulous user with extension development skills could easily use this to replace any compiled script with his or her own and hack other users. Sharing SMAs for a given UID is safe but these could be forked off a common parent and have the same absolute address anyway. BTW, the 30Mbyte is for a MediaWiki full page generation, scripts like a phpBB viewtopic create a lot smaller cache. And also if you want to reduce the op_array overhead, encourage your users to switch to PHP 5.4+ as this uses a denser op_array format that drops opcode sizes about 30%. |
by the non shared SMA being dangerous I was refering to the potential memory consumption increase (every process having it's own copy of the cache). Currently every uid has it's own cache which is shared by all of his fcgi processes so there are no security implications. I wonder if I could go the other way, by not forking new processes directly from the process manager, but using something simmilar to php-fpm with dynamic number of processes is doing, the cost whould be 1 extra php instance per active vhost whitch is acceptable for us. |
@lazy404, so its a case of +1 on the sharing of SMAS :-). This is all complicated and not very intuitive. It's taken me a year to get my head around the true performance characteristics of the Zend execution environment. For example I have a php build which is -Ofast without debug, but with the main instruction dispatch in the Zend VM if ((ret = OPLINE->handler(execute_data TSRMLS_CC)) > 0) { replaced by op_ndx = OPLINE->opcode*25 + op_type_decode[OPLINE->op1_type] * 5 + op_type_decode[OPLINE->op2_type];
op_timer = _rdtsc();
ret = OPLINE->handler(execute_data TSRMLS_CC);
op_timer = _rdtsc() - op_timer;
op_timer_table[op_ndx].count++;
op_timer_table[op_ndx].timer += op_timer;
if (ret>0) { which collects counts and CPU cycles per Zend opline handler. Doing this sort of stuff straceing, etc, gives you a real grasp of where execution cycles and resources are getting consumed for real apps like MediaWiki, Wordpress and phpBB. This is all the subject of a separate article / thread, but my point is getting the right balance is hard. For example these three apps use smart helpers for CSS furniture etc. so when a user queries a Wiki page the browser does a GET /index.php/somepage and a GET loap.php?.... These two requests load a common core of 50 or so MW classes, but index.php can load another 150 or so, whilst load.php does a different 10. This sort of pattern of overlapping module use is repeated in the other apps. Having a per request "SMA" keeps it as small as is practical. The true SMA here is the Linux Filesystem cache, and the OPcache SMA is really a null compiled module store for the script execution. This is still a real cycle saver though as this significantly reduced the number of emallocs done in loading compiled scripts. Doing a per UID true SMA adds a whole bunch of complications and I am not sure that we will tip into the domain where this will be an overall net benefit. Got to go now -- will come back on this thread later. |
Interresting approch, I am a little worried about code duplication impacting the vm cache in scripts which include same files from different starting points, and cases where code inclusion is based od some script parameters, in that cases the cache will be modified and compressed at each request untill all used options are in cache. In our env iops are the main enemy, we want to move the storage to NFS without loosing much performance. Heck for us caching data in separate files with 1-1 mapping could be satisfying (the cache will be on a cheap local SSD, while the php code will be located on NFS with ocasional stat calls to check if there are any modifications) this will prevent code duplication and decrease the compression penalty of corse it will be slower then loading whole cache from 1 file. Also I have tried to make true shared cache on opcache, ofcourse it doesn't work :), it segfaults on globals initialization. Maybe You have some pointers about the initialization process, and how to make it a 2 stage process?
|
I currently use a per-request cache, so in the case of MediaWiki, for example, the wiki/index.opcache will contain a set of compiled scripts cached in executing the wiki/index.php request and wiki/load.opcache will contain a set for executing the wiki/load.php, etc. The cache files are read-only, but if a request compiles any additional scripts, then a new cache is moved over the original, and this is created by concatenating the index, the old set of scripts and the newly compiled ones. If a cache is actually invalid (e.g. something has changed) then it is trashed completely and rebuild from scratch. This works very well for low volatility apps.
Do an strace on the execution of a request cached and uncached. The getattrs and other I/Os are materially less.
The opcode hierarchies use absolute addresses. If you want to save them to file you must address this by actively relocating absolute addresses. This is what zend_persist_prepare.c does together with the zend_accel_script_relocate() function in zend_accelerator_file_utils.c.
There' about a year's learning curve on making APC then OPcache work reasonably embedded in this fork. Rasmus Lerdof's view was that getting this working would be infeasible as all previous attempts realised minimal (say ~5%) improvement. This gets roughly 80% of a pure SMA-based cache. Hacking will just get bullet holes in your feet. Have a read through the wiki pages and try to understand the code as a whole. If I can understand your requirements properly and its a mainstream shared hosting one, then I will be happy to work out how to make it happen |
Yes, I agree i fcgi scenario it should work more or less the same, maybe we can clear the old cache in lazy mode, hoping we will get a request for the same code
I did, I also hacked php to not check for symlinks (symlink owner check is done on kernel level, so we don't have to worry about security implications, and there can't be any race scenarios, php simply can access a symlink only if symlink owner is the same as symlink target, we use open_basedir and php then constantly stats every directory in the path).
this whould automagicly align caches with sites, in our env we can't guarantee that docroots are on a certain directory level
I whould be riding a wheelchair by now ;) "my" apc is serving milions of requests daily with less then 100 segfaults,
thank You, i will start reading and get back to You if i have some more substantial questions, I have already read some code but those structure expanding macros ZS* are not helping. BTW. Did You experiment with opcache's optymizer ? If yes it any good in at least reducing the code size ? |
There's a flaw in this open_basedir code in that its terribly sub-optimal and craters NFS performance. I had an exchange with Rasmus over this on Bugzilla. I can understand where he's coming from as he just doesn't see shared hosting environments as a priority. I should really put fixing this on my TODO list.
Yes it reduces the op_array size a bit -- maybe 5-10%, but actually has minimal impact on runtime. The reason for this only became clear when I started analysing the op_timer_table stats that I discussed above (I need to write a general article on this), but in fact the Zend VM is pretty well optimized to work with the gcc compiler. The threaded interpreter only takes ~100 clocks for instructions like the JUMPs and simple arithmetic ops to temporary and local variables. This is the stuff that the optimizer tends to clean up. The killers are compilation, function call overheads, all the dynamic storage housekeeping and built-in functions (because they do most of the work anyway). |
@lazy404, an off-the-wall thought: it would be fairly easy to swap out the per-script file-based cache with, say, a memcached listening only on a socket. This could be shared across what ever scope the sysadmin cared to configure. Need to think more about pros and cons, but this could fit well in large scale shared hosting templates. And no messy opcache files in the user file space. |
@TerryE For us i think file backed cache whould be better
I have one mayby silly question is the compiled php opcode position dependent ? If I understand correctly the conversion in Your fork is involved with cache structure not the opcode itsself. Memcache pros
file based is open() + mmap + close or open + malloc, lz4 + close memcache is connect
|
I did some updates on https://github.com/lazy404/ZendOptimizerPlus/tree/fcgid, and it seems to more or less work (shared reattachable cache). For some reason it, won't cache the main php file, but all others are cached, there is about 50% performance gain and I can't break it using concurrent ab with some wordpress. Of course locks arent working (ZOP is using flocks, but this can be easly fixed to use pthread locks) so it might explode while concurently updating the cache. This will also work for cli, this uses hardcoded cache file /apc/dupa (sorry) and 0x00100000000 memory location so make sure there is a dir /apc is present and writable. |
IMO, what you are really pointing out here is that *nix already has an high performing SMA which UID/GID-based security built in. It is called the Virtual Filesystem Cache. Why try to replicate this? Which was my rationale in proposing a file-based cache alternative in the first place.
Agreed, but IMO the way to do this is to build a strong case for getting this functionality merged back into core.
You need to be careful with the assumption that compiling a given source code will always produce the same compiled output, and in fact by default this isn't the case. However, for PHP 5.3 or later OPcache modifies some compiler options ZendAccelerator.c to make sure this the case: #if ZEND_EXTENSION_API_NO >= PHP_5_3_X_API_NO
orig_compiler_options = CG(compiler_options);
CG(compiler_options) |= ZEND_COMPILE_HANDLE_OP_ARRAY;
CG(compiler_options) |= ZEND_COMPILE_IGNORE_INTERNAL_CLASSES;
CG(compiler_options) |= ZEND_COMPILE_DELAYED_BINDING;
CG(compiler_options) |= ZEND_COMPILE_NO_CONSTANT_SUBSTITUTION;
#endif Re memcached pros and cons, if you don't mind, I will collect this thread into a separate issue once I've thought this through. I will also scan your fork and if I have any feedback issues, I will raise them at your fork.
I've rewritten my The Zend Engine and opcode caching which explains this, and you should read it to understand how standard OPcache interacts with the Zend compiler and executor. The issue here is that any ZVAL target or JMP target or the HashTables inside the compiled scripts use internal (absolute) address pointers which are only correct if the compiled script is based at the same absolute address in every process which maps the SMA. In the MLC version, I use the module zend_persist_prepare.c to either (i) tag an address as needing to be relocated, or (ii) in the case of redundant links in HashTable and standard opcode handlers set these addresses to zero. Tagged addresses then have the compiled script base addresses subtracted to make them base relative. This extra pass is only done once during compilation. The relocation vector is saved with the compiled script. (It is <5% overhead on the script size). When the compiled script is reloaded into the new process (typically at a new base address), then the preparation function calls zend_accel_script_relocate() in zend_accelerator_file_utils.c to adjust all of these addresses to the correct value for the new base address. This relocation pass is fast, and the relocated version of the compiled script now works fine at its new base address. This is robust and I've ran it against the entire PHP test suite. The only bugs that I found are with the core OPcache. But what this technique does is to allow you to make compiled scripts position independent in a cheap and robust way. |
great writeup, this is more complicated then I thought, but I am getting there
I see relocation is also done in zend_persist_op_array_ex(), The compression ratio of shared file is great 16MB of wordpress code is compressed to something about 3MB. I hope the ratio will be also good when I compress signle scripts. I think it is possible to store compressed code in the shared memory alongside with orginal size and address, and after decompression do the same relocation procedure as in zend_persist_op_array_ex or use Your zend_accel_script_relocate for the each local decompressed copy, this should still be still much faster then compilation, and we can cache 5 times the data in memory. Memory usage was a problem for us, and disk backed was to slow (we didn't test the ssd). |
Yup, the compression ratio is good. Try I've pretty much worked out how to do a fast standard support for mod_fcgid, but this really depends on Dmitry el all being comfortable with zendtech#114 -- however, I am on a walking holiday for the next week so I won't have much access to the Internet, so any more dialogue will have to wait until I get back. |
I think i have ironed out last bug preventing my fork from running correctly on PHP 5.3+. So far it had a very limited testing. The lastest sources are on https://github.com/lazy404/ZendOptimizerPlus/tree/fcgid It's best that the cache file (opcache.mmap_prefix) is located on tmpfs, then on cache restarts the memory is returned to the system. The performance gain is as expected about 100% more wordpress requests served in a given time. Everything should work as expected. |
@lazy404 on #3 and @TvdW on zendtech#108 both raised the issue of creating OPcache models for use with mod_fcgid.
Standard OPcache already supports php-fpm and php-cgi (with a functional SMA so long as it build with --enable-fastcgi and ran with PHP_FCGI_CHILDREN is > 0. However these solutions don't scale to the typical shared-hosting infrastructure templates where thousands of vhosts are typically configured per server.
In order to scale to the 1,000+ vhosts, each with its own associated UID and the enforcement of UID-based access control.
IMO, this second requirement means that the PHP processes cannot be forked from a common ancestor (e.g. using mod_ruid2 without introducing exploitable security vulnerabilities — see cPanel Apache Module: Ruid2 documentation).
MLC Opcache currently assumes one request per PHP image activation, and that the "SMA" is private to the image and effectively restarted between requests. The file-based caches are (by default) per request (e.g. all modules loaded for a request /user/fred/wiki/index.php could come from /user/fred/wiki/index.opcache). So whilst these records need to be read each request and are discarded at the request shutdown, the cost of reading and LZ4 expanding the compiled content is ~10% of the processing load of the equivalent compile; the I/O load is significantly less and is typically VFAT-cached for repeated requests.
Extending MLC OPcache to support php-cgi / fastcgi would be straight forward and (from my timings) still deliver 80% of the benefits of a sure SMA based approach. However, it might be worth considering a true persistent SMA model so that in the case where multiple requests to the same image use the same modules then the SMA cache content would be preserved from request to request. However, there are further integrity issues that would need to be addressed here.
The text was updated successfully, but these errors were encountered: