Instrumentation stage

Table of Contents Scope of instrumentation stage Implementation issues Implementing persistence Relative addressing for file-based data-hierarchies Built-in diagnostic instrumentation Interrim conclusions First results from the PHP core and Zend test suites Changes that will be needed for pre-production File cache strategy SMA Implementation Module relocation Footnotes

Scope of instrumentation stage

This stage was a simple implementation of a variant of cache OPcache that is a demonstrator of cache persistence in a CLI test environment. My aim here was to achieve a set of objectives with the minimum code development:

OPcache be easily modified in compliance with my implementation principles.
Some of the file-based persistence techniques that I developed in my earlier APC-based LPC extension could be layered into OPcache
This could act as a vehicle to carry out gdb and in-built diagnostic based analysis for both the cache-miss and chache-hit execution paths, in order to investigate and to understand the workings of OPcache.
The employ the full PHP core ($PHPDIR/tests), Zend core ($PHPDIR/Zend/tests) and standard functions ($PHPDIR/ext/standard/tests) tests to exercise good coverage of the execution engine / OPcache coverage to identify critical sensitivites and failures in this approach.

This stage achieved these objectives and now been archived. As such this page is now really obsolete, and only a historical record.

Implementation issues

Implementing persistence

The main issue for opcode caching in a CLI environment is that OPcache uses a single persistent shared memory area (SMA) across all PHP worker processes as a working cache. It can use this approach because the PHP processes are persistent, can service multiple requests, and are forked from a master process. By contrast in the CLI and GCI-based SAPI execution models, each PHP request is executed within its own separate process, with no memory content including any cache being preserved across initiations. This implementation addresses this content preservation issue by saving the SMA to a disk file at the end of one image shutdown and reloading it at the next startup. Any save / reload algorithms can be optimized for reload, because this file-based cache will have extremely low volatility in virtually all usecases and the save path is rarely executed. The opcode formats are designed for execution efficiency, and have a sparse content; the cache is therefore compressed for saving to file and uncompressed on restore. The addition runtime overhead of uncompression is effectively offset against the runtime gains in the reduced size of the cache file to be input (~8x for PHP 5.3 and ~5x for PHP 5.4). This code is based on the LPC equivalents.

By default each initiating script has its own cache file, with OPcache INI parameters enabling regexp substitution based on the resolved scriptname to allow the caches to be named and placed in a separate R/W directory from the script directory. However, this SAPI API doesn't make this resolved scriptname available until request startup, so the region binding code needs to be deferred to request startup. Again, this code is based on its LPC equivalent.

Relative addressing for file-based data-hierarchies

The PHP opcode model uses absolute addressing for internal data structure references. OPcache, like the other mainstream opcode caches, uses an SMA that is located at the same absolute address within each PHP worker process. This is achieved on Linux platforms by forking the PHP interpreter after the anonymous shared region has been allocated, so that all child workers use the same SMA base address. This method doesn't work for CLI models as Linux makes no guarantees about address consistency between activations and in practice the SMA base address can vary almost randomly, even if the SMA is allocated soon after startup. This will cause any internal addresses with the SMA to be invalid on reload. The solution here is to identify all internal absolute addresses within the SMA and to convert them to relative form by subtracting off the base address of the SMA prior to saving along with a relocation vector which identifies these addresses, then on relocating them to the new address on reload; this is based on a similar approach and code that was developed for LPC. This instrumentation version simply identifies any size_t value in the address range of the SMA as an SMA-internal pointer.

Unfortunately, this simplistic approach is vulnerable to false alarms (see below), but (as the test results below show) it works well as a feasibilty demonstrator for 64bit architectures as the SMA addr is a pretty high 64bit value (eg. 0x00007f90cbdf3000) which won't be in the SMA unless it is an internal pointer. The LPC implementation used intelligent tagging of internal pointers by analysing the known structure of the op_array identify valid pointers for relocation, and this approach will need to be adopted in future versions.

Built-in diagnostic instrumentation

This implementation also includes initial embedded instrumentation to monitor execution. Additional DEBUGn() and ENTER(module) macros have been added to achieve this. These macros generate active code if --enable-opache-debug is set, and if not then generate null source strings.

The DEBUGn() macros provide a clear and simple printf-style method of adding low-footprint built-in diagnostics. The first arguments of which is a mask value which determines when this diagnostic category is enabled from the opcache.debug_flags INI setting. For example, -d opcache.debugs_flags=32 turns on function entry tracing. And of course individual logging categories can be dynamically enabled / disabled during exection by bit-diddling ZCG(accel_directives.debug_flags).
All functions have the ENTER() macro appended to their opening brace, so for example in zend_shared_alloc(), this becomes: {ENTER(zend_shared_alloc). Having this type of diagnostic available is invaluable because the use of gdb itself can effect the execution paths in the code. An example here is that running multiple runs of the PHP interpreter in the same gdb execution results in the same SMA address being allocated across runs. As discussed about, this isn't the case for separate runs in separate (non debugger) executions. Diagnosing some aspects of this type of issue really relies on this instrumentation.

Interrim conclusions

Having developed this build through unit test and run it against the core PHP test suite, I am now in a position to answer the following questions.

Does this approach work? Well for all bar a couple of the some 4,800 test cases in the PHP core, Zend core and and standard function tests the answer seems to be yes. This is described in further detail in the following section.
Does it perform well? I haven't looked at this in depth as this isn't an objective of this build; it's a waypoint, not an endpoint. However, doing some quick time tests of some Mediawiki maintenance scripts, this cached implementation (compiled with debug) was roughly halving the image startup and compile overhead of running the scripts, that is going from a 40:60% spilt of image activation:PHP compilation to a 80:20% spilt of image activation:PHP cache file load. This isn't a rigorous benchmark, but this at least indicated that a file-base OPcache would still deliver real benefits in terms of both CPU and file I/O overheads.
Do I understand the short-falls of this demonstration and know what to address in a full pre-production development? Yes, and I summarise this in the last section.

First results from the PHP core and Zend test suites

I have developed a simple wrapper script which executes a variant of the run-tests.php invocation generated by the appropriate make tests command. However the tests are run 3 times:

The first time through tests OPcache in much the same way as the classical opcache.enable_cli=1 does. However with this implementation each script execution generates a file based cache for use on subsequent executions.
The second time through tests the script retrieved from the cache.
The third time through is a repeat for consistency.

 -----------------------------------------------------------------------------------------
                    Priming Cache             1st run using Cache      2nd run using Cache
 -----------------------------------------------------------------------------------------
 Number of tests : 5147              4831
 Tests skipped   :  316 (  6.1%) --------
 Tests warned    :    0 (  0.0%) (  0.0%)
 Tests failed    :    7 (  0.1%) (  0.1%)      6 (  0.1%) (  0.1%)     7 (  0.1%) (  0.1%)
 Expected fail   :    6 (  0.1%) (  0.1%)
 Tests passed    : 4818 ( 93.6%) ( 99.7%)   4819 ( 93.6%) ( 99.8%)  4818 ( 93.6%) ( 99.7%)
 -----------------------------------------------------------------------------------------
 Time taken      :  314 seconds              284 seconds             297 seconds
 -----------------------------------------------------------------------------------------

At this stage, the points that I want to emphasise are:

This simple approach already works for all bar a few of the test cases, and provides a solid basis for further development.
I now have a conveniently basis for test-vectors and doing gdb-based debugging of the cached execution path within OPcache.
Examination of this small variation in test fails shows that they are all consistent with false hits during relocation address tagging.

Changes that will be needed for pre-production

This implementation has achieved the objectives that I set for it. However, it contains a number of design compromises that keep it simple, but that I want to address properly in a production implementation. Nonetheless, this simple version gives me enough confidence to start work on a pre-production implementation.

File cache strategy

The current strategy of saving the entire memory cache to file as a single image will have unacceptable performance overheads for some applications; for example MediaWiki adopts a unified entry through its index.php for all interactive requests, and this means that the corresponding file cache will contain some 1,000 cached PHP opcode modules, yet less than 100 are included during the execution path for guest queries with for cached pages. A full load of the 1,000 vs the 100 adds roughly a factor of 10 on I/O + decompression overheads, as well as on the memory overhead for the memory cache itself.

A sounder approach would be to adopt a proper multi-level cache, with a lazy on-demand fetch of individual modules from the file-cache into the memory cache, hence the I/O and load overheads would become directly proportional to the number of modules included during the execution. I used a similar approach in LPC, but using a separate CacheDB extension. I don't want to introduce a dependency on a separate extension into OPcache, so I will incorporate a stripped-down version of this code. This broadly uses a cdb-style file-based (piecewise) constant D/B.

An incidental and beneficial side-effect is that using a multi-level cache actually minimises the changes needed to the existing OPcache code as there is no longer the need to prime the memory cache from file at startup.

SMA Implementation

Using, or rather abusing, a mmapped SMA implementation needs to go. The cleanest approach here is to echo the shared_alloc_shm.c approach of having multiple memory segments, but simply mallocing each rather than using a true shared memory API. Note that OPcache traverses the compiler opcode structures twice. The first pass computes the total memory requirement for the opcode module; this enables the allocation of the memory to store the opcode module as a single block using simple first fit from the available memory segments with a one call to the shared allocator.

One slight complication with this new shared_alloc_malloc.c is that in the case where the block can't be allocated from one of the existing segments, any new segment will be allocated on a lazy basis. This is to minimise the memory footprint of the PHP process.

On the second pass, the opcode structures are copied into the opcode module with individual storage elements being allocated using a simple serial allocator.

Module relocation

In fact OPcache already includes relocation code in zend_persist.c and a HashTable xlat_table of old to new address mapping, but this source file is going to require quite a lot of changes to support relocation tagging as this doesn't currently track the locations which have been changed so that for example:

 op_array->scope = zend_shared_alloc_get_xlat_entry(op_array->scope);

would need to be replaced by

 op_array->scope = zend_shared_alloc_get_xlat_entry(op_array->scope);
 TAG_PTR(op_array->scope);

etc. where TAG_PTR() is a macro of the form:

 #define TAG_PTR(p) if (ZCG(use_file_cache) && p) zend_shared_tag_ptr(&p)

The main downside of this is that this would require major changes to zend_persist.c. However, the Zend team have already adopted a two-pass architecture with a second module, zend_persist_calc.c, mirroring its structure to calculate the persistent script storage requirements. I will follow this model by moving the relocation compution into its own thrid scan, zend_persist_prepare.c, again mirroring this structure. The relocation functionality will , as with LPC and the current instrumentation implementation, generate a relocation byte vector (rbvec) which can used to carry out efficient address relocation on script read-in, though with one rbvec per opcode module rather than one for the entire cache. The rbvecs only need to be stored in the file cache.

For PHP > 5.3, Relocation will needs to process deinterning of strings.

Footnotes

An example of the false detections that this simple scan approach occurred in zend_shared_alloc_startup() which stack-allocates a temporary zend_smm_shared_globals structure and subsequently copies this to the SMA. Due to field ordering, this contains an unallocated 4 byte aligned filler, and stepping through the debugger showed that this contained the m.s. half of the SMA address; as the SMA is the default 128M in length, there is roughly a 1 in 32 chance that this 8-byte sequence will falsely match an SMA address and cause ~3% of the tests to fail, so I had to explicitly zero this temp structure to get the above results. This is why I need to use explicit tagging in any production version.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly