PHP: The Dark Magic of Optimization
I recently managed to speed up a PHP script to a hundredth of its original execution time by changing just a few characters in the source code. How is this possible? The drastic acceleration is due to the appropriate use of references and assignments. I'll let you in on how it works. Don't believe the sensational headline; it's not any kind of black magic. I repeat, you just need to understand how PHP works internally. But don’t worry, it's nothing too complicated.
In-depth Reference Counting
The PHP core stores variable names separately from their values in memory. An
anonymous value is described by the structure [zval.
Besides raw data, it includes information about the type (boolean, string, etc.)
and two additional items: refcount
and is_ref
. Yes,
refcount
is exactly the counter for the aforementioned reference
counting.
$abc = 'La Trine';
What does this code actually do? It creates a new zval
value in
memory, whose data section holds the 8 characters La Trine
and
indicates the type as a string. At the same time, a new entry abc
is added to the variable table, referring to this zval
.
Additionally, in the zval
structure, we initialize the
refcount
counter to one, because there is exactly one variable
($abc
) pointing to it.
// 10MB string
$sA = str_repeat(' ', 1e7);
$sB = $sA;
How does PHP handle the assignment on the second line? Of course, it creates
a new record sB
in the variable table. Now watch – the record
will refer to the same zval
that sA
already refers to.
It also increments the refcount
.
This is great! There's no need to take up another 10MB of memory, no time-consuming data copying. The operation is lightning-fast.
But from the perspective of a PHP programmer, these are two different variables. What if I change one?
$sB .= 'the end';
No worries, everything is taken care of. When a write request to the variable
occurs, PHP looks at the referenced zval
and checks the
refcount
. If refcount > 1
, the entire
zval
value is duplicated and sB
will refer to
this copy. Of course, the refcount
of the original
zval
is also reduced.
For completeness, I'll add that the command unset($sB)
will
remove the sB
record from the variable table and decrement the
respective refcount
. Once the refcount
drops to zero,
the zval
structure is freed from memory – as no variable refers
to it anymore.
Classic References, Penetrated in Depth
Is everything clear so far? Let's move on to the second lesson and show how the core deals with classic references.
$a = 'La Trine';
$b = & $a;
You already know how PHP executes the first line. But what happens under the
hood in the case of the second line? When I described the zval
structure, I mentioned is_ref
. It's a boolean, indicating whether
the zval
value is a reference or not. And right now, its moment to
shine has come.
PHP creates the variable $b
just as in the example without using
a reference, but additionally sets is_ref
to true. At this point,
both $a
and $b
(both!) become references, as we
know them.
The significant difference comes when we try to change one of the variables.
Because is_ref
is true, the test on refcount
is
skipped along with the entire duplication mechanism. The common
zval
value is directly modified. Although… but we'll get to
that soon.
We can create additional references $xyz = & $a
, cancel them
unset($b)
, the principle remains the same. The core works with the
variable table and updates the refcount
.
Is everything still understandable? If not, try reading the article again more slowly. Now, because maximum concentration is needed.
The Charm Slowly Disappears
Think about how PHP executes the following code:
$a = 'La Trine';
$b = & $a;
$c = $a;
Variables $a
and $c
refer to the same
zval
, which has is_ref
unset. But variables
$a
and $b
need to have is_ref
set. This
can only be resolved by having two zval
values.
In other words, line No. 3 must duplicate the
zval
value:
The algorithm for creating new variables must therefore be supplemented with
a condition: if refcount > 1
and the required
is_ref
“does not match”, then just duplicate and don't look
around.
Similarly, duplication will also occur in this case:
$a = 'I love La Trine :-)';
$b = $a
$c = & $a;
See that? Creating a reference duplicates the variable's value. The copy,
with is_ref
set, will be referred to by variables $a
and $c
(just for completeness, refcount = 2
).
You might now be wondering, what kind of madness is this, why is the PHP core so poorly designed? Trust me, it's not. It's a common issue of shared vs. exclusive access, just called differently. It could be avoided, but changing the design would complicate variable handling so much that it would be counterproductive globally.
Script Optimization
Finally, I can explain the trick behind the optimization of the mentioned script. It included the following code:
...
$arr = &$this->table;
foreach($ngram as $token) {
// if(!array_key_exists($token, $arr)) {
// $arr[$token] = array();
// }
$arr = &$arr[$token];
}
...
It might seem that the success is due to removing the function
array_key_exists
, which is probably so terribly slow that it
dragged everything down. Just for fun, whoever thought that, send me Nutella
? Nope. The problem is buried elsewhere.
Now you know that the passed variable $arr
refers to a
zval
, set with the bit is_ref
and a
refcount = 2
(the value is referred from $arr
and
simultaneously by the element of the array itself). What is crucial is that this
zval
encompasses a huge array.
When assigning to the function array_key_exists
, it
becomes inevitable – the zval
must be duplicated. Which
literally pulls the brake on the moving script. If, for example, the function
key()
, which takes a parameter by reference, were called, or if we
violated the forbidden syntax Call-time
pass-by-reference and forced the argument by reference
array_key_exists($token, &$arr)
, no copying would occur. And
the script would speed up by 600×.
White Magic of Optimization
My goal was to dispel superstitions and myths around references. That they're like pointers, that they speed up code. The truth is that all variables are essentially pointers. They just differ in how the PHP core handles them.
If you understand these principles, you can use them to your advantage (I emphasize the word “can”). You can handle strings or arrays more efficiently. Once they get into your blood, you will use them subconsciously, becoming a Coding Standard.