To escape or not to escape

I was trying to find a better way to store strings in my database, and after hours of researching string escape, mysql_real_escape_string, preg_quote and other variants, I decided to rethink the whole process. I needed a solution that would completely eliminate the escaping issue and would work with any database (sql or noSQL such as MongoDB).
In the first run I came up with this line:
preg_replace("/([^A-Za-z ,]{1})/eu", "' ' . ord(substr('\\1', strlen('\\1')-1, 1)) . ' '", $text);

It converts all non alpha characters to their respective numeric codes, and also undoes the escaping of double quote characters which preg_replace will do when using /e modifier.
However, as elegant and simple it may appear, this solution works only for ASCII encoded strings. To make it more universal I need to parse UTF-8. So here comes stage two:
/*
 * replace all non alpha characters with numeric codes
 */
function remap_string( $text, $decode=false ) {
    if ($decode) return preg_replace_callback(
            array("/(0y([0-9]{3})+y0)/","/([0-9]{3})/"),
            function($matches) {
                if ($matches[2]) {
                    $code = substr($matches[0],2,strlen($matches[0])-4);
                    foreach(str_split($code, 3) as $c) $r .= chr($c);
                    return $r;
                } else return chr($matches[0]);
            },
            $text);

    return preg_replace_callback(
            "/([^A-Za-z ,]{1})/u",
            function($matches) {
                $l = strlen($matches[0]);
                if (1==$l) return str_pad(ord($matches[0]), 3, '0', STR_PAD_LEFT);
                for($i=0;$i<$l;$i++) $a.=str_pad(ord($matches[0][$i]), 3, '0', STR_PAD_LEFT);
                return "0y" . $a . "y0";
            },
            $text);
}
To my surprise, when doing some benchmarking, the preg_replace_callback was a lot faster than the regular preg_replace with /e modifier.


To use this function you call it as
$encoded_string = remap_string( $your_string );
to encode your data, and to decode it just add the optional parameter (decode) as true:
$decoded_string = remap_string( $encoded_string, true );
Discussion
The problems with escaping come from the intersecting surface between strings and parsers. If a parser would have a reserved code to start and end a string we wouldn’t have all this mess in the first place. Not unlike old timer uuencode, the remap_string function reduces this surface to a safe range for any type of transmission. Unlike uuencode it does not alter spaces. This allows you to encode any string out there without breaking word boundaries, which means you can still run fancy search queries (including regexp in mongoDB) without a problem. ... I think.
The function is binary safe, allowing you to store pretty much anything in this format. However, in my development I mostly deal with input from users, which is primarily text, hence my approach was to leave the alpha text as is, and use numbers for the rest. The data is 90 to 98% composed of A-Za-z, commas and other punctuation signs. This means I can effectively keep the added growth to a minimum, thus indexing performance won't suffer.
Of course, you can alter the preg encoder function to allow some other characters in unchanged form. I left only A-Za-z space and comma, simply because it suits my goals.

Comments

Popular posts from this blog

Javascript factorial - performance considerations

Manual pages optimized for search as well as for reading

To escape or not to escape. Part 2.