Category Archives: Code

Under the hood of my Hack24 entry

I competed by myself at hack24. I don’t see competing alone as a massive issue personally, as it cuts down communication overhead by a massive amount. ūüėČ

For those that haven’t heard of hackathons before, the concept is fairly simple:

  • Teams of 1-4 people.
  • Multiple challenges (at least in this event)
  • 24 hours to build a solution.
  • Only code you have written during the event (or Open Source libraries/apps) can be used.

It is quite gruelling to compete in these types of events; you have limited time and you need to prioritise your time across features and components carefully so you don’t end up with a complete backend but no frontend or similar.

Generally you omit anything that is not-essential such as login systems, logging or some error checks etc.

I used the event to test some of the ideas and applications (Microservices, RabbitMQ, Riak, Nginx, php-fpm etc) I have been using on larger projects in a smaller time constrained project – and I think the outcome proves that it works.

I am a big fan of microservices; breaking your app into small testable components with fixed interfaces is awesome for quick building and easy debugging – want to test if something is working, inject a message into the input and see what the outputs are. Once working you can mostly treat it as a single component and move onto the next piece. It keeps your mind focused and, at least for me, reduces stress.

I am a long-time PHP coder and so it made sense, when it came to quick prototyping, to use this and the lightweight Slim framework as my language of choice – both for the frontend and all the workers on the backend all running on top of a single Vultr VPS [disclosure: url contains affiliate link].

Screenshot

SMS proxy – my hack24 entry

I broke the project into 4 main parts:

  • Frontend (website, sms callback hook for receiving messages from Esendex (the challenge sponsor))
  • Receiving worker (take messages from callback and lookup entries in Riak and route to next processing step either filter or directly to sending worker)
  • Filter (responsible for transforming the message before routing the message to the sending worker)
  • Sending worker (responsible for sending the message out via Esendex)

I am open-sourcing the code as a learning tool for other people (and perhaps so I can abuse it in later events…)

You are welcome to ask any questions and I will do my best to answer them.

If you want to contact me, please use twitter or email mike @ technomonk . com

10 things any newbie web developer needs to know

PHP or other language

I’m not going to preach about the pros and cons of different languages, suffice to say that any sufficiently advanced website is indistinguishable from magic requires more than just just static files to make it work. PHP is my language of choice (and has been for over a decade) but there are plenty of other options out there (Python, Ruby or even (shudder) .net)

Which you choose depends more on what is most comfortable to you and what support network you have to call upon. It isn’t generally a good idea to learn Python in a Microsoft shop for example unless there are other pressing reasons to do so.

Whatever you end up learning, you need to be competent in its use, make use of the extensions available (no point reinventing the wheel when there are dozens already made for you) and know how to use the online documentation. Coding style helps, but as long as you are consistent really isn’t that pressing unless you are working with others.

XHTML/CSS

Even if you don’t deal with much front-end code yourself, perhaps you use templates, you still need to understand at least the basics of XHTML and CSS to make it easier for the front-end developers to deal with your output. At the very least your code needs to output well formed, correctly nested code. We mean

<div><p><b><i>hello world</i></b></p></div>

rather than

<div><p><i><b>hello world</i></b></div>

(note the lack of ending <p> tag and improperly nested <i> and <b> tags) an XHTML validator will help, but only on the finished output. They don’t work on fragments of code like this.

MVCs, frameworks and libraries

As stated above, and especially if you are getting paid on a per project basis, you want to reduce the number of times you want to reinvent the wheel. This is known as Don’t Repeat Yourself or DRU. There is no point writing yet another PDF library for example when FPDF, Zend_PDF, Pdflib and many others already exist (or docbook/ps via converters etc) and do an adequate job for most tasks. MVC or Model-View-Controllers to give them their full name are a concept (or design pattern) that allows you to separate the data access parts from the actual logic and the templates used in your projects. This is useful in a team based environment because the designer can design, the coders can code and the coders and DBAs can control the models and how they interact with the databases – it just tends to separate the roles more cleanly and make modifications at a later date easier. If you want to add sharding to the database, you can simply modify the models while keeping the interface the same and suddenly your app can be cluster aware. It’s really a win-win situation.

I use Zend Framework under PHP to provide a MVC and other libraries. I’ve tried Code-Igniter and Kohana and while they work, they either feel dated (in the case of Code Igniter) or just are poorly documented (in the case of Kohana). Zend Framework, while not trivial to get up and running for the novice, worked well for the way I work.

OOP

OOP is one of those things, you either love it or hate it. I personally tolerate it with a mild distain – but that is just me. (I predate the whole OOP revolution.) Put simply, OOP allows a set of related variables and functions to be pulled together into a single package called a class. You then can either extend the class in your own projects overriding or extending the various functions (called methods) or use it as is. The biggest advantage, especially when you are importing a lot of libraries is that unless you are careful, two libraries may have a function or variable with the same name. (This is solved with namespaces in the latest PHP versions) At best the interpreter or compiler throws an error, at worst one library may end up calling a function or trampling over the data from another unintentionally.

Regex

At its simplest level a regex is a way of matching patterns within strings. When used to their full ability, a regex can sometimes replace 10-20 lines of non regex code with a single line. They are often used in Mod_rewrite rules, input validation and searching for certain things within quanities of text. Learn them, use them and master them – they will save you a lot of time.

Source Control

Even when working on your own projects, you sometimes wish you could turn back time and undo something stupid you did a month ago in some code. Your backups have been recycled and a lot of changes have been made since – it sounds like a lost cause. This is where version control comes in. Everytime you are happy with a set of modifications you check them in. The difference between the old version and the current version are worked out and saved. At a later date you can go back and see who made the change, when it was done and exactly what was changed. If you want to checkout a copy of the code as it was at this time, you can do that. This is almost essential in a team-based environment as it allows auditing of code in a more detailed way than simply looking at backups from a particular point in time. What if one of the coders inserted a backdoor into your code? Wouldn’t you want to be able to identify every other change that was made and check them for similar nefariousness?

Google-fu

As a developer you will often find questions and problems that you don’t have the answers for. You can try to puzzle through it yourself, but there is a whole world of people out there, many of whom have already solved similar issues and put their answers online. Being able to search effectively puts these answers at your fingertips.

Accessibility

Not everyone in this world has perfect vision, perfect hearing or the ability to use a mouse. Many sites are designed in such a way that every user is assumed to have adequate abilities. If your site assumes that all readers can use your fixed-sized font, will have Javascript enabled and have will be read rather than listened to using a screen reader then it probably wont be accessible. There are many tools and concepts that don’t take much effort to make your site easier for users with impaired abilities, some are listed below.

Monitoring and SEO

A site is a success if it reaches its goals. It is hard to know if you are hitting them if you have no way of measuring the values the goals specify. Are you being seen by 100 people a day? Are you selling at least 3 widgets from your site a week? Does your advertising campaign result in increased sales and traffic? You need to measure anything that you need if you are going to try to set goals on these things – simple idea, but it’s suprising how few people ‘get it’.

SEO is related to monitoring your traffic in a similar way to energy efficiency is related to watching how much power you use. Unless you know where you are starting from and where you are now, you can’t know how much you have improved by and if you are paying for adwords or similar advertising, then how much each extra visitor or sale is costing you.

JQuery and other JS toolkits

It is all well and good being able to produce dynamic pages and process forms from the server-side, but sometimes you really need something extra on the clientside as well. Enter JavaScript.

JavaScript or JS has been around since the mid 90s and is standard on all major browsers. The big problem comes when certain features exposed to Javascript (Ajax, Storage, Location, DOMs/BOMs etc) vary from browser to browser.

The easy solution to this is to take a library that hides these differences and make it easy to write code once that will run on most other browsers. Jquery is one option, but not the only one.

In addition to this there are many libraries, shims and other utilities that make doing more complex things easier. JQuery for example has the UI library which makes certain tasks such as tabs and date pickers trivial to add to your page.

This isn’t an exhaustive list and I’m sure there are many things could be added to the list. If you feel I have missed something, please add it below in the comments section.

 

Mongo Benchmarking #1

These aren’t perfect benchmarks – far from it in fact – but I just wanted to get a rough idea of the relative tradeoffs between fsync and safe over normal unsafe writes…

Time taken for 10,000 inserts – (no indexes, journalling on MongoDB 2.0.1 on Debian (Rackspace CloudServer 256Meg))

default      - time taken: 0.2401921749115 seconds
fsync        - time taken: 358.55523014069 seconds
safe=true    - time taken: 1.1818060874939 seconds

Edit: and for journalling disabled and smallfiles=true in mongo.conf

default      - time taken: 0.15036606788635 seconds 
fsync        - time taken: 34.175970077515 seconds 
safe=true    - time taken: 1.0593159198761 seconds

The results aren’t perfect, but do show how big the difference is…

Source:

<?php
$mongo = new Mongo();
$db = $mongo->selectDB('bench');
$collection = new MongoCollection($db,'bench');

$start = microtime(TRUE);
for ($i=0;$i<10000;$i++) {
  $collection->insert(array('data' => sha1(rand()) ));
}
$end = microtime(TRUE);
echo 'default      - time taken: '.($end-$start)." seconds \n";

$start = microtime(TRUE);
for ($i=0;$i<10000;$i++) {
  $collection->insert(array('data' => sha1(rand()) ),array('fsync' => true));
}
$end = microtime(TRUE);
echo 'fsync        - time taken: '.($end-$start)." seconds \n";

$start = microtime(TRUE);
for ($i=0;$i<10000;$i++) {
  $collection->insert(array('data' => sha1(rand()) ),array('safe' => true));
}
$end = microtime(TRUE);
echo 'safe=true    - time taken: '.($end-$start)." seconds \n";
?>

I’m not sure that the existing number of records will make a massive amount of difference besides through the pre-allocation of files which we have little control of anyway – but it doesn’t look like there is an increase between runs even when there are a lot of entries… (perhaps we’d see more with indexes enabled)

Each run will add an extra 20,000 entries into the collection with little perceptable slowdown.

root@test:/var/www/test# php bench1.php
default      - time taken: 0.53534507751465 seconds
safe=true    - time taken: 1.2793118953705 seconds
root@test:/var/www/test# php bench1.php
default      - time taken: 0.203537940979 seconds
safe=true    - time taken: 1.2887620925903 seconds
root@test:/var/www/test# php bench1.php
default      - time taken: 0.22933197021484 seconds
safe=true    - time taken: 1.6565799713135 seconds
root@test:/var/www/test# php bench1.php
default      - time taken: 0.19606184959412 seconds
safe=true    - time taken: 1.5315411090851 seconds
root@test:/var/www/test# php bench1.php
default      - time taken: 0.2510199546814 seconds
safe=true    - time taken: 1.2419080734253 seconds

It is hard testing on a cloud server as you are at the mercy of other users impacting the available bandwidth and processor utilisation, but you can at least see trends. I hope this has been enlightening and I hope to expand on this in future…

Edit: for the one person that asked me about storage efficiency… here goes…

 > db.bench.stats()
{
    "ns" : "bench.bench",
    "count" : 140001,
    "size" : 10640108,
    "avgObjSize" : 76.00022856979594,
    "storageSize" : 21250048,
    "numExtents" : 7,
    "nindexes" : 1,
    "lastExtentSize" : 10067968,
    "paddingFactor" : 1,
    "flags" : 1,
    "totalIndexSize" : 4562208,
    "indexSizes" : {
        "_id_" : 4562208
    },
    "ok" : 1
}

So based on this… we can work out that size/storageSize = approx 50% efficiency… so MongoDB on this dataset is using about the same again for the data storage.

If we add in indexes size/(storageSize+totalIndexSize) then the result is only about 41% efficient. I think this is a reasonable tradeoff for the raw speed it gives personally…

Nginx virtual host ordering

Just a quick note.

Nginx will include configuration files based on ASCII ordering. This means that even if you have two files default (with the default config) and cloud, cloud will be included before default and will answer for any undefined hostnames.

The easy way to solve this (and obvious if you have been around Linux for a while) is to prefix all config files with a number. For example:

00-default
10-wildcard.example.com
10-wildcard.foo.com

This will result in 00-default being included in the configuration first.

I hope this is useful for someone out there.

FastCGI PHP and supervisord

Nginx wont auto-spawn workers if they don’t exist so you do need to start them outside of Nginx. Many people use the spawn-fastcgi script or some other startup script to do it, but the smart people use a process monitor.

There are a lot of people using supervisord for keeping their fastcgi PHP workers working and most of them are in my opinion doing it wrong. Not very wrong, just a few things that could be done better.

If you simply wanted PHP running as a certain user suphp and similar solutions are perfectly fine. FastCGI is slightly faster, but is more complex to manage – you need to keep it running. The way most people do it is something like the following:

[fcgi-program:php5-cgi]
socket=tcp://127.0.0.1:9000
command=/usr/bin/php5-cgi
numprocs=5
priority=999
process_name=%(program_name)s_%(process_num)02d
user=www-data
autorestart=true
autostart=true
startsecs=1
startretries=3
stopsignal=QUIT
stopwaitsecs=10
redirect_stderr=true
stdout_logfile=/var/log/php5-cgi.log
stdout_logfile_maxbytes=10MB

Which spawns many separate PHP threads and it works. The problem with it is that if you use APC or similar opcode caches or in memory caches they are done on a per interpreter basis and as such if you spawn 20 PHP processes, you will have 20 separate opcode caches each with similar information.

A better way is to use the ability of the FastCGI PHP binary to manage its own children:

[fcgi-program:php-cgi]
command=/usr/bin/php-cgi -b 127.0.0.1:9000
socket=tcp://127.0.0.1:9000
process_name=%(program_name)s
user=www-data
numprocs=1
priority=999
autostart=true
autorestart=true
startsecs=1
startretries=3
exitcodes=0,2
stopsignal=QUIT
stopwaitsecs=10
redirect_stderr=true
stdout_logfile=/var/log/supervisor/php.log
stdout_logfile_maxbytes=5MB
stdout_logfile_backups=10
environment=PHP_FCGI_CHILDREN=50,PHP_FCGI_MAX_REQUESTS=500

If you use APC or similar op-code caches this is much more efficient as only one cache is kept and has the side effect of much faster restarts within supervisord.

Gearman coalescing with the unique id

Many people on both the mailing lists and across the net seem to be slightly confused as to the coalescing features of gearman. I will try to explain what it is and how it works here…

If you are using gearman to generate data for memcache cache misses for a web page then if many people hit that page at once then you will get many jobs all requesting the same thing…

Without coalescing, all these jobs would run one after the other each one setting the same keys in memcache. This is obviously not ideal.

However, with coalescing, as long as all these jobs have the same unique id, they will all be merged into a single job and the result given back to all waiting clients.

An example might help.

<?php

/* create our object */
$gmclient= new GearmanClient();

/* add the default server */
$gmclient->addServer();

/* start some background jobs and save the handles */
$handles = array();
$handles[0] = $gmclient->doBackground("reverse", "Hello World!");
$handles[1] = $gmclient->doLowBackground("reverse", "Aardvarks!");
$handles[2] = $gmclient->doHighBackground("reverse", "Foo");
$handles[3] = $gmclient->doLowBackground("reverse", "Foo");
$handles[4] = $gmclient->doBackground("reverse", "Foo");
$handles[5] = $gmclient->doHighBackground("reverse", "Foo");
$handles[6] = $gmclient->doHighBackground("reverse", "Foo");

$gmclient->setStatusCallback("reverse_status");

/* Poll the server to see when those background jobs finish; */
/* a better method would be to use event callbacks */
do
{
   /* Use the context variable to track how many tasks have completed */
   $done = 0;
for ($i=0; $i<count($handles);$i++) {
   $gmclient->addTaskStatus($handles[$i], &$done);
}
   $gmclient->runTasks();
   echo "Done: $done\n";
   sleep(1);
}
while ($done != count($handles));

function reverse_status($task, $done)
{
   if (!$task->isKnown())
      $done++;
}

?>

As you’d expect, with the prior code, the client sees each request come back one by one (including the ‘Foo’ tasks).

Now let us contrast that to the following, which is almost identical except the ‘Foo’ tasks all have the same unique identifier.

 root@debian:~# cat gmclient3.php
<?php

/* create our object */
$gmclient= new GearmanClient();

/* add the default server */
$gmclient->addServer();

/* start some background jobs and save the handles */
$handles = array();
$handles[0] = $gmclient->doBackground("reverse", "Hello World!");
$handles[1] = $gmclient->doLowBackground("reverse", "Aardvarks!");
$handles[2] = $gmclient->doHighBackground("reverse", "Foo",'foo');
$handles[3] = $gmclient->doLowBackground("reverse", "Foo",'foo');
$handles[4] = $gmclient->doBackground("reverse", "Foo",'foo');
$handles[5] = $gmclient->doHighBackground("reverse", "Foo",'foo');
$handles[6] = $gmclient->doHighBackground("reverse", "Foo",'foo');

$gmclient->setStatusCallback("reverse_status");

/* Poll the server to see when those background jobs finish; */
/* a better method would be to use event callbacks */
do
{
   /* Use the context variable to track how many tasks have completed */
   $done = 0;
for ($i=0; $i<count($handles);$i++) {
   $gmclient->addTaskStatus($handles[$i], &$done);
}
   $gmclient->runTasks();
   echo "Done: $done\n";
   sleep(1);
}
while ($done != count($handles));

function reverse_status($task, $done)
{
   if (!$task->isKnown())
      $done++;
}

?>

Now when this is executed all the Foo tasks come back at the same time – they have been coalesced into a single task and the result copied to each client.

The drawback for this though, is that priorities don’t always work the way you expect them to. The first priority (Low, Normal or High) of any task is the one given to all tasks, so if it is urgent and you have Low priority tasks with the same unique id, then to get this out as fast as possible, you might want to avoid coalescing at the risk of duplicating work (unless you also have workers check memcache too).

I hope this is enlightening to some people that have been wondering about the feature and any comments feel free to email me at mike *at* technomonk *dot* com

Gearman admin interface

I have recently been playing with gearman for a big project for one of our clients here at Sysdom. Getting it up and running was super easy, but we needed to integrate monitoring into our admin pages, so here is a little piece of code to get you up and running…

<?php
error_reporting(E_ALL);

echo "Queues\n";

$service_port = 4730;

//$address = gethostbyname('www.example.com');
$address='127.0.0.1';

$socket = socket_create(AF_INET, SOCK_STREAM, SOL_TCP);
if ($socket === false) {
    echo "socket_create() failed: reason: " . socket_strerror(socket_last_error()) . "\n";
}

//echo "Attempting to connect to '$address' on port '$service_port'...";
$result = socket_connect($socket, $address, $service_port);
if ($result === false) {
    echo "socket_connect() failed.\nReason: ($result) " . socket_strerror(socket_last_error($socket)) . "\n";
}

$in = "status\n";
$out = '';

socket_write($socket, $in, strlen($in));

while ($out = socket_read($socket, 2048,PHP_NORMAL_READ)) {
    echo "$out";
    if ($out==".\n") break;
}

socket_close($socket);
?>

Yes, it is a quick and dirty proof of concept, but I hope this will help other people with monitoring gearman in their own apps.

Output is pretty simple:

root@debian:~# php gmstatus.php 
Queues
test    1    0    0
reverse    0    0    1
.

with the function, total jobs in queue, running jobs and available workers seperated by tabs.

Note: yes, I know there is the PEAR library for doing the same, but adding it when you are using the PECL extension is an extra dependency that we’d rather not have.

The gearman admin protocol is available here, but I’ll repost the snippet so people have an easy reference to it.

Administrative Protocol
-----------------------

The Gearman job server also supports a text-based protocol to pull
information and run some administrative tasks. This runs on the same
port as the binary protocol, and the server differentiates between
the two by looking at the first character. If it is a NULL (\0),
then it is binary, if it is non-NULL, that it attempts to parse it
as a text command. The following commands are supported:

workers

This sends back a list of all workers, their file descriptors,
their IPs, their IDs, and a list of registered functions they can
perform. The list is terminated with a line containing a single
'.' (period). The format is:

FD IP-ADDRESS CLIENT-ID : FUNCTION ...

Arguments:
- None.

status

This sends back a list of all registered functions. Next to
each function is the number of jobs in the queue, the number of
running jobs, and the number of capable workers. The columns are
tab separated, and the list is terminated with a line containing
a single '.' (period). The format is:

FUNCTION\tTOTAL\tRUNNING\tAVAILABLE_WORKERS

Arguments:
- None.

maxqueue

This sets the maximum queue size for a function. If no size is
given, the default is used. If the size is negative, then the queue
is set to be unlimited. This sends back a single line with "OK".

Arguments:
- Function name.
- Optional maximum queue size.

shutdown

Shutdown the server. If the optional "graceful" argument is used,
close the listening socket and let all existing connections
complete.

Arguments:
- Optional "graceful" mode.

version

Send back the version of the server.

Arguments:
- None.

The Perl version also has a 'gladiator' command that uses the
'Devel::Gladiator' Perl module and is used for debugging.

I hope someone finds this useful and feel free to send any comments to mike *at* technomonk *dot* com

 

Tools

Those that have known me for a while know that I can produce a fair mount of useful web based tools from time to time. While some of them have been lost to the sands of time, some still are knocking around in various forms and deserve to be resurrected.

To this end I am starting to move some of the tools over to a new page on this site.

If you know of any of the tools I have produced and would like to see them published here, please let me know at mike at mikepreston.org.

Roman Numerals in PHP

I needed a function to convert dates to their Roman numeric equivalents. The rules are pretty easy to understand so I thought I’d knock up a function. Continue reading