Tag Archives: servers

Instant Messaging

I was asked after my last post what I used for instant messaging. The answer is kinda interesting in my opinion.

As some of you know I am a privacy advocate, but try to also balance that with ease of use and ease of integrating with other people.

I run my own XMPP server which I share with a number of collegues. This is running of Prosody an open source XMPP server. We use the MUC module to host a number of conference rooms where our bots and other notifications are sent.

In addition to this I also use Skype, not because I trust Microsoft to not sell my contact info out to the NSA, but because it reduces the friction of contacting me for a number of people. Forcing them to use my private XMPP server (even through federation) is too big a hurdle for them to jump.

10 things any newbie web developer needs to know

PHP or other language

I’m not going to preach about the pros and cons of different languages, suffice to say that any sufficiently advanced website is indistinguishable from magic requires more than just just static files to make it work. PHP is my language of choice (and has been for over a decade) but there are plenty of other options out there (Python, Ruby or even (shudder) .net)

Which you choose depends more on what is most comfortable to you and what support network you have to call upon. It isn’t generally a good idea to learn Python in a Microsoft shop for example unless there are other pressing reasons to do so.

Whatever you end up learning, you need to be competent in its use, make use of the extensions available (no point reinventing the wheel when there are dozens already made for you) and know how to use the online documentation. Coding style helps, but as long as you are consistent really isn’t that pressing unless you are working with others.

XHTML/CSS

Even if you don’t deal with much front-end code yourself, perhaps you use templates, you still need to understand at least the basics of XHTML and CSS to make it easier for the front-end developers to deal with your output. At the very least your code needs to output well formed, correctly nested code. We mean

<div><p><b><i>hello world</i></b></p></div>

rather than

<div><p><i><b>hello world</i></b></div>

(note the lack of ending <p> tag and improperly nested <i> and <b> tags) an XHTML validator will help, but only on the finished output. They don’t work on fragments of code like this.

MVCs, frameworks and libraries

As stated above, and especially if you are getting paid on a per project basis, you want to reduce the number of times you want to reinvent the wheel. This is known as Don’t Repeat Yourself or DRU. There is no point writing yet another PDF library for example when FPDF, Zend_PDF, Pdflib and many others already exist (or docbook/ps via converters etc) and do an adequate job for most tasks. MVC or Model-View-Controllers to give them their full name are a concept (or design pattern) that allows you to separate the data access parts from the actual logic and the templates used in your projects. This is useful in a team based environment because the designer can design, the coders can code and the coders and DBAs can control the models and how they interact with the databases – it just tends to separate the roles more cleanly and make modifications at a later date easier. If you want to add sharding to the database, you can simply modify the models while keeping the interface the same and suddenly your app can be cluster aware. It’s really a win-win situation.

I use Zend Framework under PHP to provide a MVC and other libraries. I’ve tried Code-Igniter and Kohana and while they work, they either feel dated (in the case of Code Igniter) or just are poorly documented (in the case of Kohana). Zend Framework, while not trivial to get up and running for the novice, worked well for the way I work.

OOP

OOP is one of those things, you either love it or hate it. I personally tolerate it with a mild distain – but that is just me. (I predate the whole OOP revolution.) Put simply, OOP allows a set of related variables and functions to be pulled together into a single package called a class. You then can either extend the class in your own projects overriding or extending the various functions (called methods) or use it as is. The biggest advantage, especially when you are importing a lot of libraries is that unless you are careful, two libraries may have a function or variable with the same name. (This is solved with namespaces in the latest PHP versions) At best the interpreter or compiler throws an error, at worst one library may end up calling a function or trampling over the data from another unintentionally.

Regex

At its simplest level a regex is a way of matching patterns within strings. When used to their full ability, a regex can sometimes replace 10-20 lines of non regex code with a single line. They are often used in Mod_rewrite rules, input validation and searching for certain things within quanities of text. Learn them, use them and master them – they will save you a lot of time.

Source Control

Even when working on your own projects, you sometimes wish you could turn back time and undo something stupid you did a month ago in some code. Your backups have been recycled and a lot of changes have been made since – it sounds like a lost cause. This is where version control comes in. Everytime you are happy with a set of modifications you check them in. The difference between the old version and the current version are worked out and saved. At a later date you can go back and see who made the change, when it was done and exactly what was changed. If you want to checkout a copy of the code as it was at this time, you can do that. This is almost essential in a team-based environment as it allows auditing of code in a more detailed way than simply looking at backups from a particular point in time. What if one of the coders inserted a backdoor into your code? Wouldn’t you want to be able to identify every other change that was made and check them for similar nefariousness?

Google-fu

As a developer you will often find questions and problems that you don’t have the answers for. You can try to puzzle through it yourself, but there is a whole world of people out there, many of whom have already solved similar issues and put their answers online. Being able to search effectively puts these answers at your fingertips.

Accessibility

Not everyone in this world has perfect vision, perfect hearing or the ability to use a mouse. Many sites are designed in such a way that every user is assumed to have adequate abilities. If your site assumes that all readers can use your fixed-sized font, will have Javascript enabled and have will be read rather than listened to using a screen reader then it probably wont be accessible. There are many tools and concepts that don’t take much effort to make your site easier for users with impaired abilities, some are listed below.

Monitoring and SEO

A site is a success if it reaches its goals. It is hard to know if you are hitting them if you have no way of measuring the values the goals specify. Are you being seen by 100 people a day? Are you selling at least 3 widgets from your site a week? Does your advertising campaign result in increased sales and traffic? You need to measure anything that you need if you are going to try to set goals on these things – simple idea, but it’s suprising how few people ‘get it’.

SEO is related to monitoring your traffic in a similar way to energy efficiency is related to watching how much power you use. Unless you know where you are starting from and where you are now, you can’t know how much you have improved by and if you are paying for adwords or similar advertising, then how much each extra visitor or sale is costing you.

JQuery and other JS toolkits

It is all well and good being able to produce dynamic pages and process forms from the server-side, but sometimes you really need something extra on the clientside as well. Enter JavaScript.

JavaScript or JS has been around since the mid 90s and is standard on all major browsers. The big problem comes when certain features exposed to Javascript (Ajax, Storage, Location, DOMs/BOMs etc) vary from browser to browser.

The easy solution to this is to take a library that hides these differences and make it easy to write code once that will run on most other browsers. Jquery is one option, but not the only one.

In addition to this there are many libraries, shims and other utilities that make doing more complex things easier. JQuery for example has the UI library which makes certain tasks such as tabs and date pickers trivial to add to your page.

This isn’t an exhaustive list and I’m sure there are many things could be added to the list. If you feel I have missed something, please add it below in the comments section.

 

Mongo Benchmarking #1

These aren’t perfect benchmarks – far from it in fact – but I just wanted to get a rough idea of the relative tradeoffs between fsync and safe over normal unsafe writes…

Time taken for 10,000 inserts – (no indexes, journalling on MongoDB 2.0.1 on Debian (Rackspace CloudServer 256Meg))

default      - time taken: 0.2401921749115 seconds
fsync        - time taken: 358.55523014069 seconds
safe=true    - time taken: 1.1818060874939 seconds

Edit: and for journalling disabled and smallfiles=true in mongo.conf

default      - time taken: 0.15036606788635 seconds 
fsync        - time taken: 34.175970077515 seconds 
safe=true    - time taken: 1.0593159198761 seconds

The results aren’t perfect, but do show how big the difference is…

Source:

<?php
$mongo = new Mongo();
$db = $mongo->selectDB('bench');
$collection = new MongoCollection($db,'bench');

$start = microtime(TRUE);
for ($i=0;$i<10000;$i++) {
  $collection->insert(array('data' => sha1(rand()) ));
}
$end = microtime(TRUE);
echo 'default      - time taken: '.($end-$start)." seconds \n";

$start = microtime(TRUE);
for ($i=0;$i<10000;$i++) {
  $collection->insert(array('data' => sha1(rand()) ),array('fsync' => true));
}
$end = microtime(TRUE);
echo 'fsync        - time taken: '.($end-$start)." seconds \n";

$start = microtime(TRUE);
for ($i=0;$i<10000;$i++) {
  $collection->insert(array('data' => sha1(rand()) ),array('safe' => true));
}
$end = microtime(TRUE);
echo 'safe=true    - time taken: '.($end-$start)." seconds \n";
?>

I’m not sure that the existing number of records will make a massive amount of difference besides through the pre-allocation of files which we have little control of anyway – but it doesn’t look like there is an increase between runs even when there are a lot of entries… (perhaps we’d see more with indexes enabled)

Each run will add an extra 20,000 entries into the collection with little perceptable slowdown.

root@test:/var/www/test# php bench1.php
default      - time taken: 0.53534507751465 seconds
safe=true    - time taken: 1.2793118953705 seconds
root@test:/var/www/test# php bench1.php
default      - time taken: 0.203537940979 seconds
safe=true    - time taken: 1.2887620925903 seconds
root@test:/var/www/test# php bench1.php
default      - time taken: 0.22933197021484 seconds
safe=true    - time taken: 1.6565799713135 seconds
root@test:/var/www/test# php bench1.php
default      - time taken: 0.19606184959412 seconds
safe=true    - time taken: 1.5315411090851 seconds
root@test:/var/www/test# php bench1.php
default      - time taken: 0.2510199546814 seconds
safe=true    - time taken: 1.2419080734253 seconds

It is hard testing on a cloud server as you are at the mercy of other users impacting the available bandwidth and processor utilisation, but you can at least see trends. I hope this has been enlightening and I hope to expand on this in future…

Edit: for the one person that asked me about storage efficiency… here goes…

 > db.bench.stats()
{
    "ns" : "bench.bench",
    "count" : 140001,
    "size" : 10640108,
    "avgObjSize" : 76.00022856979594,
    "storageSize" : 21250048,
    "numExtents" : 7,
    "nindexes" : 1,
    "lastExtentSize" : 10067968,
    "paddingFactor" : 1,
    "flags" : 1,
    "totalIndexSize" : 4562208,
    "indexSizes" : {
        "_id_" : 4562208
    },
    "ok" : 1
}

So based on this… we can work out that size/storageSize = approx 50% efficiency… so MongoDB on this dataset is using about the same again for the data storage.

If we add in indexes size/(storageSize+totalIndexSize) then the result is only about 41% efficient. I think this is a reasonable tradeoff for the raw speed it gives personally…

Case Study: Optimising a Cloud Application

I was recently brought in to examine the infrastructure of a small startup. This wasn’t anything really special, I do it quite often for various reasons. What was different was that they didn’t have issues with scaling out particularly – they had that working well with their shared nothing web application and mongodb backend. What they were having issues with was their infrastructure costs.

I normally work on through a 6 step process that has been built up over time –

  1. Monitor and gather stats/measure the problem,
  2. Standardise on a reference architecture,
  3. Add configuration management and version control,
  4. Start to define a playbook of how to do things (like up/downscale or provision new machines and clusters) and start to automate them,
  5. Bring everything to reference architecture/consolidate underutilised servers and eliminate unused infrastructure,
  6. Consider architecture changes to make it more efficient.
  7. …and repeat

I will take you through a case study showing how this process was used to lower their monthly costs. Names and details have been changed in places to protect the guilty… 😉 Continue reading

Rackspace Huddles

I’m not a Rackspace expert – far from it, however I do use the Rackspace cloud often, both as a personal customer, a business customer and for various clients. I will try to lay out how I believe it all works and how this impacts you the end user.

From time to time ‘huddles’ get mentioned. A huddle is essentially a cluster of physical servers all working together to provide cloud servers service. The Rackspace cloud consists of hundreds of separate huddles all under one administration system.

Generally this is unimportant to end users. Rackspace while not exactly secretive about it are a little cagey about them as it starts to show some of the details about what goes on behind the scenes and ruins the illusion of completely scalable infrastructure.

Each area (London, Chicago or Dallas-Fort Worth) consists one or more data centres for cloud servers, each data centre will have one or more huddle. According to The Register each huddle is

One Huddle consists of 210 servers across 14 cabinets. The basic server spec is a dual Hex core, 12GB of RAM and three 500GB hard disks on a RAID5 configuration.

Which I believe is slightly inaccurate. The cloud servers go up to 15872Meg of space, this obviously can’t be accommodated on 12Gig of RAM. Also, the server one of my cloud servers is sitting on according to the cpuinfo on a Chicago Cloud-Server is

model name    : Quad-Core AMD Opteron(tm) Processor 2374 HE
cpu MHz        : 2176668.674

So, a 2.2Ghz Quad core Opteron rather than the dual Hex core.

model name    : Quad-Core AMD Opteron(tm) Processor 2374 HE
cpu MHz        : 2409004.973

and 2.4Ghz Quad-core Opteron on one of my London cloud-Servers.

Not a big issue, I’m probably just on older hardware.

So what does a huddle mean to the end user? Generally nothing. It only seems to be important on two occasions. The first is when there is a fault with the control-plane of the huddle. If your servers are all on the same huddle then they may be affected at the same time. I don’t know how common this is but for redundancy, you are better spread over different huddles.

Rackspace do say to use one of the other zones (LON/ORD/DFW) for redundancy and this might be a better idea if your application is mission critical as network issues (or massive disasters) could knock out multiple huddles at once in a single datacentre. The problem with this is that different datacentres don’t share a common servicenet, so for some applications spreading your servers across different huddles in the same zone is the best you can get…

The second time it is important is when you need to use a shared IP. Shared IPs can only be shared with other servers in the same huddle. This isn’t as much of a problem as it sounds as if you use the API to launch the servers then you need to launch (except the first server) into a shared IP group, which will be on the same huddle automagically. However, via the web interface there doesn’t seem to be any way to specify it besides opening a chat with support and letting them toggle your default huddle and adding shared IPs for you.

Much of this is guess-work based on the available documentation and conversations with support from time to time. If you have any additional information or corrections, please get in touch mike -at- technomonk . com