AD7six.com View RSS

No description
Hide details



File Caching - Preventing Running Out of Space 7 Jan 2015 12:26 AM (10 years ago)

On one project I take care of it creates a large number of sizable files (which can be reproduced on demand). Each file take about a minute to create and are typically only accessed once. However, some files are more popular than others and once created are accessed repeatedly. In the early days of this project I had 1TB of space available, didn’t think about what could happen and left the project to run. Gradually all space was eaten up by the storing of these files eventually leading to the question:

How to efficiently store as many files as possible without running out of space?

If you’re familiar with linux commands the answer may be obvious, but if you’re not (or maybe even if you are and it’s pre-coffee time) this is what I’d like to document today.

In the beginning

Let’s call all these files what they are: a file cache. They are files which can be reproduced (at a cost, time) but are left lying around to reduce/prevent duplicate effort. When this file cache occupies 1% of the available space on a server it’s hardly important to think about how to get rid of them, however when this file cache occupies 100% of the available space on a server it’s a different matter.

In my own case the rise from 1% to say 50% was slow - it took months. However the rise from 50% to 100% was overnight, caught me a little by surprise and caused “some problems” which were quickly solved with a call to:

# Delete all files not accessed in a couple of weeks
find /some/path -type f -atime +14 | xargs rm

Pop that in a daily crontab and done. Fixed. Time goes by and whuh-oh the server has had another blip of traffic and it’s filling up with files again faster than they can be deleted. So 2 weeks is too long, let’s go for 1 week:

# Delete all files not accessed in a week
find /some/path -type f -atime +7 | xargs rm

Done. Fixed. Whuh-oh holy moly another blip of traffic…. going to have to get more space and change that crontab to be much less time…

# Delete all files not accessed in 2 days
find /some/path -type f -atime +2 | xargs rm

Done. Fixed.

Boo traffic has gone down again, now there’s 95% free space available, it’s not necessary to be so agressive deleting those cached files - but increasing the time they are allowed to hang around risks filling the disk and running out of space again.

There has to be a better way.

A better way

After some thought on the matter this was the process I needed:

Each of those tasks is very simple, and some googling will result in a few useful results and plenty that probably aren’t.

Determine which physical drive a folder is on

I’ve used df countless times yet didn’t know until addressing this task that you can use it to tell you which drive a folder is on:

$ df -h /some/path/
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb1       2.0T  1.4T  494G  75% /some/path

Well that was easy!

We won’t be wanting human-readable output so drop the -h flag and to strip off the first line of the response, we can simply use awk:

$ df /some/path/ | awk 'NR>1{print}'
/dev/sdb1      2113786796 1488804468 517608200  75% /some/path

Derive the amount of free space needed as a % of drive size

The previous command gives a result which is easily parsable.

To get the drive size in bytes:

$ df /some/path/ | awk 'NR>1{print}' | awk '{print $2}'
2113786796

To get the current free space in bytes:

$ df /some/path/ | awk 'NR>1{print}' | awk '{print $4}'
517608200

Let’s say we want 30% of the drive free, using a couple of variables that becomes:

$ STATS=`df /some/path/ | awk 'NR>1{print}'`
$ SIZE=`echo $STATS | awk '{print $2}'`
$ FREE=`echo $STATS | awk '{print $4}'`
$ TARGET_FREE=`expr $SIZE \* 3 / 10`

$ echo $TARGET_FREE
634136038

How many bytes to delete? This many:

$ BYTES_TO_DELETE=`expr $TARGET_FREE - $FREE`

$ echo $BYTES_TO_DELETE
116527838

Rock and roll, only need to delete 116,527,838 or 116.5MB today.

Construct a list of files order by last access time

This isn’t quite so simple as it sounds, if you want to do it in an efficient way. However, a read of find’s man page yields the following:

$ # find
$ #     %A+ - Last access time YYYY-MM-DD:HH:MM:SS.X
$ #     %p  - filename
$ #     %s  - Size in bytes
$ find /some/path -printf "%A+::%p::%s\n"
2015-01-04+00:28:05.7512236840::/some/path/52/52/some-file.zip::3803767
...

Pipe that through sort and we have a list of files order by last accessed, the path and the size in bytes:

$ find /some/path -printf "%A+::%p::%s\n" | sort > /tmp/all-files

Construct a list of files to delete

With a parsable list of files to delete there are many ways inwhich to trim that list to just the files to delete - one intriguing post I found did this with awk, adapting that to this use case was easy:

$ cat /tmp/all-files | awk -v bytesToDelete="$BYTES_TO_DELETE" -F "::" '
  BEGIN { bytesDeleted=0; }
  {
  bytesDeleted += $3;
  if (bytesDeleted < bytesToDelete) { print $2; }
  }
  ' > /tmp/files-to-delete

Looking at the contents of /tmp/files-to-delete it should be the same list of files as the input, truncated at the point where freespace reaches the desired amount.

Delete files until free space is below the threshold

The final step is very simple:

$ cat /tmp/files-to-delete | rm

Tada, free space is now 30% of the drive.

Complete script

If you’d like a drop in and use script to do what is described here here you go:

#!/bin/bash
################################################################################
#
# Delete least used files
#
################################################################################

PROGNAME=${0##*/}
PROGDIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
VERSION="0.1"
DRYRUN=0

error_exit() {
  echo -e "${PROGNAME}: ${1:-"Unknown Error"}" >&2
  exit 1
}

graceful_exit() {
    exit
}

usage() {
    echo -e "Usage: $PROGNAME [-h|--help|-n|--dryrun] directory"
}

help_message() {
  cat <<- _EOF_
  $PROGNAME ver. $VERSION
  Find least-accessed files, and delete them

  Will ensure that the relevant disk is less than 70%
  full

  $(usage)

  Example:
    $PROGNAME /some/path/
    Drive /dev/sdb1 has xxx free bytes
     need yyy bytes available
    Finding files to delete ...
    Deleting the following files:

    /some/path/foo.zip
    /some/path/bar.zip

  Options:
  -h, --help   Display this help message and exit.
  -n, --dryrun Simulate only

_EOF_
  return
}

function driveSize {
    stats=`df -P $DIR | awk 'NR>1{print}'`

    DRIVE=`echo $stats | awk '{print $1}'`
    DRIVESIZE=`echo $stats | awk '{print $2}'`

    DRIVEUSED=`echo $stats | awk '{print $3}'`
    TARGETUSED=`expr $DRIVESIZE \* 7 / 10`

    DRIVEFREE=`expr $DRIVESIZE - $DRIVEUSED`
    TARGETFREE=`expr $DRIVESIZE \* 3 / 10`

    BYTESTODELETE=`expr $DRIVEUSED - $TARGETUSED`
}

function findToDelete {
    FILEPATH=`tempfile`

    # find
    #     %A+ - Last access time YYYY-MM-DD:HH:MM:SS.X
    #     %p  - filename
    #     %s  - Size in bytes
    find "$DIR" -printf "%A+::%p::%s\n" \
    | sort > "$FILEPATH.raw"

    cat "$FILEPATH.raw" | awk -v todelete="$BYTESTODELETE" -F "::" '
      BEGIN { deleted=0; }
      {
      deleted += $3;
      if (deleted < todelete) { print $2; }
      }
      ' > "$FILEPATH.processed"
}

function main {
    driveSize;
    echo "Drive $DRIVE has $DRIVEFREE free bytes"
    echo " need $TARGETFREE bytes available"

    if [ $DRIVEFREE -lt $TARGETFREE ];
        then
        echo "Finding files to delete ..."
        findToDelete
        if [ $DRYRUN == 1 ];
        then
            echo "The following files would be deleted:"
            echo ""
            cat "$FILEPATH.processed"
        else
            echo "Deleting the following files:"
            echo ""

            cat "$FILEPATH.processed" | tee | rm
        fi
    else
        echo "No action required at this time"
    fi
}

# Parse command-line
while [[ -n $1 ]]; do
  case $1 in
    -n | --dryrun)
      DRYRUN=1;;
    -h | --help)
      help_message; graceful_exit ;;
    -* | --*)
      usage
      error_exit "Unknown option $1" ;;
    *)
      DIR=$1;;
  esac
  shift
done

if [ "$DIR" == "" ];
then
      usage;
      graceful_exit;
fi

main $DIR

It’s functionally the same as the steps described here (i.e. each step is done atomically, mostly to make it easier to understand and/or see what the script will or did do), with a few niceties thrown in (such as a dry-run mode, and doing nothing at all unless necessary) and makes a handy crontab addition:

# Ensure there's 30% free space on the drive every hour
0 * * * * /usr/local/sbin/purgePath /some/path

Conclusion

It would be very easy to lazily continue to use a “delete anything not accessed in x days” approach as I’m sure I’ve done many times in the past for any similar scenario - but using tools available on any linux system, and a few Useless use of cat’s - now any folder used as a file cache can be used optimally without the risk of swallowing all disk space killing a server.

Add post to Blinklist Add post to Blogmarks Add post to del.icio.us Digg this! Add post to My Web 2.0 Add post to Newsvine Add post to Reddit Add post to Simpy Who's linking to this post?

Logrotate: Rotate Your Log Files 25 Oct 2014 2:54 AM (10 years ago)

Every once in a while, I’ve found myself looking at a simple, stable, app deployed on a server that hasn’t been touched in months and being asked (or myself wondering) one of these things:

A common and repeating cause for both problems is: fat log files. There’s a simple solution which is possibly often forgotten and that’s what I’d like to write about today.

How do log files slow your app down?

Appending to a file involves finding the end of the file first. Appending to a big file can mean spending more time trying to find the end of a file than it does to actually write the bytes, which is one cause for an otherwise stable application to gradually grind to a halt.

In my own experience, there are 2 basic causes for log files to get large enough to be a problem:

  1. The application is generating a large numbers of error messages
  2. The application is nolonger being developed/modified

Both of which hint at the root “crime” being committed: The log files aren’t being read by anybody.

Doesn’t CakePHP take care of that for you?

Since CakePHP version 2.4 the problems caused by (cough) not reading your log files are limited to running out of space. That’s because as of 2.4 the file logger has a basic log rotation feature which will rename log files by appending a timestamp whenever they get too big, and create a new empty log file to write to.

Running out of space doesn’t normally happen overnight, but if nobody is checking for that things will work perfectly right up until there’s no space left at which point things crash hard. If you’re using log files, my recommendation to you is to not rely on the built in log rotation feature.

What can you do? well the logical choice is to not use files for logging at all in production and instead log to Syslog. However the simplicity and convenience of the file log engine means it’s probably still the most common form of logging in any given (CakePHP) application, so on with the next best thing.

Logrotate

If you use linux at all, you are probably already using logrotate.

Logrotate, as described by its own man page:

logrotate is designed to ease administration of systems that generate large numbers of log files. It allows automatic rotation, compression, removal, and mailing of log files. Each log file may be handled daily, weekly, monthly, or when it grows too large.

Take a look inside /etc/logrotate.d/ and you’ll find a few example log rotate config files which will give you an idea of what it is already doing for you - these config files are pretty simple, how about setting up logrotate to automatically clean up your application log files?

If log files are being written to a consistent location, that’s pretty trivial to achieve. Consider there are innumerous CakePHP 3.x applications all deployed into /var/www/:domain that would mean the following pattern describes where all php application log files are:

/var/www/*/tmp/logs

Here’s an example CakePHP application log rotate config file, which will rotate all log files daily, compress them, and keep up to 12 old log files:

# /etc/logrotate.d/php-apps
/var/www/*/tmp/logs/*.log {
   rotate 12
   daily
   missingok
   notifempty
   compress
   delaycompress
}

With this config, a server is safe from unloved log files gradually taking over a server until it collapses.

It’s not necessary to wait for log rotate to run on its normal schedule, to verify that the config file does what’s expected - it can be launched directly via the cli:

$ logrotate --force -v /etc/logrotate.d/php-apps
reading config file /etc/logrotate.d/php-apps

Handling 1 logs

rotating pattern: /var/www/*/tmp/logs/*.log  forced from command line (12 rotations)
empty log files are not rotated, old logs are removed
considering log /var/www/example.com/tmp/logs/debug.log
...

Conclusion

You should probably be using syslog, but if you’re not, don’t leave your log files lying around as a (slow) time bomb. Even on a development machine, I’d recommend to configure logrotate to keep things tidy; it at least eliminates one cause for a server needing urgent attention.

Add post to Blinklist Add post to Blogmarks Add post to del.icio.us Digg this! Add post to My Web 2.0 Add post to Newsvine Add post to Reddit Add post to Simpy Who's linking to this post?

CakePHP 3.x - Entity Routing 16 Sep 2014 1:17 AM (10 years ago)

One of the most satisfying bits of code to write, are those simple solutions to simple (or overlooked) problems. The problem I’m going to write about today is Routing.

What’s wrong with routing?

Nothing.

However, that’s not to say it can’t be improved. To set the scene take a look at the 2.x docs for routing and you’ll find the following example:

Example standard route
1
2
3
4
5
Router::connect(
    '/:controller/:id',
    ['action' => 'view'],
    ['id' => '[0-9]+']
);

No surprises there. So the above route would be used like so:

Example standard link
1
2
3
4
5
6
echo $this->Html->link(
  $post['title'],
  ['controller' => 'posts', 'action' => 'view', 'id' => $post['id']]
);

// outputting a link to /posts/view/123

Unless your application is trivial, it’s quite likely to have your view files full of code like this, and there’s nothing wrong with that, as it’s the standard way to write a url that uses reverse routing (taking an array and returning a string).

However, it’s normally when you already have a chunk of code written that you get a curveball come your way such as:

Whoops.

We forgot about seo - we need to put some of that magic sauce in the url. All urls.

So now, your route (and relevant controller code if necessary) gets updated to e.g.:

Example slug route
1
2
3
4
5
Router::connect(
    '/:controller/:slug',
    ['action' => 'view'],
    ['slug' => '[0-9a-z-]+']
);

And view code (and controller redirects) needs to change to match like so:

Example slug link
1
2
3
4
echo $this->Html->link(
  $post['title'],
  ['controller' => 'posts', 'action' => 'view', 'slug' => $post['slug']]
);

Rinse and repeat a few times and this gets pretty tedious. What if there was a better way?

A better way

Let’s jump straight into designing a solution and then make it work.

I’m going to make use of named routes as this both reduces verbosity and is a minor speed up - since it means telling the router explicitly which route to use, instead of asking it to itterate over all route definitions looking for a match.

This is the end result we’re going to achieve:

entity route
1
2
3
4
5
6
7
8
9
echo $this->Html->link(
  $post['title'],
  ['_name' => 'postsView', '_entity' => $post]
);

// outputting a link to /posts/view/123
//                   or /posts/123
//                   or /posts/cakephp-3-0-entity-routing
//                   or /whatever/you/want

Note: Entity Routing IS NOT a CakePHP 3.x core feature - the above will not “just work”

Now let’s think about that for a second:

Can you feel the awesome?

Making it work

We only need 2 things for the above to work

1. An appropriate route definition

We’ll need a named route definition, and to specify to use a different route class:

entity route definition
1
2
3
4
5
$routes->connect(
  '/posts/:slug',
  ['controller' => 'Posts', 'action' => 'view'],
  ['_name' => 'postsView', 'routeClass' => 'EntityRoute']
);

2. The entity-route class

And, to create the route class:

App\Routing\Route\EntityRoute.php
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
<?php
namespace App\Routing\Route;

use Cake\Routing\Route\Route;

class EntityRoute extends Route {

    public function match(array $url, array $context = []) {

        if (isset($url['_entity'])) {

            $entity = $url['_entity'];
            preg_match_all('@:(\w+)@', $this->template, $matches);

            foreach($matches[1] as $field) {
                $url[$field] = $entity[$field];
            }

        }

        return parent::match($url, $context);
    }

}

All the above route does is use the route template (/posts/:slug) to identify which properties to read from the passed entity if it is present (slug), otherwise it will act exactly the same as a standard route.

With the above route definition and route class all of these would return the same string:

entity route
1
2
3
echo Router::url(['controller' => 'posts', 'action' => 'view', 'slug' => $post['slug']]);
echo Router::url(['_name' => 'postsView', 'slug' => $post['slug']]);
echo Router::url(['_name' => 'postsView', '_entity' => $post]);

But only one of the above calls is imune to “significant” route changes.

Wrapping up

Entity routing is a rather simple change.

One thing that it has encouraged me to do is to consider creating bespoke route classes more often, for example, have some sort of “make sure this parameter is always set” logic in your app controller before filter? Why not put that logic in the route definition and forget about it?

That’s all for now, let me know what you think =).

Add post to Blinklist Add post to Blogmarks Add post to del.icio.us Digg this! Add post to My Web 2.0 Add post to Newsvine Add post to Reddit Add post to Simpy Who's linking to this post?

A Clean Slate 11 Sep 2014 4:30 AM (10 years ago)

Let’s start again shall we.

Many moons ago, I would be eager to write, but there would always be something inbetween me and writing that got in the way. One of those things was looking at the collected tangle of code/projects that I’ve accumulated over the years, so today I’ve cleaned house, archived a load of my stale git projects on my git archive.

Another barrier to writing was simply looking at my obsolete posts - which were written mostly for/about CakePHP version 1. CakePHP version 3.0 is right around the corner to to open the way to dumping code and ideas about that I’ve archived my old posts leaving a freshly combed sandpit ready to be jumped into.

I’m hopeful over the next few months to get back into the habit of writing regularly, and publishing a few tutorials based on my experiences building applications with CakePHP 3.0.

Add post to Blinklist Add post to Blogmarks Add post to del.icio.us Digg this! Add post to My Web 2.0 Add post to Newsvine Add post to Reddit Add post to Simpy Who's linking to this post?