GarfieldTech - Technical thoughts, tutorials, and musings View RSS

No description
Hide details



A survey of data modeling 19 Aug 9:30 AM (3 months ago)

A survey of data modeling

There are many different ways of modeling data. They all have their place, and all have places where they are a poor fit.

The spectrum of options below are defined mainly by the degree to which they differentiate between read and write models, and correspondingly how powerful-but-also-complex they are. "Model" in this case usually corresponds to a class, or a class with one or more composed classes.

Varieties of data modeling

Arbitrary SQL

In this case, there is no formal data definition beyond the SQL (or other database) schema. The application just runs arbitrary SQL queries, both read and write, wherever it sees fit.

In a slightly better variant, SQL queries are all confined to selected objects that act as an API to the rest of the application. Arbitrary code does not call SQL, but it can call a method on this object that will call SQL.

The SQL could be hand-crafted, use a query builder of one kind or another, or a little of each.

This approach may work at a very small scale, where building something more formal isn't worth the effort. However, the tipping point where it is worth the effort comes very, very early.

CRUD

The most widely used approach is known as "Create Read Update Delete" (CRUD). Those are the four standard operations. In this case, the system models a series of data objects called Entities. While technically Entities do not need to correspond 1:1 to a particular database table, in practice that is often the case. An entity could also have dependent tables, the details of which are mostly hidden.

CRUD is usually managed by an ORM, or Object-Relational Mapper. The ORM attempts to hide all SQL logic from the user, providing a consistent interface pattern. A user Reads (loads) an Entity by ID, possibly Updates it (edits some value), and then saves it back to the database. The user only interacts with the Entity object.

There are two main variants of ORM: Active Record, in which the Entity object has direct access to the database connection to load and save itself, and Data Mapper, in which the Entity is ignorant of its storage and a separate service (a mapper, or repository, or various other names) is responsible for the loading and saving. Active Record is often easier to implement from scratch, so it is popular with RAD-oriented tools (like Ruby on Rails or Laravel). It is, however, a vastly inferior design as it severely hinders testing, encapsulation, and more advanced cases. The effort to set up a Data Mapper is almost always worth it, as the effort is not substantially higher for a skilled developer.

CRUD falls down in three key areas:

  • It assumes the "read model" and "write model" are the same. While in simple cases that may be true, they often have different validation requirements. For example, "last updated" or "last login time" are likely fields that are not needed on the write model, as the system manages them directly; they are either absent or optional. On the read model, however, we expect them to be always present. That difference cannot be easily captured in a single unified object. (Some workarounds do exist, but they are workarounds only.)
  • Relationships. A key value of SQL is data being "relational." That is, Entity A may have a "contains" or "uses" or "is parent of" relationship with Entity B, or with another Entity A. In SQL, this is almost always captured using a foreign key, and many-to-many relationships are captured with an extra join table. Mapping that into objects is often difficult, especially for complex data where an Entity spans multiple tables. It can also lead to severe performance problems, especially the "SELECT N+1 Problem," in which a series of Entity A objects are loaded, then as they are used each one lazily loads its related Entity B, resulting in "N+1" queries.
  • Listing and Querying. SQL is very good at building searches across arbitrary data fields. That's what it was built for. Object models frequently are not. They are fine for straight read/write operations, but less so for "find all products that cost at least $100 bought by a customer over the age of 50 in the last 3 months." That generally requires either dropping down to manual SQL to get a list of entity IDs, then doing a bulk-read on them, or a complex query builder syntax (either method calls or a custom string syntax) that translates high level relationships into low-level relationships. Tools like Java Hibernate or Doctrine ORM take the latter approach, which is one reason they are so large and complex.

An ORM in concept also does not offer any native way to create compound views, showing a subset of fields from 3 different related entities, for example. Some ORMs provide a mechanism of some sort, but rarely are they as capable or efficient as just writing SQL.

The impedance mismatch between object models and relational models has been called "The Vietnam of Computer Science," meaning "you keep trying to do it more, and it just gets worse the more you do." Simple ORMs are straightforward to build, but have an upper bound on complexity before they become too unwieldy.

CRAP

There is a variant of CRUD known as Create Read Archive Purge (CRAP), which does not get anywhere near as much use as it should. In this approach, each Entity is not updated in place when modified. Instead, an entirely new copy of the Entity is stored in the database, along with some version identifier. That gives each Entity a history of its state over time, with a built in ability to review that history and revert to an earlier state.

No Entity is deleted; if an entity needs to be deleted, a new version of it is saved that has a "deleted" flag set to true. Any SQL that interacts with the Entity must then be written to exclude older versions and deleted versions, unless specifically instructed not to.

If the historical data of a given Entity is no longer valuable, or is not valuable after a period of time, a separate Purge command can remove old revisions, including removing deleted entities entirely. The time frame for such purges and whether they can be user-triggered varies with the implementation.

The advantage is, of course, the history and rollback ability. It's also relatively easy to extend it to include forward revisions, which are revisions that will become the active revision at some point in the future (either upon editorial approval or some time trigger).

The downside is the extra tracking required, which means every bit of SQL that interacts with a CRAP Entity needs to be aware of its CRAPpiness. Writing arbitrary custom SQL becomes more problematic in this case, as a query that forgets to account for old revisions or deleted entities could result in unexpected data. That is especially true with more complex relationships. It also implies questions like "should Entity A getting a new revision cause Entity B to get a new revision, too? Should Entity A point to Entity B, or a specific revision of Entity B?" All possible answers to those questions are valid in some situations but not others. There may also be performance considerations if there are many revisions of many Entities, although that is a solvable problem with smart database design.

Nonetheless, I would argue CRAP is still superior to CRUD in most editorial-centric environments (news websites, company sites, etc.).

Projections

An extension available to both CRUD and CRAP is Projections. Usually Projections are discussed in the context of CQRS or EventSourcing (see below), but there's no requirement that they only be used there.

A Projection is the fancy name for stored data that is derived from other stored data. When the primary data is updated, an automated process causes the projection to be updated as well. That automation could be in application logic or SQL triggers/stored procedures; I would even consider an SQL View (either virtual or materialized) to be a form of Projection.

Projections are useful when you want the read version of the data structured very differently than the write version, or want it presented in some way that is expensive to compute on-the-fly.

For example, if you want a list of all sales people, their weekly sales numbers, the percentage change from last week, ordered by sales numbers, that could be expensive to compute on the fly. It could also be complex, if that data has to be derived from individual sale records and those sale records are spread across multiple tables, and sales team information is similarly well normalized across multiple tables. Instead, either on a schedule or whenever a sale record is updated, some process can compute that data (either the whole table or just update the one record it needs to) and save it to a sales_leaderboard table. Viewing that information is then a super simple, super fast single-table SELECT query.

If that table ever becomes corrupted or out of date, or we just want to change its structure, the data can just be wiped and rebuilt from the existing primary data. Projections are always expendable. If not, they're not Projections.

A system can use many or a few Projections as needed, built in a variety of ways. As usual, there's more than one way to feed a cat. If heavily used, Projections form essentially the entire read model. There's no need to read Entities from the primary data, except for update purposes.

Technically, any search index (Elasticsearch, Solr, Meilisearch, etc.) is a Projection. There is no requirement that the Projection even be in SQL, just that it is expendable, rebuildable data in a form that is optimized for how it's going to be read.

CQRS

The next level in read/write separation is Command Query Responsibility Segregation** (CQRS). CQRS works from the assumption that the read and write models are always separate.

Often, though not always, the write models are structured as command objects rather than as an Entity per se. That could be low-level (UpdateProduct command with the fields to change) or high-level (ApprovePost with a post ID).

The read models could be structured in an Entity-like way, but do not have to be. CQRS does not require using Projections, though they do fit well.

The advantage of CQRS is, of course, the flexibility that comes with having fully independent read and write models. That allows using the type system to enforce write invariants while having completely separate immutable read models. It also allows separating both read and writes from the underlying Entity definitions; a single update command may impact multiple entities, and a read/lookup can easily span entities.

The downside of CQRS is the added complexity that keeping track of separate read and write models entails. It requires great care to ensure you don't end up with a disjointed mess. Martin Fowler recommends only using it within one Bounded Context rather than the system as a whole (though he does not go into detail about what that means). If the read and write models are "close enough," CRUD with an occasional Projection may have less conceptual overhead to manage.

Event Sourcing

The most aggressive separation between read and write models is Event Sourcing. In Event Sourcing, there is no stored model. The primary data that gets written is just a history of "Events" that have happened. The entire data store is just a history of event objects, with some indexing support.

When loading an object (or "Aggregate" in Event Sourcing speak), the relevant Events are loaded from the store and a "current status" object is built on-the-fly and returned. In practice, in a well-designed system this process can be surprisingly fast. The Event stream also acts as a built-in log of all actions taken, ever.

Event Sourcing also leans very heavily on Projections. Projections can represent the current state of the system as of the most recent event, in whatever form is desired. Storing an Event can trigger a handler that updates Projections, sends emails, enqueues jobs, or anything else.

Importantly, events can be replayed. That means, for example, creating a new Projection requires only writing the routine that creates the projection, then rerunning the entire Event stream on it. It will then build the Projection appropriately. If the Projection is updated, migrating a projected database table is simple: Delete the old one, create the new one, rerun the Event stream. Every database table, search index, etc. except for the Event stream itself are disposable and can be thrown out and recreated at will.

The downside is that Event Sourcing, like CQRS, requires careful planning. It's a very different mental model, and not a good fit for all situations. Banking is the classic example of where it fits, and where a history of actions taken is the most important data. A typical editorial CMS, however, would be a generally poor fit for Event Sourcing, as most of what it's doing is very CRUD-ish. Nearly all events would be some variation on PostUpdated.

Depending on the complexity of the data, building reasonable Projections could be a challenge. If Entities/Aggregates are loaded from the Event stream, they may be easy or complex to reconstitute.

General advice

(This section is, of course, quite subjective.)

All of these models have their trade-offs, and pros/cons. For most standard applications, I would argue that CRUD-with-Projections is the least-bad approach. The ecosystem and known best practices are well established. Edge cases where the read and write models need to differ can often be handled as one-offs, if the system is designed with that in mind. That sort of edges it into CQRS space in limited areas, which is both helpful and risky if viewed as a slippery slope.

Even in a CRUD-based approach, it's possible to have slightly different objects for read and write. If the language supports it, the read objects can be immutable, while the write objects are mutable aside from select key fields (primary key, last-updated timestamp, etc.), which may even be omitted. The line between this split-CRUD approach and CQRS is somewhat fuzzy, though, so be mindful that you don't over-engineer CRUD when you should just use CQRS.

For workflow-heavy applications (like change-approval, or scheduled publishing, etc.), CRAP is likely worth the effort. The ability to have forward and backward revisions greatly simplifies many workflow approaches, and provides a nice audit trail.

Regardless of the approach chosen, it is virtually always worth the effort to define formal, well-typed data objects in your application to represent the models. Using anonymous objects, hashes, or arrays (depending on the language) is almost always going to cause maintenance issues sooner rather than later. Even if using CQRS or just queries that bypass a CRUD ORM, every set of records read from the database should be mapped into a well-typed defined object. That is inherently self-documenting, eliminates (or at least highlights as needing attention) many edge cases, provides a common, central place for in-memory handling of those edge cases (eg, null handling), and so forth.

Additionally, any database interaction should be confined to select, dedicated services whose have exclusive responsibility for interacting with the database and turning results into proper model objects. This is true regardless of the model used.

If doing CRUD or CQRS, it may be tempting to optimize updates to only update individual fields that need updating rather than updating an entire Entity at once, including unnecessary fields. I would argue that, in most cases, this is a waste of effort. Modern SQL databases are quite fast and almost certainly smarter than you are when it comes to performance. If you are using a well-established ORM that already does that, it's fine, but if rolling your own the effort involved is rarely worth it. At that point, you're almost merging CRUD and CQRS commands anyway.

Larry 19 August 2025 - 12:30pm
PHP
OOP
SQL

Add post to Blinklist Add post to Blogmarks Add post to del.icio.us Digg this! Add post to My Web 2.0 Add post to Newsvine Add post to Reddit Add post to Simpy Who's linking to this post?

Crell/Serde 1.5 released 15 Jul 10:01 AM (5 months ago)

Crell/Serde 1.5 released

It's amazing what you can do when someone is willing to pay for the time!

There have been two new releases of Crell/Serde recently, leading to the latest, Serde 1.5. This is an important release, not because of how much is in it but what major things are in it.

That's right, Serde now has support for union, intersection, and compound types! And it includes "array serialized" objects, too.

mixed fields

A key design feature of Serde is that it is driven by the PHP type definitions of the class being serialized/deserialized. That works reasonably well most of the time, and is very efficient, but can be a problem when a type is mixed. When serializing, we can just ignore the type of the property and use the type of the value. Easy enough. When deserializing, though, what do you do? In order to support non-normalized formats, like streaming formats, the incoming data is opaque.

The solution is to allow Deformatters to declare, via an interface, that the can derive the type of the value for you. Not all Deformatters can do that, depending on the format, but all of the array-oriented Deformatters (json, yaml, toml, array) are able to, and that's the lion's share of format targets. Then when deserializing, if we hit a mixed field, Serde delegates to the Deformatter to tell it what the type is. Nice.

Sometimes that's not enough, though. Especially if you're trying to deserialize into a typed object, just knowing that the incoming data is array-ish doesn't help. Serde 1.4 therefore introduced a new type field for mixed values: #[MixedField]. MixedField takes one argument, $suggestedType, which is the object type that should be used for deserialization. If the Deserializer says the data is an array, then it will be upcast to the specified object type.

class Message
{
    public string $message;
    #[MixedField(Point::class)]
    public mixed $result;
}

When serializing, the $result field will serialize as whatever value it happens to be. When deserializing, scalars will be used as is while an array will get converted to a Point class.

Unions and compound types

PHP has supported union types since 8.0, and intersection types since 8.1, and mixing the two since 8.2. But they pose a similar challenge to serialization.

The way Serde 1.5 now handles that is to simply fold compound types down to mixed. As far as Serde is concerned, anything complex is just "mixed," and we just defined above how that should be handled. That's... remarkably easy. Neat.

If the type is a union, specifically, then there's a little more we can do.

First, if a union type doesn't specify a suggestedType but the value is array-ish, it will iterate through the listed types and pick the first class or interface listed. That won't always be correct, but since the most common union type will likely be something like string|array or string|SomeObject, it should be sufficient in most cases. If not, specifying the $suggestedType explicitly is recommended.

Second, a separate #[UnionField] attribute extends MixedField and adds the ability to specify a nested TypeField for each of the types in the list. The most common use for that would be for an array, like so:

class Record
{
    public function __construct(
        #[UnionField('array', [
            'array' => new DictionaryField(Point::class, KeyType::String)]
        )]
        public string|array $values,
    ) {}
}

In this case, if the deserialized value is a string, it gets read as a string. If it's an array, then it will be read as though it were an array field with the specified #[DictionaryField] on it instead. That allows upcasting the array to a list of Point objects (in this case), and validating that the keys are strings.

Improved flattening, now for the top level

Another unrelated but very cool fix is a long-standing bug when flattening array-of-object properties. Previously, their type was not respected. Now it is. What that means in practice is you can now do this:

[
	{x: 1, y: 2},
	{x: 3, y: 4},
]
class PointList
{
    public function __construct(
	    #[SequenceField(arrayType: Point::class)]
	    public array $points,
    ) {}
}

$json = $serde->serialize($pointList, format: 'json');
$serde->deserialize($json, from: 'json', to: PointList::class);

Boom. Instant top-level array. Previously, this behavior was only available when serializing to/from CSV, which had special handling for it. Now it's available to all formats.

New version requirements

Because compound types were only introduced in PHP 8.2, Serde 1.5 now requires PHP 8.2 to run. It will not run on 8.1 anymore. Technically it would have been possible to adjust it in a way that would still run on 8.1, but it was a hassle, and according to the Packagist stats for Crell/Serde the only PHP 8.1 user left is my own CI runner. So, yeah, this shouldn't hurt anyone. :-)

Special thanks

These improvements were sponsored by my employer, MakersHub. Quite simply, we needed them, so I added them. One of the advantages of eating your own dogfood: You have an incentive to make it better.

Is your company using an OSS library? Need improvements made? Sponsor them. Either submit a PR yourself or contract the maintainer to do so, or just hire the maintainer. All of this great free code costs time and money to make. Kudos to those companies that already do sponsor their Open Source tool chain.

Larry 15 July 2025 - 1:01pm
PHP
Serde
Serialization
Open source
release

Add post to Blinklist Add post to Blogmarks Add post to del.icio.us Digg this! Add post to My Web 2.0 Add post to Newsvine Add post to Reddit Add post to Simpy Who's linking to this post?

Mildly Dynamic websites are back 15 Mar 9:55 PM (9 months ago)

Mildly Dynamic websites are back

I am pleased to report that my latest side project, MiDy, is now available for alpha testing!

MiDy is short for Mildly Dynamic. Inspired by this blog post, MiDy tries to sit "in between" static site generators and full on blogging systems. It is optimized for sites that are mostly static and only, well, "mildly dynamic." SMB websites, blogs, agency sites, and other use cases where frankly, 90% of what you need is markdown files and a template engine... but you still need that other 10% for dynamic listings, form submission, and so on.

MiDy offers four kinds of pages:

  1. Markdown pages. These should be familiar to anyone that's worked with any "edit file on disk" publishing tool before.
  2. Latte templates. Latte is used as the main template engine for MiDy, but you can also create arbitrary pages as Latte templates. Want to have a one-off page where you control the HTML and CSS, but still inherit the overall page layout and theme? Great! Put a Latte template file in your routes folder and you're done.
  3. Static files. Self-explanatory.
  4. PHP. For the few cases where you need custom logic, you have a PHP class that can do whatever you'd like. It's still pathed based on the file system, just like any other file, but you get separate handlers for each HTTP method; and you get full arbitrary DI support as well.

The README covers more details, though as it's still at version 0.2.0 the documentation is still a work in progress. And of course, it's built for PHP 8.4 and takes full advantage of many new features of the language, like property hooks and asymmetric visibility. Naturally.

I will be converting this site over to MiDy soon. Gotta dog-food my own site, of course. (And finally get rid of Drupal.)

While I wouldn't yet recommend it as production ready, it's definitely ready for folks to try out and give feedback on, and to run test sites or personal sites on. I don't expect any API changes that would impact content at this point, but like I said, it's still alpha so caveat developor.

If you have feedback, please either open an issue or reach out to me on the PHPC Discord server. If you want to send a PR of your own, please open an issue first to discuss it.

I'll be posting more blog posts on MiDy coming up. Whether before or after I move this site to it, we'll see. :-)

Larry 16 March 2025 - 12:55am
PHP
MiDy
release

Add post to Blinklist Add post to Blogmarks Add post to del.icio.us Digg this! Add post to My Web 2.0 Add post to Newsvine Add post to Reddit Add post to Simpy Who's linking to this post?

Self hosted photo albums 12 Nov 2024 1:22 PM (last year)

Self hosted photo albums

I've long kept my photo backups off of Google Cloud. I've never trusted them to keep them safe, and I've never trusted them to not do something with them I didn't want. Like, say, ingest them into AI training without telling me. (Which, now, everyone is doing.) Instead, I've backed up my photos to my own Nextcloud server, manually organized them, and let them get backed up from there.

More recently, I've decided I really need a proper photo album tool to carry around "wallet photos" of family and such to show people. A few years back I started building my own application for that in Symfony 4, but I ran into some walls and eventually abandoned the effort. This time, I figured I'd see what was available on the market for self-hosted photo albums for me and my family to use.

Strap yourself in, because this is a really depressing story (with a happy ending, at least).

I reviewed 7 self-hosted photo album tools, after checking various review sites for their top-ten lists. Of those 7:

  • 3 were in PHP, 2 were in JavaScript or TypeScript, and 2 were in Go.
  • 2 used the MIT license, 2 used the GPL, 1 used AGPL, and two had broken non-free licenses.
  • I managed to get one working. 1. Uno.
  • Most really pushed you to use their Docker Compose setup to install, none of which actually worked.

Let's have a look at the mess directly.

PiGallery 2 (https://bpatrik.github.io/pigallery2/)

Language: TypeScript License: MIT

PiGallery 2 is intended as a light-weight, directory-based photo album. The recommended way to install it is to use their Docker compose file and nginx conf file... which you have to just manually copy out of Git. (Seriously?) And when I tried to get that to run locally, I could never connect to it successfully. There was something weird with the port configuration, and I wasn't able to quickly figure it out. If I can't get the "easy" install to work, I'm not interested.

Piwigo (https://piwigo.org/)

Language: PHP/MySQL License: GPLv2

Unlike many on here it doesn't provide a Docker image, which is fine so I set one up using phpdocker.io. Unfortunately, it's net installer crashed when I tried to use it, without useful errors. Trying to install manually resulted in PHP null-value errors from the install script. When I looked at the install script, I found dozens upon dozens of file system operations with the @ operator on them to hide errors.

At that point I gave up on Piwigo.

Coppermine (https://coppermine-gallery.net/)

Language: PHP/MySQL License: GPL, version unspecified

When I first visted the Coppermine website, I got an error that their TLS certificate had expired a week and a half before. How reassuring.

Skipping past that, I was greeted with a website with minuscule text, with a design dating from the Clinton presidency. How reassuring.

Right on the home page, it says Coppermine is compatible all the way down to PHP 4.2, and supposedly up to 8.2. For those not familiar with PHP, 4.2 was released in 2002, only slightly after the Clinton presidency. PHP has evolved, um, a lot in 22 years, and most developers today view PHP 4 as an embarrassment to be forgotten. If their code is still designed to run on 4.2, it means they're ignoring literally 20 years of language improvements, including security improvements. How reassuring.

Oh, and the installation instructions, linked in the menu, are a direct link to some random forum post from 2017. How reassuring.

At this point I was so reassured that I Noped right out and didn't even bother trying to install it.

Lemorage (https://lomorage.com/)

Language: JavaScript. (Not TypeScript, raw JS as far as I can tell.) License: None specified.

Although this app showed up on a few top-ten lists, its license is not specified, and installation only offers Windows and Mac. (Really?) The "others" section eventually lets you get to an Ubuntu section, where their recommendation is to install it via... an Apt remote. Which is an interesting choice.

It has a GitHub repo, but that has no license listed at all. Which technically means it's not licensed at all, and so downloading it is a felony. (Yes, copyright law is like that.)

Being a good Netizen, I reached out to the company through their Contact form to ask them to clarify. They eventually responded that, despite some parts of the code being in public GitHub repos, none of it is Open Source.

Noping right out of that one.

PhotoPrism (https://www.photoprism.app/)

Language: Go License: It's complicated

I actually managed to get this one to run! This one also "installs" via Docker Compose, but it actually worked. This is the only one of the apps I reviewed that I could get to work. Mind you, as a Go app I cannot fathom why it needs a container to run, since Go compiles to a single binary.

Their system requirements are absurdly high. Quoting from their site, "you should host PhotoPrism on a server with at least 2 cores, 3 GB of physical memory,1 and a 64-bit operating system." What the heck are they doing? It's Go, not the JVM.

In quick experimentation, it seemed decent enough. The interface is snappy and supports uploading directly from the browser.

However, I then ran into a pickle. The GitHub repository says the license is AGPL, which I am fine with. However, in the app itself is a License page that is not even remotely close to Free Software anything, listing mainly all the ways you cannot modify or redistribute the code.

I filed an issue on their repository about it, and got back a rather blunt comment that only the "Community Edition" is AGPL, which is a different download. The supported version is not.

Noping right out of this one, too.

Photoview (https://photoview.github.io/)

Language: Go, with TypeScript for front-end License: AGPLv3

Another app wants you install via Docker Compose. And when I tried to do so, I got a bunch of errors about undefined environment variables. The install documentation says nothing about setting them, and it's not clear how to do so, so at this point I gave up.

Lychee (https://github.com/LycheeOrg/Lychee)

Language: PHP License: MIT

Lychee is built with Laravel, which I don't care for but I have used very good Laravel-based apps in the past so I had high hopes. It talks about using Docker, but unlike the others here doesn't provide a docker-compose file, just some very long Docker run commands.

Their primary instructions are to git-clone the project, then run composer install and npm. Unfortunately, phpdocker.io is still built using Ubuntu 22.04, which has an ancient version of npm in it, and I didn't want to bother trying to figure out how to upgrade it.

Lychee did offer a demo container, which uses SQLite. That I was able to get to run successfully. However, for unclear reasons it wouldn't actually show any images.

At this point, I gave up.

So what now?

Rather disappointed in the state of the art, I decided to take a different approach. As I mentioned, I use Nextcloud to store all my images. Nextcloud has a photo app, but the last time I used it, it was very basic, and pretty bad. That was a few years ago, though, so I went searching.

Turns out, not only has Nextcloud Photos improved considerably, there's also another extension app on it called Memories. On paper, it looks like it does everything I'm after. A timeline feed, custom albums that don't require duplicating files, you can edit the Exif data of the image to show a title and description, plus some fancy extras like mapping geo information to OpenStreetMap and AI-based tagging, if you have the right additional apps installed. So would it work?

Turns out... yes. The setup was slightly fiddly, but mostly because it took a while to download all the map data and index a half-million photos. Once it did that, though... it just worked. It does almost everything I was looking for. I haven't figured out how to reorder albums or pictures within an album, and it looks like it doesn't support sub-albums. But otherwise, it does what I need. It even has a mobile app (free) that let's me show off selected pictures on my phone, which is what I was ultimately after.

I have always had a love/hate relationship with Nextcloud. In concept, I love it. Self-hosted file server and application hub? Sign me up! Despite being a PHP dev of 25 years, I've never quite understood why PHP made sense for it, though. And upgrades have always been a pain, and frequently break. But its functionality is just so useful. Apps are hit or miss, ranging from first-rate (like Memories) to meh.

But in this case, it ended up being both the cleanest and most capable option, as well as the easiest to get going, provided I already had a Nextcloud server. So, solution found. I am now a Memories user, and will be setting up accounts for the rest of the family, too.

Larry 12 November 2024 - 4:22pm
PHP
Nextcloud
POSSE

Add post to Blinklist Add post to Blogmarks Add post to del.icio.us Digg this! Add post to My Web 2.0 Add post to Newsvine Add post to Reddit Add post to Simpy Who's linking to this post?

Property hooks in practice 22 Oct 2024 8:32 PM (last year)

Property hooks in practice

Two of the biggest features in the upcoming PHP 8.4 are property hooks and asymmetric visibility (or "aviz" for short). Ilija Tovilo and I worked on them over the course of two years, and they're finally almost here!

OK, so now what?

Rather than just reiterate what's in their respective RFCs (there are many blog posts that do that already), today I want to walk through a real-world application I'm working on as a side project, where I just converted a portion of it to use hooks and aviz. Hopefully that will give a better understanding of the practical benefits of these tools, and where there may be a rough edge or two still left.

One of the primary use cases for hooks is to not use them: They're there in case you need them, so you don't need to make boilerplate getter/setter methods "just in case." However, that's not their only use. They're also really nice when combined with interface properties, and delegation. Let's have a look.

The use case

The project I'm working on includes a component that represents a file system, where each Folder contains one or more Page objects. Pages are keyed by the file base name, and may be composed of one or more PageFiles, which correspond to a physical file on disk.

So, for instance, form.latte and form.php would both be represented by PageFiles, and grouped together into an AggregatePage, form. (Do those file names suggest what I'm doing...?) However, if there's only a single news.html file, then it would be just a PageFile on its own. AggregatePage and PageFile both implement the same Page interface, which includes various metadata derived from the file (title, summary, tags, last-modified time, etc.)

Additionally, a Folder can be represented by a page inside it named index. That means a Folder also implements Page. As you can imagine, this makes the Page interface rather important. But it's actually two interfaces, because there's also PageInformation, which has the bare metadata and a child interface, Page, which adds logic around the file multiplexing. The data about a folder is also lazy-loaded and cached for performance, which means we need to handle that lazy-loading transparently.

(Why am I doing something so weird? It makes routing easier. Stay tuned for more details.)

The 8.3 version

This is exactly the situation where interfaces shine. However, in PHP 8.3, interfaces are limited to methods. That means in PHP 8.3, the various interfaces look like this:

interface Hidable
{
   public function hidden(): bool;
}

interface PageInformation extends Hidable
{
   public function title(): string;
   public function summary(): string;
   public function tags(): array;
   public function slug(): ?string;

   public function hasAnyTag(string ...$tags): bool;
   public function hasAllTags(string ...$tags): bool;
}

interface Page extends PageInformation
{
   public function routable(): bool;
   public function path(): string;

   /**
    * @return array<Page>
    */
   public function variants(): array;
   public function variant(string $ext): ?Page;
   public function getTrailingPath(string $fullPath): array;
}

Several of those are quite reasonable. However, nearly any of the methods that have no arguments... don't really need to be methods. Conceptually, the "title" of a page is just data about it. It's aspect of the page, not an operation. We're used to capturing that as an operation (method), because that's all PHP let us do historically: Properties are basic, and if you expose them directly you lose a lot of flexibility, as well as safety. You cannot have interesting logic for them, and you cannot prevent someone from setting it externally (unless you make it readonly, which has its own challenges). The tools don't let us do it right.

For example, I have a degenerate case implementation called BasicPageInformation, like so:

readonly class BasicPageInformation implements PageInformation
{
   public function __construct(
       public string $title = '',
       public string $summary = '',
       public array $tags = [],
       public ?string $slug = null,
       public bool $hidden = false,
   ) {}

   public function title(): string
   {
       return $this->title;
   }

   public function summary(): string
   {
       return $this->summary;
   }

   public function tags(): array
   {
       return $this->tags;
   }

   public function slug(): ?string
   {
       return $this->slug;
   }

   public function hidden(): bool
   {
       return $this->hidden;
   }

   public function hasAnyTag(string ...$tags): bool { ... }

   public function hasAllTags(string ...$tags): bool { ... }
}

That's... a lot of code. 5 methods that do nothing but expose a primitive property. Of course, I also have the properties public, as the class is readonly. But I cannot rely on that because the interface cannot guarantee the presence of the properties, only the methods. So even though I could just have public properties in this case, they're still not reliable.

Enter Interface Properties

A part of the property hooks RFC, interface properties really deserve to be billed as their own third feature. They integrate well with hooks and aviz, and make those better, but they're a standalone feature.

The change in this case is pretty simple:

interface Hidable
{
   public bool $hidden { get; }
}

interface PageInformation extends Hidable
{
   public string $title { get; }
   public string $summary { get; }
   public array $tags { get; }
   public ?string $slug { get; }
   public bool $hidden { get; }

   public function hasAnyTag(string ...$tags): bool;
   public function hasAllTags(string ...$tags): bool;
}

interface Page extends PageInformation
{
   public bool $routable { get; }
   public string $path { get; }

   public function variants(): array;
   public function variant(string $ext): ?Page;
   public function getTrailingPath(string $fullPath): array;
}

Now, instead of read-only methods to implement, the interfaces require readable properties. In this case we don't need to set anything, so the properties are marked to only require a get operation. Whether we satisfy that requirement with a public property, a public readonly property, a public private(set) property, or a virtual property with just a get hook is entirely up to us. In fact, we'll do all of the above.

Right off the bat, that makes BasicPageInformation shorter and easier:

readonly class BasicPageInformation implements PageInformation
{
   public function __construct(
       public string $title = '',
       public string $summary = '',
       public array $tags = [],
       public ?string $slug = null,
       public bool $hidden = false,
   ) {}

   public function hasAnyTag(string ...$tags): bool { ... }

   public function hasAllTags(string ...$tags): bool { ... }
}

In this case, simple readonly properties is all we need. We can conform to the interface with about 10 fewer lines of boring, boilerplate code. Neat.

Where it gets more interesting is the other Page implementations.

The PageFile

In 8.3, PageFile looks like this (stripping out irrelevant bits for now to save space):

readonly class PageFile implements Page
{
   public function __construct(
       public string $physicalPath,
       public string $logicalPath,
       public string $ext,
       public int $mtime,
       public PageInformation $info,
   ) {}

   public function title(): string
   {
       return $this->info->title()
           ?: ucfirst(pathinfo($this->logicalPath, PATHINFO_FILENAME));
   }

   public function summary(): string
   {
       return $this->info->summary();
   }
  
   // tags(), slug(), and hidden() all look exactly the same.

   public function path(): string
   {
       return $this->logicalPath;
   }

   public function routable(): true
   {
       return true;
   }

   // ...
}

The PageFile delegates to an inner PageInformation object, and handles some defaults and extra logic. It works, but as you'll note, it's so verbose I didn't want to ask you to read such a long code sample.

In 8.4, we can remove those methods and instead use properties.

class PageFile implements Page
{
   public private(set) string $title {
       get => $this->title ??=
           $this->info->title
           ?: ucfirst(pathinfo($this->logicalPath, PATHINFO_FILENAME));
   }
   public string $summary { get => $this->info->summary; }
   public array $tags { get => $this->info->tags; }
   public string $slug { get => $this->info->slug ?? ''; }
   public bool $hidden { get => $this->info->hidden; }

   public private(set) bool $routable = true;
   public string $path { get => $this->logicalPath; }

   public function __construct(
       public readonly string $physicalPath,
       public readonly string $logicalPath,
       public readonly string $ext,
       public readonly int $mtime,
       public readonly PageInformation $info,
   ) {}

   // The boring methods omitted.
}

Much more compact, much more readable, much easier to digest. In this case, we're using hooks to create virtual properties, which have no internal storage at all. There is no "slug" slot in the memory of PageFile. Internally to the engine, it still looks and acts like a method. Because most of the properties are virtual, we don't need to bother with the set side, as it will be an engine error to even try. There's two special cases, however.

First, $routable is hard-coded to true. We can do that. Just... not with readonly, which cannot have a default value. We'd have to define it un-initialized and then manually initialize it in the constructor, which is too much work. Now, however, we can set it to public private(set) and give it a default value. In theory the class could still modify that property internally, but it's my class and I know I'm not doing that, so there's nothing to worry about.

Second, $title has some non-trivial default value logic. I don't want to run that multiple times, so it's cached onto the property itself. On subsequent calls, $this->title will have a value, so it will just get returned. That makes $title a "backed property," meaning there is a set operation. But we don't want anyone to set the title externally, so again we make it private(set).

Also note that hooked properties cannot be readonly. That means the class cannot be readonly, and the individual promoted constructor properties need to be marked readonly instead. (We could just as easily have made them private(set). It would have the same effect in this case.)

The Folder

The Folder object is even more interesting. It does a number of things that are off-topic for us here, so I'll hand-wave over them and focus on the property refactoring.

In PHP 8.3, Folder works roughly like this:

class Folder implements Page, PageSet, \IteratorAggregate
{
   public const string IndexPageName = 'index';

   private FolderData $folderData;

   public function __construct(
       public readonly string $physicalPath,
       public readonly string $logicalPath,
       protected readonly FolderParser $parser,
   ) {}

   public function routable(): bool
   {
       return $this->indexPage() !== null;
   }

   public function path(): string
   {
       return str_replace('/index', '', $this->indexPage()?->path() ?? $this->logicalPath);
   }
  
   public function variants(): array
   {
       return $this->indexPage()?->variants() ?? [];
   }

   public function variant(string $ext): ?Page
   {
       return $this->indexPage()?->variant($ext);
   }

   public function title(): string
   {
       return $this->indexPage()?->title()
           ?? ucfirst(pathinfo($this->logicalPath, PATHINFO_FILENAME));
   }

   public function summary(): string
   {
       return $this->indexPage()?->summary() ?? '';
   }
  
   // tags(), slug(), and hidden() omitted as they're just like summary().

   public function all(): iterable
   {
       return $this->folderData()->all();
   }

   public function indexPage(): ?Page
   {
       return $this->folderData()->indexPage;
   }

   protected function folderData(): FolderData
   {
       return $this->folderData ??= $this->parser->loadFolder($this);
   }

   // Various other methods omitted.
}

(Although not relevant here, PageSet is an interface for a collection of pages. It extends Countable and Traversable, and adds a few other operations like filter() and paginate(). None of its methods are relevant to hooks, though, so we will skip over that.)

That's a lot of code for what is ultimately a very simple design: A folder is given a path that it represents. (Ignore the physical vs logical paths for now, that's also not relevant.) It lazily builds a folderData value that is a collection of Pages the Folder contains. One of those pages may be an index page, in which case the Folder can be treated the same as its index page. If not, there's reasonable defaults.

But that's a lot of dancing around. Let's see if we can simplify it using PHP 8.4.

class Folder implements Page, PageSet, \IteratorAggregate
{
   public const string IndexPageName = 'index';

   protected FolderData $folderData { get => $this->folderData ??= $this->parser->loadFolder($this); }
   public ?Page $indexPage { get => $this->folderData->indexPage; }

   public private(set) string $title {
       get => $this->title ??=
           $this->indexPage?->title
           ?? ucfirst(pathinfo($this->logicalPath, PATHINFO_FILENAME));
       }
   public private(set) string $summary { get => $this->summary ??= $this->indexPage?->summary ?? ''; }
   public private(set) array $tags { get => $this->tags ??= $this->indexPage?->tags ?? []; }
   public private(set) string $slug { get => $this->slug ??= $this->indexPage?->slug ?? ''; }
   public private(set) bool $hidden { get => $this->hidden ??= $this->indexPage?->hidden ?? true; }

   public bool $routable { get => $this->indexPage !== null; }
   public private(set) string $path { get => $this->path ??= str_replace('/index', '', $this->indexPage?->path ?? $this->logicalPath); }

   public function __construct(
       public readonly string $physicalPath,
       public readonly string $logicalPath,
       protected readonly FolderParser $parser,
   ) {}

   public function count(): int
   {
       return count($this->folderData);
   }

   public function variants(): array
   {
       return $this->indexPage?->variants() ?? [];
   }

   public function variant(string $ext): ?Page
   {
       return $this->indexPage?->variant($ext);
   }

   public function all(): iterable
   {
       return $this->folderData->all();
   }
  
   // Various other methods omitted.

Now, we've done a few things.

  1. folderData was already a property, and a method. You had to do that if you wanted caching. Now, they're combined into a single lazy-initializing, caching property. It's still protected, though.
  2. The indexPage was always just a silly little wrapper around folderData. Now that wrapper is even thinner, in a property. Code calling it can just blindly assume it's there and use it safely.
  3. The various other simple data from Page/PageInformation are also now just properties. Also, it's super easy for us to cache them so defaults don't need to be handled again in the future. As before, we make the properties private(set) so they're read-only to the outside world without any of the shenanigans of readonly.
  4. Features like null-coalesce assignment, null-safe method calls, and shortened ternaries make the code overall really nice and compact. (That's not new in PHP 8.4, I just like them.)

In the end, we have less code, more self-descriptive code, and no loss in flexibility. Score! The performance should be about a wash; hooks cost very slightly more than a method call, but not enough that you'll notice a difference.

Declaration interfaces

Another place where interface properties came in handy is in my "File Handlers." The interface for those in PHP 8.3 looks like this:

interface PageHandler
{
    public function supportedMethods(): array;

    public function supportedExtensions(): array;

    public function handle(ServerRequestInterface $request, Page $page, string $ext): ?RouteResult;
}

supportedMethods() and supportedExtensions() are both, well, boring. Those methods will, 95% of the time, just return a static array value. However, the other 5% of the time they will need some minimal logic. That means they cannot be attributes, and have to be methods.

Which means most implementations have this verbose nonsense:

readonly class MarkdownLatteHandler implements PageHandler
{
    public function __construct( /* ... */) {}

    public function supportedMethods(): array
    {
        return ['GET'];
    }

    public function supportedExtensions(): array
    {
        return ['md'];
    }

    // ...
}

Which is like... why?

In PHP 8.4, interface properties let us shorten both the interface and implementations to this:

interface PageHandler
{
    public array $supportedMethods { get; }
    public array $supportedExtensions { get; }

    public function handle(ServerRequestInterface $request, Page $page, string $ext): ?RouteResult;
}

class MarkdownLatteHandler implements PageHandler
{
    public private(set) array $supportedMethods = ['GET'];
    public private(set) array $supportedExtensions = ['md'];

    public function __construct(/* ... */) {}

    // ...
}

Much shorter and easier! We can just declare the properties directly, with values, and keep them private-set, then never set them. It's marginally faster, too, as there's no function call involved (though in practice it doesn't matter). We don't even need hooks most of the time, just aviz!

And in that other 5%, well, we can use hooks just as well:

class StaticFileHandler implements PageHandler
{
    public private(set) array $supportedMethods = ['GET'];
    public array $supportedExtensions {
        get => array_keys($this->config->allowedExtensions);
    }

    public function __construct(
        /* ... */
        private readonly StaticRoutes $config,
    ) {}
}

One more thing...

There's one other place where PageInformation gets used, and where PHP 8.4's new features help out in hilarious ways.

Another task this project does is loading Markdown files off disk, with YAML frontmatter (which is, you guessed it, PageInformation's properties). The way I'm doing so is to load the file, rip off the YAML frontmatter, and deserialize that into a MarkdownPage object using Crell/Serde. Serde creates an object by bypassing the constructor and then populating it, but one thing that won't be populated is the content property. That gets set by just writing to it afterward.

The relevant loading code looks like this (abbreviated):

   public function load(string $file): MarkdownPage|MarkdownError
   {
       $fileSource = file_get_contents($file);

       if ($fileSource === false) {
           return MarkdownError::FileNotFound;
       }

       [$header, $content] = $this->extractFrontMatter($fileSource);

       $document = $this->serde->deserialize($header, from: 'yaml', to: MarkdownPage::class);
       $document->{$this->documentStructure->contentField} = $content;

       return $document;
   }

(The property to write the content to is configurable via attributes, for reasons unrelated to the topic at hand.) Problem: That means the content property needs to be publicly writable, which is generally not ideal. Technically we could use a bound closure to dance around that and set it from private scope, but PHP 8.4 lets us do something even more wild:

class MarkdownPage implements PageInformation
{
   public function __construct(
       #[Content]
       public(set) readonly string $content,
       public readonly string $title = '',
       public private(set) string $summary = '' { get => $this->summary ?: $this->summarize(); },
       public readonly string $template = '',
       public readonly array $tags = [],
       public readonly ?string $slug = null,
       public readonly bool $hidden = false,
       public readonly array $other = [],
   ) {}

   private function summarize(): string { ... }
  
   // And other stuff.
}

(I'm skipping over the fact that in 8.3 we needed a bunch of extra do-nothing methods, as we've already discussed those benefits.)

That's right. I have found a use case for public(set) readonly! Really, no one is more surprised at this than I am. With this configuration, $content can be set only once, but it can be set externally. Trying to set it a second time, from anywhere, results in an error. (Yes, we could have just used a bound closure, but this is more fun.)

Also note that most properties are just public readonly, which fully satisfies the interface. The exception is $summary, which has more interesting default logic, and thus uses a hook, and thus uses private(set) instead of readonly. Nothing especially new here.

Conclusion

I am overall happy with the result. I think it makes the code cleaner, more compact, and easier to extend. When adding more properties to the PageInformation interface, as I expect I will, adding that property to all the places it gets used will be less work, too.

The one complaint I have is that I do miss the double-short syntax that we removed from the hooks RFC, as it had too much pushback. Since the property hooks above are all get-only, they could have been abbreviated even further to (to use the Folder example):

public private(set) array $tags => $this->tags ??= $this->indexPage?->tags ?? [];
public private(set) string $slug => $this->slug ??= $this->indexPage?->slug ?? '';
public private(set) bool $hidden => $this->hidden ??= $this->indexPage?->hidden ?? true;

I find that perfectly readable, and with less visual noise of the { get wrapped around it. If folks agree, maybe we can try to re-add it in a future version.

So there we are: Interface properties, hooks, and asymmetric visibility, all dovetailing together to make code shorter, tidier, and more flexible. Welcome to PHP 8.4!

You can see a complete diff of all the PHP 8.4 upgrades I made as well. Looks like it shaved off around 150 lines of code, too.

(Note: If you're reading this article in the future, the code this is from will almost certainly have evolved further. This represents the code at the time of this blog post.)

Larry 22 October 2024 - 11:32pm
PHP
PHP8.4
Property Hooks

Add post to Blinklist Add post to Blogmarks Add post to del.icio.us Digg this! Add post to My Web 2.0 Add post to Newsvine Add post to Reddit Add post to Simpy Who's linking to this post?

Tukio 2.0 released - Event Dispatcher for PHP 14 Apr 2024 11:24 AM (last year)

Tukio 2.0 released - Event Dispatcher for PHP

I've just released version 2.0 of Crell/Tukio! Available now from your favorite Packagist.org. Tukio is a feature-complete, easy to use, robust Event Dispatcher for PHP, following PSR-14. It began life as the PSR-14 reference implementation.

Tukio 2.0 is almost a rewrite, given the amount of cleanup that was done. But the final result is a library that is vastly more robust and vastly easier to use than version 1, while still producing near-instant listener lookups.

Some of the major improvements include:

  • It now uses Topological sorting internally, rather than priority sorting. Both are still supported, but the internal representation has changed. The main benefits are cycle detection and support for multiple before/after rules per listener.
  • The API has been greatly simplified, thanks to PHP 8 and named arguments. It's now down to essentially two methods -- listener() and listenerService(), both of which should be used with named arguments for maximum effect. The old API methods are still supported, but deprecated to allow users to migrate to the new API.
  • Tukio can now auto-derive more information about your listeners, making registration even easier.
  • It now uses the powerful Crell/AttributeUtils library for handling attribute-based registration. That greatly simplified a lot of code while making several new features easy.
  • Attributes are now supported on the class level, not just method. That makes building single-method listener services trivially easy.

Listener classes

The last point bears extra mention. While Tukio supports numerous ways of organizing and configuring your listenres, the recommended way to register a listener with is now to use this pattern, with attributes:

#[ListenerPriority(priority: 5)]
#[ListenerAfter(OtherListener::class)]
class CollectListener
{
    public function __construct(public readonly Dep $someDependency) {}

    public function __invoke(CollectingEvent $event): void
    {
        $event->add(static::class);
    }
}

$provider->listenerService(CollectListener::class);

Now, ensure that CollectListener and OtherListener are both registered with your DI container using their class names. That's it, that's all, you're done. The __invoke() method will be registered as the listener method to call, while you can specify any dependencies it requires in the constructor. The DI container should auto-wire them for you. (If it doesn't, get a better DI container.) Now, any time the CollectingEvent event is fired, the CollectListener service will be loaded, given its dependencies, and then invoked.

Event Optimization: not new, but so so cool!

Tukio supports both runtime and compilable listener providers using the same API. In most cases, you'll want to use the compiled provider for better performance. However, you can get even more performance by telling Tukio ahead of time which events it should expect to see. (Odds are this can be automated by a scan of your codebase, but manually also works.) It will then build a direct lookup table in the compiled listener. The result is a constant-time simple array lookup for those events, also known as "virtually instantaneous." For example:

use Crell\Tukio\ProviderBuilder;
use Crell\Tukio\ProviderCompiler;

$builder = new ProviderBuilder();

$builder->listener('listenerA', priority: 100);
$builder->listener('listenerB', after: 'listenerA');
$builder->listener([Listen::class, 'listen']);
$builder->listenerService(MyListener::class);
$builder->addSubscriber('subscriberId', Subscriber::class);

// Here's where you specify what events you know you will have.
// Returning the listeners for these events will be near instant.
$builder->optimizeEvent(EvenOne::class);
$builder->optimizeEvent(EvenTwo::class);

$compiler = new ProviderCompiler();

// Write the generated compiler out to a file.
$filename = 'MyCompiledProvider.php';
$out = fopen($filename, 'w');

$compiler->compileAnonymous($builder, $out);

fclose($out);

Backward compatibility

Tukio v2 should be 99% a drop-in replacement for Tukio v1. I deliberately tried to keep the old API intact for now to make upgrading easier, though it is marked @deprecated to encourage developers to migrate to the more robust new API methods. If something isn't a clean drop-in, let me know on GitHub and I'll see if it's resolvable.

There is one potential BC challenge: In Tukio v1, if two listeners were specified without any ordering information, they would almost always end up triggering in the order in which they were added. That was not guaranteed, however, and the documentation warned against relying on it. Tukio v2 uses a new, topological sort-based sorting algorithm that is considerably more robust; however, the predictability of "lexically first, fires first" is no longer there. The order in which un-ordered listeners will trigger is unpreditable, though it should be stable. If you find you were inadvertently relying on the implicit ordering before, the fix is to add before/after rules to your listeners to make the intended ordering explicit.

Give it a try in your project today!

Larry 14 April 2024 - 2:24pm
PHP
PHP-FIG
PSR-14

Add post to Blinklist Add post to Blogmarks Add post to del.icio.us Digg this! Add post to My Web 2.0 Add post to Newsvine Add post to Reddit Add post to Simpy Who's linking to this post?

Tangled Threads 26 Dec 2023 1:19 PM (last year)

Tangled Threads

Erin Kissane recently posted an excellent writeup about the coming integration of Threads into the Mastdon/ActivityPub/Fediverse world. It is recommended reading for anyone who cares even slightly about the state of the Internet, and especially any server admins and moderators on Mastodon et al. I agree with most of it, although there is one important point on which I think we differ slightly. I want to expand on that here.

Before I continue, full disclosure: I am a moderator (but not admin) on phpc.social, the Mastodon server for anyone even tangentially related to or interested in the PHP programming language or its community. However, this post is my own and has not been endorsed by the other moderators (though I know from conversation at least some of them agree with the general gist). This is not a policy statement from phpc.social, but a recommendation from me.

The problem

Erin's points, if I may summarize for the point of this response, largely boil down to:

  1. Meta is a fundamentally and irredeemably immoral company.
  2. Meta happily hosts irredeemably immoral and evil people, as long as it makes them profit.
  3. Just allowing Threads users to connect to Mastodon users helps provide Meta with value, if only in telemetry on who talks to whom (which they can then sell to advertisers).
  4. The current tools for dealing with Threads are not great.
  5. There are no easy answers.

As Erin notes, some server admins have taken to proactively blocking Threads from their servers. Individual users can also block a domain, but it only sometimes works. Some think it's a great thing for the Fediverse to have the 800 lb gorilla of social media connecting to the Fediverse, others are afraid of being sat on.

I fully agree with her about the threat that Meta poses, both to the Fediverse and to the world at large. However, I do not agree that hard-blocking Threads is the way to go. In fact, that would be the worst possible option.

The state of play

Let's start with a few observations.

As of December 2023, Threads has 141 million users. As of August, it had 10.3 million active daily users.

As of December 2023, Mastodon has around 8 million users. Its daily active users is around 1.7 million.

No matter how you slice it, Threads already dwarfs Mastodon by a massive degree. Mastodon is a small fry, and the increased network effect potential of connecting Threads and Mastodon... helps Mastodon more than Threads. (It can help Threads in more nefarious ways through telemetry, but that's a different matter.)

Defederating Threads is effectively a boycott. How effective are boycotts? Maybe 1% of the time? They are most effective when they threaten reputation, not revenue. Well, Meta has no reputation to begin with, and are too big to avoid. Boycotting Meta is guaranteed to accomplish... exactly nothing. (I say this as someone who does not and has never had a Facebook, Instagram, or Whatsapp account. Yes, I've been boycotting them. You see how much that has accomplished.)

What do we do?

So does that mean we totally should welcome Threads in? Not quite. Due to Meta's laughably pathetic stance on content moderation, a substantial portion of those 10.3 million active users are openly racist, transphobic, sexist fascists. Giving them access to 1.7 million new targets (who for historical reasons tend to skew toward all the groups the fascists love to attack) is... not great.

However, that also means giving our 1.7 million users access to 10 million new potential followers/friends/audience members. That's not nothing, and especially for indie artists that rely on social media to promote themselves, that's "willing to stay on Mastodon at all" levels of massive.

Fortunately, it doesn't have to be an either/or. This is what is unique and special about the Fedi/Mastoverse: Individual server mods can aggressively police not just their own users, not just other servers, but users from other servers.

What we want in the world is "Threads minus the Nazis." That would be a good thing for humanity. What we want to allow to connect to Mastodon, to our servers, is "Threads users, but not the Nazis."

Most well-behaved servers help each other out by policing their own users to kick out the racists, misogynists, disinformation spreaders, and fascists. If a server doesn't do a good enough job of that, servers can defederate; although some admins have been almost comically over-aggressive on that front rather than engaging to improve matters. The problem is that we know with certainty that Threads will not be well-behaved, but they also represent (or will soon represent) the largest ActivityPub server by an order of magnitude. If you struggle with defederating mastodon.social due to its weak moderation because of its size, Threads has the same problem but much larger.

Unfortunately, the rules really are different for the too-big-to-fail. Defederating Gab or Truth Social can be effective. Threads is a different beast and requires a different response.

If we cannot expect Meta to create Threads-minus-Nazis, we'll have to do it ourselves. That means allowing Threads to federate with our servers... and then aggressively, proactively banning individual misbehaving Threaders so they don't have access to our users. If you run a Mastodon server and don't proactively block LibsOfTikTok on sight, that's just irresponsible.

Yes, this will be a lot of work for mods. As a long-time moderator of numerous fora, I have to say... tough. This is what we signed up for: To protect our users. No one said it would be simple or easy. But that is our responsibility.

So why go to all that work?

Meta isn't integrating with ActivityPub out of the goodness of their heart. I don't believe Zuckerberg's protestations that he's always believed in decentralization for a second. (Oceania has always been at war with Eurasia.) However, allowing users an off-ramp is necessary to avoid antitrust issues, now that governments (mainly in Europe where they still have a functional government) are finally, finally starting to question Meta's practices.

But we can make use of that off-ramp to encourage users to migrate from Threads, which has fascists on it, to another ActivityPub server, where the fascists are blocked. They can do so without cutting off their non-fascist friends on Threads. That's the secret weapon. That's our only secret weapon: A safe off-ramp from one server to another. The process is not as clean as it could be, but... that's a technical problem we can fix.

The purpose of different federated servers isn't really for topic-centric discussion. That's always been a pointless distinction. Federated servers let you shop around for a moderator team you like. That's the real value. We have an opportunity to let Threads users shop around for new moderators without losing their friends.

And that is super important. One of the main reasons people stay in cults is because the cult becomes their only source of social support, and leaving becomes an all-or-nothing proposition that makes it nearly impossible to transition out.

Having a new social network that accepts them that they can transition to safely is how people are able to leave cults.

Having a new social network that has fewer Nazis that they can transition to safely without losing their non-fascist friends is how people can leave Threads.

We can make that off-ramp into an off-expressway. And we do it by very aggressively following the Mastodon Server Covenant:

Active moderation against racism, sexism, homophobia and transphobia Users must have the confidence that they are joining a safe space, free from white supremacy, anti-semitism and transphobia of other platforms.

If we don't allow our users to talk to Threads (if they want to), we all but guarantee that Mastodon remains a footnote in the social media landscape and the promise of a decentralized, federated Internet moves even further away.

Sounds risky

It is! I won't deny that. This is a lot of work for a questionable payoff, and one that will almost certainly be only partially successful at best. But it's the best option we have.

Refusing to engage with Threads only helps Threads. They can turn around to regulators and say "see, we tried to enable federation, but none of our users are taking advantage of it, so you can't hold us accountable." No one wins in that outcome, other than Meta. That's how network effects work.

Of course, this will necessarily entail growing the population of the Fediverse, which has already gotten a lot of people upset. A lot of Mastodon old-timers miss their small, out of the way pub, from before the crowds arrived and ruined the vibe.

I get it. Really, I do. I've been on both sides of that process many times. Eternal September is a bitch. But every community either goes through Eternal September or dies. There are no other options. The little underground venue by the tracks is already gone. Trying to keep Mastodon small, tone policing people from Threads or Twitter until they leave, only hurts Mastodon. (This has already happened, of course. Black Twitter kept trying to migrate to Mastodon and got shut down hard by the "I'm not racist but" crowd, so they ended up on BlueSky instead. This is not a win. This is a massive loss.)

Yes, this will include cultural evolution. That will happen regardless. We'd best be prepared for it.

If boycotts and passive action don't work, and we know they don't, that leaves only active action. Yes, I am suggesting that we declare war on Threads, not by blocking them but by actively trying to bleed users off of Threads and onto other Mastodon servers, by making Mastodon the "Threads minus the Nazis" platform.

It's a long-shot. I fully agree. It may be unsuccessful. But we already know that every other strategy is unsuccessful.

Let Threads federate. Aggressively block the Nazis. Be the moderators, stewards, and guardians that Meta will never be.

And if it ultimately fails, we can always block Threads in the future with a few button clicks.

Larry 26 December 2023 - 4:19pm
Mastodon
Fediverse
Moderation
Community

Add post to Blinklist Add post to Blogmarks Add post to del.icio.us Digg this! Add post to My Web 2.0 Add post to Newsvine Add post to Reddit Add post to Simpy Who's linking to this post?

Cutting through the static 29 Nov 2023 1:28 PM (2 years ago)

Cutting through the static

Static methods and properties have a storied and controversial history in PHP. Some love them, some hate them, some love having something to fight about (naturally).

In practice, I find them useful in very narrow situations. They're not common, but they do exist. Today, I want to go over some guidelines on when PHP developers should, and shouldn't, use statics.

In full transparency, I will say that the views expressed here are not universal within the PHP community. They do, however, represent what I believe to be the substantial majority opinion, especially among those who are well-versed in automated testing.

What's wrong with statics?

First off, what is the problem with statics? They're a language feature, aren't they? They are, but that doesn't necessarily make them good. (Anyone else remember register_globals?) Their main downside is that they inhibit testing, because they inhibit mocking.

One of the key principles of good automated testing is to isolate small parts of the system and test them independently. That requires having key points where you can "cut" one piece of code off from another (using a mock, fake, stub, or whatever) and test one of them, without worrying about the other. The main "cut" point that PHP offers is object instances, especially object instances passed in through the constructor (aka, Dependency Injection; yes, that's all DI means: passing stuff in via the constructor). If you're working with code that does that reliably, it is generally pretty easy to test. If it does anything else, it is generally pretty hard to test.

Consider this example:

// BAD EXAMPLE
class Product
{
    public static function findOneById(int $id): self
    {
        // Some DB logic.
        $record = Database::select(/* ... */);
        
        if (!$record) {
            throw ProductNotFound::forId($id);
        }
        // More robust code than this, please.
        return new self(...$record);
    }
}

This is a simple database lookup. It's easy to call from literally anywhere in the code base using Product::findOneById(5). But... that's the problem. Actually there's several problems.

  1. If some other code calls Product::findOneById(5), that code cannot be separated from this method. There is no way to test it without also testing Product. Product cannot be mocked/faked/stubbed. Your other code will ever and always need Product.
  2. It's not just a testing issue; if you ever want to use a different version of Product::findOneById() — say because requirements have changed, you need multi-tenancy now, or whatever — you're stuck.
  3. findOneById() needs a database connection to work. But since you cannot inject values into a static method (there's no constructor), how can you get a connection to it? All you've got is static calls. So that requires some code like Database::select('Some SQL here'). That, in turn, hard-couples Product to the Database class.
  4. That, in turn, means whatever calls Product is also hard-coupled to Database, and presumably therefore to an actual database connection. You now cannot test one piece of code far-removed from the database without a for-reals database running. That's... not good.

Compare with the constructor-injected version:

readonly class ProductRepository
{
    // This is injectable, and thus trivially testable.
    public function __construct(
        private Connection $conn,
    ) {}
    
    public function findOneById(int $id): Product
    {
        $record = $this->conn->select(/* ... */);
        
        if (!$record) {
            throw ProductNotFound::forId($id);
        }
        // More robust code than this, please.
        return new Product(...$record);
    }
}

Now, we've split Product into a data object (Product) and a mapper/loader (ProductRepository). The repository is a normal service object. It requires ("depends on") a Connection object, which is passed to it. Because that's an object, not a class, we can pass anything we want to it as long as it conforms to the class type: A real database connection, a fake one, a MySQL one, a Postgres one (within reason), etc. Most notably, we can pass a mock and test ProductRepository without having to even install a database.

And that same benefit extends to the code that uses the repository: It will accept a ProductRepository constructor argument, which can similarly be the real repository or a mock. We can now test that client code without needing a real repository instance.

But isn't passing those constructor arguments around manually a lot of work? Yes, it is. Which is why no one does that anymore! Nearly all modern Dependency Injection containers support auto-wiring, whereby most (80%+) services can be auto-detected and auto-configured, so that the right constructor arguments are passed in. With Constructor Property Promotion available in all supported PHP versions, accepting a dependency via the constructor is trivial. (Prior to PHP 8.0 it was a lot more annoyingly verbose; that problem no longer exists.) The combination of auto-wiring containers and Constructor Promotion has virtually eliminated all previously-legitimate arguments against using DI. It's usually even easier than trying to set up an alternative.

But doesn't that mean it's harder to instantiate an object one-off with services? Yes. And that's good! You should rarely be doing that; making that clunky encourages you to refactor you code to not need it. If you really do need dynamic creation of a service, that's what the Factory Pattern is for. (In short, you call an object that does it for you, and it can do the wiring in a common location and do nothing else.)

Static types

So if statics suck for testing, are they ever valid to use? Yes! Statics are valid when the context they are operating within is a type, not an object instance. PHP has no meaningful way to swap out an entire type (there's some hacky ways that kind of work we'll ignore for now), so not being able to mock the type doesn't hurt anything.

In practice, the only case where the type is a relevant context is object creation. There are probably others, but this is the only one I ever really see. In the previous example, we had this line:

throw ProductNotFound::forId($id);

That is using the "named constructor" technique, using a static method. I use this approach a lot on exceptions, in fact, as it can be more self-documenting, and allows things like the error message to be incorporated into the class definition itself.

class ProductNotFound extends \InvalidArgumentException
{
    public readonly int $productId;
    public readonly array $query;
    
    public static function forId(int $id): self
    {
        $new = new self();
        $new->productId = $id;
        
        $message = 'Product %d not found.';
        $new->message = sprintf($message, $id);
        
        return $new;
    }
    
    public static function forQuery(array $query): self
    {
        $new = new self();
        $new->query = $query;
        
        $message = 'No product found for query: %s';
        $new->message = sprintf($message, implode(',', $query));
        
        return $new;
    }
}

Note here that we're offering two different named constructors; that's perfectly fine. In this case, the alternative is an inline new call in ProductRepository, which is no more or less mockable. So a static method here is fine. However, note that the static methods are both pure: They store no state (the properties are saved on the object, not the class), and do no IO.

This does mean a hard-coupling of findById() to ProductNotFound, but... that's OK. ProductNotFound is an exception, and therefore a value object. Value objects rarely if ever need to be mocked in the first place, as they can be trivially faked. Consider:

class Color
{
    private string $color;
    
    private function __construct() {}
    
    public function isPale(): bool
    {
    // ...
    }

    public static function fromRGB(int $red, int $green, int $blue): self
    {
        $new = new self();
        $new->color = '#' . dechex($red) . dechex($green) . dechex($blue);
        return $new;
    }
    
    public static function fromHex(string $color): self
    {
        $new = new self();
        $new->color = '#' . $color;
        return $new;
    }
    
    public static function fromHSV(int $hue, int $sat, int $value): self
    {
        [$r, $g, $b] = self::hsv2rgb($hue, $sat, $value);
        return self::fromRGB($r, $g, $b);
    }

    private static function hsv2rgb(int $hue, int $sat, int $val): array
    {
    // ...
    }
}

This value object represents a color. It is a value object; it doesn't really make sense to mock, any more than mocking an integer would. Just... pass a different integer, or a different Color instance. Its constructor is private, so the only way to create it is through the static named constructors. fromHex() is the main one, and the simplest. fromRGB() is also pretty straightforward, and produces the same object by different means. These are all perfectly reasonable uses of static methods, because they relate to the type Color rather than to any particular data. And note again, they're all pure functions.

fromHSV() does a little bit more, in that it has a utility method to convert HSV colors to RGB colors. (The algorithm to do so is fairly standard and easy to Duck Duck Go, hence omitted for now.) Because fromHSV() is static, the utility must also be static as well as there's no instance context to work from. This is also an acceptable use of statics; note, however, that hsv2rgb() is private. It's an internal implementation detail.

Static registration

Historically, another common use of static methods has been registration, that is, when there is some kind of plugin system in a framework and you need a runtime way to "register" some extension/instance/object/hook/whatever with the framework. In general, there's four ways that can be done.

  1. Externally from the class being registered, which we don't care about for now.
  2. A static method
  3. A static property
  4. Attributes

For example:

// Best option in modern code
#[Command(name: 'product:create')]
class CreateProductCommand implements Command
{
    // Externally mutable, which is not good.
    public static string $name = 'product:create';
    
    // More verbose, but more flexible if some logic is needed.
    public static function name(): string
    {
        return "product:create";
    }
    
    public function run(): void
    {
      // ...
    }
}

The static name() method is presumably part of the Command interface. In this case, again, the context being named isn't an object instance; it's the class/type, and thus a static method is reasonable. A method in this case is arguably overkill, as it's just returning a static value. That's why some frameworks instead have a magically named static property, like $name above. That's simpler, but comes at the cost of not being part of the interface (though hopefully that will change) and being public and mutable. Remember the rule for statics we said above: Pure and stateless! A mutable value is neither pure nor stateless. It works, but I wouldn't recommend it.

As of PHP 8, though, I'd argue attributes provide a better alternative. They can only capture fixed-at-compile-time information, but that's also true for the static method. They're a native language feature purpose-built for this kind of work, which means all the tooling knows how to parse it. It is more compact compared to multiple static metadata methods. And it logically fits the use case: Registration like this is "metadata" about the type/class. Attributes are metadata, and clearly so from the syntax as they're distinct from the runtime code. They can also be used on both classes and methods, depending on what needs to be registered, with the exact same syntax.

In the end, I wouldn't say using a static method here is bad, per se. There are just better options for modern code to use.

Functional code

What if we want the hsv2rgb() routine above to be available to more than just the Color class? It's a reasonable utility method that may have more generic use. What shall we do with it then?

Make it a function.

That's it, just a normal, plain, boring function.

// color_util.php
namespace Crell\MyApp\Colors;

function hsv2rgb(int $hue, int $sat, int $val): array
{
    // ...
}

It should be namespaced, of course, but functions support namespaces. This way it can be used anywhere, and we can unit test the function individually without any other context. That's good! But there are rules here, too:

  1. The function must be pure.
  2. The function must not do any IO, even indirectly.
  3. The function must not call a service-locator.
  4. The function must not access globals.
  5. Did I mention the function must be pure?

Why are we so strict on functional purity for functions? Because functions, like static methods, are not mockable. If your code calls a function, then all of its tests will also always call that function. There is no way around that. Your smallest "unit" of code includes the code you're testing and all functions and static methods it calls, recursively. So if your code calls a function, which calls a static method, which calls another static method, which runs an SQL query... guess what, your code now cannot be tested without a fully populated database. Welcome to testing hell.

However, if your code calls a pure function, which calls a pure function, which calls a pure function... Sure, you're running more code in your tests, but it's all still CPU cycles in the same process. There's no logic reason to mock them out; there may be a performance reason, but not a logic reason. It doesn't really hurt the code's testability.

Does that mean you should make all of your code pure floating functions? No! You still want to mock things, and you still need to have some context and input somewhere in your application. (It's not a particularly useful application otherwise.) Most of your application should still live in well-designed objects. But when you have a stand-alone, pure, utility routine that doesn't fit anywhere else... A function is fine.

Autoloading

There's an old PHP habit from the PHP 5.2 and earlier days of using static methods instead of functions as a sort of cheap knock-off namespace. That somewhat made sense before PHP had namespaces, but we've had namespaces for 14 years now. We don't need cheap knock-off namespaces when we have real namespaces that work just fine for functions, thank you very much.

// BAD EXAMPLE
class ColorUtils
{
    public static function hsv2rgb(int $hue, int $sat, int $val): array
    {
        // ...
    }
}

The other purported advantage of static utility classes is autoloading. PHP doesn't yet support function autoloading, and while it's been discussed many times there are some technical implementation challenges that make it harder than it sounds.

But... does that matter?

If you're using Composer (and if you're not, why?), then you can use the files autoload directive in composer.json:

{
    "autoload": {
        "psr-4": {
          "App\\": "app/"
        },
        "files": [
          "app/utilities/color_util.php"
        ]
    }
}

Now, Composer will automatically require() color_util.php on every page load. Boom, the function is loaded and you can just use it.

But doesn't that use up a lot of resources to load all that code? Not really! It used to, back before PHP 5.5. But since PHP 5.5, we've had an always-on opcache that stores loaded code in shared memory and just relinks it each time the file is require()ed. Since we're dealing with functions, that relinking is basically zero cost. So while it will marginally increase the shared memory baseline, it has no meaningful effect on the per-process memory.

If you want to go even further, as of PHP 7.4 you can use preloading to pull the code into memory once at server-boot and never think about it again. That may or may not have a measurable performance impact, so do your own benchmarks.

So in practice, the lack of function autoloading support is... not a big issue. If you have enough functions that it becomes an issue, seriously consider if they shouldn't be methods closer to the code that actually uses them in the first place.

Edit: Nikita Popov wrote something along similar lines over a decade ago.

Flip a coin

There's one final situation to consider when discussing static methods. That's when you have a method that is itself pure, and doesn't need a $this reference, but gets used from object code. For example:

class ProductLookup
{
    public function __construct(private Connection $conn) {}

    public function findProduct(string $id): Product
    {
        [$deptId, $productId] = $this->splitId($id);
        
        $this->conn->query("Select * Fom products where department=? AND pid=?", $deptId, $productId);
        // ...
    }
    
    private function splitId(string $id): array
    {
        return explode('-', $id);
    }
}

In this (trivial) example, splitId() is pure. It has no context, it has no $this, it has no dependencies. (These are all good things.) That means it would work effectively the same as a method, as a static method, or even as a function. You're not really going to want to mock it (nor would you be able to) in any case. So which should you use?

My argument is that you should default to an object method (as shown above), unless there's a compelling reason to do otherwise.

  1. Object methods can call static methods, but static methods cannot call object methods. (Static methods are "colored", much like Javascript Async.) So using an object method gives you more flexibility as the code evolves.
  2. Since most of the time you want to be using object methods anyway, it's a good habit to get into just using object methods unless there's a very good reason to do otherwise. Keep that muscle memory going.
  3. The odds of it actually being useful elsewhere as a general utility and being large enough that it's worth factoring out to a common utility at all are low, and you don't know that initially. If you decide later that it makes more sense to split off to a stand-alone function, that's future-you's job.
  4. When dealing with static methods that call static methods mixed with inheritance, there's an extra layer of complexity about self vs static that you have to worry about. That confusion doesn't exist with object methods.

I have seen people argue for static by default in these cases, on the grounds of "if you don't need a $this, make it static." That's a defensible position, but I disagree with it for the reasons above. I still firmly hold that you should avoid statics in most cases, which means if it's a toss up, stick with object methods.

Conclusion

In summary, when should you use static methods?

  1. If the relevant context is a type, not instance, and is a pure function, use a static method. Named constructors are the most common instance of that.
  2. If it's a general-purpose utility, with no context beyond its arguments, large enough to be worth centralizing instead of repeating, and also a pure function, use a stand-alone function.
  3. Else, use an object method. (And make most of those pure functions, too.)

This will give you the most maintainable, most testable outcome possible. And that's what we're really after, isn't it?

Larry 29 November 2023 - 4:28pm
PHP
OOP

Add post to Blinklist Add post to Blogmarks Add post to del.icio.us Digg this! Add post to My Web 2.0 Add post to Newsvine Add post to Reddit Add post to Simpy Who's linking to this post?

Announcing Crell/Serde 1.0.0 9 Nov 2023 4:39 PM (2 years ago)

Announcing Crell/Serde 1.0.0

I am pleased to announce that the trio of libraries I built while at TYPO3 have now reached a fully stable release. In particular, Crell/Serde is now the most robust, powerful, and performant serialization library available for PHP today!

Serde is inspired by the Rust library of the same name, and driven almost entirely by PHP Attributes, with entirely pure-function object-oriented code. It's easy to configure, easy to use, and rock solid.

For a full overview, I gave a presentation at Longhorn PHP 2023 that went into its capabilities in detail. Even then, I didn't have time to cover everything! Have a look at the README for a complete list of all the options and features available.

<iframe width="560" height="315" src="https://www.youtube.com/embed/GB5Vfi68ToY?si=lGs2etNle9uMcTR4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>

Serde is backed by two other libraries:

  • Crell/fp is a simple functional programming utility library, mainly aimed at enabling functional pipes.
  • Crell/AttributeUtils is a fully-featured attribute management library that builds on PHP's native attributes and adds a metric ton of functionality. A lot of the functionality of Serde is driven directly by AttributeUtils.

Give all three a try, and see how powerful modern PHP has become!

Larry 9 November 2023 - 7:39pm
PHP
Serialization
Serde
Web development

Add post to Blinklist Add post to Blogmarks Add post to del.icio.us Digg this! Add post to My Web 2.0 Add post to Newsvine Add post to Reddit Add post to Simpy Who's linking to this post?

Upgrading a RAID array in-flight 24 Jul 2023 1:30 PM (2 years ago)

Upgrading a RAID array in-flight

I have a home server I built several years ago. It used to be a mail server, but now it's mainly just a home file server. Still, it runs a three-drive RAID 5 array for safety. Recently, one of the drives failed and I decided it was time to replace the whole array (as it was old spinning disk drives and those are so early-2010s).

You'd think this would be easy, and it would be, if documentation for it were any good. The best I could find skipped over some rather important details, which I had to figure out from extensive Duck Duck Going and the friendly folks in the #ubuntu channel on irc.libera.chat. Much of this is derived from this tutorial, but with a lot more detail here.

So that I don't have to re-research all of this next time, and hopefully to help someone else in a similar situation, I'm going to document the whole process here in detail, with descriptions. This isn't quite a blow-by-blow, since I am not going to include all of my missteps along the way, but it's close.

Buckle up.

The setup

My old configuration was three 1 TB spinning disk drives, in RAID 5 configuration using Linux software RAID. It was set up by the Ubuntu installer somewhere on the order of 8 years ago, at least. For the past few months, the whole system has been extremely sluggish on anything involving disk IO. After some pondering and checking for things like hidden trojan infections, I concluded that one of the drives was dying. Given the age, I figured it was probably worth the time to port the whole thing over to larger SSD drives.

The system is running Ubuntu 22.04, although the hardware is quite old at this point.

The way software RAID works in Linux, each hard drive is divided into partitions (as always), and then a RAID array is defined that spans multiple partitions on (presumably) separate disks. Then the OS can mount the RAID array as a drive like any other drive.

On my system, I had /dev/sda, /dev/sdb, and /dev/sdc physical drives, each with a 1 partition that was 8 GB for swap (which is RAIDed), a 2 extended partition, and then the rest of the space in a 5 logical partition in the extended partition. (If that's all greek to you, hard drive partitioning is utterly weird and still based on what 386 computers could handle, leading to weirdness like primary/extended/logical partitions. Sorry. Welcome to computers.) The RAID arrays are /dev/md0 (the swap partitions) and /dev/md1 (mounted at /).

If you're not sure how your server is set up, the lsblk command will give an overview of what devices are defined and how they're configured, RAID-wise.

Note that all commands listed below are run as root; you could also use sudo for all of them if you prefer. Any time I refer to /dev/sdX, the X is for any of a, b, or c.

The diagnostics

The first step was to confirm my theory. To do that involved the smartctl utility, which on Debian-based distributions is included in the smartmontools package.

# apt-get install smartmontools

smartctl is a hard disk checking utility. (SMART is a drive-diagnostics standard.) Running smartctl -a /dev/sda shows the health status of that drive. It can also run diagnostics, including both "short" and long versions. I ran the following:

# smartctl -t short /dev/sda
# smartctl -t short /dev/sdb
# smartctl -t short /dev/sdc

The diagnostics run in the background. On sdb and sdc, they finished within seconds. On sda, it ran for a half hour and never actually finished. That confirmed my suspicion that the issue was a dying drive, which is exactly what RAID is there for.

The new hardware

My motherboard can handle up to four SATA devices, but first I needed to confirm what kind of SATA I have, since like everything else SATA comes in multiple versions. The most straightforward way I found (that someone in IRC told me) to determine that was:

# sudo dmesg | grep -i sata | grep 'link up'
[    3.019890] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    3.031035] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    3.042040] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

The 6.0 GB part indicates that I'm using SATA 3.0, since that's the version that handles that speed. A little searching of review sites (mainly Wirecutter) I settled on a Western Digital 2TB WD Blue SA510 SATA. Or, rather, three of them.

Serial problems

One important issue I ran into during the process is that every time I added or removed a drive, the sdX letters for each drive changed. Those assignments are given out in the order of the ports on the motherboard, not bound to the drive. Since everything is RAID there's no issue with data getting lost, just with me needing to keep track of what drive was which every time I booted. The way to find out is with this command:

# udevadm info --query=all --name=/dev/sda | grep ID_SERIAL
E: ID_SERIAL=WDC_WD10EZEX-00BN5A0_WD-WCC3F7UX7FFR
E: ID_SERIAL_SHORT=WD-WCC3F7UX7FFR

That tells me the serial number of whatever is connected to sda, and then I could match up the serial number with what's printed on the physical drive to know which is which. I won't repeat that step each time, but I did have to run that over again every time I changed the hardware configuration.

Fail safely

Software RAID is controlled by a command called mdadm, available in a package of the same name.

I'm not sure if it's "better" to physically install the new drive before or after removing the old one. I did it by adding it first, which is probably part of why I ended up with my sdX letters moving around on me so much. Other tutorials say to remove first, so pick your preference.

The status of the arrays can be checked at any time by examining the mdstat pseudo-device (because in Linux, half the diagnostics are available by reading from a device file).

# cat /proc/mdstat
[sudo] password for crell:
Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
md0 : active raid5 sdc1[2] sdb1[1] sdd1[3]
  	15612928 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
 	 
md1 : active raid5 sdc5[2] sdb5[1] sdd5[3]
  	1937634304 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]

Which shows the two arrays.

Once I had the first SSD installed (it got assigned /dev/sdd), it was time to disable the old drive. That needs to be done for both RAID arrays (md0 and md1), as it's per-partition.

# mdadm --manage /dev/md0 --fail /dev/sda1
# mdadm --manage /dev/md1 --fail /dev/sda5

The first command tells the md0 array that its member partition sda1 is bad and it should feel bad, so stop using it. The second does the same for the md1 array's sda5 partition. Checking mdstat again shows an F next to each disabled partition.

Now that the partitions are marked "failed", and the array is avoiding using it, it's time to remove it from the array:

# mdadm --manage /dev/md0 --remove /dev/sda1
# mdadm --manage /dev/md1 --remove /dev/sda5

(You may need to run swapoff to disable swap for the swap partition to let you remove sda1. I did it, but I'm not sure if that's because I messed up some other things along the way. If you do, remember to run swapon when you're all done to re-enable it.)

Set up the new drive

I already had the first new drive installed, but if you don't, this is the time to physically install it. Then, it needs an identical partition table to the other drives. The easiest way to copy that over is with the sfdisk command, piped to itself:

# sfdisk -d /dev/sda | sfdisk /dev/sdd

You can use any of the old drives for the copying source, and depending on how you plugged in the drives will determine if the new drive is sdd or something else. Check the serial numbers (as above) to be sure.

Note that this will setup a 1 TB configuration on the new drive, even though it's a 2 TB drive. That's OK. It has to be, as all parts of a RAID 5 configuration have to be the same size. We'll be able to resize it when we're done.

Now, add the new drive's partitions to the RAID array:

# mdadm --manage /dev/md0 --add /dev/sdd1
# mdadm --manage /dev/md1 --add /dev/sdd5

(Again, your letters may vary.)

As soon as the partition is added, the RAID software controller will begin reshuffling data around to fill it up. This can take anywhere from a few seconds to a few hours, depending on how much data there is. Once again, checking mdstat will report on the progress.

# cat /proc/mdstat  
Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]  
md0 : active raid5 sda1[4] sdc1[2] sdd1[3]
     15612928 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [U_U]
     [=============>.......]  recovery = 67.7% (5286012/7806464) finish=0.2min speed=167805K/sec
      
md1 : active raid5 sda5[4] sdc5[2] sdd5[3]
     1937634304 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [U_U]
       resync=DELAYED

(The above shows the md1 resync 2/3 of the way done, and the md1 resync waiting for it to finish.)

Again and again

That done, I checked and disk IO was nice and fast again, confirming that it was a hardware failure. I removed the dead drive and set it aside for later disposal.

Then, repeat the process for the other two old drives. It's identical, but if you're sloppy as I was about which drives went into which ports on the motherboard as I was then you'll need to recheck what serial number maps to which device each time.

Boot problems

Once I swapped out the last drive, the machine refused to boot. Strange. I put back in one of the old drives and it booted again, even though the booted configuration was using only the new SSD drives. I finally figured out (by attaching a monitor to the server and seeing the boot process go to the BIOS configuration) that the issue was that copying over the partitions did not copy over the Master Boot Record (MBR), so the computer couldn't find boot instructions on any of the available devices. That's because it scans each SATA connection in turn looking for a device with a working MBR, and wasn't finding one unless I had one of the old drives plugged into /dev/sdd. Oops.

Fortunately the solution was to just reinstall grub, the Linux boot loader ("GRand Unified Bootloader"), using the configuration it could already derive from my existing configuration. So with the old drive installed so I could boot, I ran

# dpkg-reconfigure grub-pc

It interactively asked which devices to install to. Just to be safe, I had it install to sda, sdb, and sdc, all three of the new drives. That way the computer can boot with any of them installed; the MBR is entirely independent of RAID and comes into play long before RAID is even loaded. (Note: grub will give you the option of installing to md, but will fail if you try.)

Now I was able to shut down, remove the old HDD, reboot, and it booted correctly from the new drives. Huzzah!

Growth mindset

The final step is to expand the RAID array configuration to use the extra 3 TB worth of space I have now. There's actually three different layers that have to be done in order here; fortunately, each one is very fast.

Grow the partitions

Linux has an annoying number of possible tools to use here, ranging from the minimalist growpart to the hyper-versatile parted. In my case, growpart was the easiest option as I only wanted to expand the existing partitions at the end of the disk, not move any partitions around. (Had I wanted to do that, I would have needed parted.) It's found in the cloud-guest-utils package, which makes no sense to me at all but then I'm not a distribution packager.

Fortunately, growpart has a dry-run mode. When I told it to expand sda5, it revealed that it would give me a larger sda5 than the sda2 it technically lives in. (As I said, partition design is ancient, complicated, and dumb.) Fortunately, expanding sda2 first is easy enough.

# growpart /dev/sda 2
# growpart /dev/sda 5
# growpart /dev/sdb 2
# growpart /dev/sdb 5
# growpart /dev/sdc 2
# growpart /dev/sdc 5

That expanded the partitions to fill all remaining space after them, which is what we wanted. (I'm not resizing the swap partitions.)

Grow the array

The next step is to tell the RAID array itself that it should be bigger. Fortunately, that's one quick command:

# mdadm --grow /dev/md1 --size max

Which tells md1 to grow to use all available space on its partitions.

Grow the file system

Finally, we need to expand the file system itself on the RAID array. That is, again, fortunately a simple command.

# resize2fs /dev/md1

Which, again, just tells the file system to grow as big as it can.

The result

All that finally done, I now have twice as much space available as I used to:

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           782M  1.5M  781M   1% /run
/dev/md1        3.6T  1.1T  2.4T  32% /
tmpfs           3.9G     0  3.9G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           3.9G     0  3.9G   0% /run/qemu
tmpfs           782M  8.0K  782M   1% /run/user/1000

Yay! Remember, RAID 5 stores parity check versions of all data across the drives, so the available disk space you get is n-1 where n the capacity of one drive. That's why my 6 TB of disk space only gives 3.6 TB of usable space. Still plenty for now, and if I ever need more I still have space to add one more drive. And since I've actually written down all these instructions, I'll even know how to do it! Yay!

I hope this was helpful to someone else as well.

Larry 24 July 2023 - 4:30pm
RAID
Sysadmin
GNU/Linux
Ubuntu

Add post to Blinklist Add post to Blogmarks Add post to del.icio.us Digg this! Add post to My Web 2.0 Add post to Newsvine Add post to Reddit Add post to Simpy Who's linking to this post?