Spent about half of yesterday setting up Aditya’s darcsden patches on the dev instance of hub.darcs.net, testing them, and exploring db migration issues.
Following BSRK’s instructions, I got the dev instance authenticating via Google’s OAuth servers. Good progress. The UI flow I saw needs a bit more work - eg logging in with google seemed to want me to register a new account. Or, there may be a problem with my setup at Google (wrong callback urls ?) - will have to review it with BSRK.
My dev instance has so far been using the same database as the live production instance. This is partly because I don’t yet know how to run a second CouchDB instance, partly to reduce complexity, partly to be able to compare old and new code with the same realistic data set.
This of course can lead to trouble, if old and new code require different schemas. darcsden uses CouchDB, a “schemaless” database, but of course there is an implicit schema required by the application code, even if couch doesn’t enforce one. I got more clarity on this when I noticed my dev instance experiments causing errors on the production app.
New darcsden code may include changes to the (implicit) db schema. In this case, there’s a change to the user’s password field. I need to notice such schema changes, and if I want to exercise them on the dev instance, I should first also install them on the production instance. Or, use a separate couchdb instance. Or, use separate databases in the couchdb instance. Or possibly, use separate views in the couchdb databases ?
Eg, here BSRK made the code nicely read user documents (db records) with the old or new schema. Before testing it on the shared db I should have deployed that patch to production as well as dev.
Looking ahead, is this approach (including code to deal with all old schemas) the best way to handle this ? Maybe. It makes things work and seems convenient, at least for now. But it also reminds me of years working with Zope’s ZODB (a schemaless python object database) and the layers of on-the-fly schema updating that built up, and the uncounted number of runtime bugs hunted down due to schema variations in individual objects.
While recovering from this, I learned some more about managing couchdb, schema migration, and current couchdb alternatives.
Couch has some really good and unusual qualities, and I feel I’m only scratching the surface of it’s power. Even so, I’m starting to feel a schema-ful, relational database is a better fit for darcsden/darcs hub. Replacing couch has been a topic of discussion on #darcs for some time, for other reasons. Here are some reasons to replace it:
darcsden (more particularly, the instance running darcs hub, which has a lot of long-lived data) works best when all records have the same shape. It gains nothing from the flexibility of a loose schema, in fact will break, at runtime and unpredictably, unless you have extra code that handles all variations perfectly (a hard thing to test).
couchdb makes darcsden harder to set up, eg on windows. This makes it less successful in its goal to be an easy single-user ui for local darcs repos. It also reduces the number of darcsden hackers.
it adds complexity by embedding application code in the db. Instead of all logic being in haskell, the darcsden developer has to also deal with design documents and javascript map/reduce functions, and manage the state of those within the db.
it adds complexity by being less familiar to most people than rdbms system, and by having less mature tools.
persistent, the likely alternative, would more easily support both large installations (eg postgres for darcs hub) and single-user ones (sqlite) with less code.
Some reasons not to:
Don’t replace working code!
Replacing it could be wasted effort, better spent fixing end-user bugs on darcs hub.
The migration issue can easily be worked around. It’s not that big a deal for this instance.
Don’t disrupt the GSOC in progress!