Proposal:Distributed backup of Wikimedia content

From Strategic Planning
Wikimedia Servers


It is possible to make a distributed backup of Wikimedia content; especially, to make a backup of images.


Instead of relying on static backups of the content, it is possible to make a distributed backup of [presently] ~4TB of images and ~2TB of other data.

While it is not rational to expect that everyone would willing to have ~4TB of random images, it is possible to make sets of categorized images by places, regions, ontological categories and so on. So, we are able to create, let's say, set of ~10GB images of Paris. That set of images may be backed up by a number of persons interested to have pictures and images of Paris.

This proposal opens interesting possibilities:

  • Connecting database dumps with files is the only necessary task which should be done as WMF sponsored project.
  • It is possible to make a client software for images collections, so users would be able to make their own local copies of the sets, but not only them. With developed software, we would be able to share all images one by one. Users would be able to categorize images and describe them in their native languages by using such software. Optionally, WMF may choose strategic partner for that task (Flickr/Yahoo and/or Picassa/Google) and develop a free software program which would be used for automatic upload of images, their categorization, licensing, description and similar.
  • If some entity, let's say a local government of small town, wants to have images relevant to them to be categorized properly, they may employ persons (probably, some Wikimedians, but not necessary) to do the task. As such categorizations should be synchronized [at some level] with Commons and other WM repositories, in the future we'll have much better categorization, description and maintaining of such images.

The same may be done for articles.

And, again, that program shouldn't work as a client-server application, but as a P2P application.


After making an arrangement for the second backup of upload.wikimedia content (in Belgrade), I was talking with various people about possibilities to use that content in some sensible way. Mike Dupont suggested a similar way of backing up images. I think that such way of distributed backup of images will improve significantly number of their copies.

Various companies and institutions would be interested in hosting backups of certain geographical regions or tags. For example a nature conservation agency might want to host animal pictures, but not electronics.

The mirror system of linux allows for many parties to mirror the data that they find interesting. ISPs are interested in mirroring data that their users are likely to want. The same should be used for the wikimedia content. Some intelligent mirroring system that allows for parties to make mirrors of data they are interested in.

Key Questions

  • How can we make a safe backup of Wikimedia content?
  • How can we be certain that a backup of everything is made?
  • How can we know what level of redundancy there is in the backup?
  • How can we make that backup useful, too?
  • How do we restore when necessary?
  • How do we test the restoration process regularly?

Potential Costs

  • Connecting database dumps with file repository.
  • Connecting articles with images.
  • Creating P2P infrastructure at the WMF side.
  • Probably, creating a client software for image/file manipulation.
  • Other tasks should be done particularly and probably not in direct connection with the main WMF efforts toward this issue.


