Proposal:Distributed backup of Wikimedia content

The status of this proposal is:
Request for Discussion / Sign-Ups

Every proposal should be tied to one of the strategic priorities below.

Edit this page to help identify the priorities related to this proposal!

Achieve continued growth in readership
Focus on quality content
Increase Participation
Stabilize and improve the infrastructure
Encourage Innovation

Share this:

It has been suggested that this page be merged with Proposal:Distribute Infrastructure. (Discuss)

It has been suggested that this page be merged with Proposal:Distributed Wikipedia. (Discuss)

Summary

It is possible to make a distributed backup of Wikimedia content; especially, to make a backup of images.

Proposal

Instead of relying on static backups of the content, it is possible to make a distributed backup of [presently] ~4TB of images and ~2TB of other data.

While it is not rational to expect that everyone would willing to have ~4TB of random images, it is possible to make sets of categorized images by places, regions, ontological categories and so on. So, we are able to create, let's say, set of ~10GB images of Paris. That set of images may be backed up by a number of persons interested to have pictures and images of Paris.

This proposal opens interesting possibilities:

Connecting database dumps with files is the only necessary task which should be done as WMF sponsored project.
It is possible to make a client software for images collections, so users would be able to make their own local copies of the sets, but not only them. With developed software, we would be able to share all images one by one. Users would be able to categorize images and describe them in their native languages by using such software. Optionally, WMF may choose strategic partner for that task (Flickr/Yahoo and/or Picassa/Google) and develop a free software program which would be used for automatic upload of images, their categorization, licensing, description and similar.
If some entity, let's say a local government of small town, wants to have images relevant to them to be categorized properly, they may employ persons (probably, some Wikimedians, but not necessary) to do the task. As such categorizations should be synchronized [at some level] with Commons and other WM repositories, in the future we'll have much better categorization, description and maintaining of such images.

The same may be done for articles.

And, again, that program shouldn't work as a client-server application, but as a P2P application.

Motivation

After making an arrangement for the second backup of upload.wikimedia content (in Belgrade), I was talking with various people about possibilities to use that content in some sensible way. Mike Dupont suggested a similar way of backing up images. I think that such way of distributed backup of images will improve significantly number of their copies.

Various companies and institutions would be interested in hosting backups of certain geographical regions or tags. For example a nature conservation agency might want to host animal pictures, but not electronics.

The mirror system of linux allows for many parties to mirror the data that they find interesting. ISPs are interested in mirroring data that their users are likely to want. The same should be used for the wikimedia content. Some intelligent mirroring system that allows for parties to make mirrors of data they are interested in.

Key Questions

How can we make a safe backup of Wikimedia content?
How can we be certain that a backup of everything is made?
How can we know what level of redundancy there is in the backup?

How can we make that backup useful, too?
How do we restore when necessary?
How do we test the restoration process regularly?

Potential Costs

Connecting database dumps with file repository.
Connecting articles with images.
Creating P2P infrastructure at the WMF side.
Probably, creating a client software for image/file manipulation.
Other tasks should be done particularly and probably not in direct connection with the main WMF efforts toward this issue.

References

David Gerard, Disaster recovery planning (with comments) (other copy, different comments)
Wikimedia editor
[Foundation-l] Long-term archiving of Wikimedia content
Proposal:Host Wikipedia from Space

Community Discussion

Do you have a thought about this proposal? A suggestion? Discuss this proposal by going to Proposal Talk:Distributed backup of Wikimedia content.

Want to work on this proposal?

.. Sign your name here!