Proposal:Distributed Wikipedia

From Strategic Planning
Status (see valid statuses)

The status of this proposal is:
Request for Discussion / Sign-Ups

This proposal is associated with the bolded strategic priorities below.


  1. Achieve continued growth in readership.
  2. Focus on quality content
  3. Increase Participation
  4. Ensure that the underlying project infrastructure is secure, stable, and sufficient to guarantee the permanence of the projects and support ongoing growth.
  5. Encourage Innovation



This is a featured proposal.
This template automatically categorizes into Category:Featured proposals. All proposals marked by this template should be on this list too.

Summary

Imagine being able to take parts of Wikipedia with you, offline. Imagine being able to create small wikipedia communities in outlying areas with very little or no internet connectivity (such as OLPC students in emerging countries). Imagine being able to transfer wikipedia entries and edits by Postal Service or Courier (on USB memory sticks) between communities. All of these things are possible when the Wikipedia database becomes Digitally-Signed and Distributed.

Proposal

  • Provide peer-to-peer communication between wikipedia database nodes
  • Provide GPG or other Digital Signatures on submissions of entries, where the "current" infrastructure - the wikipedia.org system - has automatic and implicit digital signing.
  • Provide a means by which the submission of Digitally-signed "offline" entries can be "validated" prior to acceptance into the quotes "main" quotes wikipedia distributed database.
  • Provide git as an option for the storage of copies of database entries (as a full peer with other distributed nodes, not as a gimmick). This option would make it easier to implement the "offline" transfer of wikipedia contributions (by utilising the git 'SMTP' commands, developed for the exact and specific purpose of communicating batches of git repository commits via "offline" storage media and protocols).
  • Provide a means by which communities can add their own "Authorised Digital Signatures" to pages (with the implicit case being that "self" is automatically "authoritative" for viewing a users' own edits/contributions!)

In this way, pages and modifications can be obtained and cached (online or offline), and verified using the digital signatures. Communities can then decide who to view as "authoritative". In other words, the entire Wikipedia database could in theory entirely be forked. Democratically. In this way, much of the criticism of Wikipedia's process simply... melts away.

Note: Two-way offline Transfer by solid-state drives provides an average transfer rate far in excess of anything that could be affordably provided in such remote outlying areas (with perfectly acceptable latencies, for people for whom "The Internet" is as much a myth or a pipe-dream as it is anything else).

Note: Two-way offline Transfer by solid-state drives allows firewalls to be bypassed in countries with extreme censorship. The massive storage capacity and small size make smuggling of really quite large (and possibly significant) wikipedia knowledge base entries and queries both easy to do and also difficult to stop and detect, both within a country and also across international borders.

Proposal Implementation Possibilities and observations

It may not be necessary to turn the existing wikipedia system (existing servers, existing code, existing database) into peer-to-peer nodes. If a sufficiently-functional "connectivity" API such as that proposed by Proposal:RPC-based_facebook-esque_API is provided, then it becomes possible to simply implement a peer-to-peer service, fully compliant with the rough specification in this proposal, as nothing more than an external third party application.

For this to happen, the issue of authentication - the use of external authentication mechanisms such as openid, GPG and other digital signatures, is of absolute paramount importance. It will be completely unacceptable for user passwords to be relayed in-the-clear over third party uncontrolled peer-to-peer infrastructure, and it will be completely unacceptable to relay HTTP authentication cookies (from a wikipedia HTTP login) due to the forced requirement of real-time HTTP access that HTTP authentication cookies implies.

Motivation

  • The load on the wikipedia database is increasing
  • The dependence on _financial_ contributions is reduced; anyone can simply set up a peer node, cacheing digitally-signed pages and making them available.
  • The creation of a full interaction API (a la facebook API) extends the reach of wikipedia beyond the limitations of its current web-only interface.
    • example: in combination with an alternate front-end user-interface, an affordable embedded system (or a mobile phone application) which can be made available in third world countries may simply not have the resources to run javascript (or even a web browser). A simpler interface can be designed which stores edits inside the device, and the distributed peer-to-peer background service running on the device will upload the edits into the peer-to-peer network as and when the resources of the device permit.
  • The provision of "offline" wikipedia edits and wikipedia page queries allows for the subversion of automated firewalls such as those known to be deployed in China and the Middle Eastern countries with strict and insane policies of cutting off access to articles about the counties of "Essex", "Middlesex" in the United Kingdom, and articles about "Wide Area Network Knowledge-bases".

Key Questions

  • Wikipedia is a world-wide important resource. Why is so much power and control over wikipedia placed into the hands of those who control the domain name, those who control the database, those who host the database and the front-end?
  • Why is there so much reliance on "real-time" Internet Access? What happens to all the knowledge collated, if there is a shift of International power and trust (such as might be created by another World War) such that the Internet becomes temporarily or permanently fragmented?
  • Why is there an assumption that "real-time" Internet access, in remote outlying areas of the world, will become as ubiquitous as it is in the first world, or that it will become affordable? With the massive increase in storage capacity of portable solid-state technology such as USB memory sticks, it's actually infinitely preferable and vastly more cost-effective to place requests for Wikipedia Pages and contributions/edits onto portable offline storage, and ship them to a location where the queries can be uploaded, and the responses placed back onto the USB memory stick. Yes, there's a real-world use case for the "USB memory stick attached to a Carrier Pigeon", as demonstrated by the recent slashdot story of a company in South Africa that was sick to the back teeth of the high cost, unreliability and slow data rate of their ISP, a mere 80km away. the pigeon won, by a massive margin.
  • Wikipedia's charter apparently has the requirement that before data can be handed over to individuals, the individual must provide proof of physical identity, in the form of faxing a copy of passport or driver's license to the wikipedia office. In the context of a distributed peer-to-peer DW node being effectively nothing more than a highly advanced ["robot"], the "physical identification" requirement seems ridiculous at best, and a complete show-stopper at worst. Discussion, clarification and resolution of this issue is of paramount importance to the workability and effectiveness of the DW proposal. Discuss.

Potential Costs

Once the development is done, the running costs are actually reduced. Whilst the "authoritative" database is still needed, the provision of a stand-alone app (even if the stand-alone app is itself a web service) which takes cached copies of pages would dramatically reduce the load on the wikipedia servers [for viewing].

Appendix: Rough Outline of Inter-Node (Distributed) API

This API is for the peer-to-peer communication between nodes.

Please be advised that prior reading up on asynchronous APIs; distributed systems; SQL queries; DHT algorithms; mlDonkey / eMule's distributed search algorithm; git; keynote; advogato trust metrics; slashdot moderation; google "wave" and JSONRPC are a good idea in order to best understand the relevance of this API.

Definitions:

  • GUID - Globally Unique Identifier. aka 128 or 160 bit hash.
  • Tuple: see definition in python
  • List: see definition in python
  • Dictionary: aka "array". see definition in python
  • Metadata: anything that's linked to a specific GUID. (in practical terms, it corresponds to the fields in a SQL table's "row")
  • property: the name given to a particular bit of metadata (in practical terms, it corresponds to the SQL table's "field" name).
  • "well known" - indicates that there is a predefined unique, published and absolutely unchangeable constant (most likely a GUID) associated with, and uniquely identifying, a particular concept (such as "A user" or "A page". all objects of a particular type, such as "users", will all have a "well known" type property associated with them).

The API needs to be asynchronous, for practical purposes (distributed peer-to-peer protocols always are. e.g. DNS). It is therefore necessary to return sufficient information in each command such that the initial call can be "tied up" with the response. exact details yet to be determined, but JSONRPC already provides a "unique id per call" system, which may prove sufficient.

Note: all responses from all queries can also return, as well as the actual response to the query, a "please REDIRECT to" response, which will contain a list of contact addresses (containing at least IP, port and protocol) for other distributed nodes which may be able to provide additional responses. the query should be repeated to each of the other nodes, taking into account the following factors:

  • a maximum sensible limit should be set on the number of outstanding queries (regardless of depth) being submitted at any one time (to be determined / discussed, but something like 16-20 per user seems reasonable). for example: if the first query is sent to 10 nodes, and one of them responds with 20 "REDIRECT" contacts, only 11 of those nodes are ALLOWED to be contacted (if the limit is set to 20) because there are 9 still yet to respond. assuming that one of THOSE responds with a further 20 "REDIRECT" contacts, and none of the other 19 have yet responded, only ONE out of the list (the initial first response PLUS the 20 from the second response) can be contacted. it is entirely up to implementations as to how to pick from this cumulative list, but the limit on the number of outstanding queries MUST be respected.
  • implementations MUST keep track of the tuples (addresses) to which a specific outstanding query has already been sent, and NOT send duplicate queries just because two or more nodes happened to give out the same contact details.
  • if the protocol is UDP or is otherwise datagram-based, repeats of queries are acceptable (and expected!) but MUST be kept to within sensible limits (such as that followed by TCP, which is something like 3x 1 seconds, 3x 4 seconds, 9 seconds, give up entirely)

There are several reasons for providing a "please REDIRECT to" system:

  • an overloaded server can simply shift into "redirect" mode, sending all queries to alternate servers without having to process the query or provide any details, thus reducing load. the load is reduced due to two reasons: firstly, processing of the query is not necessary; secondly, by _actually_ responding, the need for retries (if the protocol is UDP or datagram-based) from a given node goes away.
  • a DHT-based query can provide any responses that it has, and also redirect the query to "nearest neighbours" according to the DHT search criteria being deployed. thus, the "REDIRECT" system implicitly provides the ground framework on which to create the peer-to-peer infrastructure.

Lookup

  • input: "name" and "well-known type"
  • returns: GUID

this command translates "name", "type" into a GUID. it is an asynchronous event. its implementation results in a distributed search query (see eMule and other file sharing DHT algorithms and implementations for details). results that are already in the local cache will of course provide an immediate response. only EXACT MATCHES against the "name"/"type" tuple are returned. although many (identical) responses may be received, as the mapping between "name"+"type" and the GUID are one-to-one and onto, the query can (in theory) cease immediately when the first response is received.

note: not entirely. there's always the possibility that the distributed network is attacked and poisoned, which is why the Digital Signatures are so absolutely critical. therefore, on receipt of the response, implementations are STRONGLY advised to make a metadata query for the digital signatures, and verify them. (action to be taken if the signatures do not match is yet to be discussed, but will include creating a permanent record of the incident, and real-time investigation procedures to be initiated, immediately. the policies followed by e.g. bittorrent and other file-sharing networks should be investigated)

examples of the "well-known type" argument will be a GUID corresponding to "User", or "Wikipedia Page".

Edit/Changes List Lookup

  • input: GUID of object being queried (for e.g. a page); range-limiting arguments (date, branch-point GUID etc.)
  • returns: list of tuples of:
    • GUID edits / changes
    • parent that the edit is to be applied to. if this is the first change (e.g. first page create), the parent will be the object's GUID

this command obtains the list of GUIDs which will need to be queried in order to obtain the details of the edits / changes to an object (and other metadata). the closest analogy is a series of git commits. (Note: to obtain information about the _type_ of the change, e.g. add lines, delete lines, see Metadata "changetype" property).

Query Attachments

TODO: get list associated with page. GUID. maybe some search criteria needed, here, as well. need input from experienced wikipedia developers.

Note: Query of the actual attachment is done as a metadata query. the attachment's data itself is just another metadata property.

Note: special consideration of images needs to be thought through. cacheing of smaller versions of images should NOT be considered "authoritative". up to implementors to decide whether to number-crunch and create smaller images from larger ones, but they SHOULD not burden other nodes with the cost of requesting smaller versions of images, if they already have a larger one which they can number-crunch locally.

Metadata Query

  • input: list of GUIDs (of anything, of any type); list of properties (as strings) to be returned
  • returns: list of dictionaries (of metadata on objects known to the node).

this command obtains metadata (where wikimedia data itself is considered to be metadata) for anything - any kind of object. the order of the responses, and the number of responses given, is arbitrary (i.e. between 0 and the full length of the list of GUIDs). as the responses are asynchronous, a special metadata property named "id" MUST always be included in each dictionary in the list of responses, regardless of whether the "id" property was included in the list of properties to be returned.

note: a node MUST NOT return data that is not in its own cache. a node SHOULD pass on a list of "REDIRECTs" to nearest neighbours, according to the DHT algorithm, which may also have copies of the metadata being requested.

Metadata tags

  • the "id" property will always be the GUID corresponding to the object being queried.
  • the "pagename" property is the exact name of the page. it is included as a metadata tag, for search purposes.
  • the "data" property obtains actual wikimedia data
  • the "changetype" property e.g. "add lines", "delete lines", "create new page", "archive point begins here".
  • the "creator signature" property is the GPG (or other) digital signature.
  • the "seconding signature" property is a *list* of GPG signatures "seconding" i.e. "authorising" the wikimedia data change. think in terms of "I Second the Motion!". it is up to implementations to make best use of this property. implementors are advised to look up "Keynote" and "Advogato Trust Metrics" for suggestions.

in the case of "User" ojects, metadata tags can be:

  • the "username"
  • the "homepage"
  • the "real name"
  • the "public signature" e.g. the GPG public key
  • the "openid"

etc.

Metadata Search

  • input: search criteria as a series of (type, property, criteria) tuples, optionally joined by BOOLEAN statements (format to be determined / discussed).
  • returns: list of GUIDs of objects (known to the node) that match the search criteria

this command results in asynchronous peer-to-peer distributed searches. eMule and other file-sharing search algorithms are the inspiration behind this command, where the queries can be quite sophisticated and include "FILE_SIZE < 1Mb AND AUDIO_BITRATE = 128k". implementations must be prepared to merge incoming responses over a period of up to 30 seconds. also, implementations must be prepared to make SECONDARY metadata queries, obtaining the full details (if not already cached); to compare the responses against the initial search criteria and to make their OWN decision on how to sort the return results (and re-sort them as more search results arrive)

note: for the metadata search command to work, all objects must have a well-known published type (GUID). the search criteria must include the type of object, otherwise the properties have no context. in SQL terminology:

  • "type" maps, conceptually, to "database table name"
  • "property" maps, conceptually, to "database field"
  • "criteria" maps, conceptually, usually to the right hand side of WHERE clause comparisons, including the operator.

note: for page searches, AT LEAST ONE of the search criteria MUST include a qgram and qgrams of 4 characters or more are recommended. other restrictions may have to be added. it may be necessary, in order to perform proper DHT searches, to take the MD5 hash of the q-gram(s), and to use that hash in the DHT "search".

note: a node MUST NOT return data that is not in its own cache.

Metadata Submit

this API call is how data, including wikimedia changes, and including "new user creation", get "pushed" to the wikipedia servers, any other servers explicitly added to the "push" list according to the user's personal preferences (and/or the node administrator's criteria), and also to "nearest neighbours" in the DHT participation graph (for a particular hash / search criteria, exactly as is done in a "standard" peer-to-peer distributed network).

  • input: list of dictionaries of metadata being requested to be stored/cached by the recipient node
  • returns: list of "ok" or "sorry, not accepted" responses.

this command asks the recipient to perform the conceptual equivalent a list of SQL "REPLACE INTO" commands. each metadata dictionary MUST contain a "type" property indicating the well-known object's type (such as a "Page" or a "Page Edit" or a "User"), and it MUST contain an "id" property (as a GUID). for a "create", the node must randomly create a new GUID; for an "update", the same GUID must be utilised. also, implementations are STRONGLY ADVISED to add "digital signature" properties to all submissions. in the case of anonymous wikipedia contributions, a server should STILL add a "secondary signature" to the submission. note: for "updates", the submit MUST be digitally signed, and nodes MUST respect the policies for "updates" on particular object types (policies to be determined / discussed). failure to respect "update" policies will result in nodes being instantly blacklisted, to avoid profligation / propagation of corrupted data.

the responses are a list of dictionaries containing the original "id" property and a "status" property, which will be the number 0 on success (and a string or error code, yet to be determined / discussed, but they WILL be nice human-readable messages).

note: in the strictest sense, the "Metadata Submit" API call is the one which some implementations can consider to be optional. a "stub" implementation would be to never submit this API call, and to always respond with a list of "sorry, not accepted". however, other implementations SHOULD take note and keep track of the number of "not accepted" responses, because it is an indication that (in bittorrent terminology), the node is "leeching" from the network, by not participating in the storage and retrieval of wikipedia pages and other metadata, to the benefit of the whole network.

it is, however, impractical for all nodes to accept *all* requests to store/cache data. it is therefore ENTIRELY the discretion of the administrator of the recipient node to determine whether or not to accept the requests for storage. in the case of wikipedia, the policy MUST map to the EXACT same policy as applies to the EXISTING web-based wikipedia.org interface (where practical). in the case of all other nodes, the policy is entirely site-discretionary, but sites are advised that the consequences of NOT participating could result in other nodes throttling responses or simply banning a node outright, and publishing (broadcasting, into the distributed database) the digital signature of the throttled or banned node.

this is the primary reason why all submissions are strongly advised to have a site's "digital signature" added, as declarations of intent on behalf of the site administrator that the node under their responsibility will participate with integrity in the cooperative and collaborative maintenance of wikipedia pages.


References


Community Discussion

Do you have a thought about this proposal? A suggestion? Discuss this proposal by going to Proposal talk:Distributed Wikipedia.

Want to work on this proposal?

  • ... Sign your name here!

Jackchristopher (talk) 01:57, 18 July 2012 (UTC)