Proposal:Make Wikisource scale

Summary

Full disclosure: this is a visionary proposal.

A sub-proposal of Make Wikimedia projects scale. Start a huge program to transcribe all millions of books mass-digitalized in the world. From images to proofread text.

Proposal

Make sure that we can use the scans produced by mass-digitalization programs: images and OCR text must be public domain, while Google, Microsoft and many others claim to have "transcription rights" or other copyrights on them. We could either:
1. persuade them to voluntarily release all works in the public domain without claiming any copyright,
2. or lobby to make the law to state that.
  1. That has never been a problem: we copy all these scans to Commons or Wikisource, and tag them as public domain, and nobody has ever complained. Adding a copyright mention on something has never made that work copyrighted.
Develop vital tools to transcribe:
1. DjVu,
2. metadata extension for bibliographic records (DublinCore, MARC),
3. OCR system.
Study current implementations to proofread OCRed works, like Distributed Proofreader or the Australian Newspapers Digitisation program (where people seem to proofread huge quantities of text), and determine which is the most user-friendly and efficient one (which may be or not be a wiki system: see Stop using wikis for tasks for which wikis are not suitable).
1. The actual system at Wikisource can certainly be improved, but it is not too bad.
Develop the chosen system to bring it at its best. E.g., if it was MediaWiki with Proofread Page:
- automatic conversion of all existing scans (pdf, tiff or whatever) into DjVu format and automatic uploading to Wikimedia Commons,
- automatic, efficient OCR and creation of the index and pages for each book (with text from OCR),
  - That already exists: the text is automatically imported when creating a new page.
- automatic creation of meta-pages (sources etc.),
- user-friendly editing interface.
Make a huge, worldwide campaign to get hundreds of thousands volunteers to work on those text.
1. You can imagine an alliance of the Wikimedia Foundation with UNESCO and Open Archive Foundation, which would then be developed by Wikimedia chapters on a local basis, to get students work for free as in the Civilian Service, or as a part of their classes (an example is De' matematici italiani anteriori all'invenzione della stampa, an historycal 1860 book digitalized and trascribed by Aubrey as part of his Bachelor's degree thesis). We could involve elder people as well, but this seems more difficult.
2. Alternatively, we could pay students directly, as was done in the Slovene Wikisource.

Motivation

Normally, educated people used to spend much more time reading books than looking up encyclopedias. So, although Wikipedia is much vaster than a normal encyclopedia, if Wikisource is a real digital library then it's ridiculous that is has about the 0,3 % of page views of Wikipedia.

For readers of Wikisource and other projects which can use Wikisource as a source ("Project Sourceberg"), the main advantage of Wikisource is that it's hypertext:
- books are readable in clear text pages (instead of huge pdf or other formats),
- single words, sentences, paragraphs etc. can link and be linked to original documents, sources and references, pages (articles of Wikipedia or other books) to expand on a subject, etc.
The problem is:
- essentially only old texts (public domain) can be on Wikisource,
- Wikisource has a small amount of contributors and a very small amount of books compared to traditional libraries (while Wikipedia is a huge and high-quality encyclopedia compared to traditional ones),
- people often don't like to read books online.
The completely unique features of Wikisource (on the Net) are
- proofreading, with DjVu files (example),
- the possibility to compare a text with the same text in another language beside (see e.g. the Main Page and click the ⇔ beside one of the "other languages").

Mass-digitalization of books and works is highly technology and capital-intensive and professional: Google, Microsoft, Yahoo, Open Archive, French (Gallica, Europeana) and Japanese (National library^[1]) governments are investing hundreds above hundreds of millions of dollars. This is a big industry, and we don't have money to do that; moreover, volunteers can't do such a professional work, which requires an high division of labour and taylorization of workers who use the large automated machines for digitalization.

Google Book Search Settlement Agreement will let Google digitalize several millions of books; if the settlement is not approved or if it is approved and the public sector or some countries don't want to partecipate, governments will be obliged to do such a mass-digitalization theirself. In either way, images will be probably available quite soon.

But, nobody is planning to do a mass-transcribing of such digitalized books, because this is highly labour-intensive. Wikimedia has the know-how to organize the work of huge amounts of people towards a constructive goal: we can take a huge step forward with reference to Project Gutenberg. Moreover, OCR of very old text is completely impossibile, so manual trascription is the only way.^[2]

In most developed country conscription does no longer exist, so there is a lot of young manpower available; as Clay Shirky said,

If I had to pick the critical technology for the 20th century, the bit of social lubricant without which the wheels would've come off the whole enterprise, I'd say it was the sitcom. [...] For the first time, society forced onto an enormous number of its citizens the requirement to manage something they had never had to manage before--free time.

And what did we do with that free time? Well, mostly we spent it watching TV. [...]

And it's only now, as we're waking up from that collective bender, that we're starting to see the cognitive surplus as an asset rather than as a crisis. We're seeing things being designed to take advantage of that surplus, to deploy it in ways more engaging than just having a TV in everybody's basement.

Another issue, related to the aims and objectives of Wikisource, is the problem of metadata.

Right now, Wikisource active users don't have librarianship and information science competence and skills to pay the right attention to metadata and bibliographic records of books belonging to Wikisource.

This is a problem related proposal face the issue of building a book catalog as a Wikimedia project. Developing an Extension who allows users to insert metadata within an accepted standard (DublinCore, for example) could let us upgrade in term of quality and trustworthiness. Not mentioning the possibility of harvest data from different catalogs and databases and syncronize our book-related information.

Key Questions

If we don't make old texts really accessible, who will do that?
What's the best tool to reach this goal? Is MediaWiki suitable? Is this the way to go for Wikisource (maybe Building a database of all books ever published is better)?
- That is a different objective: Wikisource can only host public domain works. However a database of all works, even copyrighted ones is the aim of this other proposal.
Can a private charity follow such a huge program, or public institution should?
How can we raise enough funds?
...
...

Potential Costs

Very, very, very, very rough estimates!

To lobby will require us to hire at least a program manager for USA and another for EU, for two years or three (to be optimistic): say, 150 000 $. This would be useful even without the rest of the program, but it will be easier to obtain such a result if we show people the goal for which it's vital (transcription of n millions of books).
A sort of "Wikisource usability initiative": say, 500 000 $ (compared to 870 k$ for Wikipedia and 300 k$ for Commons).
Tools:
1. developing DjVu: 50 000–100 000 $??
2. developing free OCR software Tesseract or buying a commercial professional one: 100 000 $????
Manwork: for one million books, 500 000 volunteers, to be trained in groups of 40 for 6 hours by an expert Wikisourceror for 9 $/hour: 675 000 $. Moreover:
1. Civilian Service/students' classes: hire at least a project manager per country to persuade insitutions and follow the program, say 100 000 students per country, then 10 project managers, multiply by 5 #1: 750 000 $;
2. paid work: Slovene project spent 10 000 € for 380 texts (200 still waiting for correction); if we improve tools we can reduce costs by, say, 40 %: 1000000/380*10000*0,6= 15 789 473 €, about 22,5 M$
  Note: this can be heavily underestimated, since Slovenian manwork is supposedly cheaper.

Grand total: 2 225 000–24 000 000 $.

Footnotes

↑ Japanese national library to digitize books; distribution uncertain.
↑ Wikisource transcribes eye-witness report on French invasion of Russia, Wikimedia Foundation blog, August 1st, 2008.

References

Make Wikimedia scale
[Foundation-l] [ol-discuss] Open Library, Wikisource, and cleaning and translating OCR of Classics: six mainsets of features/operations for OCR proofreading.
- Complete discussion: wikisource-l, foundation-l and ol-discuss, then foundation-l again.
Proposal:Building a database of all books ever published

Community Discussion

Do you have a thought about this proposal? A suggestion? Discuss this proposal by going to Proposal talk:Make Wikisource scale.

Want to work on this proposal?

.. Sign your name here!

[1] Japanese national library to digitize books; distribution uncertain.

[2] Wikisource transcribes eye-witness report on French invasion of Russia, Wikimedia Foundation blog, August 1st, 2008.

[1]

[2]