On digitize public-domain Chinese books or other resources

On digitize public-domain Chinese books or other resources

In the China TF recommandations, we gave an advice:

  • Expand coverage on topics more relevant to Chinese users, such as Chinese people, Chinese culture, Chinese geographic, etc
    • Lots of works in public domain may be used to expand the coverage in a systematic way
    • Project to enhance geographic entries with semantic format support

China enjoy a long history for thousands of years, and hence lots of Chinese books or other resources available in public domain. Digitize them into Chinese Wikisource or Chinese Wikipedia is an option to enhance our Chinese content, but we should select some books or resources elaborately first, and then put our limited resource on this area.

I can see several advantage for this:

  • improve our position in the competition with Baidu and Hudong, Baidu and Hudong enhance their repository systematically bu inputing copy-righted resources.
  • if we elaborately select some valuable resources, and present them in good format finally, we can attract more academia people into our projects, and to form a healthy ecosystem.
  • benefit all readers

On the tech side of this idea, we can try various OCR tools or digitize them by human, the price of human resources in China is relative low.

We had discuss these ideas in Chinese community, but one of our questions is that:

  • Could this plan be supported by the WMF formally? and even some financial support?
Mountain09:27, 7 April 2010

Would substantial political issues be raised by some of the material? With Google's relationship with China predicate problems of any sort? How would Wx ensure that international copyright standards be upheld (positing that some material in Chinese may violate international copyright law)?

And with come older books containing characters not readily understood by modern readers, how would they be presented?

As a "strategic" issue, I think you are saying that "increasing material from China" is a specific goal to be presented?

Collect10:12, 7 April 2010

Sorry, maybe I should put this thread into the discussion page for China TF, I only want to track the progress of the recommandations by China TF recent days.

For your questions, I put my answer below:

  • Would substantial political issues be raised by some of the material? With Google's relationship with China predicate problems of any sort? How would Wx ensure that international copyright standards be upheld (positing that some material in Chinese may violate international copyright law)?
    • I don't think these materials could cause any political or legal issues, because the author of them all died hundreds or even thousands of years ago. These knowledge are belong to all humankind.
  • And with come older books containing characters not readily understood by modern readers, how would they be presented?
    • Unicode support most of the Chinese characters, in fact Unicode pick these characters from Kangxi Dictionary which compiled hundreds of years ago
    • The Chinese characters are stable for thousands of years, well-educated users, esp. from Hong Kong and Taiwan can recognize most of them without any difficulty.
    • Some classic books are published by modern publisher, they had already normalized these books and no special characters are there.
    • Some material, esp. from oracle bone, bronze ware are not supported by Unicode, we may not digitize them right now.
Mountain11:38, 7 April 2010
 

Why don´t you ask the Board Ting Chen (Board of Trustees or)?

It seems that the Wikilink to the Chinese userpage doesn´t show, but the preview does show. That´s odd.

Goldzahn11:30, 7 April 2010

Thanks for advice. I had already pinged him.

Mountain11:40, 7 April 2010
 

Speaking only for myself, I think the Foundation would absolutely consider giving a grant to support this kind of content partnership. This would be possible now through WMF Chapter grants (perhaps through Wikimedia HK?), and I know that User:Eloquence would like to eventually open up the grants process beyond Wikimedia Chapters.

Eekim21:06, 7 April 2010

I suggest the grant would have to be quite major, indeed. Say on the order of $50 million to begin such a substantial project, making it a large capital expense.

Collect23:30, 7 April 2010

Given that the WMF's projected operating budget for this year is about $14 million, this will not happen. :-) The whole grants process is for small grants: on the order of thousands, not millions.

But, you raise a good point. The next step, Mountain, would be to actually put together a proposal and pitch it. You should rope Yu-yu into the conversation. The application deadline is coming soon, so now would be a good time to start thinking about it.

Eekim23:35, 7 April 2010

I rather thought so. The key, however, is the issue of "commercial sponsorships." I would suggest that, even before the strategy package is over, that WMF, on its own volition, and quite speedily, contact Amazon or Google (each of which has vast experience in digitizing books - though with mixed legal results) to ascertain if either would provide seed-money for what are posited to be Chinese books well out of copyright. The amount of money is within either of their budgets, and would likely make world-wide news.

Collect00:04, 8 April 2010

I'd strongly encourage you to read Wikimedia Foundation/Feb 2010 Letter to the Board. I think it's a great idea to pursue these kinds of partnerships, but it's not the role of the Foundation. One of the conversations that didn't happen as successfully as it might have on this wiki is around movement roles: understanding whose role it is to pursue different things. Perhaps this discussion (as well as Michael Snow's recent email to foundation-l) can be a kick-off to have this discussion here.

Eekim17:58, 12 April 2010
 

For example, de:wikisource is getting each year 2000 Euro from the german chapter for digitising. Each digitising project costs between 30 Euro and 100 Euro. Source

Goldzahn05:25, 8 April 2010

With a ballpark estimate of 1 million older Chinese works to be digitized - figure on 30 million Euro?

Collect10:11, 8 April 2010

That is not the point. The problem is to find people who will do the work for free. And there is far more work to do than just place a book on a scanner. I guess that one person is able to proofread not more than 10 pages a day. There is a second problem. I know that there are maybe 50.000 Chinese characters but most people know just a few thousand characters. If you don´t know a character you don´t know if the scan is correct. That means you need high skilled people. Those people don´t like to work for free and there are not much of them.

Goldzahn11:39, 8 April 2010

For the problem of Chinese character, it is not the case, in fact by some input method(for example Wubi method ), you can just input the characters by the structure of characters rather than their pronunciation. And Wubi is popular in China.

We are just at the stage from ideas to concrete plan. I will draft the detail of the plan recently. and I don't think it is a big project.

Mountain12:18, 8 April 2010

Just be sure not to underestimate what is involved - I am sure Amazon and Google have a strong interest here to be sure.

For Goldzahn - I already raised the issue of older words. The Kangxi dictionary has under 50,000 characters - which likely covers most ones to be found in out-of-copyright material. Apparently people are considered proficient with a knowledge of about 7,000 characters (which combine to form a much larger number of words). I therefore would defer to Mountain that the task is doable.

Collect13:09, 8 April 2010