User talk:JakobVoss

Category Metadata

I am interested in category mapping for WMF data reuse and accessing. (I am the primogeniture of the DynamicPageList extensions, and recently wrote n:en:User:Amgine/Google News Sitemap, a form of companion piece.)

I apologize that I do not read German, nor am I a wikipedian. I'm not sure how I may be of assistance to you?

I have been in contact with UDC regarding the use of UDC for WMF content:

Dear Amgine,

----- Forwarded message from [redacted] -----

Date: Tue, 24 Nov 2009 19:54:37 -0800

From: Amgine [redacted]

Reply-To: Amgine [redacted]

Subject: UDC Licensure, availability

To: [redacted]

Hi!

I'm working with the Wikimedia Foundation on strategic planning for the future. In particular, I'm working in the area of offline-use of WMF content, particularly in regions with limited internet accessibility. One of the elements of this is considering the metadata related to articles as represented by topical categories.

I'm wondering what the copyright status/licensure of the UDC might be, as it may be useful in trying to apply a standard to a large collation of articles but we would not wish to cause any problems or infringements? In what formats/tools can the UDC be accessed?

LICENCE is not needed for * UDC use* (read more at http://www.udcc.org/licence.htm)

You would not need licence to use UDC or to apply it to classify any collection. You would simply need access to UDC schedules which are available either in print or online (see http://www.udcc.org/bibliography.htm). In principle you can even borrow a book with UDC schedules from a library and use it to index your collection and that would not cost you anything. Please note that some of online editions may be accessed for free, and that you can even purchase licence for UDC MRF use

You would, however, need a licence if you would be publishing/distributing UDC schedules: in print, online or embedded in a software.

Should you wish to use UDC Master Reference File - which is the UDC database - you would also need a licence for this (http://www.udcc.org/licence.htm). This may be of interest should you like to develop some kind of automatic classification tool which would be built on the UDC data. [Emphasis by Amgine]

Reading your query it appears that the following information may be of interest to you. In October we released over 2000 UDC numbers under Creative Commons Licence http://www.udcc.org/udcsummary/php/index.php (to see conditions of use read 'about').

This set of UDC can be used in non-commercial purposes free of charge. It is presented in such a way so that it can be used as a demonstrator of the UDC and would have materials attached to it that would explain the way the UDC system works.

We develop this application as we speak to:

- provide this set in as many languages as possible (Dutch will be available in December, German either at the end of December or in January, and other languages will follow as they are completed)

- to link each notation with controlled verbal access (alphabetical index/chain index/thesaurus) - which means that there will be many search terms attached to each class to support searching

- there will be mappings provided to other systems (currently this is mapped to DDC, and we hope to have further mappings available)

- there will be exports of this set available for download in the following formats: simple structured text, XML, MARC, SKOS - we shall be working on these in the following months

We would very much like to encourage developers to use this UDC Summary in various kinds of linked data projects - and that the UDC Summary is eventually linked to dbpedia. The UDC Consortium does not have resources or necessary XML/RDF expertise to undertake such a project but we would provide UDC-specific expertise necessary for this to happen and we would try to help for this to be shared with a wider www community

If such a classification system were to be applied to any WMF project it would probably entail a sizeable education effort to train community members in its use. Does UDC have any materials related to such skills-building?

I'm probably not asking the right questions, so if you can think of any I should have asked could you tell me what they are and their answers?

UDC is originally designed as classification for articles so unlike library classifications (like Dewey, LCC etc.) it allows combination of different content elements into one expression e.g. topic, place, persons, ehnics, material, time, audience, type of text, language, form of presentation (analysis, criticism, historical)

Each of these elements can be combined to express what an articles is about, and each individual element can be easily decomposed and be attached to verbal expressions

In addition with UDC you can express relationships between different fields of knowledge.. "application of nanotechnology in construction industry", ethical issues and biotechnology, philosophy and politics, religion and history of art etc.

To learn about the system itself usually comprises of the following

- being informed about vocabulary content and structure - the number and specificity of concepts available, and where to find them

- being informed about syntax i.e. how to express a complex document content using UDC

There are many books on UDC, the scheme is also taught in library schools and there may already be some workshops organized in your region. The UDC Consortium can also organize 1-2 day workshop at your site. For the first such 1-day workshop you would only need to secure travel costs of a UDC Consortium trainer. But you may choose to arrange some kind of cascade training based on these initial workshops that would be specific to your collection. Typically UDC is very logical and once the first basic rules are grasped there won't be any need to for training.

But if I understand correctly within Wikimedia you would be dealing with a large community of non-professionals which would need to be assisted in indexing... if this is so then the best way is the following

Creating a subject authority file - i.e. preparing an appropriate tool to support indexing of your collection that you would populate with UDC numbers. This is the most efficient way of ensuring consistency, accuracy and speed in indexing. In addition to assistance to indexing a subject authority file is also a source of terms for searching and browsing. You may initially populate this authority file with pre-prepared UDC terminology from the UDC Summary or UDC MRF and you can gradually expand this as you go. Eventually if you decide to implement automatic indexing or machine assisted indexing - such an authority file would come very handy.

A central vocabulary management tool is a good practice in metadata management irrespective thesaurus or classification that may be used. Preparing an authority file and midleware necessary to link this to the metadata tools - containing all classification numbers that are used and search terms associated to them can be designed in different ways. Ideally (but depending on how sophisticated the interface would be), any person indexing articles would type a word and system would come back to him/her offering a suitable expressions and would be prompted to choose further elements: time, place, person, material, other discipline - the person may see only words and system would automatically translate these into numbers. With some programming effort an authority file can be turned into a UDC number builder - which would produce regular UDC expressions. All numbers in UDC can be combined with any other numbers and algorithms for parsing numbers exist already. In any case a further consultancy support can be arranged through the UDC Consortium.

I hope you got answers you needed. If you need any further clarification, please do not hesitate to contact us.

We look forward to hearing from you

Kind regards

Aida Slavic, PhD

UDC Associate Editor

UDC Consortium

Email: [redacted]