Proposal:Structured Data
Every proposal should be tied to one of the strategic priorities below.
Edit this page to help identify the priorities related to this proposal!
- Achieve continued growth in readership
- Focus on quality content
- Increase Participation
- Stabilize and improve the infrastructure
- Encourage Innovation
It has been suggested that this page be merged with Proposal:Data.wikimedia.org. (Discuss) |
It has been suggested that this page be merged with Proposal:Global templates. (Discuss) |
It has been suggested that this page be merged with Proposal:A central repository of all language independent data. (Discuss) |
It has been suggested that this page be merged with Proposal:Alignment. (Discuss) |
It has been suggested that this page be merged with Proposal:Base de donnée interlangue. (Discuss) |
It has been suggested that this page be merged with Proposal:To create standard basic table template accross Wiki. (Discuss) |
It has been suggested that this page be merged with Proposal:A 'common knowledge' database - like 'Cyc'. (Discuss) |
It has been suggested that this page be merged with Proposal:Building a database of all books ever published. (Discuss) |
It has been suggested that this page be merged with Proposal:Templates.wikimedia.org. (Discuss) |
It has been suggested that this page be merged with Proposal:Unification. (Discuss) |
It has been suggested that this page be merged with Proposal:Data-driven content. (Discuss) |
It has been suggested that this page be merged with Proposal:Data Driven Journalism. (Discuss) |
- See also: meta:Wikidata (2)
Summary
Structuring the Wikipedia data entries to make it possible to find them[1], for example to query "all European authors between 1910 and 1925" or "all A/V products released by Sony from 2001 to 2003".
Semantic is the key for Web 3. Do we want it? Do we want it inside the articles? There are a lot of web software that scan the web searching for information and try to understand it to extract knowledge. We can do better: we can insert the semantic inside the information.
Proposal
This is not a technical proposal. Today we have some extensions, like Semantic Media Wiki, and they are quite mature. But the question is: do we want semantic in wikipedia?
Add standard data "fields=value" to all articles. Fields must be standardized, value formats must be standardized, so that they can fit into a SQL database. For example:
Name=Johann Schiller FullName=Johann Christoph Friedrich von Schiller Birthdate=1759-11-10 Deathdate=1805-05-09 Note: ansi date Sex=M Type=Person Profession=Poet|Dramatist Nationality=DE Note: ISO 2-letter country code PlaceOfBirth=Marbach, DE
etc.
The fieldnames should be standardized through all languages. The values should be standardized through languages, as much as that is possible, eg. by using ISO standard codes, so that every language can query for "country=Deutschland" or "country=Germany", whatever wiki you are in.
The data fields should be stored like the pictures in a language-independent name space, as they should - as much as possible - transcend language.
Motivation
To make it easier to extract similar data from multiple WikiPedia entries. To look for strings in specific contexts.
Try to answer to these questions: "Tell me the 10 highest mountains in Italy", "Who were the president in USA in 1950?", "Tell me the highest city in Argentina with a population greater than 1000 inhabitants", "Tell me the first afro-american woman that were olimpic champion of 100 m coming from Europe", ... With semantic we can do this. For example look at wikipedia:Wolfram Alpha: 10 highest mountains in Italy, USA president in 1950. You can asnwer to this questions also with Wikipedia, because you have the information, but it's very difficult! Today we have a lot of lists that can help you, for example wikipedia:List_of_Presidents_of_the_United_States but they are not the right tool. With semantic inside the wikipedia's article you can easily answer.
Tool like Wolfram Alpha has a very huge database and datas are bounded with semantic relation. Wikipedia has a lot of information, but not semantic.
We can also relate the semantic with the creation of shared external database.
Key Questions
- Do we want semantic? Or we want simply a free encyclopedia?
- How to introduce semantic? Inside the article? For example something like:
'''Mr X''' was born in [[semantic:born_year:1950]] in [[semantic:born_location:California]] and he's the husband of [[semantic:husband_of:Mrs Y]].
Ok, this is quite complicated for a newbie, but we can also use template to add semantic.
- Just how far do you take this? Who would volunteer for the very dull task of tidying the existing data? Perhaps it could be partially automated by programs created by volunteers.
- Do we want external database? This means that we can write something like:
'''Proton''' is a [[semantic:be:particle]] with mass [[semantic:particle_mass]]
and the software go to the database, look for the table called "particle", inside this look for the entries called "proton" and give its "particle_mass". The advantages is:
- shared database within all the projects
- updated data
Potential Costs
People might be less willing to provide information if you force them to format it in structured form. Semantic is not very easy for newbies, but I think that we can't continue without semantic. Semantic will be the key concept of the new web and we need to evolve. Today there are some tools, like wikipedia:Google_Squared that scans the web (and mainly wikipedia) to extract semantic. Look here highest mountain in Italy. Google has not information, but it can get it from the web (and from wikipedia). Google software need to understand the information, it has to extract semantic from information. We can do better: we can insert the semantic inside the wikipedia articles.
References
Community Discussion
Do you have a thought about this proposal? A suggestion? Discuss this proposal by going to Proposal talk:Structured Data.
Want to work on this proposal?
- Kozuch 11:32, 24 November 2010 (UTC)