Proposal talk:A 'common knowledge' database - like 'Cyc'

From Strategic Planning
Jump to navigation Jump to search

Auto-Ontology Classifier Required

There is a type of software which helps to recognise common sections of text, common phraseology and so on. It's typically used by Intelligence Agencies and in the Defense Industry, to help de-duplicate documents that effectively are discussing the same topic (but using different words or even languages) and so on.

Typically, these auto-ontology classifier "assistants" cannot actually say to you, "These two bits of text are both discussing dogs salivating over a bone, and thus belong in specific category 'XYZABC'". They can only simply say, "These two bits of text have the same sentence structure; use similar words" and thus help in drawing the attention of the human reader to the parallels.

Drawing the attention of a human to such similarities is the first step in identifying the knowledge behind the words. Once a human has created the category containing the two (similar) sentences, thus linking them together (with an ontology classification e.g. "dogs and their bones"), the software can then further assist by later identifying a third, a fourth phrase or paragraph and so on. In this way, the process of creating "common knowledge" is semi-automated.

It would seem to me that this proposal could do with access to such software. If this proposal is given serious consideration, contact me for further information and I can be of assistance in this regard. Lkcl 21:06, 11 October 2009 (UTC)

Incorporating existing Wikipedia information base

I really like this proposal, though does it seem possible to incorporate all the information already present on Wikipedia, and in other Wikimedia projects? If possible, this would seriously jumpstart this idea. 8bit 11:48, 14 August 2009 (UTC)

See also http://freebase.com/ for a commercial service which tries to automatically summarize the linkages in Wikipedia, among other sources. 99.25.114.234 18:20, 14 August 2009 (UTC)

The problem is that the proposed database would require a formal description language. You wouldn't be able to simply enter a big chunk of english prose and expect it to work. If you could do that then you wouldn't need this proposal because you could just push all of Wikipedia into your software - problem solved! So, in general, it requires a human to parse sentences like "Bill Clinton was once president of the United States" and turn them into things like the 'Cyc' sentence:


(#$isa #$BillClinton #$UnitedStatesPresident)
If this project were to take off, and assuming that we could hide the ugly underlying syntax with some kind of a friendly GUI interface, then I'd hope that people would enthusiastically read Wikipedia articles and other data sources - and convert them into this formal structure. The consequences of this formality would be that software could automatically answer questions, find unexpected connections, highlight contradictions and so forth. We could have answers to questions which would be almost impossible to answer by consulting the encyclopedia in conventional ways. ("What cartoon characters have something in common with Bill Clinton?"..."Ned Flanders and Bill Clinton are both left-handed, Simba (The Lion King) and Bill Clinton were both heads of state, ...")
This is not about simply connecting pages (we already do that with links and categories) - it's about converting the vast store of knowledge that is Wikipedia into a form that computers can understand and reason about.
SteveBaker 01:53, 15 August 2009 (UTC)
Check out Semantic MediaWiki. It is an extension that lets that kind of structure be added in the wikitext of an article. Installing it on the Wikimedia projects would be really good, but I think there is still some work to be done getting to ready for such a high profile use. --Tango 02:55, 15 August 2009 (UTC)

This is what DBpedia has been doing for the past two years with substantial effort (10+ developer years). It's part of a Linked Data effort that connects knowledge from a multitude of domains. There are many researchers and companies working in the field to make more knowledge open and interconnected, so it may be a very good idea to hook into these projects. I'm glad to help out in that regard. --Beckr 10:18, 30 September 2009 (UTC)

Regarding the Key question on how Cyc and Wikipedia are linked - (Open)Cyc did the interlinking and provides links to Wikipedia and DBpedia from the relevant concept's pages in both human- and machine-readable fashion, e.g. http://sw.opencyc.org/2009/04/07/concept/en/Game. More details on the links are available from http://wiki.dbpedia.org/OpenCyc.

Related to this proposal

I would like that all "data" in Wikipedia be stored in a real database. I mean data like population of towns. Why do we have to update it on each wikipedia ? We should be able to store it somewhere, like commons for the images and call it from there. Having a syntax like data:France:Paris:Population:Last... When it's updated, it is updated on all wikis. Same for a lot of information. 80.125.172.60 09:02, 15 August 2009 (UTC)


That would certainly be a beneficial 'fall-out' of this proposal. It's easy to query a Cyc database for the population of a particular city - or to extract a table of city populations. But doing the (lesser) thing that you suggest is not without difficulties. While it's certainly easily possible to create database-like representations of the population, lat/long coordinate, parent country, etc for every city - and it's possible to create a database of the atomic number, melting point, color, etc for every chemical element - and the first broadcast date of every Simpson's episode - what you have at the end are tens of thousands of little databases. Does the database of broadcast dates of the Simpson's episodes work with the database of broadcast dates of other shows? Can you use that database to find out which show was the first one with Ned Flanders in it? Even finding the database you want would be a tough thing in many cases. Using a formal language like Cyc allows for these easily tabulated facts to be stored in the same 'database' as things like "Fire is hot" and "Hitler was a man". SteveBaker 12:55, 21 August 2009 (UTC)
You said that it would be difficult to find the right database, but surely all that is required is a kind of contents database, or sensible filing structure. Of course rules would have to be set up and decisions made but as long as it is done logicically and consistently, and the 'AI' suggested is informed of the locations of relevant databases at the end of an entry then there should be no problems with the suggestion. Eddy 1000 11:36, 26 August 2009 (UTC)
Rather than an all or nothing “common knowledge” database from scratch, how about a relational database, incrementally grown over time, with a postponed option to convert it to a “common knowledge” database in the future? The choice of keys and values could be geared toward an ontology of knowledge. Keys in the relational database could hold values of the common data mentioned above, like population.
Related categories of information would be contained in the same database like people, geographical data, events, social-political-economic-religious-scientific-artistic movements, etc.. One database that contains information on people, including Hitler, then holds value for professions, nationalities, political and religious affiliations, date of birth and death, etc. A user could then query for a list of contemporaries of Mozart, for instance, or his Chinese composer contemporaries. As a separate database, this would not directly impact Wikipedia access speeds. However, relational databases improve the user interface to locate relevant information on a desired subject faster than surfing the web or content searches depending on user destination.
In a relational database query, a user could select implemented keys as they become available. Dates could be in a range or relative to contemporary people or events. My personal interest is in comparative analysis. A capitalist is to an economy as a monarch is to a country. On a scale running from tyranny to anarchy, democracy is in the middle, capitalism and socialism are both near tyranny but cooperatives are near democracy. That type of query would require a key with values running from tyranny to anarchy for all man made organizations, which is an ontological classification.
For “Fire is hot” a temperature scale would be a key. Hot is relative, but paper burns at Fahrenheit 451. “Inhuman” does not mean “not human.” “Man’s inhumanity to man,” is a rule not an exception. For a computer to confuse the two it would have to be incorrectly programmed to make that mistake. --Gistmass 00:14, 29 September 2009 (UTC)

Proposals for new projects

For those in support, please have a look at m:proposals for new projects, for current policy, produre, and organization of processing proposals for new projects. Dedalus 14:04, 15 August 2009 (UTC)

A very simple and useless form of Wikipedia

The informations "Fire is hot" and "Hitler was a man" can also be read in the Wikipedia articles. Whats the use of having a Wikipedia with a lot of articles with a single information? Your "common knowledge" has a better place in Wikipedia. --90.146.217.210 14:05, 17 August 2009 (UTC)

Tell me: Where exactly does Wikipedia say "Fire is hot"? I just looked through a bunch of likely articles - and I can't find that anywhere in the encyclopedia, and I bet you can't either. I can read from our fire article that "Fire is the rapid oxidation of a combustible material releasing heat,..." - but perhaps if it releases heat, the fire itself is cold? After all, my refrigerator releases heat too - and you don't describe refrigerators as "hot". You and I know what the article means because we already have this 'database' of common sense built into our heads - but the article doesn't actually say that.
You also have to infer that Hitler was a man by noting that we use the personal pronoun "He" in the second sentence of the summary - or by looking at the photo. At no point in the article do we actually say "The object known as 'Hitler' was a male of the species homo sapiens".
But the point of this proposal is not to add a bunch of bloody obvious statements for people to read that would make the simple-english Wikipedia sound complicated! The point is to make Wikipedia understandable by computer programs. Let me emphasises that: 'understandable'...not just 'readable'. A body of knowledge that can be operated on using formal symbolic manipulation inside a computer. That means making statements that "Fire is hot" absolutely explicit - not just something that a human can theoretically infer by careful parsing of the text. Besides, there are a large number of 'common-sense' things that we don't even mention. Was Hitler a human? That is nowhere stated in the article - and I can't find anywhere in Wikipedia where it's even hinted that he might have been. In fact, we could probably find a statement that Hitler was inhuman - which would definitely give our computer reader reason to suspect that he wasn't human! As far as the computer is concerned, Hitler could be any object to which the male personal pronoun could be applied. A pet goldfish perhaps? It's worse still for females because we call ships by the female personal pronoun - the second sentence of the RMS Titanic article says "For her time, she was..." - perhaps leading our theoretical computer program to deduce that Titanic was a woman. Worse still, we have articles such as Godwin's law that mention Hitler in the most obscure context that could result in considerable confusion to our computerized reader.
The intent of this proposal is to gradually transform statements in nice, flowery human-readable english (and in all of the other languages we support) into statements of (hopefully, formally referenceable) fact in a formal language that can be operated on symbolically. This would allow all sorts of clever applications to be created. For example: Do the same formal facts exist in both the English and Chinese versions of Wikipedia? That would be one way to do formal consistancy checking between languages - and thereby detect culturally-induced errors from the encyclopedia. If the outer-mongolian Wiki says "Adolf Hitler is an Aardvark living in a zoo in Kabul" - of which no mention is made in English Wikipedia - then we can go and look to see whether this little known fact is missing from English Wiki - or whether the outer-mongolian Wiki has some horrific misunderstanding of the history of the second world war!
216.136.51.242 12:23, 21 August 2009 (UTC)

Searching for articles by typing a question

Isn´t that something that belongs to en:Expert system? And I remember that someone once to try to create an expert system with common knowledge, because at those days they learned that common knowledge is more complex than expert knowledge. Well, a few years ago Brockhaus Enzyklopädie digital provided a new search function which could answer questions like: "Why is the sky blue?" by linking to an article. I did some googeling and I think it was this. To get an article not by typing the lemma, but by typing your question is wonderful. I think they did this by providing a knowledge database. Well, I never saw that digital Enzyklopädie. --Goldzahn 13:11, 21 August 2009 (UTC)

Computer asking questions back

Once this idea is established there would obviously be a long period whist it doesn't have all the relevant information it needs to answer many of the questions it is asked (this may last as long as it does!). It would therefore be necessary for the AI to be able to say 'I don't know' and have a page where it can ask for people to submit answers it really doesn't know and for things it may have extracted from a source to answer a question, like it is said in the section 'A very simple and useless form of Wikipedia' it might think that the Titanic was female and then someone would be able to correct it. It may also be possible for the person who asked the question to give it the answers themselves to enable it to finish there question and this would also speed up the entering of data. Eddy 1000 11:53, 26 August 2009 (UTC)

This sounds similar to the page you receive on Wikipedia when you enter in a value for a non-existing page. This would work if you asked something like, "What color is the sky", in which the database would ask you to add a color field, or if you asked "How are George Bush and King George similar?" and the database doesn't have an entry for one or the other, however, I'm not sure how that would work for a question like "Why is the sky the color that it is?", or a question which makes an incorrect assumption, such as "Why is the sky red?" 8bit 14:47, 29 September 2009 (UTC)

Impact?

Some proposals will have massive impact on end-users, including non-editors. Some will have minimal impact. What will be the impact of this proposal on our end-users? -- Philippe 00:04, 3 September 2009 (UTC)

The impact of something like this would seem immense, if done correctly, and if it became as expansive as, say, English Wikipedia. A tool like this could be used to directly answer an individual's question, but could also be used as a base for common knowledge for other machines to operate on intelligently. This could also be used to keep articles up to date, and to check for inconsistencies, so not only is it interesting from a new product point of view, but it can also potentially improve Wikipedia itself. 8bit 14:40, 29 September 2009 (UTC)