Proposal talk:A MediaWiki Parser in C

From Strategic Planning
Jump to navigation Jump to search

Media parser could be developed rather easy

Media parser could be developed rather easy using automatic parser generators like Bison. It would also be interesting to have java parser as well (javacc may be used). If there is enough interest, such parsers can be developed, providing in Wikipedia framework some official storage for them. Audriusa 10:17, 16 August 2009 (UTC)

Thanks, Audriusa. In fact, if you check the history of mail-lists, you will find lots of discussion were there. It is not so easy, and might be difficult. The MediaWiki Markup itself is not context-free, this means one-time parsing is not enough. So far, I am not familiar with the difficult very much, but I want to have a try. To write a toy parser, it is easy, but it is difficult to write a parser for real use. Many people had tried but most of them give up. --Mountain 12:59, 16 August 2009 (UTC)
I have made a few tries, and it is pretty tricky, since the markup is partially character partially line oriented. Another trouble is using Unicode with C, requiring <wchar> which combined with <stdio> is stateful. (I cannot provide any code since my experiment failed, and I threw away it). I think the simplest way would be to find the relevant PHP piece in the MediaWiki Software (Free!! Free download! Open readable code!! Yippee!!) and translate it to C, in order to get a parser that behaves exactly as the MediaWiki parser. Rursus 06:00, 18 August 2009 (UTC)
Thanks, Rursus. I will write into the proposal about your two approach. --Mountain 08:19, 19 August 2009 (UTC)
Could somebody provide a link to the relevant part of the MediaWiki software where the current parser code can be found? I would love to see it. Thanks. --Marozols 01:39, 20 August 2009 (UTC)

Better idea than it looks like

Simply writing a parser in C might sound like a very specialised idea, but might promote the future production of software adapted to Wikipedia browsing and editing in an advantageous way. The parser should be a dynalib (.so, .a or .DLL as relevant to respective OS), and provide Wikimarkup <--> XML or HTML conversion. Everything integrates with C, so C++ and Ada programmers at least, will profit from a C version. It is not easy to implement all the stuff supported by MediaWiki, so I may not assess the feasibility high. Rursus 11:27, 18 August 2009 (UTC)

Agree, that is why I said it is a fundamental enabling project. --Mountain 09:09, 19 August 2009 (UTC)
A wiki to XML parser written in c++ can be found here. In the /php subdirectory there you can find w2x.php which does the same, written in php. Dedalus 13:39, 28 September 2009 (UTC)


Some proposals will have massive impact on end-users, including non-editors. Some will have minimal impact. What will be the impact of this proposal on our end-users? -- Philippe 00:04, 3 September 2009 (UTC)

Standard is the most important

One of the reasons that keep the parser developments reluctant is that Wiki language is not officially standardized and various details can change at any time in unpredictable direction. Such change can eliminate lots of work spent in C programming, making the written parser unusable. I think, the parser from community can only be expected if Wikipedia publishes an official specification of Wiki language and then maintains at least backward compatibility. This condition is especially important for more difficult (while maybe more efficient, when the code is written) languages like C. Audriusa 16:29, 4 September 2009 (UTC)

Audriusa asserts that MediaWiki wiki markup is not officially standardized. This is a remarkable fact. The Wikimedia Movement values free and open, loves open content, open works, open source. Most would also prefer open source as regards software, and open standards, especially for file formats. For example, .ogg is allowed, .mp3 is rejected. MediaWiki is one out of many wiki engines, and it looks like each wiki engines has its own flavor of wiki markup. There simply isn't a standard, let alone an open standard for how for example on Wikipedia the content/works are stored. The strategic implication of the proposal for a MediaWiki parser in C is to have wiki markup standardized. Maybe have one day an IETF RFC for wiki markup to ensure interoperability. Maybe rename this proposal to Proposal:Open standard for wiki markup, which subsequently might be called WML (Wiki Markup Language) or MWML (MediaWiki Markup Language). I believe that could be something for the quality department to work on: In five years time, we'll have wiki markup standardized. Dedalus 18:32, 20 September 2009 (UTC)

Good idea, Dedalus. Meanwhile, I know some other endeavor for markup standardization, for example, WikiCreole in WikiSym 2006, or WikiModel derived from DOM. I will consider your advice, maybe I will open another proposal - Proposal:Open standard for wiki markup. Thanks. --Mountain 02:09, 21 September 2009 (UTC)
Browsing meta and I've found multiple attempts at documenting (media)wiki markup, probably dating back to 2003. Somehow projects to document wiki markup start ambitiously and energetically only to be abandoned within a year or so. A proposal for markup standardization should address such organizational issues, including having enough and the right people involved to get the project going and see through completion. Dedalus 05:55, 21 September 2009 (UTC)

Explain it to a non-programmer...

As someone who doesn't know programming it is hard for me to tell what use this activity is even having read the proposal. What benefit would I see from it? --Bodnotbod 13:35, 10 September 2009 (UTC)

You are likely to have user friendly tools (others than browser) to work with Wikipedia content in various ways. Some of these tools the most probably will be FOSS so you will get them for free or even pre-installed. Audriusa