Talk:Task force/Recommendations/Community health 5

From Strategic Planning
Jump to: navigation, search
Information icon.svg
This page uses the LiquidThreads discussion system. Try discussing in sandbox. Click "Start a new discussion" to begin a new discussion on this page.


Thread titleRepliesLast modified
New York Times Citation Parser301:24, 7 March 2010
Reflinks200:46, 18 February 2010
Website Citation Parser1014:53, 10 February 2010
Interface barriers015:53, 6 February 2010
Google Books Citation Tool322:26, 5 February 2010

New York Times Citation Parser

I've written up a rather simple php parser for making citations out of new york times urls. There is an official API which someone with more time than me can write a decent citation tool for... The following is copied from w:User_talk:Apoc2400#Citation_tool

Citation tool[edit]

Are you planning to add support for other sites? Could you for example add support for New York Times which prints author/publishing data in the metadata of their webpages? (Great tool by the way!)Smallman12q (talk) 02:08, 5 February 2010 (UTC)

It would be neat, but Google Books is easier since Google provides a public data API. I guess the New York Times does not, and if I start parsing their regular website they might get suspicious that I am scraping their content. Also it would break whenever they change layout. I will see if it is possible though. --Apoc2400 (talk) 23:22, 8 February 2010 (UTC)
It's fairly simple to do in php given that most of their data is given in meta data so one could use the get_meta_tags and that would pretty much be it...I'll write up a sample php script. The legality though...that I don't know.Smallman12q (talk) 02:36, 9 February 2010 (UTC)
Whatever their bot policy is, the most they can legally do is block you. Plain facts are not protected by copyright. If you're seriously worried, why not ask them? Who knows, maybe they have no API because they didn't think of it? Paradoctor (talk) 08:22, 9 February 2010 (UTC)
Well they an have two API and also a silverlight sdk. Looking at the article search api, they do have a "url" get field and you could get all the data in a more efficient manner...but I'm only writing this as a proof of concept. So, if you're more interested, feel free to use their API (the limit is 5k queries per day)...I just wanted to prove that it could be done...Smallman12q (talk) 22:42, 9 February 2010 (UTC)

Sample Script[edit]

Here is a fairly simple sample script...


 * @author Smallman
 * @copyright 2010
 * @license Public Domain
 * @version 1.0.0
$url = '';//the url

$citenewstext;//the variable where the template info will be stored
$tags = get_meta_tags($url);//get the meta tags

//build the citenewstext
$citenewstext = "{{cite news\n";
$citenewstext .= "|title= ".$tags['hdl']."\n";
$citenewstext .= "|author= ".$tags['byl']."\n";//remove the By and correct case later
$citenewstext .= "|url= ".$url."\n";
$citenewstext .= "|newspaper= The New York Times\n";
$citenewstext .= "|publisher = ".$tags['cre']."\n";
$citenewstext .= "|location =  ".$tags['geo']."\n";
$citenewstext .= "|id=  articleid".$tags['articleid']."\n";
$citenewstext .= "|date=  ".$tags['dat']."\n";
$citenewstext .= "|accessdate= ".date('d-m-Y')."\n";//you can change format
$citenewstext .= " }}";

//send it back
echo $citenewstext;




Gives a sample result of ...

{{cite news |title= Paperwork Hinders Airlifts of Ill Haitian Children |author= By IAN URBINA |url= |newspaper= The New York Times |publisher = The New York Times |location = Haiti |id= articleid1247466931716 |date= February 9, 2010 |accessdate= 08-01-2010 }}

which looks like... Template:Cite news

Let me know what you think.Smallman12q (talk) 03:24, 9 February 2010 (UTC)

22:50, 9 February 2010

Looks pretty impressive. Is there any way for me to try it on a URL?

14:52, 10 February 2010

I'm going to write one up that follows the new york times api rather than a meta tag seems fairly simple to do...

00:18, 11 February 2010

I've written a fully working script at . Currently it only handles articles from NYT, though I'm going to add blog support via the The Times Newswire API. Let me know what you think.

01:24, 7 March 2010

There's also another citation tool: Reflinks which is hosted on the toolserver, but it doesn't do a good job of pulling titles, and it can't pull the author...more info at w:User:Dispenser/Reflinks

19:28, 11 February 2010

Not sure how this one is supposed to work. But it's reassuring to see that editors recognize the value of this kind of tool.

05:14, 17 February 2010

Well you put in the Wikipedia page, and it checks its citation templates...

00:46, 18 February 2010

Website Citation Parser

Rather than asking volunteers to fill in citation details, for most news and other major websites, a simple web parser that would require nothing more than a url would suffice. The webparser could extract author, date, subject, publisher, etc.

13:36, 23 January 2010

I think that this is a great idea. I do not know if there is standard media websites use that we could easily extract data from, or if the parser would have to be taught to recognize each site individually. There should be some sort of manual override without resorting to editing the inserted markup. This could also be used to verify that the extracted information is accurate. Some form of date checking (translate the numbers in to text) would also be helpful, given the different way of denoting dates found in an international project. It can also check the source's reliability against a whitelist and blacklist.

Philosophically, I support this idea because it makes life easier for newbies and veterans alike. I think the best way to implement the newbie-helping tools is to cater to the veterans, because it will reduce the social opposition to the changes. Show them that many of their power tools will still work, or maybe even be superseded by native functionality. I think the instant communication prior to a revert is also incredibly potent in breaking down the barriers between newbies and veterans.

18:38, 24 January 2010

Yeah, I think this idea makes a lot of sense too. If we simplify the work that veterans do, we effectively shrink the gap between veterans and newbies. The on-ramp for new editors would be much easier to climb.

18:48, 24 January 2010

I don't believe there is a standard...rss doesn't provide all the details. However, it would be fairly easy to write a web parser for the top 500 sites (new york times, reuters, gov sites...etc), and some kind of "assist" web parser for the remaining ones.

There is no reason to waste hours filling in citation forms when a simple parser can do the trick...I'll probably write a simple one up when I'm done writing some other tools.

03:18, 26 January 2010

That would be great. Hopefully we could set up a system where specific sites can be requested to be added by WikiProjects or individuals who use a certain course a lot. I agree the best way to get this implemented into the interface is to show that is works - and the wiki environment is very conducive the that.

01:00, 27 January 2010

If you can get one that works with even one big news site (e.g.: CNN), I'm sure we could expand it site by site. We only need a proof of concept.

05:17, 27 January 2010

Proof of concept..right, I'll try and do that with a new york times article...

22:52, 27 January 2010

Looking forward to it!

Try to make it scale though. Like, make it modular so you can keep expanding it. Not sure how to do it. Maybe have an authorlocation.(site) = (string). So authorlocation.NYT = "by:", authorlocation.CNN == "by -"... and so on.

03:12, 28 January 2010

I've added a simple php scraping script for New York Times articles, but its probably best to write a dedicated service using their API.

01:11, 10 February 2010

php scraping may be less elegant, but it's also more robust. If this tool expands one website at a time, it's eventually going to need to work on reliable sites with no API.

14:53, 10 February 2010

Interface barriers

Sorry for the late input, but comments from my side trickle in slowly due to time constraints. So, I want to share some of my experience back not so long ago when I was still a novice editor. Here we go.

Welcome message: To start with, I never got a welcome message introducing me to the basic concepts of wiki editing. This may very well be because I usurped my user name, but I consider this a mistake that needs to be corrected in the future. Nonetheless, such basic information would be crucial for non-registered contributors as well, and a link to such information could be added to the sidebar. Edit links could even be added to the head of every edit page when contributing anonymously.

Toolbars: I found it very confusing that there are tool bars on the top (for your account), on the bottom (info box on user contributions page), and in the left sidebar (e.g., the toolbox). It took me ages to find the darn toolbox! This non-consistency is also present when editing a page, as there is a toolbar on the top of the edit box, containing a special characters menu, but then there is also a toolbar on the bottom, which also allows me to insert special characters, but in difference to the top one, finally allows me to insert math symbols, which I was also looking for for ages! Remove the lower toolbar, and integrate it into the top one!!!

Syntax checking: It would be very handy if wiki, upon saving the site, would first check for any wiki (especially table or math) syntax errors, and instead of saving warn the user about the broken site! It is far too common that non-experienced editors break up more complex tables, and others continue editing the article text, making intermediate reverts (without manual intervention) sometimes impossible.

Minor edits: These are utterly useless, as some people don't use them at all, while others have all their edits configured as minor edits.

Templates: There is a myriad of templates, but again it took me ages to find a list of these. Granted, one can find such list upon reading thoroughly through the welcome message references (which I didn't receive). Instead, there should be a drop-down menu in the edit toolbar that allows you selecting and inserting common templates (and refer you to the wiki article for more information).

WP:QWEPOX: WTF? It is absolutely frustrating for beginners to see unintelligible wiki abbreviations all over. WP:AGF? WP:MOS? Huh? Experienced editors love to use them in their edit summaries. :) Maybe the wiki site backend could automatically grep such expressions and reference them properly, e.g., replace WP:AGF by Assume good faith.

Watchlist: It would be very convenient if each line in the watch list could additionally either

  • list the number of intermediate edits since the last login, and/or provide a diff to that last edit, or
  • list the number of intermediate edits since the last personal edit (if available), and/or provide a diff to that last edit.

I usually want to see all the changes between my last edit and the current top edit.

Diffs: It would make looking up differences between revision so much easier if the line numbers given were clickable, allowing one to jump straight to a line/paragraph affected by changes.

Hm, sorry, I wrote this rather hastily. Anyway, comments welcome. Nageh 15:53, 6 February 2010 (UTC)

15:53, 6 February 2010

Google Books Citation Tool

w:User:Apoc2400 has written a citation tool for use with google books at .It should serve a simple enough proof of concept in where you only need a url, and a bot/parser can automatically derive the rest of the information. Hope this helps!

02:10, 5 February 2010

This is amazing.


21:29, 5 February 2010

It should certainly suffice as a proof of concept. The coding really isn't that hard...its just that nobody seems to want to do it=(.

22:26, 5 February 2010

That's just freakin' COOL.

22:26, 5 February 2010