New York Times Citation Parser

New York Times Citation Parser

I've written up a rather simple php parser for making citations out of new york times urls. There is an official API which someone with more time than me can write a decent citation tool for... The following is copied from w:User_talk:Apoc2400#Citation_tool

Citation tool[edit]

Are you planning to add support for other sites? Could you for example add support for New York Times which prints author/publishing data in the metadata of their webpages? (Great tool by the way!)Smallman12q (talk) 02:08, 5 February 2010 (UTC)

It would be neat, but Google Books is easier since Google provides a public data API. I guess the New York Times does not, and if I start parsing their regular website they might get suspicious that I am scraping their content. Also it would break whenever they change layout. I will see if it is possible though. --Apoc2400 (talk) 23:22, 8 February 2010 (UTC)
It's fairly simple to do in php given that most of their data is given in meta data so one could use the get_meta_tags and that would pretty much be it...I'll write up a sample php script. The legality though...that I don't know.Smallman12q (talk) 02:36, 9 February 2010 (UTC)
Whatever their bot policy is, the most they can legally do is block you. Plain facts are not protected by copyright. If you're seriously worried, why not ask them? Who knows, maybe they have no API because they didn't think of it? Paradoctor (talk) 08:22, 9 February 2010 (UTC)
Well they an have two API and also a silverlight sdk. Looking at the article search api, they do have a "url" get field and you could get all the data in a more efficient manner...but I'm only writing this as a proof of concept. So, if you're more interested, feel free to use their API (the limit is 5k queries per day)...I just wanted to prove that it could be done...Smallman12q (talk) 22:42, 9 February 2010 (UTC)

Sample Script[edit]

Here is a fairly simple sample script...

<?php

/**
 * @author Smallman
 * @copyright 2010
 * @license Public Domain
 * @version 1.0.0
 */
 
$url = 'http://www.nytimes.com/';//the url


$citenewstext;//the variable where the template info will be stored
$tags = get_meta_tags($url);//get the meta tags

//build the citenewstext
$citenewstext = "{{cite news\n";
$citenewstext .= "|title= ".$tags['hdl']."\n";
$citenewstext .= "|author= ".$tags['byl']."\n";//remove the By and correct case later
$citenewstext .= "|url= ".$url."\n";
$citenewstext .= "|newspaper= The New York Times\n";
$citenewstext .= "|publisher = ".$tags['cre']."\n";
$citenewstext .= "|location =  ".$tags['geo']."\n";
$citenewstext .= "|id=  articleid".$tags['articleid']."\n";
$citenewstext .= "|date=  ".$tags['dat']."\n";
$citenewstext .= "|accessdate= ".date('d-m-Y')."\n";//you can change format
$citenewstext .= " }}";

//send it back
echo $citenewstext;

exit;

//resources
//http://www.php.net/manual/en/function.time.php
//http://php.net/manual/en/function.get-meta-tags.php
//https://secure.wikimedia.org/wikipedia/en/wiki/Template:Cite_news

?>

Gives a sample result of ...

{{cite news |title= Paperwork Hinders Airlifts of Ill Haitian Children |author= By IAN URBINA |url= http://www.nytimes.com/2010/02/09/world/americas/09airlift.html |newspaper= The New York Times |publisher = The New York Times |location = Haiti |id= articleid1247466931716 |date= February 9, 2010 |accessdate= 08-01-2010 }}

which looks like... Template:Cite news

Let me know what you think.Smallman12q (talk) 03:24, 9 February 2010 (UTC)

Smallman12q22:50, 9 February 2010

Looks pretty impressive. Is there any way for me to try it on a URL?

Randomran14:52, 10 February 2010
 

I'm going to write one up that follows the new york times api rather than a meta tag scraper...it seems fairly simple to do...

Smallman12q00:18, 11 February 2010
 

I've written a fully working script at http://smallin.freeiz.com/beta/cite1.php . Currently it only handles articles from NYT, though I'm going to add blog support via the The Times Newswire API. Let me know what you think.

Smallman12q01:24, 7 March 2010