Website Citation Parser

History

Rather than asking volunteers to fill in citation details, for most news and other major websites, a simple web parser that would require nothing more than a url would suffice. The webparser could extract author, date, subject, publisher, etc.

Smallman12q‎

I think that this is a great idea. I do not know if there is standard media websites use that we could easily extract data from, or if the parser would have to be taught to recognize each site individually. There should be some sort of manual override without resorting to editing the inserted markup. This could also be used to verify that the extracted information is accurate. Some form of date checking (translate the numbers in to text) would also be helpful, given the different way of denoting dates found in an international project. It can also check the source's reliability against a whitelist and blacklist.

Philosophically, I support this idea because it makes life easier for newbies and veterans alike. I think the best way to implement the newbie-helping tools is to cater to the veterans, because it will reduce the social opposition to the changes. Show them that many of their power tools will still work, or maybe even be superseded by native functionality. I think the instant communication prior to a revert is also incredibly potent in breaking down the barriers between newbies and veterans.

HereToHelp (talk)‎

Yeah, I think this idea makes a lot of sense too. If we simplify the work that veterans do, we effectively shrink the gap between veterans and newbies. The on-ramp for new editors would be much easier to climb.

Randomran‎

I don't believe there is a standard...rss doesn't provide all the details. However, it would be fairly easy to write a web parser for the top 500 sites (new york times, reuters, gov sites...etc), and some kind of "assist" web parser for the remaining ones.

There is no reason to waste hours filling in citation forms when a simple parser can do the trick...I'll probably write a simple one up when I'm done writing some other tools.

Smallman12q‎

That would be great. Hopefully we could set up a system where specific sites can be requested to be added by WikiProjects or individuals who use a certain course a lot. I agree the best way to get this implemented into the interface is to show that is works - and the wiki environment is very conducive the that.

HereToHelp (talk)‎

If you can get one that works with even one big news site (e.g.: CNN), I'm sure we could expand it site by site. We only need a proof of concept.

Randomran‎

Proof of concept..right, I'll try and do that with a new york times article...

Smallman12q‎

Looking forward to it!

Try to make it scale though. Like, make it modular so you can keep expanding it. Not sure how to do it. Maybe have an authorlocation.(site) = (string). So authorlocation.NYT = "by:", authorlocation.CNN == "by -"... and so on.

Randomran‎

I've added a simple php scraping script for New York Times articles, but its probably best to write a dedicated service using their API.

Smallman12q‎

php scraping may be less elegant, but it's also more robust. If this tool expands one website at a time, it's eventually going to need to work on reliable sites with no API.

Randomran‎