Summary:Talk:Task force/Wikipedia Quality/Defining quality

From Strategic Planning
Starting point

Woodwalker considered how quality might be defined or measured. The starting point was his essay " On Quality" covering assessment of quality, edits, and editors, and the concept of "content erosion":

The Dutch Writing Contest criteria list seven factors for assessing quality (link): Lede ("lead"), article structure, page layout, article content, style, verifiability, findability.

Woodwalker proposes a more general set of factors:

  1. Content requirements
    • encyclopedicity, verifiability, neutrality, balance
  2. Reader requirements
    • interesting, completeness, depth/specialism level/reader level, relevance of content to subject matter
  3. Requirements of form
    • correct (or good) language, encyclopaedic tone, text style and readability, clear and easy article structure, clear and easy layout,
  4. Broader project requirements
    • findability of topics, and coverage , balance, consistency, and completeness of topics in general (across the project).

He also divides edits and editors by their impact on quality:

  • Edits that only add quality, those that only remove it, those that are neutral (eg US English to UK English), and those that add quality in some ways but remove it in others.
  • Editors who add quality by adding new content; editors who add quality by maintaining (preventing degradation, and making minor improvements to existing content - a continuum exists), editors who remove quality, and "problematic users" who add quality some ways or work hard generally, but overall come to be seen as harmful or negative overall in quality terms.

He then describes the (important) concept of content erosion:

"If we assume the rate of change and the percentages of all four edit types to stay constant, the rates of quality increase and quality decrease are constant too. This means any page in the project is subject to a slow decay in quality, which I call wiki-erosion. Quality is guarded by the community though. The ability to revert destructive edits of all kinds is thus related to the amount of knowledge in the community. This means destruction can only go as far as a certain quality level. If the community is larger, that level will be higher, if it is smaller, the level will be lower. Thus, in the long term, quality of any sort will stand a larger risk of being destroyed at small projects, even though the wiki-erosion rate is much smaller."

Woodwalker feels current analysis of quality is lacking, but perhaps by analyzing these components, the statistical team can find better ways to assess quality. It may also lead to practical ways to improve specific quality factors rather than unhelpful generalized suggestions, that projects can tailor to their culture.

General discussion of quality measurement

Piotrus argued metrics are overrated. Virtually all scholarly publications agree quality is high and rising, and internal categorizations (Featured/good/assessed/stub etc) are adequate for the rest, for now. He drew attention to improvements of this scheme, including articles outside WikiProjects and hence not rated, and low activity Wikiprojects without adequate discussions or members to support rating. He sees more (and more active) Wikiprojects as a core quality tool.

FT2 states three kinds of metrics are useful and attainable: 1/ crude metrics such as computer assessments based on tagging, cite to word ratios etc (and some calculation based on these) which can be used to crudely identify major issues and assess articles up to a simple baseline; 2/ article progression and conversion metrics based upon article standing (new -> baseline quality -> good -> featured) and time taken between these, and article stability; 3/ assessments based on user and reader feedback ("rate this article").

Baseline quality and the "low fruit"

FT2 observed that "the 'low fruit' is appealing [to focus upon] -- metrics relating to substandard articles that don't meet a agreed baseline for quality, or measuring how long they take -- because there's lots of them, they make a big impression, they are easy to identify and quantify the issues, and they are easy to fix. Maybe for now, we should recommend focusing on that."

Incentivizing and promoting quality

FT2 explored this topic in several thread posts. Comments included:

  • Adding a baseline quality and crude automated ratings, would "capture basic issues that are a concern and could flag them to the author and the community. If we take care of the worst articles then over time the average will improve. Nobody is more motivated to work on an article than those who have already edited it, so they may be interested in a simple "score" plus information why it's low."
  • Giving an editor even a crude rating on an article ("This article is rated as 5.4, click here to see what's needed to improve it") will incentivize and stretch users ("We need quality things to be pushed, incentivized, fun, enjoyable, and desirable to go for.... Suggesting incremental ways to do better... [even] a crude automated evaluation of an article's weaknesses [can help]"). A suggested wizard/popup was proposed, to embody this approach [1].
  • Other major organizations (Macdonalds, Coke, Nike, etc) promote by "mak[ing] it simple, easy, intuitive -- and plaster things (tastefully) wherever they can that channel people towards the ways that help that organization. We're no different in a way. We want readers to be nudged to check out possible corrections and facts to cite, and we want to make that really easy and obvious... we want editors who write an article to have it made really simple and attractive to revisit it to get it one more notch up a crude quality number... and so on."
  • AndyZ's assessment tool, one of several quality rating tools identified by Piotrus, "has real potential if it could prioritize the key issues and suggest them, and if it was made simple with an integrated interface thing that was "once click away" on each page. Every last article that's not GA/FA could have a little tasteful slow-blink icon saying "Improvements we want on this article", listing 2 or 3 selected improvements the article needed and a "Let's fix it!" button... That would get the wider public's involvement."

Piotrus felt this sounded good, advocated simplicity, and asked whether Andy's tool (or one like it) should be recommended for future development.
File uploads

There was a short subthread about file uploads. Bhneihouse commented "the fact that anyone on this team said they "hope xxx works" is a huge statement about reliability factors on Wikipedia. That is a quality statement right there".

Branding

Bhneihouse discussed the idea of "branding" in the broad sense of purpose, mission or "being-ness" (as opposed to just "visual identity"), and how the brand (roughly, what Wikipedia is) drives what Wikipedia does:

"What we are talking about here is really about what Wikipedia is, and thus how it does what it does... we cannot have a conversation about quality without starting the conversation with brand... Brand is intangible but is expressed through that which is tangible, whether it be a mark/logo, or the way customer service responds to a customer or the way that a user experiences Wikipedia".

She also added that "a consistent framework would serve Wikipedia's goals" and remove guesswork. She stated

"[W]henever Wikipedia allows that which is not consistent with its brand to exist as Wikipedia, it dilutes the brand... Wikipedia is about accurate knowledge. Standards in keeping with Wikipedia's core ideals and values keep Wikipedia being Wikipedia".
User feedback mechanisms

Woodwalker asked about obtaining reader or user feedback, and suggested neutral, balanced, complete, well cited as the four axes. FT2 noted Flagged revisions has such a tool already, and suggested balanced, sources, coverage (completeness), up to date" as axes, adding that capturing the reader's knowledge level on the topic (casual editor | knowledgable | very knowledgable | formally qualified) would be extremely valuable (shows the rating that different levels of reader give the article, and reader existing knowledge levels).

There was some agreement (FT2, Woodwalker) that "rate this" popups would be seen as "spammy" compared to a toolbar, and agreement about the value of obtaining the reader's knowledge level in terms of "actual audience".

Article rating

Woodwalker states (later) that assessment may not help smaller projects lacking the skills to rate quality, and that rating systems vary between projects, stating in summary, "having more editors is important, but let's be fair: for quality we especially love to have more 'quality' editors". Bhneihouse stated that she felt the quality framework needed consistency and buy-in across projects, not a "pick and choose" structure. Woodwalker felt WMF's capabilities to force change on all projects are very limited (eg BLP) (see below), and that offering options if they wanted to improve this or that aspect was more respectful and likely to obtain higher uptake. He noted that "all projects should in the end have the same quality goals, but they may be in different project phases and therefore need different approaches".

WMF's abilities to force change and working in the "real world"
This was a significant point of taskforce philosophy.

Woodwalker felt WMF has only a very limited ability to force change on communities (see: BLP). FT2 agreed, expressing concern that some ideas would be "dead in the water" in any practical sense, and only certain things can be effectively shaped or altered.

Woodwalker commented he was "philosophizing for perfect world", but that it wasn't our problem if projects did (or were able to do in practice) with it "isn't up to us".

FT2 argued the taskforce's aim was to maximize effect, which meant designing for the real world of the projects, not a perfect world ("the criteria is 'what's most likely to deliver given these things' "). Anticipating or considering possible issues and practicalities that could affect likely productivity was part of the job:

"[I]t has to be a path that's got a good chance of people following it, otherwise it's pointless. So our optimum result might be categorized somewhat openly as the best path that has a good chance of enough people following it to make the necessary difference. Human nature, variety of views, and inertia, will ultimately limit what we can achieve in any given "bite" at the quality cherry. Best to respect there are limits on the achievable (although not giving in lightly), and see what's the most we can progress quality for this time."
Decision making

Woodwalker stated that paralysis in decision making within projects meant that new ways needed to be found, rather than just trying to modify old ways, but felt that ultimately frustration with 'red tape' would eventually force progress.

Sjc commented that "It's not just the amount and complexity of 'red tape', it is also the fact that 'red tape' is a tool which can be bent to purpose by all and sundry. In fact, a creative editor can make the red tape mean exactly what he or she wants it to mean. Red tape is in this respect more of a liability than an asset, where edit wars can be won by the editor most adept at bending the laws of reality to their own intent than others". He (Sjc) advocated removal of policy ("appears to be a fundamental enemy of content and quality of content") in favour of unhindered good faith editing.

(Sjc's essay on this and related points)
Summing up

Woodwalker closed the thread to date, commenting that:

  1. We've found that quality isn't easy to measure by simple metrics, perhaps it's impossible unless we would have some form of feedback from the reader.
    Piotrus states this isn't impossible. Rather, there are several different metrics to doing that, and what may be impossible is selecting the "best one".
    Woodwalker disagreed, noting that while subjective feedback helps, there is no clear way to quantize key goals like "completeness", "balance" or "structure"; even reverting can be for both good and bad reasons.
  2. We've come across some nice ideas how this feedback can be obtained. Important also is to know who gives the feedback (expert/interested person/school kid?).
  3. Piotrus suggests Wikiprojects can play a part and we need more; Bhneihouse suggests Wikipedia can only become a quality brand when there is a consistent basic level of quality across Wikimedia projects (I assume this is not just about Wikipedia); FT2 thinks our recommendations should be in the form of realistic suggestions likely to make the biggest positive effect on quality as they perpetuate, allowing for where the communities are today [as amended]. I suggest this feedback thing could become our second practical recommendation (after 1. creating more manuals/wizards).