Open Web Analytics

From Strategic Planning

Open Web Analytics is an open source, PHP-based framework for web analytics. It already is integrated into MediaWiki as well as other PHP-based open source projects.

It has a large user community. Peter Adams, the creator, would like to expand the developer community. There are many bug reports and fixes, with about 4-5 patches per release. There are only two core engineers. The work is self-funded through consulting.

  • DOMstream recording for looking at mousetracks, controllable frequency
  • Data Warehousing Star Schema (extensible)...automatically tracks every request that's made across every dimension
  • Configure to write to local database (flat file) or HTTP post to another server running OWA (for high volume, sets up event queue)
  • Plug-able filters (hooks that we can write custom filters to)
  • Throughput - in event queue async mode it could handle very high throughput, but at some point it has to be played to the database...in batch
  • Largest example extant 1,000,000 pageviews a day (somebody in France. Peter can give us pointers for learning)

Current version:

  • currently just page views
  • tracks every hit within a session (30-minute period of activity)
  • uses PHP hooks we already have in our pages
  • reporting interface is integrated as a special page
  • instrumentation comes out of the extension...no need for page tagging
  • heat maps of clicks on the page

Enhancements in 1.3:

  • canvas based reporting
  • unlimited action events - set of events that allow you to track any action you want
  • user roles
  • REST support

Stuff that's not being done yet:

  • aggregate path analysis needs a better interface
  • distributed processing
  • continuous summarisation for custom reports
  • heat maps - supports time ranges but not at a granularity smaller than 24 hours currently
  • DOMstreams are per userID, but doesn't track that yet
  • segmentation, but can add custom dimensions per visitor level
  • vary the sample rate based on geo - would need to write an administrative interface to restrict invocation of tracker
  • Validate ComScore: Unique visitors per month, would we have to sample or could we handle typical volume? - Would depend on how we scale MySQL
  • No summary engine. Nice because you never have to get beneath the summary, but really they don't have this yet because its never been used on the scale of WMF. Event queue is the only answer now.
  • Pathing, larger scale are all things on the event horizon for OWA. The biggest issue cf. Peter will be scale. There's no problem staging data with OWA, the issues will be handling the volume of data (batch write out) and then whether we can get meaningful reports out of so much data.

Problems he ran into with MediaWiki: no Install event to trigger the schema

Caveats: since nobody has used OWA on a Wikimedia scale, we're not really sure of performance characteristics. We would need to pilot it. Also the system will throw a *lot* of data and we'll have to create an architecture to deal with that. On the other hand, we won't have the problem of re-negotiating our privacy policy because we'd be hosting that data ourselves.

Known Issues

  • Disables cacheing (why? Can we work around this?)
  • Permissions; writes to its extension directory from the web
  • Doesn't work well with Vector UI
  • Privacy -- Many privacy-violating options on by default, and easy to turn on if you're an "admin" with this interface
    • Need to set up user classes that have options to edit, options to view

Work that would be needed

  • anonymization of data before releasing it