Interoperable interactive geometry for Europe
I forgot my login data
Register


Report a bug


Fan club

Quick Intro Videos
click to start movie
Create A Simple
GeoGebra Resource (25Mb)
click to start movie
Filing a review
click to start movie
Find a Resource

SPONSORS
This platform is brought to you by the intergeo project, funded under the eContent Plus programme of the European commission and by partners

D4.7 Usage Analysis and Platform Adjustments

This is the document preparing our deliverable to be submitted at the end of November. This document is at http://i2geo.net/xwiki/bin/view/About/D47-UsageAnalysis and is currently in internal review.

Introduction

This deliverable presents the usage analysis that has been available thus far in i2geo, a new generic logging infrastructure, and describes the enhancements needed to obtain a better tracking that matches the expected indicators of the Intergeo platform.

Web-logging is a traditional task, one for which several off-the-shelf solutions exist. Among them, Google Analytics is a web-service-based offer, while many log-analyzers that work on the log of the web-server are available (e.g. WebAlizer, AWStats). They allow, in principle, the owners of web-sites to perform fine grained tracking of the accesses to the site and thus profile better the website and enhance the information presentation usability or to better guide the users to the expected outcomes.

In practice, web-interactions of contemporary web-servers are quite rich; the server logs are very accurate in showing the volumes, but only fine grained analysis allows useful indicators. By fine grained, we mean analysis of pages hits of particular paths, analysis of sequences of hits (clickstreams), or analysis of external data.

Outline

This deliverable presents the indicators we are interested in, how far we could get with simple analysis, the enhanced logging and statistics architecture and the methods to compute them, and finally, we propose how to derive the performance indicators described in the Intergeo description of work from these.

Definitions

I2geo is a web-based platform, it is a derivative of Curriki platform which is based on the XWiki web framework. It processes resources which are stored at XWiki documents within the space of the contributing user.

A resource is a document with attached XWiki objects of multiple possible types. The basic types include all the standard metadata of a resource while the specialized types describe the content of the resources:

  • an attachment asset is a resource with an attached file; i2geo, since Nov 2009, can display some file types, for example play directly some of the interactive geometry files; these types are more specialized asset-types
  • an external asset is a resource made of a URL to a resource outside i2geo
  • a curriki-asset is a resource made of a web-page that is edited directly on i2geo
  • a collection asset is a resource made by collecting other resources, for example under a given topic of interest
I2geo allows users to have their own profiles and their blogs, which they can display publicly at will. A notion of groups, and group documents and resources is available which allows a tighter collaboration.

Finally i2geo allows reviews, a judgement emitted about resources filled as a series of agreement statements (see deliverable D4.3).

Indicators of Interest

The Intergeo description of work, section 5.2, describes a few performance indicators:

  • content aggregated
  • increase in access
  • increase in reuse
  • QA resources
  • registered web site users
  • curriculum mapping countries
  • school coverage
These are high-level indicators while this deliverable is of technical nature. The last section explains how to estimate them from numbers.

Indicators of interest are the following:

  • content indicators:
    • total number of resources
      • number of external-links
      • simple wiki resources (lesson plan, ...)
      • collections
    • number of uploaded files with a detail per system
    • users
    • groups
    • users with blog
    • blog-entries
    • messages in groups
    • number of reviews
      • with a decomposition by overall-result
    • number of external links to i2geo
  • usage indicators:
    • number of resource played in browser
    • number of resource files downloaded
    • number of external links followed
    • number search queries
    • number of saves of reviews (creation, edition)
    • number of saves (creation, edition) of resources
    • number of deletions of resource
    • number of branches of resource (copy action for the user to appropriate)

Access policies

Access to the logs could be considered "purely statistical" and thus relatively neutral in nature. Most laws, however, prohibit the publication of these statistics without explicit user-consent. The privacy policies of the platform indicate that data is collected.

The i2geo editorial and development team has a protected access to the logs: they are all available from http://stat.i2geo.net/ with user-names and passwords of the platform, provided the user is in the right groups. This allows them to give access to all logs, which would, in principle, allow a very fine grained tracking of individuals, but at least supports offline analysis.

Analysis of Past Interactions

The i2geo platform is based on Curriki, a web-platform to share educational resources. Curriki has made use of Google analytics, a service of the Google corporation that proposes a free service to web-masters: the insertion of a small script snippet which makes each browser page displaying a page containing it be tracked on the Google Analytics web-site. Web-masters can then come and see a presentation of the statistics of the accesses.

The big issue with the Google Analytics service is the management of available sockets: a contemporary web-browser typically has no more than 4 sockets at a time opened to serve a web-page delivery; each socket is only connected to one server but can serve many requests subsequently. Reliance on the Google Analytics service thus requires a socket to be specially opened for such a call, which is done at each web-page; our current observations seem to indicate that these calls are relatively slow so that we estimated that the full time of this socket seemed to be often used. For this reason we decided to drop the usage of Google Analytics hoping to gain back the 25% of connection space.

An account was opened in Google Analytics (username IntergeoGA@gmail.com, user code UA-6685035-1). Two obvious advantages of Google Analytics is that it is easy to configure and provides attractive reports with nice graphics, but, in addition to the slowdown it causes, it did not have the necessary information to do the fine grained analysis we need. For instance, we attempted to configure a "conversion", a sequence of URL's that detected a resource edition and saving, but it did not work: since clicks in the old Currikulum Builder (now obsolete) were Javascript handled internally within the same page, instead of visiting new pages with different URLs, the sequence of pages visited (clickstream) did not contain enough information so as to distinguish this event from others, such as, for instance, editing a resource but without saving it. It was suggested that this problem could be fixed by "hacking" the original Curriki code, so that additional messages would be sent to Google Analytics, but it was considered that using AWStats would be easier to implement, less problematic in the future, and would allow for faster page download, which was our real concern.

Data provided by Google analytics includes, among others, referrer (we are getting most of our new visits through Google.com), browser, screen size, and IP, from which location can be deduced.

The Apache HTTPD server used on the i2geo platform itself produces logging in traditional access_log files. This the source we have used to compute the indicators that have been provided in the progress report of Intergeo: simple filtered counting using grep and wc Unix commands. From there we could digest a few indicators presented in the progress reports. For example there were 363737 hits from January to September 2007.

A generic web-log report based on these logs for the period from the platform opening till November 2009 can be seen at: http://stat.i2geo.net/awstats_old/

The Enhanced Web-Logging Infrastructure

Clearly, more was needed to obtain all the indicators and, unfortunately, the necessary implementations only happened in November 2009, hence we cannot offer here long historical runs of data as would be desirable.

An amount of the indicators above cannot be tracked with the strategies described above: for example, asset-saves were undifferentiated from most other saves including temporary ones, the external links were not trackable at all, the content statistics were not included.

Several aspects that we wanted to measure could not be measured because there was no service to help this in the i2geo platform until November 2009. Among others, the ability to play of a construction within web-browsers was not there, one could only download the file and open it in a desktop player. The upgrade of the platform came along the upgrade of the Curriki platform to their currently stable version, the 1.8 branch. This branch provided several fixes to issues we had, but also introduced the usage of the tracker javascript object described below.

We built the logging infrastructure split in three levels:

Enhanced Apache Logs

The first enhancement is of low-level nature to enable the other processing, in particular the logs-pre-processing: the Apache logs are enhanced to contain the language of the delivered resource, the session-identifier (allowing to track actions of a single browser), a new field containing a duplicate of the date which was need for AWStats configuration, and a finer grained measure of the volume of data.

This is done a simple configuration of the Apache virtual host with the following pattern (according to Apache Httpd LogFormat directive:

LogFormat "%h %l %u %t \"%r\" %>s %b \
\"%{Referer}i\" \"%{User-Agent}i\" \
%D \"%{Content-Language}o\" \"%{JSESSIONID}C\" %{%Y_%m_%d}t"

The latest current logs (updated every two minutes) is reachable at http://stat.i2geo.net/raw-logs/access_log-tail (i2geo editor access needed).

Tracker Calls

The Curriki development team has, since almost its start, relied on the Google Analytics service. The Intergeo project, since the latest upgrade, considered it could gain in performance by avoiding this, relying on a server internal log instead.

However, the (optional) embedding of Google Analytics in Curriki comes with javascript logging-statements, since version 1.8, which are efficient to track browser-only actions such as the start or end of a copy resource action, or of a resource addition.

We have implemented a local version of the tracker javascript object which launches requests as the google tracker does hence which are filed to the Apache logs. Examples of such is the line indicating the start of a copy of a resource:

93.222.240.103 - - [24/Nov/2009:16:31:19 +0100] "GET /static/tracker.png/features/resources/add/Copy/Coll_cdording/Sommedesanglesduntriangle?time=Tue%20Nov%2024%202009%2016:31:42%20GMT+0100%20(CET) HTTP/1.1" 404 1213 "http://i2geo.net/xwiki/bin/view/Coll_cdording/Sommedesanglesduntriangle?bc=;Coll_adminPolx.Mylittletestitems" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; fr; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5" 97836 "-" "-" 2009_11_24

and its conclusion:

93.222.240.103 - - [24/Nov/2009:16:34:39 +0100] "GET /static/tracker.png/features/resources/add/Copy2/Document/Coll_adminPolx/acopyforme?time=Tue%20Nov%2024%202009%2016:35:02%20GMT+0100%20(CET) HTTP/1.1" 404 1201 "http://i2geo.net/xwiki/bin/view/Coll_cdording/Sommedesanglesduntriangle?bc=;Coll_adminPolx.Mylittletestitems" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; fr; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5" 14751 "-" "-" 2009_11_24

Even though it is not guaranteed that these calls are faithfully sent (because they require a connection which may be temporarily failing), this facility provides the possibility for a very fine grained logging at the place the closest to where the user actually starts and ends his actions. Most of the content modification actions in Curriki are, indeed, based on dialogs run in several stages and piloted by a javascript layer.

Indicators such as number of resource creations, updates, duplications, and deletions can easily be tracked with the count of the lines as the example above.

Logs Pre-Processing

In order to insert into the Apache logs the content statistics of each day, and in order to inject deduced information into the logs, the raw access logs of Apache are pre-processed. The pre-processor is a program which reads the raw access logs and does the following:

  • inject in the log, at the time it's taken, a line with each content-indicator, the amount being reported as the volume
  • TODO:Marion: (in process) replace the field number xxx with the longitude and lattitude of the calling browser
  • replace the field number xxx-TODO-xxx of the log with the name of the logged-user, read by associating the login action with the session identifier (??? thinkme… useful?? I believe it would help helpers actually track what's being done by a helpee)
The result is an access-log which can be processed by classical log-analysis tools. The latest pre-processing result, run each day, can be seen at http://stat.i2geo.net/target-logs-for-view (i2geo editor access needed).

AWStats on Enhanced Logs

Based on the result of the pre-processing, we obtain access-logs which combine all indicators mentioned above and which just need to be read. In order to make them readable, we employed the AWStats classical system: it is an open-source set of Perl scripts that produces graphical analysis of the network access. This report is visible at http://stat.i2geo.net/awstats/ (i2geo editor access needed).

This report collects information from the last days and months and presents a few classical views of them.

The particular indicators that interest us are, at the time of this writing, extra sections of the AWStats report: they list the last values for the ongoing month.

The extra section feature of AWStats were not originally designed to provide historic data, but rather lists of most visited places, so we had to do a few modifications in its code (AWStats is open source and written in perl). The changes implemented have been:

  • Make the list of results sortable by date. In order to do this, sortable.js has been used (details available at http://www.kryogenix.org/code/browser/sorttable/).
  • Add a new field, LOGLINE, to check for unusual conditions, in particular the HTTP method.
  • Replace the OR operator in the counting condition for AND
It is the intention of the developing team of i2geo to share this adaptation with the AWStats community, by posting the code in the AWStats Enhancements and Extensions page, http://www.antezeta.com/blog/awstats (our code has already been submitted).

Concrete Indicators Implementation

In this section we provide the technical details that enable the statistics of each of the indicators of interest.

Content indicators

Counting the content is done over queries to the database or index of the platform. We chose to use the index for its ease of access. In this report we provide the query executed to execute this count. Each of these queries are run every day and saved in a text file which the log-preprocessor uses.

The electronic version of this document presents these search results as hyperlinks which, when clicked by an Intergeo editor, list all the documents that have been counted.

One should note that some of these totals do not add up to the section's parent. Most of the times, this is due to invalid content; it may also be related to types that we do not consider in this list (for example the PDF or Word files) as well as to resources which are marked as private by their owner, typically because they are not ready for public view.

TODO: Paul: I suppose we need to externalize this away from XWiki into a static document though… I don't feel it's necessary until TeX-ification.

Usage Indicators

From the access logs, maybe pre-processed, we can compute:

  • number of plays: Every time a user plays a construction in his browser we count it. In the access_log it is recorded by lines like: "GET /xwiki/bin/view/Systems/Display**".
  • number of downloads: If a user downloads the attachments of a resource we recognize that by counting "/xwiki/bin/download/Coll_firstname/title_of_resource/name_of_attachment?force%2Ddownload=1".
  • number of search queries (not implemented)
  • number of external links follows: External links are redirect to "/xwiki/bin/view/Main/ExternalLink?url=external_link", therefore counting every appearance of this line in access_log gives the number of using external links.
Usage of Reviews:
  • number of review creations: A new review starts with request of "GET /xwiki/bin/view/QF/CreateReview?resource=Coll_*.* HTTP/1.1".
  • number of review editions: Edits of reviews can be count by "GET /xwiki/bin/lock/QR/Coll_**__*_0_**?action=inline&ajax=1& HTTP/1.1".
  • number of review deletions: Every delete of a review is shown in the access_log by "GET /xwiki/bin/delete/QR/Coll_*__*_*?confirm=1&language=de HTTP/1.1".
Usage of Resources:
  • number of resource creations: New resources are displayed by "PUT /xwiki/curriki/assets/AssetTemp".
  • number of resource modifications: If a user decides to modify one of his resources, it leads to a "POST /xwiki/bin/view/CurrikiCode/AssetSaveService" in the access_log.
  • number of resource copyings: Users can copy resources of other users in order to modify them to their own needs. As mentioned above, a tracker is used to recognize this action. The lines look like:
"GET /static/tracker.png/features/resources/add/Copy/MyCurriki/Contributions?time=Mon%20Nov%2016%202009%2009:13:03%20GMT+0100%20(CET) HTTP/1.1"
  • number of resource additions: Resources can be add to a collection or a group, this is counted by "GET /xwiki/curriki/assets/Coll_*.*/subasset?_dc*"
  • number of resource deletion: "GET /xwiki/bin/view/XWiki/DeleteDocument?confirm=*" shows a deletion.
  • number of page editions: Not only resources can be edited, there are other pages too like group documentations for example. Sometimes the user can finish his modification by a "save and view"- or a "save and continue"-button. Therefore we count two types of lines:
    "GET /xwiki/bin/lock/*"
    "POST /xwiki/bin/save/*"
  • number of group creations: A group creation is shown by "POST /xwiki/bin/view/Groups/CreateNewGroup".

Estimating the High-level Indicators

Content aggregated

This indicator is the amount of interactive geometry resources that are stored in the platform, the first content-indicator to be implemented.

It should be noted that strong differences appear in the nature of resources that are stored in i2geo: some are simple interactive geometry files which are easy to re-use, count, and identify, some are archives containing multiple activities (sometimes of different levels), some are zipped courses, and some are external links to single resources or to big collections of interactive geometry constructions for which the number of constructions is not asked. For this reason, we have refined this indicator by subdividing by type of resource.

This number has suffered an important change. During the first months, the first-steps-phase, we allowed a trace to report its item count, and we had some 3.500 resources. We stopped this count, however, because it was not in the model of Curriki and because we expected the users to migrate their resources to internal resources, ceasing to be just links. We did, indeed, request everyone to convert their traces. Since this move, the number of i2geo resources is displayed in the front page; we had ~300 resources in October 2008, about 1000 in March 2009, and 1823 resources today. Currently we have 1182 external links.

Increase in Access

This can be partially measured by counting the number of hits to the web-page of resources (until Nov 2009); and, since then, we have new data on browser-play, which can be further subdivided.

A certainly related measure is the full amount of web-page hits for which the global history for this year gives the following graphics:

Increase in Reuse

Measuring reuse is only partially possible: indeed, reuse outside the platform is certainly considerable (e.g. download and modify and later upload); as long as the users do not consider it the easiest for them to use i2geo to start to reuse the estimates of reuse remain fuzzy.

Three major counts can measure reuse:

  • the copy-action is a perfect reuse-for-appropriation: it allows an i2geo user to first copy the resource, keeping a track of its origin, into his own space, then modify it as he wishes and make it visible as a contribution of him.
  • the collection inclusion is a perfect reuse-as-is: it allows an i2geo user to simply include a resource within a collection of resources, such as a lesson plan. The number of collections is now measured.
  • links directly to the browser player is a perfect form of reuse-by-linking: with it, the user of any web-server, for example a school learning management system, can invite his students to use the interactive construction file by simply linking to the player.
These three counting methods were not available before November 2009, thus we cannot provide data.

QA Resources

The number of reviews that users file by using the i2geo review system is probably the best indicator to measure the number of QA resources. It is one of the content indicators which we list only since November 2009.

Registered Web Site Users

The number of users registered on the platform is a simple content indicator and it is displayed in the front page of i2geo. It has been 200 at the end of Year 1 of the project and is 593 as of this writing.

Curriculum Mapping Countries

This indicator is not part of the indicators mentioned above and, indeed, none of the statistics can lead to it. This is because the approach taken to cross-curriculum-search has been a practical approach to a research problem in which no other project has been working on (to our surprise) and for which only textual resources are available.

The number of educational regions encoded in the internationalized ontology GeoSkills has gone beyond the number of European countries very early; but the details of each region in each country, educational pathways, and educational levels has been staying around the number of curriculum texts encoded in the platform: educational regions where a complete curriculum has been or is being encoded and hyperlinked in the sense of the report on curriculum-categorisation: this list is that of the partners' of the projects (France, Germany, Spain, Holland, Czech Republic, Luxembourg) while it started as the list of members of the curriculum-categorization work-package of the project (France, Spain, Holland).

It is probable that the task of curriculum encoding remains an enterprise of great depth for which the Intergeo practice has been a pioneer practice which only demands generalization.

School Coverage

This indicator attempts to measure the amount of schools aware of interactive geometry in Europe. None of the indicators above tackle this problem; basically, IP's are used to put accesses on a map. The platform development team is currently investigating usage of Maxmind GeoIP, which we could actually leverage for other services (e.g. distribution of points for a given educational level, or a given set of topics).

Conclusion

In this report we have described the achievements in generic logging capabilities, how we have refined these and how we can estimate the high-level indicators described in the Intergeo's description of work.

Several other indicators could be extracted with out new infrastructure as described above (Lucene or Hibernate queries for the content, AWStats and apache logs injection for the tracking):

  • The number of resources attached as favourites which would enable us to measure the amount of preference marking.
  • Depth of collections: we already count the number of folders and collections, which measures the content organization. We could also count sub-folders, sub-sub-folders, etc, which would reflect how many people are doing some serious structuring and organization of contents.
  • Also, it would be interesting to know whether a small number of users has produced a lot of resources; a histogram with the number of resources by user could throw some light on this. In fact, the same thing could be applied to all events; how many resource creation does each user do, etc. This indicator could be called the productivity by user.
  • The distribution of resources per applicable educational region and educational level.
Discussions and usage of the logging infrastructure will show whether such indicators (or others) are useful and/or needed.

This report has shown the growing usage of the i2geo platform and has described the many indicators that we are now able to extract about the platform usage. We have good hopes to see some indicators become a major sign of the quality usage of the platform:

  • the number of QA resources
  • the number of resource uploads and their editions: isolated file-based-resources, which, when uploaded in the platform have a larger guarantee of survival than many other sites (see D6.2 for an argumentation of this form as best practice)
  • the number of direct usage such as the invocation of the play function
One avenue worth exploring to stimulate the raising of these indicators is showing these statistics to the users, or publish them, in the belief that "activity calls activity". This practice has been exercised since the very start for very general counts viewable on the front page in the "watch i2geo grow" panel. We shall probably add the number of reviews and links to this panel, a view of the counts of the files for each file type, and the reviews for each score.