Interoperable interactive geometry for Europe
I forgot my login data

Report a bug

Fan club

Quick Intro Videos
click to start movie
Create A Simple
GeoGebra Resource (25Mb)
click to start movie
Filing a review
click to start movie
Find a Resource

This platform is brought to you by the intergeo project, funded under the eContent Plus programme of the European commission and by partners

Query Expansion Description

The objective of this page is to describe to a curriculum encoder how a query is formed from the words typed by a user on the serach field of, in order that he understands the importance of providing enough semantics to competencies/topics within geoskills.

The query process aims at retrieving ressources corresponding to a query text on the basis of the record that describes each resources. It is a two phases process. Firstly, the user types words into a field the returns corresponding topics or competencies (or text). Then, resources are found corresponding to the selected topic or competency (or text).

The first phase is using the skillsTextBox facility that searches for names into the geoskills ontology to define the search elements.

The second phase is using Lucene technology as a black box to process a query from the search element to the resources.

The query process depends then from Lucene datastructure and API (Application Programming Interface), on the query itself and on the resource metadata (resource annotation).

This page describes how metadata is filled for future queries and are how the query is constructed in the second phase.


A description of a resource. A record is made of a series of fields. In the lucene terminology, each record corresponds to a “document” and represent the ressource for the query process.
A field contains a string, which is considered by Lucene is a sequence of words.
A word is an element of string recognised by Lucene as a atomic, separated from the others. Usually stopwords are eliminated. A word can be either a simple word or an URI.
A query is a set of clauses. It tells Lucene to search for this set of clause in the document database (a OR of this clauses)
A clause is composed of a field, a word and a weight. It tells Lucene to serach for this word in this field with this weight

The text of a ressource is the taken as the full text of the resource: its title, it's page-name (as in the URL), its description, its authors, … For some of the attached documents, the text is also extracted (html, doc, pdf for now).
Text fragment
A text fragment is a word taken from a text.


An example of field is ft.stemmed which contains the full-text of the resource, author, etc with plurals and declinations removed.

Let us taken an example resource, For example, for the French resource Introduction Thales, the full text includes the words: activite pour introduire la petite propriete de thales activitethales.pdf et une animation pour illustrer la propriete de proportionnalite des longueurs dans un triangle thales.ggb exercices preparatoires completer des tableaux de proportionnalite exercicesthales.pdf.

Another example is the field trainedTopicsAndCompetencies which contains the URIs of competencies and topics. Each URI is a word.

For the same resource it contains the words #InterceptTheorem #Apply_ratio_and_proportion #Proportional_r #identify-ratio #Calculate_missing_values_in_a_table_of_proportional_data #Know_proportionality_of_sidelengths_of_triangles_in_particular_configuration .

A query could be for the topic #Proportional_r or for the words longueurs and thalès in the field ft.stemmed with weights = 100% .



The system indexes words that are present within a field of a record (= resource = “document”). That is, for each word found into a record, it creates an entry in a table and a link to each record that contains it.

Filling the resource fields

The resource fields are filled by words that will be used by Lucene to index and retrieve resources corresponding to a query.

  • the field ft is filled with the words of all the text found for the resource while ft.stemmed is filled with the same but the words are stemmed (that is turned to their radical in the given language, so doing becomes do).
  • the field trainedTopicsAndCompetencies of a ressource contains the URIs of the Geoskills node (competency or topic) that is referred by this ressource. Each of these URI is considered as a word and taken as an index by Lucene.
  • the field eduLevelFine contains region and pathways nodes of Geoskills.
The two last fields are manually authored for each resource.

Two other fields are automatically filled in, preparing the query expansion. They are not visible to the user who cannot edit them.

  • the field inferredTopicsAndCompetencies is filled with competencies and topic nodes inferred from the trainedTopicsAndCompetencies ones.
    • for a competency, inferred nodes is made of included topics and all their ancestors (including Topic)
    • for a topic, inferred nodes is made of all ancestor topics
  • the field inferredEduLevelFine is filled with region and pathways nodes inferred from eduLevelFine ones. It contains all ancestors of thes nodes in Geoskills.

First phase, defining the search "words"

Simple search.

Simple queries are input using skills-text-box. The set of words typed by the user provides

  • a text composed of the typed words
  • several corresponding GeoSkills nodes
After inputing words, skills-text-box suggest possible completions which will define the query: the user can choose to query for this text or to query for a (unique) Geoskills node.

The Geoskills node are retrieved by skills-text-box via Lucene using nodes names.

For example, is the user text is "Right-angled triangle", skills-text-box suggests

  • "Right-angled triangle"
  • Topic "right-angled triangle"
  • Topic "triangle"
  • Topic "isoceles triangle"
  • Topic "right angle"

Advanced search.


Second phase, constructing the query (query expansion)

The search engine responds records (“documents”) matching the query words (a Geoskills node or a text).

From the seleted Geoskills node or query text, the server must now construct a query to send to Lucene. This query will determine the set of resource retrieved and their respective rank.

First step of query expansion: constructing ranked queries

First case: the user has selected a Geoskills node.

  • queries for GeoSkills nodes of types topic or competency are made to queries for the field trainedTopicsAndCompetencies
  • queries for GeoSkills nodes of type level, region*, and pathway*, are made to the field eduLevelFine
Second case: the user has selected the text (set of words).

The clauses are converted into several other ones, by decreasing order of ranks

  • clauses of raw words in the title
  • clauses of raw words in the full-text
  • clauses of stemmed words (in the language of the user) in the title
  • clauses of stemmed words (in the language of the user) in the full-text
  • clauses of words in the full-text with error tolerance*
  • clauses of the textual fragment in the the three GeoSkills nodes that match best (as in skills-text-box) - TO CHECK WITH PAUL
Second step of query expansion: Knowledge exploitation on Geoskills nodes

For each query for GeoSkills node, we now expand:

  • a clause for a topic/competency is enriched with a clause for this ontology node in the inferredTopicsAndCompetencies field (with 80% weight)
(it is the last query in the second case)
  • a clause for an educational level is enriched with a clause for its ancestors, region and pathway in the infeferredEduLevelFine (with 50% weight)*
The result of the expansion is added to the query set of clauses.

Last step: extra weighting.

The set of clauses is sent to Lucene as a weighted disjunction (union of each of the clauses). This allows each clause to be responded by an ordered list of matching-documents: each with their own matching score: the more clauses in the set match a resource, the better score the resource gets.

The last step of the query expansion adds clauses that only influence the score, they are called boosting clauses and are the following:

  • a clause for a resource of higher quality (with own criteria weighting*)
  • a clause for a resource in the user's own language, or any other browser-accepted language

Results Presentation

The result of the search is an list of resources matching the query; the list is ordered from best ranking to least according to the product of the score of each query.

Currently, this scoring is affected by:

  • the clauses in different fields be done with different coefficients (a title matches higher than a text etc)
  • the inverse document frequency of each term, as explained in tf-idf which makes it that uncommon names are rather heavy-weights while too common-terms almost have no effect. In our case, this implies that high level topics or competency processes have low-weigths because they will be present in lots of resource inferredTopicsAndCompetencies fields.
  • queries that are weight oriented such as the language-preferences or the quality preferences
The score is presented in the search result with a gradient of squares from dark grey to almost white. The search result always starts with best ranking and we expect users to be mostly interested to the first page of results, trying to rather refine their query then browsing next pages. This ranking feature is a distinctive feature of a retrieval engine as opposed to an ontology or SQL-oriented matching process.

Further material

(Legend: ??? means: to be checked; * means will be applied in the future.)

See it at work on the platform (normally also on top of this page). Typing something in the text-field on the right will trigger skills-text-box's search where a choice will query that choice, be it text or node. The expansion result can be seen under (details...).

You could also look at the java source that does this process: