The corpus

The corpus consists of 160 Arabic documents and their keyphrases. These keyphrases have been obtained by performing a large-scale Crowdsourcing experiment.

From every document, 10 different workers extracted 10 sets of 10 keyphrases. The workers could analyze more than one document, but they could not carry out a document twice.

Therefore, for every document we collected 100 keyphrases extracted by 10 different workers. Workers were allowed to select any phrase as a keyphrase and were oblivious of each other work. For this reason, the size of the actual set of distinct keyphrases for each document is smaller than 100, since many workers could select the same keyphrase.

We selected the documents from 4 different sources:

The documents belong to different categories:

The creation of the corpus

226 unique workers have extracted 16000 keyphrases from 160 Arabic documents. Below we present some statistics and curiosities about the collection process.

Graphic Interface

The following image shows the graphic interface we provided to the workers.


The mean time spent by a worker for extracting 10 keyphrases from a document is 302 seconds (5 mins and 2 secs) while the median time is 222 seconds (3 mins and 42 secs).

We performed the whole task in two sessions, and each of them required about 12 hours. The sessions were carried out respectively the 3rd and the 6th of February, 2016.


A total 226 unique workers from 16 different countries participated to the project. On average, every worker analyzed 7,07 documents.
The chart shows which are the provenience countries of the 16000 collected keyphrases (kps).
The chart shows which are the provenience countries of the 226 unique workers.


We executed the experiment on

The whole experiment costed 134$, of which 112$ for paying workers and 26$ for the platform fees. Every worker earned 0.07$ to analyze a single document. Since several workers analyzed more than one document, the average amount they earned was 0.49$.