This page contains an overview on the Arabic Keyphrase Extraction Corpus (AKEC) built by Muhammad Helmy, Marco Basaldella, Eddy Maddalena, Stefano Mizzaro and Gianluca Demartini.
The corpus is freely available for the download at this address: https://github.com/ailab-uniud/akec.
The corpus and the process we used for its building are described in detail in the paper ''Towards Building a Standard Dataset for Arabic Keyphrase Extraction Evaluation'', presented at the 20th International Conference on Asian Language Processing (IALP 2016), held in Tainan, Taiwan, from November 21 to 23, 2016.The corpus consists of 160 Arabic documents and their keyphrases. These keyphrases have been obtained by performing a large-scale Crowdsourcing experiment.
From every document, 10 different workers extracted 10 sets of 10 keyphrases. The workers could analyze more than one document, but they could not carry out a document twice.
Therefore, for every document we collected 100 keyphrases extracted by 10 different workers. Workers were allowed to select any phrase as a keyphrase and were oblivious of each other work. For this reason, the size of the actual set of distinct keyphrases for each document is smaller than 100, since many workers could select the same keyphrase.
We selected the documents from 4 different sources:
The documents belong to different categories:
The mean time spent by a worker for extracting 10 keyphrases from a document is 302 seconds (5 mins and 2 secs) while the median time is 222 seconds (3 mins and 42 secs).
We performed the whole task in two sessions, and each of them required about 12 hours. The sessions were carried out respectively the 3rd and the 6th of February, 2016.
We executed the experiment on CrowdFlower.com.
The whole experiment costed 134$, of which 112$ for paying workers and 26$ for the platform fees. Every worker earned 0.07$ to analyze a single document. Since several workers analyzed more than one document, the average amount they earned was 0.49$.