ABOUT THE CORPUS

Arabic Learner Corpus (ALC) is a project comprising a collection of written and spoken materials from learners of Arabic in Saudi Arabia.

 

The ALC data has been captured in 2012 and 2013. It includes 282,732 words, 1585 materials (written and spoken), produced by 942 students from 67 nationalities, and 66 different L1 backgrounds. Average length of a text is 178 words.

 

Here are more details about the corpus content:

 

Files format

Corpus partitions

Learners' gender

Data medium

Learners' study level

Learners' language nativeness

Text's genre

Place of data production

First languages with number of texts and words
Nationalities with number of texts and words

Five types of non-annotated files have been generated for the corpus:

(1) TXT files with no metadata

(2) TXT files with metadata in Arabic

(3) TXT files with metadata in English

(4) XML files with metadata in Arabic

(5) XML files with metadata in English

 

The metadata information enables researchers to identify characteristics of text and its producer in each transcription, which add more depth to the data analysis.

 

The original hand-written sheets are also available after they have been scanned and saved into PDF-format files.

 

The audio recordings (3 hours, 22 minutes, and 59 seconds), of those learners who granted permissions to publish their recordings online, are also available to download in MP3 files format.

 

All corpus files were named in a method which indicates the basic characteristics of the text and its author (e.g. S038_T2_M_Pre_NNAS_W_C). They are in order, student identifier number, text number, author gender, level of study, nativeness, text mode, and place of text production.