ABOUT THE CORPUS
Arabic Learner Corpus (ALC) is a project comprising a collection of written and spoken materials from learners of Arabic in Saudi Arabia.
The ALC data has been captured in 2012 and 2013. It includes 282,732 words, 1585 materials (written and spoken), produced by 942 students from 67 nationalities, and 66 different L1 backgrounds. Average length of a text is 178 words.
Here are more details about the corpus content:
Files format
Corpus partitions
Learners' gender
Data medium
Learners' study level
Learners' language nativeness
Text's genre
Place of data production
First languages with number of texts and words
Nationalities with number of texts and words
Five types of non-annotated files have been generated for the corpus:
(1) TXT files with no metadata
(2) TXT files with metadata in Arabic
(3) TXT files with metadata in English
(4) XML files with metadata in Arabic
(5) XML files with metadata in English
The metadata information enables researchers to identify characteristics of text and its producer in each transcription, which add more depth to the data analysis.
The original hand-written sheets are also available after they have been scanned and saved into PDF-format files.
The audio recordings (3 hours, 22 minutes, and 59 seconds), of those learners who granted permissions to publish their recordings online, are also available to download in MP3 files format.
All corpus files were named in a method which indicates the basic characteristics of the text and its author (e.g. S038_T2_M_Pre_NNAS_W_C). They are in order, student identifier number, text number, author gender, level of study, nativeness, text mode, and place of text production.