The 4th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT4)

Accepted Papers

AN ARABIC TWEETS SENTIMENT ANALYSIS DATASET (ATSAD) USING DISTANT SUPERVISION AND SELF TRAINING
Kathrein Abu Kwaik, Stergios Chatzikyriakidis, Simon Dobnik, Motaz Saad and Richard Johansson
ARABERT: TRANSFORMER-BASED MODEL FOR ARABIC LANGUAGE UNDERSTANDING
Wissam Antoun, Fady Baly and Hazem Hajj
ARANET: A DEEP LEARNING TOOLKIT FOR ARABIC SOCIAL MEDIA
Muhammad Abdul-Mageed, Chiyu Zhang, Azadeh Hashemi and El Moatez Billah Nagoudi
BUILDING A QATARI HERITAGE EXPRESSIONS CORPUS
Sara Al-Mulla and Wajdi Zaghouani
FROM ARABIC SENTIMENT ANALYSIS TO SARCASM DETECTION: THE ARSARCASM DATASET
Ibrahim Abu Farha and Walid Magdy
UNDERSTANDING AND DETECTING DANGEROUS SPEECH IN SOCIAL MEDIA
Ali Alshehri, El Moatez Billah Nagoudi and Muhammad Abdul-Mageed

Shared-Task Papers

OVERVIEW OF OSACT4 ARABIC OFFENSIVE LANGUAGE DETECTION SHARED TASK
Hamdy Mubarak, Kareem Darwish, Walid Magdy, Tamer Elsayed and Hend Al-Khalifa
ALT SUBMISSION FOR OSACT SHARED TASK ON OFFENSIVE LANGUAGE DETECTION
Sabit Hassan, Younes Samih, Hamdy Mubarak, Ahmed Abdelali, Ammar Rashed and Shammur Absar Chowdhury
ARABIC OFFENSIVE LANGUAGE DETECTION WITH ATTENTION-BASED DEEP NEURAL NETWORKS
Bushr Haddad, Zoher Orabe, Anas Al-Abood and Nada Ghneim
ASU_OPTO AT OSACT4 - OFFENSIVE LANGUAGE DETECTION FOR ARABIC TEXT
Amr Keleg, Samhaa R. El-Beltagy and Mahmoud Khalil
COMBINING CHARACTER AND WORD EMBEDDINGS FOR THE DETECTION OF OFFENSIVE LANGUAGE IN ARABIC
Abdullah Alharbi and Mark Lee
LEVERAGING AFFECTIVE BIDIRECTIONAL TRANSFORMERS FOR OFFENSIVE LANGUAGE DETECTION
AbdelRahim Elmadany, Chiyu Zhang, Muhammad Abdul-Mageed and Azadeh Hashemi
MULTI-TASK LEARNING USING ARABERT FOR OFFENSIVE LANGUAGE DETECTION
Marc Djandji, Fady Baly, wissam antoun and Hazem Hajj
MULTITASK LEARNING FOR ARABIC OFFENSIVE LANGUAGE AND HATE-SPEECH DETECTION
Ibrahim Abu Farha and Walid Magdy
OCAST4 SHARED TASKS: ENSEMBLED STACKED CLASSIFICATION FOR OFFENSIVE AND HATE SPEECH IN ARABIC TWEETS
Hafiz Hassaan Saeed, Toon Calders and Faisal Kamiran
OSACT4 SHARED TASK ON OFFENSIVE LANGUAGE DETECTION: INTENSIVE PREPROCESSING BASED APPROACH
Fatemah Husain
QUICK AND SIMPLE APPROACH FOR DETECTING HATE SPEECH IN ARABIC TWEETS
Abeer Abuzayed and Tamer Elsayed
OFFENSIVE LANGUAGE DETECTION IN ARABIC USING ULMFIT
Mohamed Abdellatif and Ahmed Elgammal

Shared Task on Offensive Language Detection

Offensive speech (vulgar or targeted offense), as an expression of heightened polarization and discourse in society, has been on the rise. This is due in part to the large adoption of social media platforms that allow for greater polarization. The shared task attempts to detect such speech in the realm of Arabic social media.
In subtask A, we will use the SemEval 2020 Arabic offensive language dataset (OffensEval2020, Subtask A), which contains 10,000 tweets that were manually annotated for offensiveness (labels are: OFF or NOT_OFF). Offensive tweets contain explicit or implicit insults or attacks against other people, or inappropriate language. We will use the same splits of OffensEval2020 for train (70% of all tweets), dev (10%), and test (20%).

Example: يا مقرف يا جبان للأسف هذه تسمى خسة من شخص أحمق

In addition to Subtask A, there will be another subtask for detecting Hate Speech (Subtask B) for the whole dataset. If a tweet has insults or threats targeting a group based on their nationality, ethnicity, gender, political or sport affiliation, religious belief, or other common characteristics, this is considered as Hate Speech (labels are: HS or NOT_HS). Subtasks A and B share the same splits.

Example: الله يقلعكم يالبدو يا مجرمين يا خراب المجتمعات

Subtask B is more challenging than Subtask A as 5% only of the tweets are labeled as hate speech while 19% of the tweets are labeled as offensive. We encourage submissions to both subtasks.
Note: User mentions are replaced with @USER, URLs are replaced with URL, and empty lines in original tweets are replaced with <LF>.
The purpose of this shared task is to intensify research on the identification of offensive content and hate speech in Arabic language Twitter posts. One goal of the workshop is to define shared challenges using this dataset. We encourage submissions describing experiments for research tasks on the dataset.

Data:
The data is retrieved from Twitter and distributed in tab separated format as follows:
tweet_text \t OFF (or NOT_OFF) \t HS (or NOT_HS)\n
Ex: @USER اخرص يا أعرابي يا وقح فلن تعدو قدرك يا سافل \t OFF \t HS \n

Download training/dev data from here: Training Set, Development Set.

Evaluation Criteria:
Classification systems will be evaluated using the macro-averaged F1-score for Subtasks A and B.

Now gold-standard labels for test data can be downloaded from here: Task A, Task B.

Submission Format:
Classifications of test and dev datasets (labels only) should be submitted as separate files in the following format with a label for each corresponding tweet (i.e. the label in line x in the submission file corresponds to the tweet in line x in the test file):
For Subtask A:
OFF (or NOT_OFF)\n
For Subtask B:
HS (or NOT_HS)\n

Participants can submit up to two system results (primary submission for best result, and a secondary submission for the 2nd best result).
Official results will consider primary submissions for ranking different teams, and results of secondary submissions will be reported for guidance. All participants are required to report on the development and test sets in their papers.

Sumbission filename should be in the following format:
ParticipantName_Subtask<A/B>_<test/dev>_<1/2>.zip (a plain .txt file inside each .zip file)
Ex: QCRI_SubtaskA_test_1.zip (the best results for Subtask A for test dataset from QCRI team)
Ex: KSU_SubtaskB_dev_2.zip (the 2nd best results for Subtask B for dev dataset from KSU team)

The shared task is hosted on CODALAB using the following links for each subtask:
Subtask A: CODALAB link.
Subtask B: CODALAB link.

Test Set: is now released on CODALAB. Please find get it from there.

Contact:
For any questions related to the shared task, please contact the organizers using this email address: hmubarak@hbku.edu.qa

Results:
Please find below the results of the partipant teams sorted by F1-score.

Teams:

Motivation and Topics of interest

In the NLP, CL, and IR communities, Arabic is considered to be relatively resource-poor compared to English. This situation was thought to be the reason for the limited number of corpus-based studies in Arabic. However, the past years witnessed the emergence of new considerably free Modern Standard Arabic (MSA) corpora and to a lesser extent Arabic processing tools.
This workshop follows the footsteps of previous editions of OSACT to provide a forum for researchers to share and discuss their ongoing work. This workshop is timely given the continued rise in research projects focusing on Arabic Language Resources.

Corpora

Surveying and criticizing the design of available Arabic corpora, their associated and processing tools.
Availing new annotated corpora for NLP and IR applications such as named entity recognition, machine translation, sentiment analysis, text classification, and language learning.
Evaluating the use of crowdsourcing platforms for Arabic data annotation.

Tools and Technologies

Language education e.g. L1 and L2.
Language modeling and word embeddings.
Tokenization, normalization, word segmentation, morphological analysis, part-of-speech tagging, etc.
Sentiment analysis, dialect identification, and text classification.
Dialect translation.
Fake news detection.
Web and social media search and analytics

Issues in the design, construction and use of Arabic LRs: text, speech, sign, gesture, image, in single or multimodal/multimedia data

Guidelines, standards, best practices and models for LRs interoperability
Methodologies and tools for LRs construction and annotation
Methodologies and tools for extraction and acquisition of knowledge
Ontologies, terminology and knowledge representation
LRs and Semantic Web (including Linked Data, Knowledge Graphs, etc.)

Submission guidelines

The language of the workshop is English and submissions should be with respect to LREC 2020 paper submission instructions. All papers will be peer reviewed possibly by three independent referees. Papers must be submitted electronically in PDF format to the START system.
When submitting a paper from the START page, authors will be asked to provide essential information about resources (in a broad sense, i.e. technologies, standards, evaluation kits, etc.) that have been used for the work described in the paper or are a new result of your research. Moreover, ELRA encourages all LREC authors to share the described LRs (data, tools, services, etc.), to enable their reuse, replicability of experiments (including evaluation ones).

Identify, Describe and Share your LRs!

Describing your LRs in the LRE Map is now a normal practice in the submission procedure of LREC (introduced in 2010 and adopted by other conferences). To continue the efforts initiated at LREC 2014 about "Sharing LRs" (data, tools, web-services, etc.), authors will have the possibility, when submitting a paper, to upload LRs in a special LREC repository. This effort of sharing LRs, linked to the LRE Map for their description, may become a new "regular" feature for conferences in our field, thus contributing to creating a common repository where everyone can deposit and share data.

As scientific work requires accurate citations of referenced work so as to allow the community to understand the whole context and also replicate the experiments conducted by other researchers, LREC 2016 endorses the need to uniquely Identify LRs through the use of the International Standard Language Resource Number (ISLRN), a Persistent Unique Identifier to be assigned to each Language Resource. The assignment of ISLRNs to LRs cited in LREC papers will be offered at submission time.

Submission link: START page

Organizing Committee

Hend Al-Khalifa, King Saud University, KSA
Walid Magdy, University of Edinburgh, UK
Kareem Darwish, Qatar Computing Research Institute, Qatar
Tamer Elsayed, Qatar University, Qatar
Hamdy Mubarak, Qatar Computing Research Institute, Qatar

Programme Committee

Nizar Habash, New York University Abu Dhabi, UAE
Wajdi Zaghouani, Hamad Bin Khalifa University, Qatar
Wassim El-Hajj, American University of Beirut, Lebanon
Ayah Zirikly, George Washington University, USA
Irina Temnikova, Sofia, Sofia City Province, Bulgaria
Shady Elbassuoni, American University of Beirut, Lebanon
Nora Al-Twairesh, King Saud University, KSA
Abeer Aldayel, University of Edinburgh, UK
Khaled Shaalan, The British University in Dubai, UAE
Almoataz B. Elsaid, Cairo University, Egypt
Ahmed Mourad, RMIT University, Australia
Hassan Sawaf, Amazon, USA
Fethi Bougares, Université du Maine, Avenue Laënnec, France
Nada Ghneim, Higher Institute for Applied Science and Technology, Syria
Maha Althobaiti, Taif University, KSA
Nasser Zalmout, New York University Abu Dhabi, UAE
Mohammad Salameh, University of Alberta, Canada
Alexis Nasr, Université Aix Marseille, France
AbdelRahim Elmadany, The University of British Columbia, Canada
Mohamed Abdelmageed, The University of British Columbia, Canada
Ahmed Ali, Qatar Computing Research Institute, Qatar
Haithem Afli, Cork Institute of Technology, Ireland
Preslav Nakov, Qatar Computing Research Institute, Qatar
Fahim Dalvi, Qatar Computing Research Institute, Qatar
Salam Khalifa, NYU-AD, UAE
Hassan Sajjad, Qatar Computing Research Institute, Qatar
Maha Alamri, Bangor University, UK
Sarah Kohail, University of Hamburg, Germany
Azzeddine Mazroui, Université Mohammed Premier, Morocco
Bassam Haddad, University of Petra, Jordan
Younes Samih, Qatar Computing Research Institute, Qatar
Khaled Shaban, Qatar University, Qatar
Reem Suwaileh, Qatar University, Qatar
Mucahid Kutlu, TOBB University, Turkey
Maram Hasanain , Qatar University, Qatar
Raghad Alshalaan, Imam Abdulrahman Bin Faisal University, KSA
Shahad Alshalaan, Imam Abdulrahman Bin Faisal University, KSA
Maha Alrabiah, Al Imam Mohammad Ibn Saud Islamic, KSA
Ibrahim Abu Farha, University of Edinburgh, UK

Welcome to OSACT4