Welcome to OSACT4

The 4th Workshop on Open-Source Arabic Corpora and Processing Tools

with Shared Task on Offensive Language Detection

Marseille, France. 12th May 2020. Co-located with LREC 2020

Accepted Papers

  • AN ARABIC TWEETS SENTIMENT ANALYSIS DATASET (ATSAD) USING DISTANT SUPERVISION AND SELF TRAINING
    Kathrein Abu Kwaik, Stergios Chatzikyriakidis, Simon Dobnik, Motaz Saad and Richard Johansson
  • ARABERT: TRANSFORMER-BASED MODEL FOR ARABIC LANGUAGE UNDERSTANDING
    Wissam Antoun, Fady Baly and Hazem Hajj
  • ARANET: A DEEP LEARNING TOOLKIT FOR ARABIC SOCIAL MEDIA
    Muhammad Abdul-Mageed, Chiyu Zhang, Azadeh Hashemi and El Moatez Billah Nagoudi
  • BUILDING A QATARI HERITAGE EXPRESSIONS CORPUS
    Sara Al-Mulla and Wajdi Zaghouani
  • FROM ARABIC SENTIMENT ANALYSIS TO SARCASM DETECTION: THE ARSARCASM DATASET
    Ibrahim Abu Farha and Walid Magdy
  • UNDERSTANDING AND DETECTING DANGEROUS SPEECH IN SOCIAL MEDIA
    Ali Alshehri, El Moatez Billah Nagoudi and Muhammad Abdul-Mageed

Shared-Task Papers

  • OVERVIEW OF OSACT4 ARABIC OFFENSIVE LANGUAGE DETECTION SHARED TASK
    Hamdy Mubarak, Kareem Darwish, Walid Magdy, Tamer Elsayed and Hend Al-Khalifa
  • ALT SUBMISSION FOR OSACT SHARED TASK ON OFFENSIVE LANGUAGE DETECTION
    Sabit Hassan, Younes Samih, Hamdy Mubarak, Ahmed Abdelali, Ammar Rashed and Shammur Absar Chowdhury
  • ARABIC OFFENSIVE LANGUAGE DETECTION WITH ATTENTION-BASED DEEP NEURAL NETWORKS
    Bushr Haddad, Zoher Orabe, Anas Al-Abood and Nada Ghneim
  • ASU_OPTO AT OSACT4 - OFFENSIVE LANGUAGE DETECTION FOR ARABIC TEXT
    Amr Keleg, Samhaa R. El-Beltagy and Mahmoud Khalil
  • COMBINING CHARACTER AND WORD EMBEDDINGS FOR THE DETECTION OF OFFENSIVE LANGUAGE IN ARABIC
    Abdullah Alharbi and Mark Lee
  • LEVERAGING AFFECTIVE BIDIRECTIONAL TRANSFORMERS FOR OFFENSIVE LANGUAGE DETECTION
    AbdelRahim Elmadany, Chiyu Zhang, Muhammad Abdul-Mageed and Azadeh Hashemi
  • MULTI-TASK LEARNING USING ARABERT FOR OFFENSIVE LANGUAGE DETECTION
    Marc Djandji, Fady Baly, wissam antoun and Hazem Hajj
  • MULTITASK LEARNING FOR ARABIC OFFENSIVE LANGUAGE AND HATE-SPEECH DETECTION
    Ibrahim Abu Farha and Walid Magdy
  • OCAST4 SHARED TASKS: ENSEMBLED STACKED CLASSIFICATION FOR OFFENSIVE AND HATE SPEECH IN ARABIC TWEETS
    Hafiz Hassaan Saeed, Toon Calders and Faisal Kamiran
  • OSACT4 SHARED TASK ON OFFENSIVE LANGUAGE DETECTION: INTENSIVE PREPROCESSING BASED APPROACH
    Fatemah Husain
  • QUICK AND SIMPLE APPROACH FOR DETECTING HATE SPEECH IN ARABIC TWEETS
    Abeer Abuzayed and Tamer Elsayed
  • OFFENSIVE LANGUAGE DETECTION IN ARABIC USING ULMFIT
    Mohamed Abdellatif and Ahmed Elgammal

Workshop Description

Given the success of the first, second, and third workshops on Open-Source Arabic Corpora and Corpora Processing Tools (OSACT) in LREC 2014, LREC 2016 and LREC 2018, the fourth workshop comes to encourage researchers and practitioners of Arabic language technologies, including computational linguistics (CL), natural language processing (NLP), and information retrieval (IR) to share and discuss their research efforts, corpora, and tools. The workshop will also give special attention on Human Language Technologies based on AI/Machine Learning, which is one of LREC 2020 hot topics. In addition to the general topics of CL, NLP and IR, the workshop will give a special emphasis on Offensive Language Detection shared task.

Shared Task on Offensive Language Detection

Offensive speech (vulgar or targeted offense), as an expression of heightened polarization and discourse in society, has been on the rise. This is due in part to the large adoption of social media platforms that allow for greater polarization. The shared task attempts to detect such speech in the realm of Arabic social media.
In subtask A, we will use the SemEval 2020 Arabic offensive language dataset (OffensEval2020, Subtask A), which contains 10,000 tweets that were manually annotated for offensiveness (labels are: OFF or NOT_OFF). Offensive tweets contain explicit or implicit insults or attacks against other people, or inappropriate language. We will use the same splits of OffensEval2020 for train (70% of all tweets), dev (10%), and test (20%).

Example: يا مقرف يا جبان للأسف هذه تسمى خسة من شخص أحمق

In addition to Subtask A, there will be another subtask for detecting Hate Speech (Subtask B) for the whole dataset. If a tweet has insults or threats targeting a group based on their nationality, ethnicity, gender, political or sport affiliation, religious belief, or other common characteristics, this is considered as Hate Speech (labels are: HS or NOT_HS). Subtasks A and B share the same splits.

Example: الله يقلعكم يالبدو يا مجرمين يا خراب المجتمعات

Subtask B is more challenging than Subtask A as 5% only of the tweets are labeled as hate speech while 19% of the tweets are labeled as offensive. We encourage submissions to both subtasks.
Note: User mentions are replaced with @USER, URLs are replaced with URL, and empty lines in original tweets are replaced with <LF>.
The purpose of this shared task is to intensify research on the identification of offensive content and hate speech in Arabic language Twitter posts. One goal of the workshop is to define shared challenges using this dataset. We encourage submissions describing experiments for research tasks on the dataset.

Data:
The data is retrieved from Twitter and distributed in tab separated format as follows:
tweet_text \t OFF (or NOT_OFF) \t HS (or NOT_HS)\n
Ex: @USER اخرص يا أعرابي يا وقح فلن تعدو قدرك يا سافل \t OFF \t HS \n

Download training/dev data from here: Training Set, Development Set.

Evaluation Criteria:
Classification systems will be evaluated using the macro-averaged F1-score for Subtasks A and B.

Now gold-standard labels for test data can be downloaded from here: Task A, Task B.

Submission Format:
Classifications of test and dev datasets (labels only) should be submitted as separate files in the following format with a label for each corresponding tweet (i.e. the label in line x in the submission file corresponds to the tweet in line x in the test file):
For Subtask A:
        OFF (or NOT_OFF)\n
For Subtask B:
        HS (or NOT_HS)\n

Participants can submit up to two system results (primary submission for best result, and a secondary submission for the 2nd best result).
Official results will consider primary submissions for ranking different teams, and results of secondary submissions will be reported for guidance. All participants are required to report on the development and test sets in their papers.

Sumbission filename should be in the following format:
ParticipantName_Subtask<A/B>_<test/dev>_<1/2>.zip (a plain .txt file inside each .zip file)
Ex: QCRI_SubtaskA_test_1.zip (the best results for Subtask A for test dataset from QCRI team)
Ex: KSU_SubtaskB_dev_2.zip (the 2nd best results for Subtask B for dev dataset from KSU team)

The shared task is hosted on CODALAB using the following links for each subtask:
Subtask A: CODALAB link.
Subtask B: CODALAB link.

Test Set: is now released on CODALAB. Please find get it from there.

Contact:
For any questions related to the shared task, please contact the organizers using this email address: hmubarak@hbku.edu.qa

Results:
Please find below the results of the partipant teams sorted by F1-score.

Teams:

Motivation and Topics of interest

In the NLP, CL, and IR communities, Arabic is considered to be relatively resource-poor compared to English. This situation was thought to be the reason for the limited number of corpus-based studies in Arabic. However, the past years witnessed the emergence of new considerably free Modern Standard Arabic (MSA) corpora and to a lesser extent Arabic processing tools.
This workshop follows the footsteps of previous editions of OSACT to provide a forum for researchers to share and discuss their ongoing work. This workshop is timely given the continued rise in research projects focusing on Arabic Language Resources.

Corpora

  • Surveying and criticizing the design of available Arabic corpora, their associated and processing tools.
  • Availing new annotated corpora for NLP and IR applications such as named entity recognition, machine translation, sentiment analysis, text classification, and language learning.
  • Evaluating the use of crowdsourcing platforms for Arabic data annotation.

Tools and Technologies

  • Language education e.g. L1 and L2.
  • Language modeling and word embeddings.
  • Tokenization, normalization, word segmentation, morphological analysis, part-of-speech tagging, etc.
  • Sentiment analysis, dialect identification, and text classification.
  • Dialect translation.
  • Fake news detection.
  • Web and social media search and analytics

Issues in the design, construction and use of Arabic LRs: text, speech, sign, gesture, image, in single or multimodal/multimedia data

  • Guidelines, standards, best practices and models for LRs interoperability
  • Methodologies and tools for LRs construction and annotation
  • Methodologies and tools for extraction and acquisition of knowledge
  • Ontologies, terminology and knowledge representation
  • LRs and Semantic Web (including Linked Data, Knowledge Graphs, etc.)

Important Dates

Main Workshop

  • Paper submission deadline: 25 Feb 2020 28 Feb 2020
  • Notification of acceptance: 13 March 2020
  • Final submission of manuscripts: 25 March 2020
  • Workshop date: 12 May 2020

Shared-task

  • Shared-task train/dev set release: 21 Jan 2020
  • Shared-task test set release: 13 Feb 2020
  • Runs submission: 18 Feb 2020
  • Announcing runs results: 20 Feb 2020
  • Shared-task paper submission deadline: 25 Feb 2020 28 Feb 2020
  • Notification of acceptance: 13 March 2020
  • Final submission of manuscripts: 25 March 2020

All dates are by 23:59 UTC-10 Hawaii timezone

Submission guidelines

The language of the workshop is English and submissions should be with respect to LREC 2020 paper submission instructions. All papers will be peer reviewed possibly by three independent referees. Papers must be submitted electronically in PDF format to the START system.
When submitting a paper from the START page, authors will be asked to provide essential information about resources (in a broad sense, i.e. technologies, standards, evaluation kits, etc.) that have been used for the work described in the paper or are a new result of your research. Moreover, ELRA encourages all LREC authors to share the described LRs (data, tools, services, etc.), to enable their reuse, replicability of experiments (including evaluation ones).

Identify, Describe and Share your LRs!

Describing your LRs in the LRE Map is now a normal practice in the submission procedure of LREC (introduced in 2010 and adopted by other conferences). To continue the efforts initiated at LREC 2014 about "Sharing LRs" (data, tools, web-services, etc.), authors will have the possibility, when submitting a paper, to upload LRs in a special LREC repository. This effort of sharing LRs, linked to the LRE Map for their description, may become a new "regular" feature for conferences in our field, thus contributing to creating a common repository where everyone can deposit and share data.

As scientific work requires accurate citations of referenced work so as to allow the community to understand the whole context and also replicate the experiments conducted by other researchers, LREC 2016 endorses the need to uniquely Identify LRs through the use of the International Standard Language Resource Number (ISLRN), a Persistent Unique Identifier to be assigned to each Language Resource. The assignment of ISLRNs to LRs cited in LREC papers will be offered at submission time.

Submission link: START page

Organizing Committee

Programme Committee

  • Nizar Habash, New York University Abu Dhabi, UAE
  • Wajdi Zaghouani, Hamad Bin Khalifa University, Qatar
  • Wassim El-Hajj, American University of Beirut, Lebanon
  • Ayah Zirikly, George Washington University, USA
  • Irina Temnikova, Sofia, Sofia City Province, Bulgaria
  • Shady Elbassuoni, American University of Beirut, Lebanon
  • Nora Al-Twairesh, King Saud University, KSA
  • Abeer Aldayel, University of Edinburgh, UK
  • Khaled Shaalan, The British University in Dubai, UAE
  • Almoataz B. Elsaid, Cairo University, Egypt
  • Ahmed Mourad, RMIT University, Australia
  • Hassan Sawaf, Amazon, USA
  • Fethi Bougares, Université du Maine, Avenue Laënnec, France
  • Nada Ghneim, Higher Institute for Applied Science and Technology, Syria
  • Maha Althobaiti, Taif University, KSA
  • Nasser Zalmout, New York University Abu Dhabi, UAE
  • Mohammad Salameh, University of Alberta, Canada
  • Alexis Nasr, Université Aix Marseille, France
  • AbdelRahim Elmadany, The University of British Columbia, Canada
  • Mohamed Abdelmageed, The University of British Columbia, Canada
  • Ahmed Ali, Qatar Computing Research Institute, Qatar
  • Haithem Afli, Cork Institute of Technology, Ireland
  • Preslav Nakov, Qatar Computing Research Institute, Qatar
  • Fahim Dalvi, Qatar Computing Research Institute, Qatar
  • Salam Khalifa, NYU-AD, UAE
  • Hassan Sajjad, Qatar Computing Research Institute, Qatar
  • Maha Alamri, Bangor University, UK
  • Sarah Kohail, University of Hamburg, Germany
  • Azzeddine Mazroui, Université Mohammed Premier, Morocco
  • Bassam Haddad, University of Petra, Jordan
  • Younes Samih, Qatar Computing Research Institute, Qatar
  • Khaled Shaban, Qatar University, Qatar
  • Reem Suwaileh, Qatar University, Qatar
  • Mucahid Kutlu, TOBB University, Turkey
  • Maram Hasanain , Qatar University, Qatar
  • Raghad Alshalaan, Imam Abdulrahman Bin Faisal University, KSA
  • Shahad Alshalaan, Imam Abdulrahman Bin Faisal University, KSA
  • Maha Alrabiah, Al Imam Mohammad Ibn Saud Islamic, KSA
  • Ibrahim Abu Farha, University of Edinburgh, UK