Welcome to OSACT3

The 3rd Workshop on Open-Source Arabic Corpora and Processing Tools

with ArabicWeb16 Data Challenge

Miyazaki, Japan. 8th May 2018. Co-located with LREC 2018

Workshop Proceedings

Workshop Programme

    09:00 – 09:10 - Welcome and Introduction by Workshop Chairs

    09:10 – 10:30 - Opening Session

  • Cross lingual modeling for low resource languages with a case application to ArabicDialects (Keynote Talk)
    Mona Diab
  • Learning Subjective Language: Feature Engineered vs. Deep Models
    Muhammad Abdul-Mageed

  • 10:30 – 11:00 - Coffee Break

    11:00 – 13:00 - Session 1

  • An Arabic Dependency Treebank in the Travel Domain
    Dima Taji, Jamila El Gizuli and Nizar Habash
  • ArSentD-LEV: A Multi-Topic Corpus for Target-based Sentiment Analysis in Arabic Levantine Tweets
    Ramy Baly, Hazem Hajj, Wassim El-Hajj and Khaled Shaban
  • ARLEX: A Large Scale Comprehensive Lexical Inventory for Modern Standard Arabic
    Sawsan Alqahtani, Mona Diab and Wajdi Zaghouani
  • ArSAS: An Arabic Speech-Act and Sentiment Corpus of Tweets
    AbdelRahim Elmadany, Hamdy Mubarak and Walid Magdy
  • iArabicWeb16: Making a Large Web Collection More Accessible for Research
    Khaled Yasser, Reem Suwaileh, Abdelrahman Shouman, Yassmine Barkallah, Mucahid Kutlu and Tamer Elsayed
  • Building a Causation Annotated Corpus: The Salford Arabic Causal Bank - Proclitics
    Jawad Sadek and Farid Meziane

  • 13:00 – 14:30 - Lunch Break

    14:30 – 16:00 - Session 2

  • Creating an Arabic Dialect Text Corpus by Exploring Twitter, Facebook, and Online Newspapers
    Areej Alshutayri and Eric Atwell
  • Diacritization of Moroccan and Tunisian Arabic Dialects: A CRF Approach
    Kareem Darwish, Ahmed Abdelali, Hamdy Mubarak, Younes Samih and Mohammed Attia
  • Dial2MSA: A Tweets Corpus for Converting Dialectal Arabic to Modern StandardArabic
    Hamdy Mubarak
  • ARC-WMI: Towards Building Arabic Readability Corpus for Written Medicine Information
    Abeer AL-Dayel, Hend Al-Khalifa, Sinaa Alaqeel, Norah Abanmy, Maha Al-Yahya and Mona Diab

  • 16:00 – 16:30 - Coffee Break

    16:30 – 17:00 - Closing Session

  • ArSEL: A Large Scale Arabic Sentiment and Emotion Lexicon
    Gilbert Badaro, Hussein Jundi, Hazem Hajj, Wassim El-Hajj and Nizar Habash
  • Guidelines and Annotation Framework for Arabic Author Profiling
    Wajdi Zaghouani and Anis Charfi

Workshop Description

Given the success of the first and second workshops on Open-Source Arabic Corpora and Corpora Processing Tools (OSACT) in LREC 2014 and LREC 2016, where their presented papers received 77 citations up to now, the third workshop comes to encourage researchers and practitioners of Arabic language technologies, including computational linguistics (CL), natural language processing (NLP), and information retrieval (IR), to share and discuss their research efforts, corpora, and tools. The workshop will also give special attention on the wide variety of initiatives for the creation, use, and evaluation of Arabic as a type of Asian Language Resources and Technologies, which is one of LREC 2018 hot topics. In addition to the general topics of CL, NLP and IR, the workshop will give a special emphasis on a new Arabic Data challenge track.

Data Challenge Track

This year, we are introducing ArabicWeb16, a new Web dataset that is suitable for many research projects. ArabicWeb16 is a public Web crawl of 150M Arabic Web pages, crawled over the month of January 2016, with high coverage of dialectal Arabic (about 21%) as well as Modern Standard Arabic (MSA). One goal of the workshop is to define shared challenges using this dataset. We encourage submissions describing experiments for research tasks on the dataset. This includes (but not limited to) training word-embeddings, deduplication, cross-dialect search, question answering, dialect detection, knowledge-base population, entity search, blog search, text classification, and spam detection. Further details, including instructions on how to obtain the dataset, can be found here.

Topics of interest


  • Surveying and criticizing the design of available Arabic corpora, their associated and processing tools.
  • Availing new annotated corpora for NLP and IR applications such as named entity recognition, machine translation, sentiment analysis, text classification, and language learning.
  • Evaluating the use of crowdsourcing platforms for Arabic data annotation.

Tools and Technologies

  • Language education e.g. L1 and L2.
  • Language modeling and word embeddings.
  • Tokenization, normalization, word segmentation, morphological analysis, part-of-speech tagging, etc.
  • Sentiment analysis, dialect identification, and text classification.
  • Dialect translation.

ArabicWeb16 Data Challenge

  • Language modeling, word embeddings.
  • Dialect detection, Cross-dialect search.
  • Entity search, Blog search, Deduplication, Spam detection.
  • Question answering, Knowledge-base population.
  • Text Classification.

Important Dates

  • Submission deadline: 22 January 2018 (by 23:59 UTC-10 Hawaii timezone)
  • Notification of acceptance: 15 February 2018
  • Final submission of manuscripts: 25 February 2018
  • Workshop date: 8 May 2018

Submission guidelines

We invite both long (8 pages and 2 pages of references, formatted according to the LREC guidelines) and short papers (4 pages and 2 pages of references)

The language of the workshop is English and submissions should be with respect to LREC 2018 paper submission instructions. All papers will be peer reviewed possibly by three independent referees. Papers must be submitted electronically in PDF format to the START system. When submitting a paper from the START page, authors will be asked to provide essential information about resources (in a broad sense, i.e. technologies, standards, evaluation kits, etc.) that have been used for the work described in the paper or are a new result of your research. Moreover, ELRA encourages all LREC authors to share the described LRs (data, tools, services, etc.), to enable their reuse, replicability of experiments (including evaluation ones).

Identify, Describe and Share your LRs!

Describing your LRs in the LRE Map is now a normal practice in the submission procedure of LREC (introduced in 2010 and adopted by other conferences). To continue the efforts initiated at LREC 2014 about "Sharing LRs" (data, tools, web-services, etc.), authors will have the possibility, when submitting a paper, to upload LRs in a special LREC repository. This effort of sharing LRs, linked to the LRE Map for their description, may become a new "regular" feature for conferences in our field, thus contributing to creating a common repository where everyone can deposit and share data.

As scientific work requires accurate citations of referenced work so as to allow the community to understand the whole context and also replicate the experiments conducted by other researchers, LREC 2016 endorses the need to uniquely Identify LRs through the use of the International Standard Language Resource Number (ISLRN), a Persistent Unique Identifier to be assigned to each Language Resource. The assignment of ISLRNs to LRs cited in LREC papers will be offered at submission time.

Submission link: START page

Organizing Committee

Programme Committee

  • Nizar Habash, New York University Abu Dhabi, UAE
  • Mona Diab, George Washington University, USA
  • Waleed Ammar, Allen Institute for Artificial Intelligence, USA
  • Wajdi Zaghouani, Carnegie Mellon University, Qatar
  • Mahmoud El-Haj, Lancaster University, UK
  • Khaled Bashir Shaban, Qatar University, Qatar
  • Wassim El-Hajj, American University of Beirut, Lebanon
  • Ayah Zirikly, George Washington University, USA
  • Irina Temnikova, Qatar Computing Research Institute, Qatar
  • Shady Elbassuoni, American University of Beirut, Lebanon
  • Nora Al-Twairesh, King Saud University, KSA
  • Abeer Aldayel, King Saud University, KSA
  • Khaled Shaalan, The British University in Dubai, UAE
  • Almoataz B. Elsaid, Cairo University, Egypt
  • Ahmed Mourad, RMIT University, Australia
  • Hassan Sawaf, Amazon, USA
  • Fethi Bougares, Université du Maine, Avenue Laënnec, France
  • Nada Ghneim, Higher Institute for Applied Science and Technology, Syria
  • Maha Althobaiti, Taif University, KSA
  • Ghassan Mourad, Lebanese University, Lebanon
  • Nadi Tomeh, Université Paris 13, France
  • Nasser Zalmout, New York University Abu Dhabi, UAE
  • Mohammad Salameh, University of Alberta, Canada
  • Hamdy Mubarak, Qatar Computing Research Institute, Qatar
  • Ahmed Abdelali, Qatar Computing Research Institute, Qatar
  • Alexis Nasr, Université Aix Marseille, France
  • Amal Alsaif, Al-Imam Muhammad ibn Saud Islamic University, KSA
  • Ali Jaoua, Qatar University, Qatar
  • Mohsen Rashwan, Cairo University, Egypt
  • AbdelRahim Elmadany, Jazan University, KSA
  • Mohamed Abdelmageed, The University of British Columbia, Canada
  • Ahmed Ali, Qatar Computing Research Institute, Qatar
  • Alberto Barrón-Cedeño, Qatar Computing Research Institute, Qatar
  • Alexander Koller, Saarland University, Germany
  • Areeb Alowisheq, Al-Imam Mohammad Ibn Saud Islamic University, KSA
  • Azzeddine Mazroui, Université Mohammed Premier, Morocco
  • Bassam Haddad, University of Petra, Jordan
  • Eshrag Refaee, Heriot-Watt University, UK
  • Haithem Afli, Dublin City University, Ireland
  • Hany Hassan, Microsoft, USA
  • Hassan Sajjad, Qatar Computing Research Institute, Qatar
  • Hazem Hajj, American University in Beirut, Lebanon
  • Houda Bouamor, CMU-Q, Qatar
  • Kemal Oflazer, CMU-Q, Qatar
  • Maha Alamri, Bangor University, UK
  • Mucahid Kutlu, Qatar University, Qatar
  • Preslav Nakov, Qatar Computing Research Institute, Qatar
  • Fahim Dalvi, Qatar Computing Research Institute, Qatar
  • Salam Khalifa, NYU-AD, UAE
  • Sarah Kohail, University of Hamburg, Germany
  • Tim Buckwalter, University of Maryland, USA
  • Violetta Cavalli-Sforza, Al Akhawayn University in Ifrane, Morocco
  • Younes Samih, Universität Düsseldorf , Germany
  • Szymon Roziewski, Information Processing Institute, Warsaw, Poland