The 3rd Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT): with ArabicWeb16 Data Challenge

Workshop Proceedings

Workshop Programme

09:00 – 09:10 - Welcome and Introduction by Workshop Chairs

09:10 – 10:30 - Opening Session

Cross lingual modeling for low resource languages with a case application to ArabicDialects (Keynote Talk)
Mona Diab
Learning Subjective Language: Feature Engineered vs. Deep Models
Muhammad Abdul-Mageed

10:30 – 11:00 - Coffee Break

11:00 – 13:00 - Session 1

An Arabic Dependency Treebank in the Travel Domain
Dima Taji, Jamila El Gizuli and Nizar Habash
ArSentD-LEV: A Multi-Topic Corpus for Target-based Sentiment Analysis in Arabic Levantine Tweets
Ramy Baly, Hazem Hajj, Wassim El-Hajj and Khaled Shaban
ARLEX: A Large Scale Comprehensive Lexical Inventory for Modern Standard Arabic
Sawsan Alqahtani, Mona Diab and Wajdi Zaghouani
ArSAS: An Arabic Speech-Act and Sentiment Corpus of Tweets
AbdelRahim Elmadany, Hamdy Mubarak and Walid Magdy
iArabicWeb16: Making a Large Web Collection More Accessible for Research
Khaled Yasser, Reem Suwaileh, Abdelrahman Shouman, Yassmine Barkallah, Mucahid Kutlu and Tamer Elsayed
Building a Causation Annotated Corpus: The Salford Arabic Causal Bank - Proclitics
Jawad Sadek and Farid Meziane

13:00 – 14:30 - Lunch Break

14:30 – 16:00 - Session 2

Creating an Arabic Dialect Text Corpus by Exploring Twitter, Facebook, and Online Newspapers
Areej Alshutayri and Eric Atwell
Diacritization of Moroccan and Tunisian Arabic Dialects: A CRF Approach
Kareem Darwish, Ahmed Abdelali, Hamdy Mubarak, Younes Samih and Mohammed Attia
Dial2MSA: A Tweets Corpus for Converting Dialectal Arabic to Modern StandardArabic
Hamdy Mubarak
ARC-WMI: Towards Building Arabic Readability Corpus for Written Medicine Information
Abeer AL-Dayel, Hend Al-Khalifa, Sinaa Alaqeel, Norah Abanmy, Maha Al-Yahya and Mona Diab

16:00 – 16:30 - Coffee Break

16:30 – 17:00 - Closing Session

ArSEL: A Large Scale Arabic Sentiment and Emotion Lexicon
Gilbert Badaro, Hussein Jundi, Hazem Hajj, Wassim El-Hajj and Nizar Habash
Guidelines and Annotation Framework for Arabic Author Profiling
Wajdi Zaghouani and Anis Charfi

Workshop Description

Given the success of the first and second workshops on Open-Source Arabic Corpora and Corpora Processing Tools (OSACT) in LREC 2014 and LREC 2016, where their presented papers received 77 citations up to now, the third workshop comes to encourage researchers and practitioners of Arabic language technologies, including computational linguistics (CL), natural language processing (NLP), and information retrieval (IR), to share and discuss their research efforts, corpora, and tools. The workshop will also give special attention on the wide variety of initiatives for the creation, use, and evaluation of Arabic as a type of Asian Language Resources and Technologies, which is one of LREC 2018 hot topics. In addition to the general topics of CL, NLP and IR, the workshop will give a special emphasis on a new Arabic Data challenge track.

Data Challenge Track

This year, we are introducing ArabicWeb16, a new Web dataset that is suitable for many research projects. ArabicWeb16 is a public Web crawl of 150M Arabic Web pages, crawled over the month of January 2016, with high coverage of dialectal Arabic (about 21%) as well as Modern Standard Arabic (MSA). One goal of the workshop is to define shared challenges using this dataset. We encourage submissions describing experiments for research tasks on the dataset. This includes (but not limited to) training word-embeddings, deduplication, cross-dialect search, question answering, dialect detection, knowledge-base population, entity search, blog search, text classification, and spam detection. Further details, including instructions on how to obtain the dataset, can be found here.

Topics of interest

Corpora

Surveying and criticizing the design of available Arabic corpora, their associated and processing tools.
Availing new annotated corpora for NLP and IR applications such as named entity recognition, machine translation, sentiment analysis, text classification, and language learning.
Evaluating the use of crowdsourcing platforms for Arabic data annotation.

Tools and Technologies

Language education e.g. L1 and L2.
Language modeling and word embeddings.
Tokenization, normalization, word segmentation, morphological analysis, part-of-speech tagging, etc.
Sentiment analysis, dialect identification, and text classification.
Dialect translation.

ArabicWeb16 Data Challenge

Language modeling, word embeddings.
Dialect detection, Cross-dialect search.
Entity search, Blog search, Deduplication, Spam detection.
Question answering, Knowledge-base population.
Text Classification.

Submission guidelines

We invite both long (8 pages and 2 pages of references, formatted according to the LREC guidelines) and short papers (4 pages and 2 pages of references)

The language of the workshop is English and submissions should be with respect to LREC 2018 paper submission instructions. All papers will be peer reviewed possibly by three independent referees. Papers must be submitted electronically in PDF format to the START system. When submitting a paper from the START page, authors will be asked to provide essential information about resources (in a broad sense, i.e. technologies, standards, evaluation kits, etc.) that have been used for the work described in the paper or are a new result of your research. Moreover, ELRA encourages all LREC authors to share the described LRs (data, tools, services, etc.), to enable their reuse, replicability of experiments (including evaluation ones).

Identify, Describe and Share your LRs!

Describing your LRs in the LRE Map is now a normal practice in the submission procedure of LREC (introduced in 2010 and adopted by other conferences). To continue the efforts initiated at LREC 2014 about "Sharing LRs" (data, tools, web-services, etc.), authors will have the possibility, when submitting a paper, to upload LRs in a special LREC repository. This effort of sharing LRs, linked to the LRE Map for their description, may become a new "regular" feature for conferences in our field, thus contributing to creating a common repository where everyone can deposit and share data.

As scientific work requires accurate citations of referenced work so as to allow the community to understand the whole context and also replicate the experiments conducted by other researchers, LREC 2016 endorses the need to uniquely Identify LRs through the use of the International Standard Language Resource Number (ISLRN), a Persistent Unique Identifier to be assigned to each Language Resource. The assignment of ISLRNs to LRs cited in LREC papers will be offered at submission time.

Submission link: START page

Organizing Committee

Hend Al-Khalifa, King Saud University, KSA
Walid Magdy, University of Edinburgh, UK
Kareem Darwish, Qatar Computing Research Institute, Qatar
Tamer Elsayed, Qatar University, Qatar

Programme Committee

Nizar Habash, New York University Abu Dhabi, UAE
Mona Diab, George Washington University, USA
Waleed Ammar, Allen Institute for Artificial Intelligence, USA
Wajdi Zaghouani, Carnegie Mellon University, Qatar
Mahmoud El-Haj, Lancaster University, UK
Khaled Bashir Shaban, Qatar University, Qatar
Wassim El-Hajj, American University of Beirut, Lebanon
Ayah Zirikly, George Washington University, USA
Irina Temnikova, Qatar Computing Research Institute, Qatar
Shady Elbassuoni, American University of Beirut, Lebanon
Nora Al-Twairesh, King Saud University, KSA
Abeer Aldayel, King Saud University, KSA
Khaled Shaalan, The British University in Dubai, UAE
Almoataz B. Elsaid, Cairo University, Egypt
Ahmed Mourad, RMIT University, Australia
Hassan Sawaf, Amazon, USA
Fethi Bougares, Université du Maine, Avenue Laënnec, France
Nada Ghneim, Higher Institute for Applied Science and Technology, Syria
Maha Althobaiti, Taif University, KSA
Ghassan Mourad, Lebanese University, Lebanon
Nadi Tomeh, Université Paris 13, France
Nasser Zalmout, New York University Abu Dhabi, UAE
Mohammad Salameh, University of Alberta, Canada
Hamdy Mubarak, Qatar Computing Research Institute, Qatar
Ahmed Abdelali, Qatar Computing Research Institute, Qatar
Alexis Nasr, Université Aix Marseille, France
Amal Alsaif, Al-Imam Muhammad ibn Saud Islamic University, KSA
Ali Jaoua, Qatar University, Qatar
Mohsen Rashwan, Cairo University, Egypt
AbdelRahim Elmadany, Jazan University, KSA
Mohamed Abdelmageed, The University of British Columbia, Canada
Ahmed Ali, Qatar Computing Research Institute, Qatar
Alberto Barrón-Cedeño, Qatar Computing Research Institute, Qatar
Alexander Koller, Saarland University, Germany
Areeb Alowisheq, Al-Imam Mohammad Ibn Saud Islamic University, KSA
Azzeddine Mazroui, Université Mohammed Premier, Morocco
Bassam Haddad, University of Petra, Jordan
Eshrag Refaee, Heriot-Watt University, UK
Haithem Afli, Dublin City University, Ireland
Hany Hassan, Microsoft, USA
Hassan Sajjad, Qatar Computing Research Institute, Qatar
Hazem Hajj, American University in Beirut, Lebanon
Houda Bouamor, CMU-Q, Qatar
Kemal Oflazer, CMU-Q, Qatar
Maha Alamri, Bangor University, UK
Mucahid Kutlu, Qatar University, Qatar
Preslav Nakov, Qatar Computing Research Institute, Qatar
Fahim Dalvi, Qatar Computing Research Institute, Qatar
Salam Khalifa, NYU-AD, UAE
Sarah Kohail, University of Hamburg, Germany
Tim Buckwalter, University of Maryland, USA
Violetta Cavalli-Sforza, Al Akhawayn University in Ifrane, Morocco
Younes Samih, Universität Düsseldorf , Germany
Szymon Roziewski, Information Processing Institute, Warsaw, Poland

Welcome to OSACT3