DiaLex: A Benchmark for Evaluating Multidialectal Arabic Word Embeddings


Publication: WANLP 2021: Sixth Arabic Natural Language Processing Workshop

Summary: We describe DiaLex, a benchmark for intrinsic evaluation of dialectal Arabic word embeddings. DiaLex covers five important Arabic dialects: Algerian, Egyptian, Lebanese, Syrian, and Tunisian. Across these dialects, DiaLex provides a testbank for six syntactic and semantic relations, namely male to female, singular to dual, singular to plural, antonym, comparative, and genitive to past tense. DiaLex thus consists of a collection of word pairs representing each of the six relations in each of the five dialects. To demonstrate the utility of DiaLex, we use it to evaluate a set of existing and new Arabic word embeddings that we developed.

Learn more: https://aclanthology.org/2021.wanlp-1.2




Chi Squared Feature Selection over Apache Spark


Publication: IDEAS ‘19: Proceedings of the 23rd International Database Applications & Engineering Symposium | June 2019 Article No.: 41 Pages 1–5

Summary: We implemented a fully distributed version of Chi-Square (χ2) feature selection algorithm using Apache Spark framework. Our implementation takes into account the pitfalls of the built-in algorithm and attempts to avoid them. We then bench-marked our algorithm using different datasets of different characteristics on multiple system configurations. Our benchmarking protocol proved that our algorithm made significant performance gains over the default algorithm in terms of stability, execution times, and resource management.

Learn more: https://doi.org/10.1145/3331076.3331110