Olga Zamaraeva: Typologically-driven Modeling of wh-Questions in a Grammar Engineering Framework
Studying language typology and studying syntactic structure formally are both ways to learn about the range of variation in human languages. These two ways are often pursued separately from each other. Furthermore, assembling the complex and fragmented hypotheses about different syntactic phenomena along multiple typological dimensions becomes intractable without computational aid. In response to these issues, the Grammar Matrix grammar engineering framework combines typology and syntactic theory within a computational paradigm. As such, it offers a robust scaffolding for testing linguistic hypotheses in interaction and with respect to a clear area of applicability. In this talk, I will present my recent work on modeling the syntactic structure of constituent (wh-)questions in a typologically attested range, within the Grammar Matrix framework. The presented system of syntactic analyses is associated with grammar artifacts that can parse and generate sentences, which allowed me to rigorously test the analyses on test suites from diverse languages. The grammars can be extended directly in the future to cover more phenomena and more lexical items. Generally, the Grammar Matrix framework is intended to create implemented grammars for many languages of the world, particularly for endangered languages. In computational linguistics, formalized syntactic representations produced by such grammars play a crucial role in creating annotations which are then used for evaluating NLP system performance and which could be used for augmenting training data as well, in low-resource settings. Such grammars were also shown to be useful in applications such as grammar coaching, and advancing this line of research can contribute to educational and revitalization efforts. The talk comprises 4 parts (one hour in total), there will be Q&A sessions after each: 1) Introduction (focusing on NLP and language variation) 2) Computational syntax with HPSG 3) Assembling typologically diverse analyses 4) Future directions of research
Jon Rawski: Typology Emerges from Computability
Typology, from the ancient Sanskrit grammarians through to Alexander von Humboldt, is known to require two databases: an "encyclopedia of categories" and an "encyclopedia of types". The mathematical study of computable functions gives a rich encyclopedia of categories, and processes in natural language a rich encyclopedia of types. This talk will connect the two, especially in morphology and phonology. Jon will: 1) overview classes of string-to-string functions (polyregular, regular, rational and subsequential); 2) use them to determine the scope and limits of linguistic processes; 3) analytically connect them to classes of transducers (and acceptors using algebraic semirings); 4) show their usefulness for Seq2Seq interpretability experiments, and implications for ML in NLP generally.
Tiago Pimentel: An Informative Exploration of the Lexicon
During my PhD I've been exploring the lexicon through the lens of information theory. In this talk, I'll give an overview on results detailing the distribution of information in words (are initial or final positions more informative?), and cross-linguistic compensations (if a language has more information per character, are their words shorter?). I'll also present two new information-theoretic operationalisations (of systematicity and lexical ambiguity) which allow us to analyse computational linguistics question through corpus analyses -- relying only on natural (unsupervised) data.
Maria Ryskina: Informal Romanization Across Languages and Scripts
Informal romanization is an idiosyncratic way of typing non-Latin-script languages in Latin alphabet, commonly used in online communication. Although the character substitution choices vary between users, they are typically grounded in shared notions of visual and phonetic similarity between characters. In this talk, I will focus on the task of converting such romanized text into its native orthography and present experimental results for Russian, Arabic, and Kannada, highlighting the differences specific to writing systems. I will also show how similarity-encoding inductive bias helps in the absence of parallel data, present comparative error analysis for unsupervised finite-state and seq2seq models for this task, and explore how the combinations of the two model classes can leverage their different strengths.
Shruti Rijhwani: Cross-Lingual Entity Linking for Low-Resource Languages
Entity linking is the task of associating a named entity with its corresponding entry in a structured knowledge base (such as Wikipedia or Freebase). While entity linking systems for languages such as English and Spanish are well-developed, the performance of these methods on low-resource languages is significantly worse.
In this talk, I first discuss existing methods for cross-lingual entity linking and the associated challenges of adapting them to low-resource languages. Then, I present a suite of methods developed for entity linking that do not rely on resources in the target language. The success of our proposed methods is demonstrated with experiments on multiple languages, including extremely low-resource languages such as Tigrinya, Oromo, and Lao. Additionally, this talk will show how information from entity linking can be used with state-of-the-art neural models to improve low-resource named entity recognition.