Time zone: America/New_York
By SIGTYP2020 Organizing Committee
Opening remarks: general comments about SIGTYP development, SIGTYP2020 submissions, shared task, etc.
✻ Keynote Talks ✻
Richard is a computational linguist and a Research Scientist at Google, formerly in New York, now in Tokyo. From January, 2009, through October 2012, he was a professor at the Center for Spoken Language Understanding at the Oregon Health and Science University. At Google he has been mostly working on text normalization, where his former group has been developing various machine learning approaches to the problem of normalizing non-standard words in text and he has been particularly interested in the promise (and limitations) of approaches using recurrent neural nets. As of September 2019, he has moved to Google Tokyo, and is working on end-to-end speech understanding. Richard continues to maintain some "side-bar interests" including computational models of the early evolution of writing, the statistical properties of non-linguistic symbol systems, and collaborating on a translation of Wolfgang von Kempelen's Mechanismus der menschlichen Sprache, which was published in 2017. In this talk, Richard will present his most recent joint work with Alexander Gutkin on taxonomy of writing systems and computational approaches to evaluation of how logographic a system is.
Miriam Butt is Professor of Linguistics at the Department of Linguistics at the University of Konstanz. Currently, Miriam is concentrating on the history and distribution of case and complex predicates in South Asian languages. She is also interested in issues of grammar architecture and investigate interface issues (syntax-semantics, morphology-syntax/semantics, prosody-syntax) from both a theoretical and a computational perspective.
✻ Shared Task Session ✻
By Johannes Bjerva, Elizabeth Salesky, Sabrina J. Mielke, Aditi Chaudhary, Celano Giuseppe, Edoardo Maria Ponti, Ekaterina Vylomova, Ryan Cotterell and Isabelle Augenstein
Typological knowledge bases (KBs) such as WALS (Dryer and Haspelmath, 2013) contain information about linguistic properties of the world's languages. They have been shown to be useful for downstream applications, including cross-lingual transfer learning and linguistic probing. A major drawback hampering broader adoption of typological KBs is that they are sparsely populated, in the sense that most languages only have annotations for some features, and skewed, in that few features have wide coverage. As typological features often correlate with one another, it is possible to predict them and thus automatically populate typological KBs, which is also the focus of this shared task. Overall, the task attracted 8 submissions from 5 teams, out of which the most successful methods make use of such feature correlations. However, our error analysis reveals that even the strongest submitted systems struggle with predicting feature values for languages where few features are known.
By Ritesh Kumar, Deepak Alok, Akanksha Bansal, Bornini Lahiri and Atul Kr. Ojha
This paper enumerates SIGTYP 2020 Shared Task on the prediction of typological features as performed by the KMI-Panlingua-IITKGP team. The task entailed the prediction of missing values in a particular language, provided, the name of the language family, its genus, location (in terms of latitude and longitude coordinates and name of the country where it is spoken) and a set of feature-value pair are available. As part of fulfillment of the aforementioned task, the team submitted 3 kinds of system - 2 rule-based and one hybrid system. Of these 3, one rule-based system generated the best performance on the test set. All the systems were `constrained' in the sense that no additional dataset or information, other than those provided by the organisers, was used for developing the systems.
By Alexander Gutkin and Richard Sproat
This paper describes the NEMO submission to SIGTYP 2020 shared task (Bjerva et al., 2020) which deals with prediction of linguistic typological features for multiple languages using the data derived from World Atlas of Language Structures (WALS). We employ frequentist inference to represent correlations between typological features and use this representation to train simple multi-class estimators that predict individual features. We describe two submitted ridge regression-based configurations which ranked second and third overall in the constrained task. Our best configuration achieved the microaveraged accuracy score of 0.66 on 149 test languages.
By Martin Vastl, Daniel Zeman and Rudolf Rosa
We present our submission to the SIGTYP 2020 Shared Task on the prediction of typological features. We submit a constrained system, predicting typological features only based on the WALS database. We investigate two approaches. The simpler of the two is a system based on estimating correlation of feature values within languages by computing conditional probabilities and mutual information. The second approach is to train a neural predictor operating on precomputed language embeddings based on WALS features. Our submitted system combines the two approaches based on their self-estimated confidence scores. We reach the accuracy of 70.7% on the test data and rank first in the shared task.
By Gerhard Jäger
This paper describes a workflow to impute missing values in a typological database, a subset of the World Atlas of Language Structures (WALS). Using a world-wide phylogeny derived from lexical data, the model assumes a phylogenetic continuous time Markov chain governing the evolution of typological values. Data imputation is performed via a Maximum Likelihood estimation on the basis of this model. As back-off model for languages whose phylogenetic position is unknown, a k-nearest neighbor classification based on geographic distance is performed.
By Chinmay Choudhary
The paper describes the multitasking self-attention based approach to constrained sub-task within SIGTYP 2020 Shared task. Our model is simple neural network based architecture inspired by Transformers model. The model uses Multitasking to compute values of all WALS features for a given input language simultaneously. Results show that our approach performs at par with the baseline approaches, even though our proposed approach requires only phylogenetic and geographical attributes namely Longitude, Latitude, Genus-index, Family-index and Country-index and do not use any of the known WALS features of the respective input language, to compute its missing WALS features.
✻ Keynote Talks ✻
✻ Oral Session 1 ✻
By Sonia Cristofaro and Guglielmo Inglese
This paper presents The Pavia Diachronic Emergence of Alignment (DEmA) database. This new resource has been devised to provide a comprehensive open access database of the known possible sources and processes that account for the emergence of different (split-)alignment patterns cross-linguistically. The first goal of the database is to provide systematized and searchable empircal basis for the study of the diachronic typology of alignment systems. More generally, the unique architecture of DEmA may also provide a suitable model for future diachronic typological resources.
By Barend Beekhuizen
This work presents a novel dataset and a metric for evaluating methods for automated extraction of translation equivalent expressions in massively parallel corpora for the purposes of lexical semantic typology. Patterns in the annotation and the evaluation of the extraction methods were discussed, and directions for future research were indicated.
By Harald Hammarström
Nearly all global-level databases with structured information about the languages of the world have been constructed manually (see, e.g., the listing at languagegoldmine.com). Manual data collection by humans is a time-expensive enterprise — a database treating a single linguistic topic for some 200 languages is typically the size of a PhD project, whereas the world has 7 000 languages and there is grammatical information for over 4 500 (see glottolog.org, the remaining 2 500 being largely undocumented). Furthermore, human curated data is typically less transparent and allows for inconsistencies. It is conceivable that at least some features for such databases can be extracted automatically from either raw-text data or linguistic descriptions originally written for humans. The present work addresses the latter case, in its simplest form: extraction of information — more specifically typological features — of language from digitized full-text grammatical descriptions. In particular, we focus on the prospects of keyword extraction, i.e., extracting information which is signalled by a specific keyword. For example, keywords like classifier,suffix, prepositionorinverse signal the existence of the corresponding grammatical element and the existence of the grammatical element is signalled, perhaps not exclusively, but at least very frequently with the term in question. In contrast, other grammatical features, such as whether the verb agrees with the agent in person, may be expressed in amyriad of different ways across grammars and cannot be associated with a specific keyword. Keyword-signalled features are, of course, far simpler toextract, but not completely trivial, and hence the focus of this work.
By Aryaman Arora and Nathan Schneider
The use of specific case markers and adpositions for particular semantic roles is idiosyncratic to every language. This poses problems in many natural language processing tasks such as machine translation and semantic role labelling. Models for these tasks rely on human-annotated corpora as training data. There is a lack of corpora in South Asian languages for such tasks. Even Hindi, despite being a resource-rich language, is limited in available labelled data. This extended abstract presents the in-progress annotation of case markers and adpositions in a Hindi corpus, employing the cross-lingual scheme proposed by Schneider et al. (2017), Semantic Network of Adposition and Case Supersenses (SNACS). The SNACS guidelines we developed also apply to Urdu. We hope to finalize this corpus and develop NLP tools making use of the dataset, as well as promote NLP for typologically similar South Asian languages.
By Michael Richter and Tariq Yousef
Based on Shannon’s coding theorem, we pre- dict that aspectual coding asymmetries of verbs in Russian can be predicted by the ver- bal feature Average Information Content. We employ the novel Topic Context Model that cal- culates the verbal information content from the number of topics in the target words’ larger dis- courses and their local discourses. In contrast to a previous study, TCM yielded disappoint- ing results in this study which is, as we con- clude, mainly due to the small number of local contexts we utilized.
By Limor Raviv, Antje Mayer and Shiri Lev-Ari
Why are there so many different languages in the world? How much do languages differ from each other in terms of their linguistic structure? And how do such differences come about? One possibility is that linguistic diversity stems from differences in the social environments in which languages evolve. Specifically, it has been suggested that small, tightly knit communities can maintain high levels of linguistic complexity, while bigger and sparser communities tend to have languages that are structurally simpler, i.e., languages with more regular and more systematic grammars. However, to date this hypothesis has not been tested experimentally. Moreover, community size and network structure are typically confounded in the real-world, making it hard to evaluate the unique contribution of each social factor to this pattern of variation. To address this issue, we used a novel group communication paradigm. This experimental paradigm allowed us to look at the live formation of new languages that were created in the lab by different micro-societies under different social conditions. By analyzing the emerging languages, we could tease apart the causal role of community size and network structure, and see how the process of language evolution and change is shaped by the fact that languages develop in communities of different sizes and different social structures.
✻ Findings 1 ✻
By Philipp Dufter, Masoud Jalili Sabet, François Yvon, Hinrich Schütze
Word alignments are useful for tasks like statistical and neural machine translation (NMT) and cross-lingual annotation projection. Statis- tical word aligners perform well, as do methods that extract alignments jointly with translations in NMT. However, most approaches require parallel training data, and quality decreases as less training data is available. We propose word alignment methods that require no parallel data. The key idea is to leverage multilingual word embeddings – both static and contextualized – for word alignment. Our multilingual embeddings are created from monolingual data only without relying on any parallel data or dictionaries. We find that alignments created from embeddings are superior for four and comparable for two language pairs compared to those produced by traditional statistical aligners – even with abundant parallel data; e.g., contextualized embeddings achieve a word alignment F1 for English-German that is 5 percentage points higher than eflomal, a high-quality statistical aligner, trained on 100k parallel sentences.
By Ethan C. Chau, Lucy H. Lin, Noah A. Smith
Pretrained multilingual contextual representations have shown great success, but due to the limits of their pretraining data, their benefits do not apply equally to all language vari- eties. This presents a challenge for language varieties unfamiliar to these models, whose labeled and unlabeled data is too limited to train a monolingual model effectively. We propose the use of additional language-specific pre-training and vocabulary augmentation to adapt multilingual models to low-resource settings. Using dependency parsing of four diverse low-resource language varieties as a case study, we show that these methods significantly improve performance over baselines, especially in the lowest-resource cases, and demonstrate the importance of the relationship between such models’ pretraining data and target language varieties.
✻ Oral Session 2 ✻
By Ahmet Üstün, Arianna Bisazza, Gosse Bouma and Gertjan van Noord
Recent work has shown that a single multilingual model with typologically informed parameter sharing can improve the performance in dependency parsing on both high-resource and zero-shot conditions. In this work, we investigate whether such improvements are also observed in the POS, NER and morphological tagging tasks.
By Isabel Papadimitriou, Ethan A. Chi, Richard Futrell and Kyle Mahowald
We investigate how Multilingual BERT (mBERT) encodes grammar by examining how the high-order grammatical feature of morphosyntactic alignment (how different languages define what counts as a "subject") is manifested across the embedding spaces of different languages. To understand if and how morphosyntactic alignment affects contextual embedding spaces, we train classifiers to recover the subjecthood of mBERT embeddings in transitive sentences (which do not contain overt information about morphosyntactic alignment) and thehn evaluate them zero-shot on intransitive sentences (where subjecthood classification depends on alignment), within and across languages. We find that the resulting classifier distributions reflect the morphosyntactic alignment of their training languages. Our results demonstrate that mBERT representations are influenced by high-level grammatical features that are not manifested in any one input sentence, and that this is robust across languages. Further examining the characteristics that our classifiers rely on, we find that features such as passive voice, animacy and case strongly correlate with classification decisions, suggesting that mBERT does not encode a purely syntactic subjecthood, but a continuous subjecthood as is proposed in much of the functional linguistics literature. Together, these results provide insight into how grammatical features manifest in contextual embedding spaces, at a level of abstraction not covered by previous work.
By Alexander Gutkin, Martin Jansche and Lucy Skidmore
This extended abstract provides a summary of our past and ongoing work on assessing the quality of multilingual phoneme inventories derived from typological resources, inducing phonological inventories using distinctive feature representations from the speech data and the important role phonological typology plays in these approaches.
By Chiara Alzetta, Felice Dell'Orletta, Simonetta Montemagni and Giulia Venturi
This contribution presents the results of a method for typological feature identification in multilingual treebanks. The results are exemplified on Italian and English subject relations. Applications of the method for multilingual dependency parsing evaluation are discussed.
By Yushi Hu, Shane Settle and Karen Livescu
Acoustic word embeddings (AWEs) are vector representations of spoken word segments. AWEs can be learned jointly with embeddings of character sequences, to generate phonetically meaningful embeddings of written words, or acoustically grounded word embeddings (AGWEs). Such embeddings have been used to improve speech retrieval, recognition, and spoken term discovery. In this work, we extend this idea to multiple low-resource languages. We jointly train an AWE model and an AGWE model, using phonetically transcribed data from multiple languages. The pre-trained models can then be used for unseen zero-resource languages, or fine-tuned on data from low-resource languages. We also investigate distinctive features, as an alternative to phone labels, to better share cross-lingual information. We test our models on word discrimination tasks for twelve languages while varying the amount of target language training data, and find significant benefits to the proposed multilingual approach.
✻ Findings 2 ✻
By Saurabh Kulshreshtha, José Luis Redondo García, Ching-Yun Chang
Multilingual BERT (mBERT) has shown reasonable capability for zero-shot cross-lingual transfer when fine-tuned on downstream tasks. Since mBERT is not pre-trained with explicit cross-lingual supervision, transfer performance can further be improved by aligning mBERT with cross-lingual signal. Prior work proposes several approaches to align contextualised embeddings. In this paper we analyse how different forms of cross-lingual supervision and various alignment methods influence the transfer capability of mBERT in zero-shot setting. Specifically, we compare parallel corpora vs. dictionary-based supervision and rotational vs. fine-tuning based alignment methods. We evaluate the performance of different alignment methodologies across eight languages on two tasks: Name Entity Recognition and Semantic Slot Filling. In addition, we propose a novel normalisation method which consistently improves the performance of rotation-based alignment including a notable 3% F1 improvement for distant and typologically dissimilar languages. Importantly we identify the biases of the alignment methods to the type of task and proximity to the transfer language. We also find that supervision from parallel corpus is generally superior to dictionary alignments.
✻ Keynote Talks ✻
Yulia is an Assistant Professor in the Language Technologies Institute, School of Computer Science at Carnegie Mellon University. Her research interests are at or near the intersection of natural language processing, machine learning, linguistics, and social science. Her research is motivated by a unified goal: to extend the capabilities of human language technology beyond individual cultures and across language boundaries, thereby enabling NLP for disadvantaged groups, the users that need it most.
Bill is Professor of Linguistics at the University of New Mexico, USA. His interests in language are broad, but his central interests are in how meaning and function are encoded in grammatical form, and in the variation, diversity and evolution of languages. He takes a functional-typological approach to the analysis of grammar, drawing on the insights of construction grammar and cognitive linguistics. He has also developed an evolutionary framework for understanding language change, and collaborate in modeling language change processes in this framework.
By SIGTYP2020 Organizing Committee
Closing Remarks and SIGTYP 2021 Announcement!