András Aponyi
András Aponyi

NLP Research Analyst at TAUS with a background in linguistics and natural language processing. My mission is to follow the latest trends in NLP and use them to enrich the TAUS data toolkit.

icons-action-calendar3 Jan 2022

Bilingual, NLP-driven word clouds are now available in TAUS Data Marketplace. In this article, we discuss what word clouds are and what they can tell us about the contents of a document containing bilingual text data.

icons-action-calendar19 Aug 2021

There is a vast collection of textual data on the internet and in various organizational databases today, the overwhelming majority of which is not structured in an easily accessible manner. Natural language processing (NLP) can be used to make sense of unstructured data collections in a way that allows the automatization of important decision-making processes that would otherwise require a significant investment of time and effort to achieve manually.

icons-action-calendar10 Feb 2021

Embeddings have radically transformed the field of natural language processing (NLP) in recent years by making it possible to encode pieces of text as fixed-sized vectors. One of the most recent breakthroughs born out of this innovative way of representing textual data is a collection of methods for creating sentence embeddings, also known as sentence vectors. These embeddings make it possible to represent longer pieces of text numerically as vectors that computer algorithms, such as machine learning (ML) models, can handle directly. In this article, we will discuss the key ideas behind this technique, list some of its possible applications, and provide an overview of some of the state-of-the-art sentence embedding approaches commonly used in NLP research and the language industry.

icons-action-calendar7 Sep 2020

In another article, we discussed automatic machine translation (MT) evaluation metrics such as BLEU, NIST, METEOR, and TER. These metrics assign a score to a machine-translated segment by comparing it to a reference translation, which is a verified, human-generated translation of the source text. As a result, they are easy to understand and work with, but severely limited by their reliance on references, which are not always available in a translation production setting.