data
icons-action-calendar3 Mar 2022

The AI scene of the 2010s was shaped by breakthroughs in vision-enabled technologies, from advanced image searches to computer vision systems for medical image analysis or for detecting defective parts in manufacturing and assembly. The 2020s, however, are foreseen to be all about natural language technologies and language-based AI tasks. NLP, NLG, NLQ, NLU… The list of abbreviations starting with NL (Natural Language) seems to grow each day. Regardless of the technology domain, it’s observed that natural language technologies will be in a field-shaping position in a variety of areas from business intelligence and healthcare to fintech.

icons-action-calendar3 Jan 2022

Bilingual, NLP-driven word clouds are now available in TAUS Data Marketplace. In this article, we discuss what word clouds are and what they can tell us about the contents of a document containing bilingual text data.

icons-action-calendar2 Dec 2021

This is the third article in my series on Translation Economics of the 2020s. In the first article published in Multilingual, I sketched the evolution of the translation industry driven by technological breakthroughs from an economic perspective. In the second article, Reconfiguring the Translation Ecosystem, I laid out the emerging new business models and ended with the observation that new smarter models still need to be invented. This is where I will now pick up the thread and introduce you to the next logical translation solution. I call it: Data-Enhanced Machine Translation.

icons-action-calendar1 Dec 2021

Technologies such as Natural Language Processing (NLP), deep learning and computer vision have been thriving since data science has become well-established as a field of study and expertise. These developments have paved the way for the rise of machine learning (ML) to achieve the concept of artificial intelligence (AI). The transformative effects of these new technologies continue to be observed in our daily lives at a gradually increasing pace as we move into 2022. 

icons-action-calendar18 Nov 2021

Google Research team has recently published a paper titled Data Cascades in High-Stakes AI. The six authors of this article, Nithya Sambasivan, Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen Paritosh, and Lora Aroyo, bring light to a profound pattern of data undervaluation in high-stake fields where AI models are critical and prevalent. They conclude that although there is great interest in creating MT and ML models, there is less interest in doing the actual data work. 

icons-action-calendar4 Nov 2021

What is Explainable AI? 

As AI is becoming more prominent in high-stakes industries like healthcare, education, construction, environment, autonomous machines, and law enforcement, we are finding an increased need to trust the decision-making process. These predictions often need to be extremely accurate, e.g. critical life or death situations in healthcare. Due to the critical and direct impact AI is having on our day-to-day lives, decision-makers need more insight and visibility into the mechanics of AI systems and the prediction process. Presently, often only technical experts such as data scientists or engineers understand the backend processes and algorithms being used, like the highly complex deep neural networks. The lack of interpretability has shown to be a means of disconnect between technical and non-technical practitioners. In an effort to make these AI systems more transparent, the field of Explainable AI (XAI) came into existence.    

icons-action-calendar4 Oct 2021

A machine learning algorithm uses data to learn and make decisions. The algorithm develops confidence in its decisions by understanding the underlying patterns, relationships, and structures within a training dataset. The higher quality the training data is, the better the algorithm will perform. So what is training data exactly? 

icons-action-calendar4 Oct 2021

Training data is perhaps one of the most integral pieces of machine learning and artificial intelligence. Without it, machine learning and artificial intelligence would be impossible. Models would not be able to learn, make predictions, or extract useful information without learning from training data. It’s safe to say that training data is the backbone of machine learning and artificial intelligence. 

icons-action-calendar4 Oct 2021

Training data is used in three primary types of machine learning: supervised, unsupervised, and semi-supervised learning. In supervised learning, the training data must be labeled. This allows the model to learn a mapping from the label to its associated features. In unsupervised learning, labels are not required in the training set. Unsupervised machine learning models look for underlying structures in the features of the training set to make generalized groupings or predictions. A semi-supervised training dataset will have a mix of both unlabeled and labeled features, used in semi-supervised learning problems. 

icons-action-calendar4 Oct 2021

The amount of training data you need depends on many variables - the model you use, the task you perform, the performance you wish to achieve, the number of features available, the noise in the data, the complexity of the model, and more. 

icons-action-calendar4 Oct 2021

Data cleaning is an essential step in machine learning and takes place before the model training step. It is important because your machine learning model will produce results only as good as the data you feed it. If your dataset contains too much noise, your model will capture that noise as a result. Furthermore, messy data can break your model and cause model accuracy rates to decrease. Examples of data cleaning techniques include syntax error removals, data normalization, duplicate removal, outlier detection/removal, and fixing encoding issues. 

icons-action-calendar4 Oct 2021

Training data can be sourced from many different places, depending on your machine learning application. Data can be found just about anywhere - from free publicly available datasets to privately-held data available for purchase, to crowdsourced data. These types of datasets are known as organic data or naturally occurring datasets. 

icons-action-calendar1 Oct 2021

Acquiring high-quality parallel corpora is essential for training good-performing MT engines. There are a number of publicly available multilingual corpora, such as the proceedings of the European Parliament (Europarl) or transcribed TEDTalks (OPUS). Owing to their size and confirmed high quality, these have been used by researchers as sources of large-scale parallel language data. 

icons-action-calendar7 Sep 2021

What is Intent Recognition and Why is it Important? 

As our society continues to rely on technologies such as social network apps, emails, chatboxes, and more, the volume and availability of text data continue to multiply. Due to the popular use of online or phone services, companies have previously had a difficult time keeping up. Intent recognition models have come to the rescue to help flag and sort through the vastness of this text data.

icons-action-calendar19 Aug 2021

There is a vast collection of textual data on the internet and in various organizational databases today, the overwhelming majority of which is not structured in an easily accessible manner. Natural language processing (NLP) can be used to make sense of unstructured data collections in a way that allows the automatization of important decision-making processes that would otherwise require a significant investment of time and effort to achieve manually.

icons-action-calendar7 Jun 2021

Sentiment analysis is a subfield of Natural Language Processing (NLP) where the general sentiment is learned from a body of text. It is primarily used to understand customer satisfaction, social popularity, and feedback on a product from the general public through monitoring social or public data. There are different types of sentiment analyses that exist and are common in the real world. 

icons-action-calendar3 Jun 2021

The amount of content that is being produced worldwide and needs translation has been surging for years now. For the vast majority of players in the language industry, the COVID-19 pandemic didn’t slow things down but rather accelerated them. According to Nimdzi, the language services industry reached USD 55B in 2020 and is on a growth path expected to hit a whopping USD 73.6B by 2025. This will only be possible if the industry keeps on showing the same amount of resilience and adaptability, embracing new technologies and digital transformation.

icons-action-calendar1 Jun 2021

Perhaps the most pivotal step in your machine learning application is the data preparation phase. On average, data scientists spend more time prepping and transforming datasets before actually training a machine learning model than any other task. Each machine learning algorithm requires the data to be in a certain format. Transforming the data into these specific formats will not only have a massive impact on the performance of the model but will also yield higher predictive power. Understanding and implementing the proper data preparation methods will strengthen any machine learning algorithm. Below is a brief guide on common data preparation techniques. 

icons-action-calendar4 May 2021

People create vast amounts of data daily through many touchpoints in their usage of the IoT (Internet of Things) devices, often without even realizing it. Think of all the apps you use, messages you send, pictures you take and share. And these are just the byproducts of your leisure activities (that you might not necessarily want to use elsewhere). Now, try to imagine how much data you create as part of your work. If you are a translator or a language service provider, the amount of language data you generated over time while working on projects, building glossaries or translation memories, or even just translating your favorite song or paragraph from a book for fun, is immense. More importantly, even if some years have passed and those specific lines of text are no longer used for their original intended purpose, they still have value as training data for ML applications.

icons-action-calendar29 Apr 2021

Machine learning (ML) is all around us. From your mail inbox spam detection, to text or document classification, to self-driving cars, to speech recognition, and more, machine learning algorithms are present in our everyday lives. A machine learning model is an algorithm that learns and predicts data automatically, without the need to be explicitly programmed. In this article, we will review the main factors to consider when selecting the right ML model, given your application and data.