icons-social-media-facebook-circleicons-social-media-twitter-circleicons-social-media-linked-in-circle
Language Data Ownership and Copyright in Translation
icons-action-calendar26/02/2020
5 minute read
Let's investigate copyright scenarios in a typical translation supply chain, who owns language data and define data ownership in translation.

The digital age and new technologies are giving intellectual property (copyright) and data ownership laws a hard time. Practically any idea, document or page that’s on the web can be copied instantaneously. Authored texts can be translated using Google Translate or some other machine translation (MT) technology. In fact, browser add-ons do that automatically. In the case of copying original text the breach is evident, but how about translation? While a translation of an original is automatically protected with copyright, does the fact that MT is used play a role?

In localization, there are evident uncertainties around language data ownership, control and liabilities. The intellectual property and data protection laws vary from country to country and they are still to catch up with the AI technologies. We hope that once they do, they will know to strike a balance between the copyright protections and the benefits of the new technologies.

For now, although there are no all-encompassing rules, we list some useful common considerations around copyright in translation.

Can Anyone Own Language (Data)?

The question about language ownership is a part of an old debate with two main legal views: one that refers to “language” as general means of communication and the one that looks at “a language” as property of native speakers (see Who owns language? Mother tongues as intellectual property and the conceptualization of human linguistic diversity by Christopher Hutton for reference) What is the difference? While language as such is considered natural, unstructured and no man’s private possession but a common good for the people to use, a language can be structured and belong to the closed group of native speakers who then have privileged access to it. While this privilege is rarely claimed, it is justified for the purposes of copyright and trademark law.

So, unless we talk about specific words of a language that are deemed as structured constructs but about language as such, nobody can own it.

Copyright Scenarios in Typical Translation Supply Chain

What are the steps in the translation process and how they relate to copyright?

Let’s simplify it for the sake of explaining. First there is the original document written in the source language. Copyright to this document belongs to the author or to the company that commissions the author. Then there is the translation of the source document. Translation is considered derivative work, protected by the same copyright as the original and owned and controlled by the author. If the author has granted permission for translation, the translator also now owns copyright. The copyright is generally transferred to the company who hires the translator to provide the translation service. This is all pretty simple so far, right? But what happens when we add technology to the mix?

Copyright and Translation Memory

The copyrights are not linked to any particular form or shape: they can just as well apply to one chapter in a 1,000-page book as to the entire book. In that respect, copyright to the individual sentences – source and target – still belongs to the author, translator or the company that employed them or paid for their services. However, unless one can tell that the individual sentence is work of a specific author, claiming copyright is difficult. If the segment consists of short, commonplace phrases and the author's “own intellectual creation” is not recognizable, it is not protected by copyrights.

Copyright and Machine Translation

On the other hand, the case of MT is more complex. An unauthorized machine translation on an original would be considered infringement under the current copyright regime. At the same time, we can’t deny the usefulness of using machine translation systems, both public and proprietary. No translator has so far been found liable for the single act of running a (machine) translation engine on a work that was publicly available on the internet.

Let’s summarize! When does Copyright Apply to Language Data?

  • For the the source text, when it has been written by a human being and the “hand of the author” can be distinguished.
  • For the target text, if it is a reproduction of the original work that carries copyrights. Both the owner of the original and the translator hold the copyrights of the translation.
  • For the segment, both source and target, only if it is recognizable as the creation of the author.

Should a copyright owner find out that a translator has copied and translated their work and consider this problematic, they could demand that the language data is deleted.

Author
milica-panić

Milica is a marketing professional with over 10 years in the field. As TAUS Head of Product Marketing she manages the positioning and commercialization of TAUS data services and products, as well as the development of taus.net. Before joining TAUS in 2017, she worked in various roles at Booking.com, including localization management, project management, and content marketing. Milica holds two MAs in Dutch Language and Literature, from the University of Belgrade and Leiden University. She is passionate about continuously inventing new ways to teach languages.

Related Articles
icons-action-calendar11/03/2024
Purchase TAUS's exclusive data collection, featuring close to 7.4 billion words, covering 483 language pairs, now available at discounts exceeding 95% of the original value.
icons-action-calendar09/11/2023
Explore the crucial role of language data in training and fine-tuning LLMs and GenAI, ensuring high-quality, context-aware translations, fostering the symbiosis of human and machine in the localization sector.
icons-action-calendar19/12/2022
Domain Adaptation can be classified into three types - supervised, semi-supervised, and unsupervised - and three methods - model-centric, data-centric, or hybrid.