TAUS Data Summit 2017

2 November, 2017 | Hosted by Google

Capitalizing on Language Data

Overview

Non-Zero-Sum

Data are in great demand in every industry these days. And so they are in the translation industry. Translation memory data have been the translator’s best resource to ensure consistency and enhance productivity since the early nineties. The use of translation memory data always remained ‘private’. Using translation memories outside your own company, practice or product, not only raised issues with data ownership and confidentiality, it was also clear that no real benefits could be obtained. Translation memory technology was in essence an invention from and for translators. 

This changed with the breakthrough of Statistical Machine Translation, first demonstrated in the early nineties by IBM, and becoming really popular with the arrival of the open-source Moses SMT engine and the translation market opening up to machine translation around the year 2010. Google Translate launched in 2006, quickly became popular, and was joined in quick succession by Microsoft’s Bing Translator, the Yandex and Baidu Translators. All these events together caused a ‘hunt’ for translation data.

The recent new breakthrough in machine translation technology - the arrival of neural networks - changes the perspective on data. Very large quantities of data, often harvested from the web, belong to the past. Neural MT thrives on smaller volumes of high quality language data. New rules apply in this new reality. We are evolving from the private use of data to a level playing field, where everyone can benefit: the non-zero-sum game.

Objective of the Data Summit

The TAUS Data Summit will bring owners and producers of language data together with the ‘power users’ and MT developers to learn from each other and to find common ground and ways to collaborate. The objective of the one-day event is to define an open market for language data in which  all parties can benefit.

Who should participate?

The TAUS Data Summit welcomes everyone who is concerned about the importance and value of language data, both producers and users of language data. The producers include everyone in the translation supply chain:

  • Translation buyers that may have their in-house production resources or outsource to translation vendors;
  • Language service providers with internal translation resources and external freelance resources;
  • Independent language professionals who provide translation, transcreation, editing, post-editing and consultancy services; (ask TAUS for special freelance registration fees)
  • Power users. Power users - in our definition - are the companies that have internal translation and language technology resources;
  • Dedicated MT and language technology development companies. 

Human Language Project

In 2008 TAUS and its members already created a data-sharing platform. In 2012 the TAUS Data Cloud was revamped as the Human Language Project, pursuing the vision that ‘democratic’ access to all human language data would ultimately help us to break down the language barriers. Between 2012 and 2016 the TAUS Data Cloud has experienced significant growth in volume. The current volume is 73 billion words in 2,300 language pairs. This now forms the foundation for a further modernization of the TAUS Data Cloud. The TAUS members like to pursue a strategy that transforms the data platform into a Data Market that can help both users and producers of data to capitalize on their data and create a level playing field.

Focus and format of the Summit

The focus of the summit is on the challenges and opportunities of managing and sharing language data in the translation industry. It will be a highly interactive meeting, with short presentations of use cases and experiences, followed by discussions with the participants. The discussions will be moderated by TAUS. The aim of the meeting is to deliver concrete and useful results. We plan to draft a roadmap of features and business rules for the new TAUS Data Market based on the discussions.

Data Market Virtuous Cycle

With the TAUS Data Market we want to create a virtuous circle benefiting both providers and users of language data: the opportunity to earn royalties on language data gets data providers to participate in the Data Market, which increases data coverage for more languages and domains, the resulting better MT quality increases MT usage and generates more data demand, attracting even more data providers.

Preparation

Participants are requested to prepare a short presentation of maximum three slides consisting of an overview of their organization and the challenges and opportunities they are facing with regard to language data, machine translation and quality management. We also suggest you read the Data Market White Paper that was published in June 2017.

Agenda

9:00 / Word of welcome by the host, Mengmeng Niu (Google)

Setting the Goals of the Day

Data has become vital not only for the large machine translation providers, but for every translation business. The goal of the TAUS Data Summit is to explore ways in which we can collaborate more and better in language data collection and sharing.

9:15 / Short introductions of the participants and agenda overview, Jaap van der Meer (TAUS)

The Translation Data Landscape

In this opening session we will get an overview of experiences and use cases with corpus building and data sharing in support of language technology projects and translation automation.

9:30 / TAUS Translation Data Landscape Report, Jaap van der Meer (TAUS)

9:40 / Tmxmall TM P2P sharing and trading model in China, Jing Zhang (Tmxmall)

9:50 / Experience with corpus building projects and the current status of corpus quality and data set in Korea, Brian Cho (HansemEUG, Corp.)

10:00 / Serving the China's One Belt & One Road Initiative, Henry Wang (UTH International)

10:10 / Tilde’s work on collecting and processing EU language resources for the CEF eTranslation platform, Rihards Kalnins (Tilde)

10:20 / Data Acquisition for the Modern Machine Translation Project, Achim Ruopp (TAUS)

10:30 / Discussion and Q&A with all speakers

10:45 / Coffee break

Use Cases of Language Data

In this session we will learn about use cases of language data for language technologies and translation automation.

11:15 / Microsoft's experiences with consuming shared data and with sharing data, Chris Wendt (Microsoft)

11:25 / eBay's use case for data in the training of MT systems, Jose Luis Bonilla Sanchez (eBay)

11:35 / Managing both translation data and translation models, Spence Green (Lilt)

11:45 / The importance of clean data in machine learning, Tony O’Dowd (KantamnMT)

11:55 / Discussion and Q&A with all speakers

12:15 / Lunch break

Language Data for the Greater Good

Language data have the potential power to help break language barriers. In this session we will learn about both visions and real-life rescue stories.

13:15 / The Common Language Initiative of Translators Without Borders, Aimee Ansari (TWB)

13:30 / ​The Google Translate Community, ​Mengmeng Niu (​Google)

13:45 / The Human Language Project, Jaap van der Meer (TAUS)

14:00 / Discussion and Q&A with all speakers

14:15 / Refreshment break

World Cafe

In the last session we will follow a meeting format known as ‘The World Cafe’ (see: http://www.theworldcafe.com/). The participants will meet in small groups of four (max five) at small cafe style tables to discuss specific questions. There will be progressive rounds of conversations, approximately 20 minutes each. Hosts will be invited to  stay at the tables to welcome the next group and brief them on what happened in the previous round. Notes will be taken on the tablecloths, index cards and flip charts. The small table conversations are followed by a plenary discussion to synthesize the answers and discoveries.Some of the questions that will be addressed in the World Cafe are:

  • How do we price data? The non-existence of a language data market means that we have no experience with setting prices for data. How do we set prices? Should prices differ depending on language, domain, type, quality, format?
  • Quality of language data. How do we determine and define the quality of language data? Are there tools to automate quality assessment? Is peer review working?
  • Type of data. What type of data do we need? Do we need to extend the Data Market to also aggregate and exchange speech data?
  • Business rules for sharing and trading of data. Exchange of ideas for the business rules. What’s in it for the translator? And for the language service provider? And for the translation customer?
  • Clarifying copyright on language data. How does existing copyright legislation apply to the Data Market? How do our Terms of Use relate to daily practices and behaviors of sharing?More questions may come up during the day and will be written down on flip charts.

14:45 / First round of conversations

15:05 / Second round of conversations

15:25 / Third round of conversations

16:05 / Plenary meeting to discuss the outcomes of the World Cafe conversations

17:00 / Adjourn

Speakers

Aimee Ansari | Translators Without Borders

Aimee is the Executive Director of Translators without Borders. She brings over 20 years of experience in international aid. She has worked in several humanitarian crises from the Tajik civil war to the earthquake in Haiti, the conflicts in the Balkans to the Syrian refugee crisis and the conflict in South Sudan. Prior to joining Translators without Borders in 2016, Aimee worked with Care, Oxfam, Save the Children and the United Nations.


Brian Cho | HansemEUG, Corp.


Spence Green | Lilt


Rihards Kalnins | Tilde

Rihards Kalnins is the Head of MT Solutions at Tilde, a leading European language technology and localization services company that specializes in custom machine translation. At Tilde, Kalnins manages key accounts and strategic partnerships, steers the MT product development roadmap, and coordinates the implementation of custom MT solutions for global customers. He is currently overseeing development of a Neural MT service for the 2017-2018 EU Council Presidencies and is helping the European Commission extend its automated translation platform CEF eTranslation. A former Fulbright scholar with a degree in Philosophy, Kalnins has written about language and multilingual policy for EurActiv.com and The Guardian.


Mengmeng Niu | Google


Tony ODowd | KantanMT

Tony O’Dowd is the CEO and Chief Architect of KantanMT.com. He has over 25 years’ experience working in the localization industry and has held positions at Lotus Development Corporation, Symantec Corporation, Corel Corporation Ltd., and Alchemy Software Development, which he founded in 2000. Tony holds a BSc. in Computer Science from Trinity College Dublin, and a Fellowship from the Localisation Research Institute at the University of Limerick.


Achim Ruopp | TAUS

Achim specializes in translation automation, internationalization and multi-lingual natural language processing. He believes that making MT widely available breaks down language barriers. As part of the TAUS team Achim leads technology development and implementation.


Jose Sanchez | eBay

Jose Luis Bonilla Sánchez manages the Machine Translation Language Specialist team at eBay. After graduating in Translation and Interpreting at the University of Granada, he worked in different roles (linguist, PM, LQA and knowledge engineer) in Spain, The Netherlands and the US, before taking charge of human QA for eBay's MT services in 2013.


Jaap van der Meer | TAUS

Jaap van der Meer founded TAUS in 2004. He is a language industry pioneer and visionary, who started his first translation company, INK, in The Netherlands in 1980.  Jaap is a regular speaker at conferences and author of many articles about technologies, translation and globalization trends.


Henry Wang | UTH International

Henry Wang is the executive vice president of UTH International, a leading disruptive innovation company in the language technology solutions industry worldwide. Prior to this role, he was president of WordTech International, which was one of the top 30 globalization and language solution providers in Asia (#11) and was ranked #81 globally in 2014. He oversaw the company’s legal, IP, technical and marketing language solutions, as well as the operations of the company’s localization and globalization service offerings and innovation.


Jing Zhang | Tmxmall

Jing Zhang, the Founder & CEO of Tmxmall, Deputy Director of the Association of Language Service Providers (ALSP), graduated from Northwestern Polytechnical University and got Master's degree in Tianjin University, and once worked for Baidu. Tmxmall is one of the leading providers focusing on TM data production, sharing and trading. Tmxmall products include Tmxmall cloud-based platform for translation memory exchange, online alignment, private cloud-based TM solutions, and TM marketplace for sharing and trading.


Program Committee

Nick Lambson | MediaLocate


Tony ODowd | KantanMT

Tony O’Dowd is the CEO and Chief Architect of KantanMT.com. He has over 25 years’ experience working in the localization industry and has held positions at Lotus Development Corporation, Symantec Corporation, Corel Corporation Ltd., and Alchemy Software Development, which he founded in 2000. Tony holds a BSc. in Computer Science from Trinity College Dublin, and a Fellowship from the Localisation Research Institute at the University of Limerick.


Jose Sanchez | eBay

Jose Luis Bonilla Sánchez manages the Machine Translation Language Specialist team at eBay. After graduating in Translation and Interpreting at the University of Granada, he worked in different roles (linguist, PM, LQA and knowledge engineer) in Spain, The Netherlands and the US, before taking charge of human QA for eBay's MT services in 2013.


Jean Senellart | SYSTRAN

Jean Senellart is the Chief Scientist of SYSTRAN and has been driving development of SYSTRAN hybrid technology and new generation of SYSTRAN products. Jean graduated from the Paris Ecole Polytechnique and holds a PhD in Computational Linguistics from the University of Paris VII – LADL. He began his career as a researcher and has been teaching Natural Language Processing at Ecole Polytechnique. With a double passion for natural and computing languages, he is a strong believer in the value of big data combined with language analysis.


Venue

Venue

The TAUS Data Summit will be hosted by Google at their Mountain View Campus. The meeting room is called Hearty and is located in Building 1055. The address is:   

1055 Joaquin Rd
Mountain View, CA 94043
United States

Registration fees

Registration fee for the Data Summit is €350 for TAUS members and €700 for non-members.

Location Map

Event Properties

Event Date 02-11-2017
Event End Date 02-11-2017
Capacity Unlimited
Registered 19
Individual Price EUR700.00
Created By dimitris@taus.net
Hosted by Google
Secondary text Capitalizing on Language Data
Overview

Non-Zero-Sum

Data are in great demand in every industry these days. And so they are in the translation industry. Translation memory data have been the translator’s best resource to ensure consistency and enhance productivity since the early nineties. The use of translation memory data always remained ‘private’. Using translation memories outside your own company, practice or product, not only raised issues with data ownership and confidentiality, it was also clear that no real benefits could be obtained. Translation memory technology was in essence an invention from and for translators. 

This changed with the breakthrough of Statistical Machine Translation, first demonstrated in the early nineties by IBM, and becoming really popular with the arrival of the open-source Moses SMT engine and the translation market opening up to machine translation around the year 2010. Google Translate launched in 2006, quickly became popular, and was joined in quick succession by Microsoft’s Bing Translator, the Yandex and Baidu Translators. All these events together caused a ‘hunt’ for translation data.

The recent new breakthrough in machine translation technology - the arrival of neural networks - changes the perspective on data. Very large quantities of data, often harvested from the web, belong to the past. Neural MT thrives on smaller volumes of high quality language data. New rules apply in this new reality. We are evolving from the private use of data to a level playing field, where everyone can benefit: the non-zero-sum game.

Objective of the Data Summit

The TAUS Data Summit will bring owners and producers of language data together with the ‘power users’ and MT developers to learn from each other and to find common ground and ways to collaborate. The objective of the one-day event is to define an open market for language data in which  all parties can benefit.

Who should participate?

The TAUS Data Summit welcomes everyone who is concerned about the importance and value of language data, both producers and users of language data. The producers include everyone in the translation supply chain:

  • Translation buyers that may have their in-house production resources or outsource to translation vendors;
  • Language service providers with internal translation resources and external freelance resources;
  • Independent language professionals who provide translation, transcreation, editing, post-editing and consultancy services; (ask TAUS for special freelance registration fees)
  • Power users. Power users - in our definition - are the companies that have internal translation and language technology resources;
  • Dedicated MT and language technology development companies. 

Human Language Project

In 2008 TAUS and its members already created a data-sharing platform. In 2012 the TAUS Data Cloud was revamped as the Human Language Project, pursuing the vision that ‘democratic’ access to all human language data would ultimately help us to break down the language barriers. Between 2012 and 2016 the TAUS Data Cloud has experienced significant growth in volume. The current volume is 73 billion words in 2,300 language pairs. This now forms the foundation for a further modernization of the TAUS Data Cloud. The TAUS members like to pursue a strategy that transforms the data platform into a Data Market that can help both users and producers of data to capitalize on their data and create a level playing field.

Focus and format of the Summit

The focus of the summit is on the challenges and opportunities of managing and sharing language data in the translation industry. It will be a highly interactive meeting, with short presentations of use cases and experiences, followed by discussions with the participants. The discussions will be moderated by TAUS. The aim of the meeting is to deliver concrete and useful results. We plan to draft a roadmap of features and business rules for the new TAUS Data Market based on the discussions.

Data Market Virtuous Cycle

With the TAUS Data Market we want to create a virtuous circle benefiting both providers and users of language data: the opportunity to earn royalties on language data gets data providers to participate in the Data Market, which increases data coverage for more languages and domains, the resulting better MT quality increases MT usage and generates more data demand, attracting even more data providers.

Preparation

Participants are requested to prepare a short presentation of maximum three slides consisting of an overview of their organization and the challenges and opportunities they are facing with regard to language data, machine translation and quality management. We also suggest you read the Data Market White Paper that was published in June 2017.

Agenda

9:00 / Word of welcome by the host, Mengmeng Niu (Google)

Setting the Goals of the Day

Data has become vital not only for the large machine translation providers, but for every translation business. The goal of the TAUS Data Summit is to explore ways in which we can collaborate more and better in language data collection and sharing.

9:15 / Short introductions of the participants and agenda overview, Jaap van der Meer (TAUS)

The Translation Data Landscape

In this opening session we will get an overview of experiences and use cases with corpus building and data sharing in support of language technology projects and translation automation.

9:30 / TAUS Translation Data Landscape Report, Jaap van der Meer (TAUS)

9:40 / Tmxmall TM P2P sharing and trading model in China, Jing Zhang (Tmxmall)

9:50 / Experience with corpus building projects and the current status of corpus quality and data set in Korea, Brian Cho (HansemEUG, Corp.)

10:00 / Serving the China's One Belt & One Road Initiative, Henry Wang (UTH International)

10:10 / Tilde’s work on collecting and processing EU language resources for the CEF eTranslation platform, Rihards Kalnins (Tilde)

10:20 / Data Acquisition for the Modern Machine Translation Project, Achim Ruopp (TAUS)

10:30 / Discussion and Q&A with all speakers

10:45 / Coffee break

Use Cases of Language Data

In this session we will learn about use cases of language data for language technologies and translation automation.

11:15 / Microsoft's experiences with consuming shared data and with sharing data, Chris Wendt (Microsoft)

11:25 / eBay's use case for data in the training of MT systems, Jose Luis Bonilla Sanchez (eBay)

11:35 / Managing both translation data and translation models, Spence Green (Lilt)

11:45 / The importance of clean data in machine learning, Tony O’Dowd (KantamnMT)

11:55 / Discussion and Q&A with all speakers

12:15 / Lunch break

Language Data for the Greater Good

Language data have the potential power to help break language barriers. In this session we will learn about both visions and real-life rescue stories.

13:15 / The Common Language Initiative of Translators Without Borders, Aimee Ansari (TWB)

13:30 / ​The Google Translate Community, ​Mengmeng Niu (​Google)

13:45 / The Human Language Project, Jaap van der Meer (TAUS)

14:00 / Discussion and Q&A with all speakers

14:15 / Refreshment break

World Cafe

In the last session we will follow a meeting format known as ‘The World Cafe’ (see: http://www.theworldcafe.com/). The participants will meet in small groups of four (max five) at small cafe style tables to discuss specific questions. There will be progressive rounds of conversations, approximately 20 minutes each. Hosts will be invited to  stay at the tables to welcome the next group and brief them on what happened in the previous round. Notes will be taken on the tablecloths, index cards and flip charts. The small table conversations are followed by a plenary discussion to synthesize the answers and discoveries.Some of the questions that will be addressed in the World Cafe are:

  • How do we price data? The non-existence of a language data market means that we have no experience with setting prices for data. How do we set prices? Should prices differ depending on language, domain, type, quality, format?
  • Quality of language data. How do we determine and define the quality of language data? Are there tools to automate quality assessment? Is peer review working?
  • Type of data. What type of data do we need? Do we need to extend the Data Market to also aggregate and exchange speech data?
  • Business rules for sharing and trading of data. Exchange of ideas for the business rules. What’s in it for the translator? And for the language service provider? And for the translation customer?
  • Clarifying copyright on language data. How does existing copyright legislation apply to the Data Market? How do our Terms of Use relate to daily practices and behaviors of sharing?More questions may come up during the day and will be written down on flip charts.

14:45 / First round of conversations

15:05 / Second round of conversations

15:25 / Third round of conversations

16:05 / Plenary meeting to discuss the outcomes of the World Cafe conversations

17:00 / Adjourn

Speakers (19664, 10725, 19680, 64, 19185, 8847, 15663, 9541, 9957, 11282, 12169)
Program Committee (8847, 8804, 9541, 10783)
Venue

Venue

The TAUS Data Summit will be hosted by Google at their Mountain View Campus. The meeting room is called Hearty and is located in Building 1055. The address is:   

1055 Joaquin Rd
Mountain View, CA 94043
United States

Registration fees

Registration fee for the Data Summit is €350 for TAUS members and €700 for non-members.

Location
Mountain View Campus, Google, California, USA
Google Bldg 1055, 1055 Joaquin Rd, Mountain View, CA 94043, USA Mountain View, California 94043 United States
Mountain View Campus, Google, California, USA

Event Properties

Event Date 02-11-2017
Event End Date 02-11-2017
Capacity Unlimited
Registered 19
Individual Price EUR700.00
Created By dimitris@taus.net
Hosted by Google
Secondary text Capitalizing on Language Data
Overview

Non-Zero-Sum

Data are in great demand in every industry these days. And so they are in the translation industry. Translation memory data have been the translator’s best resource to ensure consistency and enhance productivity since the early nineties. The use of translation memory data always remained ‘private’. Using translation memories outside your own company, practice or product, not only raised issues with data ownership and confidentiality, it was also clear that no real benefits could be obtained. Translation memory technology was in essence an invention from and for translators. 

This changed with the breakthrough of Statistical Machine Translation, first demonstrated in the early nineties by IBM, and becoming really popular with the arrival of the open-source Moses SMT engine and the translation market opening up to machine translation around the year 2010. Google Translate launched in 2006, quickly became popular, and was joined in quick succession by Microsoft’s Bing Translator, the Yandex and Baidu Translators. All these events together caused a ‘hunt’ for translation data.

The recent new breakthrough in machine translation technology - the arrival of neural networks - changes the perspective on data. Very large quantities of data, often harvested from the web, belong to the past. Neural MT thrives on smaller volumes of high quality language data. New rules apply in this new reality. We are evolving from the private use of data to a level playing field, where everyone can benefit: the non-zero-sum game.

Objective of the Data Summit

The TAUS Data Summit will bring owners and producers of language data together with the ‘power users’ and MT developers to learn from each other and to find common ground and ways to collaborate. The objective of the one-day event is to define an open market for language data in which  all parties can benefit.

Who should participate?

The TAUS Data Summit welcomes everyone who is concerned about the importance and value of language data, both producers and users of language data. The producers include everyone in the translation supply chain:

  • Translation buyers that may have their in-house production resources or outsource to translation vendors;
  • Language service providers with internal translation resources and external freelance resources;
  • Independent language professionals who provide translation, transcreation, editing, post-editing and consultancy services; (ask TAUS for special freelance registration fees)
  • Power users. Power users - in our definition - are the companies that have internal translation and language technology resources;
  • Dedicated MT and language technology development companies. 

Human Language Project

In 2008 TAUS and its members already created a data-sharing platform. In 2012 the TAUS Data Cloud was revamped as the Human Language Project, pursuing the vision that ‘democratic’ access to all human language data would ultimately help us to break down the language barriers. Between 2012 and 2016 the TAUS Data Cloud has experienced significant growth in volume. The current volume is 73 billion words in 2,300 language pairs. This now forms the foundation for a further modernization of the TAUS Data Cloud. The TAUS members like to pursue a strategy that transforms the data platform into a Data Market that can help both users and producers of data to capitalize on their data and create a level playing field.

Focus and format of the Summit

The focus of the summit is on the challenges and opportunities of managing and sharing language data in the translation industry. It will be a highly interactive meeting, with short presentations of use cases and experiences, followed by discussions with the participants. The discussions will be moderated by TAUS. The aim of the meeting is to deliver concrete and useful results. We plan to draft a roadmap of features and business rules for the new TAUS Data Market based on the discussions.

Data Market Virtuous Cycle

With the TAUS Data Market we want to create a virtuous circle benefiting both providers and users of language data: the opportunity to earn royalties on language data gets data providers to participate in the Data Market, which increases data coverage for more languages and domains, the resulting better MT quality increases MT usage and generates more data demand, attracting even more data providers.

Preparation

Participants are requested to prepare a short presentation of maximum three slides consisting of an overview of their organization and the challenges and opportunities they are facing with regard to language data, machine translation and quality management. We also suggest you read the Data Market White Paper that was published in June 2017.

Agenda

9:00 / Word of welcome by the host, Mengmeng Niu (Google)

Setting the Goals of the Day

Data has become vital not only for the large machine translation providers, but for every translation business. The goal of the TAUS Data Summit is to explore ways in which we can collaborate more and better in language data collection and sharing.

9:15 / Short introductions of the participants and agenda overview, Jaap van der Meer (TAUS)

The Translation Data Landscape

In this opening session we will get an overview of experiences and use cases with corpus building and data sharing in support of language technology projects and translation automation.

9:30 / TAUS Translation Data Landscape Report, Jaap van der Meer (TAUS)

9:40 / Tmxmall TM P2P sharing and trading model in China, Jing Zhang (Tmxmall)

9:50 / Experience with corpus building projects and the current status of corpus quality and data set in Korea, Brian Cho (HansemEUG, Corp.)

10:00 / Serving the China's One Belt & One Road Initiative, Henry Wang (UTH International)

10:10 / Tilde’s work on collecting and processing EU language resources for the CEF eTranslation platform, Rihards Kalnins (Tilde)

10:20 / Data Acquisition for the Modern Machine Translation Project, Achim Ruopp (TAUS)

10:30 / Discussion and Q&A with all speakers

10:45 / Coffee break

Use Cases of Language Data

In this session we will learn about use cases of language data for language technologies and translation automation.

11:15 / Microsoft's experiences with consuming shared data and with sharing data, Chris Wendt (Microsoft)

11:25 / eBay's use case for data in the training of MT systems, Jose Luis Bonilla Sanchez (eBay)

11:35 / Managing both translation data and translation models, Spence Green (Lilt)

11:45 / The importance of clean data in machine learning, Tony O’Dowd (KantamnMT)

11:55 / Discussion and Q&A with all speakers

12:15 / Lunch break

Language Data for the Greater Good

Language data have the potential power to help break language barriers. In this session we will learn about both visions and real-life rescue stories.

13:15 / The Common Language Initiative of Translators Without Borders, Aimee Ansari (TWB)

13:30 / ​The Google Translate Community, ​Mengmeng Niu (​Google)

13:45 / The Human Language Project, Jaap van der Meer (TAUS)

14:00 / Discussion and Q&A with all speakers

14:15 / Refreshment break

World Cafe

In the last session we will follow a meeting format known as ‘The World Cafe’ (see: http://www.theworldcafe.com/). The participants will meet in small groups of four (max five) at small cafe style tables to discuss specific questions. There will be progressive rounds of conversations, approximately 20 minutes each. Hosts will be invited to  stay at the tables to welcome the next group and brief them on what happened in the previous round. Notes will be taken on the tablecloths, index cards and flip charts. The small table conversations are followed by a plenary discussion to synthesize the answers and discoveries.Some of the questions that will be addressed in the World Cafe are:

  • How do we price data? The non-existence of a language data market means that we have no experience with setting prices for data. How do we set prices? Should prices differ depending on language, domain, type, quality, format?
  • Quality of language data. How do we determine and define the quality of language data? Are there tools to automate quality assessment? Is peer review working?
  • Type of data. What type of data do we need? Do we need to extend the Data Market to also aggregate and exchange speech data?
  • Business rules for sharing and trading of data. Exchange of ideas for the business rules. What’s in it for the translator? And for the language service provider? And for the translation customer?
  • Clarifying copyright on language data. How does existing copyright legislation apply to the Data Market? How do our Terms of Use relate to daily practices and behaviors of sharing?More questions may come up during the day and will be written down on flip charts.

14:45 / First round of conversations

15:05 / Second round of conversations

15:25 / Third round of conversations

16:05 / Plenary meeting to discuss the outcomes of the World Cafe conversations

17:00 / Adjourn

Speakers (19664, 10725, 19680, 64, 19185, 8847, 15663, 9541, 9957, 11282, 12169)
Program Committee (8847, 8804, 9541, 10783)
Venue

Venue

The TAUS Data Summit will be hosted by Google at their Mountain View Campus. The meeting room is called Hearty and is located in Building 1055. The address is:   

1055 Joaquin Rd
Mountain View, CA 94043
United States

Registration fees

Registration fee for the Data Summit is €350 for TAUS members and €700 for non-members.

Location
Mountain View Campus, Google, California, USA
Google Bldg 1055, 1055 Joaquin Rd, Mountain View, CA 94043, USA Mountain View, California 94043 United States
Mountain View Campus, Google, California, USA
EUR700.00
Share this event: