Jul 03, 2019
What You See is What You Say
Teaching a computer meaningful associations between words and videos typically requires training on tens of thousands of videos painstakingly annotated by hand. That’s both labor-intensive and prone to inconsistency due to the often abstract relationship between video imagery and soundtrack.