πŸ“– About Me

Hello, I am a Senior Algorithm Expert in the Multimodal Group at 01.AI. My research interests cover computer vision, vision and language, and multimodal generation. Specifically, I focus on image generation, video generation, and speech generation, with precise control over these processes through text, images, and speech. Before joining 01.AI, I was the head of the AI department at Xinhua Zhiyun, an Alibaba-affiliated company. Prior to that, I was the co-founder and CTO of UniUbi. I received my master’s degree in 2015 from the Institute of Automation, Chinese Academy of Sciences, under the supervision of Professor Stan Z. Li. During my career, I have been dedicated to advancing AI research and have successfully implemented several AI applications, particularly in the areas of content creation and enhancement. If you are interested in collaborating with me to explore the development of next-generation multimodal models, please feel free to contact me via email at (wangtaomarvel at gmail dot com).

πŸ’» Projects

sym

Text-to-Image Generation Based on Multimodal Large Models

  • Project Duration: 2024
  • This is an Any-to-Any Multimodal LLM project that supports using both images and text as conditions simultaneously.
  • By leveraging the powerful MLLM, it achieves superior text encoding and offers better prompt following compared to open-source models.
sym

Intelligent Image Editing Based on Multimodal Large Models

  • Project Duration: 2024
  • Enables various intelligent image editing tasks.
  • Automatically generates Edit Type, Mask Prompt, and Output Image Prompt based on MLLM.
sym

Multimodal-Driven Digital Human Model for Thousands of Users

  • Project Duration: 2021-2023
  • Supports customization of digital humans for multiple users (>1000) within a single model.
  • Supports multiple input types, including voice, singing, and images.
  • Extremely fast inference speed, utilizing RTX 4090 for 10x video synthesis speed.
speech-driven video synthesis
music-driven video synthesis
sym

Self-Supervised Multi-Speaker TTS Model

  • Project Duration: 2022-2023
  • Seamlessly integrates phoneme prediction, phoneme alignment, and vocoder to achieve true self-supervised training, leveraging the value of big data.
  • Supports multi-speaker training and zero-shot voice cloning.
  • Non-autoregressive design, enabling fast inference speeds.

All Speakers in One Model

Zero Shot Original Voice

Zero Shot Synthesized

πŸ“ Publications

  • Tao Wang, Jianwei Yang, Zhen Lei, Shengcai Liao, Stan Z. Li. β€œFace Liveness Detection Using 3D Structure Recovered from a Single Camera”. ICB2013. Madrid, Spain, June 4-7, 2013. Citations:162

πŸŽ– Honors and Awards

  • National Scholarship three times
  • Champion of the first Alibaba Tianchi Big Data Competition, with a prize of 200,000 RMB
  • Champion of the 2014 Double 11 Tmall Recommendation Algorithm Challenge, with a prize of 1,000,000 RMB