📖 About Me

Hello, I am a Senior Algorithm Expert in the Multimodal Group at 01.AI. My research interests cover computer vision, vision and language, and multimodal generation. Specifically, I focus on image generation, video generation, and speech generation, with precise control over these processes through text, images, and speech. Before joining 01.AI, I was the head of the AI department at Xinhua Zhiyun, an Alibaba-affiliated company. Prior to that, I was the co-founder and CTO of UniUbi. I received my master’s degree in 2015 from the Institute of Automation, Chinese Academy of Sciences, under the supervision of Professor Stan Z. Li. During my career, I have been dedicated to advancing AI research and have successfully implemented several AI applications, particularly in the areas of content creation and enhancement. If you are interested in collaborating with me to explore the development of next-generation multimodal models, please feel free to contact me via email at (wangtaomarvel at gmail dot com).

💻 Projects

Text-to-Image Generation Based on Multimodal Large Models

Project Duration: 2024
This is an Any-to-Any Multimodal LLM project that supports using both images and text as conditions simultaneously.
By leveraging the powerful MLLM, it achieves superior text encoding and offers better prompt following compared to open-source models.

Intelligent Image Editing Based on Multimodal Large Models

Project Duration: 2024
Enables various intelligent image editing tasks.
Automatically generates Edit Type, Mask Prompt, and Output Image Prompt based on MLLM.

Multimodal-Driven Digital Human Model for Thousands of Users

Project Duration: 2021-2023
Supports customization of digital humans for multiple users (>1000) within a single model.
Supports multiple input types, including voice, singing, and images.
Extremely fast inference speed, utilizing RTX 4090 for 10x video synthesis speed.

speech-driven video synthesis

music-driven video synthesis

Self-Supervised Multi-Speaker TTS Model

Project Duration: 2022-2023
Seamlessly integrates phoneme prediction, phoneme alignment, and vocoder to achieve true self-supervised training, leveraging the value of big data.
Supports multi-speaker training and zero-shot voice cloning.
Non-autoregressive design, enabling fast inference speeds.

All Speakers in One Model

Zero Shot Original Voice

Zero Shot Synthesized

📝 Publications

Tao Wang, Jianwei Yang, Zhen Lei, Shengcai Liao, Stan Z. Li. “Face Liveness Detection Using 3D Structure Recovered from a Single Camera”. ICB2013. Madrid, Spain, June 4-7, 2013. Citations:162

🎖 Honors and Awards

National Scholarship three times
Champion of the first Alibaba Tianchi Big Data Competition, with a prize of 200,000 RMB
Champion of the 2014 Double 11 Tmall Recommendation Algorithm Challenge, with a prize of 1,000,000 RMB

Tao Wang

📖 About Me

💻 Projects

📝 Publications

🎖 Honors and Awards