Skip to the content.

Overview

Language models have been effectively applied to modeling natural signals, such as images, video, speech, and audio. A crucial component of these models is the codec tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens. In this paper, we introduce WavTokenizer, which offers several advantages over previous SOTA acoustic codec models in the audio domain: 1) Extreme Compression. By compressing the layers of quantizers and the temporal dimension of the discrete codec, a one-second audio of 24kHz sampling rate requires only a single quantizer with 40 or 75 tokens. 2) Improved Subjective Quality. Despite the reduced number of tokens, WavTokenizer achieves state-of-the-art reconstruction quality with outstanding UTMOS scores and inherently contains richer semantic information. Specifically, we achieve these results by designing a broader VQ space, extended contextual windows, and improved attention networks, as well as introducing a powerful multi-scale discriminator and an inverse Fourier transform structure. We conducted extensive reconstruction experiments in the domains of speech, audio, and music. WavTokenizer exhibited strong performance across various objective and subjective metrics compared to state-of-the-art models. We also tested semantic information, VQ utilization, and adaptability to generative models. Comprehensive ablation studies confirm the necessity of each module in WavTokenizer.

Performance Comparison

Comparison between WavTokenizer and state-of-the-art acoustic codec model. The vertical axis UTMOS represents reconstructed quality closer to human auditory perception, the horizontal axis kbps represents audio compression levels. The size of circles represents the number of discrete tokens per second.

Figure.1 Performance Comparison.

Generated Samples

Text: The dull light fell more faintly upon the page whereon another equation began to unfold itself slowly and to spread abroad its widening tail. It was his own soul going forth to experience, unfolding itself sin by sin, spreading abroad the bale fire of its burning stars and folding back upon itself, fading slowly, quenching its own lights and fires. They were quenched: and the cold darkness filled chaos.

GT WavTokenizer
0.5kbps(Ours)
WavTokenizer
0.9kbps(Ours)


Text: This was the call of life to his soul not the dull gross voice of the world of duties and despair, not the inhuman voice that had called him to the pale service of the altar.

GT WavTokenizer
0.5kbps(Ours)
WavTokenizer
0.9kbps(Ours)


Text: One way led to the left and the other to the right-straight up the mountain.

GT WavTokenizer
0.5kbps(Ours)
WavTokenizer
0.9kbps(Ours)


Text: No one saw him do this, for all were looking at the Powder of Life; but soon the woman remembered what she had been doing, and came back to the cupboard.

GT WavTokenizer
0.5kbps(Ours)
WavTokenizer
0.9kbps(Ours)


Text: She groped for new ways to teach colored brains and marshal colored thoughts and the result was puzzling both to teacher and student.

GT WavTokenizer
0.5kbps(Ours)
WavTokenizer
0.9kbps(Ours)


Text: This may answer, perhaps, in a small place where the manager can gauge pretty closely from actual observation what each customer does; but even then there are elements of risk and waste; and obviously in a large city such a method would soon be likely to result in financial disaster to the plant.

GT WavTokenizer
0.5kbps(Ours)
WavTokenizer
0.9kbps(Ours)


Text: There was an average cost per lamp for meter operation of twenty two cents a year, and each meter took care of an average of seventeen lamps.

GT WavTokenizer
0.5kbps(Ours)
WavTokenizer
0.9kbps(Ours)


Text: To prevent these electrolytes from freezing we had in each meter a strip of metal.

GT WavTokenizer
0.5kbps(Ours)
WavTokenizer
0.9kbps(Ours)


Text: But before it has reached us the rain cloud parts asunder, the sea boils, and the electric fires are brought into violent action by a mighty chemical power that descends from the higher regions.

GT WavTokenizer
0.5kbps(Ours)
WavTokenizer
0.9kbps(Ours)


Text: How long are we likely to be separated? Why are we to be denied each other's society?

GT WavTokenizer
0.5kbps(Ours)
WavTokenizer
0.9kbps(Ours)


Text: She makes effort after effort, trembling with eagerness, and when she fails to reproduce what she sees, she works herself into a frenzy of grief and disappointment."

GT WavTokenizer
0.5kbps(Ours)
WavTokenizer
0.9kbps(Ours)


Text: 'You know Captain Lake?' said Lord Chelford, addressing me.

GT WavTokenizer
0.5kbps(Ours)
WavTokenizer
0.9kbps(Ours)


Text: For the ideas and generalizations thus mainly formed from the images and impressions received in childhood become, in later years, the elements of the machinery, so to speak, by which all his mental operations are performed.

GT WavTokenizer
0.5kbps(Ours)
WavTokenizer
0.9kbps(Ours)


Text: He was such a big boy that he wore high boots and carried a jack knife. He gazed and gazed at the cap, and could not keep from fingering the blue tassel.

GT WavTokenizer
0.5kbps(Ours)
WavTokenizer
0.9kbps(Ours)


Text: This was a formidable array of advantages; slavery was playing with loaded dice.

GT WavTokenizer
0.5kbps(Ours)
WavTokenizer
0.9kbps(Ours)


Text: Great pains were taken by the Scots (and the English complied with their pretended delicacy) to make this estimation and payment of arrears appear a quite different transaction from that for the delivery of the king's person: but common sense requires that they should be regarded as one and the same.

GT WavTokenizer
0.5kbps(Ours)
WavTokenizer
0.9kbps(Ours)


Text: The study of any topic is like the continued observation of an object which is approaching us along a road: what is certain to begin with is the quite vague knowledge that there is some object on the road.

GT WavTokenizer
0.5kbps(Ours)
WavTokenizer
0.9kbps(Ours)


Text: At any rate, my eloquence was altogether stopped. The gentleman was named Sir Ferdinando Brown.

GT WavTokenizer
0.5kbps(Ours)
WavTokenizer
0.9kbps(Ours)


Text: These plumes waved gracefully in the air with every mincing step the Princesses took.

GT WavTokenizer
0.5kbps(Ours)
WavTokenizer
0.9kbps(Ours)


Text: Then they rolled the frame in position underneath the Great Knife and Trot held in her hand the cord which would release it.

GT WavTokenizer
0.5kbps(Ours)
WavTokenizer
0.9kbps(Ours)