Former OpenAI chief scientist Ilya Sutskever once stated, “We have reached the data ceiling. There will be no more.” He believes that “there’s only one internet,” and current Large Language Models (LLMs) will hit a development bottleneck due to data exhaustion.
In my opinion, however, the potential of textual data is far from being fully unlocked. The core issue is not how many “articles” we feed the model before training, but the data conversion methods we use. The methods currently employed to convert words into data still have representational limitations.
In fact, current LLMs, before training, first convert words into high-dimensional numerical vectors. They then use a “self-attention” mechanism to integrate contextual information and ultimately generate text based on probability.
The trap here is that the high-dimensional numerical vectors we provide to the model are not nearly dimensional enough in the language information they encode. They treat language as merely logical and sequential information, ignoring the much richer non-textual features often hidden behind it. We need the vector representing each word to be able to encode more diverse information modalities.
A writer’s work is not a simple arrangement of words based on their meanings and most common combinations, as is the case with LLMs. When genuine authors write, consciously or subconsciously, their words have rhythm, cadence, length, and even a sense of color and texture. This kind of combination, like music and painting, is what gives their work its strong artistic and emotional power.
In other words, the “textual data” we give models is a two-dimensional, flattened plane. The language humans actually use is a three-dimensional, multi-dimensional sensory experience. This is precisely why current models excel at tasks like programming but often fall short in literary creation.
Programming languages are highly abstract and logical. They rely on strict syntax and rules, with no need for emotional or auditory information. Current models are very good at processing this kind of “lower-dimensional” information. In essence, the information dimension of the task perfectly matches the data.
However, literary language and even daily conversation require higher-dimensional information, including emotion, intuition, association, sensory experience, and even environmental factors. Under the current training paradigm, this information isn’t provided in the data. Instead, models have to try and discover it on their own during training. And for them, that’s just too hard.
Perhaps we need to fundamentally change the data representation and training/inference paradigms of traditional large models. In the training phase, we can explicitly encode key features like acoustics, length, and emotion directly into the vectors, allowing the model to learn from this high-dimensional, multi-modal data. Furthermore, during both training and inference, the Token—the smallest unit of language in a model—will be a dynamic, evolving state. This way, when the model generates text, it won’t just consider the probability of word combinations. It will also use its internal representations to infer elements like pitch, duration, accent, and rhyme.
Just as a writer refines sentences to provide the best possible work, future models in literary creation should provide a high-dimensional, deterministic optimal choice. The model should truly understand and capture the inherent essence of literature and language, thereby precisely selecting the most expressive word for the current context, rather than relying on cheap tricks like “temperature” to introduce some randomness. This is true understanding of human language.
It’s like how human eyes can only see the visible light spectrum. But after we discovered radio waves, we could use radar to see places our eyes couldn’t. Likewise, by imbuing our data with more dimensional information, we might be able to make models more intelligent and more soulful.

