In light of humanity’s distant ambitions, there has been nothing deeper than understanding how the human brain works and building machines that can simulate perception and intuition. Despite the significant progress made by AI models in processing data and language, these models have suffered from a large gap in the ability to intuitively understand the physical world, an ability that infants easily acquire through observation.
Recent research indicates that this gap is rapidly narrowing, through the development of models capable of demonstrating surprise when the rules of physics change before them, i.e., when illogical events occur. The (V-JEPA) model represents a significant step towards providing AI with an innate understanding of the world, marking a revolution in the fields of robotics and autonomous vehicles.
AI mimics infant perception: In a distinct scientific step, researchers at (Meta) were able to develop an AI model with the ability to understand the basic physical principles of the world, known as (innate intuition), which infants acquire through observation. This preliminary step is represented in the (V-JEPA) model, which can express a sense of surprise when it encounters physically impossible events, such as an object disappearing without reason, thus mimicking the reaction of six-month-old babies to object permanence.
The (V-JEPA) model does not rely on pre-programmed physical rules, but rather learns by watching millions of videos, just as human minds learn through experience. According to Meta’s tests, the model can predict what will happen in videos based on (Latent Representations), which are abstract layers that condense thousands of pixels into essential information about objects, their movement, and their location. If future scenes contradict its logical expectations, a large prediction error occurs, resembling the feeling of surprise in infants.
But how does the (V-JEPA) model differ from traditional models in understanding scenes? AI engineers, especially those who develop autonomous driving systems, face a fundamental challenge in enabling machines to understand the visual world with reliability comparable to human perception. For a long time, systems designed to analyze video content, whether for classification or to identify surrounding objects, relied on what is called (Pixel Space). In this space, every color point (pixel) in the scene is treated with the same weight, in a process similar to the brain receiving all sensory inputs without filtering or prioritization.
However, this approach suffers from a perceptual blind spot, even with its effectiveness in certain contexts. Imagine a complex scene of a street full of cars and traffic lights. If the model insists on processing precise and non-essential details such as the movement of leaves or the contrast of shadows, this will lead to overlooking more important information, such as the color of the traffic light or the exact location of adjacent cars. As the researchers explain, working in pixel space means dealing with large amounts of detail that should not necessarily be modeled, which hinders efficiency and the ability to make quick and conscious decisions.
To address this shortcoming, Meta developed the (Video Joint Embedding Predictive Architecture), abbreviated as (V-JEPA), which was launched in 2024, with the aim of simulating an essential part of the human perceptual process, which is selective abstraction. While traditional models hide parts of video frames and train the network to predict the value of the missing pixels, the (V-JEPA) model takes a radically different path. Using the same masking process, it does not predict what is behind the mask at the pixel level, but rather predicts the content based on higher levels of abstraction known as Latent Representations, which is the philosophical and technical core that mimics human perception.
The model relies on an encoder that converts frames into a small set of numerical values, representing essential features such as the object’s shape, dimensions, location, movement, and the relationships between elements. Instead of thousands of pixels, the system deals only with the essence of the scene, just as the brain processes visual inputs by neglecting noise and focusing on useful information. Quentin Garrido, a research scientist at (Meta), confirms that the strength of this model lies in its ability to filter data, stating: “This mechanism allows the model to discard unnecessary details and focus only on the most essential and important aspects of the filmed scene. Efficient elimination of redundant information is a pivotal goal that the (V-JEPA) model seeks to achieve with maximum effectiveness.”
This shift from modeling pixels to modeling meanings gives the (V-JEPA) model a great ability to generalize, high accuracy in understanding new scenes, and remarkable efficiency in complex environments such as autonomous driving or robotics. Thus, …
United News Network – UNN Arabic
An independent media platform offering reliable news and objective analysis, striving to promote peace and cultural dialogue worldwide, to convey the truth and build bridges of understanding between peoples.
For more news, you can visit our homepage:
