Title: Multimodal AI in 2025: Expanding Horizons in Healthcare, eCommerce, and Beyond
In the near future, multimodality is poised to revolutionize how businesses utilize AI. Picture an AI that isn't limited to deciphering text but can also comprehend images, audio, and various sensor data. Humans, after all, are naturally multimodal, albeit we're bounded by our capacity to manage large amounts of input. Take the field of healthcare as an example. During my stint at Google Health, I witnessed numerous instances where doctors were overwhelmed by an excess of data.
Suppose a patient with atrial fibrillation (AFIB) brings five years of intricate sleep data gathered from their smartwatch or a cancer patient arrives with a 20-pound stack of medical records documenting past treatments. For healthcare professionals, the challenge is to differentiate the significant information from the noise. What's required is an AI that can condense and emphasize critical points. While large-scale language models like ChatGPT can already do this with text, we can train AI to achieve similar results with other types of data, such as images, time series, or lab results.
How Multimodal AI Operates
To grasp the inner workings of multimodal AI, it's essential to recognize that AI needs data for both training and making predictions. Multimodal AI is designed to work with diverse data sources - text, images, audio, video, and even time-series data - all at once. By merging these inputs, multimodal AI offers a more holistic and comprehensive understanding of problems.
Multimodal AI acts as a discovery tool, with different data modalities being stored within the AI. When a new data point is input, the AI identifies topics that share close relation to it. For instance, by inputting a patient's sleep data from their smartwatch alongside data about their AFIB episodes, doctors might discover signs of sleep apnea.
It's worth noting that this is based on "closeness," not correlation. Multimodal AI can be likened to Amazon's popularized recommendation system: "people who bought this item also bought this item." In this case, it's more like: "People with this sleep pattern have also been diagnosed with AFIB."

Multimodal AI: Encoders, Fusion, and Decoders
A multimodal AI system is made up of three primary components: Encoders, Fusion, and Decoders.
Encoding Any Modality
Encoders transform raw data - like text, images, or sound - into a representation that the AI can handle. These representations are called vectors and are stored in a latent space. Thinking of this process as storing an item in a warehouse (latent space), where each item has a specific location (vector), encoders can process various forms of data - images, text, sound, videos, log files, IoT (sensor) information, time series - to mention a few.
Fusion Mechanism: Combining Modalities

When working with one type of data, like images, encoding is enough. However, with multiple types - like images, sounds, text, or time-series data - we require fusion to find what's most relevant.
Decoders: Delivering Understandable Outputs
Decoders "decode" the information from the latent space - aka the warehouse - and present it in a format that humans can comprehend. For instance, finding an image of a "house."
If you're interested in learning more about encoding, decoding, and reranking, consider enrolling in my eCornell Online Certificate course, "Designing and Building AI Solutions." This no-coding program delves into the various aspects of AI solution development.
Transforming eCommerce with Multimodality

Let's consider another application: eCommerce. Amazon's interface has remained the same for 25 years - type a keyword, scroll through results, and hope for the best. Multimodality can modernize this user experience by allowing users to describe a product, upload an image, or provide context to find their perfect match.
Improving Search with Multimodal AI
At r2decide, a company founded by a few Cornellians and myself, we're using multimodality to merge Search, Browse, and Chat into a seamless flow. Our clients are eCommerce companies struggling with revenue loss due to users unable to locate what they need. At the core of our solution is multimodal AI.
For instance, in an online jewelry store, a user searching for "green" might only see green jewelry if the word "green" appeared in the product text earlier. Since r2decide's AI also encodes images into a shared latent space (e.g., warehouse), it identifies "green" across all modalities. The items are then re-ranked based on the user's past searches and clicks, ensuring they receive the most relevant "green" options.
Users can also search for broader contexts, like "wedding," "red dress," or "gothic." The AI encodes these inputs into the latent space, matches them with suitable products, and displays the most relevant results. This capability even extends to brand names like "Swarovski," revealing relevant items - even if the shop doesn't offer Swarovski products officially.

AI-Driven Nudges for a Chat-like Experience
In addition to search results, r2Decide also generates AI-driven nudges - contextual recommendations or prompts designed to enhance the user experience. These nudges are powered by AI agents, as discussed in my blog post on "agentic AI." Their purpose is to guide users effortlessly towards the most relevant options, making the search process intuitive, engaging, and effective.
Multimodality in 2025: Unlimited Potential for Enterprises
Multimodality is touching various industries, from healthcare to eCommerce. Startups like TC Labs are utilizing multimodal AI to streamline engineering workflows, while Toyota uses it for interactive, personalized customer assistance.
2025 will be the year multimodal AI changes how businesses operate. To stay updated on my 2025 AI predictions, follow me here on Our Website, or on LinkedIn.
In the realm of ecommerce, multimodal AI can revolutionize the user experience, allowing consumers to search for products using a combination of keywords, images, and context. For instance, at r2decide, a company utilizing multimodal AI, users searching for "green" jewelry will be presented with relevant options based on the color's encoding in both text and images.
Furthermore, in the healthcare sector, multimodal AI can serve as a powerful tool for doctors struggling with overwhelmed data. By encoding diverse data modalities – such as sleep data, medical records, and sensor data – into a shared latent space, AI can help doctors condense and emphasize critical information, ultimately aiding in more accurate diagnoses.
In the future, multimodal AI will continue to expand its reach across various industries, from engineering workflows (like TC Labs) to customer assistance (like Toyota). By incorporating this technology into their systems, enterprises can improve decision-making, enhance user experiences, and remain competitive in their respective markets.