Andrew Ng’s Shift From Big Data to Data-Centric AI

Andrew Ng argues the next AI leap won’t come from bigger models alone, but from “data-centric AI”—better data, faster iteration, and tools that let businesses fix bias and drift.
Andrew Ng has helped set the pace for modern AI, from pushing GPU-powered deep learning to co-founding Google Brain and later leading AI work at Baidu. Now, in conversation, he’s signaling a different kind of momentum—one that doesn’t treat data as a backdrop, but as the main engineering problem.
The focus_keyphrase anchoring his argument is simple: data-centric AI.. After a decade where progress often meant scaling up models and pouring in more data. Ng says the bottleneck is increasingly about getting the right data. not just more of it.. He points to foundation models and larger training efforts—especially in text—while also warning that the scale recipe doesn’t translate cleanly across industries that don’t have “billions of users” feeding massive datasets.
Ng’s biggest claim is that foundation models will expand beyond language, but some modalities—especially video—face real constraints.. Video is heavy: the compute bandwidth and cost to process long. continuous streams are far higher than working with tokenized text.. He argues that this is why “foundation” approaches arrived earlier in NLP and may arrive later in vision and video.. Still, he suggests the path is clear if the hardware and processing capacity catch up.
That’s only one half of the story, though.. The other half is what Ng calls “small data” solutions—approaches designed for settings where datasets are limited or inconsistent.. In his view, the last decade largely solved neural network architecture for many real-world tasks.. For practical deployments. he says it’s often more productive to keep the model structure stable and engineer the data pipeline: how examples are collected. labeled. validated. and corrected over time.
A useful way to understand the movement is through his company, Landing AI.. Ng frames LandingLens as a platform aimed at manufacturers using computer vision for inspection and quality control.. The pitch isn’t just “train a better model.” It’s to give teams tools to choose the right training images. label them consistently. and then iterate efficiently when performance reveals specific weaknesses—like certain defect types that are underperforming.
This is where the human side becomes noticeable.. In factories, a model isn’t a classroom experiment; it’s tied to real operations.. If labeling varies between annotators, the system can learn that confusion as truth.. If lighting changes or product design shifts. the model can degrade in ways that feel sudden to operators on a night shift.. Ng emphasizes engineering workflows that let customers correct data and retrain quickly—because waiting for experts to “start over” isn’t an option when a line needs to keep running.
Ng also ties data-centric AI to fairness and bias in a more grounded way than many broad discussions do.. Biased systems aren’t caused by data alone, he says, but biased data is a major ingredient.. He describes how engineering a targeted subset—finding where performance breaks. and then fixing the underlying data inconsistencies—can be more effective than trying to rewrite the entire model.. That subset strategy matters for bias because bias often concentrates in particular slices of data rather than spreading evenly.
Synthetic data enters the conversation as another tool, not a magic replacement.. Ng views it as part of a “tool chest” for iterative machine learning development.. Instead of only using synthetic data to bulk up datasets generally. he argues it can help generate more examples where error analysis shows the system struggles—such as a particular defect category in visual inspection.. Even then. he prefers simpler steps first in many cases. including data augmentation. improving labeling consistency. or collecting additional data where it actually helps.
There’s also an economic argument running beneath all of this.. Ng notes that scaling in consumer internet comes from large user bases and massive centralized training loops.. Manufacturing and other sectors look different: you might have many organizations building customized systems with limited local data.. He warns that expecting every industry actor—like hospitals dealing with varied healthcare records or factories running unique product lines—to redesign neural architectures from scratch is unrealistic.. The operational answer. in his view. is tooling that lets customers express domain knowledge through data engineering rather than bespoke model invention.
Ultimately. Ng’s message reads like a strategic bet for the coming decade: the biggest shift in AI may not be deeper models. faster GPUs. or cleverer architectures alone. but the discipline of systematically engineering data.. As AI moves from labs into messy. high-stakes environments. the ability to detect label drift. flag inconsistencies. and retrain with focused corrections may matter as much—or more—than raw scaling.. Misryoum
Pixel Weather wins over a dad—here’s why