Data augmentation expands training data by generating or modifying existing examples — rephrasing sentences, adding noise, or creating synthetic variations. This improves model robustness and performance without requiring large amounts of new data to be collected.
It is particularly useful when labeled training data is scarce, or when the model needs to handle more variety in real-world inputs than the original dataset provides. Instead of collecting thousands of new examples, teams augment what they have to produce a more representative training set.