
In today’s data -controlled investment environment, the standard, availability and specificity of information can create or break a technique. However, investment professionals are routinely exposed to restrictions: Historical data records will not be recorded, alternative data are sometimes incomplete or unaffordable, and open source models and data records are distorted to vital markets and English-language content.
Since firms are searching for more adaptable and future-oriented instruments, synthetic data-especially in the event that they are derived from generative KI (Genai), to a strategic capital that gives recent opportunities for simulation of market scenarios, training machine learning models and baking test investment strategies. This article examines how Genai-operated synthetic data represent the redesign of investment workflows-from the simulation of asset correlations to improving mood models and what practitioners have to know as a way to evaluate their usefulness and restrictions.
What exactly are synthetic data, how is it generated by Genai models and why is it increasingly relevant for investment use cases?
Consider two frequent challenges. A portfolio manager who desires to optimize the performance in numerous market regimens is restricted by historical data that can’t have in mind “water-wenn” scenarios which have not yet occurred. Similarly, an information scientist monitoring in German-language news for small cap shares can find that the majority available data records are in English and think about LARGE CAP firms and restrict each reporting and relevance. In each cases, synthetic data offer a practical solution.
What Genai distinguishes synthetic data – and why it’s now vital
Synthetic data relate to artificially generated data records that replicate the statistical properties of real data. While the concept is just not recent – techniques akin to Monte Carlo Simulation and boat trapping have supported financial evaluation for a very long time – this has modified.
Genai refers to a category of deep learning models that may create synthetic data in high bondage about modalities akin to text, table, image and time series. In contrast to standard methods, Genai models learn complex real distributions directly from data, which suggests that the necessity to remove rigid assumptions in regards to the underlying generative process. This ability opens up high -performance use cases in investment management, especially in areas where real data is scarce, complex, incomplete or by costs, language or regulation.
Common Genai models
There are various kinds of Genai models. Variation automobile code (VAES), generative controversial networks (goose), diffusion base and enormous voice models (LLMS) are essentially the most common. Each model is created using neuronal network architectures, although they differ in size and complexity. These methods have already shown the potential to enhance certain data -centered work processes within the industry. For example, VAEs were used to create synthetic volatility areas to enhance the choice trade (Bergeron, 2021). Gans have proven to be useful for portfolio optimization and risk management (ZHU, Mariani and Li, 2020; continued, 2023). Diffusion -based models have proven to be useful to simulate the correlation matrices of the assets under various market regimens (Kubiak, 2024). And LLMS have proven to be useful for market simulations (Li, 2024).
Table 1. Approaches for synthetic data production.
| Proceedings | Types of information it generates | Example applications | Generative? |
| Monte Carlo | Time driver | Portfolio optimization, risk management | NO |
| Copula-based functions | Time series, tabular | Credit risk evaluation, asset correlation modeling | NO |
| Author -compressive models | Time driver | Volatility forecast, property return | NO |
| Bootstrapping | Time series, tabular, textual | Create confidence intervals, stress test | NO |
| Variation carcoder | Tabular, time series, audio, pictures | Simulation of the volatility areas | Yes |
| Generative controversial networks | Tabular, time series, audio, pictures, | Portfolio optimization, risk management, model training | Yes |
| Diffusion models | Tabular, time series, audio, pictures, | Correlation modeling, portfolio optimization | Yes |
| Great -speaking models | Text, table, pictures, audio | Mood evaluation, market simulation | Yes |
Evaluation of the standard of the synthetic data
Synthetic data must be realistic and match the statistical properties of their real data. Existing evaluation methods fall in two categories: quantitative and qualitative.
Qualitative approaches include the visualization of comparisons between real and artificial data records. Examples are the visualization of distributions, the comparison of scatter diagrams between pairs of variables, time series paths and correlation matrices. For example, a GAN model, which was trained to simulate asset returns to estimate the chance of value, should successfully reproduce the heavy frame of the distribution. A diffusion model that’s designed for the production of synthetic correlation matrices under different market regimen should adequately record the cohovements of assets.
Quantitative approaches include statistical tests for comparison of distributions akin to Kolmogorov-Smirnov, population stability index and Jensen-Shannon divergence. These tests have output statistics that indicate the similarity between two distributions. For example, the Kolmogorov Smirnov test spent a P value that, in the event that they are lower than 0.05, significantly differ two distributions. This can provide a more concrete measurement with the similarity between two distributions in contrast to visualizations.
Another approach is “train-on-synthetic, test-on-real”, by which a model is trained on synthetic data and tested for real data. The performance of this model will be in comparison with a model that’s trained and tested in real data. If the synthetic data successfully replicate the properties of real data, the performance between the 2 models must be similar.
In motion: Improvement of monetary mood evaluation with Genai -synthetic data
In order to place this into practice, I even have a small open source LLM, QWEN3-0.6B, for the evaluation of the financial mood using a public data record with financial headlines and social media content, that are known as FIQA-SA[1]. The data record consists of 822 training examples, with most sentences classified as “positive” or “negative”.
I then used GPT-4O to generate 800 synthetic training examples. The synthetic data set generated by GPT-4O was more diverse than the unique training data that cover more firms and mood (Figure 1). The increase in the range of coaching data provides the LLM further examples to discover the sensation of text content and possibly improve the model output for invisible data.
Figure 1. Distribution of mood classes for real (left), synthetic (right) and augmented training dataset (center), which consist of real and artificial data.

Table 2. Example sentences from the actual and artificial training data sets.
| Sentence | Class | Data |
| Burglary in Weir leads the Record High. | Negative | Real |
| Astrazeneca wins the FDA approval for a very powerful recent lung cancer pill. | Positive | Real |
| Shell and BG shareholders to coordinate a deal at the tip of January. | Neutral | Real |
| Tesla’s quarterly report shows a rise in vehicle deliveries by 15%. | Positive | synthetic |
| Pepsico stops a press conference to tackle the most recent product recall. | Neutral | synthetic |
| The CEO from Home Depot is abruptly resigned in the inner controversy. | Negative | synthetic |
After the fine-tuning of a second model for a mixture of real and artificial data using the identical training procedure, the F1 rating within the validation data set increased by almost 10 percentage points with a final F1 rating of 82.37% within the test data record.
Table 3. Model output within the FIQA SA validation data set.
| Model | Weighted F1 rating |
| Model 1 (real) | 75.29% |
| Model 2 (Real + synthetic) | 85.17% |
I discovered that increasing the proportion of synthetic data had a negative influence. There is a Goldillocks zone between an excessive amount of and too little synthetic data to realize optimal results.
No silver ball, but a priceless tool
Synthetic data is just not an alternative to real data, but it surely is price experimenting. Choose a technique, evaluate the standard of the synthetic data and perform A/B tests in a sandbox environment by which you compare workflows with and without different parts of synthetic data. You could possibly be surprised on the findings.
You can display all code and data records on the display RPC Labs Github Repository and enter right into a deeper immersion within the LLM case study within the research and guideline center of the rule of thumb center “Synthetic data in investment managementResearch report.
[1] The data record will be downloaded here: https://huggingface.co/datastets/thefinai/fiqa-sentiment-Classification
