In our last blog post, we discussed ways in which WiL or any other VC firm can optimize their outreach and follow-up process. The solution we found most useful was a combination of filters to curate a larger pool of companies into a more manageable set in addition to some email platform to create more structure and visibility. After optimizing this system, we wondered if we could take our process a step further and leverage machine learning (ML) to prioritize the companies we’ve curated.
Rather than relying on heuristics, training a model on a wide number of features to predict when a company will raise its next round provides us a clearer sense of urgency for making new connections and building relationships. Below are our findings in the form of an empirical ML white paper. Note that the content includes core ML concepts and terms that may be challenging for non-technical readers to fully grasp.
Abstract
In venture, predicting if a company will raise funds within a given timeframe is extremely valuable. Accurate predictions enable investors to prioritize relationship-building efforts, allocate resources, and increase their likelihood of participating in deals. Traditional approaches for identifying companies likely to raise rely upon guesswork and intimate company familiarity, neither of which are consistent or scalable. Our paper explores a ML approach aimed at predicting whether a company will raise in over 12 months and uses features that capture investor centrality, traction metrics, and executive profiles. We evaluate multiple ML models with the F0.5 score, weighting precision twice as much as recall to minimize the risk of false positives. This metric helps us focus on avoiding missed deals more so than investing time in companies unlikely to raise soon.
Related Work
Existing white papers in the data-driven VC space often focus on valuation increases or the likelihood of an exit to public markets. These types of outcomes, while important, are more challenging to predict for a number of reasons:
- highly idiosyncratic nature of company data;
- prolonged feedback cycles;
- and exogenous factors—such as industry trends and market conditions—that affect valuation trajectories.
Additionally, the fundamentals of ML don’t necessarily apply well to this problem space given the nature of venture investing. While supervised learning works by learning patterns from training data that generalize to unseen records, venture investing is about finding disruptors and emerging players that by definition are not common or representative.
Moreover, investors need to believe in a company’s thesis and have conviction in their ability to execute regardless of what a model has to say about its potential. As a result, our approach diverges by focusing specifically on predicting a near-term fundraising event—a task that may be more feasible given the higher frequency of fundraising activities relative to other events like achieving a 10x valuation or going public.
Methodology
Data sources
Feature engineering
- Investor centrality: We hypothesize that companies with well-connected investors are likely to raise sooner because having these investors on a cap table can generate more market interest and attract preemptive offers. To capture this connectivity, we constructed a graph network by creating pairwise combinations of investors for each deal and then computed eigenvector centrality on a quarterly basis.
- Headcount growth: The rate at which companies are growing or contracting serves as a proxy for operation and performance. As talent spending is resource intensive, it can correlate with increased funding needs in the future.
- Executive profiles: We encode key background and experience to capture information likely to increase investor confidence and make fundraising more probable. Research by Talaia, Pisoni, and Onetti on “Factors influencing the fund raising process for innovative new ventures” supports our hypothesis that characteristics like founder education have a strong relationship with the ability to attract funding.
- Web traffic growth: Traffic and engagement metrics measure public interest, and segmenting this data by geography can help identify potential market opportunities. Increased interest and expansion both signal possible fundraising needs in order to meet rising demand.
Data preprocessing
The initial dataset of companies and their various financing rounds consisted of 32,575 records, from which there were 6,908 distinct companies. After filtering for venture rounds, companies in our geographic focus, and completed deals, 22,895 records remained. Joining against additional data sources to build a robust feature set resulted in a post-join count of 10,032 records. Finally, we removed records that (a) belonged to companies with only a single deal or (b) represented a company’s most recent deal when the company had multiple deals. These records can only be used for inference and are not suitable for training. The final dataset, then, consists of 4,579 records.
Numeric features were log-transformed to smooth skewed distributions, and categorical features were reduced to relevant groups that capture a high percent of data before being one-hot encoded. By grouping categorical values to capture 70% of data spread, for example, we avoided increased dimensionality while minimizing information loss.
The data was split using a proportion-based grouped time series approach instead of a pure time based approach with a cutoff date. We opted for this proportion-based grouped time series split since the risk of overfitting is lower. By splitting companies’ respective deals across the train, validation, and test sets, we avoid learning company specific patterns that could arise from having multiple deals from the same companies in the train or validation sets. The resulting distribution was 74% train, 13% validation, and 13% test.
Labeling
We framed the problem as a binary classification task to predict if a company will raise within the next 12 months (label = 0) or after this time period (label = 1). We were interested in splitting the data based on longer time periods such as 18 or 24 months, but this threshold was chosen to minimize class imbalances and the subsequent problem of possible under- or over-sampling in model training.
Models and Evaluation
Model Selection
Despite the various filters in preprocessing, the final dataset had a moderate share of nulls across the various features. This is an artifact of working with venture data from third party vendors where data points may not be publicly known (e.g., company valuation) or tracked for long enough in terms of time series (e.g., headcount or web traffic). In order to proceed without dropping a substantial number of records, three classification models were chosen since they allow for null values across features.
- HistGradientBoostingClassifier
- XGBClassifier
- CatBoostClassifier
Evaluation
We selected the F-score to balance the risk of missing deals that occur within the next 12 months (false positives) and spending time with companies that end up later (false negatives). Because we’re more concerned about missing potential deals than allocating resources to companies that won’t raise soon, we emphasize false positives more, weighting precision twice as much as recall. Thus, our evaluation metric is the F0.5 score.
In training and hyperparameter tuning, we experimented with several different model combinations–using only features determined important via feature permutation, upsampling the minority class, and downsampling the majority class. Due to the dataset’s small size, upsampling led to overfitting as models “memorized” duplicated data points, and downsampling reduced performance from information loss. Our final model was the HistGradientBoostingClassifier trained with only the important features.
In comparing model performance, we aimed to beat a baseline set by predicting the majority class—that a company will raise in over 12 months. The baseline achieved an F0.5 score of 0.542 on the test set while the model reached a mere 0.309. Underperforming the baseline by nearly 75% means that we’re much better off guessing the majority class than using ML.
With the full understanding that we failed to fit a useful model, we tweaked the dataset to filter for just software companies, not to cheat and hack at a better metric but out of curiosity to see what would happen with a tighter industry focus. With this refined dataset, the baseline achieved an F0.5 score of 0.514 compared to the model’s 0.510. While this is a more encouraging result, this suggests that our model still underperforms guessing even with a narrower problem space and high quality feature engineering.
Key Learnings and Impact
This paper highlights both the potential and limits of using ML to predict when venture-backed (software) companies will next raise funds. While features like investor centrality, functional headcount growth, and executive profiles bring our model close to baseline, the complex nature of venture data makes predicting timing incredibly difficult. Similar to predicting valuation increase or exits to public markets, exogenous factors—macro conditions, market trends, and preemptions—play a major role and challenge the ability of models to learn patterns that generalize to unseen records.
Additionally, we derived labels for each record by calculating the difference in days between a company’s current and next deal date and then binarizing the values. However, because deal date information comes from various sources (e.g., press releases, direct company or investor disclosures) and represents different transaction stages (e.g., announced vs. closed), there’s no single source of truth. This lack of standardization introduces significant variability to the labels, adding noise that makes it difficult for a model to detect consistent patterns.
Given these constraints, a conviction-driven, research-backed approach may be more practical. In cases where model performance falls short, investors can rely on deep qualitative insights and strategic relationships, focusing resources on companies they have a strong belief in rather than timing predictions.
In conclusion, with the current data sources and our specific industry and geographic filters, precise fundraising predictions may be challenging to achieve through modeling alone. This paper posits that investors can benefit more from using data to inform strategic focus areas, blending quantitative insights with qualitative research and judgment for effective decision making about which companies to prioritize.
Read more insights from Max in our Unleashing the Power of Data in VC blog series:
You can also find Max on LinkedIn.
This material does not constitute an offer, solicitation or recommendation to sell or an offer to buy any securities, investment products or investment advisory services. Any offer or solicitation will be made only pursuant to a confidential private placement memorandum and subscription documents (the “Offering Materials”) and will be subject to the terms and conditions contained in such Offering Materials.