Author: Allie Giddings and Chhaya Methani
In our last blog post, we explained how news is surfaced in Supply Chain Insights and how it can be useful for having better risk visibility. Since then, we’ve made two major updates to Supply Chain Insights News. First, we added a tag describing whether the news article has an immediate, future, or positive impact on the supply chain. Second, we added tags for the category of news. These updates can help supply chain managers quickly see the most impactful and relevant news for them, as well as filter to relevant categories of news.
What are immediate, future, and positive impact articles?
Immediate impact articles are those with an expected negative effect on a supply chain in the near term. Here are two examples of articles with immediate impact:
- More than 900 layoffs planned at
plant in suspends operations, adjusts production line to minimize impact
Future impact articles are those with an expected negative effect on a supply chain in the future. Here’s an example of a future impact:
lockdown will not have a big impact on production
Positive impact articles are those with an expected positive effect on a supply chain. Here are two examples of articles with positive impact:
acquires Business in talks with and to setup local semiconductor plants
What data does the model use to learn?
The model is trained on news articles from recent months with a label for if the article is immediate, future, positive, or no impact. The key challenge here is to account for data imbalance among the three classes. Immediate and future impact categories tend to make up roughly 3% of all news related to a company. While this justifies the need for an AI model to filter and surface these not-so-frequent articles, it is also challenging to get enough articles belonging to this category to train the model effectively. If you pick all articles randomly, you could incur a high cost to get enough samples to train the model with.
To help with the data imbalance, we used a combination of various techniques to generate a representative sample set to be labelled by crowd sourcing. We bootstrapped the process with a few labelled articles and then added some automation to find articles similar to the ones we labelled by evaluating their contextual similarity in an embedding space where all articles are mapped.
How does the model learn?
For the impact model, we trained a multi-class classifier which selects one of the 4 categories 1) Immediate impact, 2) Future Impact, 3) No impact and 4) Positive Impact. We used a combination of statistical NLP features along with contextual representation of text from articles as features. These features are used by a deep neural network to train an impact classifier having 4 softmax classifiers, one for each class. By using this one vs all strategy, we get the best performance for each class. We selected a threshold for each category with the help of an ROC curve.
What features does the model use?
As mentioned above, we used a combination of statistical, linguistic and contextual features. For the statistical features, we used the term frequency, text length, etc. We used a sentiment classifier to indicate the sentiment of the article which helps in identifying the tone of the article. Additionally, we use contextual representations to capture the meaning of the article overall by using deep models that are trained on long text like news articles to capture their relevance.
What are the categories?
The categories are related to supply chain topics and are shown in the list below.
- Bankruptcy, acquisition, and collaboration: Contains information about mergers and acquisitions, bankruptcy, or new or reduced collaborations with other companies or suppliers.
- Company: Contains information relevant to the company, such as change in leadership or important personnel, new investment areas, or awards.
- Company financial: Contains information about the growth and financial outlook of an individual company.
- Disruption and weather: Contains information about events causing direct supply disruption, such as factory fires, explosions, leaks, Suez Canal blockage, or natural disasters such as forest fires or hurricanes.
- Health: Contains information about human and animal epidemics and pandemics, such as COVID-19, Ebola, or H1N1.
- Industry financial: Contains information that focuses on the growth or financial outlook of an entire industry.
- Industry supplier: Contains information about other suppliers in the same industry, such as top supplier lists or general supplier risk articles.
- Infrastructure: Contains information about general infrastructure improvements that could benefit a specific supplier.
- Politics and government: Contains information such as government investigations, government collaborations, discounts/deals, lobbying, litigation, or regulations.
- Product: Contains information about new or old products of the company, such as new technologies used in existing products or removal of product lines.
- Quality: Contains information about supplier quality or quality control issues.
- Sustainability: Contains information such as new or existing sustainability efforts or environmental impacts.
- Workforce: Contains information affecting employees, such as strikes or workplace conditions.
What data does the model use to learn?
The category model is trained on recent news articles collected using the Bing News API. Each of these articles can have multiple category labels associated with them. E.g., an article could be about a workforce strike due to local political matters and will thus belong to two categories. This adds some complexity to the process of labeling since missing dominant labels for an article will confuse the classifier. It is important to assign the dominant categories as labels for an article. However, in our experience, judges tend to miss some categories when asked to choose all labels from the list of all categories. This impacted the classifier performance adversely.
To make sure we get a relatively complete set of labels, we modified the labeling task. We asked the judges to select the dominant categories in the article from a subset of categories assigned by a rules-based classifier. A rules-based classifier is prone to make mistakes which can generate good examples for the classifier to learn from when the judges assign negative labels to them. One issue with using the rules-based classifier is with recall. The classifier cannot learn from the articles that were never present in the dataset. Hence, it is important to add some randomly selected categories to the list of category options presented to the judges. Following this approach, we got a set of labels that was used to train the classifier.
The classifier had a varied performance on different categories. To improve those categories, we decided to add more samples. However, it is challenging to extract the most useful samples for selective categories due to the problem of data imbalance. Each category has a relatively small number of overall articles belonging to it. To collect more data, we followed an approach similar to the one mentioned above for the impact classifier. We selected samples by evaluating similarity to a small representative set to shortlist the set of possible candidates. This helped improve the model performance effectively.
As a result of a combination of techniques, we were able to collect a sizable set for all the categories and added more samples to the categories where the model needed more examples to resolve ambiguities and have good performance.
How does the model learn?
The model learns by fine-tuning a RoBerta model. We add a softmax layer for each of the classes and assign all categories that are above a threshold determined empirically for each category. In this manner, we can assign multiple labels to an article that may belong to many categories. To evaluate performance, we considered each assigned category individually and computed the F-1 score for those categories.
What features does the model use?
The mapping to the semantic space where the articles are represented as data points in the high-dimensional contextual space acts as the features used by the softmax classification layer to learn the boundary between classes.
Supply Chain News Model in Supply Chain Insights
In Supply Chain Insights, customers can see news that impacts their partners and collaborate with them to reduce risk surfaced from the news. These news articles are selected from a machine learning model and tagged for their impact and category, so only the most important information is surfaced. We presented information to help you understand how impact and categories are determined to be able to act on risks to your supply chain more effectively.
Check out our documentation for more information on this news feature and Supply Chain Insights. Which categories are most relevant to you? Share your thoughts on our forum: http://aka.ms/sci-forum. Please send any feedback on news article relevance to [email protected].