Using Python to unearth a goldmine of threat intelligence from leaked chat logs
Dealing with a great amount of data can be time consuming, thus using Python can be very powerful to help analysts sort information and extract the most relevant data for their investigation. The open-source tools library, MSTICPy, for example, is a Python tool dedicated to threat intelligence. It aims to help threat analysts acquire, enrich, analyze, and visualize data.
This blog provides a workflow for deeper data analysis and visualization using Python, as well as for extraction and analysis of indicators of compromise (IOCs) using MSTICPy. Data sets from the February 2022 leak of data from the ransomware-as-a-service (RaaS) coordinated operation called “Conti” is used as case study.
An interactive Jupyter notebook with related data is also available for analysts interested to do further data exploration.
This research aims to provide a view into research methodology that may help other analysts apply Python to threat intelligence. Analysts can reuse the code and continue to explore the extracted information. Additionally, it offers an out-of-the-box methodology for analyzing chat logs, extracting IOCs, and improving threat intelligence and defense process using Python.
Using Python to analyze the Conti network
On February 28, 2022, a Twitter account named @ContiLeaks (allegedly a Ukrainian researcher) began posting leaked Conti data on Twitter. The leaked data sets, which were posted in a span of several months, consisted of chat logs, source codes, and backend applications.
For this research, we focused our analysis on the chat logs, which revealed crucial information about the Conti group’s operating methods, infrastructure, and organizational structure.
Compiling and translating chat logs
The leaked chat logs are written in the Russian language. To make the analysis more accessible, we adopted the methodology published here and translated the logs to English.
The chat logs revealed that the Conti group uses the messaging application Jabber to communicate among members. Since raw Jabber logs are saved using a file per day, they can be compiled in one JSON file so they can easily be manipulated with Python. Once the data is merged, they can be translated using the deep translator library. After the logs are translated and loaded into a new file, it’s then possible to load the data into a dataframe for manipulation and exploration:
df = pd.read_json(codecs.open('translated_Log2.json', 'r', 'utf-8'))
Russian slang words not properly translated by the automated process can be translated by creating a dictionary. A dictionary off a list proposed here was used in this case to correctly translate the slang:
Analyzing the chat activity timeline
One way to get insights from chat logs is to see its timeline and check the number of discussions per day. The Bokeh library can be used to build an interactive diagram and explore the loaded dataframe.
Using the data from Conti chat logs generates the following diagram, which shows the volume of Jabber discussions over time:
Visualizing the data as a timeline shows some peaks of activity that align to certain events. In the case of the Conti leaks, for example:
- July 7, 2021 (615 discussions): Ransomware attack by REvil against software company Kaseya
- August 27, 2021 (1,289 discussions): The playbook of a specific Conti affiliate was leaked
- August 31, 2021 (1,156 discussions): FBI CISA advisory on ransomware and labor day
- August 10, 2021 (853 discussions): Ransomware attack by Conti against Meyer Corporation
It’s interesting that no peak in chat activity was observed within the Conti group after the first leak, which could indicate that the breach was ignored or not known by the group at that time.
Analyzing the level of user activity
When analyzing chat logs, identifying the number of users and analyzing the most active ones can provide insight into the size of the group and roles of users within it. Using Python, the list of users can be extracted and saved in a text file:
Running the script above using the Conti chat logs yielded a list of 346 unique accounts. This list can then be used to create a graph and show which users sent the most messages.
Based on the graph, the users named defender, stern, driver, bio, and mango have the largest number of discussions. Checkpoint published extensive research on the structure of the organization and correlated the user discussions with several roles and services like human resources, coders, crypters, offensive team, SysAdmins, and more.
Mapping the users’ connections
Another way to analyze chat log data is to visualize the users’ connection. This can be done by creating a dynamic network graph that can highlight the connections between users. The Barnes Hut algorithm and the Pyvis library can be used to visualize this data.
Dynamic visualization shows a graphical overview of the network and allows zooming into the network to closely analyze the connections within. Bigger points represent the most active users, and it’s possible to highlight a user to analyze their connections. Additionally, the hovering tool shows which other users a specific user had conversations with.
Searching for other topics of interest
Since reading data sets can be time-consuming, a simple search engine can be built to search for specific strings in the chat logs or to filter for topics of interest. For the Conti leak data, examples of these include Bitcoin, usernames, malware names, exploits, and CVEs, to name a few.
The following code snippet provides a simple search engine using the TextSearch library:
Using MSTICPy to extract and analyze IOCs
Besides processing chat logs to analyze user activity and connections, Python can also be used to extract and analyze threat intelligence. This section shows how the MSTICPy library can be used to extract IOCs and how it can be used for additional threat hunting and intelligence.
Extracting IOCs
MSTICPy is a Python library used for threat investigation and threat hunting. The library can connect to several threat intelligence providers, as well as Microsoft tools like Microsoft Sentinel. It can be used to query logs and to enrich data. It’s particularly convenient for analyzing IOCs and adding more threat contextualization.
After installing MSTICPy, the first thing to do is to initialize the notebook. This allows the loading of several modules that can be used to extract and enrich the data. External resources like VirusTotal or OTX can also be added by configuring msticpyconfig.yaml and adding the API keys.
The IoCExtract module from MSTICPy offers a convenient way to extract IOCs using predefined regex. The code automatically extracts IOCs such as DNS, URLs, IP addresses, and hashes and then reports them in a new dataframe.
A regex can be added to filter specific IOCs from those extracted by the IOC extraction module by default. For example, the regex below extracts Bitcoin addresses from the Conti chat logs:
After extracting IOCs, the dataframe can be cleaned to remove false positives as well as duplicate data. The final dataframe from the processed Conti chat logs contains the following unique IOC count, (these IOCs require additional analysis as not all of them are considered malicious):
URL | DNS | IPV4 | Bitcoin | MD5 | SHA-256 |
1,137 | 474 | 317 | 175 | 106 | 16 |
Investigating IP addresses
The threat intel lookup module TILookup in MSTICPy can be used to get more information on IOCs such as IP addresses. In the case of the Conti leak, 317 unique IP addresses were identified. Not all these IOCs are malicious but can reveal more relevant information.
The configuration file can be specified to load the TILookup module, along with other threat intelligence providers such as VirusTotal, GreyNoise, and OTX.
Running the module generates a new dataframe with more context for every IP address provided.
The module also allows to request information for a single observable.
The browser provided by MSTICPy can also be used to explore the IOCs previously enriched. The interactive Jupyter notebook includes this view of the IOCs.
In addition, MSTICPy has an embedded module that looks up the geolocation of IP addresses using Maxmind, which can be used to create a map of the IP addresses previously extracted.
Investigating URLs
Extracted URLs from IOC lists can provide details about targets, tools used to exchange information, and the infrastructure used to deploy attacks. A total of 1,137 unique URLs were extracted from the Conti leak dataset, but not all of them are usable for threat intelligence. The following code snippet shows how to filter for URLs.
A filter can be created to get details on executables, DLLs, ZIP files, and other files related to the extracted URLs. This can provide interesting insights and can be extracted for further research.
Using the same technique for filtering, .onion URLs can also be identified from the URL list. This proved particularly useful in this case, since the Conti group used the Tor network for some of their infrastructure.
Pivoting extracted IOCs using VirusTotal
The use of the pivot function within the MSTICPy library allows enrichment of data and discovery of additional infrastructure and IOC. This is particularly useful for threat intelligence and threat actor tracking. The next sections demonstrate the use of the VirusTotal module VTlookupV3 in MSTICPy to obtain intelligence about an IP address extracted from the Conti leak dataset that was used to deliver additional malware.
The following code initiates the VTlookupV3 in MSTICPy:
The VirusTotal module can be used to get data related to a particular IOC. The code below searches for files downloaded from a particular IP address from the Conti leak dataset:
The results show that the IP address 109[.]230[.]199[.]73 delivers several strains of malware.
The VirusTotal module can then be used to pivot and extract more information about these hashes. The table below shows information about the first hash on the list:
Attributes | |
authentihash | 0d10a35c1bed8d5a4516a2e704d43f10d47ffd2aabd9ce9e04fb3446f62168bf |
creation_date | 1624910154 |
crowdsourced_ids_results | [{[TRUNCATED]’alert_context’: [{‘dest_ip’: ‘8.8.8.8’, ‘dest_port’: 53}, {‘dest_ip’: ‘193.204.114.232’, ‘dest_port’: 123}], ‘rule_url’: ‘https://www.snort.org/downloads/#rule-downloads’, ‘rule_source’: ‘Snort registered user ruleset’, ‘rule_id’: ‘1:527’}, {‘rule_category’: ‘not-suspicious’, ‘alert_severity’: ‘low’, ‘rule_msg’: ‘TAG_LOG_PKT’, ‘rule_raw’: ‘alert ( gid:2; sid:1; rev:1; msg:”TAG_LOG_PKT”; metadata:rule-type preproc; classtype:not-suspicious; )’, ‘alert_context’: [{‘dest_ip’: ‘107.181.161.197’, ‘dest_port’: 443}], ‘rule_url’: ‘https://www.snort.org/downloads/#rule-downloads’, ‘rule_source’: ‘Snort registered user ruleset’, ‘rule_id’: ‘2:1’}] |
crowdsourced_ids_stats | {‘info’: 0, ‘high’: 0, ‘medium’: 2, ‘low’: 1} |
downloadable | TRUE |
exiftool | {‘MIMEType’: ‘application/octet-stream’, ‘Subsystem’: ‘Windows GUI’, ‘MachineType’: ‘AMD AMD64’, ‘TimeStamp’: ‘2021:06:28 19:55:54+00:00’, ‘FileType’: ‘Win64 DLL’, ‘PEType’: ‘PE32+’, ‘CodeSize’: ‘115712’, ‘LinkerVersion’: ‘14.16’, ‘ImageFileCharacteristics’: ‘Executable, Large address aware, DLL’, ‘FileTypeExtension’: ‘dll’, ‘InitializedDataSize’: ‘69632’, ‘SubsystemVersion’: ‘6.0’, ‘ImageVersion’: ‘0.0’, ‘OSVersion’: ‘6.0’, ‘EntryPoint’: ‘0x139c4’, ‘UninitializedDataSize’: ‘0’} |
first_submission_date | 1624917754 |
last_analysis_date | 16365918529 |
last_analysis_results | { [TRUNCATED] ‘20211110’}, ‘Tencent’: {‘category’: ‘undetected’, ‘engine_name’: ‘Tencent’, ‘engine_version’: ‘1.0.0.1’, ‘result’: None, ‘method’: ‘blacklist’, ‘engine_update’: ‘20211111’}, ‘Ad-Aware’: {‘category’: ‘malicious’, Edition’: {‘category’: ‘malicious’, ‘engine_name’: ‘McAfee-GW-Edition’, ‘engine_version’: ‘v2019.1.2+3728’, ‘result’: ‘RDN/CobaltStrike’, ‘method’: ‘blacklist’, ‘engine_update’: ‘20211110’}, ‘Trapmine’: {‘category’: ‘type-unsupported’, ‘engine_name’: ‘Trapmine’, ‘engine_version’: ‘3.5.0.1023’, ‘result’: None, ‘method’: ‘blacklist’, ‘engine_update’: ‘20200727’}, ‘CMC’: {‘category’: ‘undetected’, ‘engine_name’: ‘CMC’, ‘engine_version’: ‘2.10.2019.1’, ‘result’: None, ‘method’: ‘blacklist’, ‘engine_update’: ‘20211026’}, ‘Sophos’: {‘category’: ‘malicious’, ‘engine_name’: ‘Sophos’, ‘engine_version’: ‘1.4.1.0’, ‘result’: |
last_analysis_stats | {‘harmless’: 0, ‘type-unsupported’: 6, ‘suspicious’: 0, ‘confirmed-timeout’: 1, ‘timeout’: 0, ‘failure’: 0, ‘malicious’: 47, ‘undetected’: 19} |
last_modification_date | 1646895757 |
last_submission_date | 1624917754 |
magic | PE32+ executable for MS Windows (DLL) (GUI) Mono/.Net assembly |
md5 | 55646b7df1d306b0414d4c8b3043c283 |
meaningful_name | 197.dll |
names | [197.dll, iduD2A1.tmp] |
pe_info | [TRUNCATED] {‘exports’: [‘StartW’, ‘7c908697e85da103e304d57e0193d4cf’}, {‘name’: ‘.rsrc’, ‘chi2’: 51663.55, ‘virtual_address’: 196608, ‘entropy’: 5.81, ‘raw_size’: 1536, ‘flags’: ‘r’, ‘virtual_size’: 1128, ‘md5’:, ‘GetStringTypeW’, ‘RtlUnwindEx’, ‘GetOEMCP’, ‘TerminateProcess’, ‘GetModuleHandleExW’, ‘IsValidCodePage’, ‘WriteFile’, ‘CreateFileW’, ‘FindClose’, ‘TlsGetValue’, ‘GetFileType’, ‘TlsSetValue’, ‘HeapAlloc’, ‘GetCurrentThreadId’, ‘SetLastError’, ‘LeaveCriticalSection’]}], ‘entry_point’: 80324} |
popular_threat_classification | {‘suggested_threat_label’: ‘trojan.bulz/shelma’, ‘popular_threat_category’: [{‘count’: 22, ‘value’: ‘trojan’}, {‘count’: 6, ‘value’: ‘downloader’}, {‘count’: 2, ‘value’: ‘dropper’}], ‘popular_threat_name’: [{‘count’: 6, ‘value’: ‘bulz’}, {‘count’: 6, ‘value’: ‘shelma’}, {‘count’: 3, ‘value’: ‘cobaltstrike’}]} |
reputation | 0 |
sandbox_verdicts | {‘Zenbox’: {‘category’: ‘malicious’, ‘sandbox_name’: ‘Zenbox’, ‘malware_classification’: [‘MALWARE’, ‘TROJAN’, ‘EVADER’]}, ‘C2AE’: {‘category’: ‘undetected’, ‘sandbox_name’: ‘C2AE’, ‘malware_classification’: [‘UNKNOWN_VERDICT’]}, ‘Yomi Hunter’: {‘category’: ‘malicious’, ‘sandbox_name’: ‘Yomi Hunter’, ‘malware_classification’: [‘MALWARE’]}, ‘Lastline’: {‘category’: ‘malicious’, ‘sandbox_name’: ‘Lastline’, ‘malware_classification’: [‘MALWARE’]}} |
sha1 | ddf0214fbf92240bc60480a37c9c803e3ad06321 |
sha256 | cf0a85f491146002a26b01c8aff864a39a18a70c7b5c579e96deda212bfeec58 |
sigma_analysis_stats | {‘high’: 0, ‘medium’: 1, ‘critical’: 1, ‘low’: 0} |
sigma_analysis_summary | {‘Sigma Integrated Rule Set (GitHub)’: {‘high’: 0, ‘medium’: 0, ‘critical’: 1, ‘low’: 0}, ‘SOC Prime Threat Detection Marketplace’: {‘high’: 0, ‘medium’: 1, ‘critical’: 0, ‘low’: 0}} |
size | 181248 |
ssdeep | 3072:fck3rwbtOsN4X1JmKSol6LZVZgBPruYgr3Ig/XZO9:fck3rwblqPgokNgBPr9gA |
tags | [assembly, invalid-rich-pe-linker-version, detect-debug-environment, long-sleeps, 64bits, pedll] |
times_submitted | 1 |
tlsh | T110049E14B2A914FBEE6A82B984935611B07174624338DFEF03A4C375DE0E7E15A3EF25 |
total_votes | {‘harmless’: 0, ‘malicious’: 0} |
trid | [{‘file_type’: ‘Win64 Executable (generic)’, ‘probability’: 48.7}, {‘file_type’: ‘Win16 NE executable (generic)’, ‘probability’: 23.3}, {‘file_type’: ‘OS/2 Executable (generic)’, ‘probability’: 9.3}, {‘file_type’: ‘Generic Win/DOS Executable’, ‘probability’: 9.2}, {‘file_type’: ‘DOS Executable Generic’, ‘probability’: 9.2}] |
type_description | Win32 DLL |
type_extension | dll |
type_tag | pedll |
unique_sources | 1 |
Vhash | 115076651d155d15555az43=z55 |
The results indicate that the hash is a Cobalt Strike loader, which means that Conti affiliates also use the penetration testing tool as part of their infrastructure during their operation.
In addition, the VirusTotal module can also provide details such as detection rate, type, description, and other information related to the hashes. The code snippet below generates the list of domains to which the hashes connect to.
Doing this kind of analysis on the Conti leak data or similar data sets can lead to the discovery of possibly related domains that were not in the initial data sets.
Conclusion
This blog outlines how Python can be used to find valuable threat intelligence from data sets such as chat logs. It also presents details on how processing data using the MSTICPy library can be useful for enriching and hunting within environments, as well as collecting additional threat context. The interactive notebook provides additional code snippets that can also be used to continue log exploration.
The types of information extracted in this blog provides insights into the various elements of the criminal ecosystem that were coordinating their activities. Threat intelligence from research like this informs products and services like Microsoft 365 Defender, translating knowledge into real-world protection for customers. More importantly, the methodology described in this blog can be adapted to specific threat intelligence services, and the broader community is invited to use it for further analysis, enrichment of data, and intelligence sharing for the benefit of all.
Thomas Roccia
Microsoft 365 Defender Research Team
References
- https://krebsonsecurity.com/2022/03/conti-ransomware-group-diaries-part-i-evasion/
- https://research.checkpoint.com/2022/leaks-of-conti-ransomware-group-paint-picture-of-a-surprisingly-normal-tech-start-up-sort-of/
- https://therecord.media/conti-leaks-the-panama-papers-of-ransomware/
- https://www.breachquest.com/conti-leaks-insight-into-a-ransomware-unicorn/
- https://www.forescout.com/resources/analysis-of-conti-leaks/
- https://github.com/Res260/conti_202202_leak_procedures
- https://readme.security/the-conti-leaks-first-rumble-of-the-ukraine-earthquake-thats-rattling-the-cybercrime-underground-7abb23b0fb04
- https://medium.com/@arnozobec/analyzing-conti-leaks-without-speaking-russian-only-methodology-f5aecc594d1b
- https://github.com/soufianetahiri/ContiLeaks/blob/main/cobaltsrike_lolbins
- https://twitter.com/TheDFIRReport/status/1498656118746365952
- https://www.clearskysec.com/wp-content/uploads/2021/02/Conti-Ransomware.pdf
- https://blog.bushidotoken.net/2022/04/lessons-from-conti-leaks.html
- https://www.trellix.com/en-au/about/newsroom/stories/threat-labs/conti-leaks-examining-the-panama-papers-of-ransomware.html
- https://msticpy.readthedocs.io/en/latest/getting_started/Introduction.html