Table of Contents
- Quick and Dirty Cheatsheet (Start here if needed)
1. Quick and Dirty Cheatsheet (Start here if needed)
The following is a VERY short form of the procedure in section six that will get you started to get things quickly under control. It isn’t meant to be used exclusively as it skips essential steps so please refer to section six. It’s also necessary to note that these instructions assume the time is during a workday. If it’s after-hours, do your best to limit the damage, and call in your own staff if available but delays are expected since the University is not a 24/7 work environment, and attackers will pick times where the response may be limited.
- DON’T PANIC. Panicking causes more problems, so take a deep breath, relax, and proceed as methodically as possible.
- If things are actively happening, we want to reduce the ‘Blast Radius’ of the attack as quickly as possible, as this will limit the damage and lessen the recovery time and effort required.
- Typically when a ransomware attack is complete, a message will appear on the screen of the device. If this is the case, take a photo and physically or logically disconnect the device from the network. Now. Try not to turn it off unless you absolutely have to, as this can damage forensic evidence.
- If you are sure or strongly suspect a device is infected with ransomware, but there is no message yet, physically or logically disconnect it from the network. Now. Try not to turn it off unless you absolutely have to, as this can damage forensic evidence.
- If many devices are infected, it may be easier to isolate everything by turning off a network switch or wifi access point rather than dealing with each device individually.
- Disconnect any network shares used by any confirmed or suspected devices until the ransomware is contained. Now. Include any mirrors or disaster recovery versions as well.
If you have to ask first before acting, proceed to step 3, get that permission and then make the changes needed. Note that the University Chief Information Security Officer expects that infected devices will be taken off the network immediately and will support you if you take immediate action.
- Let other people know what is happening: be ready with a preliminary scope but don’t spend time compiling a detailed list of devices and files. If you have an incident response team tell them; otherwise, let your boss know and your Unit Administrator(s). Please let Central Incident Response know by sending an email to email@example.com
- Start collecting and protecting information about the infection. This information should include a list of affected devices and file stores, local or network logs, system images or malicious executables, examples of encrypted files, and screenshots of infected devices.
- Go back to the procedure in section five and work through that, verifying any steps you have already completed.
This playbook is provided by Information Technologies Services – Information Security (ITS-IS) to give a framework and typical workflow to help with recovering from a ransomware attack.
Ransomware is a form of malware used to perpetrate a cryptoviral extortion attack. In the attack, the malware encrypts the victim’s files, making them inaccessible, and an attacker demands a ransom payment to decrypt them. Additionally, the attackers may export the data before encrypting it and add the threat of public distribution of the data if the ransom is not paid. There can be other slight variations on the attack, but these are the most prevalent.
A ransomware attack in the context of this playbook is one where one or more university-owned devices have been infected with malware that has encrypted files, and a ransom demand has been issued.
Typically ransomware starts on Workstations (desktops and Laptops) but may propagate to Servers. The intended scope of this playbook is all persons who own or manage workstations or servers associated with the University of Toronto. The devices may be physically located on the University’s grounds (on-prem) or located remotely (in the cloud or wherever a person is working). The devices could exit either physically or virtually in either environment.
This document assumes access to the physical devices that are or may have been infected. Additionally, access to network devices to contain infected devices may be required. If virtual machines are involved, then administrator-level access to the hypervisor is assumed.
Familiarity with the server’s Operating System (OS), software, hardware, hypervisor, and network environment is also assumed.
An overarching Security Incident Response Plan should be in place to define roles and responsibilities and what communication about the incident is expected.
The actions described will primarily be completed by subject matter experts (SMEs) with the access and skills required. Other non-technical persons will be involved with the process, including but not limited to Operational management and other administrative staff.
Please see the Security Incident Response Plan for your unit/environment for the list of roles and responsibilities. The University’s Security incident response plan that provides guidance for individual plans can be found here: Information Security Response Plan.
The following procedure is organized into logical steps more for organization purposes than a strict timeline of when things must happen. Many of the steps in the following process can and will happen simultaneously, and this is okay. If it makes sense to perform a step before others, then do that so long as all relevant actions happen. An example of this is that Initial containment typically occurs before the identification is completed.
A security incident can be a stressful exercise, and it is essential to proceed calmly and methodically to ensure that dealing with the situation does not make things worse. This is not the first time this has happened and is likely not the last. If you have any questions or concerns on how to proceed or who to notify, please start by contacting the Security Incident Response (SIR) team at firstname.lastname@example.org.
SIR will help with these issues and also help manage the incident and connect you with any additional internal or external people or services you may need for the remediation.
For a ransomware attack, if it is caused by a random infection of a single machine, then the timing will also typically also be random. If the attackers compromised systems in your network threat) and have maintained a foothold (what’s known as an Advanced Persistent Threat (APT)), then the attack will often be launched when it will do the most damage before it is noticed. This is typically after hours or over the weekend when no one is around. This type of attack will cause outages, and working calmly and methodically is the best way to recover and identify the root cause to help prevent a recurrence.
Identifying what has been compromised and getting the right people working on it quickly is essential. Ransomware can propagate rapidly through a network, so acting quickly can help to limit the ‘blast radius of affected devices.
- Identify what devices have been affected by the attack and act on those first. Typically when a ransomware attack is complete, a message will appear on the screen of the device. If this is the case, take a photo and physically or logically disconnect the device from the network as soon as possible, preferably immediately. Try not to turn it off unless you absolutely have to, as this can damage forensic evidence
- If you are sure or strongly suspect a device is infected with ransomware, but there is no message yet, then physically or logically disconnect it from the network as soon as possible, preferably immediately. Try not to turn it off unless you absolutely have to, as this can damage forensic evidence
- If many devices are infected, it may be easier to isolate everything by turning off a network switch or wifi access point rather than dealing with each device individually.
- Disconnect any network shares used by any confirmed or suspected devices until the ransomware is contained. Do this as soon as possible, preferably immediately. Include any mirrors or disaster recovery versions as well.
- Identify what data type(s) exist on the devices(s), file shares, or other systems to which it has a direct connection.
- Based on that information, and the number of affected devices determine the severity of the incident. The Security Incident Response Plan will help with this determination. If the severity is uncertain, go with a higher severity as it can be lowered after further review, but it may not get the focus it needs if it is initially too low.
- With the severity identified, begin to notify the persons required. Ideally, your unit will already have an incident response team identified, and you can tell them; otherwise, let your boss know and your Unit Administrator(s). Depending on the severity, a local or University-wide CSIRT team may be activated who may engage other services, such as a breach coach or forensics services.
- Restoring from backup is the easiest recovery solution for ransomware. Identify what backups are available of the data affected and also validate that the backups are usable.
- Continue to identify the Who, What, When, Where, Why, and How of the incident to the best level possible. Use local operating system and application logs and network device logs to find as much information as possible about the attack.
- If external resources will be needed, or there is public visibility, then mobilizing resources to do find information should be done as soon as possible
Documenting the incident is essential to keep all involved aware of what has or is happening and help everyone stay calm. Recording your actions also helps protect you and the University if there is ever a question of whether an incident handled adequately. If there is a chance that this will end in legal proceedings, follow proper evidence handling procedures; the (SIR) team can help.
The containment stage is primarily concerned with limiting the damage, preventing further damage, and retaining data for further review or possible use in legal proceedings. With ransomware, short-term containment typically happens at the same time as identification of the attack.
- If not already complete, physically or logically disconnect all infected or suspected devices from the network.
- Keeping devices powered on but disconnected is the ideal state. In some cases, the ransomware unlock keys remain resident in memory and can be used to restore the device easily
- Forensic images of affected devices may be required to understand the root cause of the attack. Ideally, these images are collected BEFORE any mitigation efforts occur on the system(s); however, this may not always be possible, so please endeavour to collect them in as pristine (unchanged) a state as possible.
- For virtual systems, take a snapshot and ensure that the snapshot cannot be accidentally deleted. Include the memory state as well as the data at rest if possible.
- For physical systems, a clone of the physical drives is typically required. Ideally, this will be captured in a running state with a forensics tool; however, an offline clone is acceptable if using a forensics tool is not possible. If possible, retain the original hard drive and rebuild the device with a new drive.
- Collect and review evidence from other sources. These can include remotely collected system logs, network device logs (firewalls, IDS, etc.)
The Eradication step deals with the actual removal of malware or other methods attackers have used to gain or maintain a foothold in the affected systems. Restoration of systems and finalizing the detailed timelines and scope of the incident also happen here.
- Where the scope and severity of the incident warrant it, engage a third-party forensics firm to perform a more thorough review of the affected systems. This will typically be at the direction of the CSIRT and will use one of the pre-identified services already vetted and contracted to provide these services to the University as a whole. Ransomware attacks have been known to recur, so it is essential to identify the root cause of the infection to limit the chances of this happening.
- Finalize a formal timeline of the incident with as many details as possible, including:
- Who accessed the systems, and what accounts were changed, accessed, or created.
- What changes were made to the systems by the attackers? What code, malicious or other type was installed or used by the attackers? What data was affected, accessed, or exfiltrated by the attackers?
- When did the attack occur, and for how long were the attackers accessing the systems? Whendid the method the attackers used become public knowledge (if at all)?
- Where were the systems that were affected, and what other systems shared the same environment? Where did the attackers access the systems from, and where did they enter the network and systems?
- Why were the systems attacked? Was this a targeted attack or just a random attack on a vulnerable system? Why didn’t Antivirus or other tools detect or stop the attack?
- How did the attackers access the system? Did they use a misconfiguration, a known or 0-day vulnerability, or are these deliberate actions by a malicious insider?
- Once the complete timeline and details of the incident are known, rebuild and repair the systems to prepare them to return to operation.
- Ideally, compromised systems should never be returned to production as there is always a chance that some remnant from the attackers remains and could compromise the systems.The best practice is to build replacement systems from scratch and fully patch all software on them.
- If the replacement systems will be entirely new, it is unnecessary to wait for the review to complete before starting this process.
- The only exception to starting clean is where the exact timeline of a compromise is known, and a backup that is known clean from before this time can be used. In this case, the system must still be fully patched to correct the method of attack before it can be put back in production.
Recovery is the safe redeployment of affected systems back into the production environment. The fastest and easiest method of recovering from a ransomware attack is to restore from known good backups. However, the root cause of the attack needs to be addressed along with simply replacing the damaged file. This will help ensure that re-compromise chances are as minimal as possible, and the chance of the same attack vector being successful is eliminated.
- All systems being deployed need to be fully patched before redeployment.It is not sufficient to only patch the software that was the root cause of the compromise; all software on the system should be patched. If software cannot be patched, it must have compensating controls applied to protect it, be listed in the University Risk Registry (contact email@example.com), and have a fixed date for when it will be corrected.
- Systems should be hardened to an industry-standard to minimize initial attack surfaces and limit the chances of weak or default configurations making it to production.The Center for Internet Security Benchmarks provides free hardening standards that provide good baseline security that multiple organizations have vetted.
- Before deployment, the systems must be scanned for vulnerabilities.Where possible, this should be an authenticated scan as this provides a better level of assurance than a simple remote scan. A single device can be scanned for workstations, assuming all devices are built to the same standard.
- If the systems are being deployed with entirely new applications, then the standard risk review process for the University should be performed on those applications.
- If there was a failure (or lack) of backups that resulted in data that could not be recovered, ensure that your backup process has been improved to include ransomware resistance.
- If there was a failure of endpoint protection (or lack of) that permitted the attack to happen or expand, then ensure that those gaps have been closed or mitigated before redeploying systems. Ideally, modern ‘Next Generation’ endpoint protection that uses machine learning and process monitoring and not just signatures to identify malware will be deployed.
- Once the final review is complete, any Public notifications that the University is legally required to make should happen.These should include as much information as need to clearly and concisely let an affected person know what happened and what, if anything, they need to do. The wording should be such to have a minimal chance of causing panic in anyone. The CSIRT will identify the notifications that need to be sent. The CSIRT will also provide the notification text with input from the FIPP office and the affected Unit stakeholders. The CSIRT must be kept updated during the process the restoration process.
Learning from an incident is critical to help prevent others from occurring in future. The purpose of this phase is to complete any documentation still outstanding and use the collected report to find places where improvements to controls and processes can help minimize the risk of future similar incidents.
- Ensure the complete timeline for the incident has been documented and any evidence that may be needed for future use is safely stored.Retention for this is typically no more than 24 months, but the CSIRT or FIPP office can confirm if the timeframe is different in this case.
- Schedule one or more lessons learned sessions to collect feedback about the incident.The session should cover off the following information:
- When was the problem was first detected, and by whom?
- What was the scope of the incident?
- How was the incident contained and eradicated?
- What were the actions taken to return the systems to production?
- What can be improved?
- From the information collected from the lessons learned session(s), any opportunities to improve should be enacted to reduce the risk of another similar incident and improve the incident management process.Specific things that should be considered are:
- What improvements can be made to system management?
- Are there access controls that can improve security?
- Are there any architectural changes that can minimize the amount of data at risk?
- How are vulnerabilities identified and patches applied?
- What worked well or poorly in the incident response process?
Security Incident Response Plan
The University has recently published its Security Incident Response Plan (Incident Security Response Plan | Information Security and Enterprise Architecture (utoronto.ca)
Terms used in this SOP:
A Computer Security Incident Response Team (CSIRT) is an institutional entity responsible for coordinating and supporting a computer security incident response. It comprises a mixture of technical and business staff from the University and the affected unit.
The Freedom of Information and Protection of Privacy (FIPP) Office supports the protection of personal privacy and access to University records in support of transparency and accountability. They are aware of the University’s legal requirements for disclosure regarding privacy and other legislation the University is subject to.
A hypervisor (or virtual machine monitor, VMM, virtualizer) is computer software, firmware or hardware that creates and runs virtual machines. (https://en.wikipedia.org/wiki/Hypervisor).
University Risk Registry:
A listing of known Information technology risks at the University of Toronto with ownership and timelines for reducing or eliminating those risks identified.