Updates on know issues affecting the ePortfolio
A failure with the networking infrastructure at our hosting provider (specifically, the Microsoft Azure UKSouth region) resulted in the NHS ePortfolios website at http://www.nhseportfolios.org becoming unavailable/unresponsive to all users on three occasions on Thursday the 20th July 2017:
- At 21:49 to 21:55 (6 minutes)
- At 22:05 to 22:10 (5 minutes)
- At 22:29 to 00:58 (149 minutes)
Following identification of the cause of the outage, updates were provided via the @NHSePortfolios Twitter account between 23:13 and 01:05.
No data was damaged / compromised as a result of this incident.
Update, 25 July 2017: Root Cause Analysis provided by hosting provider:
RCA – Network Infrastructure – UK South
Summary of impact: Between July 20, 2017 21:41 UTC and July 21, 2017 1:40 UTC, a subset of customers may have encountered connectivity failures for their resources deployed in the UK South region. Customers would have experienced errors or timeouts while accessing their resources. Upon investigation, the Azure Load Balancing team found that the data plane for one of the instances of Azure Load Balancing service in UK South region was down. A single instance of Azure Load Balancing service has multiple instances of data plane. It was noticed that all data plane instances went down in quick succession and failed repeatedly whilst trying to self-recover. The team immediately started working on the mitigation to fail over from the offending Azure Load Balancing instance to another instance of Azure Load Balancing service. This failover process was delayed due to the fact that VIP address of Azure authentication service used to secure access to any Azure production service in that region was also being served by the Azure Load Balancing service instance that went down. The Engineering teams resolved the access issue and then recovered the impacted Azure Load Balancing service instance by failing over the impacted customers to another instance of Azure Load Balancing service. The dependent services recovered gradually once the underlying load balancing service instance was recovered. Full recovery by all of the affected services was confirmed by 01:40 UTC on 21 July 2017.
Workaround: Customers who had deployed their services across multiple regions could fail out of UK South region.
Root cause and mitigation: The issue occurred when one of the instances of Azure Load Balancing service went down in the UK South region. The root cause of the issue was a bug in the Azure Load Balancing service. The issue was exposed due to a specific combination of configurations on this load balancing instance combined with a deployment specification that caused the data plane of the load balancing service to crash. There are multiple instances of data plane in a particular instance of Azure Load Balancing Service. However, due to this bug, the crash cascaded through multiple instances. The issue was recovered by failing over from the specific load balancing instance to another load balancing instance. The software bug was not detected in deployments in prior regions because it only manifested under specific combinations of the configuration in Azure Load Balancing services. The combination of configurations that exposed this bug was addressed by recovering the Azure Load Balancing service instance.
Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, we will: 1. Roll out a fix to the bug which caused Azure Load Balancing instance data plane to crash. In the interim a temporary mitigation has been applied to prevent this bug from resurfacing in any other region. 2. Improve test coverage for the specific combination of configuration that exposed the bug. 3. Address operational issues for Azure Authentication services break-glass scenarios.
UPDATE: 9th November 2016, 8pm. We have now received notification from e-LfH that the OpenID endpoint has been restored and as such users will once again be able to establish links between their NHS ePortfolios user account and e-LfH user account.
The NHS ePortfolios team are aware that users will presently encounter issues when attempting to establish a new link between an NHS ePortfolios account and an e-LfH user account to enable the exchange of Learning Activity data.
When attempting to establish the links, users are presently witnessing an error message within NHS ePortfolios stating “No OpenID endpoint found”, as shown below:
This issue has arisen following an application upgrade at e-LfH at the end of last week and the e-LfH team are working to resolve this as quickly as they can:
If you have already established a link between your NHS ePortfolios account and your e-LfH user account, the ongoing exchange of Learning Activity data will not be affected by this issue.
At 2:07pm on Friday the 8th May 2015 for three minutes and again, at 2:19pm for nine minutes, ending at 2:28pm, a failure of equipment at our hosting provider resulted in the NHS ePortfolios website at http://www.nhseportfolios.org becoming unavailable/unresponsive to some users.
During this time, users would have encountered disruption / errors as our systems transferred users away from failing devices to other, still functional, devices.
At 2:24pm, for four minutes, the NHS ePortfolios website was online, but user requests were being serviced by only one of three devices that are normally responsible for this task. At the time, around 600 users were making approximately 2,500 requests per minute. Response times from a single device would have resulted in an unacceptable user experience at this time.
Intermittent outages affecting a single device continued to be experienced until 3:23pm.
Shortly after midnight, a configuration change was performed by our hosting provider and disruption has subsequently not recurred.
It has come to our attention that a number of users, instead of clicking a bookmark to the site or typing https://www.nhseportfolios.org into the address bar of their browser, access our site by searching for terms such as “NHS ePortfolio” in search engines such as Google or Bing, and then clicking links in the search results page these sites choose to provide. When navigating to the site in this way, the results page provided by these third party sites can often include links to alternate instances of our main site in which users login credentials do not work, but which look very similar to our main site, causing confusion/frustration to users. We use these sties for supporting the ePortfolio, they are used for training, pre-release testing and many other import processes. (Please note, the search results vary on a per user basis and are not consistent in their content).
This situation arose previously, in May/June 2013, when Google substantially changed the algorithm used to prioritise search results. These alternate sites can be easily identified as they have different addresses (e.g. http://qa.nhseportfolios.org) which will appear in the address bar of the browser when accessing the site and because all but one of the alternate sites now contain the following information message on the homepage: This is not the main NHS ePortfolio site – you may have reached this site by mistake. You are viewing xxx.nhseportfolios.org, were you looking for www.nhseportfolios.org? Whilst the contents of the results pages in the search engines are not controlled by NES, we have contacted the two major global search engines – Google and Bing – to remove links to alternate instances of our site to reduce this possible source of confusion.
We are currently investigating the options available to us to ensure that the non-inclusion of alternative sites is made permanent. In the interim, please ensure that you and your trainees are visiting the main site at the following address: https://www.nhseportfolios.org If you continue to experience login issues, please first confirm that you are visiting https://www.nhseportfolios.org and, if problems persist, please provide a full-screen screen grab where possible to assign with identifying the issue.
At 18:18 on Tuesday the 28th January 2014, a failure of equipment at our hosting provider resulted in the NHS ePortfolio website at http://www.nhseportfolios.org becoming unavailable to all users. Visitors to the site received only an error page (A 502 error) with no indication as to why the site was unavailable or how long it would take to recover.
3 hours and 3 minutes later, at 21:21, access to the site was restored and users were able to successfully login once more. Users of our mobile app in offline mode were able to continue to create ticket requests and reflection forms within the app during this period but were unable to synchronise these with the website until after the site returned at 21:21.
Whilst all users were able to login as of 21:21, some users may have experienced delays in receipt of email messages from the site and would have received error messages onscreen when attempting to access files in their personal library whilst we restored all services. All services were restored to fully operational status by 22:55 and no email messages remain unsent by the time of writing (01:32, Wednesday the 29th January 2014).
During the period of downtime, we were unfortunately unable to post a message at http://www.nhseportfolios.org indicating that the site was offline and to provide an ETA for the resumption of service. We were, however, able to answer requests received from users by email to firstname.lastname@example.org (2 users) and via the @neseportfolio twitter account (7 users).
A full investigation into the failure will be performed in order that we can determine how this situation can be avoided in the future and as part of this, we will investigate options to allow us to provide appropriate user feedback should a similar situation recur.
The UKFPO would like to apologise to users of the NES e-portfolio product for the recent CS access issue. As per the Reference Guide 2010 and 2012, clinical supervisor access to the portfolio is for the “period of supervision and for a period of three months following the end of the placement. Read only access is indefinite”.
There was a typographic error within the e-portfolio specification issued to NES which inadvertently caused this error. We sincerely apologise for any inconvenience caused. We understand that NES has now resolved this issue.
UK Foundation Programme Office
Regus House, Falcon Drive,
Cardiff, CF10 4RU