Hello, everyone! Because of the quarantine in Russia, the receipt of government services in electronic form has increased, and not all portals providing such services have appeared ready. We decided to share our experience in testing and participating in the preparation of the portal for providing services to the public.

In Russia, a two-level model for the provision of public services has developed — Federal portal gosuslugi.ru and portals of regions where local services are provided (most often for social assistance and services of local state and municipal institutions — education, medical institutions).

Popular Portal — The President of the Russian Federation has appointed support payments for each child (and there are 27 million of them in Russia), for which you need to apply on the Federal portal for public services gosuslugi.ru. You can imagine what happened when parents of all these children simultaneously rushed to fill out applications on this portal!

Watching all this, we decided to recall the experience of our ION DV.Framework team [iondv.com] in testing and participating in the preparation of a regional (as at the state or province level) portal providing the ‘School admission’ service.

How did we pass this challenge, and what were we preparing for? Read more about our experience below, which will give you an opportunity to look at the issue of public service portals during peak periods.

Let’s start by getting to know each other a little. We are a team of IONDV developers and analysts with more than six years of experience working on projects, which we develop on our platform — read about it in another article. There were several teams that were working in the process of preparation for the peak school admission period, which performed the following functions: the data center operator, the operator and the developer of the regional portal information system, the operator and the developer of the integrated education information system, technical support for the users of the regional portal. Our IONDV team was responsible for performing the last task, independently monitoring the health of the portal, and supporting users.

Our role in preparation was to organize load testing, recommendations on caching configurations in Nginx before the information system, and we also prepared instructions for users with the recommended ‘behavior.’

But Why Is This Such a Big Deal?

(For those who don’t know about the problem of school admission in Russia)

School education in Russia is free for all children. Children are distributed to schools according to the place of official registration of residence of parents (and children). But not all schools provide the same quality of education. In many large cities in Russia, there are several municipal schools that are difficult to enroll in because of too many applicants.

The reasons are different. Some schools consistently produce children with high scores on the final exam (similar to SAT); others use special programs of education and others give some skills that are highly valued by parents (e.g. English language or focusing on math, and sometimes — especially known for the good primary school teacher). Or some schools are located in areas where the number of houses assigned to the school is large, the school itself is small, and the other nearest school is located behind a large highway or a half-hour walk for a child.

In order for a child to be enrolled in this school, he or she must be registered in houses that are related to the school. But if a family lives in a house that isn’t assigned for the desired school this might not stop parents, they are ready to register in someone else’s housing for money (rent according to documents without living) or even buy apartments with the necessary addresses in which they register themselves. Therefore, the excitement is also fueled by the already produced costs of time and money.

In many regions, schooadmission is organized in electronic form in parallel with taking paper applications, but with some difference in time. For example, in Khabarovsk this year, electronic applications receiving started at 0:00 a.m. on January 26 and paper applications admission at schools started at 10:00 a.m. Obviously, applying in person means being at the end of the line. Last year, the admission started simultaneously at 10:00 a.m., but this did not solve the problem of queues — parents were still making lines, fearing that the portal would fall down or something would break.

Usually, recording in the traditional form was carried out in the form of round-the-clock shifts and queuing. Sometimes this situation even provoked fights between parents in queues. And the prospect of many hours or even days of waiting at the school doorstep in winter in 30 degrees of frost makes the possibility of applying online for admission to school an essential option.

Popular Service Resource As a Technical Task

The problem with services like ‘school admission’ in the integral load indicator (or query probability) tending to the ‘Delta function’ (δ-function, Dirac function) is clearly visible on the charts as a peak. At this point, there are multiple increases in requests in a short period.

Query Statistics Example

Our experience shows that the main task of preparation is not exactly to increase resources but to minimize the potential number of requests per second, stretch them for a certain period, and prepare the system for the remaining load. At the same time, it is necessary to find and eliminate bottlenecks — this will give the greatest effect in accordance with the principles of the theory of constraints (Goldratt’s principles). Otherwise, this very bottleneck will break. The whole system should be working out of it, according to Throughput, Operating Expense, and Inventory Measures.

It is physically impossible to spill the entire volume of sand of a 10-minute hourglass in 1 minute — obviously, it would collapse. Similarly for the provision of services. No one is surprised when there’s a queue in front of the service center and the scandals to receive the demanded service in the peak hour, but everyone wonders why the portal went down.

There are different models of load handling behavior from Queuing theory:

  • You can put users on hold, i.e. increase the queue;

  • You can simply refuse to service those who came later, for example, until the previous ones are processed;

  • You can try to increase productivity indefinitely.

Expediency is somewhere in between. After all, for a service with a limited resource provided by regional or state government, not only speed is important, but also, first of all, the preservation of social justice – equal conditions for everyone. At the same time, if a user doesn’t get what they need, they initiate a new request. In this case, requests grow like an avalanche, forming the ‘dog-pile effect’ attack model (cache stampede, hit-miss storm). It happens when a user has already canceled a request and initiated a new one, while the previous one is still in the queue for processing.

This process is complicated by the fact that whole families participate in the submission — moms and dads fill out applications at the same time, and often submit applications several times for reliability. And in addition, often also in several tabs and several browsers. Therefore, the expected peak load usually makes sense to multiply by 2-3 times, from the number of those who actually apply for such services.

The Organization of the Service Provision

We calculated the expected number of applicants based on a combination of data on the total number of applications submitted last year and minute-by-minute data for other regions. Usually, the peak of applications is 5-10 minutes after the opening of the submission. That’s also because of the fact the first three to five minute portals almost do not respond, and after that users fill out the form from 1 to 5 minutes (do not be surprised, many even in such ‘nervous’ conditions fill out from the phone).

The approximate calculation model for conditional 1000 statements per hour is as follows:

  • A peak from 5 to 10 minutes from the start and 80% of applications will be submitted under the Pareto rule;

  • We are planning approximately 160 applications per minute or 3 applications per second.

In fact, the first submission occurred after one minute and 45 seconds, and the peak of applications was from 4 minutes. To reduce the load on the authentication system and the system from generating authorization sessions, our instructions suggested that users login in advance and extended the session lifetime. In fact, 50% were authorized in 1 hour before the start of application submission, and ~90% in half an hour. We had previously encountered the situation when users started accessing the portal 10 minutes before the start of the service — and authorization began to work unstablyIt’s hard to say why. Perhaps the reason is that when technical work is carried out in Moscow at night — in Khabarovsk it is just the beginning of the working day.

A Digression Into Instructions and Organizational Measures

Banners with a direct link to the form and instructions were placed on the main page.

All direct links to the form were spelled out in the manual itself in order to avoid the use of ‘resource-intensive resources’ of search and directories with descriptions. In other words, users were organizationally routed directly to the form. At the same time, we clarified all controversial and complex issues in order to reduce the number of requests for technical support.
 It is
impossible to remove the ‘delta function’ when the form is reloaded at 00:00. The whole point of this procedure is that the service appears at the specified time. But it’s possible to try to reduce the number of browser requests on all the expected routes of users and thus leave the load on the system only from the necessary ones — the form, dynamic directories, and sending a statement.

The Nginx settings themselves are fairly standard. Here it is more important to choose the restrictions that the system can withstand. You can pick them up, i.e., start queuing requests when the server is expected to reach the limit of its capabilities. Or start failing if the overload level is guaranteed to be reached.

And of course the basic things – it is necessary to force caching (proxy_cache) and increase the lifetime of data (expires) in Nginx for all paths of static resources and, where possible, dynamic pages in which there are no sessions. This is actually a common mistake when caching – writing data to the cache (sometimes even statics) in which someone else’s session is saved. The way out is to remove these cookies from the headers in Nginx if the server cannot share the data types.

In the user’s browser, this looks like updating pages from files downloaded from disk or memory. But even when the user gets them from the server — they are taken from the Nginx cache. The directories themselves are of course cached in the system itself.

Directories Cached in the System

This reduced the number of potential requests from 89 requests to 14 and the volume from 2.1 MB (for 1000 users who updated the page, this is a potential peak of 4-8Gb/s) to 38 KB (we all remember about webpack, but for enterprise platforms, this is not always easy to do). Based on the results of the service this year, it would be advisable to cache not only in the system but also in Nginx part of the directories from the form and dynamic classifiers not used at the peak moment and to force the lifetime of them. And with an increase in load, it makes sense to put a completely static page on the main page with routing users to the desired service or making a separate resource for the service.

To reduce the load on sending, drafts of previously entered data on the service and automatic filling of data on the child based on their user profile were disabled. The speed of data entry is different for all users, which eliminates the appearance of a form that is completely ready for sending before the form in which you only need to click the ‘send’ button. This allows avoiding the Delta function for sending applications — the entire 1000 in one minute. At the same time, social justice is maintained, although, of course, complaints appear.

We will not describe the optimization of the system itself — during load testing, bottlenecks were identified — mainly in DBMS queries; indexes and queries themselves were optimized.

Perhaps the most important optimization is to simplify the form of electronic services. What affects the speed most when implementing in a form?

  • File loading — when the channel is loaded, this significantly increases the load on it and on the system, especially when loading large scans. The math is simple here — a typical photo on a smartphone now takes 5-10 MB (Hello to owners of new iPhones that simply do not support a low resolution on cameras) and for 5 documents one user uses up to 375 Mbit/s of the channel (1 byte is approximately equal to 10 bits in traffic, although when encoding files application/x-www-form-urlencoded — this is 20 bits), and 100 users per minute — 625 Mbit/s. In regions where the width of leased channels to data сenters rarely exceeds 100Mbit/s, this may come as an unexpected surprise — as the denial of service for timeouts will begin. Users will be nervous; they will refresh the form, and this will lead to a ‘dog-pile effect.’ The first question that pops up — why do we need these files? If the original fileare still to be broughtit’s often possible to omit the scanned copies. And what is the legal significance of these copies at all?

  • Complex Directories — The load is usually increased by using the FIAS or KLADR address directories. The problems here are caused by the size — FIAS takes up to 40 GB in the database, and the search in it takes time. Tenths of a second, but multiplied by 1000 simultaneous requests, load any system. Without special preparation, possibly in the form of a separate web service and on a separate resource, it is difficult to withstand such a load. Therefore, a simple text field is often usefor the address.

Well, let’s move on to the tests themselves.

Load Testing in Preparation

We carried out the testing through puppeteer – by emulating user actions in the Chrominium browser. Yandex. Tank and JMeter fight off with protection against attacks because they generate many requests of the same type. Besides, these tests do not match the profile of real queries when changing the system behavior under load. And in addition, servers cache requests, and some of the processes in them are difficult to reproduce (for example, authorization).

To begin with, we compiled a user behavior profile and divided the procedure into the key stages:

  1. Mass authorization in the Unified identification and authentication system.
  2. One-time update of the service form.
  3. Mass submission of applications.

For each of the stages we did a separate test.

Testing authorization is a difficult task, since it is an external federal system and protection against attacks and bans on authorization is triggeredIn addition, there is also protection through captcha.

However, it is possible to create a test profile to test precisely the bottlenecks of the system being tested – usually, this is the number of simultaneously authorized sessions and planned authorization values per minute, which can be regulated by recommendations.

In the test itself, a wrapper is important for organizing multiple threads of a task. We use ‘puppeteer-cluster’. However, it is usually more difficult to handle exceptions and change the behavior of the portal under load — layout elements that pop up twice are often detected. Or the elements do not appear if some data was not loaded as expected. These are all the errors that users will see under load and refresh the page — which means that they will create an additional load. There are two ways: implement exception handling in the test or modify the portal.

The test itself is simple. Below is a fragment from clicking on the ‘Login’ button on the services portal to entering data into the Unified identification and authentication system.

await page.waitForSelector(AUTH_AVAIL,{timeout:OPT_ELEM_WAIT_TIME});
const needAuth = await page.$(ELEM_AUTH_IN);
if (!needAuth) throw (new Error(`Нет элемента входа`));
        
await page.waitForSelector(AUTH_BUT, OPT_ELEMENT_VISIBLE);
await page.click(AUTH_BUT);
await waitNewUrl(page, 'https://esia.gosuslugi.ru/idp/rlogin?cc=bp', OPT_PAGE_WAIT_TIME);
await page.waitForSelector('#mobileOrEmail', OPT_ELEMENT_VISIBLE);
let text = await elemGetText(page, '#authnFrm > div.login-slils-box > div > div.detected > div.left > div.this-user');
if (text) 
   text = text.replace(/ -()/g, '');        
if (text && text.indexOf(user) === -1) {
  await page.click('div.click-to-another > a');
  await page.waitForSelector('#authnFrm > div.login-slils-box > div >' +
                ' div.detected > div.left > div.this-user', OPT_ELEMENT_INVISIBLE);
}
await page.waitForSelector('#password', OPT_ELEMENT_VISIBLE);
await page.type('#mobileOrEmail', user);
await page.type('#password', pwd);
await page.click('#loginByPwdButton');

Checking that the application form is updated while users are waiting for the record to open. The restart test is essentially a one-step process, but it is important to check the types of errors that are returned – a network problem, an nginx error, a server error, and whether the form meets the criteria. And the difficulty is to generate the maximum amount of requests in the shortest amount of time and not fall under the protection restrictions (however, it can be changed during tests, on the other hand, this is also checking the settings of the network and server infrastructure, as well as the WAF).

Such tests on puppeteer require a lot of resources to work with. De facto, it turned out that you need at least 2 cores against the 1 core of the frontend subsystem and a very wide channel. But if you rent them in the cloud, this is quite affordable. We used Yandex.Cloud.

In the test, authorization is first implemented in the Unified identification and authentication system for each stream separately. After that, a separate browser is launched for each thread, and the specified number of updates is carried out within one instance. After that, the instance is restarted. The check itself can include a typical path, such as the main page or service form. But more often it is enough only to completely update the service and check the necessary directory that the service can be submitted — just as in the instructions for users.

Step 1

A fragment of the test for opening the main page and updating the page with the form:

try {
  await page.setViewport(PUP_OPT);
  await page.goto(BASE_URL);
  await page.setCookie(...cookies[worker.id]);
  await page.goto(`${BASE_URL}/nd/lk/form/dnv.htm`);
  rdyRefresh++;
} catch (err) {
  console.error(`# Ошибка в открытия портала или формы ${data}: ${err.message}`);
  getErr++;
  await page.screenshot({path: filename});
}
for (let i = 0; i < AMOUNT_REFRESH - 1; i++) {
  const filenameIter = path.join(BASE_DIR, PIC_DIR, `${data}-${i}.png`);
   try {
       await page.reload({waitUntil: ["networkidle0", "domcontentloaded"]});
        rdyRefresh++;
    } catch (err) {
        if (!err.message.includes('Navigation failed because browser')) {
           console.error(`# Ошибка в обновлении страницы ${data}-${i}: ${err.message}`);
           getErr++;
           await page.screenshot({path: filenameIter});
        }
   }
}

For the load of sending applications, the entire verification cycle was implemented – with the form reloading and checking the input of all data.

Fragment

for (let i = 0; i < AMOUNT_RESEND; i++) {
   const filename = path.join(BASE_DIR, PIC_DIR, `${data}-${i}.png`);
  try {
     await page.goto('https://uslugi27.ru/nd/lk/form/dnv.htm');
  } catch (err) {
      console.error(`# Ошибка в в открытиие страницы 1го класса ${data}-${i}: ${err.message}`);
      await page.screenshot({path: filename});
      getErr++;
      continue;
 }
 try {
     const FORM_PREF = '#createForm > div:nth-child(4) > ';
     await clickDelayed(page,`${FORM_PREF}fieldset.petgroup.ungroupped-attrs > div > div:nth-child(4) > div.col-md-9.attr-data`);
// <…>
     await page.type(`${FORM_PREF}fieldset:nth-child(2) > div > div:nth-child(1) > div.col-md-9.attr-data > input`, 'ТестФамилия');
// <…>
  } catch (err) {
      console.error(`# Ошибка в заполнении данных формы ${data}-${i}: ${err.message}`);
      await page.screenshot({path: filename});
     continue;
  }
  try {
      await page.click('#createForm > div.col_100.controls > button.btn.btn-primary.pull-right.next');
      await clickDelayed(page,`#createForm > div:nth-child(5) > fieldset > div > div:nth-child(1) > div > div`);
       await page.click('#createForm > div:nth-child(5) > fieldset > div > div:nth-child(2) > div > div');
       await page.click('#createForm > div.col_100.controls > button.btn.btn-success.pull-right.submit');
  } catch (err) {
    console.error(`# Ошибка в отправке формы ${data}-${i}: ${err.message}`);
    await page.screenshot({path: filename});
    sendErr++;
    continue;
  }

By the way, you can speed up the test if you enter all the data not from puppeteer with await page.type construct but transfer this logic to the browser itself. But then the difficulty of catching errors increases. For example

document.querySelector('#createForm > div:nth-child(4) > fieldset.petgroup.ungroupped-attrs > div > div:nth-child(4) > div.col-md-9.attr-data').click();
 document.querySelector('#createForm > div:nth-child(4) > fieldset:nth-child(2) > div > div:nth-child(1) > div.col-md-9.attr-data > input').value = 'ТестФамилия';

During the tests, we provided several thousand authorizations of the federal system and about 16 thousand applications sent. How was the restoration of a productive information system of education carried out after such a number of applications — do not even ask. This is a completely different story.

The main visible result of this process was that the local media in the days of school admission were now completely bored — no queues, no conflicts. The service has left the media area.

In parallel, we made a dashboard for monitoring the performance of the form based on Grafana: the number of applications, the number of calls, web analytics data, etc. But let’s leave this topic for the next time.



Source link

Write A Comment