COVID-19 – Francesco's Crumbs

What could go wrong by extrapolating a fit into the future? We’re talking about a very good fit, after all. Just to show how good, here’s the relative error (this graph uses the additional data point relative to March 21st).

Every day, the discrepancy between the data and the fitted line is no more than about 10%. That’s pretty good for something as trivially simple as drawing a parabola through the logarithm of a bunch of data. Now, let’s say that we had a reason to believe that the fit really represented the true dynamics of the infection (we don’t… but allow me, for the sake of the argument). Then what would be the reason for discrepancies with the data? Errors! Errors in the sense a natural scientist would use the word: any unforeseen event that makes your measure a number while reality is represented by a different number. Finding and counting infected individuals is not easy. The incubation period is variable. Some infected people get tested when they are still asymptomatic, and are counted early, others may delay alerting healthcare personnel, and end up being counted later. Testing facilities may be overworked. The test themselves may be not 100% reliable. Finally, the information may arrive too late to Protezione Civile to be counted for that day. For example: the value of newly infected individuals for March 10th is about 600 individuals short, who have been counted on the next day. I fixed that in the plotted data. But other errors are not fixable, because no one can quantify their magnitude with certainty. But “errors”, if they really are such, have a nice characteristic: they’re independent of each other. They should appear as random numbers added or subtracted to the real data, without showing any pattern, bias or trend. And so does the relative error do in this case. I haven’t run any sophisticated statistical test, but, by just eyeballing the graph, I can’t see anything patently off. Sure, the largest errors occur at the beginning of the series, but that’s consistent with a system taken by surprise by the initial burst of the epidemics. Personnel had to be gathered, procedures has to be set in place, the entire public safety machine had to be set in motion. A 10% error in such a contingency is amazingly low. Kudos.

So, why am I scared? Yes, the last three days aligned in an upward trend, but that still amounts to no more than a 5% error (on today’s fit…).

Well, here is the trouble: today’s fit peaks on March 31st. Yesterday’s fit peaked on March 30th. And three days ago the peak was forecast to occur by March 29th. In the last three days the date of the peak has sensibly shifted forward as new data piled up.

That’s bad, and gives me pause. Is the fit wrong? No. Is it a bad fit? No. But it wasn’t a bad fit two days ago, either. The relative errors changed little or nothing…

And here’s the hint that sheds light on the problem: could it be that fits which are not the best, but are still believable, may extrapolate to wildly different peaks?

Let’s then play this game: we keep assuming that the infection will have a parabolic trend (in log-scale), but we impose by hand the date of the maximum. Namely, I select all the parabolas that peak at a given day, and use only the remaining two free parameters to fit the data. And here’s the dramatic result.

The blue line is yesterday’s best fit, which peaks 35 days after the day zero (February 24th). It still looks good as a fit today. By imposing the peak to occur on day 45, we obtain the orange line. That doesn’t look like a bad fit either! The green line has a peak on day 55. Even that is pretty darn close to the data. And so are the parabolas peaking at 65 and 75 days. They all look very close to each other in the region covered by data. But they are wildly different in the future: the last line peaks above two million infected people!

Here are the relative errors of those constrained fits.

Obviously the blue line has the best-looking errors: the smallest and more evenly spread out. The other lines tend to have a negative bias at the beginning and the end of the data set, and a positive one in the middle. And yet, they are not that much off: nearly all data points are less than 20% off the fit. The bias is not surprising given that there’s no theoretical reason to expect the dynamics to follow exactly a parabola. We could just argue in favor of something whose slope decreases in time.

So, I’m baffled by two unexpected outcomes of this exercise. The first is that the data seems to be quite consistent with a simple parabolic fit (which I didn’t expect, because I saw no stringent reason for that, other than the little hand waving of the previous post). The second is that, among vastly different parabolic fits, the data seem terribly ineffective at ruling out alternative hypotheses (which I didn’t expect, because the best fit seemed so good).

But the conclusions are scary. We know little to nothing about the dynamics of this epidemics. There’s no reason for it to follow the parabolic trend lines that I am using as a fit. But the numbers can’t swing from a trend to another too quickly. The incubation period is about one week, and this suggests that hoping for a sudden stop of the growth of the infection is wishful thinking. Even if the contagions were to stop now, we’d still be counting new cases for about a week. The current best fit forecasts a peak in 10 days. That’s about as quick a slow-down as it can possibly be. Given that the epidemics is still growing at a rampant rate (the last doubling took little more than 5 days) and it’s impossible to keep everyone in isolation (for good reasons: essential services mustn’t be shut down) hoping for anything faster seems unrealistic to me.

If the data won’t follow the current line of best fit, all realistic alternative scenario involve reaching the peak of the epidemics later in April, or even after that (and it would be tragedy). The trend might remain parabolic, but the data may end up following a much higher curve than the current best fit. Or the parabolic trend may be broken, with data shifting to a different regime, which, nevertheless, won’t be one that slows down rapidly.

Thus, here we are, at the twelfth battle of Isonzo, with its outcome hanging in the balance. It can still be won, but it has to happen in the next few days. And yes, if Caporetto has to be, then later Vittorio Veneto will come. But the additional devastation will be staggering.

Ten days ago I started looking at the epidemiological data of the coronavirus infection in my country. The Italian government had just declared a national lockdown, following an escalation of measures (the first of which occurred on February 21st), that affected a progressively larger number of “red zones” in the North of the country, and culminated with emergency measures for the entire region of Lombardia, along with selected additional northern provinces. It was clear that those attempts at containing the epidemics had not been effective enough to preserve the rest of the country.

Data about the COVID-19 epidemics in Italy are released by the National Department of “Protezione Civile” and are available here. Similar sites reporting data for other countries like to post the total number of infected individuals. Italy’s Protezione Civile highlights the number of currently infected people. The difference being that the latter counts only the those who are carriers of the virus at the date of reporting, while the first also includes healed and deceased persons. Looking at the number of currently infected individuals seems more sensible to me, because it is them who can spread the contagion. Therefore I decided to stick to that metric. Anyhow, Protezione Civile makes available for download all the relevant figures (look in the lower right pane of the same page).

Epidemics in their initial phase are supposed to grow exponentially. Exponential growths appear as straight lines in graphs whose vertical axis has a logarithmic graduation. Here’s what I get when I plot the country-wide, daily number of currently infected individuals.

The epidemics appeared to have grown with a doubling time of 2.3 days during the first week of data availability, and to have slowed down to a doubling time of 3.4 days in the following week.

Difficult to say what the cause of the slow-down may be. One may wish to hope that it is the merit of the containment measures. But it is difficult to justify a sudden change in slope from one day to the next: containment measures have been ramped up gradually over those two weeks. Even though both fits look good, the sudden change in slope might be a statistical artifact…

The simplest model for an epidemics is the following differential equation $$\dot{I} = r(t)I$$ where $I$ is the number of infected individuals, $\dot{I}$ is its rate of change, and $r(t)$ is the growth rate of the epidemics at time $t$. The model is exceptionally accurate… if you know what the growth rate is. (Bear with me, mathematicians, including applied ones, have befuddling ways to declare their own ignorance…) Figuring out the growth rate is the tricky part, and the reason why forecasting epidemics is even more uncertain that forecasting the next thunderstorm, stock market fluctuations, or the results of sport events.

But one can make hypothesis, and attempting to draw conclusions from those. Here the obvious hypothesis is that the ramp-up of containment measures has gradually decreased the growth rate. As a consequence, the data representing the currently infected people should not appear as a straight line in our logarithmic plot, but as a curve whose slope decreases gradually. This notion was reinforced in the subsequent days, when the new data points on the graph systematically appeared to be below the green straight line. What could that curve be? If we stick to polynomials, the next in the line after a straight line is a parabola. So, here’s my attempt at a parabolic fit. In the figure below I include all of the data available as of today.

Indeed, the red parabolic curve fits very well all of the dataset. The idea that the sudden change in slope were just a fluke seems now more substantiated. In fact, the fit is so good that one feels authorized to do what it should never be done: extrapolate a fit well beyond the region covered by data, and use that trend as a forecast.

If you accept such a gamble, the parabolic fit forecasts that the maximum of the epidemics will be reached on March 30th, with a total of 57000 infected people. All in all, it doesn’t seem to be bad news! Ten more days and the worst will be behind us… Given the world situation, it would seem like a good deal.

Up to two days ago I was being quite optimistic, and I confess I started believing in this little extrapolation. The last two datapoints, and further analysis scared me pale. Things are on the edge, forecasts are an uncertain business, and data ain’t looking good. When I’ll find the time for another post I’ll show what I mean.

Francesco's Crumbs

Tag: COVID-19

COVID-19 in Italy: (II) the twelfth battle of Isonzo

COVID-19 in Italy: (I) an exercise in data fitting.