Research Paper: Burden and characteristics of COVID-19 in the United States during 2020

Hi everyone, we are happy to share our work on the burden and characteristics of COVID-19 in the US during 2020, which used SafeGraph POI data to inform human mobility change across county boarders in 2020. Please find the paper published in Nature. :point_left:

What’s the overall burden and changing characteristics of COVID-19 in the US during 2020? In our modeling study, we used mathematical models and synthesized multiple datasets to answer this question.

Key questions:

  1. How many people were infected in 2020 in the US?
  2. What’s the prevalence of contagious people in the community over time?
  3. How has IFR evolved over the year?

We developed a data-driven transmission model at the county level informed by human mobility and calibrated this model to daily case counts in 3142 counties. Model calibration was cross-validated using out-of-sample seroprevalence adjusted for antibody waning. This model-inference system allows the estimation of county-level ascertainment rates, community prevalence of infectious people, and if coupled with death and the line-list data publicly available at CDC, the IFR.


  1. Overall, 31% US population had been infected in 2020. Only 1 in 4.6 infections was confirmed. The ascertainment rate increased from 11% during March to 25% during December. The estimated infection is much higher than the reported number; however, the large majority of the US population was still susceptible to SARS-CoV-2 by the end of 2020. Large spatial variations exist. New England has a lower attack rate. In contrast, the upper Midwest (e.g., Dakotas) was hit hard, with estimated susceptibility below 40% by the year-end.

  2. The community infectious rate, the percentage of people harboring a contagious infection, rose above 0.8% before the end of the year nationally and was as high as 2.4% in certain metro areas like LA. That means 1 in 42 people was contagious in LA at the peak in winter. This metric, usually cannot be directed observed, is key to plan intervention policies.

  3. The cumulative IFR dropped from 1% to 0.4% over the year, possibly driven by the changing age profile of infection and improvement of care and treatment. Again, there are large spatial variations. The instantaneous IFR rose again in winter when hospitals were stressed.

This study used mathematical models to combine different types of datasets, including county-level case and death surveillance data, census data on commuting, SafeGraph POI data, serological survey data, and CDC line-list data of confirmed cases. The SafeGraph data support the model establishment and are essential to the study. We are grateful to SafeGraph for sharing high-quality human mobility data to academia, which has supported numerous important studies on COVID-19 and beyond.

This topic was automatically generated from Slack. You can find the original thread here.

Hey @Sen_Pei ! Thanks for sharing this publication. This is fantastic!

Regarding your first takeaway, wow! If I’m understanding correctly, only ~22% of cases were confirmed while 31% of the US population had been infected. That’s almost a ten percent difference of people who had COVID-19 but were not confirmed. Certainly demonstrates how devastating COVID-19 can be. Was reading through your paper and didn’t see any speculation on what might have led to the upper midwest being hit hard. Any thoughts on why that might be? Maybe I glanced over it.

I’m sure you might have already looked at this. But how does this change going into 2021 with vaccines being available in the early winter and being more widespread by the spring (at least where I’m currently based in the midwest). Have you already done any preliminary analysis for 2021? Obviously, IFR will decrease, but any ideas on the magnitude of the change? I’m sure delta variant will likely affect this as well. As we get further into 2021 and 2022, I’m guessing booster shots will influence this.

Also, love all the different datasets that you shared. Going to drop them in this conversation as we’re always trying to beef up our resources of datasources (both SafeGraph and non-SafeGraph) for all of Community members.

COVID-19 surveillance data from JHU
Commuting data from US Census Bureau
Seroprevalence data from CDC
I don’t think I’ve ran across other members using commuting data from the US Census Bureau. Will have to keep that one in mind! I’ll be sure to loop you into other questions if others in the Community have questions about these other datasets.

I’d love to share some other research by our Community members. Have you met @Song_Gao_UW-Madison? Song published this research titled Intracounty modeling of COVID-19 infection with human mobility: Assessing spatial heterogeneity with business traffic, age, and race. His work focuses on two counties in Wisconsin. Here’s a link to the full publication:

@Song_Gao_UW-Madison Have you checked out this publication by @Sen_Pei ? Interesting to see some parallels and valuable implications for helping shape public health policies.

Hey @Kwang_il_Yoo ! Just wanted to pull you into this thread. Lots of overlap between your current line of research and Sen’s publication! Have you gave it a read?

Hey @xueming_Chen ! Wanted to flag this publication for you. Might be helpful in your current line of research. Can you share with Sen on what you’re currently working on?

Again, thanks for sharing!

Hi Niki @Niki_Kaz Thanks for pointing to those interesting studies. I’ll look into details. For 2021, we haven’t extended the analysis since the age-based vaccination roll-out puts into more complexity and our model does not have age structures. The new variants with possible breakthrough or re-infection also add on the difficulty.

In our following studies, we will also use SafeGraph data and we are truly grateful that SG made it easy to get access to data. We will share more studies in the community in the future. :grinning:

My first response is that this is an incredible paper; you’ve done so much work here to create a fine-grained model of COVID-19 infections in the United States during 2020. I can very much see how this got published in Nature. A few comments/questions:

  1. I believe you use confirmed COVID-19 deaths to calculate both the CFR and IFR. Did you consider using excess deaths rather than confirmed COVID-19 cases? This seems like it might be useful particularly early in the pandemic, during the first wave, many people died who were never confirmed cases. I recognize that this might cause more issues than it solves later in the pandemic, when excess deaths are more likely capturing people who did not seek medical care for issues other than COVID-19, so there might be some other solution (perhaps using medical record data to capture all pneumonia deaths, for instance – I know the COVID-19 Research Database has ICD-10 codes.)
  2. In extended data figure 4, you show models with 25% and 50% movement between counties. Did you validate what percentage of individuals move between counties in the SafeGraph data to determine which was a better approximation of reality? I wasn’t quite sure since extended data fig. 4 doesn’t seem to be referenced in the text. Did R_t vary with reported mobility?
  3. How was the waning of antibodies determined? It wasn’t clear to me where the 17.5% and 15% numbers came from. Furthermore, did this an 15% of the population lost antibodies each month, or 15% of those with antibodies lost them each month?
  4. The last fifteen days of data would include the beginning of the vaccine rollout; did you exclude this in your seroprevalence validation, or was there not enough sampling of the vaccinated within that two week window for it to be a problem?

Hi Lauren, thanks for your kind words. Those are all good questions (and apologize for my delay).

  1. We haven’t considered excess death, so the estimated IFR may be biased low. CFR quantifies the mortality rate among confirmed cases, so it is less impacted by unidentified death outside the surveillance system.
  2. We are assuming 25% or 50% of INFECTED individuals can move across counties as a sensitivity analysis on the model structure. Since we don’t know the infection status of safegraph users, we were not able to validate this assumption. But the results seem robust to this assumption.
  3. We used a previously published method to estimate antibody waning rate ( We adapted this method to seroprevalence data in NYC. Since NYC experienced early large outbreaks, the effect of waning is more significant and detectable. The estimated monthly waning rate is about 10%, which means, about 10% infected population with antibody will not be detected after one month (but they still have immunity from other components of immune system.) For other locations, we find antibody waning rate that is closest to 10% and can explain observed seroprevalence data. In some locations, sampling bias may lead to much faster waning rate, which we believe is not realistic. In those case, we set a maximum waning rate as a cut-off to eliminate highly biased data. The 15% and 17.5% are those cutoffs we tested (also as a sensitivity analysis).
  4. The number of vaccinated people during 2020 is limited. We therefore did not consider vaccination in the model.