Can anyone give me ideas on how to verify if the data I am receiving from my state health officials is accurate if I have a historical repository of data?

hi there, can anyone give me ideas on how to verify if the data I am receiving from my state health officials is accurate if I have a historical repository of data?

Hi @Jennifer_Larsen_University_of_Central_Florida, accurate how? As in if that data matches up with safegraph data?

The first quick trick I would do is find data that is as similar to safegraph as possible and do a simple count() - if that comes out the same they are probably getting the data from safegraph. If not, you can do some simple graphs for weekly or monthly data and do a visual comparison. If they more or less match you, you are good to go. If they look similar but still seem to be off enough to cause concern, you will need to do a more in depth analysis.

Towards datascience is an amazing resource for this field. I will add a link to an article below… if you get blocked by a pay wall just hit ctrl + shift + N and it will open an incognito window and free up the page (paste the URL there) — or just open an account. they deserve it :wink:

link: How to Quickly Compare Data Sets. How to get a quick summary of any… | by Costas Andreou | Towards Data Science

Added to general here: Workspace Deleted | Slack

So one approach would be to compare the Florida provided data with another similar set of data. The question then becomes what data would be similar enough to Florida to work for that?

If you have a schema or list of column names I might can help you pick the data here that fits best

Hi Mr. Lindsay! Looping in @Ben_Sawyer_University_of_Central_Florida so he sees this. I have the file format Florida uses to report its deaths as a case line and also the format used to report the county data overall for tests, deaths, etc. Would that be suffcient? A state that structures their data similarly would be possibly useful.

ok great. are you looking to see if the department of health is reporting data consistent with other data sources or if everyones data is accurate?

well, we know other people (news agencies, Johns hopkins, etc) are using the data provided from Florida, so it should all match because it’s from the same source. So we are looking to see if the data reported from FDOH is consistent over time

ok. So safegraph is traffic data. correct me if I am wrong, but I believe what you are looking for is cases per area and deaths etc. not movement data. Is that the case?

@Jennifer_Larsen_University_of_Central_Florida

They don’t particularly have much to do with SafeGraph data, but I have 2 ideas.

1st, if you have access to daily data, you could analyze the distribution of daily data for things like Bedford’s law or other measures of variance or number distribution, and compare them to other states. This is purely about finding patterns in the numbers that are less random than you’d expect from a truly random process.

2nd, one rigorous approach would be to explore case count growth models and see whether the case count growth in florida is particularly different than other places, after to controlling for things like shetler-in-place, timing of stay-at-home orders, weather, etc. That’s a big undertaking.

If after controlling for those factors, growth in Florida doesn’t look particularly different than other places, then that would be evidence in favor of the numbers being real.

if the FL data does look very different than other states, that doesn’t prove that data is falsified, but it certainly would raise a lot of questions about why FL case growth is so different than other places (if that is what the data showed).

here are some case growth models from other members of teh consortim:

https://safegraphcovid19.slack.com/archives/C0114RJA0BW/p1588116315070600?thread_ts=1588017254.038000&cid=C0114RJA0BW

I have done some research in the past on case numbers etc and a great place to find data on that is here : https://rapidapi.com/collection/coronavirus-covid-19

if you decide to push further with this you can combine the RapidAPI data with safegraph data to create a real superset.

You guys are amazing, thank you! I had to make dinner and chase after kids right after asking, so I appreciate the patience and input. I had originally looked into Benford’s law and used it to mess around with testing numbers already to see if they were legit. I didn’t know if I could use it with deaths, however, because it is such a tiny set of data

@Jennifer_Larsen_University_of_Central_Florida you could at least compare it to other states

I am quite overwhelmed because this wasn’t what I set out to look at, so I appreciate the ideas on where to start! I am going to look at the UT at Austin set Mortality Modeling FAQ, but i suspect it is far, far outside of my level of skill :slightly_smiling_face: