Example of Data Cleaning in Physics

When I was working on my PhD, there was one part of my analysis work that was the bane of my existence. Not because what I was ultimately trying to accomplish was all that difficult, but because where I was pulling a specific set of data from was an absolute mess. I was working with a virgin data stream within the experiment, as no one had used it for analysis at any point before. The data stream had been passed from person to person for finishing its development and was still full of many issues and bugs. I needed information from it not available from the normal data stream, so I became the guinea pig for the data set.

In physics, most the data collected is garbage. Experiments typically use some form of filtering (they call it triggering) for base event selection. This filter is generally set for >99% of major desired events to be kept, but still cuts out most of the background noise. I was looking for specific types of events that were purposely filtered out in the traditional data stream but were not removed in the new stream.

Before I get into the nitty gritty, here’s a list of some of the challenges that needed to be solved to even just look at the data: some of the data was purposely duplicated, some data was accidentally duplicated, some data was out of place, some of data were missing, some types events were missing, and there were non-unique timestamps.

The new data stream used a PC farm to do initial event processing, and then a final organizing PC would put all the information together. Since a farm was used, data was split into blocks with individual blocks sent to the PCs for processing. The purposeful overlap was done so data of interest was wholly contained in at least one block. Once a block was processed, the organizing PC (hopefully) put it all together (Spoiler alert, it had its issues).

Outside of physics, there generally is not much need for ultra-precise time stamps. Where at most, someone would look for something maybe down to the minute or second, I needed precision down to the microsecond. I was looking for a form of cause and effect within the data, and a third of the “cause” event were missing in the new data and the “effect” events were only located in the new data. The old data stream would have to be used to identify the “cause” events and matched to the new data. So, I would have to time sort the entire day of data to be able to completely match the time. Some of the timestamps were mismatched between the two, and the only viable method was troublesome to say the least. Luckily, this was a reasonable enough fix for the missing type of events.

Rollover

When trying to match the two streams, you needed to have a grip on time. A hardware clock was the only reliable means to measure time in the new data and was the way to match it to the old set, but the issue was it used unsigned 32-bit integer with each count representing 17 microseconds. The data was collected in up to 24 hours of running and if you do the math. That is a 20-hour clock for a 24-hour day meaning a third of the timestamps overlapped. In order to sort things, I checked if there was a somewhat large scale order to the data. If the disorder in the data only occurred on a short enough time scale, you can sort things on a long-time scale, and then finish the sorting locally. With this kind of clock rollover, the duplication is located at the beginning and end. As long as you can keep the two separate, you may just make it.

You can go in a circle, with 00 < 01 < 10 < 11 < 00… for the highest two bits. So long as you don’t have too much disorder, you keep the different quadrants separate

You can sort on a macro scale by sorting the highest bits of the clock, and once that is completed, you can sort based on the rest of the lower bits within those quadrants. It doesn’t matter if it is a 20-hour, 20-min, or 20-day clock, so long as your data is ordered within a quarter of your time scale you can split it up accordingly. In my case, if data was not more than 5-hours out of place, it could always find its correct place. It would also kill two of the initial issues I listed before, the non-unique timestamps and the mis-ordered data.

Seeing Double

You can have two forms of duplicate data that I would have to deal with, that which should be there, and that which shouldn’t. Against your intuition, the duplicate data that shouldn’t be there can be easier to get rid of.

Handling duplicate data depends on the circumstance. It is easy enough to remove duplicate data if the two are identical, but what if they are not? How do you identify it? How do you choose between the two? If it were two of the same things it could be solved with a simple if statement. Sometimes you may have to engineer something to make a quick check of things.

You can have two forms of duplicate data that I would have to deal with, that which should be there, and that which shouldn’t. Against your intuition, the duplicate data that shouldn’t be there can be easier to get rid of. In my case, the unintended data was fully duplicate data, and the intended duplicate data was similar but tended to have small differences. There weren’t two variables you could just say, if A == B discard B… One thing that helps when trying process data, is to understand the data. There can always be some limitation or relation that limits things. The physics behind things and how the detector worked made it so duplicate events couldn’t necessarily be cut in time OR space, but they could be cut in time AND space. Being able to identify this made it easier to identify off four variables if there was a difference between two events vs looking at up to a thousand. Especially when up to 1 TB of data was collected a day with 10s of millions of events. From here I didn’t know which would be a better selection, so I could just remove arbitrarily one of them, but for others that may not be the case and deciding how to proceed is up to you.

On the other end we had the duplicate data that shouldn’t be there. This was the organizing PCs fault, it would get confused and make a mistake. We still love it though. To handle this, we could have two cases of data overlap for the unintended duplicate data. You can think of two chunks of data, data A and data B with A starting before B, and how they can relate to each other. Case 1 would have A completely covering B and Case 2 would have A ending before B ends. Case 1 I could discard B, and with Case 2I would have to choose where to end A and start using B. I was more inclined to keep more of the longer stream of data, whether it was A or B, to avoid the fickleness of the organizer.

The three kinds of overlap between disjointed chunks of data.

Missing in Action

Using the A and B analogy, there can be a possible third case, which isn’t necessarily a case of duplicate data, but brings up the final listed point, missing data. There is no real solution for this. You can’t make it up, you can’t will it into existence. It is either there or it isn’t. And sometimes that will be the case. You have to make do with what you have. Part of the job of being a physicist or a data scientist is make sense of what you were able to get. Two end this on a little more happier note, I was ultimately able to accomplish what I set out to do. I made sense of the data I wanted to, and was able to produce a first of its kind discovery for an experiment of this nature. But that came much later, after doing much, much more data cleaning.

Leave a comment