[Blog #3] Exploring Data Quality

These past few weeks have been arguably my favorite so far because as I have progressed further into the projects I've been working on, I've been able to learn a lot about how the platform works, in part by listening in on paper discussions and through formal explanations, but even moreso by working with the different elements of the platform and exploring how they work for myself. I have continued working with DISTIL with the more specific focus of data quality checks. One of the first things I worked on in these past few weeks was changing how the distillers output data quality information. Originally, we had all of the data quality checks output a 1 at timestamps where a data quality issue was found and output 0 if there was no data quality issue at the point at that time in the input. However, this means that the data quality stream would take up as much memory as the input stream, which would be really inefficient if we wanted to run multiple data quality checks on each stream. Therefore, we decided to change the distillers to not output anything if there wasn’t any issues found, reducing the size of the output significantly.


The next task I worked on was figuring out how to narrow down the computation required for a data quality check by first looking through the aggregate input data. As an example, for the distiller meant to find outputs of exactly zero, we can greatly reduce the computation required to identify the zeros by first filtering out the larger windows of data where the minimum value is greater than zero. While we knew that this was possible to do, we didn’t have any example of distillers that used this, so I spent a good amount of time looking into the DISTIL codebase to figure out how to implement it. After figuring out how these aggregate stages worked and implementing them in the zero-detection distiller, I wrote up a description of how to implement them more generally.


In addition to the checks that we had already written to check for zeros and repeated values, we wanted to add checks to identify gaps of data or places where the sensor output two data points at the exact same time. I was able to build the distiller to check for points with duplicate time stamps, but it ended up being more difficult than expected to build a distiller to check for gaps. My initial implementation was able to identify smaller gaps of data, but wasn’t able to identify gaps of data that were longer than a day, as shown in the pictures below. The red graph represents a sample input data stream, and the blue represents gaps that were detected. While the gap in the first image is completely detected (because it is only a few hours long), the gap in the second image is only partially detected.


I wasn’t sure why this was happening, so I spent a lot of time digging around in the code that makes DISTIL run. Eventually, I figured out that the issue was with how DISTIL decides which input data should be processed. In order to reflect real time updates in the input data, DISTIL figures out which chunks of data in the input have been changed since they were last processed, and sends only those time ranges to the distiller to be processed. However, if there is a large gap in the data, the time range that falls within that gap will never have any changes (since no data was ever written there), and therefore that section of the input data will never be processed by the distiller. After sharing these conclusions with other people at the company, we decided that DISTIL was not the best tool to use for gap detection, and this should probably be implemented separately.


The last task that I worked on in this period was making it easier to write distillers. When writing my first distiller, I was able to copy an existing distiller and essentially pattern-match to make it work. However, it isn’t always clear what options there are in terms of customizing the distiller by just pattern-matching, and it’s hard to distinguish between optional components and required components. Therefore, I decided to create a template for writing distillers. By this point, I had written a good number of distillers and learned a lot about how they work, so I felt equipped to write a general outline of what needs to be included in a distiller and a summary of the additional options that can be included. 


In my final two weeks at PingThings, I hope to make the work on data quality that I’ve done so far more useful. I plan to work with Michael, another member of the data science team, on making it possible to easily work with the data quality information generated by distillers through the Python API, which is how most users interface with the data on PingThings’ platform. I also hope to run these data quality distillers on data from real sensors and try to get some useful results. I'm really excited about all of the work that I've done so far, and I've learned so much about how DISTIL works in just these past few weeks.