In my last two weeks of work at PingThings, I primarily focused on wrapping up my work on data quality distillers. This meant first applying these data quality checks to real data and secondly developing a draft API to allow users to easily use available data quality information for a given stream.
One of the major goals of the data quality project was to be able to identify data quality issues with information gathered through the National Infrastructure for Artificial Intelligence on the grid, or NI4AI. NI4AI is an initiative from the Department of Energy led by PingThings in collaboration with UC Berkeley to make it possible to use artificial intelligence to analyze grid data. The initiative hosts thousands of data streams from all kinds of sensors, which can be viewed publicly on ni4ai.org by creating an account. Running the data quality checks on all of the streams would have been too much computation & memory, so I was given a subset of 16 streams to test the information on. First, I had to figure out how to run each data quality check on multiple inputs at the same time, which led to some unexpected issues that I had to investigate and debug. It took me a couple days to figure out exactly what was going on, but eventually I was able to find the main problem and apply the appropriate fix. Another issue that I ran into was that the distiller that was supposed to detect zeros would output the same value multiple times whenever it found a zero. After some investigation, I realized that this was because the zeros distiller used window stages (as discussed in a previous blog post), and when testing on real data with different sample rates, this sometimes led to the windows that were processed having lots of overlap. To fix this issue, I changed the distiller to generate the window sizes based on the sampling rate of the input stream.
After running the distillers on the real data, we also realized that the data quality check that flagged if the same value was outputted multiple times in a row wasn’t always a super great indicator of whether or not there was an actual issue with the data. There were a significant number of instances where the same exact value would be outputted twice in a row, but based on the context of the data around it, the repeated values seemed to be the correct measurements. To make this check more useful, I added in a parameter to choose how many repeated values would count as a data quality issue. We found that detecting when a sensor output the same value four times in a row was a much better indicator of where there was a data quality issue.
In conjunction with creating these data quality checks on real data, we also had to come up with a way to make this information actually usable. We decided on creating an API that added onto the existing Python API for interacting with these data streams, specifically to match given streams to their corresponding data quality information and to make it easy to check whether or not there are any potential data quality issues in a given time range. While Michael, one of the data scientists, did most of the work on this, I was able to help design the overall structure of the API (e.g. which classes we might need to add and how they should inherit from existing classes) and implement some of the new functions. I also added attributes to each of the data quality streams that linked them back to their source directly so that the API could match a given stream with the data quality information.
On the last day of the internship, we had a great Q&A between the interns and the other members of the company. It was a great opportunity to not only reflect on how the internship went and talk about our experiences, but also to ask questions of the different people at the company and what their experiences have been. Many of them had worked at very large companies or for the government before moving to PingThings, so it was interesting to hear how working at a startup contrasted to those environments.
While there is definitely still work left to do in adjusting how these data quality checks work, how they can be accessed from the API, and how DISTIL overall works, overall I feel very satisfied with what I’ve accomplished in only two months at PingThings. It’s really satisfying to know that over the course of the summer, I’ve helped develop the idea of data quality checks into something usable. Along the way, I learned so much about not only data science & data quality, but about how the power grid works and the overall process of software engineering. Moreover, I really enjoyed the work that I was doing at PingThings, so I have decided to continue working on power grid data analytics by doing research at Berkeley with Prof. Sascha von Meier, who leads the UC Berkeley branch of NI4AI. I’m really excited to have the opportunity to continue doing this kind of work and be able to contrast the research perspective with the experience of working on this at a startup. Overall, I had an amazing experience this summer at PingThings, and I am so grateful to Cal Energy Corps for giving me this opportunity.