[Final Blog] Abstract & Reflections

Abstract:

 

            The electric grid is a massive system of many interconnected parts that all need to work simultaneously in order to bring us electricity. Usually, these parts work together very well, but when something goes wrong, the consequences can be disastrous, as can be seen in the prevalence of wildfires caused by downed utility lines in just the past few years. One way to avoid such consequences is to have real-time information about what is happening in the power grid, so any faults can be detected and corrected hopefully before they pose significant concern. PingThings’ PredictiveGrid™ platform provides a way to visualize and analyze such data through both historical and real-time data ingestion, and time series windowing that makes it efficient to deal with data in aggregate, allowing for grid monitoring.

            While this system is well equipped to deal with power grid data, in practice the resulting analytics are only as good as the data streaming in from the various sensors. When sensors stop reporting data accurately (if they go offline, for example), they will often spit out erroneous values that can significantly throw off any analytic calculations. For example, a sensor sometimes outputs the same value repeatedly when it gets taken offline, which can make it appear like the phase angle difference between two locations in a network is changing rapidly when in fact it is staying roughly the same. In order to avoid such incorrect conclusions, it is necessary to have a way of differentiating genuine data points from erroneous ones. Furthermore, this information about the quality of the data must be easily accessible so that users of the platform can efficiently apply the information to their specific use case, and must be generated fast enough to keep up with real time data as well as historical data. Finally, the data quality information should ideally take up a relatively small amount of memory compared to the actual data that is being qualified.

            In order to pick out data that may be of low quality, it was first necessary to determine how to recognize potentially erroneous data. We picked our criteria based on previous research from the data science team at PingThings, which determined some common signs of incorrect data. Additionally, we spent a good amount of time learning about DISTIL, a system for efficiently processing synchrophasor data in the PredictiveGrid™, to determine whether this system would be a good fit for the data quality checks. DISTIL is especially useful in that it can both analyze a lot of historical data relatively quickly, as well as keep up with real time changes to the data. 

            Our general approach to the problem was to use DISTIL to generate separate streams containing data quality information for each stream we wanted to analyze. The data quality checks we attempted to implement were detecting repeated values, duplicate timestamps, outputs of exactly zero, and gaps within the data. For a given stream of synchrophasor data, a series of data quality checks are run on the input, each producing a new output stream that is updated as new data streams in. Each output stream only contains data points when an issue has been detected, to minimize the memory storage of these auxiliary streams, and contains an attribute linking it back to the original stream. In order to make this data easily accessible to the users, we developed a Python API specifically for this data quality information, which would allow users interested in the data quality of a particular stream to easily view the kinds of data quality checks that have been run on that stream, and query a specific time window to check whether there were any potential data quality issues flagged in that interval.

            Overall, this approach was successful at generating streams that flagged data quality issues in both historical and streaming data, and making it simple to interact with that information through Python. However, while we were able to implement the other data quality checks, we were unable to implement gap detection. This is because DISTIL operates by detecting where the input stream has changed, while a gap represents a portion of the input stream that has never changed from its initial, empty state. Implementing this within the framework of our approach would require making somewhat significant changes to DISTIL, which was outside the scope of this project.

            Based on these results, one next step would be to investigate how to implement gap detection. Additionally, although we have some data quality checks that are functioning, this is by no means an exhaustive set of criteria, so further research is needed to create a more robust set of data quality checks. Additionally, while there exist data quality logs for some streams using the C37.118 protocol, the information contained within them is somewhat difficult to parse, as there are a range of different types of issues that are all combined into a single log. It could be useful to separate the information contained within these streams into a set of different data quality streams, and incorporate that information into a more usable format.

 

Personal Reflection:

 

            Over the course of this project, I learned so much, both in terms of technical knowledge and general skills. One of the primary skills I developed over the course of the internship was being able to work independently. Of course, I had plenty of support from the people at PingThings, and every day I would check in with the team on what I was doing. However, because PingThings is an entirely remote company, I spent most of each day just working by myself at home. Because I wasn’t in the same room as my colleagues while I was working, I had to be much more self-guided. For example, in times when I got stuck, I became much better at distinguishing whether the problem was something I could figure out and work through on my own, or something I needed to ask a more experienced person for help on. Learning this balance was really crucial, as often I learned a lot from investigating and resolving a problem on my own, but sometimes it just made more sense for me to ask someone who could resolve my problem in a few minutes rather than spend hours trying to work through it myself without getting anywhere. Additionally, because I generally didn’t have specific deadlines and I didn’t have anyone directly looking over my shoulder, I had to be much more self-motivated than I would be in a school environment, which is a skill that’s really important for research.

            Even though I often worked independently, the project was really a team effort that involved collaboration from many people with different backgrounds. Because of this, another skill I developed was being able to communicate what I was working on and what I had concluded with others. I had to be able to communicate about topics with people who experts in that field, and also communicate with people who were pretty unfamiliar with the specifics of what I was working on. Being able to communicate on a range of topics with people who had varying areas of expertise was sometimes difficult, but a crucial skill to develop.

            Beyond general skills, I also gained a lot of specific technical knowledge. For example, I became much more comfortable browsing a large GitHub repository, and gained experience working with test clusters. I also learned a lot about how the power grid works, and what some of the current concerns are within the world of power systems. I found that I was really interested in the world of power systems, so since my PingThings internship ended, I have continued doing research within power systems, specifically working with phasor and harmonic estimation of point-on-wave data from the grid. I credit Cal Energy Corps for this research opportunity, as my summer internship both sparked my interest in power systems and connected me with researchers in this area at UC Berkeley.