[Blog #1] Getting into the Ping of Things

I began my internship at PingThings two weeks ago. PingThings provides a platform for energy companies to process their large amounts of sensor data efficiently through their uniquely structured database, which uses a tree structure to make it very fast to calculate the aggregate metrics (e.g. maximum, minimum, mean, etc) of stream of data over windows of time of varying lengths. My internship is completely remote because PingThings has always been an entirely remote company, so I don’t need to worry about going into a physical office space. Because everything is remote, I am able to schedule my work pretty flexibly, with the exceptions of a couple meetings per day. However, even though everything is remote, I still feel that I’ve gotten to know people. PingThings is (at least currently) a relatively small company, and every morning we all meet to discuss what each of us will be working on for the day, so even if I don’t work with people directly, I am able to get a sense of the work that they do on a daily basis. Every other week, we also have a team social with everyone in the company, so I’ve been able to get to know people on a friendlier level as well. 

As the machine learning intern, my role is generally focused on analyzing this sensor data, so I spent the first couple days getting comfortable with both how to process data using the database as well as learning a bit more about how the energy grid works so that I could have a better understanding of what the data meant in a physical sense. I didn’t really have any prior experience with power systems coming into the internship, but the people I work with at PingThings have taught me a lot. One of the things that I learned about was the grid frequency. The energy grid uses alternating current, which generally operates with a frequency of either 50 or 60 Hz. However, based on changes in energy supply and load on the grid, the actual frequency can vary over the course of the day. Usually these fluctuations are very small (no more than +/- 0.05 Hz). However, if the frequency changes significantly, it can damage the equipment and cause serious problems. Therefore, accurate detection of changes in frequency is really important.

For my first project, I worked on detecting such events. At face value, the problem is relatively simple – because the database already tracks the minimum & maximum data point over intervals of varying sizes, it is possible to efficiently find events even in a very large stream of data by first looking over the aggregate data over large intervals, and recursively zooming into the intervals that have a maximum or minimum outside of the normal range. Unfortunately though, not all of the frequency data is useful. Whenever the sensor goes offline, when it comes online again, the first frequency value that it outputs is usually very incorrect and far outside the range of normal frequency values. Therefore, using the previous method on the raw frequency data would generate a lot of false positives for frequency events at timestamps where the sensor came back online. 

To account for this problem, I decided that it would be easiest to first find and delete these erroneous points from the stream, and then perform the above analysis. Deleting these points from the stream altogether (instead of ignoring them) would allow one to perform multiple types of analysis on the data without having to specifically work around these outliers in every computation. While deleting the points can take a while to run if there is a lot of data and the sensor dropped offline many times, once the erroneous points are deleted, it is very efficient to find the frequency data – only taking a couple seconds to find an event within a month’s worth of data.

As a next step, I plan on working towards finding a way to efficiently and accurately generate the frequency data for sensors that do not report it. Many of the sensors that are currently used do not collect data on the frequency of the grid and only collect information on the voltage and current. However, the frequency of the grid can be approximated by using the derivative of the voltage phase measured at each sensor. To do this computation efficiently, I will use DISTIL, a data processing system specifically designed to process data from power grid sensors. Over the next weeks, I will learn how to use DISTIL and try to apply it to the problem of generating frequency data.