You are here

Summer 2014 Blog - Herman Wu

Herman Wu is spending the summer at the Energy Biosciences Institute

August 8, 2014

Just when the summer is nearly over, I start to see where all my work and research is heading. Flow data from the Brazilian government, coupled with rainfall and forest cover loss from Earth Engine, can be used to estimate hydropower potential. The end goal is to see how changes in forest cover affect the amount of hydropower generated. The generation of heads a while back was just one part of the equation; flow and turbine efficiency were the other two. With all three in hand for a particular site, we can calculate, with simple physics, the amount of power a small hydropower plant can generate at that site. Of course, water flow is closely coupled to rainfall, so data on that is highly relevant as well. I have yet to gather the forest cover data, but there is enough to estimate hydropower potential already.

My most recently completed task involved the ground station gauges and the watershed basins on our study region. It was a subjective task, and thus was difficult, likely impossible, to automate via programming. Fortunately there weren’t too many gauges to go through (though the reported location tends to miss the actual rivers). The task was essentially figuring out (and recording) which watersheds fed water into a particular gauge. Thanks to a special numbering system, I only needed to note the watershed with the lowest identification number that fed into a gauge. On several occasions, the gauges fell in the middle of a basin, with smaller streams flowing into the larger river both upstream and downstream from the gauge location. All I could do was flag such sites.

On another note, I’ve also been going through and reading papers on climate conditions in the study region for growing certain crops, namely corn, cotton, and soybean. Brazil is a tropical region, and possesses both wet and dry seasons. As of late, through agricultural developments on several fronts, farmers in the region have been able to cultivate crops on a type of landscape called the cerrado (aka the savanna), previously believed to be unusable. It was a fascinating process, nothing short of a miracle, and there is hope that lessons learned in Brazil could be put to use in Africa, which has a similar ecological profile. On my end though, I’ve just been looking at the ideal climate conditions for crops grown in the cerrado. It is difficult to find papers dedicated to such a topic, more so since optimal conditions can vary from species to species. Temperature has been the easiest factor to find data on, so far.


July 25, 2014

Lab work has been going slowly, but it has given me more time to think about my poster project. The high spatial resolution of the MODIS image sets means I have to pull data from a much smaller region than with TRRM. Moreover, the MODIS data collections are not as complete as TRRM. Cloud cover is the main issue, preventing the satellites from collecting useful data. For the most part, I’ve been working with MODIS, and researching how it actually collects its temperature data.

There are two satellites, Terra and Aqua. Both carry the same equipment and collect the same types of data, but they have different orbits, passing over our region of interest in the Amazon basin at different times. Fortunately, the MODIS data sets available on Earth Engine include these times as its own field. The satellites can also pass over the region twice in a day; the data sets have separate measurements for day temperature and night temperature, and times for each.

Due to the finer spatial resolution, it became infeasible to collect and extract data at a pixel level like with TRRM. Eventually, we went down to a small enough area where I could extract data in the same format as TRRM (every column was a pixel, every row a date). But it was clear that to do this over the entire study region would not be reasonable.

As with TRRM, we wanted to compare satellite readings with ground station readings. The ground station temperature readings came from a Brazilian government agency, INMET. I browsed their website (with help from Google Translate, it was in Portuguese) and downloaded the necessary data. Morgan had code to clean it up, as the INMET data wasn’t in the cleanest format, and code to compare it to MODIS. Due to the difficulty of getting data for the whole region, for a first comparison, I only had to extract the temperature values at each of the INMET sites for which I was able to pull temperature readings. It was 36 locations, compared to over 3000 for the full sub-basin and nearly a billion for the entire study region, which was easy. For the most part, the data matches up, though Aqua’s day readings seem a tad high. MODIS includes quality control codes, indicating how “good” the data is, so we can use those to correct or throw out odd values.

In the near future, I may be focusing back on the hydropower aspect, or moving on to something new like pulling forest cover data.


July 11, 2014

The work has slowed down considerably since I finished extracting the precipitation data from Google Earth Engine. That particular task had taken a while due to Earth Engine’s limitations with iterations, but time was the only limiting factor. In the interim, I had been reading more papers on hydropower, in particular the intricacies of the actual turbine generators. I had also begun playing around more with R and ArcMap. The earlier task of calculating maximum possible head at each site was one such process I was able to do in Arc. It took several steps, stringing together the built-in tools available in the program using Python code, but it worked. Unfortunately, one step took a long time, computationally speaking, and I had over 200 sites to run the process over. After getting the code to work and actually loop for a small group of ten, I set the program going for all 200+ sites and let the computer run overnight. The time spent wasn’t a lot, and we processed all the sites for our current study, but Morgan and I started considering other programs, like R.

On that end, Morgan sent me code and some links for R’s spatial data capabilities to play around with. As it turns out, much of the tools available in Arc had analogues in R’s various spatial analysis packages. The one downside was that it was considerably harder to visualize the results, unlike in Arc where layers could be simply manipulated interactively. For the task of calculating head though, visualization was not as important (in fact the end product for a site was just a number), so if running it in another platform would be worth it if it would speed up computations.

After the head calculations were completed, the next step was temperature. Namely, extraction of temperature data. The desired end product was a full temperature data set, constructed from ground station measurements and satellite data. The ground station data would come from a Brazilian government site, but there were several choices for the satellite data. A modeled data set would avoid the problem of cloud cover preventing the satellites from obtaining proper data, but would not be as exact as one that directly measured temperature. For the most part, Earth Engine only offered temperature images from MODIS in our region of interest; MODIS data was often obscured by cloud cover, and so the resulting images had many blank regions where the satellite just couldn’t pick up anything. To make things more confusing, there were two satellites that collected MODIS data, each with different orbits. An Earth Engine search for temperature would reveal image collections in pairs, two sets with nearly identical descriptions and resolutions, but using different satellites (and hence yielding different data and spatial/temporal coverage).


July 19, 2014

My work continues on Google Earth Engine, even as the World Cup gets into full swing. I’ve worked out all the kinks in the code, and now it’s just hitting run and waiting for the data to be computed. For now, my primary task has been to extract daily data at a pixel level. It’s easier said than done, especially with over 2000 pixels in the region of interest in Brazil. After a bit of experimentation, I’ve found the optimal size for the spreadsheets to be downloaded: 200 pixels over 3 months. As it turns out, combining all the smaller spreadsheets into a larger one (all pixels over an entire year) takes significantly less time than I initially thought. I’ve already written up an R script to do so, and it runs smoothly and quickly. I’ve more or less settled into a routine workflow: download all the “chunks” of data for a 3 month period, combine those, repeat for the other months, then combine all of those into a spreadsheet containing data for all pixels in a single year.

Currently, the data I am extracting is satellite imagery from TRRM, or Tropical Rainfall Measuring Mission, over the Amazon basin. Rather than extract the raw images (which were collected at 3-hour intervals), first I aggregate the raw images into daily images to get daily rainfall measurements, in image form. This is done by taking all the images for a particular day, tripling the values, then summing them up to produce one image (with precipitation values) for that particular day. Obviously I would need to do this for every day from 1998 (when TRRM first started) to 2013. The next step in this process is transforming those images and precipitation values into a spreadsheet format (the format of choice is csv, comma-separated-values, which is easier for R compared to Excel spreadsheets and still viewable in Excel). This initially took some work, as the desired format for the spreadsheets was to have every row be the values at every pixel for a particular day (so days as rows and pixels as columns). For this step, I needed to convert an image (any image at TRRM’s resolution) into a collection of features (points in this case). That was done in R, with code provided by Morgan. Once I had that collection, it was a simple matter of mapping each daily image over the pixel collection to produce one row in the resulting data table.

Once we have the daily data, the next step is to compare the satellite measurements against ground-based rain-gauge data. Ground-based stations gather this rain-gauge data. The reason for this comparison is that TRRM does not directly measure rainfall. Instead it estimates rainfall using other data gathered by remote sensing (it is a satellite-based system). Hence, in exchange for exact measurements, TRRM can cover a significantly higher area compared to ground stations, which in a region like the Amazon is a huge advantage. Due to the estimation, then, we may need to perform bias correction on the TRRM data, depending on how well it matches up to the rain-gauge data. On this end, I’ve already produced complete data tables for 2005-2013, while Morgan is working on code to do the comparisons.    


July 10, 2014

In the past week I’ve learned/re-learned R, a statistical programming language. It is designed to not only compute statistics like mean, median, and standard deviation, but also to run statistical analyses like regression. On top of that, you can make very impressive graphs, charts, and diagrams with just one or two lines of code. Like Python, R is open-source and extremely flexible; you can import third-party or official packages into R just like in Python. The language and syntax turned out to be very similar to Python as well, so picking up the basics was not very difficult.

On the other hand, Google Earth Engine has proven to be a bit of a challenge. It is technically still in beta, so not only is there a dearth of material on how to use it, you never know if an error in your code is due to a bug on your end or a bug on the developers’ end. Fortunately, being in the beta testers’ group means you can pretty much ask the developers directly for help or debugging. I’ve spent the past week tinkering with JavaScript code, posting questions to the developers, and in general take the code I’ve written and make sure it completes its intended purpose. Already, I’ve begun running into the limitations of the system. First, there is no effective way to download an entire collection of images at once (the best I’ve been able to do is download 7 images one by one), at least not yet. Luckily, for my current work I only need one image.

Aside from learning R via a 2-day R-intensive workshop, I’ve spent the past week writing and debugging JavaScript code for the Earth Engine API. The raw data available was obtained at 3-hour intervals, so first I needed to convert that data into daily data. That took a bit of clever thinking. Next, I had to run monthly statistics on the converted data. I thought this wouldn’t be too difficult, considering that I had already written a similar script for doing monthly statistics on the raw data, but as it turns out, this two-step process (aggregate by day, then aggregate by month) requires a lot of computation time. Fortunately, all the hard work is done on the Google servers, so that leaves my computer free to work on other tasks. Nevertheless, the long wait time is tedious; in particular the API does not notify you when the computation is complete. All one can do is hit run every so often and hope your results come back. I can get results quickly if I limit each computation to a single month, but that has the downside of producing many different files that would need to be downloaded separately and linked together. So there is a trade-off: export a single large spreadsheet of data that requires no further processing at the cost of runtime, or export several smaller but more rapidly computed spreadsheets that require further processing in R. For now, I am choosing the first option; the wait isn’t so bad.


May 30, 2014

I started more or less as soon as spring courses ended (aka after finals). First day I had to go through some standard orientations, like building safety, setting up a workspace at the Energy Biosciences Institute, and a small tour. I had met my supervisors already at the start of the month, Avery Cohn and Morgan Levy, and we discussed a general outline of what I would be doing, what I was expected to learn in terms of skills, and of course where I would be working. I would be splitting my time between EBI and the Thompson Hydrology Lab in Davis Hall.

For the first week, I was working in EBI with Avery. Perhaps “working” is the wrong word, as I was mainly familiarizing myself with the ArcGIS software that Avery worked with and reading papers in his (and Morgan’s) field of research to get a sense of the terminology. The research Avery and Morgan work on isn’t lab-based or experiment-based, but instead analysis-based. Take data and play around with it to see if you can draw any conclusions. Avery is working with census data from Brazil, while Morgan is working with satellite images also from Brazil.

ArcGIS is not the only software I’ve been learning these past two weeks. I already know Python and Java in terms of programming languages, and ArcGIS conveniently has Python integration, allowing one to pipe data into/out of ArcGIS, build new processing tools (and add them for later use), or use built-in geoprocessing tools in stand-alone scripts. I’m also learning R for Morgan’s lab, which is quite easy to pick up with a Python background (just some different syntax). On top of that, I’ve also started using Google Earth Engine (still in beta) for more geospatial analysis. This Thursday, I attended a remote tutorial session, and the capabilities of Google Earth Engine turned out to be more powerful than Morgan expected, as it could not only pull in data but perform a decent amount of statistical analysis on it before passing it on to more specialized programs like R. Even better is the inclusion of a JavaScript API, which I’ve learned to use very quickly and very easily due to its similarity to Java.

It’s only been two weeks and I’ve picked up a lot of useful technical skills, mostly with the new software. On top of that, I’ve been applying all the programming knowledge from my classes. Most of this has been outside my field of study for classes (except the computer science courses obviously), so it is a refreshing experience. I’ve taken several biology and chem lab courses, and there it’s all running experiments/reactions and recording the data, with the occasional dose of analysis. Now though, I’m taking huge amounts of data other people gathered and running statistical tests to see what conclusions or trends I can draw. There is certainly plenty of data out there from the Google Earth Engine alone.

UC Berkeley logo


© Copyright UC Regents. All Rights Reserved.
Privacy Statement