Paths to data-driven science

For as long as I can remember, I have been driven by my desire to understand how the world works. One question stuck in my mind during my junior year of high school, “What are the differences between living and non-living things?” It seemed to me that if you knew where all of the atoms were and how the atoms interacted, you should be able to predict the behavior of any physical system and that this should be true of living and non-living things alike. None of these ideas were new, of course. I was beaten to the punch by at least 200 years, but these kinds of questions continue to motivate my intellectual pursuits even today.

When I was in high school, it was typical for students to first take biology, then to take chemistry, and then to take physics. By sheer luck, the school that I went to had started an experiment during my freshman year. Students were allowed to take two years of physics, then to take chemistry, and then to take biology. These high school science classes, the logical order in which the subjects were presented to me, and the math and computer programming classes that I took during elementary school and high school, set the tone for my intellectual life at an early stage. I was beginning to (re)discover what scientists have known for a long time now, namely that “What is life?” is a very difficult question to answer, but also that attempting to answer that question can be a fruitful way of understanding living and non-living things alike.

During my time at the University of Wisconsin, I found that there were so many interesting classes to take that I decided to stay for five years and to add majors in mathematics, molecular biology, and astronomy in addition to physics. Where there was extra time, I crammed in chemistry and computer science courses. After graduation, I enrolled in the Physics PhD program at the University of California, San Diego. When it came time to look for research groups, I went to meet with the faculty members at the Center for Theoretical Biological Physics (which is now located at Rice University, where I ended up finishing my PhD). I ended up joining the group of Peter Wolynes. Joining Peter’s group was the start of a ten year dive into the field of molecular biophysics.

Molecular biophysics, and theoretical molecular biophysics in particular, is a wonderful field for someone with my combination of interests. Whether it is math, computation, physics, chemistry, or biology, learning about any and all of these subjects is considered a good use of time when you are studying theoretical molecular biophysics. Since then, I have been involved in research into molecular simulation techniques, protein folding, protein structure prediction, evolution, and the molecular mechanisms that underlie disease including, notably, neurodegenerative diseases. After completing my PhD, I joined a group at Aarhus University in Denmark where I developed new experimental methods for studying protein biophysics. My work in the area of molecular biophysics has resulted in more than 30 publications that have been cited more than 600 times.

During my time as a molecular biophysicist, I have enjoyed formulating and testing models that attempt to explain how the world works and that can be used to make predictions about the future. Relatively simple physical processes sometimes admit of mathematically compact descriptions that reliably make predictions with high precision. Important examples of these kinds of theories include Newtonian mechanics, electromagnetism, and the special and general theories of relativity. More complex systems, often involving many interacting particles or agents, often admit only of statistical descriptions that yield probabilistic predictions. The most prominent theory of this type is statistical mechanics. The problems that admit only of statistical descriptions, including almost all problems that are of practical interest, vastly outnumber the problems that admit of compact non-probabilistic descriptions.

Two related innovations have enabled, for the first time in history, scientists to make usefully accurate statistical predictions about nearly arbitrarily complicated physical systems. The first is the development of technologies that allow us to collect and store very large amounts of data. The second is the development of methods to infer predictive models directly from large data sets without having to explicitly formulate theories by thinking about the physics of the constituent parts of the system. Suddenly, it is no longer necessary to understand how a system works in order to be able to make usefully accurate predictions of how they will behave in the future. Instead of doing the thinking ourselves, we get the machines to do the thinking for us (and we spend our time thinking about how to make the machines think). Hence the names of some of the hottest and most disruptive innovations in recent human history: machine learning and artificial intelligence.

Many subproblems of estimation and classification within more traditional scientific frameworks are being approximately solved using machine learning techniques. One example that is particularly close to my heart is the inference of energy functions for protein folding from protein structural data. Machine learning is also taking over much of the commercial and industrial world: wherever data exists, attempts are being made to exploit it to derive valuable insights. My path to studying data science and machine learning has taken me through a traditional theoretical science background and to a place where I am now trying my hand at predicting the behavior of complex systems by performing inference on data.

I am interested in practical and fundamental problems alike. The use of artificial intelligence raises many interesting ethical questions. Some of these ethical questions are related to the interpretability, or lack thereof, of the algorithms that are being employed. There are connections between neural networks (a commonly employed machine learning technique) and aspects of physics, including to the physics of protein folding.

Schematic representation of a protein energy landscape

Written on May 1, 2019