Why do we choose the tools we do? If I'm asked to hang a picture in my son’s room, I grab a hammer and a nail or hook. If I’m asked to write a program, then I grab a programming language (compiler) and toolset. Equally easy, no? The challenge is in choosing the right tool—programming language—for the job. A framing hammer would not work to install trim, for instance; a dryer would ruin fine linen. And so it goes.
Why choose one tool over the other? Just about any hammer can be used to hang a picture, and any Turing complete programming language can solve the same set of problems that any other Turing complete language can. So why choose one and not another?
A DIY website discussing which hammer to choose, offers good advice on this very topic.
This advice is equally good for deciding which tools to use when developing a software package to accomplish a task. However, unlike the simple task of choosing the right hammer and nail combination, external factors can affect decisions about which tools teams select to tackle a large undertaking.
An anecdote from a friend’s job illustrates this effect. The company had a large software package that was written 100% in Fortran (being the first programming language has given it 67 years of optimization and speed tweaks; an advantage unique to this language). One day my friend inquired why Fortran had been chosen over C (the development on this had begun in the mid-1980s, so there were far fewer choices). The answer was simple: the group had a copy of a Fortran compiler and did not want to shell out the thousands of dollars it would have cost to pay for a C compiler—how times have changed!
Early Steps in the Decision-Making Process
An engineer would want to optimize the tools they use, but a question that must be answered is optimized for what? A developer would know that when compiling software there are often two possible optimizations: speed and size, but not both. The same decision must be made when choosing tools. In developing the Protenus privacy platform, had we wanted to optimize for pure speed, we would have used C and/or Fortran and raw MPI. That path would have been at the expense of developer productivity—and sanity (MPI_ERR_UNKNOWN: remote execution returned Cthulhu).
Were we concerned about size, then different choices would have been made. Since we run our applications in AWS, storage space isn’t such an issue. Instead, our focus as we chose our tools was on developer productivity/happiness and on delivering value to our customers; everything we do is meant to optimize our way to our goals.
Choosing Tools for Multiple Aims: Protenus Privacy Development Group
Further complications arise when a given group has multiple goals. I have spent the past 14 years of my career in mixed groups where research into new algorithms/models was being carried out alongside production work. This adds a wrinkle because researchers (often trained as scientists or engineers, not necessarily as developers) are concerned less with the memory efficiency of an algorithm, and more with the question of whether some out-there idea will work?
On the flip side of that problem is those tasked with turning research ideas into production quality code—where one needs to focus on memory, runtimes, and thread safety concerns. The solution to this problem is to let team members use the tools that work best for them. If that means support for multiple languages/platforms, then the overhead associated with that can be worth it.
For example, the Protenus privacy development group uses three major languages to accomplish our goal of detecting privacy breaches: Scala, Python, and Shell scripting. Each language plays a slightly different role that allows team members to accomplish their goals while we all work toward our common goal of alerting customers about suspicious behavior, potential privacy breaches, and cases of drug diversion. Even within each language, multiple tools are used to further enhance our productivity and allow us to get more done. If the goal is to be productive, then allowing people to use the tools that let them do their job quickly and efficiently is key.
Our privacy application that our customers rely on everyday is written in Scala. Using this language gives us strong type safety, memory safety, leverage of the JVM tools, and a native implementation of the Apache Spark library. Given the volume of data we process for each client, each and every day (200,000,000 accesses per client per month) using a distributed system/architecture is a necessity. While it may not be the fastest option—for instance, the Fortran/C and MPI mentioned above has the potential to be much faster—using a popular library such as Spark allows us to be more agile and make large-scale changes with far less effort. We can focus on the business problems at hand rather than fighting with a framework. By trading small losses in performance for the benefits gained by using this language/library combination, we are able to quickly meet customers' needs and requests.
For our team of researchers, made up of data scientists and engineers, even the speed with which we are able to make changes and test ideas in our main Scala-based application is not fast enough. They need to be able to prototype and test at the speed at which their brains come up with ideas. For these tasks, we readily leverage the Python programming language and supporting ecosystem. This lets our world-class scientists and engineers rapidly make changes to explore a problem that to date may not have any solution, much less an efficient one.
For data scientists and engineers, the ability to visualize data is important. Given this, another favorite tool is the matplotlib python graphing and plotting library, which allows for the simple creation of graphs of all types, which lets these brilliant minds tell a story about the data we are interested in. In combination with visual tools, such as Jupyter notebooks, our researchers can quickly probe data sets and determine how we can improve the inner brains of our machine learning offerings. In addition, the ability to rapidly prototype and explore data not for customers but for feedback to our data scientists and internal stakeholders is key for our data engineers. Using Python allows them to quickly generate charts and images that can highlight key performance indicators (KPIs) of how well our system is performing and where we can best prioritize our research energy.
While our researchers are on the leading edge of machine learning in the healthcare privacy and analytics space, it doesn’t mean we can’t use tools that others have developed to let us go even further. Scientific progress is built “by standing on the shoulders of giants”—something that is exceptionally true in software development. Data scientists in the privacy group have leveraged work that has come out of Amazon on machine learning. We have a core technology that powers our privacy AI, but that doesn’t mean that it is the best. Part of our work is being on the look out for other technologies that can aid us. To this end, data scientists in the privacy research group have used Amazon’s Sagemaker application to rapidly test to see if a different machine learning technology would be better than our current system (implications of the no free lunch theorem aside). Again, using technology/tooling suited to the task at hand allows us to continuously deliver new value to our customers.
Using the right tool for the job allows each employee/group (within reason, and what can be supported by devops and external constraints) to be their most productive. It is also a simple extension/implementation of one of the tenets of the Agile Manifesto.
By choosing tools that allow individuals to make contributions as best they can is one of the many ways in which Protenus R&D groups contribute to the success of the company and maintain our competitive edge.
Protenus is an exciting place to work—and we are hiring. To learn about open positions, click the button below.