Blog 04 - Beyond the Buzzwords: Life as a data scientist at Rosa Biotech
Data science and artificial intelligence (AI) have gained a huge amount of popularity in recent years. There has been considerable excitement about the possibility that this will lead to a new industrial revolution, but also questions about the threat AI poses in taking over human civilisation. I don’t believe we need to worry about the latter yet - instead data science and AI can provide tools to business and research that were never possible before. As Rosa Biotech’s data scientist, I will try and cut through the noise and shed some light on what the day-to-day life of a data scientist looks like.
As a quick introduction, I use machine learning which is a subset of AI that allows systems to learn relationships from experience, where experience is the data applied to it. It exists in different forms, commonly called models, whose structure can be derived from probability equations (e.g Gaussian Naive Bayes), to simulations of the neurons in a brain (e.g Neural Networks).
With great data, comes great responsibility
My job is to extract meaning from and interpret data, and working with data comes with great responsibility. Rosa’s wet-lab scientists (Jordan Fletcher, Arne Scott and Ulrike Obst) have refined the company’s sensing platform with their years of experience. Each experiment is carefully performed to produce as clean and informative a dataset as possible. Therefore, if there is relevant information in this dataset, I had better be sure I find it! This is the most exciting part - a fresh new dataset is full of possibility and you don’t always know what insights you will find until you start digging.
My data analysis pipeline generally looks something like:
Looking at the data to understand it in as many ways as possible
Cleaning the data to refine the information we are looking for
Applying transformations which may include: adjusting a dataset to a normal distribution, standardisation or dimensionality reduction
If necessary, training machine learning models around the data, for classification or prediction.. and don’t forget about the extensive model validation! A typical machine learning process is illustrated below.
I usually iterate through the first three steps to the point I am almost able to separate the data by eye. This results in an informative clean dataset, which is vital - a machine learning model is only as good as the data it is provided with.
Rosa’s sensing platform is based on detecting the displacement of a dye from an array of alpha-helical barrel proteins. We use machine learning to recognise patterns of displacement that are characteristic of a particular single analyte or complex mixture, which we call a ‘fingerprint’. In this way, we can interpret high dimensions of data with a strength our eyes would not be capable of. To do this we use a lot of statistical and interpretable machine learning models, which can largely be described through maths and statistics. These are largely favoured in medical and biological applications, as we are able to illustrate how a model has come to a particular decision.
Illustration of the general machine learning pipeline for a classification problem.
Communication is key
Once data analysis and modelling are complete, the way the results are communicated to the rest of the team is just as important as the results themselves. It is vital as a data scientist to have a solid understanding of the mathematical and statistical principles beneath the data analysis and modelling processes. If a result is good or bad, we need to understand and communicate as much as we can why this is the case. Problems such as ‘overfitting’ - in which a model performs extraordinarily well on a small subset of data, but is not generalisable to unseen data (in the test set, or future data applied to the model) - can have big consequences when business decisions are determined by the outcome!
Strong data visualisation is a big part of communication too. I am constantly trying to find new and better ways to illustrate the data, and the results that come with it. From producing colourful graphics to explain abstract methods used, such as dimensionality reduction with principal component analysis (shown below), to simple box-plots to really focus on individual signals from our array. The more clear and informative, the better.
Illustration of Principal Component Analysis (PCA), a dimensionality reduction technique, which allows us to view data at different points in dimensional space. Also makes for a cool graphic!
If it ain’t broke, fix it anyway!
Of course, to get all this done, the vast majority of my time is spent coding. As a data scientist, I use code as a tool to perform my data analysis and to build machine learning models. Python is my language of choice, as I believe it provides a wealth of extra functionality over other languages available in the data science field. In a business setting, robustness and optimised code is also very important. As the only computational scientist at Rosa, it is important I go beyond simply using code as a tool - I need to create beautiful code as well. Coming from a physical sciences as opposed to a computer science background, this change in perspective has been the area I have learnt the most (shoutout to Stack Overflow!). A lot of my time is spent refining the code to make it as robust as possible, as well as employing strict unit testing, continuous integration and version control. On this journey to better coding, one of my mantras is:
‘If it ain’t broke, fix it anyway!’.
The learning doesn’t stop
Data science & machine learning is one of the most popular and rapidly evolving fields in the world right now. To ensure I do my job to the best of my ability, it is important to keep up with everything going on in the space. I make sure I read material related and unrelated to my current projects in the field every day. With this, my knowledge of mathematical principles in data science remains fresh so I can better communicate to others, I constantly discover new ways I can analyse and visualise the data, and improve my coding skills along the way.
This also allows me to keep one eye fixed on the future. In a startup environment, I need to bear in mind where I would like to see the software stack a year + from now. Therefore, I am constantly learning: to prepare for the future, and to decide in what direction we should go. For example, at Rosa, my plans include moving some data processing to a database in an effort to automate our platform and, longer term, migrating this platform to the cloud.
Data science is a career I would recommend highly, and applying it to novel technology here at Rosa Biotech is an exciting opportunity. Watch this space for great things to come!