My Digital Garden

SE Radio 641 Catherine Nelson on Machine Learning in Data Science

SE Radio 641: Catherine Nelson on Machine Learning in Data Science (developers, )

rw-book-cover

Metadata

Highlights

  • Notebooks vs. Git
    • Jupyter Notebooks are great for initial exploration and data interaction.
    • Refactor into a Git repository when retraining models or needing robustness. Transcript: Philip Winston I'd like to talk a little bit about the use of notebooks like Google's CoLab in data science. This is a technique or a method that I think is more common in data science than in software engineering at large. So I'm wondering what are the pros and if there are any cons of doing your work inside of a notebook. Catherine Nelson Definitely. I'm a huge fan of Jupyter Notebooks. I love being able to get the instant feedback on what my code is doing, which is particularly useful when I'm looking at data all the time. I can print that data. I can plot a small graph that data. I can really interact with that data while I'm coding. I find them incredibly useful when I'm starting a project. I don't quite know where things are going. I'm really exploring around and trying to see what the data I'm working with can do for me. Or I'm starting with a basic machine learning model and seeing if it learns anything about the problem that I'm working on. Philip Winston What sorts of signs are there that maybe you need to switch to just a traditional Git repository? What starts to become difficult with a notebook? Catherine Nelson For me, I refactor when I'm at the point where I want to train that model repeatedly. So in a machine learning problem, I have chosen the features that I want to work with. I've chosen the data that I want to work with. I've trained an initial model. It's getting a reasonable result, but then I want to train it repeatedly and optimize the hyperparameters and then eventually move towards deploying it into production. So I think that's the main difference is that when I just have code that I may only run once, I don't know where I'm going, I don't know exactly what the final code base will look like, that's When I'm happiest in a Jupyter notebook. (Time 0:07:00)
  • Data Scientist vs. ML Engineer
    • Data scientists explore problems and determine ML suitability.
    • Machine learning engineers take over for productionization and monitoring. Transcript: Philip Winston A little while, we're going to talk through the steps in a machine learning workflow, focusing on what it would be like to make an automated reusable pipeline out of them. But let's talk a little bit more about roles. So I think you mentioned data analysts relative to data scientists. Let's talk about a machine learning engineer that certainly comes up. What is your feeling about their role and how it differs from either data science or software engineer? Catherine Nelson At many companies, I think it's the data scientists that will make some initial explorations and take a fresh problem and say, is this even a problem that we should be solving with machine Learning? What type of algorithms are suitable for this particular problem? Train an initial model, prove that that model is going to answer the question that's under consideration. And then it's the machine learning engineer that takes over when that has been established, when those initial experiments have been done, and then puts that model into production. And then they look more on the side of monitoring that model, checking that it's performance, checking that it returns, the inference happens in the right amount of time, and so on. (Time 0:09:19)
  • Key Skills for Data Scientists
    • Data scientists should learn to write tests and use version control.
    • These practices ensure code robustness and maintainability. Transcript: Catherine Nelson Where I see there's often a gap is in writing tests. That's often something that's not familiar to people from a data science background. And that's because data science projects can be so ad hoc, so exploratory. It's not obvious when to add tests. You can't add tests to every single piece of code that you're writing in a data science project because half of them you're going to throw away because you found that that particular Line of inquiry goes nowhere. There's not really a culture of going back and adding those tests later. But if you then move on from that exploratory code to putting your machine learning model into production, it's a problem if your code's not tested. Another one is that, again, it comes from this exploratory nature. Often data scientists are reluctant to use version control when it's just an individual project. It seems like it's more hassle than it's worth. It's not obvious what the benefits of that are until you start working on a larger code base. What (Time 0:13:35)
  • Data Ingestion Process
    • Data ingestion involves taking data from company infrastructure and feeding it into the pipeline.
    • It may include splitting data into training and validation sets. Transcript: Philip Winston Basically, what sorts of tools or techniques should we keep in mind for each step? So the first one I have down is data ingestion. So I guess there's a lot of different projects, but what are some things we might be ingesting and what are we feeding it into? Catherine Nelson Yeah, this step is when you take your data from wherever it's stored in your company's infrastructure and feeding it into the rest of the pipeline, this is the point where you might also Make the split into training data and validation data. It's picking up that data from whatever format it's stored in and then potentially transforming it into a format that can move through the rest of that pipeline. (Time 0:17:36)
  • Pipeline Retraining
    • Pipelines are rerun when data changes or model performance degrades.
    • Retraining usually involves the entire pipeline for updated model artifacts. Transcript: Philip Winston Pause for a second and talk about when we would rerun the pipeline or why we're rerunning the pipeline. So if this was just a one-off exploratory investigation and we created a model and produced a visualization and that was the end. But in this case, we're talking about building a pipeline. So when is it that we rerun this pipeline? Is it because we have new data? Is it because we're trying to train a better model or what situations? And I guess related to that is, do we rerun the entire thing or is it being able to rerun portions of it? Catherine Nelson So for many business problems, the data doesn't stay static. The data changes through time, people behave differently with your product and so on. So that causes the model performance to degrade with time. Because if you've trained a model at a specific point in time, it's been trained on that data. And then as your usage pattern changes, then that model is not quite so relevant to that data. So the performance drops, that's the time when you might want to retrain that model. And usually you'd want to run the entire training pipeline all the way through. If you just run part of it, you don't actually change anything, because the artifact that you get at the end of the pipeline that you're going to deploy into production? (Time 0:20:35)
  • Data Preprocessing vs. Validation
    • Data preprocessing, or feature engineering, transforms raw data into usable model inputs.
    • This differs from data validation, which describes and checks data integrity. Transcript: Philip Winston So let me read the next four steps. So we have some idea where this is going and maybe what to talk about at which step. So I have next data pre-processing, then model training, then model analysis and validation, and then deployment. So let's talk about data pre-processing next. I don't know if this is an official step or does this depend on the workflow, but I guess how is pre-processing different from the previous steps? Catherine Nelson Pre-processing is often synonymous with feature engineering. So that's translating the raw data into something that you can use to train the model. (Time 0:25:00)
  • Collaboration in Pipeline Development
    • Data scientists should be involved in initial pipeline setup, especially data validation, hyperparameter selection, and model analysis.
    • Software engineers are valuable for debugging, integration, testing, and robust code. Transcript: Philip Winston Guess taking a step back for a second, during all of these steps of creating a pipeline, in what cases are we able to just hand this over to software engineers and kind of give them the information About the model? And in what cases do you feel the data scientist needs to be involved? What's the trade-off from either a handoff situation or a collaboration like side-by situation? Catherine Nelson Part of this is going to depend on the team that you have and the skill sets available. But I would say it's very useful to have the data scientist involved in setting up the initial pipeline. In particular, things like what are the criteria for the data validation step? What is a sensible distribution of your data? What are the hyperparameters that you should be considering when you're training the model? And particularly in the step that we haven't talked about yet, which is the model analysis step, I think that's where the data scientist has a really crucial part to play. I think any data scientist can learn the skills that they need to deploy a pipeline, but often being able to debug that complex system, being able to set it up so that it interfaces with The rest of the products, making sure that it's well tested and so on. (Time 0:30:19)
  • Model Analysis and Validation
    • Model analysis validates model performance (accuracy, precision, recall, bias).
    • It's a final go/no-go step before production deployment. Transcript: Philip Winston Scales and so on. You mentioned model analysis and validation. That's the next step. So because the word's the same, how is this different from data validation? I guess it's a question of what are we validating? Catherine Nelson Yeah, so this is where we're looking at the performance of the model in terms of how accurate it is, what's the precision and recall, and also sometimes splitting that accuracy down Into finer grained sectors. So if you had a model that you were deploying in lots of different countries, does it perform equally well on the data from all those countries? That's something that you could do with your validation data, which is the split of your data goes into the training data and validation or test data. And I know that we're using the word validation way too many times in this, but that seems to be the way that the terminology has gone. So analysis is looking at that accuracy across different aspects. This is a point where you might look for bias in your model as well. Is it providing better performance for certain groups? Is it providing better performance on your female users versus your male users? That would be something you'd want to look for at this step. And then the validation part of that is the model should only be deployed if it is acceptable in all the analysis criteria. (Time 0:32:20)
  • LLM for Flight Details Extraction
    • Catherine found LLMs incredibly effective for extracting flight details from emails.
    • A five-line prompt outperformed previous complex solutions. Transcript: Philip Winston Continuing to wrap up, what are you excited about looking ahead in machine learning projects you're working on or that you see in the wider industry? Catherine Nelson So having relatively recently started working with LLMs, it just makes me so, just so blown away by the capabilities at the moment. I've been working on a project to showcase a good example of a use of an LLM for a startup I'm working with. And the project we decided to choose was extracting people's flight details out of an email. So you send the servers an email with your flight details, and that will extract the origin, the destination, the time of departure, time of arrival, and so on, and populate those into Whatever kind of app you want to work on. And I've worked on similar projects before and seen things like big piles of regular expressions or doing all this complex feature engineering to get this out of it. But now I can do it in a five-line prompt to OpenAI, and it works better than all those previous incredibly complicated solutions. And you can even do things like you can ask it for the airport code instead of the name of the city. And even if the airport code isn't in the email, you can still get that because the LLM has that context. (Time 0:44:39)