2024
- Dec 29, 2024
LLM Research Papers: The 2024 List
I want to share my running bookmark list of many fascinating (mostly LLM-related) papers I stumbled upon in 2024. It's just a list, but maybe it will come in handy for those who are interested in finding some gems to read for the holidays.
- Nov 3, 2024
Understanding Multimodal LLMs An Introduction to the Main Techniques and Latest Models
There has been a lot of new research on the multimodal LLM front, including the latest Llama 3.2 vision models, which employ diverse architectural strategies to integrate various data types like text and images. For instance, The decoder-only method uses a single stack of decoder blocks to process all modalities sequentially. On the other hand, cross-attention methods (for example, used in Llama 3.2) involve separate encoders for different modalities with a cross-attention layer that allows these encoders to interact. This article explains how these different types of multimodal LLMs function. Additionally, I will review and summarize roughly a dozen other recent multimodal papers and models published in recent weeks to compare their approaches.
- Sep 21, 2024
Building A GPT-Style LLM Classifier From Scratch Finetuning a GPT Model for Spam Classification
This article shows you how to transform pretrained large language models (LLMs) into strong text classifiers. But why focus on classification? First, finetuning a pretrained model for classification offers a gentle yet effective introduction to model finetuning. Second, many real-world and business challenges revolve around text classification: spam detection, sentiment analysis, customer feedback categorization, topic labeling, and more.
- Sep 1, 2024
Building LLMs from the Ground Up: A 3-hour Coding Workshop
This tutorial is aimed at coders interested in understanding the building blocks of large language models (LLMs), how LLMs work, and how to code them from the ground up in PyTorch. We will kick off this tutorial with an introduction to LLMs, recent milestones, and their use cases. Then, we will code a small GPT-like LLM, including its data input pipeline, core architecture components, and pretraining code ourselves. After understanding how everything fits together and how to pretrain an LLM, we will learn how to load pretrained weights and finetune LLMs using open-source libraries.
- Aug 17, 2024
New LLM Pre-training and Post-training Paradigms -- A Look at How Moderns LLMs Are Trained
There are hundreds of LLM papers each month proposing new techniques and approaches. However, one of the best ways to see what actually works well in practice is to look at the pre-training and post-training pipelines of the most recent state-of-the-art models. Luckily, four major new LLMs have been released in the last months, accompanied by relatively detailed technical reports. In this article, I focus on the pre-training and post-training pipelines of the following models: Alibaba's Qwen 2, Apple Intelligence Foundation Language Models, Google's Gemma 2, Meta AI's Llama 3.1.
- Jul 20, 2024
Instruction Pretraining LLMs -- The Latest Research in Instruction Finetuning
This article covers a new, cost-effective method for generating data for instruction finetuning LLMs; instruction finetuning from scratch; pretraining LLMs with instruction data; and an overview of what's new in Gemma 2.
- Jun 2, 2024
Developing an LLM: Building, Training, Finetuning A Deep Dive into the Lifecycle of LLM Development
This is an overview of the LLM development process. This one-hour talk focuses on the essential three stages of developing an LLM: coding the architecture, implementing pretraining, and fine-tuning the LLM. Lastly, we also discuss the main ways LLMs are evaluated, along with the caveats of each method.
- Jun 2, 2024
LLM Research Insights: Instruction Masking and New LoRA Finetuning Experiments? Discussing the Latest Model Releases and AI Research in May 2024
This article covers three new papers related to instruction finetuning and parameter-efficient finetuning with LoRA in large language models (LLMs). I work with these methods on a daily basis, so it's always exciting to see new research that provides practical insights.
- May 12, 2024
How Good Are the Latest Open LLMs? And Is DPO Better Than PPO? Discussing the Latest Model Releases and AI Research in April 2024
What a month! We had four major open LLM releases: Mixtral, Meta AI's Llama 3, Microsoft's Phi-3, and Apple's OpenELM. In my new article, I review and discuss all four of these major transformer-based LLM model releases, followed by new research on reinforcement learning with human feedback methods for instruction finetuning using PPO and DPO algorithms.
- Apr 20, 2024
Using and Finetuning Pretrained Transformers
What are the different ways to use and finetune pretrained large language models (LLMs)? The three most common ways to use and finetune pretrained LLMs include a feature-based approach, in-context prompting, and updating a subset of the model parameters. First, most pretrained LLMs or language transformers can be utilized without the need for further finetuning. For instance, we can employ a feature-based method to train a new downstream model, such as a linear classifier, using embeddings generated by a pretrained transformer. Second, we can showcase examples of a new task within the input itself, which means we can directly exhibit the expected outcomes without requiring any updates or learning from the model. This concept is also known as prompting. Finally, it’s also possible to finetune all or just a small number of parameters to achieve the desired outcomes. This article discusses these types of approaches in greater depth
- Mar 31, 2024
Tips for LLM Pretraining and Evaluating Reward Models Research Papers in March 2024
It's another month in AI research, and it's hard to pick favorites. This month, I am going over a paper that discusses strategies for the continued pretraining of LLMs, followed by a discussion of reward modeling used in reinforcement learning with human feedback (a popular LLM alignment method), along with a new benchmark. Continued pretraining for LLMs is an important topic because it allows us to update existing LLMs, for instance, ensuring that these models remain up-to-date with the latest information and trends. Also, it allows us to adapt them to new target domains without having them to retrain from scratch. Reward modeling is important because it allows us to align LLMs more closely with human preferences and, to some extent, helps with safety. But beyond human preference optimization, it also provides a mechanism for learning and adapting LLMs to complex tasks by providing instruction-output examples where explicit programming of correct behavior is challenging or impractical.
- Mar 3, 2024
Research Papers in February 2024 — A LoRA Successor, Small Finetuned LLMs Vs Generalist LLMs, and Transparent LLM Research
Once again, this has been an exciting month in AI research. This month, I'm covering two new openly available LLMs, insights into small finetuned LLMs, and a new parameter-efficient LLM finetuning technique. The two LLMs mentioned above stand out for several reasons. One LLM (OLMo) is completely open source, meaning that everything from the training code to the dataset to the log files is openly shared. The other LLM (Gemma) also comes with openly available weights but achieves state-of-the-art performance on several benchmarks and outperforms popular LLMs of similar size, such as Llama 2 7B and Mistral 7B, by a large margin.
- Feb 18, 2024
Improving LoRA: Implementing Weight-Decomposed Low-Rank Adaptation (DoRA) from Scratch
Low-rank adaptation (LoRA) is a machine learning technique that modifies a pretrained model (for example, an LLM or vision transformer) to better suit a specific, often smaller, dataset by adjusting only a small, low-rank subset of the model's parameters. In this article, we will take a look at both LoRA and DoRA, which is a new promising alternative to LoRA.
2023
- Sep 15, 2023
Optimizing LLMs From a Dataset Perspective
This article focuses on improving the modeling performance of LLMs by finetuning them using carefully curated datasets. Specifically, this article highlights strategies that involve modifying, utilizing, or manipulating the datasets for instruction-based finetuning rather than altering the model architecture or training algorithms (the latter will be topics of a future article). This article will also explain how you can prepare your own datasets to finetune open-source LLMs.
- Aug 10, 2023
The NeurIPS 2023 LLM Efficiency Challenge Starter Guide
Large language models (LLMs) offer one of the most interesting opportunities for developing more efficient training methods. A few weeks ago, the NeurIPS 2023 LLM Efficiency Challenge launched to focus on efficient LLM finetuning, and this guide is a short walkthrough explaining how to participate in this competition. This article covers everything you need to know, from setting up the coding environment to making the first submission.
- Jul 1, 2023
Optimizing Memory Usage for Training LLMs and Vision Transformers in PyTorch
Peak memory consumption is a common bottleneck when training deep learning models such as vision transformers and LLMs. This article provides a series of techniques that can lower memory consumption by approximately 20x without sacrificing modeling performance and prediction accuracy.
- Jun 14, 2023
Finetuning Falcon LLMs More Efficiently With LoRA and Adapters
Finetuning allows us to adapt pretrained LLMs in a cost-efficient manner. But which method should we use? This article compares different parameter-efficient finetuning methods for the latest top-performing open-source LLM, Falcon. Using parameter-efficient finetuning methods outlined in this article, it's possible to finetune an LLM in 1 hour on a single GPU instead of a day on 6 GPUs.
- May 11, 2023
Accelerating Large Language Models with Mixed-Precision Techniques
Training and using large language models (LLMs) is expensive due to their large compute requirements and memory footprints. This article will explore how leveraging lower-precision formats can enhance training and inference speeds up to 3x without compromising model accuracy.
- Apr 26, 2023
Parameter-Efficient LLM Finetuning With Low-Rank Adaptation (LoRA)
Pretrained large language models are often referred to as foundation models for a good reason: they perform well on various tasks, and we can use them as a foundation for finetuning on a target task. As an alternative to updating all layers, which is very expensive, parameter-efficient methods such as prefix tuning and adapters have been developed. Let's talk about one of the most popular parameter-efficient finetuning techniques: Low-rank adaptation (LoRA). What is LoRA? How does it work? And how does it compare to the other popular finetuning approaches? Let's answer all these questions in this article!
- Apr 12, 2023
Understanding Parameter-Efficient Finetuning of Large Language Models: From Prefix Tuning to LLaMA-Adapters
In the rapidly evolving field of artificial intelligence, utilizing large language models in an efficient and effective manner has become increasingly important. Parameter-efficient finetuning stands at the forefront of this pursuit, allowing researchers and practitioners to reuse pretrained models while minimizing their computational and resource footprints. This article explains the broad concept of finetuning and discusses popular parameter-efficient alternatives like prefix tuning and adapters. Finally, we will look at the recent LLaMA-Adapter method and see how we can use it in practice.
- Mar 28, 2023
Finetuning Large Language Models On A Single GPU Using Gradient Accumulation
Previously, I shared an article using multi-GPU training strategies to speed up the finetuning of large language models. Several of these strategies include mechanisms such as model or tensor sharding that distributes the model weights and computations across different devices to work around GPU memory limitations. However, many of us don't have access to multi-GPU resources. So, this article illustrates a simple technique that works as a great workaround to train models with larger batch sizes when GPU memory is a concern: gradient accumulation.
- Mar 23, 2023
Keeping Up With AI Research And News
When it comes to productivity workflows, there are a lot of things I'd love to share. However, the one topic many people ask me about is how I keep up with machine learning and AI at large, and how I find interesting papers.
- Feb 23, 2023
Some Techniques To Make Your PyTorch Models Train (Much) Faster
This blog post outlines techniques for improving the training performance of your PyTorch model without compromising its accuracy. To do so, we will wrap a PyTorch model in a LightningModule and use the Trainer class to enable various training optimizations. By changing only a few lines of code, we can reduce the training time on a single GPU from 22.53 minutes to 2.75 minutes while maintaining the model's prediction accuracy. Yes, that's a 8x performance boost!
- Feb 9, 2023
Understanding and Coding the Self-Attention Mechanism of Large Language Models From Scratch
In this article, we are going to understand how self-attention works from scratch. This means we will code it ourselves one step at a time. Since its introduction via the original transformer paper, self-attention has become a cornerstone of many state-of-the-art deep learning models, particularly in the field of Natural Language Processing. Since self-attention is now everywhere, it's important to understand how it works.
- Feb 7, 2023
Understanding Large Language Models -- A Transformative Reading List
Since transformers have such a big impact on everyone's research agenda, I wanted to flesh out a short reading list for machine learning researchers and practitioners getting started with large language models.
- Feb 1, 2023
What Are the Different Approaches for Detecting Content Generated by LLMs Such As ChatGPT? And How Do They Work and Differ?
Since the release of the AI Classifier by OpenAI made big waves yesterday, I wanted to share a few details about the different approaches for detecting AI-generated text. This article briefly outlines four approaches to identifying AI-generated contents.
- Jan 29, 2023
Comparing Different Automatic Image Augmentation Methods in PyTorch
Data augmentation is a key tool in reducing overfitting, whether it's for images or text. This article compares three Auto Image Data Augmentation techniques in PyTorch: AutoAugment, RandAugment, and TrivialAugment.
- Jan 16, 2023
Curated Resources and Trustworthy Experts: The Key Ingredients for Finding Accurate Answers to Technical Questions in the Future
Conversational chat bots such as ChatGPT probably will not be able replace traditional search engines and expert knowledge anytime soon. With the vast amount of misinformation available on the internet, the ability to distinguish between credible and unreliable sources remains challenging and crucial.
- Jan 15, 2023
Training an XGBoost Classifier Using Cloud GPUs Without Worrying About Infrastructure
Imagine you want to quickly train a few machine learning or deep learning models on the cloud but don't want to deal with cloud infrastructure. This short article explains how we can get our code up and running in seconds using the open source lightning library.
- Jan 5, 2023
Open Source Highlights 2022 for Machine Learning & AI
Recently, I shared the top 10 papers that I read in 2022. As a follow-up, I am compiling a list of my favorite 10 open-source releases that I discovered, used, or contributed to in 2022.
- Jan 3, 2023
Influential Machine Learning Papers Of 2022
Every day brings something new and exciting to the world of machine learning and AI, from the latest developments and breakthroughs in the field to emerging trends and challenges. To mark the start of the new year, below is a short review of the top ten papers I've read in 2022.
2022
- Oct 15, 2022
Ahead Of AI, And What's Next?
About monthly machine learning musings, and other things I am currently workin on ...
- Jul 24, 2022
A Short Chronology Of Deep Learning For Tabular Data
Occasionally, I share research papers proposing new deep learning approaches for tabular data on social media, which is typically an excellent discussion starter. Often, people ask for additional methods or counterexamples. So, with this short post, I aim to briefly summarize the major papers on deep tabular learning I am currently aware of. However, I want to emphasize that no matter how interesting or promising deep tabular methods look, I still recommend using a conventional machine learning method as a baseline. There is a reason why I cover conventional machine learning before deep learning in my books.
- Jul 5, 2022
No, We Don't Have to Choose Batch Sizes As Powers Of 2
Regarding neural network training, I think we are all guilty of doing this: we choose our batch sizes as powers of 2, that is, 64, 128, 256, 512, 1024, and so forth. There are some valid theoretical justifications for this, but how does it pan out in practice? We had some discussions about that in the last couple of days, and here I want to write down some of the take-aways so I can reference them in the future. I hope you'll find this helpful as well!
- Jun 30, 2022
Sharing Deep Learning Research Models with Lightning Part 2: Leveraging the Cloud
In this article, we will take deploy a Super Resolution App on the cloud using lightning.ai. The primary goal here is to see how easy it is to create and share a research demo. However, the cloud is for more than just model sharing: we will also learn how we can tap into additional GPU resources for model training.
- Jun 17, 2022
Sharing Deep Learning Research Models with Lightning Part 1: Building A Super Resolution App
In this post, we will build a Lightning App. Why? Because it is 2022, and it is time to explore a more modern take on interacting with, presenting, and sharing our deep learning models. We are going to tackle this in three parts. In this first part, we will learn what a Lightning App is and how we build a Super Resolution GAN demo.
- Jun 12, 2022
Taking Datasets, DataLoaders, and PyTorch’s New DataPipes for a Spin
The PyTorch team recently announced TorchData, a prototype library focused on implementing composable and reusable data loading utilities for PyTorch. In particular, the TorchData library is centered around DataPipes, which are meant to be a DataLoader-compatible replacement for the existing Dataset class.
- May 18, 2022
Running PyTorch on the M1 GPU
Today, PyTorch officially introduced GPU support for Apple's ARM M1 chips. This is an exciting day for Mac users out there, so I spent a few minutes trying it out in practice. In this short blog post, I will summarize my experience and thoughts with the M1 chip for deep learning tasks.
- Apr 25, 2022
Creating Confidence Intervals for Machine Learning Classifiers
Developing good predictive models hinges upon accurate performance evaluation and comparisons. However, when evaluating machine learning models, we typically have to work around many constraints, including limited data, independence violations, and sampling biases. Confidence intervals are no silver bullet, but at the very least, they can offer an additional glimpse into the uncertainty of the reported accuracy and performance of a model. This article outlines different methods for creating confidence intervals for machine learning models. Note that these methods also apply to deep learning.
- Apr 4, 2022
Losses Learned -- Optimizing Negative Log-Likelihood and Cross-Entropy in PyTorch (Part 1)
The cross-entropy loss is our go-to loss for training deep learning-based classifiers. In this article, I am giving you a quick tour of how we usually compute the cross-entropy loss and how we compute it in PyTorch. There are two parts to it, and here we will look at a binary classification context first. You may wonder why bother writing this article; computing the cross-entropy loss should be relatively straightforward!? Yes and no. We can compute the cross-entropy loss in one line of code, but there's a common gotcha due to numerical optimizations under the hood. (And yes, when I am not careful, I sometimes make this mistake, too.) So, in this article, let me tell you a bit about deep learning jargon, improving numerical performance, and what could go wrong.
- Mar 24, 2022
TorchMetrics -- How do we use it, and what's the difference between .update() and .forward()?
TorchMetrics is a really nice and convenient library that lets us compute the performance of models in an iterative fashion. It's designed with PyTorch (and PyTorch Lightning) in mind, but it is a general-purpose library compatible with other libraries and workflows. This iterative computation is useful if we want to track a model during iterative training or evaluation on minibatches (and optionally across on multiple GPUs). In deep learning, that's essentially *all the time*. However, when using TorchMetrics, one common question is whether we should use `.update()` or `.forward()`? (And that's also a question I certainly had when I started using it.). Here's a hands-on example and explanation.
- Feb 25, 2022
Machine Learning with PyTorch and Scikit-Learn -- The *new* Python Machine Learning Book
Machine Learning with PyTorch and Scikit-Learn has been a long time in the making, and I am excited to finally get to talk about the release of my new book. Initially, this project started as the 4th edition of Python Machine Learning. However, we made so many changes to the book that we thought it deserved a new title to reflect that. So, what's new, you may wonder? In this post, I am excited to tell you all about it.
2021
- Dec 29, 2021
Introduction to Machine Learning -- Video Lectures about Python Basics, Tree-based Methods, Model Evaluation, and Feature Selection
About half a year ago, I organized all my deep learning-related videos in a handy blog post to have everything in one place. Since many people liked this post, and because I like to use my winter break to get organized, I thought I could free two birds with one key by compiling this list below. Here, you find a list of approximately 90 machine learning lectures I recorded in 2020 and 2021! Once again, I hope this is useful to you!
- Jul 9, 2021
Introduction to Deep Learning -- 170 Video Lectures from Adaptive Linear Neurons to Zero-shot Classification with Transformers
I just sat down this morning and organized all deep learning related videos I recorded in 2021. I am sure this will be a useful reference for my future self, but I am also hoping it might be useful for one or the other person out there. PS: All code examples are in PyTorch :)
- Feb 11, 2021
Datasets for Machine Learning and Deep Learning -- Some of the Best Places to Explore
With the semester being in full swing, I recently shared this set of dataset repositories with my deep learning class. However, I thought that beyond using this list for finding inspiration for interesting student class projects, these are also good places to look for additional bechmark datasets for your model.
- Jan 21, 2021
Book Review: Deep Learning With PyTorch -- A Practical Deep Learning Guide With a Computer Vision Focus and an Interesting Structure
After its release in August 2020, Deep Learning with PyTorch has been sitting on my shelf before I finally got a chance to read it during this winter break. It turned out to be the perfect easy-going reading material for a bit of productivity after the relaxing holidays. As promised last week, here are my thoughts.
- Jan 3, 2021
How I Keep My Projects Organized
Since I started my undergraduate studies in 2008, I have been obsessed with productivity tips, notetaking solutions, and todo-list management. Over the years, I tried many, many workflows and hundreds of (mostly digital) tools to keep my life, projects, and notes organized. Occasionally, I exchange ideas with friends and colleagues, and upon request, I talked about my workflow a couple of times on Twitter. After today's 2021-edition of this discussion, I thought that writing a quick and informal blogpost makes sense, making it easier to read and having a quick reference if someone asks about it again :).
2020
- Sep 27, 2020
Scientific Computing in Python: Introduction to NumPy and Matplotlib -- Including Video Tutorials
Since many students in my Stat 451 (Introduction to Machine Learning and Statistical Pattern Classification) class are relatively new to Python and NumPy, I was recently devoting a lecture to the latter. Since the course notes are based on an interactive Jupyter notebook file, which I used as a basis for the lecture videos, I thought it would be worthwhile to reformat it as a blog article with the embedded 'narrated content' -- the video recordings.
- Aug 26, 2020
Interpretable Machine Learning -- Book Review and Thoughts about Linear and Logistic Regression as Interpretable Models
In this blog post, I am (briefly) reviewing Christoph Molnar's *Interpretable Machine Learning Book*. Then, I am writing about two classic generalized linear models, linear and logistic regression. Mainly, this blog post explains the relationship between feature weights and predictions and demonstrates how to construct confidence intervals via Python.
- Aug 5, 2020
Chapter 1: Introduction to Machine Learning and Deep Learning
The first chapter (draft) of the Introduction to Deep Learning book, which is a book based on my lecture notes and slides.
- Jan 6, 2020
Book Review: Architects of Intelligence by Martin Ford
A brief review of Martin Ford's book that features interviews with 23 of the most well-known and brightest minds working on AI.
2019
- Dec 12, 2019
What's New in the 3rd Edition
A brief summary of what's new in the 3rd edition of Python Machine Learning.
- May 24, 2019
My First Year at UW-Madison and a Gallery of Awesome Student Projects
Not too long ago, in the Summer of 2018, I was super excited to join the Department of Statistics at the University of Wisconsin-Madison after obtaining my Ph.D. after ~5 long and productive years. Now, two semesters later after finals' week, I finally found some quiet days to look back on what's happened since then. In this post, I am sharing a short reflection as well as a some of the exciting projects my students were working on.
2018
- Nov 10, 2018
Model evaluation, model selection, and algorithm selection in machine learning Part IV - Comparing the performance of machine learning models and algorithms using statistical tests and nested cross-validation
This final article in the series *Model evaluation, model selection, and algorithm selection in machine learning* presents overviews of several statistical hypothesis testing approaches, with applications to machine learning model and algorithm comparisons. This includes statistical tests based on target predictions for independent test sets (the downsides of using a single test set for model comparisons was discussed in previous articles) as well as methods for algorithm comparisons by fitting and evaluating models via cross-validation. Lastly, this article will introduce *nested cross-validation*, which has become a common and recommended a method of choice for algorithm comparisons for small to moderately-sized datasets.
- Aug 2, 2018
Generating Gender-Neutral Face Images with Semi-Adversarial Neural Networks to Enhance Privacy
I thought that it would be nice to have short and concise summaries of recent projects handy, to share them with a more general audience, including colleagues and students. So, I challenged myself to use fewer than 1000 words without getting distracted by the nitty-gritty details and technical jargon. In this post, I mainly cover some of my recent research in collaboration with the [iPRoBe Lab](http://iprobe.cse.msu.edu) that falls under the broad category of developing approaches to hide specific information in face images. The research discussed in this post is about "maximizing privacy while preserving utility."
2016
- Oct 2, 2016
Model evaluation, model selection, and algorithm selection in machine learning Part III - Cross-validation and hyperparameter tuning
Almost every machine learning algorithm comes with a large number of settings that we, the machine learning researchers and practitioners, need to specify. These tuning knobs, the so-called hyperparameters, help us control the behavior of machine learning algorithms when optimizing for performance, finding the right balance between bias and variance. Hyperparameter tuning for performance optimization is an art in itself, and there are no hard-and-fast rules that guarantee best performance on a given dataset. In Part I and Part II, we saw different holdout and bootstrap techniques for estimating the generalization performance of a model. We learned about the bias-variance trade-off, and we computed the uncertainty of our estimates. In this third part, we will focus on different methods of cross-validation for model evaluation and model selection. We will use these cross-validation techniques to rank models from several hyperparameter configurations and estimate how well they generalize to independent datasets.
- Aug 13, 2016
Model evaluation, model selection, and algorithm selection in machine learning Part II - Bootstrapping and uncertainties
In this second part of this series, we will look at some advanced techniques for model evaluation and techniques to estimate the uncertainty of our estimated model performance as well as its variance and stability. Then, in the next article, we will shift the focus onto another task that is one of the main pillar of successful, real-world machine learning applications -- Model Selection.
- Jun 11, 2016
Model evaluation, model selection, and algorithm selection in machine learning Part I - The basics
Machine learning has become a central part of our life -- as consumers, customers, and hopefully as researchers and practitioners! Whether we are applying predictive modeling techniques to our research or business problems, I believe we have one thing in common : We want to make good predictions! Fitting a model to our training data is one thing, but how do we know that it generalizes well to unseen data? How do we know that it doesn't simply memorize the data we fed it and fails to make good predictions on future samples, samples that it hasn't seen before? And how do we select a good model in the first place? Maybe a different learning algorithm could be better-suited for the problem at hand? Model evaluation is certainly not just the end point of our machine learning pipeline.
Before we handle any data, we want to plan ahead and use techniques that are suited for our purposes. In this article, we will go over a selection of these techniques, and we will see how they fit into the bigger picture, a typical machine learning workflow.
2015
- Sep 24, 2015
Writing 'Python Machine Learning' – A Reflection on a Journey
It's been about time. I am happy to announce that "Python Machine Learning" was finally released today! Sure, I could just send an email around to all the people who were interested in this book. On the other hand, I could put down those 140 characters on Twitter (minus what it takes to insert a hyperlink) and be done with it. Even so, writing "Python Machine Learning" really was quite a journey for a few months, and I would like to sit down in my favorite coffeehouse once more to say a few words about this experience.
- Aug 24, 2015
Python, Machine Learning, and Language Wars – A Highly Subjective Point of View
This has really been quite a journey for me lately. And regarding the frequently asked question “Why did you choose Python for Machine Learning?” I guess it is about time to write my script. In this article, I really don’t mean to tell you why you or anyone else should use Python. But read on if you are interested in my opinion.
- Mar 24, 2015
Single-Layer Neural Networks and Gradient Descent
This article offers a brief glimpse of the history and basic concepts of machine learning. We will take a look at the first algorithmically described neural network and the gradient descent algorithm in context of adaptive linear neurons, which will not only introduce the principles of machine learning but also serve as the basis for modern multilayer neural networks in future articles.
- Jan 27, 2015
Principal Component Analysis in 3 Simple Steps
Principal Component Analysis (PCA) is a simple yet popular and useful linear transformation technique that is used in numerous applications, such as stock market predictions, the analysis of gene expression data, and many more. In this tutorial, we will see that PCA is not just a “black box”, and we are going to unravel its internals in 3 basic steps.
- Jan 11, 2015
Implementing a Weighted Majority Rule Ensemble Classifier in scikit-learn
Here, I want to present a simple and conservative approach of implementing a weighted majority rule ensemble classifier in scikit-learn that yielded remarkably good results when I tried it in a kaggle competition. For me personally, kaggle competitions are just a nice way to try out and compare different approaches and ideas -- basically an opportunity to learn in a controlled environment with nice datasets.
2014
- Dec 5, 2014
MusicMood – A Machine Learning Model for Classifying Music by Mood Based on Song Lyrics
In this article, I want to share my experience with a recent data mining project which probably was one of my most favorite hobby projects so far. It's all about building a classification model that can automatically predict the mood of music based on song lyrics.
- Nov 28, 2014
Turn Your Twitter Timeline into a Word Cloud – using Python
Last week, I posted some visualizations in context of Happy Rock Song data mining project, and some people were curious about how I created the word clouds. Learn how to create YOUR personal Twitter Timeline!
- Oct 4, 2014
Naive Bayes and Text Classification – Introduction and Theory
Naive Bayes classifiers, a family of classifiers that are based on the popular Bayes’ probability theorem, are known for creating simple yet well performing models, especially in the fields of document classification and disease prediction. In this first part of a series, we will take a look at the theory of naive Bayes classifiers and introduce the basic concepts of text classification. In following articles, we will implement those concepts to train a naive Bayes spam filter and apply naive Bayes to song classification based on lyrics.
- Sep 14, 2014
Kernel tricks and nonlinear dimensionality reduction via RBF kernel PCA
The focus of this article is to briefly introduce the idea of kernel methods and to implement a Gaussian radius basis function (RBF) kernel that is used to perform nonlinear dimensionality reduction via KBF kernel principal component analysis (kPCA).
- Aug 25, 2014
Predictive modeling, supervised machine learning, and pattern classification — the big picture
When I was working on my next pattern classification application, I realized that it might be worthwhile to take a step back and look at the big picture of pattern classification in order to put my previous topics into context and to provide and introduction for the future topics that are going to follow.
- Aug 3, 2014
Linear Discriminant Analysis – Bit by Bit
I received a lot of positive feedback about the step-wise Principal Component Analysis (PCA) implementation. Thus, I decided to write a little follow-up about Linear Discriminant Analysis (LDA) — another useful linear transformation technique. Both LDA and PCA are commonly used dimensionality reduction techniques in statistics, pattern classification, and machine learning applications. By implementing the LDA step-by-step in Python, we will see and understand how it works, and we will compare it to a PCA to see how it differs.
- Jul 19, 2014
Dixon's Q test for outlier identification – A questionable practice
I recently faced the impossible task to identify outliers in a dataset with very, very small sample sizes and Dixon's Q test caught my attention. Honestly, I am not a big fan of this statistical test, but since Dixon's Q-test is still quite popular in certain scientific fields (e.g., chemistry) that it is important to understand its principles in order to draw your own conclusion of the presented research data that you might stumble upon in research articles or scientific talks.
- Jul 11, 2014
About Feature Scaling and Normalization – and the effect of standardization for machine learning algorithms
I received a couple of questions in response to my previous article (Entry point: Data) where people asked me why I used Z-score standardization as feature scaling method prior to the PCA. I added additional information to the original article, however, I thought that it might be worthwhile to write a few more lines about this important topic in a separate article.
- Jun 27, 2014
Entry Point Data – Using Python's sci-packages to prepare data for Machine Learning tasks and other data analyses
In this short tutorial I want to provide a short overview of some of my favorite Python tools for common procedures as entry points for general pattern classification and machine learning tasks, and various other data analyses.
- Jun 26, 2014
Molecular docking, estimating free energies of binding, and AutoDock's semi-empirical force field
Discussions and questions about methods, approaches, and tools for estimating (relative) binding free energies of protein-ligand complexes are quite popular, and even the simplest tools can be quite tricky to use. Here, I want to briefly summarize the idea of molecular docking, and give a short overview about how we can use AutoDock 4.2's hybrid approach for evaluating binding affinities.
- Jun 20, 2014
An introduction to parallel programming using Python's multiprocessing module – using Python's multiprocessing module
The default Python interpreter was designed with simplicity in mind and has a thread-safe mechanism, the so-called "GIL" (Global Interpreter Lock). In order to prevent conflicts between threads, it executes only one statement at a time (so-called serial processing, or single-threading). In this introduction to Python's multiprocessing module, we will see how we can spawn multiple subprocesses to avoid some of the GIL's disadvantages and make best use of the multiple cores in our CPU.
- Jun 19, 2014
Kernel density estimation via the Parzen-Rosenblatt window method – explained using Python
The Parzen-window method (also known as Parzen-Rosenblatt window method) is a widely used non-parametric approach to estimate a probability density function *p(**x**)* for a specific point *p(**x**)* from a sample *p(**x**n)* that doesn't require any knowledge or assumption about the underlying distribution.
- Jun 19, 2014
Numeric matrix manipulation – The cheat sheet for MATLAB, Python NumPy, R, and Julia
At its core, this article is about a simple cheat sheet for basic operations on numeric matrices, which can be very useful if you working and experimenting with some of the most popular languages that are used for scientific computing, statistics, and data analysis.
- Jun 1, 2014
The key differences between Python 2.7.x and Python 3.x with examples
Many beginning Python users are wondering with which version of Python they should start. My answer to this question is usually something along the lines 'just go with the version your favorite tutorial was written in, and check out the differences later on.'\ But what if you are starting a new project and have the choice to pick? I would say there is currently no 'right' or 'wrong' as long as both Python 2.7.x and Python 3.x support the libraries that you are planning to use. However, it is worthwhile to have a look at the major differences between those two most popular versions of Python to avoid common pitfalls when writing the code for either one of them, or if you are planning to port your project.
- May 28, 2014
5 simple steps for converting Markdown documents into HTML and adding Python syntax highlighting
In this little tutorial, I want to show you in 5 simple steps how easy it is to add code syntax highlighting to your blog articles.
- May 20, 2014
Creating a table of contents with internal links in IPython Notebooks and Markdown documents
Many people have asked me how I create the table of contents with internal links for my IPython Notebooks and Markdown documents on GitHub. Well, no (IPython) magic is involved, it is just a little bit of HTML, but I thought it might be worthwhile to write this little how-to tutorial.
- May 12, 2014
A Beginner's Guide to Python's Namespaces, Scope Resolution, and the LEGB Rule
A short tutorial about Python's namespaces and the scope resolution for variable names using the LEGB-rule with little quiz-like exercises.
- Apr 21, 2014
Diving deep into Python – the not-so-obvious language parts
Some while ago, I started to collect some of the not-so-obvious things I encountered when I was coding in Python. I thought that it was worthwhile sharing them and encourage you to take a brief look at the section-overview and maybe you'll find something that you do not already know - I can guarantee you that it'll likely save you some time at one or the other tricky debugging challenge.
- Apr 13, 2014
Implementing a Principal Component Analysis (PCA) – in Python, step by step
In this article I want to explain how a Principal Component Analysis (PCA) works by implementing it in Python step by step. At the end we will compare the results to the more convenient Python PCA() classes that are available through the popular matplotlib and scipy libraries and discuss how they differ.
- Mar 13, 2014
Installing Scientific Packages for Python3 on MacOS 10.9 Mavericks
I just went through some pain (again) when I wanted to install some of Python's scientific libraries on my second Mac. I summarized the setup and installation process for future reference.\ If you encounter any different or additional obstacles let me know, and please feel free to make any suggestions to improve this short walkthrough.
- Mar 7, 2014
A thorough guide to SQLite database operations in Python
After I wrote the initial teaser article "SQLite - Working with large data sets in Python effectively" about how awesome SQLite databases are via sqlite3 in Python, I wanted to delve a little bit more into the SQLite syntax and provide you with some more hands-on examples.
- Feb 23, 2014
Using OpenEye software for substructure alignments and best-matching low-energy conformer overlays
This is a quickguide showing how to use OpenEye software command line tools to align target molecules to a query based on substructure matches and how to retrieve the best molecule overlay from two sets of low-energy conformers.
2013
- Dec 14, 2013
Unit testing in Python – Why we want to make it a habit
Let’s be honest, code testing is everything but a joyful task. However, a good unit testing framework makes this process as smooth as possible. Eventually, testing becomes a regular and continuous process, accompanied by the assurance that our code will operate just as exact and seamlessly as a Swiss clockwork.
- Dec 8, 2013
A short tutorial for decent heat maps in R
I received many questions from people who want to quickly visualize their data via heat maps - ideally as quickly as possible. This is the major issue of exploratory data analysis, since we often don’t have the time to digest whole books about the particular techniques in different software packages to just get the job done. But once we are happy with our initial results, it might be worthwhile to dig deeper into the topic in order to further customize our plots and maybe even polish them for publication. In this post, my aim is to briefly introduce one of R’s several heat map libraries for a simple data analysis. I chose R, because it is one of the most popular free statistical software packages around. Of course there are many more tools out there to produce similar results (and even in R there are many different packages for heat maps), but I will leave this as an open topic for another time.
- Nov 3, 2013
SQLite – Working with large data sets in Python effectively
My new project confronted me with the task of screening a massive set of large data files in text format with billions of entries each. I will have to retrieve data repeatedly and frequently in the future. Thus, I was tempted to find a better solution than brute-force scanning through ~20 separate 1-column text files with ~6 billion entries every time line by line.